| OS-ATLAS: A Foundation Action Model for Generalist GUI Agents | Oct 30, 2024 | Natural Language Visual Grounding | CodeCode Available | 3 |
| SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents | Jan 17, 2024 | Natural Language Visual Grounding | CodeCode Available | 3 |
| GUICourse: From General Vision Language Models to Versatile GUI Agents | Jun 17, 2024 | Natural Language Visual GroundingOptical Character Recognition (OCR) | CodeCode Available | 2 |
| Improved GUI Grounding via Iterative Narrowing | Nov 18, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Learning Cross-modal Context Graph for Visual Grounding | Feb 13, 2020 | Graph MatchingGraph Neural Network | CodeCode Available | 1 |
| Localizing Moments in Long Video Via Multimodal Guidance | Feb 26, 2023 | Natural Language Moment RetrievalNatural Language Visual Grounding | CodeCode Available | 1 |
| CALVIN: A Benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks | Dec 6, 2021 | Continuous ControlImitation Learning | CodeCode Available | 1 |
| TubeDETR: Spatio-Temporal Video Grounding with Transformers | Mar 30, 2022 | DecoderLanguage-Based Temporal Localization | CodeCode Available | 1 |
| Belief Revision based Caption Re-ranker with Visual Semantic Information | Sep 16, 2022 | Caption GenerationImage Captioning | CodeCode Available | 1 |
| Panoptic Narrative Grounding | Jan 1, 2021 | Natural Language Visual GroundingPanoptic Segmentation | CodeCode Available | 1 |