| OmniParser for Pure Vision Based GUI Agent | Aug 1, 2024 | Natural Language Visual Grounding | CodeCode Available | 12 |
| Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution | Sep 18, 2024 | Natural Language Visual Grounding | CodeCode Available | 11 |
| MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning | Oct 14, 2023 | Image ClassificationImage Description | CodeCode Available | 7 |
| Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond | Aug 24, 2023 | Chart Question AnsweringFS-MEVQA | CodeCode Available | 5 |
| CogAgent: A Visual Language Model for GUI Agents | Dec 14, 2023 | Language Modeling | CodeCode Available | 5 |
| ShowUI: One Vision-Language-Action Model for GUI Visual Agent | Nov 26, 2024 | Instruction FollowingNatural Language Visual Grounding | CodeCode Available | 5 |
| Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models | Apr 19, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 4 |
| Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction | Dec 5, 2024 | Multimodal ReasoningNatural Language Visual Grounding | CodeCode Available | 3 |
| Aria-UI: Visual Grounding for GUI Instructions | Dec 20, 2024 | Natural Language Visual GroundingVisual Grounding | CodeCode Available | 3 |
| Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents | Oct 7, 2024 | Natural Language Visual GroundingNavigate | CodeCode Available | 3 |
| OS-ATLAS: A Foundation Action Model for Generalist GUI Agents | Oct 30, 2024 | Natural Language Visual Grounding | CodeCode Available | 3 |
| SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents | Jan 17, 2024 | Natural Language Visual Grounding | CodeCode Available | 3 |
| GUICourse: From General Vision Language Models to Versatile GUI Agents | Jun 17, 2024 | Natural Language Visual GroundingOptical Character Recognition (OCR) | CodeCode Available | 2 |
| Improved GUI Grounding via Iterative Narrowing | Nov 18, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Learning Cross-modal Context Graph for Visual Grounding | Feb 13, 2020 | Graph MatchingGraph Neural Network | CodeCode Available | 1 |
| Localizing Moments in Long Video Via Multimodal Guidance | Feb 26, 2023 | Natural Language Moment RetrievalNatural Language Visual Grounding | CodeCode Available | 1 |
| CALVIN: A Benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks | Dec 6, 2021 | Continuous ControlImitation Learning | CodeCode Available | 1 |
| TubeDETR: Spatio-Temporal Video Grounding with Transformers | Mar 30, 2022 | DecoderLanguage-Based Temporal Localization | CodeCode Available | 1 |
| Belief Revision based Caption Re-ranker with Visual Semantic Information | Sep 16, 2022 | Caption GenerationImage Captioning | CodeCode Available | 1 |
| Panoptic Narrative Grounding | Jan 1, 2021 | Natural Language Visual GroundingPanoptic Segmentation | CodeCode Available | 1 |
| Panoptic Narrative Grounding | Sep 10, 2021 | Natural Language Visual GroundingPanoptic Segmentation | CodeCode Available | 1 |
| A Linguistic Analysis of Visually Grounded Dialogues Based on Spatial Expressions | Oct 7, 2020 | Coreference ResolutionNatural Language Visual Grounding | CodeCode Available | 1 |
| ALFWorld: Aligning Text and Embodied Environments for Interactive Learning | Oct 8, 2020 | Natural Language Visual GroundingScene Understanding | CodeCode Available | 1 |
| Self-Monitoring Navigation Agent via Auxiliary Progress Estimation | Jan 10, 2019 | Natural Language Visual GroundingVision and Language Navigation | CodeCode Available | 1 |
| ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks | Dec 3, 2019 | Natural Language Visual Grounding | CodeCode Available | 1 |
| Visual Writing Prompts: Character-Grounded Story Generation with Curated Image Sequences | Jan 20, 2023 | Coherence EvaluationGrounded language learning | —Unverified | 0 |
| Learning to Assemble Neural Module Tree Networks for Visual Grounding | Dec 8, 2018 | Dependency ParsingNatural Language Visual Grounding | —Unverified | 0 |
| Searching for Ambiguous Objects in Videos using Relational Referring Expressions | Aug 3, 2019 | Deep AttentionNatural Language Visual Grounding | CodeCode Available | 0 |
| Modularized Textual Grounding for Counterfactual Resilience | Apr 7, 2019 | Attributecounterfactual | CodeCode Available | 0 |
| Grounding of Textual Phrases in Images by Reconstruction | Nov 12, 2015 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| Composing Pick-and-Place Tasks By Grounding Language | Feb 16, 2021 | Natural Language Visual GroundingRobotic Grasping | CodeCode Available | 0 |
| Robust Change Captioning | Jan 8, 2019 | Natural Language Visual Grounding | CodeCode Available | 0 |