| OmniParser for Pure Vision Based GUI Agent | Aug 1, 2024 | Natural Language Visual Grounding | CodeCode Available | 12 | 5 |
| Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution | Sep 18, 2024 | Natural Language Visual Grounding | CodeCode Available | 11 | 5 |
| MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning | Oct 14, 2023 | Image ClassificationImage Description | CodeCode Available | 7 | 5 |
| Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond | Aug 24, 2023 | Chart Question AnsweringFS-MEVQA | CodeCode Available | 5 | 5 |
| ShowUI: One Vision-Language-Action Model for GUI Visual Agent | Nov 26, 2024 | Instruction FollowingNatural Language Visual Grounding | CodeCode Available | 5 | 5 |
| CogAgent: A Visual Language Model for GUI Agents | Dec 14, 2023 | Language Modeling | CodeCode Available | 5 | 5 |
| Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models | Apr 19, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 4 | 5 |
| Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents | Oct 7, 2024 | Natural Language Visual GroundingNavigate | CodeCode Available | 3 | 5 |
| OS-ATLAS: A Foundation Action Model for Generalist GUI Agents | Oct 30, 2024 | Natural Language Visual Grounding | CodeCode Available | 3 | 5 |
| Aria-UI: Visual Grounding for GUI Instructions | Dec 20, 2024 | Natural Language Visual GroundingVisual Grounding | CodeCode Available | 3 | 5 |