SOTAVerified

Natural Language Visual Grounding

Papers

Showing 132 of 32 papers

TitleStatusHype
OmniParser for Pure Vision Based GUI AgentCode12
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any ResolutionCode11
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learningCode7
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and BeyondCode5
CogAgent: A Visual Language Model for GUI AgentsCode5
ShowUI: One Vision-Language-Action Model for GUI Visual AgentCode5
Groma: Localized Visual Tokenization for Grounding Multimodal Large Language ModelsCode4
Aguvis: Unified Pure Vision Agents for Autonomous GUI InteractionCode3
Aria-UI: Visual Grounding for GUI InstructionsCode3
Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI AgentsCode3
OS-ATLAS: A Foundation Action Model for Generalist GUI AgentsCode3
SeeClick: Harnessing GUI Grounding for Advanced Visual GUI AgentsCode3
GUICourse: From General Vision Language Models to Versatile GUI AgentsCode2
Improved GUI Grounding via Iterative NarrowingCode1
Learning Cross-modal Context Graph for Visual GroundingCode1
Localizing Moments in Long Video Via Multimodal GuidanceCode1
CALVIN: A Benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation TasksCode1
TubeDETR: Spatio-Temporal Video Grounding with TransformersCode1
Belief Revision based Caption Re-ranker with Visual Semantic InformationCode1
Panoptic Narrative GroundingCode1
Panoptic Narrative GroundingCode1
A Linguistic Analysis of Visually Grounded Dialogues Based on Spatial ExpressionsCode1
ALFWorld: Aligning Text and Embodied Environments for Interactive LearningCode1
Self-Monitoring Navigation Agent via Auxiliary Progress EstimationCode1
ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday TasksCode1
Visual Writing Prompts: Character-Grounded Story Generation with Curated Image Sequences0
Learning to Assemble Neural Module Tree Networks for Visual Grounding0
Searching for Ambiguous Objects in Videos using Relational Referring ExpressionsCode0
Modularized Textual Grounding for Counterfactual ResilienceCode0
Grounding of Textual Phrases in Images by ReconstructionCode0
Composing Pick-and-Place Tasks By Grounding LanguageCode0
Robust Change CaptioningCode0
Show:102550

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1UGround-V1-7BAccuracy (%)86.34Unverified
2Aguvis-7BAccuracy (%)83Unverified
3OS-Atlas-Base-7BAccuracy (%)82.47Unverified
4Aria-UIAccuracy (%)81.1Unverified
5Aguvis-G-7BAccuracy (%)81Unverified
6UGround-V1-2BAccuracy (%)77.67Unverified
7ShowUIAccuracy (%)75.1Unverified
8ShowUI-GAccuracy (%)75Unverified
9UGroundAccuracy (%)73.3Unverified
10OmniParserAccuracy (%)73Unverified