SOTAVerified

Visual Grounding

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

  • What is the main focus in a query?
  • How to understand an image?
  • How to locate an object?

Papers

Showing 301350 of 571 papers

TitleStatusHype
Improved Visual Grounding through Self-Consistent Explanations0
Improving Visually Grounded Sentence Representations with Self-Attention0
Individuation in Neural Models with and without Visual Grounding0
Intent3D: 3D Object Detection in RGB-D Scans Based on Human Intention0
Interactive Reinforcement Learning for Object Grounding via Self-Talking0
Interactive Visual Grounding of Referring Expressions for Human-Robot Interaction0
Interpretable Visual Question Answering by Visual Grounding from Attention Supervision Mining0
Interpretable Visual Question Answering via Reasoning Supervision0
INVIGORATE: Interactive Visual Grounding and Grasping in Clutter0
I Speak and You Find: Robust 3D Visual Grounding with Noisy and Ambiguous Speech Inputs0
Joint Top-Down and Bottom-Up Frameworks for 3D Visual Grounding0
Knowledge Supports Visual Language Grounding: A Case Study on Colour Terms0
Language-Guided 3D Object Detection in Point Cloud for Autonomous Driving0
Language learning using Speech to Image retrieval0
LanguageRefer: Spatial-Language Model for 3D Visual Grounding0
LCV2: An Efficient Pretraining-Free Framework for Grounded Visual Question Answering0
Learning from Synthetic Data for Visual Grounding0
Visually Consistent Hierarchical Image Classification0
Learning Language Structures through Grounding0
Learning to Compose and Reason with Language Tree Structures for Visual Grounding0
Learning to Ground VLMs without Forgetting0
Learning Unsupervised Visual Grounding Through Semantic Self-Supervision0
Learning Visual Grounding from Generative Vision and Language Model0
Learning with Difference Attention for Visually Grounded Self-supervised Representations0
Less is More: Generating Grounded Navigation Instructions from Landmarks0
Leveraging Multimodal-LLMs Assisted by Instance Segmentation for Intelligent Traffic Monitoring0
Leveraging Past References for Robust Language Grounding0
LidaRefer: Outdoor 3D Visual Grounding for Autonomous Driving with Transformers0
Lightweight In-Context Tuning for Multimodal Unified Models0
Like a bilingual baby: The advantage of visually grounding a bilingual language model0
LLM-Optic: Unveiling the Capabilities of Large Language Models for Universal Visual Grounding0
LQMFormer: Language-aware Query Mask Transformer for Referring Image Segmentation0
M4CXR: Exploring Multi-task Potentials of Multi-modal Large Language Models for Chest X-ray Interpretation0
MAMO: Masked Multimodal Modeling for Fine-Grained Vision-Language Representation Learning0
Medical Phrase Grounding with Region-Phrase Context Contrastive Alignment0
MedRG: Medical Report Grounding with Multi-modal Large Language Model0
MedSG-Bench: A Benchmark for Medical Image Sequences Grounding0
Mitigating Hallucination in Large Vision-Language Models via Adaptive Attention Calibration0
MMR: Evaluating Reading Ability of Large Multimodal Models0
MNER-QG: An End-to-End MRC framework for Multimodal Named Entity Recognition with Query Grounding0
MoDA: Modulation Adapter for Fine-Grained Visual Grounding in Instructional MLLMs0
More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models0
Motion-Grounded Video Reasoning: Understanding and Perceiving Motion at Pixel Level0
Movie Box Office Prediction With Self-Supervised and Visually Grounded Pretraining0
mRAG: Elucidating the Design Space of Multi-modal Retrieval-Augmented Generation0
Multi-Granularity Modularized Network for Abstract Visual Reasoning0
Multimodal Reference Visual Grounding0
Multimodal Unified Attention Networks for Vision-and-Language Interactions0
Multi-task Learning of Hierarchical Vision-Language Representation0
NanoMVG: USV-Centric Low-Power Multi-Task Visual Grounding based on Prompt-Guided Camera and 4D mmWave Radar0
Show:102550
← PrevPage 7 of 12Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1Florence-2-large-ftAccuracy (%)95.3Unverified
2mPLUG-2Accuracy (%)92.8Unverified
3X2-VLM (large)Accuracy (%)92.1Unverified
4XFM (base)Accuracy (%)90.4Unverified
5X2-VLM (base)Accuracy (%)90.3Unverified
6X-VLM (base)Accuracy (%)89Unverified
7HYDRAIoU61.7Unverified
8HYDRAIoU61.1Unverified
#ModelMetricClaimedVerifiedStatus
1Florence-2-large-ftAccuracy (%)92Unverified
2mPLUG-2Accuracy (%)86.05Unverified
3X2-VLM (large)Accuracy (%)81.8Unverified
4XFM (base)Accuracy (%)79.8Unverified
5X2-VLM (base)Accuracy (%)78.4Unverified
6X-VLM (base)Accuracy (%)76.91Unverified
#ModelMetricClaimedVerifiedStatus
1Florence-2-large-ftAccuracy (%)93.4Unverified
2mPLUG-2Accuracy (%)90.33Unverified
3X2-VLM (large)Accuracy (%)87.6Unverified
4XFM (base)Accuracy (%)86.1Unverified
5X2-VLM (base)Accuracy (%)85.2Unverified
6X-VLM (base)Accuracy (%)84.51Unverified