SOTAVerified

Visual Grounding

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

  • What is the main focus in a query?
  • How to understand an image?
  • How to locate an object?

Papers

Showing 501550 of 571 papers

TitleStatusHype
Barking Up The Syntactic Tree: Enhancing VLM Training with Syntactic Losses0
VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos0
VideoGLaMM : A Large Multimodal Model for Pixel-Level Visual Grounding in Videos0
A Visual Tour Of Current Challenges In Multimodal Language Models0
VidLA: Video-Language Alignment at Scale0
Viewpoint-Aware Visual Grounding in 3D Scenes0
A Vision Centric Remote Sensing Benchmark0
ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding0
ViewSRD: 3D Visual Grounding via Structured Multi-View Decomposition0
ViGiL3D: A Linguistically Diverse Dataset for 3D Visual Grounding0
3D Scene Graph Guided Vision-Language Pre-training0
YFACC: A Yorùbá speech-image dataset for cross-lingual keyword localisation through visual grounding0
AugRefer: Advancing 3D Visual Grounding via Cross-Modal Augmentation and Spatial Relation-based Referring0
VIMI: Grounding Video Generation through Multi-modal Instruction0
Attention-Based Keyword Localisation in Speech using Visual Grounding0
Vision-Language Pre-training with Object Contrastive Learning for 3D Scene Understanding0
VIST-GPT: Ushering in the Era of Visual Storytelling with LLMs?0
3DRP-Net: 3D Relative Position-aware Network for 3D Visual Grounding0
Zero-Shot Visual Grounding of Referring Utterances in Dialogue0
Visual Grounding Annotation of Recipe Flow Graph0
Learning from Synthetic Data for Visual Grounding0
Visually Consistent Hierarchical Image Classification0
Learning Language Structures through Grounding0
Visual grounding for desktop graphical user interfaces0
Learning to Compose and Reason with Language Tree Structures for Visual Grounding0
Attention as Grounding: Exploring Textual and Cross-Modal Attention on Entities and Relations in Language-and-Vision Transformer0
Learning to Ground VLMs without Forgetting0
Attending Self-Attention: A Case Study of Visually Grounded Supervision in Vision-and-Language Transformers0
Learning Unsupervised Visual Grounding Through Semantic Self-Supervision0
Learning Visual Grounding from Generative Vision and Language Model0
Learning with Difference Attention for Visually Grounded Self-supervised Representations0
How direct is the link between words and images?0
Less is More: Generating Grounded Navigation Instructions from Landmarks0
Visual Grounding of Inter-lingual Word-Embeddings0
Leveraging Multimodal-LLMs Assisted by Instance Segmentation for Intelligent Traffic Monitoring0
Leveraging Past References for Robust Language Grounding0
A survey on knowledge-enhanced multimodal learning0
LCV2: An Efficient Pretraining-Free Framework for Grounded Visual Question Answering0
LanguageRefer: Spatial-Language Model for 3D Visual Grounding0
LidaRefer: Outdoor 3D Visual Grounding for Autonomous Driving with Transformers0
Lightweight In-Context Tuning for Multimodal Unified Models0
Like a bilingual baby: The advantage of visually grounding a bilingual language model0
Language learning using Speech to Image retrieval0
Language-Guided 3D Object Detection in Point Cloud for Autonomous Driving0
LLM-Optic: Unveiling the Capabilities of Large Language Models for Universal Visual Grounding0
Knowledge Supports Visual Language Grounding: A Case Study on Colour Terms0
Joint Top-Down and Bottom-Up Frameworks for 3D Visual Grounding0
I Speak and You Find: Robust 3D Visual Grounding with Noisy and Ambiguous Speech Inputs0
INVIGORATE: Interactive Visual Grounding and Grasping in Clutter0
LQMFormer: Language-aware Query Mask Transformer for Referring Image Segmentation0
Show:102550
← PrevPage 11 of 12Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1Florence-2-large-ftAccuracy (%)95.3Unverified
2mPLUG-2Accuracy (%)92.8Unverified
3X2-VLM (large)Accuracy (%)92.1Unverified
4XFM (base)Accuracy (%)90.4Unverified
5X2-VLM (base)Accuracy (%)90.3Unverified
6X-VLM (base)Accuracy (%)89Unverified
7HYDRAIoU61.7Unverified
8HYDRAIoU61.1Unverified
#ModelMetricClaimedVerifiedStatus
1Florence-2-large-ftAccuracy (%)92Unverified
2mPLUG-2Accuracy (%)86.05Unverified
3X2-VLM (large)Accuracy (%)81.8Unverified
4XFM (base)Accuracy (%)79.8Unverified
5X2-VLM (base)Accuracy (%)78.4Unverified
6X-VLM (base)Accuracy (%)76.91Unverified
#ModelMetricClaimedVerifiedStatus
1Florence-2-large-ftAccuracy (%)93.4Unverified
2mPLUG-2Accuracy (%)90.33Unverified
3X2-VLM (large)Accuracy (%)87.6Unverified
4XFM (base)Accuracy (%)86.1Unverified
5X2-VLM (base)Accuracy (%)85.2Unverified
6X-VLM (base)Accuracy (%)84.51Unverified