SOTAVerified

Visual Grounding

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

  • What is the main focus in a query?
  • How to understand an image?
  • How to locate an object?

Papers

Showing 451500 of 571 papers

TitleStatusHype
Visual Grounding via Accumulated Attention0
Visual Grounding with Attention-Driven Constraint Balancing0
Visual Intention Grounding for Egocentric Assistants0
Visually grounded cross-lingual keyword spotting in speech0
Visually Grounded Neural Syntax Acquisition0
Visual Prompting in Multimodal Large Language Models: A Survey0
Visual Reference Resolution using Attention Memory for Visual Dialog0
VisualTrap: A Stealthy Backdoor Attack on GUI Agents via Visual Grounding Manipulation0
VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks0
VLMAE: Vision-Language Masked Autoencoder0
VQD: Visual Query Detection in Natural Scenes0
WALL-E: Embodied Robotic WAiter Load Lifting with Large Language Model0
WaterVG: Waterway Visual Grounding based on Text-Guided Vision and mmWave Radar0
Weakly-Supervised 3D Visual Grounding based on Visual Linguistic Alignment0
Weakly-supervised segmentation of referring expressions0
Weakly-supervised Visual Grounding of Phrases with Linguistic Structures0
When Visual Grounding Meets Gigapixel-level Large-scale Scenes: Benchmark and Approach0
Word2Pix: Word to Pixel Cross Attention Transformer in Visual Grounding0
YFACC: A Yorùbá speech-image dataset for cross-lingual keyword localisation through visual grounding0
Your Large Vision-Language Model Only Needs A Few Attention Heads For Visual Grounding0
ZALM3: Zero-Shot Enhancement of Vision-Language Alignment via In-Context Information in Multi-Turn Multimodal Medical Dialogue0
Zero-Shot 3D Visual Grounding from Vision-Language Models0
Zero-Shot Visual Grounding of Referring Utterances in Dialogue0
A Better Loss for Visual-Textual GroundingCode0
Context-Infused Visual Grounding for ArtCode0
Ges3ViG : Incorporating Pointing Gestures into Language-Based 3D Visual Grounding for Embodied Reference UnderstandingCode0
Ges3ViG: Incorporating Pointing Gestures into Language-Based 3D Visual Grounding for Embodied Reference UnderstandingCode0
Context Does Matter: End-to-end Panoptic Narrative Grounding with Deformable Attention Refined Matching NetworkCode0
Progressive Multi-granular Alignments for Grounded Reasoning in Large Vision-Language ModelsCode0
Phrase Decoupling Cross-Modal Hierarchical Matching and Progressive Position Correction for Visual GroundingCode0
ResVG: Enhancing Relation and Semantic Understanding in Multiple Instances for Visual GroundingCode0
Rethinking 3D Dense Caption and Visual Grounding in A Unified Framework through Prompt-based LocalizationCode0
Rethinking Diversified and Discriminative Proposal Generation for Visual GroundingCode0
WildRefer: 3D Object Localization in Large-scale Dynamic Scenes with Multi-modal Visual Data and Natural LanguageCode0
G2D: From Global to Dense Radiography Representation Learning via Vision-Language Pre-trainingCode0
AttnGrounder: Talking to Cars with AttentionCode0
Revisiting Visual Question Answering BaselinesCode0
Not (yet) the whole story: Evaluating Visual Storytelling Requires More than Measuring Coherence, Grounding, and RepetitionCode0
NICE: Improving Panoptic Narrative Detection and Segmentation with Cascading Collaborative LearningCode0
Neural Twins TalkCode0
Uncovering the Full Potential of Visual Grounding Methods in VQACode0
Connecting Vision and Language with Localized NarrativesCode0
Attention as Grounding: Exploring Textual and Cross-Modal Attention on Entities and Relations in Language-and-Vision TransformerCode0
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning FrameworkCode0
RoViST:Learning Robust Metrics for Visual StorytellingCode0
RoViST: Learning Robust Metrics for Visual StorytellingCode0
Flexible Visual GroundingCode0
UniMoCo: Unified Modality Completion for Robust Multi-Modal EmbeddingsCode0
FiVL: A Framework for Improved Vision-Language AlignmentCode0
Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual GroundingCode0
Show:102550
← PrevPage 10 of 12Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1Florence-2-large-ftAccuracy (%)95.3Unverified
2mPLUG-2Accuracy (%)92.8Unverified
3X2-VLM (large)Accuracy (%)92.1Unverified
4XFM (base)Accuracy (%)90.4Unverified
5X2-VLM (base)Accuracy (%)90.3Unverified
6X-VLM (base)Accuracy (%)89Unverified
7HYDRAIoU61.7Unverified
8HYDRAIoU61.1Unverified
#ModelMetricClaimedVerifiedStatus
1Florence-2-large-ftAccuracy (%)92Unverified
2mPLUG-2Accuracy (%)86.05Unverified
3X2-VLM (large)Accuracy (%)81.8Unverified
4XFM (base)Accuracy (%)79.8Unverified
5X2-VLM (base)Accuracy (%)78.4Unverified
6X-VLM (base)Accuracy (%)76.91Unverified
#ModelMetricClaimedVerifiedStatus
1Florence-2-large-ftAccuracy (%)93.4Unverified
2mPLUG-2Accuracy (%)90.33Unverified
3X2-VLM (large)Accuracy (%)87.6Unverified
4XFM (base)Accuracy (%)86.1Unverified
5X2-VLM (base)Accuracy (%)85.2Unverified
6X-VLM (base)Accuracy (%)84.51Unverified