SOTAVerified

Visual Grounding

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

  • What is the main focus in a query?
  • How to understand an image?
  • How to locate an object?

Papers

Showing 51100 of 571 papers

TitleStatusHype
F-LMM: Grounding Frozen Large Multimodal ModelsCode2
Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and LanguageCode2
List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMsCode2
HiVG: Hierarchical Multimodal Fine-grained Modulation for Visual GroundingCode2
VHM: Versatile and Honest Vision Language Model for Remote Sensing Image AnalysisCode2
MedPromptX: Grounded Multimodal Prompting for Chest X-ray DiagnosisCode2
The Revolution of Multimodal Large Language Models: A SurveyCode2
ChatterBox: Multi-round Multimodal Referring and GroundingCode2
SkyEyeGPT: Unifying Remote Sensing Vision-Language Tasks via Instruction Tuning with Large Language ModelCode2
Unveiling Parts Beyond Objects: Towards Finer-Granularity Referring Expression SegmentationCode2
One Model to Rule them All: Towards Universal Segmentation for Medical Images with Text PromptsCode2
Aligning and Prompting Everything All at Once for Universal Visual PerceptionCode2
NExT-Chat: An LMM for Chat, Detection and SegmentationCode2
From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language ModelsCode2
LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an AgentCode2
3D-VisTA: Pre-trained Transformer for 3D Vision and Text AlignmentCode2
BuboGPT: Enabling Visual Grounding in Multi-Modal LLMsCode2
X^2-VLM: All-In-One Pre-trained Model For Vision-Language TasksCode2
Referring Image MattingCode2
Revisit What You See: Disclose Language Prior in Vision Tokens for Efficient Guided Decoding of LVLMsCode1
GUI-G1: Understanding R1-Zero-Like Training for Visual Grounding in GUI AgentsCode1
Extending Large Vision-Language Model for Diverse Interactive Tasks in Autonomous DrivingCode1
STING-BEE: Towards Vision-Language Model for Real-World X-ray Baggage Security InspectionCode1
RefChartQA: Grounding Visual Answer on Chart Images through Instruction TuningCode1
Visual Position Prompt for MLLM based Visual GroundingCode1
How Do Multimodal Large Language Models Handle Complex Multimodal Reasoning? Placing Them in An Extensible Escape GameCode1
SwimVG: Step-wise Multimodal Fusion and Adaption for Visual GroundingCode1
Evolving Symbolic 3D Visual Grounder with Weakly Supervised ReflectionCode1
NAVER: A Neuro-Symbolic Compositional Automaton for Visual Grounding with Explicit Logic ReasoningCode1
PAINT: Paying Attention to INformed Tokens to Mitigate Hallucination in Large Vision-Language ModelCode1
When language and vision meet road safety: leveraging multimodal large language models for video-based traffic accident analysisCode1
Multi-task Visual Grounding with Coarse-to-Fine Consistency ConstraintsCode1
Open Eyes, Then Reason: Fine-grained Visual Mathematical Understanding in MLLMsCode1
Solving Zero-Shot 3D Visual Grounding as Constraint Satisfaction ProblemsCode1
VividMed: Vision Language Model with Versatile Visual Grounding for MedicineCode1
Visual Grounding with Multi-modal Conditional AdaptationCode1
IAA: Inner-Adaptor Architecture Empowers Frozen Large Language Model with Multimodal CapabilitiesCode1
Visual Grounding for Object-Level Generalization in Reinforcement LearningCode1
An Efficient and Effective Transformer Decoder-Based Framework for Multi-Task Visual GroundingCode1
3D Vision and Language Pretraining with Large-Scale Synthetic DataCode1
Multi-branch Collaborative Learning Network for 3D Visual GroundingCode1
CVLUE: A New Benchmark Dataset for Chinese Vision-Language Understanding EvaluationCode1
Advancing Grounded Multimodal Named Entity Recognition via LLM-Based Reformulation and Box-Based SegmentationCode1
Instruction-Guided Visual MaskingCode1
Talk2Radar: Bridging Natural Language with 4D mmWave Radar for 3D Referring Expression ComprehensionCode1
DARA: Domain- and Relation-aware Adapters Make Parameter-efficient Tuning for Visual GroundingCode1
Lexicon-Level Contrastive Visual-Grounding Improves Language ModelingCode1
Boosting Transferability in Vision-Language Attacks via Diversification along the Intersection Region of Adversarial TrajectoryCode1
HYDRA: A Hyper Agent for Dynamic Compositional Visual ReasoningCode1
MiKASA: Multi-Key-Anchor & Scene-Aware Transformer for 3D Visual GroundingCode1
Show:102550
← PrevPage 2 of 12Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1Florence-2-large-ftAccuracy (%)95.3Unverified
2mPLUG-2Accuracy (%)92.8Unverified
3X2-VLM (large)Accuracy (%)92.1Unverified
4XFM (base)Accuracy (%)90.4Unverified
5X2-VLM (base)Accuracy (%)90.3Unverified
6X-VLM (base)Accuracy (%)89Unverified
7HYDRAIoU61.7Unverified
8HYDRAIoU61.1Unverified
#ModelMetricClaimedVerifiedStatus
1Florence-2-large-ftAccuracy (%)92Unverified
2mPLUG-2Accuracy (%)86.05Unverified
3X2-VLM (large)Accuracy (%)81.8Unverified
4XFM (base)Accuracy (%)79.8Unverified
5X2-VLM (base)Accuracy (%)78.4Unverified
6X-VLM (base)Accuracy (%)76.91Unverified
#ModelMetricClaimedVerifiedStatus
1Florence-2-large-ftAccuracy (%)93.4Unverified
2mPLUG-2Accuracy (%)90.33Unverified
3X2-VLM (large)Accuracy (%)87.6Unverified
4XFM (base)Accuracy (%)86.1Unverified
5X2-VLM (base)Accuracy (%)85.2Unverified
6X-VLM (base)Accuracy (%)84.51Unverified