SOTAVerified

Visual Grounding

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

  • What is the main focus in a query?
  • How to understand an image?
  • How to locate an object?

Papers

Showing 101150 of 571 papers

TitleStatusHype
InfMLLM: A Unified Framework for Visual-Language TasksCode1
CVLUE: A New Benchmark Dataset for Chinese Vision-Language Understanding EvaluationCode1
UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language ModelingCode1
A Unified Framework for 3D Point Cloud Visual GroundingCode1
Cross3DVG: Cross-Dataset 3D Visual Grounding on Different RGB-D ScansCode1
Instruction-Guided Visual MaskingCode1
CPT: Colorful Prompt Tuning for Pre-trained Vision-Language ModelsCode1
A Fast and Accurate One-Stage Approach to Visual GroundingCode1
Position-guided Text Prompt for Vision-Language Pre-trainingCode1
PROGrasp: Pragmatic Human-Robot Communication for Object GraspingCode1
How Do Multimodal Large Language Models Handle Complex Multimodal Reasoning? Placing Them in An Extensible Escape GameCode1
Joint Visual Grounding and Tracking with Natural Language SpecificationCode1
CoT3DRef: Chain-of-Thoughts Data-Efficient 3D Visual GroundingCode1
Distilling Coarse-to-Fine Semantic Matching Knowledge for Weakly Supervised 3D Visual GroundingCode1
OCID-Ref: A 3D Robotic Dataset with Embodied Language for Clutter Scene GroundingCode1
Multi-task Visual Grounding with Coarse-to-Fine Consistency ConstraintsCode1
Multi-View Transformer for 3D Visual GroundingCode1
Multi-Modal Dynamic Graph Transformer for Visual GroundingCode1
Learning Point-Language Hierarchical Alignment for 3D Visual GroundingCode1
Visual Grounding Methods for VQA are Working for the Wrong Reasons!Code1
Boosting Transferability in Vision-Language Attacks via Diversification along the Intersection Region of Adversarial TrajectoryCode1
Learning Cross-modal Context Graph for Visual GroundingCode1
Multimodal Incremental Transformer with Visual Grounding for Visual Dialogue GenerationCode1
NAVER: A Neuro-Symbolic Compositional Automaton for Visual Grounding with Explicit Logic ReasoningCode1
Context Disentangling and Prototype Inheriting for Robust Visual GroundingCode1
SAT: 2D Semantics Assisted Training for 3D Visual GroundingCode1
3D Vision and Language Pretraining with Large-Scale Synthetic DataCode1
HYDRA: A Hyper Agent for Dynamic Compositional Visual ReasoningCode1
Context-Aware Alignment and Mutual Masking for 3D-Language Pre-TrainingCode1
Connecting What to Say With Where to Look by Modeling Human Attention TracesCode1
Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual ConceptsCode1
Open Eyes, Then Reason: Fine-grained Visual Mathematical Understanding in MLLMsCode1
Visual Grounding for Object-Level Generalization in Reinforcement LearningCode1
Confidence-aware Pseudo-label Learning for Weakly Supervised Visual GroundingCode1
GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object DetectionCode1
mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connectionsCode1
MixGen: A New Multi-Modal Data AugmentationCode1
Grounded Situation Recognition with TransformersCode1
Advancing Visual Grounding with Scene Knowledge: Benchmark and MethodCode1
Extending Large Vision-Language Model for Diverse Interactive Tasks in Autonomous DrivingCode1
Collaborative Transformers for Grounded Situation RecognitionCode1
GPT-4V-AD: Exploring Grounding Potential of VQA-oriented GPT-4V for Zero-shot Anomaly DetectionCode1
Guessing State Tracking for Visual DialogueCode1
MiKASA: Multi-Key-Anchor & Scene-Aware Transformer for 3D Visual GroundingCode1
Mono3DVG: 3D Visual Grounding in Monocular ImagesCode1
Multi3DRefer: Grounding Text Description to Multiple 3D ObjectsCode1
GPT-4 Enhanced Multimodal Grounding for Autonomous Driving: Leveraging Cross-Modal Attention with Large Language ModelsCode1
CLIP-VG: Self-paced Curriculum Adapting of CLIP for Visual GroundingCode1
Look Before You Leap: Learning Landmark Features for One-Stage Visual GroundingCode1
CLIP-Lite: Information Efficient Visual Representation Learning with Language SupervisionCode1
Show:102550
← PrevPage 3 of 12Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1Florence-2-large-ftAccuracy (%)95.3Unverified
2mPLUG-2Accuracy (%)92.8Unverified
3X2-VLM (large)Accuracy (%)92.1Unverified
4XFM (base)Accuracy (%)90.4Unverified
5X2-VLM (base)Accuracy (%)90.3Unverified
6X-VLM (base)Accuracy (%)89Unverified
7HYDRAIoU61.7Unverified
8HYDRAIoU61.1Unverified
#ModelMetricClaimedVerifiedStatus
1Florence-2-large-ftAccuracy (%)92Unverified
2mPLUG-2Accuracy (%)86.05Unverified
3X2-VLM (large)Accuracy (%)81.8Unverified
4XFM (base)Accuracy (%)79.8Unverified
5X2-VLM (base)Accuracy (%)78.4Unverified
6X-VLM (base)Accuracy (%)76.91Unverified
#ModelMetricClaimedVerifiedStatus
1Florence-2-large-ftAccuracy (%)93.4Unverified
2mPLUG-2Accuracy (%)90.33Unverified
3X2-VLM (large)Accuracy (%)87.6Unverified
4XFM (base)Accuracy (%)86.1Unverified
5X2-VLM (base)Accuracy (%)85.2Unverified
6X-VLM (base)Accuracy (%)84.51Unverified