SOTAVerified

Visual Grounding

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

  • What is the main focus in a query?
  • How to understand an image?
  • How to locate an object?

Papers

Showing 251300 of 571 papers

TitleStatusHype
SCO-VIST: Social Interaction Commonsense Knowledge-based Visual Storytelling0
LCV2: An Efficient Pretraining-Free Framework for Grounded Visual Question Answering0
ChatterBox: Multi-round Multimodal Referring and GroundingCode2
Unifying Visual and Vision-Language Tracking via Contrastive LearningCode1
Veagle: Advancements in Multimodal Representation LearningCode1
SkyEyeGPT: Unifying Remote Sensing Vision-Language Tasks via Instruction Tuning with Large Language ModelCode2
SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding0
Uncovering the Full Potential of Visual Grounding Methods in VQACode0
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMsCode3
Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as Programmers0
Unveiling Parts Beyond Objects: Towards Finer-Granularity Referring Expression SegmentationCode2
Investigating Compositional Challenges in Vision-Language Models for Visual GroundingCode0
Towards CLIP-driven Language-free 3D Visual Grounding via 2D-3D Relational Enhancement and ConsistencyCode0
LQMFormer: Language-aware Query Mask Transformer for Referring Image Segmentation0
G^3-LQ: Marrying Hyperbolic Alignment with Explicit Semantic-Geometric Modeling for 3D Visual Grounding0
When Visual Grounding Meets Gigapixel-level Large-scale Scenes: Benchmark and Approach0
Multi-Attribute Interactions Matter for 3D Visual GroundingCode0
Omni-Q: Omni-Directional Scene Understanding for Unsupervised Visual Grounding0
Viewpoint-Aware Visual Grounding in 3D Scenes0
V?: Guided Visual Search as a Core Mechanism in Multimodal LLMsCode4
Bridging Modality Gap for Visual Grounding with Effecitve Cross-modal Distillation0
One Model to Rule them All: Towards Universal Segmentation for Medical Images with Text PromptsCode2
Cycle-Consistency Learning for Captioning and Grounding0
GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object DetectionCode1
Mask Grounding for Referring Image SegmentationCode1
Context Disentangling and Prototype Inheriting for Robust Visual GroundingCode1
Weakly-Supervised 3D Visual Grounding based on Visual Linguistic Alignment0
Mono3DVG: 3D Visual Grounding in Monocular ImagesCode1
Unveiling Parts Beyond Objects:Towards Finer-Granularity Referring Expression SegmentationCode1
Visual Grounding of Whole Radiology Reports for 3D CT Images0
Improved Visual Grounding through Self-Consistent Explanations0
GPT-4 Enhanced Multimodal Grounding for Autonomous Driving: Leveraging Cross-Modal Attention with Large Language ModelsCode1
Mismatch Quest: Visual and Textual Feedback for Image-Text MisalignmentCode0
Uni3DL: Unified Model for 3D and Language Understanding0
Expand BERT Representation with Visual Information via Grounded Language Learning with Multimodal Partial Alignment0
Aligning and Prompting Everything All at Once for Universal Visual PerceptionCode2
Behind the Magic, MERLIM: Multi-modal Evaluation Benchmark for Large Image-Language ModelsCode0
G2D: From Global to Dense Radiography Representation Learning via Vision-Language Pre-trainingCode0
Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and CaptionsCode1
Context-Aware Indoor Point Cloud Object Generation through User Instructions0
Visual Programming for Zero-shot Open-Vocabulary 3D Visual GroundingCode1
Enhancing Visual Grounding and Generalization: A Multi-Task Cycle Training Approach for Vision-Language ModelsCode0
InfMLLM: A Unified Framework for Visual-Language TasksCode1
Florence-2: Advancing a Unified Representation for a Variety of Vision TasksCode1
Language-guided Robot Grasping: CLIP-based Referring Grasp Synthesis in ClutterCode1
NExT-Chat: An LMM for Chat, Detection and SegmentationCode2
GPT-4V-AD: Exploring Grounding Potential of VQA-oriented GPT-4V for Zero-shot Anomaly DetectionCode1
A Systematic Evaluation of GPT-4V's Multimodal Capability for Medical Image Analysis0
CityRefer: Geography-aware 3D Visual Grounding Dataset on City-scale Point Cloud DataCode1
GROOViST: A Metric for Grounding Objects in Visual StorytellingCode0
Show:102550
← PrevPage 6 of 12Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1Florence-2-large-ftAccuracy (%)95.3Unverified
2mPLUG-2Accuracy (%)92.8Unverified
3X2-VLM (large)Accuracy (%)92.1Unverified
4XFM (base)Accuracy (%)90.4Unverified
5X2-VLM (base)Accuracy (%)90.3Unverified
6X-VLM (base)Accuracy (%)89Unverified
7HYDRAIoU61.7Unverified
8HYDRAIoU61.1Unverified
#ModelMetricClaimedVerifiedStatus
1Florence-2-large-ftAccuracy (%)92Unverified
2mPLUG-2Accuracy (%)86.05Unverified
3X2-VLM (large)Accuracy (%)81.8Unverified
4XFM (base)Accuracy (%)79.8Unverified
5X2-VLM (base)Accuracy (%)78.4Unverified
6X-VLM (base)Accuracy (%)76.91Unverified
#ModelMetricClaimedVerifiedStatus
1Florence-2-large-ftAccuracy (%)93.4Unverified
2mPLUG-2Accuracy (%)90.33Unverified
3X2-VLM (large)Accuracy (%)87.6Unverified
4XFM (base)Accuracy (%)86.1Unverified
5X2-VLM (base)Accuracy (%)85.2Unverified
6X-VLM (base)Accuracy (%)84.51Unverified