SOTAVerified

Visual Grounding

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

  • What is the main focus in a query?
  • How to understand an image?
  • How to locate an object?

Papers

Showing 401450 of 571 papers

TitleStatusHype
Semantic sentence similarity: size does not always matter0
Sim-To-Real Transfer of Visual Grounding for Human-Aided Ambiguity Resolution0
Spatio-Temporal Graph for Video Captioning with Knowledge Distillation0
SPAZER: Spatial-Semantic Progressive Reasoning Agent for Zero-shot 3D Visual Grounding0
Structured Preference Optimization for Vision-Language Long-Horizon Task Planning0
Surgical-LVLM: Learning to Adapt Large Vision-Language Model for Grounded Visual Question Answering in Robotic Surgery0
Suspected Object Matters: Rethinking Model's Prediction for One-stage Visual Grounding0
Taking a HINT: Leveraging Explanations to Make Vision and Language Models More Grounded0
Talk to Parallel LiDARs: A Human-LiDAR Interaction Method Based on 3D Visual Grounding0
Task-aware Cross-modal Feature Refinement Transformer with Large Language Models for Visual Grounding0
Task-oriented Sequential Grounding in 3D Scenes0
Teaching Metric Distance to Autoregressive Multimodal Foundational Models0
Tell Me the Evidence? Dual Visual-Linguistic Interaction for Answer Grounding0
The Solution for the ICCV 2023 Perception Test Challenge 2023 -- Task 6 -- Grounded videoQA0
Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding0
TinyRS-R1: Compact Multimodal Language Model for Remote Sensing0
Toward Explainable and Fine-Grained 3D Grounding through Referring Textual Phrases0
Towards Open-World Grasping with Large Vision-Language Models0
Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as Programmers0
Towards Visual Text Grounding of Multimodal Large Language Model0
Training-Free Reasoning and Reflection in MLLMs0
Transfer Learning from Audio-Visual Grounding to Speech Recognition0
Transformers in Vision: A Survey0
TransRefer3D: Entity-and-Relation Aware Transformer for Fine-Grained 3D Visual Grounding0
TRAVEL: Training-Free Retrieval and Alignment for Vision-and-Language Navigation0
TreePrompt: Learning to Compose Tree Prompts for Explainable Visual Grounding0
Two Causally Related Needles in a Video Haystack0
Uni3DL: Unified Model for 3D and Language Understanding0
Unified Representation Space for 3D Visual Grounding0
UniT3D: A Unified Transformer for 3D Dense Captioning and Visual Grounding0
UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning0
Unveiling and Mitigating Bias in Audio Visual Segmentation0
UOUO: Uncontextualized Uncommon Objects for Measuring Knowledge Horizons of Vision Language Models0
Using Multiple Instance Learning to Build Multimodal Representations0
VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos0
VideoGLaMM : A Large Multimodal Model for Pixel-Level Visual Grounding in Videos0
VidLA: Video-Language Alignment at Scale0
Viewpoint-Aware Visual Grounding in 3D Scenes0
ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding0
ViewSRD: 3D Visual Grounding via Structured Multi-View Decomposition0
ViGiL3D: A Linguistically Diverse Dataset for 3D Visual Grounding0
VIMI: Grounding Video Generation through Multi-modal Instruction0
Vision-Language Pre-training with Object Contrastive Learning for 3D Scene Understanding0
VIST-GPT: Ushering in the Era of Visual Storytelling with LLMs?0
Visual Grounding Annotation of Recipe Flow Graph0
Visual grounding for desktop graphical user interfaces0
How direct is the link between words and images?0
Visual Grounding of Inter-lingual Word-Embeddings0
Visual Grounding of Whole Radiology Reports for 3D CT Images0
Visual Grounding Strategies for Text-Only Natural Language Processing0
Show:102550
← PrevPage 9 of 12Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1Florence-2-large-ftAccuracy (%)95.3Unverified
2mPLUG-2Accuracy (%)92.8Unverified
3X2-VLM (large)Accuracy (%)92.1Unverified
4XFM (base)Accuracy (%)90.4Unverified
5X2-VLM (base)Accuracy (%)90.3Unverified
6X-VLM (base)Accuracy (%)89Unverified
7HYDRAIoU61.7Unverified
8HYDRAIoU61.1Unverified
#ModelMetricClaimedVerifiedStatus
1Florence-2-large-ftAccuracy (%)92Unverified
2mPLUG-2Accuracy (%)86.05Unverified
3X2-VLM (large)Accuracy (%)81.8Unverified
4XFM (base)Accuracy (%)79.8Unverified
5X2-VLM (base)Accuracy (%)78.4Unverified
6X-VLM (base)Accuracy (%)76.91Unverified
#ModelMetricClaimedVerifiedStatus
1Florence-2-large-ftAccuracy (%)93.4Unverified
2mPLUG-2Accuracy (%)90.33Unverified
3X2-VLM (large)Accuracy (%)87.6Unverified
4XFM (base)Accuracy (%)86.1Unverified
5X2-VLM (base)Accuracy (%)85.2Unverified
6X-VLM (base)Accuracy (%)84.51Unverified