SOTAVerified

Visual Grounding

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

  • What is the main focus in a query?
  • How to understand an image?
  • How to locate an object?

Papers

Showing 151200 of 571 papers

TitleStatusHype
Fine-Grained Semantically Aligned Vision-Language Pre-TrainingCode1
Improving Visual Grounding by Encouraging Consistent Gradient-based ExplanationsCode1
MixGen: A New Multi-Modal Data AugmentationCode1
TransVG++: End-to-End Visual Grounding with Language Conditioned Vision TransformerCode1
mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connectionsCode1
Improving Visual Grounding with Visual-Linguistic Verification and Iterative ReasoningCode1
3D-SPS: Single-Stage 3D Visual Grounding via Referred Point Progressive SelectionCode1
Multi-View Transformer for 3D Visual GroundingCode1
SeqTR: A Simple yet Universal Network for Visual GroundingCode1
Collaborative Transformers for Grounded Situation RecognitionCode1
TubeDETR: Spatio-Temporal Video Grounding with TransformersCode1
Shifting More Attention to Visual Backbone: Query-modulated Refinement Networks for End-to-End Visual GroundingCode1
Word Discovery in Visually Grounded, Self-Supervised Speech ModelsCode1
Local-Global Context Aware Transformer for Language-Guided Video SegmentationCode1
Pseudo-Q: Generating Pseudo Language Queries for Visual GroundingCode1
REX: Reasoning-aware and Grounded ExplanationCode1
Self-Supervised Representation Learning for Speech Using Visual Grounding and Masked Language ModelingCode1
Multi-Modal Dynamic Graph Transformer for Visual GroundingCode1
CLIP-Lite: Information Efficient Visual Representation Learning with Language SupervisionCode1
UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language ModelingCode1
Grounded Situation Recognition with TransformersCode1
Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual ConceptsCode1
CPT: Colorful Prompt Tuning for Pre-trained Vision-Language ModelsCode1
Multimodal Incremental Transformer with Visual Grounding for Visual Dialogue GenerationCode1
Panoptic Narrative GroundingCode1
VidLanKD: Improving Language Understanding via Video-Distilled Knowledge TransferCode1
Referring Transformer: A One-step Approach to Multi-task Visual GroundingCode1
SAT: 2D Semantics Assisted Training for 3D Visual GroundingCode1
Connecting What to Say With Where to Look by Modeling Human Attention TracesCode1
MDETR -- Modulated Detection for End-to-End Multi-Modal UnderstandingCode1
TransVG: End-to-End Visual Grounding with TransformersCode1
Look Before You Leap: Learning Landmark Features for One-Stage Visual GroundingCode1
Cyclic Co-Learning of Sounding Object Visual Grounding and Sound SeparationCode1
Relation-aware Instance Refinement for Weakly Supervised Visual GroundingCode1
Refer-it-in-RGBD: A Bottom-up Approach for 3D Visual Grounding in RGBD ImagesCode1
OCID-Ref: A 3D Robotic Dataset with Embodied Language for Clutter Scene GroundingCode1
InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds through Instance Multi-level Contextual ReferringCode1
Panoptic Narrative GroundingCode1
Text-Free Image-to-Speech Synthesis Using Learned Segmental UnitsCode1
Text-to-Image Generation Grounded by Fine-Grained User AttentionCode1
X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal TransformersCode1
Improving One-stage Visual Grounding by Recursive Sub-query ConstructionCode1
Spatially Aware Multimodal Transformers for TextVQACode1
Visual Relation Grounding in VideosCode1
Improving Weakly Supervised Visual Grounding by Contrastive Knowledge DistillationCode1
Visual Grounding of Learned Physical ModelsCode1
Deep Multimodal Neural Architecture SearchCode1
Visual Grounding Methods for VQA are Working for the Wrong Reasons!Code1
Visual Grounding in Video for Unsupervised Word TranslationCode1
Guessing State Tracking for Visual DialogueCode1
Show:102550
← PrevPage 4 of 12Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1Florence-2-large-ftAccuracy (%)95.3Unverified
2mPLUG-2Accuracy (%)92.8Unverified
3X2-VLM (large)Accuracy (%)92.1Unverified
4XFM (base)Accuracy (%)90.4Unverified
5X2-VLM (base)Accuracy (%)90.3Unverified
6X-VLM (base)Accuracy (%)89Unverified
7HYDRAIoU61.7Unverified
8HYDRAIoU61.1Unverified
#ModelMetricClaimedVerifiedStatus
1Florence-2-large-ftAccuracy (%)92Unverified
2mPLUG-2Accuracy (%)86.05Unverified
3X2-VLM (large)Accuracy (%)81.8Unverified
4XFM (base)Accuracy (%)79.8Unverified
5X2-VLM (base)Accuracy (%)78.4Unverified
6X-VLM (base)Accuracy (%)76.91Unverified
#ModelMetricClaimedVerifiedStatus
1Florence-2-large-ftAccuracy (%)93.4Unverified
2mPLUG-2Accuracy (%)90.33Unverified
3X2-VLM (large)Accuracy (%)87.6Unverified
4XFM (base)Accuracy (%)86.1Unverified
5X2-VLM (base)Accuracy (%)85.2Unverified
6X-VLM (base)Accuracy (%)84.51Unverified