SOTAVerified

Phrase Grounding

Given an image and a corresponding caption, the Phrase Grounding task aims to ground each entity mentioned by a noun phrase in the caption to a region in the image.

Source: Phrase Grounding by Soft-Label Chain Conditional Random Field

Papers

Showing 150 of 88 papers

TitleStatusHype
GLIPv2: Unifying Localization and Vision-Language UnderstandingCode4
OmDet: Large-scale vision-language multi-dataset pre-training with multimodal detection networkCode3
Towards Visual Grounding: A SurveyCode3
MDETR - Modulated Detection for End-to-End Multi-Modal UnderstandingCode2
PG-Video-LLaVA: Pixel Grounding Large Video-Language ModelsCode2
Learning Cross-modal Context Graph for Visual GroundingCode1
What is Where by Looking: Weakly-Supervised Open-World Phrase-Grounding without Text InputsCode1
A Survey on Interpretable Cross-modal ReasoningCode1
Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence ModelsCode1
Contrastive Learning for Weakly Supervised Phrase GroundingCode1
Unsupervised Vision-Language Parsing: Seamlessly Bridging Visual Scene Graphs with Language Structures via Dependency RelationshipsCode1
Improving Weakly Supervised Visual Grounding by Contrastive Knowledge DistillationCode1
Coarse-to-Fine Vision-Language Pre-training with Fusion in the BackboneCode1
MAF: Multimodal Alignment Framework for Weakly-Supervised Phrase GroundingCode1
An Open and Comprehensive Pipeline for Unified Object Grounding and DetectionCode1
MDETR -- Modulated Detection for End-to-End Multi-Modal UnderstandingCode1
DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and GroundingCode1
Kosmos-2: Grounding Multimodal Large Language Models to the WorldCode1
Enhancing Representation in Radiography-Reports Foundation Model: A Granular Alignment Algorithm Using Masked Contrastive LearningCode1
PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language ModelsCode1
CAVL: Learning Contrastive and Adaptive Representations of Vision and Language0
A Comparison of Object Detection and Phrase Grounding Models in Chest X-ray Abnormality Localization using Eye-tracking Data0
Grounding Plural Phrases: Countering Evaluation Biases by Individuation0
Hierarchical Alignment-enhanced Adaptive Grounding Network for Generalized Referring Expression Comprehension0
How to Understand "Support"? An Implicit-enhanced Causal Inference Approach for Weakly-supervised Phrase Grounding0
Improving Pre-trained Vision-and-Language Embeddings for Phrase Grounding0
Knowledge Aided Consistency for Weakly Supervised Phrase Grounding0
Language Features Matter: Effective Language Representations for Vision-Language Tasks0
Learning Deep Structure-Preserving Image-Text Embeddings0
Contextual Self-paced Learning for Weakly Supervised Spatio-Temporal Video Grounding0
LIMITR: Leveraging Local Information for Medical Image-Text Representation0
Lite-MDETR: A Lightweight Multi-Modal Detector0
Contrastive Region Guidance: Improving Grounding in Vision-Language Models without Training0
CXR-Agent: Vision-language models for chest X-ray interpretation with uncertainty aware radiology reporting0
Medical Phrase Grounding with Region-Phrase Context Contrastive Alignment0
MedRG: Medical Report Grounding with Multi-modal Large Language Model0
Align2Ground: Weakly Supervised Phrase Grounding Guided by Image-Caption Alignment0
Detailed Annotations of Chest X-Rays via CT Projection for Report Understanding0
Anatomy-Grounded Weakly Supervised Prompt Tuning for Chest X-ray Latent Diffusion Models0
Disentangled Motif-aware Graph Learning for Phrase Grounding0
Neural Sequential Phrase Grounding (SeqGROUND)0
Dynamic Conditional Networks for Few-Shot Learning0
Phrase Grounding-based Style Transfer for Single-Domain Generalized Object Detection0
ELVIS: Empowering Locality of Vision Language Pre-training with Intra-modal Similarity0
PIRC Net : Using Proposal Indexing, Relationships and Context for Phrase Grounding0
Pre-Training Multimodal Hallucination Detectors with Corrupted Grounding Data0
Progressive Local Alignment for Medical Multimodal Pre-training0
Propagating Over Phrase Relations for One-Stage Visual Grounding0
Q-GroundCAM: Quantifying Grounding in Vision Language Models via GradCAM0
Query-guided Regression Network with Context Policy for Phrase Grounding0
Show:102550
← PrevPage 1 of 2Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GLIPv2R@187.7Unverified
2FIBER-BR@187.4Unverified
3GLIPR@187.1Unverified
4PEVLR@184.4Unverified
5MDETR-ENB5R@184.3Unverified
6DIGNR@178.73Unverified
7LCMCGR@176.74Unverified
8Soft-Label Chain CRF (SL-CCRF)R@174.69Unverified
9DDPN (ResNet-101)R@173.3Unverified
10VisualBERTR@171.33Unverified
#ModelMetricClaimedVerifiedStatus
1GBS Ensemble + 12-in-1Pointing Game Accuracy85.9Unverified
2GbS Ensemble MS-COCOPointing Game Accuracy75.6Unverified
3COCO_ELMo_PNASNetPointing Game Accuracy69.19Unverified
#ModelMetricClaimedVerifiedStatus
1Fiber-BR@187.1Unverified
2PEVLR@184.1Unverified
3VisualBERTR@170.4Unverified
#ModelMetricClaimedVerifiedStatus
1VG_BiLSTM_VGGPointing Game Accuracy62.76Unverified
2GbS Ensemble MS-COCOPointing Game Accuracy58.21Unverified
3MCBAccuracy28.91Unverified
#ModelMetricClaimedVerifiedStatus
1GbS VGPointing Game Accuracy55.91Unverified
2VG_ELMo_PNASNetPointing Game Accuracy55.16Unverified
3GbS Ensemble MS-COCOPointing Game Accuracy54.55Unverified