Phrase Grounding

Given an image and a corresponding caption, the Phrase Grounding task aims to ground each entity mentioned by a noun phrase in the caption to a region in the image.

Source: Phrase Grounding by Soft-Label Chain Conditional Random Field

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 51–75 of 88 papers

Title	Date	Tasks	Status
Read, look and detect: Bounding box annotation from image-caption pairs	Jun 9, 2023	Objectobject-detection	—Unverified
ELVIS: Empowering Locality of Vision Language Pre-training with Intra-modal Similarity	Apr 11, 2023	Phrase Grounding	—Unverified
Catalog Phrase Grounding (CPG): Grounding of Product Textual Attributes in Product Images for e-commerce Vision-Language Applications	Aug 30, 2023	Decoderobject-detection	—Unverified
Investigating the Role of Attribute Context in Vision-Language Models for Object Recognition and Detection	Mar 17, 2023	AttributeContrastive Learning	—Unverified
Unsupervised Vision-Language Grammar Induction with Shared Structure Modeling	Sep 29, 2021	Contrastive LearningPhrase Grounding	—Unverified
Utilizing Every Image Object for Semi-supervised Phrase Grounding	Nov 5, 2020	Phrase GroundingReferring Expression	—Unverified
ViCaS: A Dataset for Combining Holistic and Pixel-level Video Understanding using Captions with Grounded Segmentation	Dec 12, 2024	Phrase GroundingQuestion Answering	—Unverified
Enhancing the vision-language foundation model with key semantic knowledge-emphasized report refinement	Jan 21, 2024	Medical Image AnalysisPhrase Grounding	—Unverified
Zero-Shot Medical Phrase Grounding with Off-the-shelf Diffusion Models	Apr 19, 2024	Contrastive LearningPhrase Grounding	CodeCode Available
A Joint Study of Phrase Grounding and Task Performance in Vision and Language Models	Sep 6, 2023	Phrase Grounding	CodeCode Available
Anatomical grounding pre-training for medical phrase grounding	Feb 23, 2025	Phrase GroundingZero-Shot Learning	CodeCode Available
Augment the Pairs: Semantics-Preserving Image-Caption Pair Augmentation for Grounding-Based Vision and Language Models	Nov 5, 2023	Data AugmentationPhrase Grounding	CodeCode Available
Box-based Refinement for Weakly Supervised and Unsupervised Localization Tasks	Sep 7, 2023	Object DiscoveryPhrase Grounding	CodeCode Available
Conditional Image-Text Embedding Networks	Nov 22, 2017	Phrase Grounding	CodeCode Available
Context-Infused Visual Grounding for Art	Oct 16, 2024	object-detectionObject Detection	CodeCode Available
Detector-Free Weakly Supervised Grounding by Separation	Apr 20, 2021	Phrase Grounding	CodeCode Available
Disambiguating Reference in Visually Grounded Dialogues through Joint Modeling of Textual and Multimodal Semantic Structures	May 16, 2025	coreference-resolutionCoreference Resolution	CodeCode Available
Empathic Grounding: Explorations using Multimodal Interaction and Large Language Models with Conversational Agents	Jul 1, 2024	Emotional IntelligenceEmotion Classification	CodeCode Available
Extending Phrase Grounding with Pronouns in Visual Dialogues	Oct 23, 2022	Phrase Grounding	CodeCode Available
Grounding of Textual Phrases in Images by Reconstruction	Nov 12, 2015	Language ModelingLanguage Modelling	CodeCode Available
Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing	Jan 11, 2023	Phrase GroundingSelf-Supervised Learning	CodeCode Available
Learning to ground medical text in a 3D human atlas	Nov 1, 2020	Phrase GroundingVisual Grounding	CodeCode Available
A Lightweight Modular Framework for Low-Cost Open-Vocabulary Object Detection Training	Aug 20, 2024	Autonomous VehiclesComputational Efficiency	CodeCode Available
Localizing Active Objects from Egocentric Vision with Symbolic World Knowledge	Oct 23, 2023	Phrase GroundingWorld Knowledge	CodeCode Available
Making the Most of Text Semantics to Improve Biomedical Vision--Language Processing	Apr 21, 2022	Contrastive LearningLanguage Modeling	CodeCode Available

Show:10 25 50

← PrevPage 3 of 4Next →

All datasets Flickr30k Entities Test Flickr30k Flickr30k Entities Dev ReferIt Visual Genome

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GLIPv2	R@1	87.7	—	Unverified
2	FIBER-B	R@1	87.4	—	Unverified
3	GLIP	R@1	87.1	—	Unverified
4	PEVL	R@1	84.4	—	Unverified
5	MDETR-ENB5	R@1	84.3	—	Unverified
6	DIGN	R@1	78.73	—	Unverified
7	LCMCG	R@1	76.74	—	Unverified
8	Soft-Label Chain CRF (SL-CCRF)	R@1	74.69	—	Unverified
9	DDPN (ResNet-101)	R@1	73.3	—	Unverified
10	VisualBERT	R@1	71.33	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	GBS Ensemble + 12-in-1	Pointing Game Accuracy	85.9	—	Unverified
2	GbS Ensemble MS-COCO	Pointing Game Accuracy	75.6	—	Unverified
3	COCO_ELMo_PNASNet	Pointing Game Accuracy	69.19	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Fiber-B	R@1	87.1	—	Unverified
2	PEVL	R@1	84.1	—	Unverified
3	VisualBERT	R@1	70.4	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	VG_BiLSTM_VGG	Pointing Game Accuracy	62.76	—	Unverified
2	GbS Ensemble MS-COCO	Pointing Game Accuracy	58.21	—	Unverified
3	MCB	Accuracy	28.91	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	GbS VG	Pointing Game Accuracy	55.91	—	Unverified
2	VG_ELMo_PNASNet	Pointing Game Accuracy	55.16	—	Unverified
3	GbS Ensemble MS-COCO	Pointing Game Accuracy	54.55	—	Unverified