Phrase Grounding

Given an image and a corresponding caption, the Phrase Grounding task aims to ground each entity mentioned by a noun phrase in the caption to a region in the image.

Source: Phrase Grounding by Soft-Label Chain Conditional Random Field

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 26–50 of 88 papers

Title	Date	Tasks	Status	Hype
PG-Video-LLaVA: Pixel Grounding Large Video-Language Models	Nov 22, 2023	BenchmarkingPhrase Grounding	CodeCode Available	2
Augment the Pairs: Semantics-Preserving Image-Caption Pair Augmentation for Grounding-Based Vision and Language Models	Nov 5, 2023	Data AugmentationPhrase Grounding	CodeCode Available	0
Localizing Active Objects from Egocentric Vision with Symbolic World Knowledge	Oct 23, 2023	Phrase GroundingWorld Knowledge	CodeCode Available	0
Enhancing Representation in Radiography-Reports Foundation Model: A Granular Alignment Algorithm Using Masked Contrastive Learning	Sep 12, 2023	Contrastive LearningMedical Image Analysis	CodeCode Available	1
Box-based Refinement for Weakly Supervised and Unsupervised Localization Tasks	Sep 7, 2023	Object DiscoveryPhrase Grounding	CodeCode Available	0
A Joint Study of Phrase Grounding and Task Performance in Vision and Language Models	Sep 6, 2023	Phrase Grounding	CodeCode Available	0
A Survey on Interpretable Cross-modal Reasoning	Sep 5, 2023	Cross-Modal RetrievalDecision Making	CodeCode Available	1
Catalog Phrase Grounding (CPG): Grounding of Product Textual Attributes in Product Images for e-commerce Vision-Language Applications	Aug 30, 2023	Decoderobject-detection	—Unverified	0
Kosmos-2: Grounding Multimodal Large Language Models to the World	Jun 26, 2023	Image CaptioningIn-Context Learning	CodeCode Available	1
Read, look and detect: Bounding box annotation from image-caption pairs	Jun 9, 2023	Objectobject-detection	—Unverified	0
ELVIS: Empowering Locality of Vision Language Pre-training with Intra-modal Similarity	Apr 11, 2023	Phrase Grounding	—Unverified	0
CAVL: Learning Contrastive and Adaptive Representations of Vision and Language	Apr 10, 2023	Image RetrievalPhrase Grounding	—Unverified	0
Trade-offs in Fine-tuned Diffusion Models Between Accuracy and Interpretability	Mar 31, 2023	Conditional Image GenerationImage Generation	CodeCode Available	0
LIMITR: Leveraging Local Information for Medical Image-Text Representation	Mar 21, 2023	Image RetrievalPhrase Grounding	—Unverified	0
Investigating the Role of Attribute Context in Vision-Language Models for Object Recognition and Detection	Mar 17, 2023	AttributeContrastive Learning	—Unverified	0
Medical Phrase Grounding with Region-Phrase Context Contrastive Alignment	Mar 14, 2023	Medical Image AnalysisPhrase Grounding	—Unverified	0
Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing	Jan 11, 2023	Phrase GroundingSelf-Supervised Learning	CodeCode Available	0
Similarity Maps for Self-Training Weakly-Supervised Phrase Grounding	Jan 1, 2023	Phrase Grounding	CodeCode Available	0
DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and Grounding	Nov 28, 2022	object-detectionObject Detection	CodeCode Available	1
Extending Phrase Grounding with Pronouns in Visual Dialogues	Oct 23, 2022	Phrase Grounding	CodeCode Available	0
Detailed Annotations of Chest X-Rays via CT Projection for Report Understanding	Oct 7, 2022	AnatomyPhrase Grounding	—Unverified	0
OmDet: Large-scale vision-language multi-dataset pre-training with multimodal detection network	Sep 10, 2022	Continual LearningObject	CodeCode Available	3
What is Where by Looking: Weakly-Supervised Open-World Phrase-Grounding without Text Inputs	Jun 19, 2022	BenchmarkingImage Captioning	CodeCode Available	1
Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone	Jun 15, 2022	Described Object DetectionImage Captioning	CodeCode Available	1
GLIPv2: Unifying Localization and Vision-Language Understanding	Jun 12, 2022	2D Object DetectionContrastive Learning	CodeCode Available	4

Show:10 25 50

← PrevPage 2 of 4Next →

All datasets Flickr30k Entities Test Flickr30k Flickr30k Entities Dev ReferIt Visual Genome

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GLIPv2	R@1	87.7	—	Unverified
2	FIBER-B	R@1	87.4	—	Unverified
3	GLIP	R@1	87.1	—	Unverified
4	PEVL	R@1	84.4	—	Unverified
5	MDETR-ENB5	R@1	84.3	—	Unverified
6	DIGN	R@1	78.73	—	Unverified
7	LCMCG	R@1	76.74	—	Unverified
8	Soft-Label Chain CRF (SL-CCRF)	R@1	74.69	—	Unverified
9	DDPN (ResNet-101)	R@1	73.3	—	Unverified
10	VisualBERT	R@1	71.33	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	GBS Ensemble + 12-in-1	Pointing Game Accuracy	85.9	—	Unverified
2	GbS Ensemble MS-COCO	Pointing Game Accuracy	75.6	—	Unverified
3	COCO_ELMo_PNASNet	Pointing Game Accuracy	69.19	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Fiber-B	R@1	87.1	—	Unverified
2	PEVL	R@1	84.1	—	Unverified
3	VisualBERT	R@1	70.4	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	VG_BiLSTM_VGG	Pointing Game Accuracy	62.76	—	Unverified
2	GbS Ensemble MS-COCO	Pointing Game Accuracy	58.21	—	Unverified
3	MCB	Accuracy	28.91	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	GbS VG	Pointing Game Accuracy	55.91	—	Unverified
2	VG_ELMo_PNASNet	Pointing Game Accuracy	55.16	—	Unverified
3	GbS Ensemble MS-COCO	Pointing Game Accuracy	54.55	—	Unverified