SOTAVerified

Phrase Grounding

Given an image and a corresponding caption, the Phrase Grounding task aims to ground each entity mentioned by a noun phrase in the caption to a region in the image.

Source: Phrase Grounding by Soft-Label Chain Conditional Random Field

Papers

Showing 5175 of 88 papers

TitleStatusHype
Read, look and detect: Bounding box annotation from image-caption pairs0
ELVIS: Empowering Locality of Vision Language Pre-training with Intra-modal Similarity0
Catalog Phrase Grounding (CPG): Grounding of Product Textual Attributes in Product Images for e-commerce Vision-Language Applications0
Investigating the Role of Attribute Context in Vision-Language Models for Object Recognition and Detection0
Unsupervised Vision-Language Grammar Induction with Shared Structure Modeling0
Utilizing Every Image Object for Semi-supervised Phrase Grounding0
ViCaS: A Dataset for Combining Holistic and Pixel-level Video Understanding using Captions with Grounded Segmentation0
Enhancing the vision-language foundation model with key semantic knowledge-emphasized report refinement0
Zero-Shot Medical Phrase Grounding with Off-the-shelf Diffusion ModelsCode0
A Joint Study of Phrase Grounding and Task Performance in Vision and Language ModelsCode0
Anatomical grounding pre-training for medical phrase groundingCode0
Augment the Pairs: Semantics-Preserving Image-Caption Pair Augmentation for Grounding-Based Vision and Language ModelsCode0
Box-based Refinement for Weakly Supervised and Unsupervised Localization TasksCode0
Conditional Image-Text Embedding NetworksCode0
Context-Infused Visual Grounding for ArtCode0
Detector-Free Weakly Supervised Grounding by SeparationCode0
Disambiguating Reference in Visually Grounded Dialogues through Joint Modeling of Textual and Multimodal Semantic StructuresCode0
Empathic Grounding: Explorations using Multimodal Interaction and Large Language Models with Conversational AgentsCode0
Extending Phrase Grounding with Pronouns in Visual DialoguesCode0
Grounding of Textual Phrases in Images by ReconstructionCode0
Learning to Exploit Temporal Structure for Biomedical Vision-Language ProcessingCode0
Learning to ground medical text in a 3D human atlasCode0
A Lightweight Modular Framework for Low-Cost Open-Vocabulary Object Detection TrainingCode0
Localizing Active Objects from Egocentric Vision with Symbolic World KnowledgeCode0
Making the Most of Text Semantics to Improve Biomedical Vision--Language ProcessingCode0
Show:102550
← PrevPage 3 of 4Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GLIPv2R@187.7Unverified
2FIBER-BR@187.4Unverified
3GLIPR@187.1Unverified
4PEVLR@184.4Unverified
5MDETR-ENB5R@184.3Unverified
6DIGNR@178.73Unverified
7LCMCGR@176.74Unverified
8Soft-Label Chain CRF (SL-CCRF)R@174.69Unverified
9DDPN (ResNet-101)R@173.3Unverified
10VisualBERTR@171.33Unverified
#ModelMetricClaimedVerifiedStatus
1GBS Ensemble + 12-in-1Pointing Game Accuracy85.9Unverified
2GbS Ensemble MS-COCOPointing Game Accuracy75.6Unverified
3COCO_ELMo_PNASNetPointing Game Accuracy69.19Unverified
#ModelMetricClaimedVerifiedStatus
1Fiber-BR@187.1Unverified
2PEVLR@184.1Unverified
3VisualBERTR@170.4Unverified
#ModelMetricClaimedVerifiedStatus
1VG_BiLSTM_VGGPointing Game Accuracy62.76Unverified
2GbS Ensemble MS-COCOPointing Game Accuracy58.21Unverified
3MCBAccuracy28.91Unverified
#ModelMetricClaimedVerifiedStatus
1GbS VGPointing Game Accuracy55.91Unverified
2VG_ELMo_PNASNetPointing Game Accuracy55.16Unverified
3GbS Ensemble MS-COCOPointing Game Accuracy54.55Unverified