SOTAVerified

Phrase Grounding

Given an image and a corresponding caption, the Phrase Grounding task aims to ground each entity mentioned by a noun phrase in the caption to a region in the image.

Source: Phrase Grounding by Soft-Label Chain Conditional Random Field

Papers

Showing 150 of 88 papers

TitleStatusHype
GLIPv2: Unifying Localization and Vision-Language UnderstandingCode4
OmDet: Large-scale vision-language multi-dataset pre-training with multimodal detection networkCode3
Towards Visual Grounding: A SurveyCode3
PG-Video-LLaVA: Pixel Grounding Large Video-Language ModelsCode2
MDETR - Modulated Detection for End-to-End Multi-Modal UnderstandingCode2
Enhancing Representation in Radiography-Reports Foundation Model: A Granular Alignment Algorithm Using Masked Contrastive LearningCode1
Improving Weakly Supervised Visual Grounding by Contrastive Knowledge DistillationCode1
Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence ModelsCode1
Coarse-to-Fine Vision-Language Pre-training with Fusion in the BackboneCode1
An Open and Comprehensive Pipeline for Unified Object Grounding and DetectionCode1
A Survey on Interpretable Cross-modal ReasoningCode1
PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language ModelsCode1
Contrastive Learning for Weakly Supervised Phrase GroundingCode1
MDETR -- Modulated Detection for End-to-End Multi-Modal UnderstandingCode1
Unsupervised Vision-Language Parsing: Seamlessly Bridging Visual Scene Graphs with Language Structures via Dependency RelationshipsCode1
What is Where by Looking: Weakly-Supervised Open-World Phrase-Grounding without Text InputsCode1
MAF: Multimodal Alignment Framework for Weakly-Supervised Phrase GroundingCode1
DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and GroundingCode1
Learning Cross-modal Context Graph for Visual GroundingCode1
Kosmos-2: Grounding Multimodal Large Language Models to the WorldCode1
Zero-Shot Medical Phrase Grounding with Off-the-shelf Diffusion ModelsCode0
A Joint Study of Phrase Grounding and Task Performance in Vision and Language ModelsCode0
Anatomical grounding pre-training for medical phrase groundingCode0
Augment the Pairs: Semantics-Preserving Image-Caption Pair Augmentation for Grounding-Based Vision and Language ModelsCode0
Box-based Refinement for Weakly Supervised and Unsupervised Localization TasksCode0
Conditional Image-Text Embedding NetworksCode0
Context-Infused Visual Grounding for ArtCode0
Detector-Free Weakly Supervised Grounding by SeparationCode0
Disambiguating Reference in Visually Grounded Dialogues through Joint Modeling of Textual and Multimodal Semantic StructuresCode0
Empathic Grounding: Explorations using Multimodal Interaction and Large Language Models with Conversational AgentsCode0
Extending Phrase Grounding with Pronouns in Visual DialoguesCode0
Grounding of Textual Phrases in Images by ReconstructionCode0
Learning to Exploit Temporal Structure for Biomedical Vision-Language ProcessingCode0
Learning to ground medical text in a 3D human atlasCode0
A Lightweight Modular Framework for Low-Cost Open-Vocabulary Object Detection TrainingCode0
Localizing Active Objects from Egocentric Vision with Symbolic World KnowledgeCode0
Making the Most of Text Semantics to Improve Biomedical Vision--Language ProcessingCode0
Modularized Textual Grounding for Counterfactual ResilienceCode0
Multi-level Multimodal Common Semantic Space for Image-Phrase GroundingCode0
Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual GroundingCode0
Natural Language Object RetrievalCode0
Revisiting Image-Language Networks for Open-ended Phrase DetectionCode0
Trade-offs in Fine-tuned Diffusion Models Between Accuracy and InterpretabilityCode0
Phrase Grounding by Soft-Label Chain Conditional Random FieldCode0
Rethinking Diversified and Discriminative Proposal Generation for Visual GroundingCode0
Neural Parameter Allocation SearchCode0
Similarity Maps for Self-Training Weakly-Supervised Phrase GroundingCode0
Transformer with Controlled Attention for Synchronous Motion CaptioningCode0
VICCA: Visual Interpretation and Comprehension of Chest X-ray Anomalies in Generated Report Without Human FeedbackCode0
Zero-Shot Grounding of Objects from Natural Language QueriesCode0
Show:102550
← PrevPage 1 of 2Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GLIPv2R@187.7Unverified
2FIBER-BR@187.4Unverified
3GLIPR@187.1Unverified
4PEVLR@184.4Unverified
5MDETR-ENB5R@184.3Unverified
6DIGNR@178.73Unverified
7LCMCGR@176.74Unverified
8Soft-Label Chain CRF (SL-CCRF)R@174.69Unverified
9DDPN (ResNet-101)R@173.3Unverified
10VisualBERTR@171.33Unverified
#ModelMetricClaimedVerifiedStatus
1GBS Ensemble + 12-in-1Pointing Game Accuracy85.9Unverified
2GbS Ensemble MS-COCOPointing Game Accuracy75.6Unverified
3COCO_ELMo_PNASNetPointing Game Accuracy69.19Unverified
#ModelMetricClaimedVerifiedStatus
1Fiber-BR@187.1Unverified
2PEVLR@184.1Unverified
3VisualBERTR@170.4Unverified
#ModelMetricClaimedVerifiedStatus
1VG_BiLSTM_VGGPointing Game Accuracy62.76Unverified
2GbS Ensemble MS-COCOPointing Game Accuracy58.21Unverified
3MCBAccuracy28.91Unverified
#ModelMetricClaimedVerifiedStatus
1GbS VGPointing Game Accuracy55.91Unverified
2VG_ELMo_PNASNetPointing Game Accuracy55.16Unverified
3GbS Ensemble MS-COCOPointing Game Accuracy54.55Unverified