SOTAVerified

Phrase Grounding

Given an image and a corresponding caption, the Phrase Grounding task aims to ground each entity mentioned by a noun phrase in the caption to a region in the image.

Source: Phrase Grounding by Soft-Label Chain Conditional Random Field

Papers

Showing 150 of 88 papers

TitleStatusHype
GLIPv2: Unifying Localization and Vision-Language UnderstandingCode4
OmDet: Large-scale vision-language multi-dataset pre-training with multimodal detection networkCode3
Towards Visual Grounding: A SurveyCode3
PG-Video-LLaVA: Pixel Grounding Large Video-Language ModelsCode2
MDETR - Modulated Detection for End-to-End Multi-Modal UnderstandingCode2
Enhancing Representation in Radiography-Reports Foundation Model: A Granular Alignment Algorithm Using Masked Contrastive LearningCode1
Improving Weakly Supervised Visual Grounding by Contrastive Knowledge DistillationCode1
Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence ModelsCode1
Coarse-to-Fine Vision-Language Pre-training with Fusion in the BackboneCode1
An Open and Comprehensive Pipeline for Unified Object Grounding and DetectionCode1
A Survey on Interpretable Cross-modal ReasoningCode1
PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language ModelsCode1
Contrastive Learning for Weakly Supervised Phrase GroundingCode1
MDETR -- Modulated Detection for End-to-End Multi-Modal UnderstandingCode1
Unsupervised Vision-Language Parsing: Seamlessly Bridging Visual Scene Graphs with Language Structures via Dependency RelationshipsCode1
What is Where by Looking: Weakly-Supervised Open-World Phrase-Grounding without Text InputsCode1
MAF: Multimodal Alignment Framework for Weakly-Supervised Phrase GroundingCode1
DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and GroundingCode1
Learning Cross-modal Context Graph for Visual GroundingCode1
Kosmos-2: Grounding Multimodal Large Language Models to the WorldCode1
Zero-Shot Medical Phrase Grounding with Off-the-shelf Diffusion ModelsCode0
A Joint Study of Phrase Grounding and Task Performance in Vision and Language ModelsCode0
Anatomical grounding pre-training for medical phrase groundingCode0
Augment the Pairs: Semantics-Preserving Image-Caption Pair Augmentation for Grounding-Based Vision and Language ModelsCode0
Box-based Refinement for Weakly Supervised and Unsupervised Localization TasksCode0
Conditional Image-Text Embedding NetworksCode0
Context-Infused Visual Grounding for ArtCode0
Detector-Free Weakly Supervised Grounding by SeparationCode0
Disambiguating Reference in Visually Grounded Dialogues through Joint Modeling of Textual and Multimodal Semantic StructuresCode0
Empathic Grounding: Explorations using Multimodal Interaction and Large Language Models with Conversational AgentsCode0
Extending Phrase Grounding with Pronouns in Visual DialoguesCode0
Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-ReferringCode0
Grounding of Textual Phrases in Images by ReconstructionCode0
Learning to Exploit Temporal Structure for Biomedical Vision-Language ProcessingCode0
Learning to ground medical text in a 3D human atlasCode0
A Lightweight Modular Framework for Low-Cost Open-Vocabulary Object Detection TrainingCode0
Localizing Active Objects from Egocentric Vision with Symbolic World KnowledgeCode0
Making the Most of Text Semantics to Improve Biomedical Vision--Language ProcessingCode0
Modularized Textual Grounding for Counterfactual ResilienceCode0
Multi-level Multimodal Common Semantic Space for Image-Phrase GroundingCode0
Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual GroundingCode0
Natural Language Object RetrievalCode0
Revisiting Image-Language Networks for Open-ended Phrase DetectionCode0
Trade-offs in Fine-tuned Diffusion Models Between Accuracy and InterpretabilityCode0
Phrase Grounding by Soft-Label Chain Conditional Random FieldCode0
Rethinking Diversified and Discriminative Proposal Generation for Visual GroundingCode0
Neural Parameter Allocation SearchCode0
Similarity Maps for Self-Training Weakly-Supervised Phrase GroundingCode0
Transformer with Controlled Attention for Synchronous Motion CaptioningCode0
VICCA: Visual Interpretation and Comprehension of Chest X-ray Anomalies in Generated Report Without Human FeedbackCode0
Show:102550
← PrevPage 1 of 2Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GLIPv2R@187.7Unverified
2FIBER-BR@187.4Unverified
3GLIPR@187.1Unverified
4PEVLR@184.4Unverified
5MDETR-ENB5R@184.3Unverified
6DIGNR@178.73Unverified
7LCMCGR@176.74Unverified
8Soft-Label Chain CRF (SL-CCRF)R@174.69Unverified
9DDPN (ResNet-101)R@173.3Unverified
10VisualBERTR@171.33Unverified
#ModelMetricClaimedVerifiedStatus
1GBS Ensemble + 12-in-1Pointing Game Accuracy85.9Unverified
2GbS Ensemble MS-COCOPointing Game Accuracy75.6Unverified
3COCO_ELMo_PNASNetPointing Game Accuracy69.19Unverified
#ModelMetricClaimedVerifiedStatus
1Fiber-BR@187.1Unverified
2PEVLR@184.1Unverified
3VisualBERTR@170.4Unverified
#ModelMetricClaimedVerifiedStatus
1VG_BiLSTM_VGGPointing Game Accuracy62.76Unverified
2GbS Ensemble MS-COCOPointing Game Accuracy58.21Unverified
3MCBAccuracy28.91Unverified
#ModelMetricClaimedVerifiedStatus
1GbS VGPointing Game Accuracy55.91Unverified
2VG_ELMo_PNASNetPointing Game Accuracy55.16Unverified
3GbS Ensemble MS-COCOPointing Game Accuracy54.55Unverified