Image-text matching

Image-Text Matching is a subtask within Cross-Modal Retrieval (CMR) that involves establishing associations between images and corresponding textual descriptions. The goal is to retrieve an image given a textual query or, conversely, retrieve a textual description given an image query. This task is challenging due to the heterogeneity gap between image and text data representations. Image-text matching is used in applications such as content-based image search, visual question answering, and multimodal summarization.

Assessing Brittleness of Image-Text Retrieval Benchmarks from Vision-Language Models Perspective

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 51–100 of 188 papers

Title	Date	Tasks	Status	Hype	Score
Learning Semantic Relationship Among Instances for Image-Text Matching	Jan 1, 2023	Cross-Modal RetrievalImage Retrieval	CodeCode Available	1	5
Are Diffusion Models Vision-And-Language Reasoners?	May 25, 2023	DenoisingImage Generation	CodeCode Available	1	5
Learning with Noisy Correspondence for Cross-modal Matching	Dec 1, 2021	Cross-Modal RetrievalCross-modal retrieval with noisy correspondence	CodeCode Available	1	5
LLMScore: Unveiling the Power of Large Language Models in Text-to-Image Synthesis Evaluation	May 18, 2023	AttributeImage Generation	CodeCode Available	1	5
Adaptive Offline Quintuplet Loss for Image-Text Matching	Mar 7, 2020	Image-text matchingText Matching	CodeCode Available	1	5
MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training Model	Oct 11, 2022	Contrastive LearningImage-text matching	CodeCode Available	1	5
BiCro: Noisy Correspondence Rectification for Multi-modality Data via Bi-directional Cross-modal Similarity Consistency	Mar 22, 2023	Cross-modal retrieval with noisy correspondenceImage-text matching	CodeCode Available	1	5
Prototype-based Aleatoric Uncertainty Quantification for Cross-modal Retrieval	Sep 29, 2023	Cross-Modal RetrievalImage-text matching	CodeCode Available	1	5
RadCLIP: Enhancing Radiologic Image Analysis through Contrastive Language-Image Pre-training	Mar 15, 2024	Diagnosticimage-classification	CodeCode Available	1	5
ReCon: Enhancing True Correspondence Discrimination through Relation Consistency for Robust Noisy Correspondence Learning	Feb 27, 2025	Cross-Modal RetrievalCross-modal retrieval with noisy correspondence	CodeCode Available	1	5
BrainCLIP: Bridging Brain and Visual-Linguistic Representation Via CLIP for Generic Natural Visual Stimulus Decoding	Feb 25, 2023	Brain DecodingImage Generation	CodeCode Available	1	5
ECCV Caption: Correcting False Negatives by Collecting Machine-and-Human-verified Image-Caption Associations for MS-COCO	Apr 7, 2022	Image-text matchingText Matching	CodeCode Available	1	5
Self-supervised vision-language pretraining for Medical visual question answering	Nov 24, 2022	Contrastive LearningImage-text matching	CodeCode Available	1	5
Similarity Reasoning and Filtration for Image-Text Matching	Jan 5, 2021	Cross-Modal RetrievalImage Retrieval	CodeCode Available	1	5
Stacked Cross Attention for Image-Text Matching	Mar 21, 2018	Cross-Modal RetrievalImage Retrieval	CodeCode Available	1	5
Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-modal Structured Representations	May 6, 2023	Image-text matchingText Matching	CodeCode Available	1	5
Efficient Medical Vision-Language Alignment Through Adapting Masked Vision Models	Jun 10, 2025	Contrastive LearningImage-text matching	CodeCode Available	1	5
Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language Understanding	Nov 30, 2023	AttributeCompositional Zero-Shot Learning	CodeCode Available	1	5
Advancing Visual Grounding with Scene Knowledge: Benchmark and Method	Jul 21, 2023	Image-text matchingText Matching	CodeCode Available	1	5
Towards Unified Text-based Person Retrieval: A Large-scale Multi-Attribute and Language Search Benchmark	Jun 5, 2023	AttributeImage-text matching	CodeCode Available	1	5
Transformer Reasoning Network for Image-Text Matching and Retrieval	Apr 20, 2020	Image RetrievalImage-text matching	CodeCode Available	1	5
CLIP is Strong Enough to Fight Back: Test-time Counterattacks towards Zero-shot Adversarial Robustness of CLIP	Mar 5, 2025	Adversarial RobustnessImage-text matching	CodeCode Available	1	5
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation	Jul 16, 2021	Cross-Modal RetrievalGrounded language learning	CodeCode Available	1	5
UGNCL: Uncertainty-Guided Noisy Correspondence Learning for Efficient Cross-Modal Matching	Jul 11, 2024	Cross-Modal RetrievalCross-modal retrieval with noisy correspondence	CodeCode Available	1	5
Fine-Grained Image-Text Matching by Cross-Modal Hard Aligning Network	Jan 1, 2023	Image-text matchingRetrieval	CodeCode Available	1	5
CLIP Under the Microscope: A Fine-Grained Analysis of Multi-Object Representation	Feb 27, 2025	Image-text matchingObject	CodeCode Available	1	5
Graph Structured Network for Image-Text Matching	Apr 1, 2020	AttributeCross-Modal Retrieval	CodeCode Available	1	5
Do Vision-and-Language Transformers Learn Grounded Predicate-Noun Dependencies?	Oct 21, 2022	Image-text matchingLanguage Modeling	CodeCode Available	0	5
Beyond Image-Text Matching: Verb Understanding in Multimodal Transformers Using Guided Masking	Jan 29, 2024	Image-text matchingText Matching	CodeCode Available	0	5
Learning Two-Branch Neural Networks for Image-Text Matching Tasks	Apr 11, 2017	Image-text matchingRetrieval	CodeCode Available	0	5
Integrating Language Guidance Into Image-Text Matching for Correcting False Negatives	Mar 24, 2023	Cross-modal retrieval with noisy correspondenceImage-text matching	CodeCode Available	0	5
Learning fragment self-attention embeddings for image-text matching	Oct 1, 2019	Image-text matchingSentence	CodeCode Available	0	5
Dual Attention Networks for Multimodal Reasoning and Matching	Nov 2, 2016	Collaborative InferenceImage-text matching	CodeCode Available	0	5
Position Focused Attention Network for Image-Text Matching	Jul 23, 2019	Image-text matchingPosition	CodeCode Available	0	5
RoCOCO: Robustness Benchmark of MS-COCO to Stress-test Image-Text Matching Models	Apr 21, 2023	Cross-Modal RetrievalImage-text matching	CodeCode Available	0	5
Generative Visual Instruction Tuning	Jun 17, 2024	Image GenerationImage-text matching	CodeCode Available	0	5
Backdoor Attack on Unpaired Medical Image-Text Foundation Models: A Pilot Study on MedCLIP	Jan 1, 2024	Backdoor AttackContrastive Learning	CodeCode Available	0	5
MALM: Mask Augmentation based Local Matching for Food-Recipe Retrieval	May 18, 2023	Image-text matchingRetrieval	CodeCode Available	0	5
GR-GAN: Gradual Refinement Text-to-image Generation	May 23, 2022	Generative Adversarial NetworkImage Generation	CodeCode Available	0	5
Deep Cross-Modal Projection Learning for Image-Text Matching	Sep 1, 2018	Cross-Modal RetrievalImage-text matching	CodeCode Available	0	5
Efficient and Long-Tailed Generalization for Pre-trained Vision-Language Model	Jun 18, 2024	Image-text matchingLanguage Modeling	CodeCode Available	0	5
Improving Multimodal Classification of Social Media Posts by Leveraging Image-Text Auxiliary Tasks	Sep 14, 2023	Image-text matchingSarcasm Detection	CodeCode Available	0	5
Increasing Textual Context Size Boosts Medical Image-Text Matching	Mar 23, 2023	Image-text matchingText Matching	CodeCode Available	0	5
Align before Search: Aligning Ads Image to Text for Accurate Cross-Modal Sponsored Search	Sep 28, 2023	cross-modal alignmentCross-Modal Retrieval	CodeCode Available	0	5
Matching Images and Text with Multi-modal Tensor Fusion and Re-ranking	Aug 12, 2019	Binary ClassificationGeneral Classification	CodeCode Available	0	5
Towards Better Multi-modal Keyphrase Generation via Visual Entity Enhancement and Multi-granularity Image Noise Filtering	Sep 9, 2023	Image CaptioningImage-text matching	CodeCode Available	0	5
Enhancing Image-Text Matching with Adaptive Feature Aggregation	Jan 18, 2024	Image-text matchingImage-text Retrieval	CodeCode Available	0	5
ALADIN: Distilling Fine-grained Alignment Scores for Efficient Image-Text Matching and Retrieval	Jul 29, 2022	Cross-Modal RetrievalImage-text matching	CodeCode Available	0	5
Evaluating Attribute Comprehension in Large Vision-Language Models	Aug 25, 2024	AttributeImage-text matching	CodeCode Available	0	5
MAGID: An Automated Pipeline for Generating Synthetic Multi-modal Datasets	Mar 5, 2024	DiversityImage Description	CodeCode Available	0	5

Show:10 25 50

← PrevPage 2 of 4Next →

No leaderboard results yet.