SOTAVerified

Image-to-Text Retrieval

Image-text retrieval is the process of retrieving relevant images based on textual descriptions or finding corresponding textual descriptions for a given image. This task is interdisciplinary, combining techniques from computer vision, and natural language processing. The primary challenge lies in bridging the semantic gap — the difference between how visual data is represented in images and how humans describe that information using language. To address this, many methods focus on learning a shared embedding space where both images and text can be represented in a comparable way, allowing their similarities to be measured and facilitating more accurate retrieval.

Source: Extending CLIP for Category-to-Image Retrieval in E-commerce

Papers

Showing 110 of 59 papers

TitleStatusHype
Improving Medical Visual Representation Learning with Pathological-level Cross-Modal Alignment and Correlation Exploration0
Efficient Medical Vision-Language Alignment Through Adapting Masked Vision ModelsCode1
Towards Cross-modal Retrieval in Chinese Cultural Heritage Documents: Dataset and Solution0
SemCORE: A Semantic-Enhanced Generative Cross-Modal Retrieval Framework with MLLMs0
DART: Disease-aware Image-Text Alignment and Self-correcting Re-alignment for Trustworthy Radiology Report Generation0
ABC: Achieving Better Control of Multimodal Embeddings using VLMs0
Retaining Knowledge and Enhancing Long-Text Representations in CLIP through Dual-Teacher Distillation0
DIR: Retrieval-Augmented Image Captioning with Comprehensive Understanding0
Robotic State Recognition with Image-to-Text Retrieval Task of Pre-Trained Vision-Language Model and Black-Box Optimization0
Robotic Environmental State Recognition with Pre-Trained Vision-Language Models and Black-Box Optimization0
Show:102550
← PrevPage 1 of 6Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1OscarRecall@1099.8Unverified
2OscarRecall@1098.3Unverified
3Unicoder-VLRecall@1097.2Unverified
4BLIP-2 (ViT-G, fine-tuned)Recall@185.4Unverified
5ONE-PEACE (ViT-G, w/o ranking)Recall@184.1Unverified
6BLIP-2 (ViT-L, fine-tuned)Recall@183.5Unverified
7DVSARecall@1074.8Unverified
8IAISRecall@167.78Unverified
9CLIP (zero-shot)Recall@158.4Unverified
10FLAVA (ViT-B, zero-shot)Recall@142.74Unverified