SOTAVerified

Image-text Retrieval

Papers

Showing 51100 of 248 papers

TitleStatusHype
MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training ModelCode1
LexLIP: Lexicon-Bottlenecked Language-Image Pre-Training for Large-Scale Image-Text Sparse RetrievalCode1
A Survey of Medical Vision-and-Language Applications and Their TechniquesCode1
Cross-modal Scene Graph Matching for Relationship-aware Image-Text RetrievalCode1
LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time Image-Text RetrievalCode1
A Deep Local and Global Scene-Graph Matching for Image-Text RetrievalCode1
Mr. Right: Multimodal Retrieval on Representation of ImaGe witH TextCode1
Transcending Fusion: A Multi-Scale Alignment Method for Remote Sensing Image-Text RetrievalCode1
Large-Scale Adversarial Training for Vision-and-Language Representation LearningCode1
AdvCLIP: Downstream-agnostic Adversarial Examples in Multimodal Contrastive LearningCode1
PIR: Remote Sensing Image-Text Retrieval with Prior Instruction Representation LearningCode1
PC^2: Pseudo-Classification Based Pseudo-Captioning for Noisy Correspondence Learning in Cross-Modal RetrievalCode1
S-CLIP: Semi-supervised Vision-Language Learning using Few Specialist CaptionsCode1
Composing Object Relations and Attributes for Image-Text MatchingCode1
Benchmarking Robustness of Multimodal Image-Text Models under Distribution ShiftCode1
Rethinking Benchmarks for Cross-modal Image-text RetrievalCode1
Seeing What You Miss: Vision-Language Pre-training with Semantic Completion LearningCode1
ComCLIP: Training-Free Compositional Image and Text MatchingCode1
Dynamic Modality Interaction Modeling for Image-Text RetrievalCode1
Hyperbolic Image-Text RepresentationsCode1
I0T: Embedding Standardization Method Towards Zero Modality GapCode1
IMRAM: Iterative Matching with Recurrent Attention Memory for Cross-Modal Image-Text RetrievalCode1
Efficient Token-Guided Image-Text Retrieval with Consistent Multimodal Contrastive TrainingCode1
Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical AlignmentCode1
A Prior Instruction Representation Framework for Remote Sensing Image-text RetrievalCode1
Learnable Pillar-based Re-ranking for Image-Text RetrievalCode1
Learning Relation Alignment for Calibrated Cross-modal RetrievalCode1
Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense CaptionerCode1
Equivariant Similarity for Vision-Language Foundation ModelsCode1
ESA: External Space Attention Aggregation for Image-Text RetrievalCode1
Learning the Best Pooling Strategy for Visual Semantic EmbeddingCode1
LexLIP: Lexicon-Bottlenecked Language-Image Pre-Training for Large-Scale Image-Text RetrievalCode1
Coarse-to-Fine Vision-Language Pre-training with Fusion in the BackboneCode1
GLoRIA: A Multimodal Global-Local Representation Learning Framework for Label-Efficient Medical Image RecognitionCode1
Eye-gaze Guided Multi-modal Alignment for Medical Representation LearningCode1
FETA: Towards Specializing Foundation Models for Expert Task ApplicationsCode1
Global and Local Semantic Completion Learning for Vision-Language Pre-trainingCode1
A Comparison of Pre-trained Vision-and-Language Models for Multimodal Representation Learning across Medical Images and ReportsCode1
Graph Optimal Transport for Cross-Domain AlignmentCode1
Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional UnderstandingCode1
Image-text Retrieval via Preserving Main Semantics of VisionCode1
FlexiViT: One Model for All Patch SizesCode1
CoSMo: Content-Style Modulation for Image Retrieval With Text FeedbackCode1
Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language ModelsCode1
Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision TransformersCode1
Set-level Guidance Attack: Boosting Adversarial Transferability of Vision-Language Pre-training ModelsCode1
Multi-modal Pre-training for Medical Vision-language Understanding and Generation: An Empirical Study with A New BenchmarkCode1
UGNCL: Uncertainty-Guided Noisy Correspondence Learning for Efficient Cross-Modal MatchingCode1
An Unsupervised Cross-Modal Hashing Method Robust to Noisy Training Image-Text Correspondences in Remote SensingCode0
Differentiable Outlier Detection Enable Robust Deep Multimodal AnalysisCode0
Show:102550
← PrevPage 2 of 5Next →

No leaderboard results yet.