SOTAVerified

Image-text Retrieval

Papers

Showing 150 of 248 papers

TitleStatusHype
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and GenerationCode5
Multi-label Cluster Discrimination for Visual Representation LearningCode4
FG-CLIP: Fine-Grained Visual and Textual AlignmentCode4
M3D: Advancing 3D Medical Image Analysis with Multi-Modal Large Language ModelsCode3
Vision-Language Pre-training: Basics, Recent Advances, and Future TrendsCode3
Temporal Working Memory: Query-Guided Segment Refinement for Enhanced Multimodal UnderstandingCode3
AToMiC: An Image/Text Retrieval Test Collection to Support Multimedia Content CreationCode3
ONE-PEACE: Exploring One General Representation Model Toward Unlimited ModalitiesCode3
BIOMEDICA: An Open Biomedical Image-Caption Archive, Dataset, and Vision-Language Models Derived from Scientific LiteratureCode2
Frozen Transformers in Language Models Are Effective Visual Encoder LayersCode2
RemoteCLIP: A Vision Language Foundation Model for Remote SensingCode2
PMC-CLIP: Contrastive Language-Image Pre-training using Biomedical DocumentsCode2
MedCLIP: Contrastive Learning from Unpaired Medical Images and TextCode2
Accelerating Transformers with Spectrum-Preserving Token MergingCode2
Vision-Language Pre-Training with Triple Contrastive LearningCode2
WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine LearningCode2
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text SupervisionCode2
RWKV-CLIP: A Robust Vision-Language Representation LearnerCode2
Towards Vision-Language Geo-Foundation Model: A SurveyCode2
FlagEvalMM: A Flexible Framework for Comprehensive Multimodal Model EvaluationCode2
Med3DVLM: An Efficient Vision-Language Model for 3D Medical Image AnalysisCode2
Cross-Modal and Uni-Modal Soft-Label Alignment for Image-Text RetrievalCode2
VeCLIP: Improving CLIP Training via Visual-enriched CaptionsCode2
Cross-lingual and Multilingual CLIPCode2
DreamLIP: Language-Image Pre-training with Long CaptionsCode2
IMRAM: Iterative Matching with Recurrent Attention Memory for Cross-Modal Image-Text RetrievalCode1
A Deep Local and Global Scene-Graph Matching for Image-Text RetrievalCode1
I0T: Embedding Standardization Method Towards Zero Modality GapCode1
Large-Scale Adversarial Training for Vision-and-Language Representation LearningCode1
Image-text Retrieval via Preserving Main Semantics of VisionCode1
ALIP: Adaptive Language-Image Pre-training with Synthetic CaptionCode1
Babel-ImageNet: Massively Multilingual Evaluation of Vision-and-Language RepresentationsCode1
Graph Optimal Transport for Cross-Domain AlignmentCode1
Global and Local Semantic Completion Learning for Vision-Language Pre-trainingCode1
GLoRIA: A Multimodal Global-Local Representation Learning Framework for Label-Efficient Medical Image RecognitionCode1
Hyperbolic Image-Text RepresentationsCode1
Learnable Pillar-based Re-ranking for Image-Text RetrievalCode1
Align before Fuse: Vision and Language Representation Learning with Momentum DistillationCode1
FILIP: Fine-grained Interactive Language-Image Pre-TrainingCode1
FlexiViT: One Model for All Patch SizesCode1
Eye-gaze Guided Multi-modal Alignment for Medical Representation LearningCode1
ESA: External Space Attention Aggregation for Image-Text RetrievalCode1
FETA: Towards Specializing Foundation Models for Expert Task ApplicationsCode1
Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional UnderstandingCode1
Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense CaptionerCode1
CoSMo: Content-Style Modulation for Image Retrieval With Text FeedbackCode1
A Survey of Medical Vision-and-Language Applications and Their TechniquesCode1
Equivariant Similarity for Vision-Language Foundation ModelsCode1
CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language TransformersCode1
From Association to Generation: Text-only Captioning by Unsupervised Cross-modal MappingCode1
Show:102550
← PrevPage 1 of 5Next →

No leaderboard results yet.