SOTAVerified

Image-text Retrieval

Papers

Showing 101150 of 248 papers

TitleStatusHype
EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE0
ALIP: Adaptive Language-Image Pre-training with Synthetic CaptionCode1
AdvCLIP: Downstream-agnostic Adversarial Examples in Multimodal Contrastive LearningCode1
Free-ATM: Exploring Unsupervised Learning on Diffusion-Generated Images with Free Attention Masks0
Set-level Guidance Attack: Boosting Adversarial Transferability of Vision-Language Pre-training ModelsCode1
Distilling Knowledge from Text-to-Image Generative Models Improves Visio-Linguistic Reasoning in CLIP0
mCLIP: Multilingual CLIP via Cross-lingual TransferCode1
Stop Pre-Training: Adapt Visual-Language Models to Unseen LanguagesCode0
Switch-BERT: Learning to Model Multimodal Interactions by Switching Attention and Input0
RemoteCLIP: A Vision Language Foundation Model for Remote SensingCode2
Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional UnderstandingCode1
Efficient Token-Guided Image-Text Retrieval with Consistent Multimodal Contrastive TrainingCode1
Babel-ImageNet: Massively Multilingual Evaluation of Vision-and-Language RepresentationsCode1
Global and Local Semantic Completion Learning for Vision-Language Pre-trainingCode1
Multi-modal Pre-training for Medical Vision-language Understanding and Generation: An Empirical Study with A New BenchmarkCode1
Revisiting the Role of Language Priors in Vision-Language ModelsCode1
CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language TransformersCode1
Integrating Listwise Ranking into Pairwise-based Image-Text RetrievalCode0
S-CLIP: Semi-supervised Vision-Language Learning using Few Specialist CaptionsCode1
Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense CaptionerCode1
ONE-PEACE: Exploring One General Representation Model Toward Unlimited ModalitiesCode3
Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision TransformersCode1
From Association to Generation: Text-only Captioning by Unsupervised Cross-modal MappingCode1
Hypernymization of named entity-rich captions for grounding-based multi-modal pretraining0
Learnable Pillar-based Re-ranking for Image-Text RetrievalCode1
Rethinking Benchmarks for Cross-modal Image-text RetrievalCode1
Image-text Retrieval via Preserving Main Semantics of VisionCode1
Hyperbolic Image-Text RepresentationsCode1
RECLIP: Resource-efficient CLIP by Training with Small Images0
Exposing and Mitigating Spurious Correlations for Cross-Modal RetrievalCode0
AToMiC: An Image/Text Retrieval Test Collection to Support Multimedia Content CreationCode3
Equivariant Similarity for Vision-Language Foundation ModelsCode1
Scene Graph Based Fusion Network For Image-Text Retrieval0
Efficient Image-Text Retrieval via Keyword-Guided Pre-Screening0
PMC-CLIP: Contrastive Language-Image Pre-training using Biomedical DocumentsCode2
Understanding and Constructing Latent Modality Structures in Multi-modal Representation Learning0
Semantic-Preserving Augmentation for Robust Image-Text RetrievalCode0
The style transformer with common knowledge optimization for image-text retrieval0
Multimodal Federated Learning via Contrastive Representation EnsembleCode1
UniAdapter: Unified Parameter-Efficient Transfer Learning for Cross-modal ModelingCode1
Differentiable Outlier Detection Enable Robust Deep Multimodal AnalysisCode0
LexLIP: Lexicon-Bottlenecked Language-Image Pre-Training for Large-Scale Image-Text RetrievalCode1
UPop: Unified and Progressive Pruning for Compressing Vision-Language TransformersCode1
USER: Unified Semantic Enhancement with Momentum Contrast for Image-Text RetrievalCode0
HADA: A Graph-based Amalgamation Framework in Image-text RetrievalCode0
NAPReg: Nouns As Proxies Regularization for Semantically Aware Cross-Modal EmbeddingsCode0
VL-Match: Enhancing Vision-Language Pretraining with Token-Level and Instance-Level Matching0
LexLIP: Lexicon-Bottlenecked Language-Image Pre-Training for Large-Scale Image-Text Sparse RetrievalCode1
Multilateral Semantic Relations Modeling for Image Text Retrieval0
ViLEM: Visual-Language Error Modeling for Image-Text Retrieval0
Show:102550
← PrevPage 3 of 5Next →

No leaderboard results yet.