SOTAVerified

Image-text Retrieval

Papers

Showing 151200 of 248 papers

TitleStatusHype
Improving Adversarial Transferability of Vision-Language Pre-training Models through Collaborative Multimodal Interaction0
Enhancing Conceptual Understanding in Multimodal Contrastive Learning through Hard Negative Samples0
Embracing Language Inclusivity and Diversity in CLIP through Continual Language LearningCode0
Enhancing Image-Text Matching with Adaptive Feature AggregationCode0
SyCoCa: Symmetrizing Contrastive Captioners with Attentive Masking for Multimodal Alignment0
Filter & Align: Leveraging Human Knowledge to Curate Image-Text Data0
LightCLIP: Learning Multi-Level Interaction for Lightweight Vision-Language Models0
IG Captioner: Information Gain Captioners are Strong Zero-shot Classifiers0
A New Fine-grained Alignment Method for Image-text Matching0
MCAD: Multi-teacher Cross-modal Alignment Distillation for efficient image-text retrieval0
Direction-Oriented Visual-semantic Embedding Model for Remote Sensing Image-text Retrieval0
Ziya-Visual: Bilingual Large Vision-Language Model via Multi-Task Instruction Tuning0
Constructing Image-Text Pair Dataset from Books0
Dual Relation Alignment for Composed Image Retrieval0
MultiWay-Adapater: Adapting large-scale multi-modal models for scalable image-text retrievalCode0
Contrastive Feature Masking Open-Vocabulary Vision Transformer0
DLIP: Distilling Language-Image Pre-training0
EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE0
Free-ATM: Exploring Unsupervised Learning on Diffusion-Generated Images with Free Attention Masks0
Distilling Knowledge from Text-to-Image Generative Models Improves Visio-Linguistic Reasoning in CLIP0
Stop Pre-Training: Adapt Visual-Language Models to Unseen LanguagesCode0
Switch-BERT: Learning to Model Multimodal Interactions by Switching Attention and Input0
Integrating Listwise Ranking into Pairwise-based Image-Text RetrievalCode0
Hypernymization of named entity-rich captions for grounding-based multi-modal pretraining0
RECLIP: Resource-efficient CLIP by Training with Small Images0
Exposing and Mitigating Spurious Correlations for Cross-Modal RetrievalCode0
Scene Graph Based Fusion Network For Image-Text Retrieval0
Efficient Image-Text Retrieval via Keyword-Guided Pre-Screening0
Understanding and Constructing Latent Modality Structures in Multi-modal Representation Learning0
Semantic-Preserving Augmentation for Robust Image-Text RetrievalCode0
The style transformer with common knowledge optimization for image-text retrieval0
Differentiable Outlier Detection Enable Robust Deep Multimodal AnalysisCode0
USER: Unified Semantic Enhancement with Momentum Contrast for Image-Text RetrievalCode0
HADA: A Graph-based Amalgamation Framework in Image-text RetrievalCode0
NAPReg: Nouns As Proxies Regularization for Semantically Aware Cross-Modal EmbeddingsCode0
VL-Match: Enhancing Vision-Language Pretraining with Token-Level and Instance-Level Matching0
Multilateral Semantic Relations Modeling for Image Text Retrieval0
GAFNet: A Global Fourier Self Attention Based Novel Network for multi-modal downstream tasks0
ViLEM: Visual-Language Error Modeling for Image-Text Retrieval0
Efficient Image Captioning for Edge Devices0
HGAN: Hierarchical Graph Alignment Network for Image-Text Retrieval0
NLIP: Noise-robust Language-Image Pre-training0
Scale-Semantic Joint Decoupling Network for Image-text Retrieval in Remote Sensing0
Masked Contrastive Pre-Training for Efficient Video-Text Retrieval0
Generative Negative Text Replay for Continual Vision-Language Pretraining0
RSVG: Exploring Data and Models for Visual Grounding on Remote Sensing DataCode0
Dissecting Deep Metric Learning Losses for Image-Text RetrievalCode0
Image-Text Retrieval with Binary and Continuous Label Supervision0
CPL: Counterfactual Prompt Learning for Vision and Language Models0
MAMO: Masked Multimodal Modeling for Fine-Grained Vision-Language Representation Learning0
Show:102550
← PrevPage 4 of 5Next →

No leaderboard results yet.