| EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE | Aug 23, 2023 | Image-text matchingImage-text Retrieval | —Unverified | 0 |
| ALIP: Adaptive Language-Image Pre-training with Synthetic Caption | Aug 16, 2023 | Action ClassificationImage-text Retrieval | CodeCode Available | 1 |
| AdvCLIP: Downstream-agnostic Adversarial Examples in Multimodal Contrastive Learning | Aug 14, 2023 | Contrastive LearningGenerative Adversarial Network | CodeCode Available | 1 |
| Free-ATM: Exploring Unsupervised Learning on Diffusion-Generated Images with Free Attention Masks | Aug 13, 2023 | Contrastive Learningimage-classification | —Unverified | 0 |
| Set-level Guidance Attack: Boosting Adversarial Transferability of Vision-Language Pre-training Models | Jul 26, 2023 | Image-text RetrievalRetrieval | CodeCode Available | 1 |
| Distilling Knowledge from Text-to-Image Generative Models Improves Visio-Linguistic Reasoning in CLIP | Jul 18, 2023 | AttributeImage-text Retrieval | —Unverified | 0 |
| mCLIP: Multilingual CLIP via Cross-lingual Transfer | Jul 10, 2023 | Contrastive LearningCross-Lingual Transfer | CodeCode Available | 1 |
| Stop Pre-Training: Adapt Visual-Language Models to Unseen Languages | Jun 29, 2023 | Image-text RetrievalMachine Translation | CodeCode Available | 0 |
| Switch-BERT: Learning to Model Multimodal Interactions by Switching Attention and Input | Jun 25, 2023 | DiversityImage-text Retrieval | —Unverified | 0 |
| RemoteCLIP: A Vision Language Foundation Model for Remote Sensing | Jun 19, 2023 | ClassificationCross-Modal Retrieval | CodeCode Available | 2 |
| Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding | Jun 15, 2023 | Contrastive Learningimage-classification | CodeCode Available | 1 |
| Efficient Token-Guided Image-Text Retrieval with Consistent Multimodal Contrastive Training | Jun 15, 2023 | Image-text RetrievalRepresentation Learning | CodeCode Available | 1 |
| Babel-ImageNet: Massively Multilingual Evaluation of Vision-and-Language Representations | Jun 14, 2023 | image-classificationImage Classification | CodeCode Available | 1 |
| Global and Local Semantic Completion Learning for Vision-Language Pre-training | Jun 12, 2023 | cross-modal alignmentImage-text Retrieval | CodeCode Available | 1 |
| Multi-modal Pre-training for Medical Vision-language Understanding and Generation: An Empirical Study with A New Benchmark | Jun 10, 2023 | Image-text RetrievalMedical Report Generation | CodeCode Available | 1 |
| Revisiting the Role of Language Priors in Vision-Language Models | Jun 2, 2023 | Image-text matchingImage-text Retrieval | CodeCode Available | 1 |
| CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers | May 27, 2023 | Image CaptioningImage Retrieval | CodeCode Available | 1 |
| Integrating Listwise Ranking into Pairwise-based Image-Text Retrieval | May 26, 2023 | Image-text RetrievalRetrieval | CodeCode Available | 0 |
| S-CLIP: Semi-supervised Vision-Language Learning using Few Specialist Captions | May 23, 2023 | Contrastive LearningImage-text Retrieval | CodeCode Available | 1 |
| Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense Captioner | May 19, 2023 | Dense CaptioningImage Captioning | CodeCode Available | 1 |
| ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities | May 18, 2023 | 1 Image, 2*2 StitchiAction Classification | CodeCode Available | 3 |
| Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers | May 11, 2023 | Contrastive LearningImage-text Retrieval | CodeCode Available | 1 |
| From Association to Generation: Text-only Captioning by Unsupervised Cross-modal Mapping | Apr 26, 2023 | DecoderImage Captioning | CodeCode Available | 1 |
| Hypernymization of named entity-rich captions for grounding-based multi-modal pretraining | Apr 25, 2023 | ArticlesImage-text Retrieval | —Unverified | 0 |
| Learnable Pillar-based Re-ranking for Image-Text Retrieval | Apr 25, 2023 | Image-text RetrievalRe-Ranking | CodeCode Available | 1 |
| Rethinking Benchmarks for Cross-modal Image-text Retrieval | Apr 21, 2023 | Cross-Modal RetrievalImage-text Retrieval | CodeCode Available | 1 |
| Image-text Retrieval via Preserving Main Semantics of Vision | Apr 20, 2023 | Cross-Modal RetrievalImage-text Retrieval | CodeCode Available | 1 |
| Hyperbolic Image-Text Representations | Apr 18, 2023 | image-classificationImage Classification | CodeCode Available | 1 |
| RECLIP: Resource-efficient CLIP by Training with Small Images | Apr 12, 2023 | Contrastive LearningImage-text Retrieval | —Unverified | 0 |
| Exposing and Mitigating Spurious Correlations for Cross-Modal Retrieval | Apr 6, 2023 | Cross-Modal RetrievalImage-text Retrieval | CodeCode Available | 0 |
| AToMiC: An Image/Text Retrieval Test Collection to Support Multimedia Content Creation | Apr 4, 2023 | Cross-Modal RetrievalImage-text Retrieval | CodeCode Available | 3 |
| Equivariant Similarity for Vision-Language Foundation Models | Mar 25, 2023 | Image-text RetrievalRetrieval | CodeCode Available | 1 |
| Scene Graph Based Fusion Network For Image-Text Retrieval | Mar 20, 2023 | Image-text RetrievalRetrieval | —Unverified | 0 |
| Efficient Image-Text Retrieval via Keyword-Guided Pre-Screening | Mar 14, 2023 | Image-text RetrievalMulti-Label Classification | —Unverified | 0 |
| PMC-CLIP: Contrastive Language-Image Pre-training using Biomedical Documents | Mar 13, 2023 | image-classificationImage Classification | CodeCode Available | 2 |
| Understanding and Constructing Latent Modality Structures in Multi-modal Representation Learning | Mar 10, 2023 | Few-Shot Image Classificationimage-classification | —Unverified | 0 |
| Semantic-Preserving Augmentation for Robust Image-Text Retrieval | Mar 10, 2023 | Image-text RetrievalRetrieval | CodeCode Available | 0 |
| The style transformer with common knowledge optimization for image-text retrieval | Mar 1, 2023 | Image-text RetrievalRetrieval | —Unverified | 0 |
| Multimodal Federated Learning via Contrastive Representation Ensemble | Feb 17, 2023 | Federated LearningImage-text Retrieval | CodeCode Available | 1 |
| UniAdapter: Unified Parameter-Efficient Transfer Learning for Cross-modal Modeling | Feb 13, 2023 | Image-text RetrievalRetrieval | CodeCode Available | 1 |
| Differentiable Outlier Detection Enable Robust Deep Multimodal Analysis | Feb 11, 2023 | Image-text RetrievalKnowledge Graphs | CodeCode Available | 0 |
| LexLIP: Lexicon-Bottlenecked Language-Image Pre-Training for Large-Scale Image-Text Retrieval | Feb 6, 2023 | Image-text RetrievalRetrieval | CodeCode Available | 1 |
| UPop: Unified and Progressive Pruning for Compressing Vision-Language Transformers | Jan 31, 2023 | Image CaptioningImage Classification | CodeCode Available | 1 |
| USER: Unified Semantic Enhancement with Momentum Contrast for Image-Text Retrieval | Jan 17, 2023 | Contrastive LearningImage-text Retrieval | CodeCode Available | 0 |
| HADA: A Graph-based Amalgamation Framework in Image-text Retrieval | Jan 11, 2023 | Graph Neural NetworkImage Retrieval | CodeCode Available | 0 |
| NAPReg: Nouns As Proxies Regularization for Semantically Aware Cross-Modal Embeddings | Jan 7, 2023 | Cross-Modal RetrievalImage-text Retrieval | CodeCode Available | 0 |
| VL-Match: Enhancing Vision-Language Pretraining with Token-Level and Instance-Level Matching | Jan 1, 2023 | Image-text matchingImage-text Retrieval | —Unverified | 0 |
| LexLIP: Lexicon-Bottlenecked Language-Image Pre-Training for Large-Scale Image-Text Sparse Retrieval | Jan 1, 2023 | image-classificationImage Classification | CodeCode Available | 1 |
| Multilateral Semantic Relations Modeling for Image Text Retrieval | Jan 1, 2023 | Image-text RetrievalRetrieval | —Unverified | 0 |
| ViLEM: Visual-Language Error Modeling for Image-Text Retrieval | Jan 1, 2023 | Contrastive LearningImage-text Retrieval | —Unverified | 0 |