| BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation | Jan 28, 2022 | Image CaptioningImage-text matching | CodeCode Available | 5 |
| FG-CLIP: Fine-Grained Visual and Textual Alignment | May 8, 2025 | Image-text Retrievalobject-detection | CodeCode Available | 4 |
| Multi-label Cluster Discrimination for Visual Representation Learning | Jul 24, 2024 | Contrastive LearningImage-text Retrieval | CodeCode Available | 4 |
| Temporal Working Memory: Query-Guided Segment Refinement for Enhanced Multimodal Understanding | Feb 9, 2025 | Image CaptioningImage-text Retrieval | CodeCode Available | 3 |
| M3D: Advancing 3D Medical Image Analysis with Multi-Modal Large Language Models | Mar 31, 2024 | Image-text RetrievalLanguage Modeling | CodeCode Available | 3 |
| ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities | May 18, 2023 | 1 Image, 2*2 StitchiAction Classification | CodeCode Available | 3 |
| AToMiC: An Image/Text Retrieval Test Collection to Support Multimedia Content Creation | Apr 4, 2023 | Cross-Modal RetrievalImage-text Retrieval | CodeCode Available | 3 |
| Vision-Language Pre-training: Basics, Recent Advances, and Future Trends | Oct 17, 2022 | Few-Shot LearningImage Captioning | CodeCode Available | 3 |
| FlagEvalMM: A Flexible Framework for Comprehensive Multimodal Model Evaluation | Jun 10, 2025 | Image-text RetrievalQuestion Answering | CodeCode Available | 2 |
| Med3DVLM: An Efficient Vision-Language Model for 3D Medical Image Analysis | Mar 25, 2025 | Contrastive LearningImage-text Retrieval | CodeCode Available | 2 |
| BIOMEDICA: An Open Biomedical Image-Caption Archive, Dataset, and Vision-Language Models Derived from Scientific Literature | Jan 13, 2025 | ArticlesImage-text Retrieval | CodeCode Available | 2 |
| Towards Vision-Language Geo-Foundation Model: A Survey | Jun 13, 2024 | Earth ObservationImage Captioning | CodeCode Available | 2 |
| RWKV-CLIP: A Robust Vision-Language Representation Learner | Jun 11, 2024 | Image-text RetrievalRepresentation Learning | CodeCode Available | 2 |
| Accelerating Transformers with Spectrum-Preserving Token Merging | May 25, 2024 | image-classificationImage Classification | CodeCode Available | 2 |
| DreamLIP: Language-Image Pre-training with Long Captions | Mar 25, 2024 | Contrastive LearningImage-text Retrieval | CodeCode Available | 2 |
| Cross-Modal and Uni-Modal Soft-Label Alignment for Image-Text Retrieval | Mar 8, 2024 | Image-text RetrievalRetrieval | CodeCode Available | 2 |
| Frozen Transformers in Language Models Are Effective Visual Encoder Layers | Oct 19, 2023 | Action RecognitionImage-text Retrieval | CodeCode Available | 2 |
| VeCLIP: Improving CLIP Training via Visual-enriched Captions | Oct 11, 2023 | Image-text RetrievalRetrieval | CodeCode Available | 2 |
| RemoteCLIP: A Vision Language Foundation Model for Remote Sensing | Jun 19, 2023 | ClassificationCross-Modal Retrieval | CodeCode Available | 2 |
| PMC-CLIP: Contrastive Language-Image Pre-training using Biomedical Documents | Mar 13, 2023 | image-classificationImage Classification | CodeCode Available | 2 |
| MedCLIP: Contrastive Learning from Unpaired Medical Images and Text | Oct 18, 2022 | Contrastive LearningImage-text Retrieval | CodeCode Available | 2 |
| Cross-lingual and Multilingual CLIP | Jun 1, 2022 | Contrastive LearningImage-text Retrieval | CodeCode Available | 2 |
| Vision-Language Pre-Training with Triple Contrastive Learning | Feb 21, 2022 | Contrastive Learningcross-modal alignment | CodeCode Available | 2 |
| WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning | Mar 2, 2021 | BIG-bench Machine LearningImage Retrieval | CodeCode Available | 2 |
| Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision | Feb 11, 2021 | Cross-Modal RetrievalFine-Grained Image Classification | CodeCode Available | 2 |
| Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language Models | Mar 25, 2025 | BenchmarkingImage Captioning | CodeCode Available | 1 |
| ReCon: Enhancing True Correspondence Discrimination through Relation Consistency for Robust Noisy Correspondence Learning | Feb 27, 2025 | Cross-Modal RetrievalCross-modal retrieval with noisy correspondence | CodeCode Available | 1 |
| I0T: Embedding Standardization Method Towards Zero Modality Gap | Dec 18, 2024 | Contrastive LearningImage-text Retrieval | CodeCode Available | 1 |
| A Survey of Medical Vision-and-Language Applications and Their Techniques | Nov 19, 2024 | Decision MakingDiagnostic | CodeCode Available | 1 |
| Nearest Neighbor Normalization Improves Multimodal Retrieval | Oct 31, 2024 | Cross-Modal RetrievalImage Captioning | CodeCode Available | 1 |
| PC^2: Pseudo-Classification Based Pseudo-Captioning for Noisy Correspondence Learning in Cross-Modal Retrieval | Aug 2, 2024 | Cross-modal retrieval with noisy correspondenceImage-text Retrieval | CodeCode Available | 1 |
| UGNCL: Uncertainty-Guided Noisy Correspondence Learning for Efficient Cross-Modal Matching | Jul 11, 2024 | Cross-Modal RetrievalCross-modal retrieval with noisy correspondence | CodeCode Available | 1 |
| CVLUE: A New Benchmark Dataset for Chinese Vision-Language Understanding Evaluation | Jul 1, 2024 | Image-text RetrievalQuestion Answering | CodeCode Available | 1 |
| Composing Object Relations and Attributes for Image-Text Matching | Jun 17, 2024 | AttributeGraph Attention | CodeCode Available | 1 |
| Transcending Fusion: A Multi-Scale Alignment Method for Remote Sensing Image-Text Retrieval | May 29, 2024 | cross-modal alignmentImage-text Retrieval | CodeCode Available | 1 |
| PIR: Remote Sensing Image-Text Retrieval with Prior Instruction Representation Learning | May 16, 2024 | Image-text RetrievalRepresentation Learning | CodeCode Available | 1 |
| Eye-gaze Guided Multi-modal Alignment for Medical Representation Learning | Mar 19, 2024 | Diagnosticimage-classification | CodeCode Available | 1 |
| MLLMs-Augmented Visual-Language Representation Learning | Nov 30, 2023 | Image-text RetrievalRepresentation Learning | CodeCode Available | 1 |
| A Prior Instruction Representation Framework for Remote Sensing Image-text Retrieval | Oct 27, 2023 | Cross-Modal RetrievalImage-text Retrieval | CodeCode Available | 1 |
| ESA: External Space Attention Aggregation for Image-Text Retrieval | Oct 10, 2023 | Image-text RetrievalRetrieval | CodeCode Available | 1 |
| Towards Fast and Accurate Image-Text Retrieval with Self-Supervised Fine-Grained Alignment | Aug 27, 2023 | Contrastive LearningImage-text Retrieval | CodeCode Available | 1 |
| Parameter-Efficient Transfer Learning for Remote Sensing Image-Text Retrieval | Aug 24, 2023 | Cross-Modal RetrievalImage-text matching | CodeCode Available | 1 |
| ALIP: Adaptive Language-Image Pre-training with Synthetic Caption | Aug 16, 2023 | Action ClassificationImage-text Retrieval | CodeCode Available | 1 |
| AdvCLIP: Downstream-agnostic Adversarial Examples in Multimodal Contrastive Learning | Aug 14, 2023 | Contrastive LearningGenerative Adversarial Network | CodeCode Available | 1 |
| Set-level Guidance Attack: Boosting Adversarial Transferability of Vision-Language Pre-training Models | Jul 26, 2023 | Image-text RetrievalRetrieval | CodeCode Available | 1 |
| mCLIP: Multilingual CLIP via Cross-lingual Transfer | Jul 10, 2023 | Contrastive LearningCross-Lingual Transfer | CodeCode Available | 1 |
| Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding | Jun 15, 2023 | Contrastive Learningimage-classification | CodeCode Available | 1 |
| Efficient Token-Guided Image-Text Retrieval with Consistent Multimodal Contrastive Training | Jun 15, 2023 | Image-text RetrievalRepresentation Learning | CodeCode Available | 1 |
| Babel-ImageNet: Massively Multilingual Evaluation of Vision-and-Language Representations | Jun 14, 2023 | image-classificationImage Classification | CodeCode Available | 1 |
| Global and Local Semantic Completion Learning for Vision-Language Pre-training | Jun 12, 2023 | cross-modal alignmentImage-text Retrieval | CodeCode Available | 1 |