| Crossmodal-3600: A Massively Multilingual Multimodal Evaluation Dataset | May 25, 2022 | Image CaptioningImage Retrieval | —Unverified | 0 |
| A New Fine-grained Alignment Method for Image-text Matching | Nov 3, 2023 | Image-text matchingImage-text Retrieval | —Unverified | 0 |
| CtrlSynth: Controllable Image Text Synthesis for Data-Efficient Multimodal Learning | Oct 15, 2024 | Image-text RetrievalText Retrieval | —Unverified | 0 |
| DCFormer: Efficient 3D Vision-Language Modeling with Decomposed Convolutions | Feb 7, 2025 | Anomaly DetectionImage-text Retrieval | —Unverified | 0 |
| Deep Semantic Multimodal Hashing Network for Scalable Image-Text and Video-Text Retrievals | Jan 9, 2019 | Cross-Modal RetrievalDeep Hashing | —Unverified | 0 |
| Direction-Oriented Visual-semantic Embedding Model for Remote Sensing Image-text Retrieval | Oct 12, 2023 | Cross-Modal RetrievalImage-text Retrieval | —Unverified | 0 |
| VladVA: Discriminative Fine-tuning of LVLMs | Dec 5, 2024 | Image-text RetrievalRepresentation Learning | —Unverified | 0 |
| Distill CLIP (DCLIP): Enhancing Image-Text Retrieval via Cross-Modal Transformer Distillation | May 25, 2025 | Contrastive LearningImage-text Retrieval | —Unverified | 0 |
| DLIP: Distilling Language-Image Pre-training | Aug 24, 2023 | Image CaptioningImage-text Retrieval | —Unverified | 0 |
| Dual Relation Alignment for Composed Image Retrieval | Sep 5, 2023 | Image RetrievalImage-text Retrieval | —Unverified | 0 |
| Dynamic Contrastive Distillation for Image-Text Retrieval | Jul 4, 2022 | Contrastive LearningGPU | —Unverified | 0 |
| Efficient Image Captioning for Edge Devices | Dec 18, 2022 | CPUImage Captioning | —Unverified | 0 |
| Efficient Image-Text Retrieval via Keyword-Guided Pre-Screening | Mar 14, 2023 | Image-text RetrievalMulti-Label Classification | —Unverified | 0 |
| Efficient Multilingual Multi-modal Pre-training through Triple Contrastive Loss | Oct 1, 2022 | image-classificationImage Classification | —Unverified | 0 |
| Enhancing Conceptual Understanding in Multimodal Contrastive Learning through Hard Negative Samples | Mar 5, 2024 | Concept AlignmentContrastive Learning | —Unverified | 0 |
| EvdCLIP: Improving Vision-Language Retrieval with Entity Visual Descriptions from Large Language Models | May 24, 2025 | Image-text RetrievalLanguage Modeling | —Unverified | 0 |
| EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE | Aug 23, 2023 | Image-text matchingImage-text Retrieval | —Unverified | 0 |
| Explaining and Mitigating the Modality Gap in Contrastive Multimodal Learning | Dec 10, 2024 | Contrastive LearningImage-text Retrieval | —Unverified | 0 |
| Fine-tuning Multimodal Transformers on Edge: A Parallel Split Learning Approach | Feb 10, 2025 | Federated LearningImage-text Retrieval | —Unverified | 0 |
| FocalLens: Instruction Tuning Enables Zero-Shot Conditional Image Representations | Apr 11, 2025 | image-classificationImage Classification | —Unverified | 0 |
| Free-ATM: Exploring Unsupervised Learning on Diffusion-Generated Images with Free Attention Masks | Aug 13, 2023 | Contrastive Learningimage-classification | —Unverified | 0 |
| GAFNet: A Global Fourier Self Attention Based Novel Network for multi-modal downstream tasks | Jan 1, 2023 | Image GenerationImage-text Retrieval | —Unverified | 0 |
| Generative Negative Text Replay for Continual Vision-Language Pretraining | Oct 31, 2022 | Continual Learningimage-classification | —Unverified | 0 |
| Global–Local Information Soft-Alignment for Cross-Modal Remote-Sensing Image–Text Retrieval | May 14, 2024 | Cross-Modal RetrievalCross-Modal Retrieval on RSITMD | —Unverified | 0 |
| HGAN: Hierarchical Graph Alignment Network for Image-Text Retrieval | Dec 16, 2022 | Image-text RetrievalRetrieval | —Unverified | 0 |