| MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training Model | Oct 11, 2022 | Contrastive LearningImage-text matching | CodeCode Available | 1 | 5 |
| LexLIP: Lexicon-Bottlenecked Language-Image Pre-Training for Large-Scale Image-Text Sparse Retrieval | Jan 1, 2023 | image-classificationImage Classification | CodeCode Available | 1 | 5 |
| A Survey of Medical Vision-and-Language Applications and Their Techniques | Nov 19, 2024 | Decision MakingDiagnostic | CodeCode Available | 1 | 5 |
| Cross-modal Scene Graph Matching for Relationship-aware Image-Text Retrieval | Oct 11, 2019 | Graph MatchingImage-text Retrieval | CodeCode Available | 1 | 5 |
| LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time Image-Text Retrieval | Mar 16, 2021 | Image-text RetrievalRe-Ranking | CodeCode Available | 1 | 5 |
| A Deep Local and Global Scene-Graph Matching for Image-Text Retrieval | Jun 4, 2021 | Graph MatchingImage Retrieval | CodeCode Available | 1 | 5 |
| Mr. Right: Multimodal Retrieval on Representation of ImaGe witH Text | Sep 28, 2022 | Image CaptioningImage Retrieval | CodeCode Available | 1 | 5 |
| Transcending Fusion: A Multi-Scale Alignment Method for Remote Sensing Image-Text Retrieval | May 29, 2024 | cross-modal alignmentImage-text Retrieval | CodeCode Available | 1 | 5 |
| Large-Scale Adversarial Training for Vision-and-Language Representation Learning | Jun 11, 2020 | Image-text RetrievalQuestion Answering | CodeCode Available | 1 | 5 |
| AdvCLIP: Downstream-agnostic Adversarial Examples in Multimodal Contrastive Learning | Aug 14, 2023 | Contrastive LearningGenerative Adversarial Network | CodeCode Available | 1 | 5 |
| PIR: Remote Sensing Image-Text Retrieval with Prior Instruction Representation Learning | May 16, 2024 | Image-text RetrievalRepresentation Learning | CodeCode Available | 1 | 5 |
| PC^2: Pseudo-Classification Based Pseudo-Captioning for Noisy Correspondence Learning in Cross-Modal Retrieval | Aug 2, 2024 | Cross-modal retrieval with noisy correspondenceImage-text Retrieval | CodeCode Available | 1 | 5 |
| S-CLIP: Semi-supervised Vision-Language Learning using Few Specialist Captions | May 23, 2023 | Contrastive LearningImage-text Retrieval | CodeCode Available | 1 | 5 |
| Composing Object Relations and Attributes for Image-Text Matching | Jun 17, 2024 | AttributeGraph Attention | CodeCode Available | 1 | 5 |
| Benchmarking Robustness of Multimodal Image-Text Models under Distribution Shift | Dec 15, 2022 | BenchmarkingImage Captioning | CodeCode Available | 1 | 5 |
| Rethinking Benchmarks for Cross-modal Image-text Retrieval | Apr 21, 2023 | Cross-Modal RetrievalImage-text Retrieval | CodeCode Available | 1 | 5 |
| Seeing What You Miss: Vision-Language Pre-training with Semantic Completion Learning | Nov 24, 2022 | cross-modal alignmentImage-text Retrieval | CodeCode Available | 1 | 5 |
| ComCLIP: Training-Free Compositional Image and Text Matching | Nov 25, 2022 | Image-text matchingImage-text Retrieval | CodeCode Available | 1 | 5 |
| Dynamic Modality Interaction Modeling for Image-Text Retrieval | Jul 11, 2021 | cross-modal alignmentCross-Modal Retrieval | CodeCode Available | 1 | 5 |
| Hyperbolic Image-Text Representations | Apr 18, 2023 | image-classificationImage Classification | CodeCode Available | 1 | 5 |
| I0T: Embedding Standardization Method Towards Zero Modality Gap | Dec 18, 2024 | Contrastive LearningImage-text Retrieval | CodeCode Available | 1 | 5 |
| IMRAM: Iterative Matching with Recurrent Attention Memory for Cross-Modal Image-Text Retrieval | Mar 8, 2020 | Cross-Modal RetrievalImage-text Retrieval | CodeCode Available | 1 | 5 |
| Efficient Token-Guided Image-Text Retrieval with Consistent Multimodal Contrastive Training | Jun 15, 2023 | Image-text RetrievalRepresentation Learning | CodeCode Available | 1 | 5 |
| Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical Alignment | Aug 29, 2022 | cross-modal alignmentImage-text Retrieval | CodeCode Available | 1 | 5 |
| A Prior Instruction Representation Framework for Remote Sensing Image-text Retrieval | Oct 27, 2023 | Cross-Modal RetrievalImage-text Retrieval | CodeCode Available | 1 | 5 |
| Learnable Pillar-based Re-ranking for Image-Text Retrieval | Apr 25, 2023 | Image-text RetrievalRe-Ranking | CodeCode Available | 1 | 5 |
| Learning Relation Alignment for Calibrated Cross-modal Retrieval | May 28, 2021 | Cross-Modal RetrievalImage-text Retrieval | CodeCode Available | 1 | 5 |
| Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense Captioner | May 19, 2023 | Dense CaptioningImage Captioning | CodeCode Available | 1 | 5 |
| Equivariant Similarity for Vision-Language Foundation Models | Mar 25, 2023 | Image-text RetrievalRetrieval | CodeCode Available | 1 | 5 |
| ESA: External Space Attention Aggregation for Image-Text Retrieval | Oct 10, 2023 | Image-text RetrievalRetrieval | CodeCode Available | 1 | 5 |
| Learning the Best Pooling Strategy for Visual Semantic Embedding | Nov 9, 2020 | Cross-Modal Information RetrievalImage-text Retrieval | CodeCode Available | 1 | 5 |
| LexLIP: Lexicon-Bottlenecked Language-Image Pre-Training for Large-Scale Image-Text Retrieval | Feb 6, 2023 | Image-text RetrievalRetrieval | CodeCode Available | 1 | 5 |
| Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone | Jun 15, 2022 | Described Object DetectionImage Captioning | CodeCode Available | 1 | 5 |
| GLoRIA: A Multimodal Global-Local Representation Learning Framework for Label-Efficient Medical Image Recognition | Jan 1, 2021 | Image-text RetrievalMedical Image Analysis | CodeCode Available | 1 | 5 |
| Eye-gaze Guided Multi-modal Alignment for Medical Representation Learning | Mar 19, 2024 | Diagnosticimage-classification | CodeCode Available | 1 | 5 |
| FETA: Towards Specializing Foundation Models for Expert Task Applications | Sep 8, 2022 | Domain GeneralizationFew-Shot Learning | CodeCode Available | 1 | 5 |
| Global and Local Semantic Completion Learning for Vision-Language Pre-training | Jun 12, 2023 | cross-modal alignmentImage-text Retrieval | CodeCode Available | 1 | 5 |
| A Comparison of Pre-trained Vision-and-Language Models for Multimodal Representation Learning across Medical Images and Reports | Sep 3, 2020 | Image-text RetrievalMedical Visual Question Answering | CodeCode Available | 1 | 5 |
| Graph Optimal Transport for Cross-Domain Alignment | Jun 26, 2020 | Graph MatchingImage Captioning | CodeCode Available | 1 | 5 |
| Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding | Jun 15, 2023 | Contrastive Learningimage-classification | CodeCode Available | 1 | 5 |
| Image-text Retrieval via Preserving Main Semantics of Vision | Apr 20, 2023 | Cross-Modal RetrievalImage-text Retrieval | CodeCode Available | 1 | 5 |
| FlexiViT: One Model for All Patch Sizes | Dec 15, 2022 | AllImage-text Retrieval | CodeCode Available | 1 | 5 |
| CoSMo: Content-Style Modulation for Image Retrieval With Text Feedback | Jun 19, 2021 | Image RetrievalImage-text Retrieval | CodeCode Available | 1 | 5 |
| Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language Models | Mar 25, 2025 | BenchmarkingImage Captioning | CodeCode Available | 1 | 5 |
| Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers | May 11, 2023 | Contrastive LearningImage-text Retrieval | CodeCode Available | 1 | 5 |
| Set-level Guidance Attack: Boosting Adversarial Transferability of Vision-Language Pre-training Models | Jul 26, 2023 | Image-text RetrievalRetrieval | CodeCode Available | 1 | 5 |
| Multi-modal Pre-training for Medical Vision-language Understanding and Generation: An Empirical Study with A New Benchmark | Jun 10, 2023 | Image-text RetrievalMedical Report Generation | CodeCode Available | 1 | 5 |
| UGNCL: Uncertainty-Guided Noisy Correspondence Learning for Efficient Cross-Modal Matching | Jul 11, 2024 | Cross-Modal RetrievalCross-modal retrieval with noisy correspondence | CodeCode Available | 1 | 5 |
| An Unsupervised Cross-Modal Hashing Method Robust to Noisy Training Image-Text Correspondences in Remote Sensing | Feb 26, 2022 | Image-text RetrievalMeta-Learning | CodeCode Available | 0 | 5 |
| Differentiable Outlier Detection Enable Robust Deep Multimodal Analysis | Feb 11, 2023 | Image-text RetrievalKnowledge Graphs | CodeCode Available | 0 | 5 |