| Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language Models | Mar 25, 2025 | BenchmarkingImage Captioning | CodeCode Available | 1 |
| ReCon: Enhancing True Correspondence Discrimination through Relation Consistency for Robust Noisy Correspondence Learning | Feb 27, 2025 | Cross-Modal RetrievalCross-modal retrieval with noisy correspondence | CodeCode Available | 1 |
| I0T: Embedding Standardization Method Towards Zero Modality Gap | Dec 18, 2024 | Contrastive LearningImage-text Retrieval | CodeCode Available | 1 |
| A Survey of Medical Vision-and-Language Applications and Their Techniques | Nov 19, 2024 | Decision MakingDiagnostic | CodeCode Available | 1 |
| Nearest Neighbor Normalization Improves Multimodal Retrieval | Oct 31, 2024 | Cross-Modal RetrievalImage Captioning | CodeCode Available | 1 |
| PC^2: Pseudo-Classification Based Pseudo-Captioning for Noisy Correspondence Learning in Cross-Modal Retrieval | Aug 2, 2024 | Cross-modal retrieval with noisy correspondenceImage-text Retrieval | CodeCode Available | 1 |
| UGNCL: Uncertainty-Guided Noisy Correspondence Learning for Efficient Cross-Modal Matching | Jul 11, 2024 | Cross-Modal RetrievalCross-modal retrieval with noisy correspondence | CodeCode Available | 1 |
| CVLUE: A New Benchmark Dataset for Chinese Vision-Language Understanding Evaluation | Jul 1, 2024 | Image-text RetrievalQuestion Answering | CodeCode Available | 1 |
| Composing Object Relations and Attributes for Image-Text Matching | Jun 17, 2024 | AttributeGraph Attention | CodeCode Available | 1 |
| Transcending Fusion: A Multi-Scale Alignment Method for Remote Sensing Image-Text Retrieval | May 29, 2024 | cross-modal alignmentImage-text Retrieval | CodeCode Available | 1 |
| PIR: Remote Sensing Image-Text Retrieval with Prior Instruction Representation Learning | May 16, 2024 | Image-text RetrievalRepresentation Learning | CodeCode Available | 1 |
| Eye-gaze Guided Multi-modal Alignment for Medical Representation Learning | Mar 19, 2024 | Diagnosticimage-classification | CodeCode Available | 1 |
| MLLMs-Augmented Visual-Language Representation Learning | Nov 30, 2023 | Image-text RetrievalRepresentation Learning | CodeCode Available | 1 |
| A Prior Instruction Representation Framework for Remote Sensing Image-text Retrieval | Oct 27, 2023 | Cross-Modal RetrievalImage-text Retrieval | CodeCode Available | 1 |
| ESA: External Space Attention Aggregation for Image-Text Retrieval | Oct 10, 2023 | Image-text RetrievalRetrieval | CodeCode Available | 1 |
| Towards Fast and Accurate Image-Text Retrieval with Self-Supervised Fine-Grained Alignment | Aug 27, 2023 | Contrastive LearningImage-text Retrieval | CodeCode Available | 1 |
| Parameter-Efficient Transfer Learning for Remote Sensing Image-Text Retrieval | Aug 24, 2023 | Cross-Modal RetrievalImage-text matching | CodeCode Available | 1 |
| ALIP: Adaptive Language-Image Pre-training with Synthetic Caption | Aug 16, 2023 | Action ClassificationImage-text Retrieval | CodeCode Available | 1 |
| AdvCLIP: Downstream-agnostic Adversarial Examples in Multimodal Contrastive Learning | Aug 14, 2023 | Contrastive LearningGenerative Adversarial Network | CodeCode Available | 1 |
| Set-level Guidance Attack: Boosting Adversarial Transferability of Vision-Language Pre-training Models | Jul 26, 2023 | Image-text RetrievalRetrieval | CodeCode Available | 1 |
| mCLIP: Multilingual CLIP via Cross-lingual Transfer | Jul 10, 2023 | Contrastive LearningCross-Lingual Transfer | CodeCode Available | 1 |
| Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding | Jun 15, 2023 | Contrastive Learningimage-classification | CodeCode Available | 1 |
| Efficient Token-Guided Image-Text Retrieval with Consistent Multimodal Contrastive Training | Jun 15, 2023 | Image-text RetrievalRepresentation Learning | CodeCode Available | 1 |
| Babel-ImageNet: Massively Multilingual Evaluation of Vision-and-Language Representations | Jun 14, 2023 | image-classificationImage Classification | CodeCode Available | 1 |
| Global and Local Semantic Completion Learning for Vision-Language Pre-training | Jun 12, 2023 | cross-modal alignmentImage-text Retrieval | CodeCode Available | 1 |