| Maximal Matching Matters: Preventing Representation Collapse for Robust Cross-Modal Retrieval | Jun 26, 2025 | Cross-Modal RetrievalImage-text Retrieval | —Unverified | 0 |
| Adding simple structure at inference improves Vision-Language Compositionality | Jun 11, 2025 | AttributeImage-text Retrieval | CodeCode Available | 0 |
| FlagEvalMM: A Flexible Framework for Comprehensive Multimodal Model Evaluation | Jun 10, 2025 | Image-text RetrievalQuestion Answering | CodeCode Available | 2 |
| Attacking Attention of Foundation Models Disrupts Downstream Tasks | Jun 3, 2025 | Depth EstimationImage-text Retrieval | CodeCode Available | 0 |
| Distill CLIP (DCLIP): Enhancing Image-Text Retrieval via Cross-Modal Transformer Distillation | May 25, 2025 | Contrastive LearningImage-text Retrieval | —Unverified | 0 |
| EvdCLIP: Improving Vision-Language Retrieval with Entity Visual Descriptions from Large Language Models | May 24, 2025 | Image-text RetrievalLanguage Modeling | —Unverified | 0 |
| Representation Discrepancy Bridging Method for Remote Sensing Image-Text Retrieval | May 22, 2025 | cross-modal alignmentImage-text Retrieval | —Unverified | 0 |
| Breaking Language Barriers or Reinforcing Bias? A Study of Gender and Racial Disparities in Multilingual Contrastive Vision Language Models | May 20, 2025 | Image-text RetrievalText Retrieval | —Unverified | 0 |
| A Vision-Language Foundation Model for Leaf Disease Identification | May 11, 2025 | Contrastive Learningimage-classification | CodeCode Available | 0 |
| FG-CLIP: Fine-Grained Visual and Textual Alignment | May 8, 2025 | Image-text Retrievalobject-detection | CodeCode Available | 4 |
| AGATE: Stealthy Black-box Watermarking for Multimodal Model Copyright Protection | Apr 28, 2025 | Adversarial AttackAnomaly Detection | —Unverified | 0 |
| Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs | Apr 24, 2025 | Image-text RetrievalInstruction Following | —Unverified | 0 |
| FocalLens: Instruction Tuning Enables Zero-Shot Conditional Image Representations | Apr 11, 2025 | image-classificationImage Classification | —Unverified | 0 |
| Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language Models | Mar 25, 2025 | BenchmarkingImage Captioning | CodeCode Available | 1 |
| SeLIP: Similarity Enhanced Contrastive Language Image Pretraining for Multi-modal Head MRI | Mar 25, 2025 | Contrastive LearningImage Segmentation | —Unverified | 0 |
| Med3DVLM: An Efficient Vision-Language Model for 3D Medical Image Analysis | Mar 25, 2025 | Contrastive LearningImage-text Retrieval | CodeCode Available | 2 |
| Anatomy-Aware Conditional Image-Text Retrieval | Mar 10, 2025 | AnatomyContrastive Learning | —Unverified | 0 |
| Variance-Aware Loss Scheduling for Multimodal Alignment in Low-Data Settings | Mar 5, 2025 | Contrastive LearningImage-text Retrieval | —Unverified | 0 |
| LLaVE: Large Language and Vision Embedding Models with Hardness-Weighted Contrastive Learning | Mar 4, 2025 | Contrastive LearningImage-text Retrieval | —Unverified | 0 |
| MedUnifier: Unifying Vision-and-Language Pre-training on Medical Data with Vision Generation Task using Discrete Visual Representations | Mar 2, 2025 | image-classificationImage Classification | —Unverified | 0 |
| ReCon: Enhancing True Correspondence Discrimination through Relation Consistency for Robust Noisy Correspondence Learning | Feb 27, 2025 | Cross-Modal RetrievalCross-modal retrieval with noisy correspondence | CodeCode Available | 1 |
| Progressive Local Alignment for Medical Multimodal Pre-training | Feb 25, 2025 | Contrastive LearningImage-text Retrieval | —Unverified | 0 |
| SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features | Feb 20, 2025 | FairnessImage-text Retrieval | CodeCode Available | 0 |
| Fine-tuning Multimodal Transformers on Edge: A Parallel Split Learning Approach | Feb 10, 2025 | Federated LearningImage-text Retrieval | —Unverified | 0 |
| Temporal Working Memory: Query-Guided Segment Refinement for Enhanced Multimodal Understanding | Feb 9, 2025 | Image CaptioningImage-text Retrieval | CodeCode Available | 3 |
| DCFormer: Efficient 3D Vision-Language Modeling with Decomposed Convolutions | Feb 7, 2025 | Anomaly DetectionImage-text Retrieval | —Unverified | 0 |
| MASS: Overcoming Language Bias in Image-Text Matching | Jan 20, 2025 | Image-text matchingImage-text Retrieval | —Unverified | 0 |
| TSVC:Tripartite Learning with Semantic Variation Consistency for Robust Image-Text Retrieval | Jan 19, 2025 | Cross-Modal RetrievalImage-text Retrieval | —Unverified | 0 |
| BIOMEDICA: An Open Biomedical Image-Caption Archive, Dataset, and Vision-Language Models Derived from Scientific Literature | Jan 13, 2025 | ArticlesImage-text Retrieval | CodeCode Available | 2 |
| Advancing Myopia To Holism: Fully Contrastive Language-Image Pre-training | Jan 1, 2025 | Image-text RetrievalImage to text | —Unverified | 0 |
| Reversed in Time: A Novel Temporal-Emphasized Benchmark for Cross-Modal Video-Text Retrieval | Dec 26, 2024 | Image-text RetrievalInformation Retrieval | CodeCode Available | 0 |
| I0T: Embedding Standardization Method Towards Zero Modality Gap | Dec 18, 2024 | Contrastive LearningImage-text Retrieval | CodeCode Available | 1 |
| Barking Up The Syntactic Tree: Enhancing VLM Training with Syntactic Losses | Dec 11, 2024 | Image-text RetrievalQuestion Answering | —Unverified | 0 |
| Explaining and Mitigating the Modality Gap in Contrastive Multimodal Learning | Dec 10, 2024 | Contrastive LearningImage-text Retrieval | —Unverified | 0 |
| VladVA: Discriminative Fine-tuning of LVLMs | Dec 5, 2024 | Image-text RetrievalRepresentation Learning | —Unverified | 0 |
| Approximate Fiber Product: A Preliminary Algebraic-Geometric Perspective on Multimodal Embedding Alignment | Nov 30, 2024 | Image-text RetrievalRepresentation Learning | —Unverified | 0 |
| Knowledge Transfer Across Modalities with Natural Language Supervision | Nov 23, 2024 | Image-text RetrievalNovel Concepts | —Unverified | 0 |
| Uni-Mlip: Unified Self-supervision for Medical Vision Language Pre-training | Nov 20, 2024 | Contrastive Learningimage-classification | —Unverified | 0 |
| A Survey of Medical Vision-and-Language Applications and Their Techniques | Nov 19, 2024 | Decision MakingDiagnostic | CodeCode Available | 1 |
| Nearest Neighbor Normalization Improves Multimodal Retrieval | Oct 31, 2024 | Cross-Modal RetrievalImage Captioning | CodeCode Available | 1 |
| Multilingual Vision-Language Pre-training for the Remote Sensing Domain | Oct 30, 2024 | Cross-Modal Retrievalimage-classification | CodeCode Available | 0 |
| GSSF: Generalized Structural Sparse Function for Deep Cross-modal Metric Learning | Oct 20, 2024 | Image RetrievalImage-text Retrieval | CodeCode Available | 0 |
| CtrlSynth: Controllable Image Text Synthesis for Data-Efficient Multimodal Learning | Oct 15, 2024 | Image-text RetrievalText Retrieval | —Unverified | 0 |
| AnyAttack: Towards Large-scale Self-supervised Adversarial Attacks on Vision-language Models | Oct 7, 2024 | Image CaptioningImage-text Retrieval | —Unverified | 0 |
| From Unimodal to Multimodal: Scaling up Projectors to Align Modalities | Sep 28, 2024 | Image-text RetrievalSemantic Similarity | CodeCode Available | 0 |
| NEVLP: Noise-Robust Framework for Efficient Vision-Language Pre-training | Sep 15, 2024 | Contrastive Learningcross-modal alignment | —Unverified | 0 |
| Pushing the Limits of Vision-Language Models in Remote Sensing without Human Annotations | Sep 11, 2024 | Image-text RetrievalText Retrieval | —Unverified | 0 |
| Toward Automatic Relevance Judgment using Vision--Language Models for Image--Text Retrieval Evaluation | Aug 2, 2024 | Image-text RetrievalRetrieval | —Unverified | 0 |
| PC^2: Pseudo-Classification Based Pseudo-Captioning for Noisy Correspondence Learning in Cross-Modal Retrieval | Aug 2, 2024 | Cross-modal retrieval with noisy correspondenceImage-text Retrieval | CodeCode Available | 1 |
| FiCo-ITR: bridging fine-grained and coarse-grained image-text retrieval for comparative performance analysis | Jul 29, 2024 | Image-text RetrievalModel Selection | CodeCode Available | 0 |