| Attacking Attention of Foundation Models Disrupts Downstream Tasks | Jun 3, 2025 | Depth EstimationImage-text Retrieval | CodeCode Available | 0 |
| Distill CLIP (DCLIP): Enhancing Image-Text Retrieval via Cross-Modal Transformer Distillation | May 25, 2025 | Contrastive LearningImage-text Retrieval | —Unverified | 0 |
| EvdCLIP: Improving Vision-Language Retrieval with Entity Visual Descriptions from Large Language Models | May 24, 2025 | Image-text RetrievalLanguage Modeling | —Unverified | 0 |
| Representation Discrepancy Bridging Method for Remote Sensing Image-Text Retrieval | May 22, 2025 | cross-modal alignmentImage-text Retrieval | —Unverified | 0 |
| Breaking Language Barriers or Reinforcing Bias? A Study of Gender and Racial Disparities in Multilingual Contrastive Vision Language Models | May 20, 2025 | Image-text RetrievalText Retrieval | —Unverified | 0 |
| A Vision-Language Foundation Model for Leaf Disease Identification | May 11, 2025 | Contrastive Learningimage-classification | CodeCode Available | 0 |
| AGATE: Stealthy Black-box Watermarking for Multimodal Model Copyright Protection | Apr 28, 2025 | Adversarial AttackAnomaly Detection | —Unverified | 0 |
| Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs | Apr 24, 2025 | Image-text RetrievalInstruction Following | —Unverified | 0 |
| FocalLens: Instruction Tuning Enables Zero-Shot Conditional Image Representations | Apr 11, 2025 | image-classificationImage Classification | —Unverified | 0 |
| SeLIP: Similarity Enhanced Contrastive Language Image Pretraining for Multi-modal Head MRI | Mar 25, 2025 | Contrastive LearningImage Segmentation | —Unverified | 0 |
| Anatomy-Aware Conditional Image-Text Retrieval | Mar 10, 2025 | AnatomyContrastive Learning | —Unverified | 0 |
| Variance-Aware Loss Scheduling for Multimodal Alignment in Low-Data Settings | Mar 5, 2025 | Contrastive LearningImage-text Retrieval | —Unverified | 0 |
| LLaVE: Large Language and Vision Embedding Models with Hardness-Weighted Contrastive Learning | Mar 4, 2025 | Contrastive LearningImage-text Retrieval | —Unverified | 0 |
| MedUnifier: Unifying Vision-and-Language Pre-training on Medical Data with Vision Generation Task using Discrete Visual Representations | Mar 2, 2025 | image-classificationImage Classification | —Unverified | 0 |
| Progressive Local Alignment for Medical Multimodal Pre-training | Feb 25, 2025 | Contrastive LearningImage-text Retrieval | —Unverified | 0 |
| SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features | Feb 20, 2025 | FairnessImage-text Retrieval | CodeCode Available | 0 |
| Fine-tuning Multimodal Transformers on Edge: A Parallel Split Learning Approach | Feb 10, 2025 | Federated LearningImage-text Retrieval | —Unverified | 0 |
| DCFormer: Efficient 3D Vision-Language Modeling with Decomposed Convolutions | Feb 7, 2025 | Anomaly DetectionImage-text Retrieval | —Unverified | 0 |
| MASS: Overcoming Language Bias in Image-Text Matching | Jan 20, 2025 | Image-text matchingImage-text Retrieval | —Unverified | 0 |
| TSVC:Tripartite Learning with Semantic Variation Consistency for Robust Image-Text Retrieval | Jan 19, 2025 | Cross-Modal RetrievalImage-text Retrieval | —Unverified | 0 |
| Advancing Myopia To Holism: Fully Contrastive Language-Image Pre-training | Jan 1, 2025 | Image-text RetrievalImage to text | —Unverified | 0 |
| Reversed in Time: A Novel Temporal-Emphasized Benchmark for Cross-Modal Video-Text Retrieval | Dec 26, 2024 | Image-text RetrievalInformation Retrieval | CodeCode Available | 0 |
| Barking Up The Syntactic Tree: Enhancing VLM Training with Syntactic Losses | Dec 11, 2024 | Image-text RetrievalQuestion Answering | —Unverified | 0 |
| Explaining and Mitigating the Modality Gap in Contrastive Multimodal Learning | Dec 10, 2024 | Contrastive LearningImage-text Retrieval | —Unverified | 0 |
| VladVA: Discriminative Fine-tuning of LVLMs | Dec 5, 2024 | Image-text RetrievalRepresentation Learning | —Unverified | 0 |
| Approximate Fiber Product: A Preliminary Algebraic-Geometric Perspective on Multimodal Embedding Alignment | Nov 30, 2024 | Image-text RetrievalRepresentation Learning | —Unverified | 0 |
| Knowledge Transfer Across Modalities with Natural Language Supervision | Nov 23, 2024 | Image-text RetrievalNovel Concepts | —Unverified | 0 |
| Uni-Mlip: Unified Self-supervision for Medical Vision Language Pre-training | Nov 20, 2024 | Contrastive Learningimage-classification | —Unverified | 0 |
| Multilingual Vision-Language Pre-training for the Remote Sensing Domain | Oct 30, 2024 | Cross-Modal Retrievalimage-classification | CodeCode Available | 0 |
| GSSF: Generalized Structural Sparse Function for Deep Cross-modal Metric Learning | Oct 20, 2024 | Image RetrievalImage-text Retrieval | CodeCode Available | 0 |
| CtrlSynth: Controllable Image Text Synthesis for Data-Efficient Multimodal Learning | Oct 15, 2024 | Image-text RetrievalText Retrieval | —Unverified | 0 |
| AnyAttack: Towards Large-scale Self-supervised Adversarial Attacks on Vision-language Models | Oct 7, 2024 | Image CaptioningImage-text Retrieval | —Unverified | 0 |
| From Unimodal to Multimodal: Scaling up Projectors to Align Modalities | Sep 28, 2024 | Image-text RetrievalSemantic Similarity | CodeCode Available | 0 |
| NEVLP: Noise-Robust Framework for Efficient Vision-Language Pre-training | Sep 15, 2024 | Contrastive Learningcross-modal alignment | —Unverified | 0 |
| Pushing the Limits of Vision-Language Models in Remote Sensing without Human Annotations | Sep 11, 2024 | Image-text RetrievalText Retrieval | —Unverified | 0 |
| Toward Automatic Relevance Judgment using Vision--Language Models for Image--Text Retrieval Evaluation | Aug 2, 2024 | Image-text RetrievalRetrieval | —Unverified | 0 |
| FiCo-ITR: bridging fine-grained and coarse-grained image-text retrieval for comparative performance analysis | Jul 29, 2024 | Image-text RetrievalModel Selection | CodeCode Available | 0 |
| Assessing Brittleness of Image-Text Retrieval Benchmarks from Vision-Language Models Perspective | Jul 21, 2024 | Image-text RetrievalInformation Retrieval | —Unverified | 0 |
| Object-Aware Query Perturbation for Cross-Modal Image-Text Retrieval | Jul 17, 2024 | Image-text RetrievalObject | CodeCode Available | 0 |
| CosmoCLIP: Generalizing Large Vision-Language Models for Astronomical Imaging | Jul 10, 2024 | Contrastive LearningImage-text Retrieval | —Unverified | 0 |
| How to Make Cross Encoder a Good Teacher for Efficient Image-Text Retrieval? | Jul 10, 2024 | Contrastive LearningImage-text Retrieval | —Unverified | 0 |
| Improving the Consistency in Cross-Lingual Cross-Modal Retrieval with 1-to-K Contrastive Learning | Jun 26, 2024 | Contrastive LearningCross-Modal Retrieval | CodeCode Available | 0 |
| Beat: Bi-directional One-to-Many Embedding Alignment for Text-based Person Retrieval | Jun 9, 2024 | Image-text RetrievalPerson Retrieval | —Unverified | 0 |
| Knowledge-grounded Adaptation Strategy for Vision-language Models: Building Unique Case-set for Screening Mammograms for Residents Training | May 30, 2024 | Image-text RetrievalLanguage Modeling | —Unverified | 0 |
| Multimodal Adversarial Defense for Vision-Language Models by Leveraging One-To-Many Relationships | May 29, 2024 | Adversarial DefenseAdversarial Robustness | —Unverified | 0 |
| Active Learning for Finely-Categorized Image-Text Retrieval by Selecting Hard Negative Unpaired Samples | May 25, 2024 | Active LearningImage-text Retrieval | —Unverified | 0 |
| Global–Local Information Soft-Alignment for Cross-Modal Remote-Sensing Image–Text Retrieval | May 14, 2024 | Cross-Modal RetrievalCross-Modal Retrieval on RSITMD | —Unverified | 0 |
| UrbanCross: Enhancing Satellite Image-Text Retrieval with Cross-Domain Adaptation | Apr 22, 2024 | DiversityDomain Adaptation | —Unverified | 0 |
| Self-Training Large Language Models for Improved Visual Program Synthesis With Visual Reinforcement | Apr 6, 2024 | Image-text Retrievalobject-detection | —Unverified | 0 |
| LuoJiaHOG: A Hierarchy Oriented Geo-aware Image Caption Dataset for Remote Sensing Image-Text Retrival | Mar 16, 2024 | Caption GenerationImage-text Retrieval | —Unverified | 0 |