| Maximal Matching Matters: Preventing Representation Collapse for Robust Cross-Modal Retrieval | Jun 26, 2025 | Cross-Modal RetrievalImage-text Retrieval | —Unverified | 0 |
| Adding simple structure at inference improves Vision-Language Compositionality | Jun 11, 2025 | AttributeImage-text Retrieval | CodeCode Available | 0 |
| FlagEvalMM: A Flexible Framework for Comprehensive Multimodal Model Evaluation | Jun 10, 2025 | Image-text RetrievalQuestion Answering | CodeCode Available | 2 |
| Attacking Attention of Foundation Models Disrupts Downstream Tasks | Jun 3, 2025 | Depth EstimationImage-text Retrieval | CodeCode Available | 0 |
| Distill CLIP (DCLIP): Enhancing Image-Text Retrieval via Cross-Modal Transformer Distillation | May 25, 2025 | Contrastive LearningImage-text Retrieval | —Unverified | 0 |
| EvdCLIP: Improving Vision-Language Retrieval with Entity Visual Descriptions from Large Language Models | May 24, 2025 | Image-text RetrievalLanguage Modeling | —Unverified | 0 |
| Representation Discrepancy Bridging Method for Remote Sensing Image-Text Retrieval | May 22, 2025 | cross-modal alignmentImage-text Retrieval | —Unverified | 0 |
| Breaking Language Barriers or Reinforcing Bias? A Study of Gender and Racial Disparities in Multilingual Contrastive Vision Language Models | May 20, 2025 | Image-text RetrievalText Retrieval | —Unverified | 0 |
| A Vision-Language Foundation Model for Leaf Disease Identification | May 11, 2025 | Contrastive Learningimage-classification | CodeCode Available | 0 |
| FG-CLIP: Fine-Grained Visual and Textual Alignment | May 8, 2025 | Image-text Retrievalobject-detection | CodeCode Available | 4 |