| MHSAN: Multi-Head Self-Attention Network for Visual Semantic Embedding | Jan 11, 2020 | Image CaptioningImage-text Retrieval | CodeCode Available | 0 |
| Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval | Jun 11, 2018 | Image-text RetrievalRetrieval | CodeCode Available | 0 |
| Differentiable Outlier Detection Enable Robust Deep Multimodal Analysis | Feb 11, 2023 | Image-text RetrievalKnowledge Graphs | CodeCode Available | 0 |
| Adding simple structure at inference improves Vision-Language Compositionality | Jun 11, 2025 | AttributeImage-text Retrieval | CodeCode Available | 0 |
| Semantic-Preserving Augmentation for Robust Image-Text Retrieval | Mar 10, 2023 | Image-text RetrievalRetrieval | CodeCode Available | 0 |
| Intra-Modal Constraint Loss For Image-Text Retrieval | Jul 11, 2022 | Cross-Modal RetrievalImage-text Retrieval | CodeCode Available | 0 |
| Attacking Attention of Foundation Models Disrupts Downstream Tasks | Jun 3, 2025 | Depth EstimationImage-text Retrieval | CodeCode Available | 0 |
| Single-Stream Multi-Level Alignment for Vision-Language Pretraining | Mar 27, 2022 | Image-text RetrievalQuestion Answering | CodeCode Available | 0 |
| Stop Pre-Training: Adapt Visual-Language Models to Unseen Languages | Jun 29, 2023 | Image-text RetrievalMachine Translation | CodeCode Available | 0 |
| An Unsupervised Cross-Modal Hashing Method Robust to Noisy Training Image-Text Correspondences in Remote Sensing | Feb 26, 2022 | Image-text RetrievalMeta-Learning | CodeCode Available | 0 |