| Learning the Best Pooling Strategy for Visual Semantic Embedding | Nov 9, 2020 | Cross-Modal Information RetrievalImage-text Retrieval | CodeCode Available | 1 |
| A Comparison of Pre-trained Vision-and-Language Models for Multimodal Representation Learning across Medical Images and Reports | Sep 3, 2020 | Image-text RetrievalMedical Visual Question Answering | CodeCode Available | 1 |
| Graph Optimal Transport for Cross-Domain Alignment | Jun 26, 2020 | Graph MatchingImage Captioning | CodeCode Available | 1 |
| Large-Scale Adversarial Training for Vision-and-Language Representation Learning | Jun 11, 2020 | Image-text RetrievalQuestion Answering | CodeCode Available | 1 |
| Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers | Apr 2, 2020 | Image-text matchingImage-text Retrieval | CodeCode Available | 1 |
| IMRAM: Iterative Matching with Recurrent Attention Memory for Cross-Modal Image-Text Retrieval | Mar 8, 2020 | Cross-Modal RetrievalImage-text Retrieval | CodeCode Available | 1 |
| Cross-modal Scene Graph Matching for Relationship-aware Image-Text Retrieval | Oct 11, 2019 | Graph MatchingImage-text Retrieval | CodeCode Available | 1 |
| UNITER: UNiversal Image-TExt Representation Learning | Sep 25, 2019 | Image-text matchingImage-text Retrieval | CodeCode Available | 1 |
| Maximal Matching Matters: Preventing Representation Collapse for Robust Cross-Modal Retrieval | Jun 26, 2025 | Cross-Modal RetrievalImage-text Retrieval | —Unverified | 0 |
| Adding simple structure at inference improves Vision-Language Compositionality | Jun 11, 2025 | AttributeImage-text Retrieval | CodeCode Available | 0 |