| Describing Videos by Exploiting Temporal Structure | Feb 27, 2015 | Action RecognitionImage Description | CodeCode Available | 0 | 5 |
| Bridging Languages through Images with Deep Partial Canonical Correlation Analysis | Jul 1, 2018 | Image DescriptionImage Retrieval | CodeCode Available | 0 | 5 |
| Improving Visual-Semantic Embeddings by Learning Semantically-Enhanced Hard Negatives for Cross-modal Information Retrieval | Oct 10, 2022 | Cross-Modal Information RetrievalImage Description | CodeCode Available | 0 | 5 |
| Difficult Task Yes but Simple Task No: Unveiling the Laziness in Multimodal LLMs | Oct 15, 2024 | Image DescriptionMultiple-choice | CodeCode Available | 0 | 5 |
| Deep Imbalanced Attribute Classification using Visual Attention Aggregation | Jul 10, 2018 | AttributeClassification | CodeCode Available | 0 | 5 |
| Does Multimodality Help Human and Machine for Translation and Image Captioning? | May 30, 2016 | Image CaptioningImage Description | CodeCode Available | 0 | 5 |
| Bounding and Filling: A Fast and Flexible Framework for Image Captioning | Oct 15, 2023 | Image CaptioningImage Description | CodeCode Available | 0 | 5 |
| IDEA: Image Description Enhanced CLIP-Adapter | Jan 15, 2025 | Few-Shot Image Classificationimage-classification | CodeCode Available | 0 | 5 |
| Efficient Decentralized Visual Place Recognition From Full-Image Descriptors | May 30, 2017 | ClusteringImage Description | CodeCode Available | 0 | 5 |
| Generating Image Descriptions via Sequential Cross-Modal Alignment Guided by Human Gaze | Nov 9, 2020 | cross-modal alignmentImage Captioning | CodeCode Available | 0 | 5 |