| MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning | Oct 14, 2023 | Image ClassificationImage Description | CodeCode Available | 7 | 5 |
| MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models | Apr 20, 2023 | Image DescriptionLanguage Modelling | CodeCode Available | 7 | 5 |
| Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond | Aug 24, 2023 | Chart Question AnsweringFS-MEVQA | CodeCode Available | 5 | 5 |
| Caption Anything: Interactive Image Description with Diverse Multimodal Controls | May 4, 2023 | controllable image captioningImage Captioning | CodeCode Available | 3 | 5 |
| Patho-R1: A Multimodal Reinforcement Learning-Based Pathology Expert Reasoner | May 16, 2025 | Cross-Modal RetrievalDiagnostic | CodeCode Available | 2 | 5 |
| Image Textualization: An Automatic Framework for Creating Accurate and Detailed Image Descriptions | Jun 11, 2024 | HallucinationImage Description | CodeCode Available | 2 | 5 |
| Seedream 2.0: A Native Chinese-English Bilingual Image Generation Foundation Model | Mar 10, 2025 | Image DescriptionImage Generation | CodeCode Available | 2 | 5 |
| PandaGPT: One Model To Instruction-Follow Them All | May 25, 2023 | AllImage Description | CodeCode Available | 2 | 5 |
| Revisiting Binary Local Image Description for Resource Limited Devices | Aug 18, 2021 | Image DescriptionTriplet | CodeCode Available | 1 | 5 |
| A skeletonization algorithm for gradient-based optimization | Sep 5, 2023 | BenchmarkingDeep Learning | CodeCode Available | 1 | 5 |
| CIDEr: Consensus-based Image Description Evaluation | Nov 20, 2014 | Action RecognitionAttribute | CodeCode Available | 1 | 5 |
| Visual Spatial Description: Controlled Spatial-Oriented Image-to-Text Generation | Oct 20, 2022 | DecoderImage Captioning | CodeCode Available | 1 | 5 |
| Chatting Makes Perfect: Chat-based Image Retrieval | May 31, 2023 | Chat-based Image RetrievalImage Description | CodeCode Available | 1 | 5 |
| Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations | Feb 23, 2016 | image-classificationImage Classification | CodeCode Available | 1 | 5 |
| Towards image compression with perfect realism at ultra-low bitrates | Oct 16, 2023 | Image CompressionImage Description | CodeCode Available | 1 | 5 |
| Mitigating Hallucinations in Vision-Language Models through Image-Guided Head Suppression | May 22, 2025 | HallucinationImage Description | CodeCode Available | 1 | 5 |
| SPIDER: A Comprehensive Multi-Organ Supervised Pathology Dataset and Baseline Models | Mar 4, 2025 | Image Description | CodeCode Available | 1 | 5 |
| Zero-Shot Out-of-Distribution Detection Based on the Pre-trained Model CLIP | Sep 6, 2021 | Image DescriptionOut-of-Distribution Detection | CodeCode Available | 1 | 5 |
| DialogCC: An Automated Pipeline for Creating High-Quality Multi-Modal Dialogue Dataset | Dec 8, 2022 | DiversityImage Description | CodeCode Available | 1 | 5 |
| Can Large Multimodal Models Uncover Deep Semantics Behind Images? | Feb 17, 2024 | Image Description | CodeCode Available | 1 | 5 |
| UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling | Nov 23, 2021 | Image CaptioningImage Description | CodeCode Available | 1 | 5 |
| Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models | May 19, 2015 | Image DescriptionPhrase Grounding | CodeCode Available | 1 | 5 |
| Text-Visual Semantic Constrained AI-Generated Image Quality Assessment | Jul 14, 2025 | Image DescriptionImage Quality Assessment | CodeCode Available | 1 | 5 |
| Grounded Video Description | Dec 17, 2018 | Image DescriptionSentence | CodeCode Available | 1 | 5 |
| ContextRef: Evaluating Referenceless Metrics For Image Description Generation | Sep 21, 2023 | Image Description | CodeCode Available | 0 | 5 |
| Human Attention in Image Captioning: Dataset and Analysis | Mar 6, 2019 | Image CaptioningImage Description | CodeCode Available | 0 | 5 |
| Compositional Obverter Communication Learning From Raw Visual Input | Apr 6, 2018 | Image Description | CodeCode Available | 0 | 5 |
| Pragmatic factors in image description: the case of negations | Jun 20, 2016 | Image DescriptionNegation | CodeCode Available | 0 | 5 |
| Multimodal Word Sense Disambiguation in Creative Practice | Jul 15, 2020 | ClassificationDescriptive | CodeCode Available | 0 | 5 |
| Contextualize, Show and Tell: A Neural Visual Storyteller | Jun 3, 2018 | DecoderImage Description | CodeCode Available | 0 | 5 |
| On Architectures for Including Visual Information in Neural Language Models for Image Description | Nov 9, 2019 | Image DescriptionLanguage Modeling | CodeCode Available | 0 | 5 |
| CIDEr-R: Robust Consensus-based Image Description Evaluation | Sep 28, 2021 | DescriptiveImage Description | CodeCode Available | 0 | 5 |
| Multi30K: Multilingual English-German Image Descriptions | May 2, 2016 | Image DescriptionMachine Translation | CodeCode Available | 0 | 5 |
| Multilingual Image Description with Neural Sequence Models | Oct 15, 2015 | Image CaptioningImage Description | CodeCode Available | 0 | 5 |
| Room for improvement in automatic image description: an error analysis | Apr 13, 2017 | Image Description | CodeCode Available | 0 | 5 |
| Measuring the Diversity of Automatic Image Descriptions | Aug 1, 2018 | DiversityImage Description | CodeCode Available | 0 | 5 |
| MAGID: An Automated Pipeline for Generating Synthetic Multi-modal Datasets | Mar 5, 2024 | DiversityImage Description | CodeCode Available | 0 | 5 |
| MiCEval: Unveiling Multimodal Chain of Thought's Quality via Image Description and Reasoning Steps | Oct 18, 2024 | Image DescriptionInformativeness | CodeCode Available | 0 | 5 |
| Localized Symbolic Knowledge Distillation for Visual Commonsense Models | Dec 8, 2023 | Image DescriptionInstruction Following | CodeCode Available | 0 | 5 |
| Long-term Recurrent Convolutional Networks for Visual Recognition and Description | Nov 17, 2014 | Image DescriptionRetrieval | CodeCode Available | 0 | 5 |
| Describing Videos by Exploiting Temporal Structure | Feb 27, 2015 | Action RecognitionImage Description | CodeCode Available | 0 | 5 |
| Bridging Languages through Images with Deep Partial Canonical Correlation Analysis | Jul 1, 2018 | Image DescriptionImage Retrieval | CodeCode Available | 0 | 5 |
| Improving Visual-Semantic Embeddings by Learning Semantically-Enhanced Hard Negatives for Cross-modal Information Retrieval | Oct 10, 2022 | Cross-Modal Information RetrievalImage Description | CodeCode Available | 0 | 5 |
| Difficult Task Yes but Simple Task No: Unveiling the Laziness in Multimodal LLMs | Oct 15, 2024 | Image DescriptionMultiple-choice | CodeCode Available | 0 | 5 |
| Deep Imbalanced Attribute Classification using Visual Attention Aggregation | Jul 10, 2018 | AttributeClassification | CodeCode Available | 0 | 5 |
| Does Multimodality Help Human and Machine for Translation and Image Captioning? | May 30, 2016 | Image CaptioningImage Description | CodeCode Available | 0 | 5 |
| Bounding and Filling: A Fast and Flexible Framework for Image Captioning | Oct 15, 2023 | Image CaptioningImage Description | CodeCode Available | 0 | 5 |
| IDEA: Image Description Enhanced CLIP-Adapter | Jan 15, 2025 | Few-Shot Image Classificationimage-classification | CodeCode Available | 0 | 5 |
| Efficient Decentralized Visual Place Recognition From Full-Image Descriptors | May 30, 2017 | ClusteringImage Description | CodeCode Available | 0 | 5 |
| Generating Image Descriptions via Sequential Cross-Modal Alignment Guided by Human Gaze | Nov 9, 2020 | cross-modal alignmentImage Captioning | CodeCode Available | 0 | 5 |