| MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning | Oct 14, 2023 | Image ClassificationImage Description | CodeCode Available | 7 |
| MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models | Apr 20, 2023 | Image DescriptionLanguage Modelling | CodeCode Available | 7 |
| Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond | Aug 24, 2023 | Chart Question AnsweringFS-MEVQA | CodeCode Available | 5 |
| Caption Anything: Interactive Image Description with Diverse Multimodal Controls | May 4, 2023 | controllable image captioningImage Captioning | CodeCode Available | 3 |
| PandaGPT: One Model To Instruction-Follow Them All | May 25, 2023 | AllImage Description | CodeCode Available | 2 |
| Patho-R1: A Multimodal Reinforcement Learning-Based Pathology Expert Reasoner | May 16, 2025 | Cross-Modal RetrievalDiagnostic | CodeCode Available | 2 |
| Seedream 2.0: A Native Chinese-English Bilingual Image Generation Foundation Model | Mar 10, 2025 | Image DescriptionImage Generation | CodeCode Available | 2 |
| Image Textualization: An Automatic Framework for Creating Accurate and Detailed Image Descriptions | Jun 11, 2024 | HallucinationImage Description | CodeCode Available | 2 |
| Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models | May 19, 2015 | Image DescriptionPhrase Grounding | CodeCode Available | 1 |
| A skeletonization algorithm for gradient-based optimization | Sep 5, 2023 | BenchmarkingDeep Learning | CodeCode Available | 1 |
| UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling | Nov 23, 2021 | Image CaptioningImage Description | CodeCode Available | 1 |
| DialogCC: An Automated Pipeline for Creating High-Quality Multi-Modal Dialogue Dataset | Dec 8, 2022 | DiversityImage Description | CodeCode Available | 1 |
| Visual Spatial Description: Controlled Spatial-Oriented Image-to-Text Generation | Oct 20, 2022 | DecoderImage Captioning | CodeCode Available | 1 |
| Zero-Shot Out-of-Distribution Detection Based on the Pre-trained Model CLIP | Sep 6, 2021 | Image DescriptionOut-of-Distribution Detection | CodeCode Available | 1 |
| Revisiting Binary Local Image Description for Resource Limited Devices | Aug 18, 2021 | Image DescriptionTriplet | CodeCode Available | 1 |
| Towards image compression with perfect realism at ultra-low bitrates | Oct 16, 2023 | Image CompressionImage Description | CodeCode Available | 1 |
| Grounded Video Description | Dec 17, 2018 | Image DescriptionSentence | CodeCode Available | 1 |
| Text-Visual Semantic Constrained AI-Generated Image Quality Assessment | Jul 14, 2025 | Image DescriptionImage Quality Assessment | CodeCode Available | 1 |
| Mitigating Hallucinations in Vision-Language Models through Image-Guided Head Suppression | May 22, 2025 | HallucinationImage Description | CodeCode Available | 1 |
| Chatting Makes Perfect: Chat-based Image Retrieval | May 31, 2023 | Chat-based Image RetrievalImage Description | CodeCode Available | 1 |
| CIDEr: Consensus-based Image Description Evaluation | Nov 20, 2014 | Action RecognitionAttribute | CodeCode Available | 1 |
| Can Large Multimodal Models Uncover Deep Semantics Behind Images? | Feb 17, 2024 | Image Description | CodeCode Available | 1 |
| Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations | Feb 23, 2016 | image-classificationImage Classification | CodeCode Available | 1 |
| SPIDER: A Comprehensive Multi-Organ Supervised Pathology Dataset and Baseline Models | Mar 4, 2025 | Image Description | CodeCode Available | 1 |
| Focused Evaluation for Image Description with Binary Forced-Choice Tasks | Aug 1, 2016 | Image CaptioningImage Description | —Unverified | 0 |
| Computer Vision and Conflicting Values: Describing People with Automated Alt Text | May 26, 2021 | Image Description | —Unverified | 0 |
| A Fine-Grained Image Description Generation Method Based on Joint Objectives | Sep 2, 2023 | Image DescriptionObject | —Unverified | 0 |
| A Genetic Algorithm Approach for ImageRepresentation Learning through Color Quantization | Nov 18, 2017 | Content-Based Image RetrievalImage Description | —Unverified | 0 |
| A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching | Jun 1, 2013 | Image DescriptionVideo Description | —Unverified | 0 |
| From phonemes to images: levels of representation in a recurrent neural model of visually-grounded language learning | Oct 11, 2016 | FormGrounded language learning | —Unverified | 0 |
| Comparing Automatic Evaluation Measures for Image Description | Jun 1, 2014 | Image DescriptionSlot Filling | —Unverified | 0 |
| Collecting Image Description Datasets using Crowdsourcing | Nov 12, 2014 | Image DescriptionSentence | —Unverified | 0 |
| Advanced Chest X-Ray Analysis via Transformer-Based Image Descriptors and Cross-Model Attention Mechanism | Apr 23, 2025 | DecoderImage Description | —Unverified | 0 |
| Doubly-Attentive Decoder for Multi-modal Neural Machine Translation | Feb 4, 2017 | DecoderImage Description | —Unverified | 0 |
| A Cognitive Evaluation Benchmark of Image Reasoning and Description for Large Vision-Language Models | Feb 28, 2024 | Image DescriptionQuestion Answering | —Unverified | 0 |
| Findings of the Second Shared Task on Multimodal Machine Translation and Multilingual Image Description | Oct 19, 2017 | Image DescriptionMachine Translation | —Unverified | 0 |
| Generating Image Captions in Arabic using Root-Word Based Recurrent Neural Networks and Deep Neural Networks | Jun 1, 2018 | Caption GenerationImage Captioning | —Unverified | 0 |
| Exploring the Use of Contrastive Language-Image Pre-Training for Human Posture Classification: Insights from Yoga Pose Analysis | Jan 13, 2025 | Image DescriptionTransfer Learning | —Unverified | 0 |
| Artwork Explanation in Large-scale Vision Language Models | Feb 29, 2024 | Explanation GenerationImage Description | —Unverified | 0 |
| Exploring Visual Relationship for Image Captioning | Sep 19, 2018 | DecoderImage Captioning | —Unverified | 0 |
| DiffCap: Exploring Continuous Diffusion on Image Captioning | May 20, 2023 | Caption GenerationDiversity | —Unverified | 0 |
| DIDEC: The Dutch Image Description and Eye-tracking Corpus | Aug 1, 2018 | Image DescriptionSpecificity | —Unverified | 0 |
| A Preliminary Survey of Semantic Descriptive Model for Images | Jan 13, 2025 | DescriptiveImage Description | —Unverified | 0 |
| Diverse and Accurate Image Description Using a Variational Auto-Encoder with an Additive Gaussian Encoding Space | Nov 19, 2017 | Caption GenerationImage Description | —Unverified | 0 |
| Adding the Third Dimension to Spatial Relation Detection in 2D Images | Nov 1, 2018 | Image DescriptionObject | —Unverified | 0 |
| Don't Mention the Shoe! A Learning to Rank Approach to Content Selection for Image Description Generation | Sep 1, 2016 | Image DescriptionImage Retrieval | —Unverified | 0 |
| Exploring the Behavior of Classic REG Algorithms in the Description of Characters in 3D Images | Sep 1, 2017 | Image DescriptionReferring Expression | —Unverified | 0 |
| Draw and Tell: Multimodal Descriptions Outperform Verbal- or Sketch-Only Descriptions in an Image Retrieval Task | Nov 1, 2017 | Image DescriptionImage Retrieval | —Unverified | 0 |
| A Shared Task on Multimodal Machine Translation and Crosslingual Image Description | Aug 1, 2016 | Image DescriptionImage Retrieval | —Unverified | 0 |
| Face2Text revisited: Improved data set and baseline results | May 24, 2022 | Image DescriptionTransfer Learning | —Unverified | 0 |