| Seeing the Unseen: Visual Common Sense for Semantic Placement | Jan 15, 2024 | Common Sense ReasoningImage Description | —Unverified | 0 |
| InfoVisDial: An Informative Visual Dialogue Dataset by Bridging Large Multimodal and Language Models | Dec 21, 2023 | Image Description | —Unverified | 0 |
| Localized Symbolic Knowledge Distillation for Visual Commonsense Models | Dec 8, 2023 | Image DescriptionInstruction Following | CodeCode Available | 0 |
| Impressions: Understanding Visual Semiotics and Aesthetic Impact | Oct 27, 2023 | Image CaptioningImage Description | —Unverified | 0 |
| Large Language Models can Share Images, Too! | Oct 23, 2023 | Image DescriptionSentence | CodeCode Available | 0 |
| Towards image compression with perfect realism at ultra-low bitrates | Oct 16, 2023 | Image CompressionImage Description | CodeCode Available | 1 |
| Bounding and Filling: A Fast and Flexible Framework for Image Captioning | Oct 15, 2023 | Image CaptioningImage Description | CodeCode Available | 0 |
| MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning | Oct 14, 2023 | Image ClassificationImage Description | CodeCode Available | 7 |
| ContextRef: Evaluating Referenceless Metrics For Image Description Generation | Sep 21, 2023 | Image Description | CodeCode Available | 0 |
| A skeletonization algorithm for gradient-based optimization | Sep 5, 2023 | BenchmarkingDeep Learning | CodeCode Available | 1 |
| A Fine-Grained Image Description Generation Method Based on Joint Objectives | Sep 2, 2023 | Image DescriptionObject | —Unverified | 0 |
| Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond | Aug 24, 2023 | Chart Question AnsweringFS-MEVQA | CodeCode Available | 5 |
| Chatting Makes Perfect: Chat-based Image Retrieval | May 31, 2023 | Chat-based Image RetrievalImage Description | CodeCode Available | 1 |
| PandaGPT: One Model To Instruction-Follow Them All | May 25, 2023 | AllImage Description | CodeCode Available | 2 |
| DiffCap: Exploring Continuous Diffusion on Image Captioning | May 20, 2023 | Caption GenerationDiversity | —Unverified | 0 |
| Caption Anything: Interactive Image Description with Diverse Multimodal Controls | May 4, 2023 | controllable image captioningImage Captioning | CodeCode Available | 3 |
| MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models | Apr 20, 2023 | Image DescriptionLanguage Modelling | CodeCode Available | 7 |
| Fan-Beam Binarization Difference Projection (FB-BDP): A Novel Local Object Descriptor for Fine-Grained Leaf Image Retrieval | Jan 1, 2023 | BinarizationImage Description | CodeCode Available | 0 |
| DialogCC: An Automated Pipeline for Creating High-Quality Multi-Modal Dialogue Dataset | Dec 8, 2022 | DiversityImage Description | CodeCode Available | 1 |
| Visual Spatial Description: Controlled Spatial-Oriented Image-to-Text Generation | Oct 20, 2022 | DecoderImage Captioning | CodeCode Available | 1 |
| Improving Visual-Semantic Embeddings by Learning Semantically-Enhanced Hard Negatives for Cross-modal Information Retrieval | Oct 10, 2022 | Cross-Modal Information RetrievalImage Description | CodeCode Available | 0 |
| Facial Expression Recognition and Image Description Generation in Vietnamese | Aug 12, 2022 | DescriptiveEmotion Recognition | —Unverified | 0 |
| Skeletal Human Action Recognition using Hybrid Attention based Graph Convolutional Network | Jul 12, 2022 | Action RecognitionImage Description | CodeCode Available | 0 |
| Image Description Dataset for Language Learners | Jun 1, 2022 | Image DescriptionSentence | —Unverified | 0 |
| Multilingual Image Corpus – Towards a Multimodal and Multilingual Dataset | Jun 1, 2022 | Caption Generationimage-classification | —Unverified | 0 |