| MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models | Apr 20, 2023 | Image DescriptionLanguage Modelling | CodeCode Available | 7 | 5 |
| MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning | Oct 14, 2023 | Image ClassificationImage Description | CodeCode Available | 7 | 5 |
| Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond | Aug 24, 2023 | Chart Question AnsweringFS-MEVQA | CodeCode Available | 5 | 5 |
| Caption Anything: Interactive Image Description with Diverse Multimodal Controls | May 4, 2023 | controllable image captioningImage Captioning | CodeCode Available | 3 | 5 |
| Seedream 2.0: A Native Chinese-English Bilingual Image Generation Foundation Model | Mar 10, 2025 | Image DescriptionImage Generation | CodeCode Available | 2 | 5 |
| PandaGPT: One Model To Instruction-Follow Them All | May 25, 2023 | AllImage Description | CodeCode Available | 2 | 5 |
| Image Textualization: An Automatic Framework for Creating Accurate and Detailed Image Descriptions | Jun 11, 2024 | HallucinationImage Description | CodeCode Available | 2 | 5 |
| Patho-R1: A Multimodal Reinforcement Learning-Based Pathology Expert Reasoner | May 16, 2025 | Cross-Modal RetrievalDiagnostic | CodeCode Available | 2 | 5 |
| Can Large Multimodal Models Uncover Deep Semantics Behind Images? | Feb 17, 2024 | Image Description | CodeCode Available | 1 | 5 |
| Chatting Makes Perfect: Chat-based Image Retrieval | May 31, 2023 | Chat-based Image RetrievalImage Description | CodeCode Available | 1 | 5 |