| Vision-Language Models for Medical Report Generation and Visual Question Answering: A Review | Mar 4, 2024 | Medical Report GenerationQuestion Answering | CodeCode Available | 3 |
| ALLaVA: Harnessing GPT4V-Synthesized Data for Lite Vision-Language Models | Feb 18, 2024 | Language ModellingQuestion Answering | CodeCode Available | 3 |
| PreFLMR: Scaling Up Fine-Grained Late-Interaction Multi-modal Retrievers | Feb 13, 2024 | Question AnsweringRetrieval | CodeCode Available | 3 |
| Q-Bench+: A Benchmark for Multi-modal Foundation Models on Low-level Vision from Single Images to Pairs | Feb 11, 2024 | Image Quality AssessmentQuestion Answering | CodeCode Available | 3 |
| Knowledge Graphs Meet Multi-Modal Learning: A Comprehensive Survey | Feb 8, 2024 | ArticlesEntity Alignment | CodeCode Available | 3 |
| LLaVA-Phi: Efficient Multi-Modal Assistant with Small Language Model | Jan 4, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 3 |
| TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones | Dec 28, 2023 | Computational EfficiencyImage Captioning | CodeCode Available | 3 |
| DriveLM: Driving with Graph Visual Question Answering | Dec 21, 2023 | Autonomous DrivingQuestion Answering | CodeCode Available | 3 |
| Generative Multimodal Models are In-Context Learners | Dec 20, 2023 | In-Context LearningPersonalized Image Generation | CodeCode Available | 3 |
| SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery | Dec 15, 2023 | Contrastive LearningEarth Observation | CodeCode Available | 3 |
| Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models | Dec 11, 2023 | Chart UnderstandingDecoder | CodeCode Available | 3 |
| Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models | Nov 11, 2023 | Image CaptioningMMR total | CodeCode Available | 3 |
| Emu: Generative Pretraining in Multimodality | Jul 11, 2023 | Image CaptioningImage Generation | CodeCode Available | 3 |
| Champion Solution for the WSDM2023 Toloka VQA Challenge | Jan 22, 2023 | Question AnsweringVisual Grounding | CodeCode Available | 3 |
| Vision-Language Pre-training: Basics, Recent Advances, and Future Trends | Oct 17, 2022 | Few-Shot LearningImage Captioning | CodeCode Available | 3 |
| All You May Need for VQA are Image Captions | May 4, 2022 | AllImage Captioning | CodeCode Available | 3 |
| Bilinear Attention Networks | May 21, 2018 | Visual Question AnsweringVisual Question Answering (VQA) | CodeCode Available | 3 |
| FlagEvalMM: A Flexible Framework for Comprehensive Multimodal Model Evaluation | Jun 10, 2025 | Image-text RetrievalQuestion Answering | CodeCode Available | 2 |
| VTool-R1: VLMs Learn to Think with Images via Reinforcement Learning on Multimodal Tool Use | May 25, 2025 | Multimodal ReasoningQuestion Answering | CodeCode Available | 2 |
| Patho-R1: A Multimodal Reinforcement Learning-Based Pathology Expert Reasoner | May 16, 2025 | Cross-Modal RetrievalDiagnostic | CodeCode Available | 2 |
| MedM-VL: What Makes a Good Medical LVLM? | Apr 6, 2025 | Medical Image AnalysisQuestion Answering | CodeCode Available | 2 |
| ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion Refinement | Apr 2, 2025 | DecoderImage Generation | CodeCode Available | 2 |
| FortisAVQA and MAVEN: a Benchmark Dataset and Debiasing Framework for Robust Multimodal Reasoning | Apr 1, 2025 | Audio-visual Question AnsweringAudio-Visual Question Answering (AVQA) | CodeCode Available | 2 |
| Med3DVLM: An Efficient Vision-Language Model for 3D Medical Image Analysis | Mar 25, 2025 | Contrastive LearningImage-text Retrieval | CodeCode Available | 2 |
| MC-LLaVA: Multi-Concept Personalized Vision-Language Model | Mar 24, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 2 |