| AnyAnomaly: Zero-Shot Customizable Video Anomaly Detection with LVLM | Mar 6, 2025 | Anomaly DetectionLanguage Modeling | CodeCode Available | 2 | 5 |
| A Bounding Box is Worth One Token: Interleaving Layout and Text in a Large Language Model for Document Understanding | Jul 2, 2024 | document understandingKey Information Extraction | CodeCode Available | 2 | 5 |
| Doe-1: Closed-Loop Autonomous Driving with Large World Model | Dec 12, 2024 | Autonomous DrivingDecision Making | CodeCode Available | 2 | 5 |
| MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts | Oct 3, 2023 | ChatbotImage Captioning | CodeCode Available | 2 | 5 |
| Med-Flamingo: a Multimodal Medical Few-shot Learner | Jul 27, 2023 | Medical Visual Question AnsweringQuestion Answering | CodeCode Available | 2 | 5 |
| Analyzing and Boosting the Power of Fine-Grained Visual Recognition for Multi-modal Large Language Models | Jan 25, 2025 | AttributeContrastive Learning | CodeCode Available | 2 | 5 |
| BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions | Aug 19, 2023 | MMEOptical Character Recognition (OCR) | CodeCode Available | 2 | 5 |
| BiomedGPT: A Generalist Vision-Language Foundation Model for Diverse Biomedical Tasks | May 26, 2023 | Image CaptioningMedical Visual Question Answering | CodeCode Available | 2 | 5 |
| Efficient Large Multi-modal Models via Visual Context Compression | Jun 28, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 2 | 5 |
| BiMediX2: Bio-Medical EXpert LMM for Diverse Medical Modalities | Dec 10, 2024 | Medical Visual Question AnsweringQuestion Answering | CodeCode Available | 2 | 5 |
| Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate | Oct 9, 2024 | cross-modal alignmentVisual Question Answering | CodeCode Available | 2 | 5 |
| LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents | Nov 9, 2023 | Instruction FollowingLLM real-life tasks | CodeCode Available | 2 | 5 |
| LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models | Mar 22, 2024 | Language ModellingLarge Language Model | CodeCode Available | 2 | 5 |
| LLMGA: Multimodal Large Language Model based Generation Assistant | Nov 27, 2023 | Image GenerationLanguage Modeling | CodeCode Available | 2 | 5 |
| Beyond Text: Frozen Large Language Models in Visual Signal Comprehension | Mar 12, 2024 | DeblurringDecoder | CodeCode Available | 2 | 5 |
| List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs | Apr 25, 2024 | Visual GroundingVisual Question Answering | CodeCode Available | 2 | 5 |
| Large Continual Instruction Assistant | Oct 8, 2024 | Question AnsweringSemantic Similarity | CodeCode Available | 2 | 5 |
| LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models | Nov 28, 2023 | Image CaptioningQuestion Answering | CodeCode Available | 2 | 5 |
| Keeping Yourself is Important in Downstream Tuning Multimodal Large Language Model | Mar 6, 2025 | General KnowledgeImage Captioning | CodeCode Available | 2 | 5 |
| LingoQA: Visual Question Answering for Autonomous Driving | Dec 21, 2023 | Autonomous DrivingDecision Making | CodeCode Available | 2 | 5 |
| CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts | May 9, 2024 | Image CaptioningInstruction Following | CodeCode Available | 2 | 5 |
| LinVT: Empower Your Image-level Large Language Model to Understand Videos | Dec 6, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 2 | 5 |
| LOVA3: Learning to Visual Question Answering, Asking and Assessment | May 23, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 2 | 5 |
| Aligning Modalities in Vision Large Language Models via Preference Fine-tuning | Feb 18, 2024 | HallucinationInstruction Following | CodeCode Available | 2 | 5 |
| InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning | May 11, 2023 | 1 Image, 2*2 StitchingDiversity | CodeCode Available | 2 | 5 |