| Where do Large Vision-Language Models Look at when Answering Questions? | Mar 18, 2025 | Question AnsweringVisual Question Answering | CodeCode Available | 2 |
| DriveLMM-o1: A Step-by-Step Reasoning Dataset and Large Multimodal Model for Driving Scenario Understanding | Mar 13, 2025 | 4kAutonomous Driving | CodeCode Available | 2 |
| AnyAnomaly: Zero-Shot Customizable Video Anomaly Detection with LVLM | Mar 6, 2025 | Anomaly DetectionLanguage Modeling | CodeCode Available | 2 |
| Keeping Yourself is Important in Downstream Tuning Multimodal Large Language Model | Mar 6, 2025 | General KnowledgeImage Captioning | CodeCode Available | 2 |
| Multimodal RewardBench: Holistic Evaluation of Reward Models for Vision Language Models | Feb 20, 2025 | Question AnsweringVisual Question Answering | CodeCode Available | 2 |
| Re-Align: Aligning Vision Language Models via Retrieval-Augmented Direct Preference Optimization | Feb 18, 2025 | Image RetrievalQuestion Answering | CodeCode Available | 2 |
| Analyzing and Boosting the Power of Fine-Grained Visual Recognition for Multi-modal Large Language Models | Jan 25, 2025 | AttributeContrastive Learning | CodeCode Available | 2 |
| A Simple Aerial Detection Baseline of Multimodal Language Models | Jan 16, 2025 | object-detectionObject Detection | CodeCode Available | 2 |
| Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding | Jan 14, 2025 | image-classificationImage Classification | CodeCode Available | 2 |
| Dual Diffusion for Unified Image Generation and Understanding | Dec 31, 2024 | Image GenerationLanguage Modeling | CodeCode Available | 2 |
| AutoTrust: Benchmarking Trustworthiness in Large Vision Language Models for Autonomous Driving | Dec 19, 2024 | Autonomous DrivingBenchmarking | CodeCode Available | 2 |
| Doe-1: Closed-Loop Autonomous Driving with Large World Model | Dec 12, 2024 | Autonomous DrivingDecision Making | CodeCode Available | 2 |
| Towards a Multimodal Large Language Model with Pixel-Level Insight for Biomedicine | Dec 12, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 2 |
| BiMediX2: Bio-Medical EXpert LMM for Diverse Medical Modalities | Dec 10, 2024 | Medical Visual Question AnsweringQuestion Answering | CodeCode Available | 2 |
| TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action | Dec 7, 2024 | Depth EstimationMathematical Reasoning | CodeCode Available | 2 |
| LinVT: Empower Your Image-level Large Language Model to Understand Videos | Dec 6, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 2 |
| FlashSloth: Lightning Multimodal Large Language Models via Embedded Visual Compression | Dec 5, 2024 | DescriptiveVisual Question Answering | CodeCode Available | 2 |
| Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification | Dec 1, 2024 | GPUVisual Question Answering | CodeCode Available | 2 |
| Path-RAG: Knowledge-Guided Key Region Retrieval for Open-ended Pathology Visual Question Answering | Nov 26, 2024 | PrognosisQuestion Answering | CodeCode Available | 2 |
| Grounding-IQA: Multimodal Language Grounding Model for Image Quality Assessment | Nov 26, 2024 | Image Quality AssessmentQuestion Answering | CodeCode Available | 2 |
| ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration | Nov 25, 2024 | AI AgentVisual Question Answering | CodeCode Available | 2 |
| Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering | Nov 25, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 2 |
| GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and A Comprehensive Multimodal Dataset Towards General Medical AI | Nov 21, 2024 | Decision MakingLanguage Modeling | CodeCode Available | 2 |
| MC-LLaVA: Multi-Concept Personalized Vision-Language Model | Nov 18, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 2 |
| VQA^2: Visual Question Answering for Video Quality Assessment | Nov 6, 2024 | Question AnsweringVideo Quality Assessment | CodeCode Available | 2 |
| MMFuser: Multimodal Multi-Layer Feature Fuser for Fine-Grained Vision-Language Understanding | Oct 15, 2024 | Visual Question Answering | CodeCode Available | 2 |
| VoxelPrompt: A Vision-Language Agent for Grounded Medical Image Analysis | Oct 10, 2024 | Medical Image AnalysisQuestion Answering | CodeCode Available | 2 |
| Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate | Oct 9, 2024 | cross-modal alignmentVisual Question Answering | CodeCode Available | 2 |
| TEOChat: A Large Vision-Language Assistant for Temporal Earth Observation Data | Oct 8, 2024 | Change DetectionEarth Observation | CodeCode Available | 2 |
| Large Continual Instruction Assistant | Oct 8, 2024 | Question AnsweringSemantic Similarity | CodeCode Available | 2 |
| Phantom of Latent for Large Language and Vision Models | Sep 23, 2024 | Visual Question Answering | CodeCode Available | 2 |
| One missing piece in Vision and Language: A Survey on Comics Understanding | Sep 14, 2024 | document understandingimage-classification | CodeCode Available | 2 |
| EyeCLIP: A visual-language foundation model for multi-modal ophthalmic image analysis | Sep 10, 2024 | Contrastive LearningCross-Modal Retrieval | CodeCode Available | 2 |
| PA-LLaVA: A Large Language-Vision Assistant for Human Pathology Image Understanding | Aug 18, 2024 | Language ModellingQuestion Answering | CodeCode Available | 2 |
| A Survey on Benchmarks of Multimodal Large Language Models | Aug 16, 2024 | Question AnsweringSurvey | CodeCode Available | 2 |
| GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI | Aug 6, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 2 |
| Towards A Generalizable Pathology Foundation Model via Unified Knowledge Distillation | Jul 26, 2024 | Knowledge DistillationQuestion Answering | CodeCode Available | 2 |
| MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity | Jul 22, 2024 | DiversityMultiple-choice | CodeCode Available | 2 |
| DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception | Jul 11, 2024 | Visual Question Answering | CodeCode Available | 2 |
| WSI-VQA: Interpreting Whole Slide Images by Generative Visual Question Answering | Jul 8, 2024 | DiagnosticGenerative Visual Question Answering | CodeCode Available | 2 |
| MiniGPT-Med: Large Language Model as a General Interface for Radiology Diagnosis | Jul 4, 2024 | DiagnosticLanguage Modeling | CodeCode Available | 2 |
| A Bounding Box is Worth One Token: Interleaving Layout and Text in a Large Language Model for Document Understanding | Jul 2, 2024 | document understandingKey Information Extraction | CodeCode Available | 2 |
| Efficient Large Multi-modal Models via Visual Context Compression | Jun 28, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 2 |
| MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning | Jun 25, 2024 | ObjectObject Recognition | CodeCode Available | 2 |
| TroL: Traversal of Layers for Large Language and Vision Models | Jun 18, 2024 | Visual Question Answering | CodeCode Available | 2 |
| VRSBench: A Versatile Vision-Language Benchmark Dataset for Remote Sensing Image Understanding | Jun 18, 2024 | Image CaptioningQuestion Answering | CodeCode Available | 2 |
| MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs | Jun 17, 2024 | Visual Question Answering | CodeCode Available | 2 |
| Explore the Limits of Omni-modal Pretraining at Scale | Jun 13, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 2 |
| Yo'LLaVA: Your Personalized Language and Vision Assistant | Jun 13, 2024 | Image CaptioningQuestion Answering | CodeCode Available | 2 |
| Towards Vision-Language Geo-Foundation Model: A Survey | Jun 13, 2024 | Earth ObservationImage Captioning | CodeCode Available | 2 |