| TransMamba: Fast Universal Architecture Adaption from Transformers to Mamba | Feb 21, 2025 | image-classificationImage Classification | —Unverified | 0 |
| Directional Gradient Projection for Robust Fine-Tuning of Foundation Models | Feb 21, 2025 | image-classificationImage Classification | —Unverified | 0 |
| Multimodal RewardBench: Holistic Evaluation of Reward Models for Vision Language Models | Feb 20, 2025 | Question AnsweringVisual Question Answering | CodeCode Available | 2 |
| ChatVLA: Unified Multimodal Understanding and Robot Control with Vision-Language-Action Model | Feb 20, 2025 | Mixture-of-ExpertsQuestion Answering | CodeCode Available | 1 |
| Exploring Advanced Techniques for Visual Question Answering: A Comprehensive Comparison | Feb 20, 2025 | DiversityLanguage Modeling | —Unverified | 0 |
| Sce2DriveX: A Generalized MLLM Framework for Scene-to-Drive Learning | Feb 19, 2025 | Autonomous DrivingBench2Drive | —Unverified | 0 |
| PitVQA++: Vector Matrix-Low-Rank Adaptation for Open-Ended Visual Question Answering in Pituitary Surgery | Feb 19, 2025 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| SimpleVQA: Multimodal Factuality Evaluation for Multimodal Large Language Models | Feb 18, 2025 | Image ComprehensionQuestion Answering | —Unverified | 0 |
| Re-Align: Aligning Vision Language Models via Retrieval-Augmented Direct Preference Optimization | Feb 18, 2025 | Image RetrievalQuestion Answering | CodeCode Available | 2 |
| SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation | Feb 18, 2025 | Object RearrangementRobot Manipulation | CodeCode Available | 3 |
| MMXU: A Multi-Modal and Multi-X-ray Understanding Dataset for Disease Progression | Feb 17, 2025 | DiagnosticQuestion Answering | CodeCode Available | 1 |
| "See the World, Discover Knowledge": A Chinese Factuality Evaluation for Large Vision Language Models | Feb 17, 2025 | Object RecognitionQuestion Answering | —Unverified | 0 |
| Visual Graph Question Answering with ASP and LLMs for Language Parsing | Feb 13, 2025 | Graph Question AnsweringOptical Character Recognition | —Unverified | 0 |
| Abduction of Domain Relationships from Data for VQA | Feb 13, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| EmoAssist: Emotional Assistant for Visual Impairment Community | Feb 13, 2025 | Emotional IntelligenceQuestion Answering | —Unverified | 0 |
| Vision-Language Models for Edge Networks: A Comprehensive Survey | Feb 11, 2025 | Autonomous VehiclesImage Captioning | —Unverified | 0 |
| ClinKD: Cross-Modal Clinical Knowledge Distiller For Multi-Task Medical Images | Feb 9, 2025 | Clinical KnowledgeMedical Visual Question Answering | CodeCode Available | 0 |
| Performance Analysis of Traditional VQA Models Under Limited Computational Resources | Feb 9, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Hummingbird: High Fidelity Image Generation via Multimodal Context Alignment | Feb 7, 2025 | DiversityHuman-Object Interaction Detection | —Unverified | 0 |
| Efficient Few-Shot Continual Learning in Vision-Language Models | Feb 6, 2025 | Continual LearningImage Captioning | —Unverified | 0 |
| No Images, No Problem: Retaining Knowledge in Continual VQA with Questions-Only Memory | Feb 6, 2025 | Continual LearningQuestion Answering | CodeCode Available | 0 |
| PixFoundation: Are We Heading in the Right Direction with Pixel-level Vision Foundation Models? | Feb 6, 2025 | Question AnsweringReferring Expression | CodeCode Available | 1 |
| DocMIA: Document-Level Membership Inference Attacks against DocVQA Models | Feb 6, 2025 | document understandingInference Attack | CodeCode Available | 0 |
| Exploring Spatial Language Grounding Through Referring Expressions | Feb 4, 2025 | Image CaptioningNegation | —Unverified | 0 |
| Robust-LLaVA: On the Effectiveness of Large-Scale Robust Image Encoders for Multi-modal Large Language Models | Feb 3, 2025 | Adversarial RobustnessImage Captioning | CodeCode Available | 1 |