| Treble Counterfactual VLMs: A Causal Approach to Hallucination | Mar 8, 2025 | Autonomous Drivingcounterfactual | CodeCode Available | 0 |
| Integrating Frequency-Domain Representations with Low-Rank Adaptation in Vision-Language Models | Mar 8, 2025 | Caption GenerationQuestion Answering | —Unverified | 0 |
| SplatTalk: 3D VQA with Gaussian Splatting | Mar 8, 2025 | 3DGSQuestion Answering | —Unverified | 0 |
| MoEMoE: Question Guided Dense and Scalable Sparse Mixture-of-Expert for Multi-source Multi-modal Answering | Mar 8, 2025 | Answer GenerationMixture-of-Experts | —Unverified | 0 |
| Keeping Yourself is Important in Downstream Tuning Multimodal Large Language Model | Mar 6, 2025 | General KnowledgeImage Captioning | CodeCode Available | 2 |
| Enhancing SAM with Efficient Prompting and Preference Optimization for Semi-supervised Medical Image Segmentation | Mar 6, 2025 | Active LearningImage Segmentation | —Unverified | 0 |
| AnyAnomaly: Zero-Shot Customizable Video Anomaly Detection with LVLM | Mar 6, 2025 | Anomaly DetectionLanguage Modeling | CodeCode Available | 2 |
| Question-Aware Gaussian Experts for Audio-Visual Question Answering | Mar 6, 2025 | Audio-visual Question AnsweringAudio-Visual Question Answering (AVQA) | CodeCode Available | 1 |
| Enhancing Vietnamese VQA through Curriculum Learning on Raw and Augmented Text Representations | Mar 5, 2025 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| OWLViz: An Open-World Benchmark for Visual Question Answering | Mar 4, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| BioD2C: A Dual-level Semantic Consistency Constraint Framework for Biomedical VQA | Mar 4, 2025 | Medical DiagnosisQuestion Answering | CodeCode Available | 0 |
| Watch Out Your Album! On the Inadvertent Privacy Memorization in Multi-Modal Large Language Models | Mar 3, 2025 | MemorizationQuestion Answering | CodeCode Available | 0 |
| FunBench: Benchmarking Fundus Reading Skills of MLLMs | Mar 2, 2025 | AnatomyBenchmarking | —Unverified | 0 |
| CL-MoE: Enhancing Multimodal Large Language Model with Dual Momentum Mixture-of-Experts for Continual Visual Question Answering | Mar 1, 2025 | Continual LearningLanguage Modeling | —Unverified | 0 |
| Fine-Grained Retrieval-Augmented Generation for Visual Question Answering | Feb 28, 2025 | Question AnsweringRAG | —Unverified | 0 |
| MedHallTune: An Instruction-Tuning Benchmark for Mitigating Medical Hallucination in Vision-Language Models | Feb 28, 2025 | Decision MakingHallucination | CodeCode Available | 0 |
| Can Large Language Models Unveil the Mysteries? An Exploration of Their Ability to Unlock Information in Complex Scenarios | Feb 27, 2025 | Data IntegrationQuestion Answering | —Unverified | 0 |
| MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement Learning | Feb 26, 2025 | Domain GeneralizationMedical Image Analysis | —Unverified | 0 |
| Talking to the brain: Using Large Language Models as Proxies to Model Brain Semantic Representation | Feb 26, 2025 | Question Answeringvalid | —Unverified | 0 |
| Detecting Knowledge Boundary of Vision Large Language Models by Sampling-Based Inference | Feb 25, 2025 | Question AnsweringRAG | CodeCode Available | 0 |
| FilterRAG: Zero-Shot Informed Retrieval-Augmented Generation to Mitigate Hallucinations in VQA | Feb 25, 2025 | Question AnsweringRetrieval | —Unverified | 0 |
| MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMs | Feb 24, 2025 | Question AnsweringVisual Question Answering | CodeCode Available | 3 |
| All-in-one: Understanding and Generation in Multimodal Reasoning with the MAIA Benchmark | Feb 24, 2025 | AllMultimodal Reasoning | —Unverified | 0 |
| Retrieval-Augmented Visual Question Answering via Built-in Autoregressive Search Engines | Feb 23, 2025 | Answer GenerationLanguage Modeling | —Unverified | 0 |
| Tracking the Copyright of Large Vision-Language Models through Parameter Learning Adversarial Images | Feb 23, 2025 | Adversarial AttackQuestion Answering | —Unverified | 0 |
| TransMamba: Fast Universal Architecture Adaption from Transformers to Mamba | Feb 21, 2025 | image-classificationImage Classification | —Unverified | 0 |
| Directional Gradient Projection for Robust Fine-Tuning of Foundation Models | Feb 21, 2025 | image-classificationImage Classification | —Unverified | 0 |
| Multimodal RewardBench: Holistic Evaluation of Reward Models for Vision Language Models | Feb 20, 2025 | Question AnsweringVisual Question Answering | CodeCode Available | 2 |
| ChatVLA: Unified Multimodal Understanding and Robot Control with Vision-Language-Action Model | Feb 20, 2025 | Mixture-of-ExpertsQuestion Answering | CodeCode Available | 1 |
| Exploring Advanced Techniques for Visual Question Answering: A Comprehensive Comparison | Feb 20, 2025 | DiversityLanguage Modeling | —Unverified | 0 |
| Sce2DriveX: A Generalized MLLM Framework for Scene-to-Drive Learning | Feb 19, 2025 | Autonomous DrivingBench2Drive | —Unverified | 0 |
| PitVQA++: Vector Matrix-Low-Rank Adaptation for Open-Ended Visual Question Answering in Pituitary Surgery | Feb 19, 2025 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| SimpleVQA: Multimodal Factuality Evaluation for Multimodal Large Language Models | Feb 18, 2025 | Image ComprehensionQuestion Answering | —Unverified | 0 |
| Re-Align: Aligning Vision Language Models via Retrieval-Augmented Direct Preference Optimization | Feb 18, 2025 | Image RetrievalQuestion Answering | CodeCode Available | 2 |
| SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation | Feb 18, 2025 | Object RearrangementRobot Manipulation | CodeCode Available | 3 |
| MMXU: A Multi-Modal and Multi-X-ray Understanding Dataset for Disease Progression | Feb 17, 2025 | DiagnosticQuestion Answering | CodeCode Available | 1 |
| "See the World, Discover Knowledge": A Chinese Factuality Evaluation for Large Vision Language Models | Feb 17, 2025 | Object RecognitionQuestion Answering | —Unverified | 0 |
| Visual Graph Question Answering with ASP and LLMs for Language Parsing | Feb 13, 2025 | Graph Question AnsweringOptical Character Recognition | —Unverified | 0 |
| Abduction of Domain Relationships from Data for VQA | Feb 13, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| EmoAssist: Emotional Assistant for Visual Impairment Community | Feb 13, 2025 | Emotional IntelligenceQuestion Answering | —Unverified | 0 |
| Vision-Language Models for Edge Networks: A Comprehensive Survey | Feb 11, 2025 | Autonomous VehiclesImage Captioning | —Unverified | 0 |
| ClinKD: Cross-Modal Clinical Knowledge Distiller For Multi-Task Medical Images | Feb 9, 2025 | Clinical KnowledgeMedical Visual Question Answering | CodeCode Available | 0 |
| Performance Analysis of Traditional VQA Models Under Limited Computational Resources | Feb 9, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Hummingbird: High Fidelity Image Generation via Multimodal Context Alignment | Feb 7, 2025 | DiversityHuman-Object Interaction Detection | —Unverified | 0 |
| Efficient Few-Shot Continual Learning in Vision-Language Models | Feb 6, 2025 | Continual LearningImage Captioning | —Unverified | 0 |
| No Images, No Problem: Retaining Knowledge in Continual VQA with Questions-Only Memory | Feb 6, 2025 | Continual LearningQuestion Answering | CodeCode Available | 0 |
| PixFoundation: Are We Heading in the Right Direction with Pixel-level Vision Foundation Models? | Feb 6, 2025 | Question AnsweringReferring Expression | CodeCode Available | 1 |
| DocMIA: Document-Level Membership Inference Attacks against DocVQA Models | Feb 6, 2025 | document understandingInference Attack | CodeCode Available | 0 |
| Exploring Spatial Language Grounding Through Referring Expressions | Feb 4, 2025 | Image CaptioningNegation | —Unverified | 0 |
| Robust-LLaVA: On the Effectiveness of Large-Scale Robust Image Encoders for Multi-modal Large Language Models | Feb 3, 2025 | Adversarial RobustnessImage Captioning | CodeCode Available | 1 |