| Multi-Sourced Compositional Generalization in Visual Question Answering | May 29, 2025 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| NegVQA: Can Vision Language Models Understand Negation? | May 28, 2025 | NegationQuestion Answering | —Unverified | 0 |
| FRAMES-VQA: Benchmarking Fine-Tuning Robustness across Multi-Modal Shifts in Visual Question Answering | May 27, 2025 | BenchmarkingQuestion Answering | CodeCode Available | 0 |
| Music's Multimodal Complexity in AVQA: Why We Need More than General Multimodal LLMs | May 27, 2025 | Audio-visual Question AnsweringQuestion Answering | CodeCode Available | 0 |
| MineAnyBuild: Benchmarking Spatial Planning for Open-world AI Agents | May 26, 2025 | BenchmarkingMinecraft | CodeCode Available | 1 |
| Benchmarking Large Multimodal Models for Ophthalmic Visual Question Answering with OphthalWeChat | May 26, 2025 | BenchmarkingQuestion Answering | —Unverified | 0 |
| MangaVQA and MangaLMM: A Benchmark and Specialized Model for Multimodal Manga Understanding | May 26, 2025 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| MM-Prompt: Cross-Modal Prompt Tuning for Continual Visual Question Answering | May 26, 2025 | Continual LearningQuestion Answering | CodeCode Available | 0 |
| Visualized Text-to-Image Retrieval | May 26, 2025 | Image RetrievalQuestion Answering | CodeCode Available | 1 |
| VTool-R1: VLMs Learn to Think with Images via Reinforcement Learning on Multimodal Tool Use | May 25, 2025 | Multimodal ReasoningQuestion Answering | CodeCode Available | 2 |
| Are Vision Language Models Ready for Clinical Diagnosis? A 3D Medical Benchmark for Tumor-centric Visual Question Answering | May 25, 2025 | AnatomyBenchmarking | CodeCode Available | 1 |
| GC-KBVQA: A New Four-Stage Framework for Enhancing Knowledge Based Visual Question Answering Performance | May 25, 2025 | Caption GenerationQuestion Answering | —Unverified | 0 |
| InfoChartQA: A Benchmark for Multimodal Question Answering on Infographic Charts | May 25, 2025 | Chart UnderstandingQuestion Answering | CodeCode Available | 3 |
| SATORI-R1: Incentivizing Multimodal Reasoning with Spatial Grounding and Verifiable Rewards | May 25, 2025 | Image CaptioningMultimodal Reasoning | CodeCode Available | 1 |
| Scaling Up Biomedical Vision-Language Models: Fine-Tuning, Instruction Tuning, and Multi-Modal Learning | May 23, 2025 | DecoderImage Captioning | CodeCode Available | 4 |
| CXReasonBench: A Benchmark for Evaluating Structured Diagnostic Reasoning in Chest X-rays | May 23, 2025 | DiagnosticQuestion Answering | CodeCode Available | 0 |
| VEAttack: Downstream-agnostic Vision Encoder Attack against Large Vision Language Models | May 23, 2025 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| Mitigating Hallucinations in Vision-Language Models through Image-Guided Head Suppression | May 22, 2025 | HallucinationImage Description | CodeCode Available | 1 |
| CT-Agent: A Multimodal-LLM Agent for 3D CT Radiology Question Answering | May 22, 2025 | Computed Tomography (CT)Question Answering | —Unverified | 0 |
| A Causal Approach to Mitigate Modality Preference Bias in Medical Visual Question Answering | May 22, 2025 | counterfactualMedical Visual Question Answering | —Unverified | 0 |
| Steering LVLMs via Sparse Autoencoder for Hallucination Mitigation | May 22, 2025 | HallucinationImage Captioning | —Unverified | 0 |
| Benchmarking Retrieval-Augmented Multimomal Generation for Document Question Answering | May 22, 2025 | BenchmarkingEvidence Selection | CodeCode Available | 1 |
| Zero-Shot Anomaly Detection in Battery Thermal Images Using Visual Question Answering with Prior Knowledge | May 22, 2025 | Anomaly DetectionQuestion Answering | —Unverified | 0 |
| Seeing Far and Clearly: Mitigating Hallucinations in MLLMs with Attention Causal Decoding | May 22, 2025 | Causal InferenceHallucination | —Unverified | 0 |
| Grounding Chest X-Ray Visual Question Answering with Generated Radiology Reports | May 22, 2025 | Answer GenerationQuestion Answering | —Unverified | 0 |
| Human-centered Interactive Learning via MLLMs for Text-to-Image Person Re-identification | May 21, 2025 | Data AugmentationLarge Language Model | —Unverified | 0 |
| Discovering Pathology Rationale and Token Allocation for Efficient Multimodal Pathology Reasoning | May 21, 2025 | Computational EfficiencyDiagnostic | —Unverified | 0 |
| TinyDrive: Multiscale Visual Question Answering with Selective Token Routing for Autonomous Driving | May 21, 2025 | Autonomous DrivingQuestion Answering | —Unverified | 0 |
| Robo2VLM: Visual Question Answering from Large-Scale In-the-Wild Robot Manipulation Datasets | May 21, 2025 | Dataset GenerationDescriptive | —Unverified | 0 |
| TimeCausality: Evaluating the Causal Ability in Time Dimension for Vision Language Models | May 21, 2025 | Human AgingQuestion Answering | CodeCode Available | 0 |
| Visual Question Answering on Multiple Remote Sensing Image Modalities | May 21, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| SNAP: A Benchmark for Testing the Effects of Capture Conditions on Fundamental Vision Tasks | May 21, 2025 | image-classificationImage Classification | CodeCode Available | 0 |
| Traveling Across Languages: Benchmarking Cross-Lingual Consistency in Multimodal LLMs | May 21, 2025 | BenchmarkingQuestion Answering | CodeCode Available | 0 |
| Toward Effective Reinforcement Learning Fine-Tuning for Medical VQA in Vision-Language Models | May 20, 2025 | Medical Visual Question AnsweringQuestion Answering | —Unverified | 0 |
| Debating for Better Reasoning: An Unsupervised Multimodal Approach | May 20, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Towards Omnidirectional Reasoning with 360-R1: A Dataset, Benchmark, and GRPO-based Method | May 20, 2025 | HallucinationObject Localization | —Unverified | 0 |
| Domain Adaptation of VLM for Soccer Video Understanding | May 20, 2025 | Action ClassificationDomain Adaptation | —Unverified | 0 |
| RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual Culture Understanding | May 20, 2025 | Image CaptioningQuestion Answering | CodeCode Available | 0 |
| Understanding Complexity in VideoQA via Visual Program Generation | May 19, 2025 | Code GenerationQuestion Answering | —Unverified | 0 |
| Reasoning-OCR: Can Large Multimodal Models Solve Complex Logical Reasoning Problems from OCR Cues? | May 19, 2025 | Logical ReasoningOptical Character Recognition | CodeCode Available | 1 |
| MedAgentBoard: Benchmarking Multi-Agent Collaboration with Conventional Methods for Diverse Medical Tasks | May 18, 2025 | BenchmarkingMedical Visual Question Answering | CodeCode Available | 1 |
| HumaniBench: A Human-Centric Framework for Large Multimodal Models Evaluation | May 16, 2025 | BenchmarkingEthics | CodeCode Available | 0 |
| Patho-R1: A Multimodal Reinforcement Learning-Based Pathology Expert Reasoner | May 16, 2025 | Cross-Modal RetrievalDiagnostic | CodeCode Available | 2 |
| TCC-Bench: Benchmarking the Traditional Chinese Culture Understanding Capabilities of MLLMs | May 16, 2025 | BenchmarkingQuestion Answering | CodeCode Available | 0 |
| End-to-End Vision Tokenizer Tuning | May 15, 2025 | Image GenerationQuestion Answering | —Unverified | 0 |
| Variational Visual Question Answering | May 14, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Visually Interpretable Subtask Reasoning for Visual Question Answering | May 12, 2025 | AttributeObject Recognition | CodeCode Available | 0 |
| Multi-Modal Explainable Medical AI Assistant for Trustworthy Human-AI Collaboration | May 11, 2025 | BenchmarkingDescriptive | —Unverified | 0 |
| OMGM: Orchestrate Multiple Granularities and Modalities for Efficient Multimodal Retrieval | May 10, 2025 | Cross-Modal RetrievalQuestion Answering | —Unverified | 0 |
| Natural Reflection Backdoor Attack on Vision Language Model for Autonomous Driving | May 9, 2025 | Autonomous DrivingBackdoor Attack | —Unverified | 0 |