| PhyBlock: A Progressive Benchmark for Physical Understanding and Planning via 3D Block Assembly | Jun 10, 2025 | Question AnsweringScene Understanding | —Unverified | 0 |
| HAIBU-ReMUD: Reasoning Multimodal Ultrasound Dataset and Model Bridging to General Specific Domains | Jun 9, 2025 | DiagnosticQuestion Answering | CodeCode Available | 0 |
| Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning | Jun 8, 2025 | Medical Report GenerationQuestion Answering | —Unverified | 0 |
| Hallucination at a Glance: Controlled Visual Edits and Fine-Grained Multimodal Learning | Jun 8, 2025 | AttributeHallucination | —Unverified | 0 |
| Meta-Adaptive Prompt Distillation for Few-Shot Visual Question Answering | Jun 7, 2025 | In-Context LearningMeta-Learning | —Unverified | 0 |
| Ontology-based knowledge representation for bone disease diagnosis: a foundation for safe and sustainable medical artificial intelligence systems | Jun 5, 2025 | DiagnosticMultimodal Deep Learning | —Unverified | 0 |
| TextVidBench: A Benchmark for Long Video Scene Text Understanding | Jun 5, 2025 | Prompt EngineeringQuestion Answering | —Unverified | 0 |
| ReXVQA: A Large-scale Visual Question Answering Benchmark for Generalist Chest X-ray Understanding | Jun 4, 2025 | NegationNegation Detection | —Unverified | 0 |
| Learning Sparsity for Effective and Efficient Music Performance Question Answering | Jun 2, 2025 | Audio-visual Question AnsweringQuestion Answering | —Unverified | 0 |
| Hanfu-Bench: A Multimodal Benchmark on Cross-Temporal Cultural Understanding and Transcreation | Jun 2, 2025 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| Fast or Slow? Integrating Fast Intuition and Deliberate Thinking for Enhancing Visual Question Answering | Jun 1, 2025 | AllMME | —Unverified | 0 |
| Vision LLMs Are Bad at Hierarchical Visual Understanding, and LLMs Are the Bottleneck | May 30, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Light as Deception: GPT-driven Natural Relighting Against Vision-Language Pre-training Models | May 30, 2025 | Image CaptioningQuestion Answering | —Unverified | 0 |
| MedOrch: Medical Diagnosis with Tool-Augmented Reasoning Agents for Flexible Extensibility | May 30, 2025 | Decision MakingMedical Diagnosis | —Unverified | 0 |
| mRAG: Elucidating the Design Space of Multi-modal Retrieval-Augmented Generation | May 29, 2025 | Question AnsweringRAG | —Unverified | 0 |
| Synthetic Document Question Answering in Hungarian | May 29, 2025 | Optical Character Recognition (OCR)Question Answering | CodeCode Available | 0 |
| QLIP: A Dynamic Quadtree Vision Prior Enhances MLLM Performance Without Retraining | May 29, 2025 | Question AnsweringRepresentation Learning | CodeCode Available | 0 |
| Multi-Sourced Compositional Generalization in Visual Question Answering | May 29, 2025 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| NegVQA: Can Vision Language Models Understand Negation? | May 28, 2025 | NegationQuestion Answering | —Unverified | 0 |
| Music's Multimodal Complexity in AVQA: Why We Need More than General Multimodal LLMs | May 27, 2025 | Audio-visual Question AnsweringQuestion Answering | CodeCode Available | 0 |
| FRAMES-VQA: Benchmarking Fine-Tuning Robustness across Multi-Modal Shifts in Visual Question Answering | May 27, 2025 | BenchmarkingQuestion Answering | CodeCode Available | 0 |
| Benchmarking Large Multimodal Models for Ophthalmic Visual Question Answering with OphthalWeChat | May 26, 2025 | BenchmarkingQuestion Answering | —Unverified | 0 |
| MM-Prompt: Cross-Modal Prompt Tuning for Continual Visual Question Answering | May 26, 2025 | Continual LearningQuestion Answering | CodeCode Available | 0 |
| GC-KBVQA: A New Four-Stage Framework for Enhancing Knowledge Based Visual Question Answering Performance | May 25, 2025 | Caption GenerationQuestion Answering | —Unverified | 0 |
| CXReasonBench: A Benchmark for Evaluating Structured Diagnostic Reasoning in Chest X-rays | May 23, 2025 | DiagnosticQuestion Answering | CodeCode Available | 0 |
| CT-Agent: A Multimodal-LLM Agent for 3D CT Radiology Question Answering | May 22, 2025 | Computed Tomography (CT)Question Answering | —Unverified | 0 |
| Steering LVLMs via Sparse Autoencoder for Hallucination Mitigation | May 22, 2025 | HallucinationImage Captioning | —Unverified | 0 |
| A Causal Approach to Mitigate Modality Preference Bias in Medical Visual Question Answering | May 22, 2025 | counterfactualMedical Visual Question Answering | —Unverified | 0 |
| Zero-Shot Anomaly Detection in Battery Thermal Images Using Visual Question Answering with Prior Knowledge | May 22, 2025 | Anomaly DetectionQuestion Answering | —Unverified | 0 |
| Grounding Chest X-Ray Visual Question Answering with Generated Radiology Reports | May 22, 2025 | Answer GenerationQuestion Answering | —Unverified | 0 |
| Seeing Far and Clearly: Mitigating Hallucinations in MLLMs with Attention Causal Decoding | May 22, 2025 | Causal InferenceHallucination | —Unverified | 0 |
| Robo2VLM: Visual Question Answering from Large-Scale In-the-Wild Robot Manipulation Datasets | May 21, 2025 | Dataset GenerationDescriptive | —Unverified | 0 |
| Discovering Pathology Rationale and Token Allocation for Efficient Multimodal Pathology Reasoning | May 21, 2025 | Computational EfficiencyDiagnostic | —Unverified | 0 |
| Visual Question Answering on Multiple Remote Sensing Image Modalities | May 21, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| SNAP: A Benchmark for Testing the Effects of Capture Conditions on Fundamental Vision Tasks | May 21, 2025 | image-classificationImage Classification | CodeCode Available | 0 |
| Traveling Across Languages: Benchmarking Cross-Lingual Consistency in Multimodal LLMs | May 21, 2025 | BenchmarkingQuestion Answering | CodeCode Available | 0 |
| TinyDrive: Multiscale Visual Question Answering with Selective Token Routing for Autonomous Driving | May 21, 2025 | Autonomous DrivingQuestion Answering | —Unverified | 0 |
| TimeCausality: Evaluating the Causal Ability in Time Dimension for Vision Language Models | May 21, 2025 | Human AgingQuestion Answering | CodeCode Available | 0 |
| Human-centered Interactive Learning via MLLMs for Text-to-Image Person Re-identification | May 21, 2025 | Data AugmentationLarge Language Model | —Unverified | 0 |
| Domain Adaptation of VLM for Soccer Video Understanding | May 20, 2025 | Action ClassificationDomain Adaptation | —Unverified | 0 |
| Toward Effective Reinforcement Learning Fine-Tuning for Medical VQA in Vision-Language Models | May 20, 2025 | Medical Visual Question AnsweringQuestion Answering | —Unverified | 0 |
| Debating for Better Reasoning: An Unsupervised Multimodal Approach | May 20, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Towards Omnidirectional Reasoning with 360-R1: A Dataset, Benchmark, and GRPO-based Method | May 20, 2025 | HallucinationObject Localization | —Unverified | 0 |
| RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual Culture Understanding | May 20, 2025 | Image CaptioningQuestion Answering | CodeCode Available | 0 |
| Understanding Complexity in VideoQA via Visual Program Generation | May 19, 2025 | Code GenerationQuestion Answering | —Unverified | 0 |
| HumaniBench: A Human-Centric Framework for Large Multimodal Models Evaluation | May 16, 2025 | BenchmarkingEthics | CodeCode Available | 0 |
| TCC-Bench: Benchmarking the Traditional Chinese Culture Understanding Capabilities of MLLMs | May 16, 2025 | BenchmarkingQuestion Answering | CodeCode Available | 0 |
| End-to-End Vision Tokenizer Tuning | May 15, 2025 | Image GenerationQuestion Answering | —Unverified | 0 |
| Variational Visual Question Answering | May 14, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Visually Interpretable Subtask Reasoning for Visual Question Answering | May 12, 2025 | AttributeObject Recognition | CodeCode Available | 0 |