| PhyBlock: A Progressive Benchmark for Physical Understanding and Planning via 3D Block Assembly | Jun 10, 2025 | Question AnsweringScene Understanding | —Unverified | 0 |
| HAIBU-ReMUD: Reasoning Multimodal Ultrasound Dataset and Model Bridging to General Specific Domains | Jun 9, 2025 | DiagnosticQuestion Answering | CodeCode Available | 0 |
| Hallucination at a Glance: Controlled Visual Edits and Fine-Grained Multimodal Learning | Jun 8, 2025 | AttributeHallucination | —Unverified | 0 |
| Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning | Jun 8, 2025 | Medical Report GenerationQuestion Answering | —Unverified | 0 |
| Meta-Adaptive Prompt Distillation for Few-Shot Visual Question Answering | Jun 7, 2025 | In-Context LearningMeta-Learning | —Unverified | 0 |
| Ontology-based knowledge representation for bone disease diagnosis: a foundation for safe and sustainable medical artificial intelligence systems | Jun 5, 2025 | DiagnosticMultimodal Deep Learning | —Unverified | 0 |
| TextVidBench: A Benchmark for Long Video Scene Text Understanding | Jun 5, 2025 | Prompt EngineeringQuestion Answering | —Unverified | 0 |
| ReXVQA: A Large-scale Visual Question Answering Benchmark for Generalist Chest X-ray Understanding | Jun 4, 2025 | NegationNegation Detection | —Unverified | 0 |
| Learning Sparsity for Effective and Efficient Music Performance Question Answering | Jun 2, 2025 | Audio-visual Question AnsweringQuestion Answering | —Unverified | 0 |
| Hanfu-Bench: A Multimodal Benchmark on Cross-Temporal Cultural Understanding and Transcreation | Jun 2, 2025 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| Fast or Slow? Integrating Fast Intuition and Deliberate Thinking for Enhancing Visual Question Answering | Jun 1, 2025 | AllMME | —Unverified | 0 |
| MedOrch: Medical Diagnosis with Tool-Augmented Reasoning Agents for Flexible Extensibility | May 30, 2025 | Decision MakingMedical Diagnosis | —Unverified | 0 |
| Light as Deception: GPT-driven Natural Relighting Against Vision-Language Pre-training Models | May 30, 2025 | Image CaptioningQuestion Answering | —Unverified | 0 |
| Vision LLMs Are Bad at Hierarchical Visual Understanding, and LLMs Are the Bottleneck | May 30, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Multi-Sourced Compositional Generalization in Visual Question Answering | May 29, 2025 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| Synthetic Document Question Answering in Hungarian | May 29, 2025 | Optical Character Recognition (OCR)Question Answering | CodeCode Available | 0 |
| QLIP: A Dynamic Quadtree Vision Prior Enhances MLLM Performance Without Retraining | May 29, 2025 | Question AnsweringRepresentation Learning | CodeCode Available | 0 |
| mRAG: Elucidating the Design Space of Multi-modal Retrieval-Augmented Generation | May 29, 2025 | Question AnsweringRAG | —Unverified | 0 |
| NegVQA: Can Vision Language Models Understand Negation? | May 28, 2025 | NegationQuestion Answering | —Unverified | 0 |
| Music's Multimodal Complexity in AVQA: Why We Need More than General Multimodal LLMs | May 27, 2025 | Audio-visual Question AnsweringQuestion Answering | CodeCode Available | 0 |
| FRAMES-VQA: Benchmarking Fine-Tuning Robustness across Multi-Modal Shifts in Visual Question Answering | May 27, 2025 | BenchmarkingQuestion Answering | CodeCode Available | 0 |
| Benchmarking Large Multimodal Models for Ophthalmic Visual Question Answering with OphthalWeChat | May 26, 2025 | BenchmarkingQuestion Answering | —Unverified | 0 |
| MM-Prompt: Cross-Modal Prompt Tuning for Continual Visual Question Answering | May 26, 2025 | Continual LearningQuestion Answering | CodeCode Available | 0 |
| GC-KBVQA: A New Four-Stage Framework for Enhancing Knowledge Based Visual Question Answering Performance | May 25, 2025 | Caption GenerationQuestion Answering | —Unverified | 0 |
| CXReasonBench: A Benchmark for Evaluating Structured Diagnostic Reasoning in Chest X-rays | May 23, 2025 | DiagnosticQuestion Answering | CodeCode Available | 0 |