| Human-centered Interactive Learning via MLLMs for Text-to-Image Person Re-identification | May 21, 2025 | Data AugmentationLarge Language Model | —Unverified | 0 |
| Discovering Pathology Rationale and Token Allocation for Efficient Multimodal Pathology Reasoning | May 21, 2025 | Computational EfficiencyDiagnostic | —Unverified | 0 |
| TinyDrive: Multiscale Visual Question Answering with Selective Token Routing for Autonomous Driving | May 21, 2025 | Autonomous DrivingQuestion Answering | —Unverified | 0 |
| Robo2VLM: Visual Question Answering from Large-Scale In-the-Wild Robot Manipulation Datasets | May 21, 2025 | Dataset GenerationDescriptive | —Unverified | 0 |
| TimeCausality: Evaluating the Causal Ability in Time Dimension for Vision Language Models | May 21, 2025 | Human AgingQuestion Answering | CodeCode Available | 0 |
| Visual Question Answering on Multiple Remote Sensing Image Modalities | May 21, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| SNAP: A Benchmark for Testing the Effects of Capture Conditions on Fundamental Vision Tasks | May 21, 2025 | image-classificationImage Classification | CodeCode Available | 0 |
| Traveling Across Languages: Benchmarking Cross-Lingual Consistency in Multimodal LLMs | May 21, 2025 | BenchmarkingQuestion Answering | CodeCode Available | 0 |
| Toward Effective Reinforcement Learning Fine-Tuning for Medical VQA in Vision-Language Models | May 20, 2025 | Medical Visual Question AnsweringQuestion Answering | —Unverified | 0 |
| Debating for Better Reasoning: An Unsupervised Multimodal Approach | May 20, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Towards Omnidirectional Reasoning with 360-R1: A Dataset, Benchmark, and GRPO-based Method | May 20, 2025 | HallucinationObject Localization | —Unverified | 0 |
| Domain Adaptation of VLM for Soccer Video Understanding | May 20, 2025 | Action ClassificationDomain Adaptation | —Unverified | 0 |
| RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual Culture Understanding | May 20, 2025 | Image CaptioningQuestion Answering | CodeCode Available | 0 |
| Understanding Complexity in VideoQA via Visual Program Generation | May 19, 2025 | Code GenerationQuestion Answering | —Unverified | 0 |
| Reasoning-OCR: Can Large Multimodal Models Solve Complex Logical Reasoning Problems from OCR Cues? | May 19, 2025 | Logical ReasoningOptical Character Recognition | CodeCode Available | 1 |
| MedAgentBoard: Benchmarking Multi-Agent Collaboration with Conventional Methods for Diverse Medical Tasks | May 18, 2025 | BenchmarkingMedical Visual Question Answering | CodeCode Available | 1 |
| HumaniBench: A Human-Centric Framework for Large Multimodal Models Evaluation | May 16, 2025 | BenchmarkingEthics | CodeCode Available | 0 |
| Patho-R1: A Multimodal Reinforcement Learning-Based Pathology Expert Reasoner | May 16, 2025 | Cross-Modal RetrievalDiagnostic | CodeCode Available | 2 |
| TCC-Bench: Benchmarking the Traditional Chinese Culture Understanding Capabilities of MLLMs | May 16, 2025 | BenchmarkingQuestion Answering | CodeCode Available | 0 |
| End-to-End Vision Tokenizer Tuning | May 15, 2025 | Image GenerationQuestion Answering | —Unverified | 0 |
| Variational Visual Question Answering | May 14, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Visually Interpretable Subtask Reasoning for Visual Question Answering | May 12, 2025 | AttributeObject Recognition | CodeCode Available | 0 |
| Multi-Modal Explainable Medical AI Assistant for Trustworthy Human-AI Collaboration | May 11, 2025 | BenchmarkingDescriptive | —Unverified | 0 |
| OMGM: Orchestrate Multiple Granularities and Modalities for Efficient Multimodal Retrieval | May 10, 2025 | Cross-Modal RetrievalQuestion Answering | —Unverified | 0 |
| Natural Reflection Backdoor Attack on Vision Language Model for Autonomous Driving | May 9, 2025 | Autonomous DrivingBackdoor Attack | —Unverified | 0 |