Interpreting Chest X-rays Like a Radiologist: A Benchmark with Clinical Reasoning May 29, 2025 Diagnostic Question Answering
Code Code Available 1GeoLLaVA-8K: Scaling Remote-Sensing Multimodal Large Language Models to 8K Resolution May 27, 2025 8k Avg
Code Code Available 1MineAnyBuild: Benchmarking Spatial Planning for Open-world AI Agents May 26, 2025 Benchmarking Minecraft
Code Code Available 1Unifying Multimodal Large Language Model Capabilities and Modalities via Model Merging May 26, 2025 Language Modeling Language Modelling
Code Code Available 1SATORI-R1: Incentivizing Multimodal Reasoning with Spatial Grounding and Verifiable Rewards May 25, 2025 Image Captioning Multimodal Reasoning
Code Code Available 1Are Vision Language Models Ready for Clinical Diagnosis? A 3D Medical Benchmark for Tumor-centric Visual Question Answering May 25, 2025 Anatomy Benchmarking
Code Code Available 1Let Androids Dream of Electric Sheep: A Human-like Image Implication Understanding and Reasoning Framework May 22, 2025 Multiple-choice Visual Question Answering (VQA)
Code Code Available 1MedAgentBoard: Benchmarking Multi-Agent Collaboration with Conventional Methods for Diverse Medical Tasks May 18, 2025 Benchmarking Medical Visual Question Answering
Code Code Available 1MM-Skin: Enhancing Dermatology Vision-Language Model with an Image-Text Dataset Derived from Textbooks May 9, 2025 Diagnostic Instruction Following
Code Code Available 1VideoMultiAgents: A Multi-Agent Framework for Video Question Answering Apr 25, 2025 Caption Generation EgoSchema
Code Code Available 1NTIRE 2025 Challenge on Short-form UGC Video Quality Assessment and Enhancement: Methods and Results Apr 17, 2025 Form Image Super-Resolution
Code Code Available 1Mimic In-Context Learning for Multimodal Tasks Apr 11, 2025 In-Context Learning Visual Question Answering (VQA)
Code Code Available 1STING-BEE: Towards Vision-Language Model for Real-World X-ray Baggage Security Inspection Apr 3, 2025 Instruction Following Language Modeling
Code Code Available 1FaceBench: A Multi-View Multi-Level Facial Attribute VQA Dataset for Benchmarking Face Perception MLLMs Mar 27, 2025 Attribute Benchmarking
Code Code Available 1AMD-Hummingbird: Towards an Efficient Text-to-Video Model Mar 24, 2025 Computational Efficiency Video Generation
Code Code Available 1NuPlanQA: A Large-Scale Dataset and Benchmark for Multi-View Driving Scene Understanding in Multi-Modal Large Language Models Mar 17, 2025 Question Answering Scene Understanding
Code Code Available 1MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research Mar 17, 2025 Articles Benchmarking
Code Code Available 1Open3DVQA: A Benchmark for Comprehensive Spatial Reasoning with Multimodal Large Language Model in Open Space Mar 14, 2025 Language Modeling Language Modelling
Code Code Available 1KVQ: Boosting Video Quality Assessment via Saliency-guided Local Perception Mar 13, 2025 Video Quality Assessment Visual Question Answering (VQA)
Code Code Available 1MMUnlearner: Reformulating Multimodal Machine Unlearning in the Era of Multimodal Large Language Models Feb 16, 2025 Language Modeling Language Modelling
Code Code Available 1Content-Rich AIGC Video Quality Assessment via Intricate Text Alignment and Motion-Aware Consistency Feb 6, 2025 Video Generation Video Quality Assessment
Code Code Available 1Robust-LLaVA: On the Effectiveness of Large-Scale Robust Image Encoders for Multi-modal Large Language Models Feb 3, 2025 Adversarial Robustness Image Captioning
Code Code Available 1Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation Jan 6, 2025 Language Model Evaluation Language Modeling
Code Code Available 1Generalizing from SIMPLE to HARD Visual Reasoning: Can We Mitigate Modality Imbalance in VLMs? Jan 5, 2025 Image Captioning Image to text
Code Code Available 1Notes-guided MLLM Reasoning: Enhancing MLLM with Knowledge and Visual Notes for Visual Question Answering Jan 1, 2025 Large Language Model Multimodal Large Language Model
Code Code Available 1MedCoT: Medical Chain of Thought via Hierarchical Expert Dec 18, 2024 Diagnostic Medical Visual Question Answering
Code Code Available 1Fast Prompt Alignment for Text-to-Image Generation Dec 11, 2024 Image Generation In-Context Learning
Code Code Available 1IMPACT: A Large-scale Integrated Multimodal Patent Analysis and Creation Dataset for Design Patents Dec 10, 2024 Cross-Modal Retrieval Image Classification
Code Code Available 1MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale Dec 6, 2024 Multimodal Reasoning Visual Question Answering
Code Code Available 1AIGV-Assessor: Benchmarking and Evaluating the Perceptual Quality of Text-to-Video Generation with LMM Nov 26, 2024 Benchmarking Text-to-Video Generation
Code Code Available 1Teaching VLMs to Localize Specific Objects from In-context Examples Nov 20, 2024 Object Object Tracking
Code Code Available 1Awaker2.5-VL: Stably Scaling MLLMs with Parameter-Efficient Mixture of Experts Nov 16, 2024 Mixture-of-Experts Optical Character Recognition (OCR)
Code Code Available 1An Empirical Analysis on Spatial Reasoning Capabilities of Large Multimodal Models Nov 9, 2024 object-detection Object Detection
Code Code Available 1Benchmarking Vision Language Model Unlearning via Fictitious Facial Identity Dataset Nov 5, 2024 Benchmarking Language Modeling
Code Code Available 1Progressive Compositionality In Text-to-Image Generative Models Oct 22, 2024 Attribute Contrastive Learning
Code Code Available 1WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines Oct 16, 2024 Question Answering Visual Question Answering
Code Code Available 1VividMed: Vision Language Model with Versatile Visual Grounding for Medicine Oct 16, 2024 Language Modeling Language Modelling
Code Code Available 1LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content Oct 14, 2024 Visual Question Answering (VQA) World Knowledge
Code Code Available 1Towards Foundation Models for 3D Vision: How Close Are We? Oct 14, 2024 Question Answering Visual Question Answering
Code Code Available 1Skipping Computations in Multimodal LLMs Oct 12, 2024 Question Answering Visual Question Answering
Code Code Available 1DataEnvGym: Data Generation Agents in Teacher Environments with Student Feedback Oct 8, 2024 Math Sequential Decision Making
Code Code Available 1ActiView: Evaluating Active Perception Ability for Multimodal Large Language Models Oct 7, 2024 Question Answering Visual Question Answering
Code Code Available 1MC-CoT: A Modular Collaborative CoT Framework for Zero-shot Medical-VQA with LLM and MLLM Integration Oct 6, 2024 Medical Visual Question Answering Question Answering
Code Code Available 1BadCM: Invisible Backdoor Attack Against Cross-Modal Learning Oct 3, 2024 Backdoor Attack Cross-Modal Retrieval
Code Code Available 1A Hitchhikers Guide to Fine-Grained Face Forgery Detection Using Common Sense Reasoning Oct 1, 2024 Common Sense Reasoning DeepFake Detection
Code Code Available 1T2Vs Meet VLMs: A Scalable Multimodal Dataset for Visual Harmfulness Recognition Sep 29, 2024 In-Context Learning Question Answering
Code Code Available 1MediConfusion: Can you trust your AI radiologist? Probing the reliability of multimodal medical foundation models Sep 23, 2024 Medical Visual Question Answering Question Answering
Code Code Available 1Evaluating Image Hallucination in Text-to-Image Generation with Question-Answering Sep 19, 2024 Hallucination Hallucination Evaluation
Code Code Available 1Less is More: A Simple yet Effective Token Reduction Method for Efficient Multi-modal LLMs Sep 17, 2024 Question Answering Token Reduction
Code Code Available 1AIM 2024 Challenge on Compressed Video Quality Assessment: Methods and Results Aug 21, 2024 Image Manipulation valid
Code Code Available 1