VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning Jul 17, 2025 Language Modeling Language Modelling
Code Code Available 0Describe Anything Model for Visual Question Answering on Text-rich Images Jul 16, 2025 Descriptive Language Modeling
Code Code Available 1MGFFD-VLM: Multi-Granularity Prompt Learning for Face Forgery Detection with VLM Jul 16, 2025 Attribute Face Swapping
— Unverified 0LinguaMark: Do Multimodal Models Speak Fairly? A Benchmark-Based Evaluation Jul 9, 2025 Question Answering Visual Question Answering
— Unverified 0Evaluating Attribute Confusion in Fashion Text-to-Image Generation Jul 9, 2025 Attribute cross-modal alignment
— Unverified 0Decoupled Seg Tokens Make Stronger Reasoning Video Segmenter and Grounder Jun 28, 2025 Image Segmentation Large Language Model
Code Code Available 1SMMILE: An Expert-Driven Benchmark for Multimodal Medical In-Context Learning Jun 26, 2025 In-Context Learning Medical Visual Question Answering
— Unverified 0Bridging Video Quality Scoring and Justification via Large Multimodal Models Jun 26, 2025 Video Quality Assessment Visual Question Answering (VQA)
— Unverified 0DrishtiKon: Multi-Granular Visual Grounding for Text-Rich Document Images Jun 26, 2025 document understanding Optical Character Recognition (OCR)
Code Code Available 0FOCUS: Internal MLLM Representations for Efficient Fine-Grained Visual Question Answering Jun 25, 2025 Question Answering Visual Question Answering
— Unverified 0HRIBench: Benchmarking Vision-Language Models for Real-Time Human Perception in Human-Robot Interaction Jun 25, 2025 Benchmarking Person Identification
Code Code Available 0MMSearch-R1: Incentivizing LMMs to Search Jun 25, 2025 RAG Retrieval-augmented Generation
Code Code Available 3GEMeX-ThinkVG: Towards Thinking with Visual Grounding in Medical VQA via Reinforcement Learning Jun 22, 2025 Answer Generation Decision Making
— Unverified 0How Far Can Off-the-Shelf Multimodal Large Language Models Go in Online Episodic Memory Question Answering? Jun 19, 2025 Multiple-choice Question Answering
— Unverified 0Can Common VLMs Rival Medical VLMs? Evaluation and Strategic Insights Jun 19, 2025 Question Answering Visual Question Answering
— Unverified 0MEGC2025: Micro-Expression Grand Challenge on Spot Then Recognize and Visual Question Answering Jun 18, 2025 Multimodal Reasoning Question Answering
— Unverified 0ASCD: Attention-Steerable Contrastive Decoding for Reducing Hallucination in MLLM Jun 17, 2025 Hallucination Language Modeling
— Unverified 0Adapting Lightweight Vision Language Models for Radiological Visual Question Answering Jun 17, 2025 Diagnostic Question Answering
Code Code Available 0Connecting phases of matter to the flatness of the loss landscape in analog variational quantum algorithms Jun 16, 2025 Visual Question Answering (VQA)
— Unverified 0CAPO: Reinforcing Consistent Reasoning in Medical Decision-Making Jun 15, 2025 Answer Generation Decision Making
— Unverified 0EyeSim-VQA: A Free-Energy-Guided Eye Simulation Framework for Video Quality Assessment Jun 13, 2025 Image Quality Assessment Video Quality Assessment
— Unverified 0SlotPi: Physics-informed Object-centric Reasoning Models Jun 12, 2025 Object Question Answering
Code Code Available 0HalLoc: Token-level Localization of Hallucinations for Vision Language Models Jun 12, 2025 Hallucination Image Captioning
Code Code Available 0Scientists' First Exam: Probing Cognitive Abilities of MLLM via Perception, Understanding, and Reasoning Jun 12, 2025 Attribute Multimodal Reasoning
— Unverified 0Provoking Multi-modal Few-Shot LVLM via Exploration-Exploitation In-Context Learning Jun 11, 2025 In-Context Learning Question Answering
— Unverified 0CausalVQA: A Physically Grounded Causal Reasoning Benchmark for Video Models Jun 11, 2025 counterfactual Descriptive
Code Code Available 2Outside Knowledge Conversational Video (OKCV) Dataset -- Dialoguing over Videos Jun 11, 2025 Question Answering Visual Question Answering
Code Code Available 0Kvasir-VQA-x1: A Multimodal Dataset for Medical Reasoning and Robust MedVQA in Gastrointestinal Endoscopy Jun 11, 2025 Medical Visual Question Answering Question Answering
Code Code Available 0PhyBlock: A Progressive Benchmark for Physical Understanding and Planning via 3D Block Assembly Jun 10, 2025 Question Answering Scene Understanding
— Unverified 0From Pixels to Graphs: using Scene and Knowledge Graphs for HD-EPIC VQA Challenge Jun 10, 2025 Knowledge Graphs Language Modeling
— Unverified 0Looking Beyond Visible Cues: Implicit Video Question Answering via Dual-Clue Reasoning Jun 9, 2025 Future prediction Question Answering
Code Code Available 0HAIBU-ReMUD: Reasoning Multimodal Ultrasound Dataset and Model Bridging to General Specific Domains Jun 9, 2025 Diagnostic Question Answering
Code Code Available 0Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning Jun 8, 2025 Medical Report Generation Question Answering
— Unverified 0Ontology-based knowledge representation for bone disease diagnosis: a foundation for safe and sustainable medical artificial intelligence systems Jun 5, 2025 Diagnostic Multimodal Deep Learning
— Unverified 0ReXVQA: A Large-scale Visual Question Answering Benchmark for Generalist Chest X-ray Understanding Jun 4, 2025 Negation Negation Detection
— Unverified 0CoRe-MMRAG: Cross-Source Knowledge Reconciliation for Multimodal RAG Jun 3, 2025 Answer Generation RAG
— Unverified 0Fast or Slow? Integrating Fast Intuition and Deliberate Thinking for Enhancing Visual Question Answering Jun 1, 2025 All MME
— Unverified 0MMedAgent-RL: Optimizing Multi-Agent Collaboration for Multimodal Medical Reasoning May 31, 2025 Diagnostic Reinforcement Learning (RL)
— Unverified 0Vision LLMs Are Bad at Hierarchical Visual Understanding, and LLMs Are the Bottleneck May 30, 2025 Question Answering Visual Question Answering
— Unverified 0Proxy-FDA: Proxy-based Feature Distribution Alignment for Fine-tuning Vision Foundation Models without Forgetting May 30, 2025 image-classification Image Classification
— Unverified 0VideoCAD: A Large-Scale Video Dataset for Learning UI Interactions and 3D Reasoning from CAD Software May 30, 2025 Question Answering Spatial Reasoning
Code Code Available 1A Comprehensive Evaluation of Multi-Modal Large Language Models for Endoscopy Analysis May 29, 2025 Diagnostic Visual Prompting
— Unverified 0MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence May 29, 2025 Multiple-choice Spatial Reasoning
— Unverified 0Spoken question answering for visual queries May 29, 2025 Question Answering Visual Question Answering (VQA)
— Unverified 0Synthetic Document Question Answering in Hungarian May 29, 2025 Optical Character Recognition (OCR) Question Answering
Code Code Available 0Multi-Sourced Compositional Generalization in Visual Question Answering May 29, 2025 Question Answering Visual Question Answering
Code Code Available 0Interpreting Chest X-rays Like a Radiologist: A Benchmark with Clinical Reasoning May 29, 2025 Diagnostic Question Answering
Code Code Available 1NegVQA: Can Vision Language Models Understand Negation? May 28, 2025 Negation Question Answering
— Unverified 0VIGNETTE: Socially Grounded Bias Evaluation for Vision-Language Models May 28, 2025 Decision Making Question Answering
Code Code Available 0FRAMES-VQA: Benchmarking Fine-Tuning Robustness across Multi-Modal Shifts in Visual Question Answering May 27, 2025 Benchmarking Question Answering
Code Code Available 0