SpatialLLM: A Compound 3D-Informed Design towards Spatially-Intelligent Large Multimodal Models May 1, 2025 Spatial Reasoning Visual Question Answering (VQA)
— Unverified 0Localizing Before Answering: A Hallucination Evaluation Benchmark for Grounded Medical Multimodal LLMs Apr 30, 2025 Hallucination Hallucination Evaluation
— Unverified 0VideoMultiAgents: A Multi-Agent Framework for Video Question Answering Apr 25, 2025 Caption Generation EgoSchema
Code Code Available 1An Empirical Study on Prompt Compression for Large Language Models Apr 24, 2025 Articles Math
Code Code Available 3Data-Driven Calibration of Prediction Sets in Large Vision-Language Models Based on Inductive Conformal Prediction Apr 24, 2025 Conformal Prediction Hallucination
— Unverified 0A Comprehensive Survey of Knowledge-Based Vision Question Answering Systems: The Lifecycle of Knowledge in Visual Reasoning Task Apr 24, 2025 Question Answering Retrieval
— Unverified 0NTIRE 2025 Challenge on Short-form UGC Video Quality Assessment and Enhancement: Methods and Results Apr 17, 2025 Form Image Super-Resolution
Code Code Available 1Instruction-augmented Multimodal Alignment for Image-Text and Element Matching Apr 16, 2025 Image Augmentation Image Generation
— Unverified 0Bridging the Semantic Gaps: Improving Medical VQA Consistency with LLM-Augmented Question Sets Apr 16, 2025 Diversity Medical Visual Question Answering
— Unverified 0DVLTA-VQA: Decoupled Vision-Language Modeling with Text-Guided Adaptation for Blind Video Quality Assessment Apr 16, 2025 Language Modeling Language Modelling
— Unverified 0PuzzleBench: A Fully Dynamic Evaluation Framework for Large Multimodal Models on Puzzle Solving Apr 15, 2025 Logical Reasoning Visual Question Answering (VQA)
— Unverified 0QAVA: Query-Agnostic Visual Attack to Large Vision-Language Models Apr 15, 2025 Question Answering Visual Question Answering
Code Code Available 0Building Trustworthy Multimodal AI: A Review of Fairness, Transparency, and Ethics in Vision-Language Tasks Apr 14, 2025 Ethics Fairness
— Unverified 0MMKB-RAG: A Multi-Modal Knowledge-Based Retrieval-Augmented Generation Framework Apr 14, 2025 Question Answering RAG
— Unverified 0FVQ: A Large-Scale Dataset and A LMM-based Method for Face Video Quality Assessment Apr 12, 2025 Video Quality Assessment Visual Question Answering (VQA)
Code Code Available 0PathVLM-R1: A Reinforcement Learning-Driven Reasoning Model for Pathology Visual-Language Tasks Apr 12, 2025 Computed Tomography (CT) Question Answering
— Unverified 0NoTeS-Bank: Benchmarking Neural Transcription and Search for Scientific Notes Understanding Apr 12, 2025 Benchmarking Document AI
— Unverified 0Mimic In-Context Learning for Multimodal Tasks Apr 11, 2025 In-Context Learning Visual Question Answering (VQA)
Code Code Available 1TokenFocus-VQA: Enhancing Text-to-Image Alignment with Position-Aware Focus and Multi-Perspective Aggregations on LVLMs Apr 10, 2025 Ensemble Learning Position
— Unverified 0UniRVQA: A Unified Framework for Retrieval-Augmented Vision Question Answering via Self-Reflective Joint Training Apr 5, 2025 Articles Question Answering
— Unverified 0Hierarchical Modeling for Medical Visual Question Answering with Cross-Attention Fusion Apr 4, 2025 Diagnostic Medical Visual Question Answering
— Unverified 0QIRL: Boosting Visual Question Answering via Optimized Question-Image Relation Learning Apr 4, 2025 Data Augmentation Image Generation
— Unverified 0STING-BEE: Towards Vision-Language Model for Real-World X-ray Baggage Security Inspection Apr 3, 2025 Instruction Following Language Modeling
Code Code Available 1SocialGesture: Delving into Multi-person Gesture Understanding Apr 3, 2025 Gesture Recognition Question Answering
— Unverified 0Reasoning LLMs for User-Aware Multimodal Conversational Agents Apr 2, 2025 RAG Retrieval-augmented Generation
— Unverified 0MPDrive: Improving Spatial Understanding with Marker-Based Prompt Learning for Autonomous Driving Apr 1, 2025 Autonomous Driving Prompt Learning
— Unverified 0KOFFVQA: An Objectively Evaluated Free-form VQA Benchmark for Large Vision-Language Models in the Korean Language Mar 31, 2025 Form Question Answering
Code Code Available 0How Well Can Vison-Language Models Understand Humans' Intention? An Open-ended Theory of Mind Question Evaluation Benchmark Mar 28, 2025 Question Answering Visual Question Answering
— Unverified 0FaceBench: A Multi-View Multi-Level Facial Attribute VQA Dataset for Benchmarking Face Perception MLLMs Mar 27, 2025 Attribute Benchmarking
Code Code Available 1Feature4X: Bridging Any Monocular Video to 4D Agentic AI with Versatile Gaussian Feature Fields Mar 26, 2025 Question Answering Visual Question Answering
— Unverified 0Vision-Amplified Semantic Entropy for Hallucination Detection in Medical Visual Question Answering Mar 26, 2025 Diagnostic Hallucination
— Unverified 0VGAT: A Cancer Survival Analysis Framework Transitioning from Generative Visual Question Answering to Genomic Reconstruction Mar 25, 2025 Generative Visual Question Answering Question Answering
Code Code Available 0LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning? Mar 25, 2025 Autonomous Navigation Question Answering
— Unverified 0ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation Mar 25, 2025 Action Generation Autonomous Driving
— Unverified 0Med3DVLM: An Efficient Vision-Language Model for 3D Medical Image Analysis Mar 25, 2025 Contrastive Learning Image-text Retrieval
Code Code Available 2Where is this coming from? Making groundedness count in the evaluation of Document VQA models Mar 24, 2025 Question Answering Visual Question Answering
— Unverified 0MAGIC-VQA: Multimodal And Grounded Inference with Commonsense Knowledge for Visual Question Answering Mar 24, 2025 Graph Neural Network Question Answering
— Unverified 0DiN: Diffusion Model for Robust Medical VQA with Semantic Noisy Labels Mar 24, 2025 Medical Visual Question Answering Question Answering
— Unverified 0AMD-Hummingbird: Towards an Efficient Text-to-Video Model Mar 24, 2025 Computational Efficiency Video Generation
Code Code Available 1Expanding the Boundaries of Vision Prior Knowledge in Multi-modal Large Language Models Mar 23, 2025 Question Answering Visual Question Answering
— Unverified 0Progressive Prompt Detailing for Improved Alignment in Text-to-Image Generative Models Mar 22, 2025 Question Answering Visual Question Answering
Code Code Available 0A Vision Centric Remote Sensing Benchmark Mar 20, 2025 Question Answering Representation Learning
— Unverified 0UPME: An Unsupervised Peer Review Framework for Multimodal Large Language Model Evaluation Mar 19, 2025 Language Model Evaluation Language Modeling
— Unverified 0TruthLens:A Training-Free Paradigm for DeepFake Detection Mar 19, 2025 Binary Classification DeepFake Detection
— Unverified 0ChatBEV: A Visual Language Model that Understands BEV Maps Mar 18, 2025 Autonomous Driving Language Modeling
— Unverified 0Marten: Visual Question Answering with Mask Generation for Multi-modal Document Understanding Mar 18, 2025 document understanding Question Answering
Code Code Available 0NuPlanQA: A Large-Scale Dataset and Benchmark for Multi-View Driving Scene Understanding in Multi-Modal Large Language Models Mar 17, 2025 Question Answering Scene Understanding
Code Code Available 1MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research Mar 17, 2025 Articles Benchmarking
Code Code Available 1GeoRSMLLM: A Multimodal Large Language Model for Vision-Language Tasks in Geoscience and Remote Sensing Mar 16, 2025 Change Detection Image Captioning
— Unverified 0DynRsl-VLM: Enhancing Autonomous Driving Perception with Dynamic Resolution Vision-Language Models Mar 14, 2025 Autonomous Driving Computational Efficiency
— Unverified 0