| KOFFVQA: An Objectively Evaluated Free-form VQA Benchmark for Large Vision-Language Models in the Korean Language | Mar 31, 2025 | FormQuestion Answering | CodeCode Available | 0 |
| OpenDriveVLA: Towards End-to-end Autonomous Driving with Large Vision Language Action Model | Mar 30, 2025 | Autonomous DrivingDecision Making | CodeCode Available | 4 |
| How Well Can Vison-Language Models Understand Humans' Intention? An Open-ended Theory of Mind Question Evaluation Benchmark | Mar 28, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| JEEM: Vision-Language Understanding in Four Arabic Dialects | Mar 27, 2025 | Image CaptioningQuestion Answering | —Unverified | 0 |
| Fine-Grained Evaluation of Large Vision-Language Models in Autonomous Driving | Mar 27, 2025 | AttributeAutonomous Driving | CodeCode Available | 1 |
| CTRL-O: Language-Controllable Object-Centric Visual Representation Learning | Mar 27, 2025 | Image GenerationObject | —Unverified | 0 |
| FaceBench: A Multi-View Multi-Level Facial Attribute VQA Dataset for Benchmarking Face Perception MLLMs | Mar 27, 2025 | AttributeBenchmarking | CodeCode Available | 1 |
| Mitigating Low-Level Visual Hallucinations Requires Self-Awareness: Database, Model and Training Strategy | Mar 26, 2025 | HallucinationImage Captioning | —Unverified | 0 |
| Vision-Amplified Semantic Entropy for Hallucination Detection in Medical Visual Question Answering | Mar 26, 2025 | DiagnosticHallucination | —Unverified | 0 |
| Instruction-Oriented Preference Alignment for Enhancing Multi-Modal Comprehension Capability of MLLMs | Mar 26, 2025 | HallucinationHallucination Evaluation | —Unverified | 0 |
| Feature4X: Bridging Any Monocular Video to 4D Agentic AI with Versatile Gaussian Feature Fields | Mar 26, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| VGAT: A Cancer Survival Analysis Framework Transitioning from Generative Visual Question Answering to Genomic Reconstruction | Mar 25, 2025 | Generative Visual Question AnsweringQuestion Answering | CodeCode Available | 0 |
| Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language Models | Mar 25, 2025 | BenchmarkingImage Captioning | CodeCode Available | 1 |
| PAVE: Patching and Adapting Video Large Language Models | Mar 25, 2025 | Audio-visual Question AnsweringMulti-Task Learning | CodeCode Available | 1 |
| Improved Alignment of Modalities in Large Vision Language Models | Mar 25, 2025 | GPUImage Captioning | —Unverified | 0 |
| LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning? | Mar 25, 2025 | Autonomous NavigationQuestion Answering | —Unverified | 0 |
| ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation | Mar 25, 2025 | Action GenerationAutonomous Driving | —Unverified | 0 |
| Med3DVLM: An Efficient Vision-Language Model for 3D Medical Image Analysis | Mar 25, 2025 | Contrastive LearningImage-text Retrieval | CodeCode Available | 2 |
| Where is this coming from? Making groundedness count in the evaluation of Document VQA models | Mar 24, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| DiN: Diffusion Model for Robust Medical VQA with Semantic Noisy Labels | Mar 24, 2025 | Medical Visual Question AnsweringQuestion Answering | —Unverified | 0 |
| MC-LLaVA: Multi-Concept Personalized Vision-Language Model | Mar 24, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 2 |
| MAGIC-VQA: Multimodal And Grounded Inference with Commonsense Knowledge for Visual Question Answering | Mar 24, 2025 | Graph Neural NetworkQuestion Answering | —Unverified | 0 |
| Expanding the Boundaries of Vision Prior Knowledge in Multi-modal Large Language Models | Mar 23, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Progressive Prompt Detailing for Improved Alignment in Text-to-Image Generative Models | Mar 22, 2025 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| Does Chain-of-Thought Reasoning Help Mobile GUI Agent? An Empirical Study | Mar 21, 2025 | AttributeMathematical Problem-Solving | CodeCode Available | 0 |
| A Vision Centric Remote Sensing Benchmark | Mar 20, 2025 | Question AnsweringRepresentation Learning | —Unverified | 0 |
| UMIT: Unifying Medical Imaging Tasks via Vision-Language Models | Mar 20, 2025 | DiagnosticMedical Image Analysis | CodeCode Available | 0 |
| UPME: An Unsupervised Peer Review Framework for Multimodal Large Language Model Evaluation | Mar 19, 2025 | Language Model EvaluationLanguage Modeling | —Unverified | 0 |
| EfficientLLaVA:Generalizable Auto-Pruning for Large Vision-language Models | Mar 19, 2025 | MM-VetMultimodal Reasoning | —Unverified | 0 |
| GraspCorrect: Robotic Grasp Correction via Vision-Language Model-Guided Feedback | Mar 19, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| TruthLens:A Training-Free Paradigm for DeepFake Detection | Mar 19, 2025 | Binary ClassificationDeepFake Detection | —Unverified | 0 |
| Marten: Visual Question Answering with Mask Generation for Multi-modal Document Understanding | Mar 18, 2025 | document understandingQuestion Answering | CodeCode Available | 0 |
| Where do Large Vision-Language Models Look at when Answering Questions? | Mar 18, 2025 | Question AnsweringVisual Question Answering | CodeCode Available | 2 |
| NuPlanQA: A Large-Scale Dataset and Benchmark for Multi-View Driving Scene Understanding in Multi-Modal Large Language Models | Mar 17, 2025 | Question AnsweringScene Understanding | CodeCode Available | 1 |
| MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research | Mar 17, 2025 | ArticlesBenchmarking | CodeCode Available | 1 |
| Task-Oriented Feature Compression for Multimodal Understanding via Device-Edge Co-Inference | Mar 17, 2025 | Feature CompressionImage Compression | —Unverified | 0 |
| From Head to Tail: Towards Balanced Representation in Large Vision-Language Models through Adaptive Data Calibration | Mar 17, 2025 | DenoisingQuestion Answering | —Unverified | 0 |
| GeoRSMLLM: A Multimodal Large Language Model for Vision-Language Tasks in Geoscience and Remote Sensing | Mar 16, 2025 | Change DetectionImage Captioning | —Unverified | 0 |
| PEBench: A Fictitious Dataset to Benchmark Machine Unlearning for Multimodal Large Language Models | Mar 16, 2025 | Machine UnlearningPrivacy Preserving | —Unverified | 0 |
| DynRsl-VLM: Enhancing Autonomous Driving Perception with Dynamic Resolution Vision-Language Models | Mar 14, 2025 | Autonomous DrivingComputational Efficiency | —Unverified | 0 |
| T2I-FineEval: Fine-Grained Compositional Metric for Text-to-Image Evaluation | Mar 14, 2025 | AttributeQuestion Answering | CodeCode Available | 0 |
| How Do Multimodal Large Language Models Handle Complex Multimodal Reasoning? Placing Them in An Extensible Escape Game | Mar 13, 2025 | Multimodal ReasoningQuestion Answering | CodeCode Available | 1 |
| DriveLMM-o1: A Step-by-Step Reasoning Dataset and Large Multimodal Model for Driving Scenario Understanding | Mar 13, 2025 | 4kAutonomous Driving | CodeCode Available | 2 |
| On the Limitations of Vision-Language Models in Understanding Image Transforms | Mar 12, 2025 | Question AnsweringVideo Generation | —Unverified | 0 |
| SurgicalVLM-Agent: Towards an Interactive AI Co-Pilot for Pituitary Surgery | Mar 12, 2025 | Activity RecognitionAnatomy | —Unverified | 0 |
| SimLingo: Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment | Mar 12, 2025 | Autonomous DrivingBench2Drive | CodeCode Available | 3 |
| Seeing and Reasoning with Confidence: Supercharging Multimodal LLMs with an Uncertainty-Aware Agentic Framework | Mar 11, 2025 | Conformal PredictionMultimodal Reasoning | —Unverified | 0 |
| From Text to Visuals: Using LLMs to Generate Math Diagrams with Vector Graphics | Mar 10, 2025 | MathQuestion Answering | —Unverified | 0 |
| Robusto-1 Dataset: Comparing Humans and VLMs on real out-of-distribution Autonomous Driving VQA from Peru | Mar 10, 2025 | Autonomous DrivingQuestion Answering | —Unverified | 0 |
| TI-JEPA: An Innovative Energy-based Joint Embedding Strategy for Text-Image Multimodal Systems | Mar 9, 2025 | Multimodal Sentiment AnalysisQuestion Answering | —Unverified | 0 |