| Feature4X: Bridging Any Monocular Video to 4D Agentic AI with Versatile Gaussian Feature Fields | Mar 26, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Improved Alignment of Modalities in Large Vision Language Models | Mar 25, 2025 | GPUImage Captioning | —Unverified | 0 |
| VGAT: A Cancer Survival Analysis Framework Transitioning from Generative Visual Question Answering to Genomic Reconstruction | Mar 25, 2025 | Generative Visual Question AnsweringQuestion Answering | CodeCode Available | 0 |
| LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning? | Mar 25, 2025 | Autonomous NavigationQuestion Answering | —Unverified | 0 |
| ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation | Mar 25, 2025 | Action GenerationAutonomous Driving | —Unverified | 0 |
| DiN: Diffusion Model for Robust Medical VQA with Semantic Noisy Labels | Mar 24, 2025 | Medical Visual Question AnsweringQuestion Answering | —Unverified | 0 |
| MAGIC-VQA: Multimodal And Grounded Inference with Commonsense Knowledge for Visual Question Answering | Mar 24, 2025 | Graph Neural NetworkQuestion Answering | —Unverified | 0 |
| Where is this coming from? Making groundedness count in the evaluation of Document VQA models | Mar 24, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Expanding the Boundaries of Vision Prior Knowledge in Multi-modal Large Language Models | Mar 23, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Progressive Prompt Detailing for Improved Alignment in Text-to-Image Generative Models | Mar 22, 2025 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| Does Chain-of-Thought Reasoning Help Mobile GUI Agent? An Empirical Study | Mar 21, 2025 | AttributeMathematical Problem-Solving | CodeCode Available | 0 |
| UMIT: Unifying Medical Imaging Tasks via Vision-Language Models | Mar 20, 2025 | DiagnosticMedical Image Analysis | CodeCode Available | 0 |
| A Vision Centric Remote Sensing Benchmark | Mar 20, 2025 | Question AnsweringRepresentation Learning | —Unverified | 0 |
| GraspCorrect: Robotic Grasp Correction via Vision-Language Model-Guided Feedback | Mar 19, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| TruthLens:A Training-Free Paradigm for DeepFake Detection | Mar 19, 2025 | Binary ClassificationDeepFake Detection | —Unverified | 0 |
| UPME: An Unsupervised Peer Review Framework for Multimodal Large Language Model Evaluation | Mar 19, 2025 | Language Model EvaluationLanguage Modeling | —Unverified | 0 |
| EfficientLLaVA:Generalizable Auto-Pruning for Large Vision-language Models | Mar 19, 2025 | MM-VetMultimodal Reasoning | —Unverified | 0 |
| Marten: Visual Question Answering with Mask Generation for Multi-modal Document Understanding | Mar 18, 2025 | document understandingQuestion Answering | CodeCode Available | 0 |
| Task-Oriented Feature Compression for Multimodal Understanding via Device-Edge Co-Inference | Mar 17, 2025 | Feature CompressionImage Compression | —Unverified | 0 |
| From Head to Tail: Towards Balanced Representation in Large Vision-Language Models through Adaptive Data Calibration | Mar 17, 2025 | DenoisingQuestion Answering | —Unverified | 0 |
| GeoRSMLLM: A Multimodal Large Language Model for Vision-Language Tasks in Geoscience and Remote Sensing | Mar 16, 2025 | Change DetectionImage Captioning | —Unverified | 0 |
| PEBench: A Fictitious Dataset to Benchmark Machine Unlearning for Multimodal Large Language Models | Mar 16, 2025 | Machine UnlearningPrivacy Preserving | —Unverified | 0 |
| T2I-FineEval: Fine-Grained Compositional Metric for Text-to-Image Evaluation | Mar 14, 2025 | AttributeQuestion Answering | CodeCode Available | 0 |
| DynRsl-VLM: Enhancing Autonomous Driving Perception with Dynamic Resolution Vision-Language Models | Mar 14, 2025 | Autonomous DrivingComputational Efficiency | —Unverified | 0 |
| SurgicalVLM-Agent: Towards an Interactive AI Co-Pilot for Pituitary Surgery | Mar 12, 2025 | Activity RecognitionAnatomy | —Unverified | 0 |
| On the Limitations of Vision-Language Models in Understanding Image Transforms | Mar 12, 2025 | Question AnsweringVideo Generation | —Unverified | 0 |
| Seeing and Reasoning with Confidence: Supercharging Multimodal LLMs with an Uncertainty-Aware Agentic Framework | Mar 11, 2025 | Conformal PredictionMultimodal Reasoning | —Unverified | 0 |
| Robusto-1 Dataset: Comparing Humans and VLMs on real out-of-distribution Autonomous Driving VQA from Peru | Mar 10, 2025 | Autonomous DrivingQuestion Answering | —Unverified | 0 |
| From Text to Visuals: Using LLMs to Generate Math Diagrams with Vector Graphics | Mar 10, 2025 | MathQuestion Answering | —Unverified | 0 |
| TI-JEPA: An Innovative Energy-based Joint Embedding Strategy for Text-Image Multimodal Systems | Mar 9, 2025 | Multimodal Sentiment AnalysisQuestion Answering | —Unverified | 0 |
| Integrating Frequency-Domain Representations with Low-Rank Adaptation in Vision-Language Models | Mar 8, 2025 | Caption GenerationQuestion Answering | —Unverified | 0 |
| MoEMoE: Question Guided Dense and Scalable Sparse Mixture-of-Expert for Multi-source Multi-modal Answering | Mar 8, 2025 | Answer GenerationMixture-of-Experts | —Unverified | 0 |
| Treble Counterfactual VLMs: A Causal Approach to Hallucination | Mar 8, 2025 | Autonomous Drivingcounterfactual | CodeCode Available | 0 |
| SplatTalk: 3D VQA with Gaussian Splatting | Mar 8, 2025 | 3DGSQuestion Answering | —Unverified | 0 |
| Enhancing SAM with Efficient Prompting and Preference Optimization for Semi-supervised Medical Image Segmentation | Mar 6, 2025 | Active LearningImage Segmentation | —Unverified | 0 |
| Enhancing Vietnamese VQA through Curriculum Learning on Raw and Augmented Text Representations | Mar 5, 2025 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| BioD2C: A Dual-level Semantic Consistency Constraint Framework for Biomedical VQA | Mar 4, 2025 | Medical DiagnosisQuestion Answering | CodeCode Available | 0 |
| OWLViz: An Open-World Benchmark for Visual Question Answering | Mar 4, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Watch Out Your Album! On the Inadvertent Privacy Memorization in Multi-Modal Large Language Models | Mar 3, 2025 | MemorizationQuestion Answering | CodeCode Available | 0 |
| FunBench: Benchmarking Fundus Reading Skills of MLLMs | Mar 2, 2025 | AnatomyBenchmarking | —Unverified | 0 |
| CL-MoE: Enhancing Multimodal Large Language Model with Dual Momentum Mixture-of-Experts for Continual Visual Question Answering | Mar 1, 2025 | Continual LearningLanguage Modeling | —Unverified | 0 |
| Fine-Grained Retrieval-Augmented Generation for Visual Question Answering | Feb 28, 2025 | Question AnsweringRAG | —Unverified | 0 |
| MedHallTune: An Instruction-Tuning Benchmark for Mitigating Medical Hallucination in Vision-Language Models | Feb 28, 2025 | Decision MakingHallucination | CodeCode Available | 0 |
| Can Large Language Models Unveil the Mysteries? An Exploration of Their Ability to Unlock Information in Complex Scenarios | Feb 27, 2025 | Data IntegrationQuestion Answering | —Unverified | 0 |
| Talking to the brain: Using Large Language Models as Proxies to Model Brain Semantic Representation | Feb 26, 2025 | Question Answeringvalid | —Unverified | 0 |
| MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement Learning | Feb 26, 2025 | Domain GeneralizationMedical Image Analysis | —Unverified | 0 |
| Detecting Knowledge Boundary of Vision Large Language Models by Sampling-Based Inference | Feb 25, 2025 | Question AnsweringRAG | CodeCode Available | 0 |
| FilterRAG: Zero-Shot Informed Retrieval-Augmented Generation to Mitigate Hallucinations in VQA | Feb 25, 2025 | Question AnsweringRetrieval | —Unverified | 0 |
| All-in-one: Understanding and Generation in Multimodal Reasoning with the MAIA Benchmark | Feb 24, 2025 | AllMultimodal Reasoning | —Unverified | 0 |
| Retrieval-Augmented Visual Question Answering via Built-in Autoregressive Search Engines | Feb 23, 2025 | Answer GenerationLanguage Modeling | —Unverified | 0 |