DynRsl-VLM: Enhancing Autonomous Driving Perception with Dynamic Resolution Vision-Language Models Mar 14, 2025 Autonomous Driving Computational Efficiency
— Unverified 0Open3DVQA: A Benchmark for Comprehensive Spatial Reasoning with Multimodal Large Language Model in Open Space Mar 14, 2025 Language Modeling Language Modelling
Code Code Available 1KVQ: Boosting Video Quality Assessment via Saliency-guided Local Perception Mar 13, 2025 Video Quality Assessment Visual Question Answering (VQA)
Code Code Available 1DriveLMM-o1: A Step-by-Step Reasoning Dataset and Large Multimodal Model for Driving Scenario Understanding Mar 13, 2025 4k Autonomous Driving
Code Code Available 2Astrea: A MOE-based Visual Understanding Model with Progressive Alignment Mar 12, 2025 Contrastive Learning Cross-Modal Retrieval
— Unverified 0SurgicalVLM-Agent: Towards an Interactive AI Co-Pilot for Pituitary Surgery Mar 12, 2025 Activity Recognition Anatomy
— Unverified 0SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories Mar 11, 2025 Decision Making Interactive Segmentation
Code Code Available 2SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories Mar 11, 2025 Decision Making Interactive Segmentation
Code Code Available 2ComicsPAP: understanding comic strips by picking the correct panel Mar 11, 2025 Image Captioning Visual Question Answering (VQA)
— Unverified 0Bring Remote Sensing Object Detect Into Nature Language Model: Using SFT Method Mar 11, 2025 Language Modeling Language Modelling
— Unverified 0Seeing and Reasoning with Confidence: Supercharging Multimodal LLMs with an Uncertainty-Aware Agentic Framework Mar 11, 2025 Conformal Prediction Multimodal Reasoning
— Unverified 0When Large Vision-Language Model Meets Large Remote Sensing Imagery: Coarse-to-Fine Text-Guided Token Pruning Mar 10, 2025 Language Modeling Language Modelling
Code Code Available 2Robusto-1 Dataset: Comparing Humans and VLMs on real out-of-distribution Autonomous Driving VQA from Peru Mar 10, 2025 Autonomous Driving Question Answering
— Unverified 0CalliReader: Contextualizing Chinese Calligraphy via an Embedding-Aligned Vision-Language Model Mar 9, 2025 Hallucination Language Modeling
— Unverified 0SplatTalk: 3D VQA with Gaussian Splatting Mar 8, 2025 3DGS Question Answering
— Unverified 0Next Token Is Enough: Realistic Image Quality and Aesthetic Scoring with Multimodal Large Language Model Mar 8, 2025 Image Quality Assessment Language Modeling
Code Code Available 2Integrating Frequency-Domain Representations with Low-Rank Adaptation in Vision-Language Models Mar 8, 2025 Caption Generation Question Answering
— Unverified 0MoEMoE: Question Guided Dense and Scalable Sparse Mixture-of-Expert for Multi-source Multi-modal Answering Mar 8, 2025 Answer Generation Mixture-of-Experts
— Unverified 0Enhancing Vietnamese VQA through Curriculum Learning on Raw and Augmented Text Representations Mar 5, 2025 Question Answering Visual Question Answering
Code Code Available 0A Token-level Text Image Foundation Model for Document Understanding Mar 4, 2025 document understanding Visual Question Answering (VQA)
— Unverified 0BioD2C: A Dual-level Semantic Consistency Constraint Framework for Biomedical VQA Mar 4, 2025 Medical Diagnosis Question Answering
Code Code Available 0V^2Dial: Unification of Video and Visual Dialog via Multimodal Experts Mar 3, 2025 Contrastive Learning Text Retrieval
— Unverified 0Enhancing Multi-hop Reasoning in Vision-Language Models via Self-Distillation with Multi-Prompt Ensembling Mar 3, 2025 Answer Generation Computational Efficiency
— Unverified 0Watch Out Your Album! On the Inadvertent Privacy Memorization in Multi-Modal Large Language Models Mar 3, 2025 Memorization Question Answering
Code Code Available 0FunBench: Benchmarking Fundus Reading Skills of MLLMs Mar 2, 2025 Anatomy Benchmarking
— Unverified 0ABC: Achieving Better Control of Multimodal Embeddings using VLMs Mar 1, 2025 Image to text Image-to-Text Retrieval
— Unverified 0CL-MoE: Enhancing Multimodal Large Language Model with Dual Momentum Mixture-of-Experts for Continual Visual Question Answering Mar 1, 2025 Continual Learning Language Modeling
— Unverified 0Fine-Grained Retrieval-Augmented Generation for Visual Question Answering Feb 28, 2025 Question Answering RAG
— Unverified 0MedHallTune: An Instruction-Tuning Benchmark for Mitigating Medical Hallucination in Vision-Language Models Feb 28, 2025 Decision Making Hallucination
Code Code Available 0Adaptive Score Alignment Learning for Continual Perceptual Quality Assessment of 360-Degree Videos in Virtual Reality Feb 27, 2025 Video Quality Assessment Visual Question Answering (VQA)
Code Code Available 0ChatReID: Open-ended Interactive Person Retrieval via Hierarchical Progressive Tuning for Vision Language Models Feb 27, 2025 Person Re-Identification Person Retrieval
— Unverified 0Talking to the brain: Using Large Language Models as Proxies to Model Brain Semantic Representation Feb 26, 2025 Question Answering valid
— Unverified 0FilterRAG: Zero-Shot Informed Retrieval-Augmented Generation to Mitigate Hallucinations in VQA Feb 25, 2025 Question Answering Retrieval
— Unverified 0OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference Feb 25, 2025 Visual Question Answering (VQA)
Code Code Available 2Retrieval-Augmented Visual Question Answering via Built-in Autoregressive Search Engines Feb 23, 2025 Answer Generation Language Modeling
— Unverified 0Directional Gradient Projection for Robust Fine-Tuning of Foundation Models Feb 21, 2025 image-classification Image Classification
— Unverified 0Hardware-Friendly Static Quantization Method for Video Diffusion Transformers Feb 20, 2025 Quantization Video Generation
— Unverified 0Multiscale Byte Language Models -- A Hierarchical Architecture for Causal Million-Length Sequence Modeling Feb 20, 2025 Decoder GPU
Code Code Available 0Exploring Advanced Techniques for Visual Question Answering: A Comprehensive Comparison Feb 20, 2025 Diversity Language Modeling
— Unverified 0Sce2DriveX: A Generalized MLLM Framework for Scene-to-Drive Learning Feb 19, 2025 Autonomous Driving Bench2Drive
— Unverified 0PitVQA++: Vector Matrix-Low-Rank Adaptation for Open-Ended Visual Question Answering in Pituitary Surgery Feb 19, 2025 Question Answering Visual Question Answering
Code Code Available 0Qwen2.5-VL Technical Report Feb 19, 2025 document understanding
Code Code Available 11SafeEraser: Enhancing Safety in Multimodal Large Language Models through Multimodal Machine Unlearning Feb 18, 2025 Machine Unlearning Visual Question Answering (VQA)
— Unverified 0Re-Align: Aligning Vision Language Models via Retrieval-Augmented Direct Preference Optimization Feb 18, 2025 Image Retrieval Question Answering
Code Code Available 2Multi-Modal Retrieval Augmentation for Open-Ended and Knowledge-Intensive Video Question Answering Feb 17, 2025 Multiple-choice Question Answering
— Unverified 0MMUnlearner: Reformulating Multimodal Machine Unlearning in the Era of Multimodal Large Language Models Feb 16, 2025 Language Modeling Language Modelling
Code Code Available 1USER-VLM 360: Personalized Vision Language Models with User-aware Tuning for Social Human-Robot Interactions Feb 15, 2025 Multimodal Reasoning Visual Question Answering (VQA)
Code Code Available 0VisCon-100K: Leveraging Contextual Web Data for Fine-tuning Vision Language Models Feb 14, 2025 Image Captioning Large Language Model
— Unverified 0Exploring the Potential of Encoder-free Architectures in 3D LMMs Feb 13, 2025 Inductive Bias Visual Question Answering (VQA)
Code Code Available 0Abduction of Domain Relationships from Data for VQA Feb 13, 2025 Question Answering Visual Question Answering
— Unverified 0