| Probabilistic Embeddings for Frozen Vision-Language Models: Uncertainty Quantification with Gaussian Process Latent Variable Models | May 8, 2025 | Active Learningcross-modal alignment | CodeCode Available | 0 |
| SITE: towards Spatial Intelligence Thorough Evaluation | May 8, 2025 | Question AnsweringSpatial Reasoning | —Unverified | 0 |
| Task-Oriented Semantic Communication in Large Multimodal Models-based Vehicle Networks | May 5, 2025 | Question AnsweringSemantic Communication | —Unverified | 0 |
| Structure Causal Models and LLMs Integration in Medical Visual Question Answering | May 5, 2025 | Causal InferenceMedical Visual Question Answering | —Unverified | 0 |
| Sim2Real Transfer for Vision-Based Grasp Verification | May 5, 2025 | Objectobject-detection | CodeCode Available | 0 |
| Compositional Image-Text Matching and Retrieval by Grounding Entities | May 4, 2025 | Image CaptioningImage-text matching | CodeCode Available | 0 |
| Adaptive Token Boundaries: Integrating Human Chunking Mechanisms into Multimodal LLMs | May 3, 2025 | ChunkingQuestion Answering | —Unverified | 0 |
| Knowledge-Augmented Language Models Interpreting Structured Chest X-Ray Findings | May 3, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Grounding Task Assistance with Multimodal Cues from a Single Demonstration | May 2, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Transferable Adversarial Attacks on Black-Box Vision-Language Models | May 2, 2025 | Image CaptioningObject Recognition | —Unverified | 0 |
| AdCare-VLM: Leveraging Large Vision Language Model (LVLM) to Monitor Long-Term Medication Adherence and Care | May 1, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| Unlearning Sensitive Information in Multimodal LLMs: Benchmark and Attack-Defense Evaluation | May 1, 2025 | Question AnsweringSpecificity | CodeCode Available | 0 |
| UniBiomed: A Universal Foundation Model for Grounded Biomedical Image Interpretation | Apr 30, 2025 | DiagnosticLarge Language Model | CodeCode Available | 1 |
| Calibrating Uncertainty Quantification of Multi-Modal LLMs using Grounding | Apr 30, 2025 | Medical Question AnsweringQuestion Answering | —Unverified | 0 |
| ChestX-Reasoner: Advancing Radiology Foundation Models with Reasoning through Step-by-Step Verification | Apr 29, 2025 | DiagnosticQuestion Answering | CodeCode Available | 1 |
| LMME3DHF: Benchmarking and Evaluating Multimodal 3D Human Face Generation with LMMs | Apr 29, 2025 | BenchmarkingFace Generation | —Unverified | 0 |
| SpatialReasoner: Towards Explicit and Generalizable 3D Spatial Reasoning | Apr 28, 2025 | Question AnsweringSpatial Reasoning | —Unverified | 0 |
| Data-Driven Calibration of Prediction Sets in Large Vision-Language Models Based on Inductive Conformal Prediction | Apr 24, 2025 | Conformal PredictionHallucination | —Unverified | 0 |
| Benchmarking Multimodal Mathematical Reasoning with Explicit Visual Dependency | Apr 24, 2025 | BenchmarkingMath | CodeCode Available | 1 |
| TraveLLaMA: Facilitating Multi-modal Large Language Models to Understand Urban Scenes and Provide Travel Assistance | Apr 23, 2025 | Question AnsweringScene Understanding | —Unverified | 0 |
| Are Vision LLMs Road-Ready? A Comprehensive Benchmark for Safety-Critical Driving Video Understanding | Apr 20, 2025 | Autonomous DrivingImage Captioning | CodeCode Available | 0 |
| Neglected Risks: The Disturbing Reality of Children's Images in Datasets and the Urgent Call for Accountability | Apr 20, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Hadamard product in deep learning: Introduction, Advances and Challenges | Apr 17, 2025 | Computational EfficiencyDeep Learning | —Unverified | 0 |
| Bridging the Semantic Gaps: Improving Medical VQA Consistency with LLM-Augmented Question Sets | Apr 16, 2025 | DiversityMedical Visual Question Answering | —Unverified | 0 |
| Instruction-augmented Multimodal Alignment for Image-Text and Element Matching | Apr 16, 2025 | Image AugmentationImage Generation | —Unverified | 0 |
| QAVA: Query-Agnostic Visual Attack to Large Vision-Language Models | Apr 15, 2025 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| LVLM_CSP: Accelerating Large Vision Language Models via Clustering, Scattering, and Pruning for Reasoning Segmentation | Apr 15, 2025 | Image CaptioningQuestion Answering | —Unverified | 0 |
| VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents | Apr 14, 2025 | Question AnsweringRAG | —Unverified | 0 |
| Building Trustworthy Multimodal AI: A Review of Fairness, Transparency, and Ethics in Vision-Language Tasks | Apr 14, 2025 | EthicsFairness | —Unverified | 0 |
| MMKB-RAG: A Multi-Modal Knowledge-Based Retrieval-Augmented Generation Framework | Apr 14, 2025 | Question AnsweringRAG | —Unverified | 0 |
| ReasonDrive: Efficient Visual Question Answering for Autonomous Vehicles with Reasoning-Enhanced Small Vision-Language Models | Apr 14, 2025 | Autonomous DrivingAutonomous Vehicles | CodeCode Available | 1 |
| A Survey on Efficient Vision-Language Models | Apr 13, 2025 | Image CaptioningQuestion Answering | CodeCode Available | 1 |
| NoTeS-Bank: Benchmarking Neural Transcription and Search for Scientific Notes Understanding | Apr 12, 2025 | BenchmarkingDocument AI | —Unverified | 0 |
| AstroLLaVA: towards the unification of astronomical data and natural language | Apr 11, 2025 | AstronomyImage Captioning | —Unverified | 0 |
| Data Metabolism: An Efficient Data Design Schema For Vision Language Model | Apr 10, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Beyond the Frame: Generating 360° Panoramic Videos from Perspective Videos | Apr 10, 2025 | Question AnsweringVideo Generation | —Unverified | 0 |
| TokenFocus-VQA: Enhancing Text-to-Image Alignment with Position-Aware Focus and Multi-Perspective Aggregations on LVLMs | Apr 10, 2025 | Ensemble LearningPosition | —Unverified | 0 |
| Resource-efficient Inference with Foundation Model Programs | Apr 9, 2025 | modelQuestion Answering | CodeCode Available | 0 |
| Enhancing Compositional Reasoning in Vision-Language Models with Synthetic Preference Data | Apr 7, 2025 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| RS-RAG: Bridging Remote Sensing Imagery and Comprehensive Knowledge with a Multi-Modal Dataset and Retrieval-Augmented Generation Model | Apr 7, 2025 | Image Captioningimage-classification | —Unverified | 0 |
| MedM-VL: What Makes a Good Medical LVLM? | Apr 6, 2025 | Medical Image AnalysisQuestion Answering | CodeCode Available | 2 |
| Hierarchical Modeling for Medical Visual Question Answering with Cross-Attention Fusion | Apr 4, 2025 | DiagnosticMedical Visual Question Answering | —Unverified | 0 |
| QIRL: Boosting Visual Question Answering via Optimized Question-Image Relation Learning | Apr 4, 2025 | Data AugmentationImage Generation | —Unverified | 0 |
| STING-BEE: Towards Vision-Language Model for Real-World X-ray Baggage Security Inspection | Apr 3, 2025 | Instruction FollowingLanguage Modeling | CodeCode Available | 1 |
| SocialGesture: Delving into Multi-person Gesture Understanding | Apr 3, 2025 | Gesture RecognitionQuestion Answering | —Unverified | 0 |
| ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion Refinement | Apr 2, 2025 | DecoderImage Generation | CodeCode Available | 2 |
| GMAI-VL-R1: Harnessing Reinforcement Learning for Multimodal Medical Reasoning | Apr 2, 2025 | Decision MakingDiagnostic | CodeCode Available | 1 |
| MPDrive: Improving Spatial Understanding with Marker-Based Prompt Learning for Autonomous Driving | Apr 1, 2025 | Autonomous DrivingPrompt Learning | —Unverified | 0 |
| SViQA: A Unified Speech-Vision Multimodal Model for Textless Visual Question Answering | Apr 1, 2025 | cross-modal alignmentQuestion Answering | —Unverified | 0 |
| FortisAVQA and MAVEN: a Benchmark Dataset and Debiasing Framework for Robust Multimodal Reasoning | Apr 1, 2025 | Audio-visual Question AnsweringAudio-Visual Question Answering (AVQA) | CodeCode Available | 2 |