| QAVA: Query-Agnostic Visual Attack to Large Vision-Language Models | Apr 15, 2025 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| LVLM_CSP: Accelerating Large Vision Language Models via Clustering, Scattering, and Pruning for Reasoning Segmentation | Apr 15, 2025 | Image CaptioningQuestion Answering | —Unverified | 0 |
| Building Trustworthy Multimodal AI: A Review of Fairness, Transparency, and Ethics in Vision-Language Tasks | Apr 14, 2025 | EthicsFairness | —Unverified | 0 |
| VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents | Apr 14, 2025 | Question AnsweringRAG | —Unverified | 0 |
| MMKB-RAG: A Multi-Modal Knowledge-Based Retrieval-Augmented Generation Framework | Apr 14, 2025 | Question AnsweringRAG | —Unverified | 0 |
| NoTeS-Bank: Benchmarking Neural Transcription and Search for Scientific Notes Understanding | Apr 12, 2025 | BenchmarkingDocument AI | —Unverified | 0 |
| AstroLLaVA: towards the unification of astronomical data and natural language | Apr 11, 2025 | AstronomyImage Captioning | —Unverified | 0 |
| Beyond the Frame: Generating 360° Panoramic Videos from Perspective Videos | Apr 10, 2025 | Question AnsweringVideo Generation | —Unverified | 0 |
| Data Metabolism: An Efficient Data Design Schema For Vision Language Model | Apr 10, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| TokenFocus-VQA: Enhancing Text-to-Image Alignment with Position-Aware Focus and Multi-Perspective Aggregations on LVLMs | Apr 10, 2025 | Ensemble LearningPosition | —Unverified | 0 |
| Resource-efficient Inference with Foundation Model Programs | Apr 9, 2025 | modelQuestion Answering | CodeCode Available | 0 |
| RS-RAG: Bridging Remote Sensing Imagery and Comprehensive Knowledge with a Multi-Modal Dataset and Retrieval-Augmented Generation Model | Apr 7, 2025 | Image Captioningimage-classification | —Unverified | 0 |
| Enhancing Compositional Reasoning in Vision-Language Models with Synthetic Preference Data | Apr 7, 2025 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| Hierarchical Modeling for Medical Visual Question Answering with Cross-Attention Fusion | Apr 4, 2025 | DiagnosticMedical Visual Question Answering | —Unverified | 0 |
| QIRL: Boosting Visual Question Answering via Optimized Question-Image Relation Learning | Apr 4, 2025 | Data AugmentationImage Generation | —Unverified | 0 |
| SocialGesture: Delving into Multi-person Gesture Understanding | Apr 3, 2025 | Gesture RecognitionQuestion Answering | —Unverified | 0 |
| SViQA: A Unified Speech-Vision Multimodal Model for Textless Visual Question Answering | Apr 1, 2025 | cross-modal alignmentQuestion Answering | —Unverified | 0 |
| MPDrive: Improving Spatial Understanding with Marker-Based Prompt Learning for Autonomous Driving | Apr 1, 2025 | Autonomous DrivingPrompt Learning | —Unverified | 0 |
| KOFFVQA: An Objectively Evaluated Free-form VQA Benchmark for Large Vision-Language Models in the Korean Language | Mar 31, 2025 | FormQuestion Answering | CodeCode Available | 0 |
| How Well Can Vison-Language Models Understand Humans' Intention? An Open-ended Theory of Mind Question Evaluation Benchmark | Mar 28, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| JEEM: Vision-Language Understanding in Four Arabic Dialects | Mar 27, 2025 | Image CaptioningQuestion Answering | —Unverified | 0 |
| CTRL-O: Language-Controllable Object-Centric Visual Representation Learning | Mar 27, 2025 | Image GenerationObject | —Unverified | 0 |
| Vision-Amplified Semantic Entropy for Hallucination Detection in Medical Visual Question Answering | Mar 26, 2025 | DiagnosticHallucination | —Unverified | 0 |
| Instruction-Oriented Preference Alignment for Enhancing Multi-Modal Comprehension Capability of MLLMs | Mar 26, 2025 | HallucinationHallucination Evaluation | —Unverified | 0 |
| Feature4X: Bridging Any Monocular Video to 4D Agentic AI with Versatile Gaussian Feature Fields | Mar 26, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 |