| QAVA: Query-Agnostic Visual Attack to Large Vision-Language Models | Apr 15, 2025 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| LVLM_CSP: Accelerating Large Vision Language Models via Clustering, Scattering, and Pruning for Reasoning Segmentation | Apr 15, 2025 | Image CaptioningQuestion Answering | —Unverified | 0 |
| VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents | Apr 14, 2025 | Question AnsweringRAG | —Unverified | 0 |
| Building Trustworthy Multimodal AI: A Review of Fairness, Transparency, and Ethics in Vision-Language Tasks | Apr 14, 2025 | EthicsFairness | —Unverified | 0 |
| MMKB-RAG: A Multi-Modal Knowledge-Based Retrieval-Augmented Generation Framework | Apr 14, 2025 | Question AnsweringRAG | —Unverified | 0 |
| ReasonDrive: Efficient Visual Question Answering for Autonomous Vehicles with Reasoning-Enhanced Small Vision-Language Models | Apr 14, 2025 | Autonomous DrivingAutonomous Vehicles | CodeCode Available | 1 |
| A Survey on Efficient Vision-Language Models | Apr 13, 2025 | Image CaptioningQuestion Answering | CodeCode Available | 1 |
| NoTeS-Bank: Benchmarking Neural Transcription and Search for Scientific Notes Understanding | Apr 12, 2025 | BenchmarkingDocument AI | —Unverified | 0 |
| AstroLLaVA: towards the unification of astronomical data and natural language | Apr 11, 2025 | AstronomyImage Captioning | —Unverified | 0 |
| Data Metabolism: An Efficient Data Design Schema For Vision Language Model | Apr 10, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Beyond the Frame: Generating 360° Panoramic Videos from Perspective Videos | Apr 10, 2025 | Question AnsweringVideo Generation | —Unverified | 0 |
| TokenFocus-VQA: Enhancing Text-to-Image Alignment with Position-Aware Focus and Multi-Perspective Aggregations on LVLMs | Apr 10, 2025 | Ensemble LearningPosition | —Unverified | 0 |
| Resource-efficient Inference with Foundation Model Programs | Apr 9, 2025 | modelQuestion Answering | CodeCode Available | 0 |
| Enhancing Compositional Reasoning in Vision-Language Models with Synthetic Preference Data | Apr 7, 2025 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| RS-RAG: Bridging Remote Sensing Imagery and Comprehensive Knowledge with a Multi-Modal Dataset and Retrieval-Augmented Generation Model | Apr 7, 2025 | Image Captioningimage-classification | —Unverified | 0 |
| MedM-VL: What Makes a Good Medical LVLM? | Apr 6, 2025 | Medical Image AnalysisQuestion Answering | CodeCode Available | 2 |
| Hierarchical Modeling for Medical Visual Question Answering with Cross-Attention Fusion | Apr 4, 2025 | DiagnosticMedical Visual Question Answering | —Unverified | 0 |
| QIRL: Boosting Visual Question Answering via Optimized Question-Image Relation Learning | Apr 4, 2025 | Data AugmentationImage Generation | —Unverified | 0 |
| STING-BEE: Towards Vision-Language Model for Real-World X-ray Baggage Security Inspection | Apr 3, 2025 | Instruction FollowingLanguage Modeling | CodeCode Available | 1 |
| SocialGesture: Delving into Multi-person Gesture Understanding | Apr 3, 2025 | Gesture RecognitionQuestion Answering | —Unverified | 0 |
| ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion Refinement | Apr 2, 2025 | DecoderImage Generation | CodeCode Available | 2 |
| GMAI-VL-R1: Harnessing Reinforcement Learning for Multimodal Medical Reasoning | Apr 2, 2025 | Decision MakingDiagnostic | CodeCode Available | 1 |
| MPDrive: Improving Spatial Understanding with Marker-Based Prompt Learning for Autonomous Driving | Apr 1, 2025 | Autonomous DrivingPrompt Learning | —Unverified | 0 |
| SViQA: A Unified Speech-Vision Multimodal Model for Textless Visual Question Answering | Apr 1, 2025 | cross-modal alignmentQuestion Answering | —Unverified | 0 |
| FortisAVQA and MAVEN: a Benchmark Dataset and Debiasing Framework for Robust Multimodal Reasoning | Apr 1, 2025 | Audio-visual Question AnsweringAudio-Visual Question Answering (AVQA) | CodeCode Available | 2 |