| Tracking the Copyright of Large Vision-Language Models through Parameter Learning Adversarial Images | Feb 23, 2025 | Adversarial AttackQuestion Answering | —Unverified | 0 |
| Directional Gradient Projection for Robust Fine-Tuning of Foundation Models | Feb 21, 2025 | image-classificationImage Classification | —Unverified | 0 |
| TransMamba: Fast Universal Architecture Adaption from Transformers to Mamba | Feb 21, 2025 | image-classificationImage Classification | —Unverified | 0 |
| Exploring Advanced Techniques for Visual Question Answering: A Comprehensive Comparison | Feb 20, 2025 | DiversityLanguage Modeling | —Unverified | 0 |
| Sce2DriveX: A Generalized MLLM Framework for Scene-to-Drive Learning | Feb 19, 2025 | Autonomous DrivingBench2Drive | —Unverified | 0 |
| PitVQA++: Vector Matrix-Low-Rank Adaptation for Open-Ended Visual Question Answering in Pituitary Surgery | Feb 19, 2025 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| SimpleVQA: Multimodal Factuality Evaluation for Multimodal Large Language Models | Feb 18, 2025 | Image ComprehensionQuestion Answering | —Unverified | 0 |
| "See the World, Discover Knowledge": A Chinese Factuality Evaluation for Large Vision Language Models | Feb 17, 2025 | Object RecognitionQuestion Answering | —Unverified | 0 |
| Visual Graph Question Answering with ASP and LLMs for Language Parsing | Feb 13, 2025 | Graph Question AnsweringOptical Character Recognition | —Unverified | 0 |
| Abduction of Domain Relationships from Data for VQA | Feb 13, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| EmoAssist: Emotional Assistant for Visual Impairment Community | Feb 13, 2025 | Emotional IntelligenceQuestion Answering | —Unverified | 0 |
| Vision-Language Models for Edge Networks: A Comprehensive Survey | Feb 11, 2025 | Autonomous VehiclesImage Captioning | —Unverified | 0 |
| Performance Analysis of Traditional VQA Models Under Limited Computational Resources | Feb 9, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| ClinKD: Cross-Modal Clinical Knowledge Distiller For Multi-Task Medical Images | Feb 9, 2025 | Clinical KnowledgeMedical Visual Question Answering | CodeCode Available | 0 |
| Hummingbird: High Fidelity Image Generation via Multimodal Context Alignment | Feb 7, 2025 | DiversityHuman-Object Interaction Detection | —Unverified | 0 |
| DocMIA: Document-Level Membership Inference Attacks against DocVQA Models | Feb 6, 2025 | document understandingInference Attack | CodeCode Available | 0 |
| No Images, No Problem: Retaining Knowledge in Continual VQA with Questions-Only Memory | Feb 6, 2025 | Continual LearningQuestion Answering | CodeCode Available | 0 |
| Efficient Few-Shot Continual Learning in Vision-Language Models | Feb 6, 2025 | Continual LearningImage Captioning | —Unverified | 0 |
| Exploring Spatial Language Grounding Through Referring Expressions | Feb 4, 2025 | Image CaptioningNegation | —Unverified | 0 |
| Hypo3D: Exploring Hypothetical Reasoning in 3D | Feb 2, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| VLM-Assisted Continual learning for Visual Question Answering in Self-Driving | Feb 2, 2025 | Autonomous DrivingContinual Learning | —Unverified | 0 |
| Anatomy Might Be All You Need: Forecasting What to Do During Surgery | Jan 29, 2025 | AllAnatomy | —Unverified | 0 |
| Large Models in Dialogue for Active Perception and Anomaly Detection | Jan 27, 2025 | Anomaly DetectionQuestion Answering | CodeCode Available | 0 |
| Scaling Large Vision-Language Models for Enhanced Multimodal Comprehension In Biomedical Image Analysis | Jan 26, 2025 | ArticlesHallucination | —Unverified | 0 |
| Scene Understanding Enabled Semantic Communication with Open Channel Coding | Jan 24, 2025 | Question AnsweringScene Understanding | —Unverified | 0 |
| Combining Knowledge Graph and LLMs for Enhanced Zero-shot Visual Question Answering | Jan 22, 2025 | Knowledge GraphsQuestion Answering | —Unverified | 0 |
| Patent Figure Classification using Large Vision-language Models | Jan 22, 2025 | ClassificationFew-Shot Learning | CodeCode Available | 0 |
| Can Multimodal LLMs do Visual Temporal Understanding and Reasoning? The answer is No! | Jan 18, 2025 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| Double Visual Defense: Adversarial Pre-training and Instruction Tuning for Improving Vision-Language Model Robustness | Jan 16, 2025 | Adversarial DefenseAdversarial Robustness | —Unverified | 0 |
| Embodied Scene Understanding for Vision Language Models via MetaVQA | Jan 15, 2025 | Decision MakingQuestion Answering | —Unverified | 0 |
| Dynamic Knowledge Integration for Enhanced Vision-Language Reasoning | Jan 15, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| SAR Strikes Back: A New Hope for RSVQA | Jan 14, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| The Quest for Visual Understanding: A Journey Through the Evolution of Visual Question Answering | Jan 13, 2025 | Common Sense ReasoningQuestion Answering | —Unverified | 0 |
| GeoPix: Multi-Modal Large Language Model for Pixel-level Image Understanding in Remote Sensing | Jan 12, 2025 | Image CaptioningLanguage Modeling | —Unverified | 0 |
| Overcoming Language Priors for Visual Question Answering Based on Knowledge Distillation | Jan 10, 2025 | Knowledge DistillationQuestion Answering | —Unverified | 0 |
| LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding | Jan 9, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Feedback-Driven Vision-Language Alignment with Minimal Human Supervision | Jan 8, 2025 | HallucinationQuestion Answering | —Unverified | 0 |
| KAnoCLIP: Zero-Shot Anomaly Detection through Knowledge-Driven Prompt Learning and Enhanced Cross-Modal Integration | Jan 7, 2025 | Anomaly DetectionAnomaly Segmentation | —Unverified | 0 |
| Visual question answering: from early developments to recent advances -- a survey | Jan 7, 2025 | DescriptiveNatural Language Understanding | —Unverified | 0 |
| ReDiT: Re‑evaluating large visual question answering model confidence by defining input scenario Difficulty and applying Temperature mapping | Jan 6, 2025 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| Accounting for Focus Ambiguity in Visual Questions | Jan 4, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Interpretable Face Anti-Spoofing: Enhancing Generalization with Multimodal Large Language Models | Jan 3, 2025 | Binary ClassificationFace Anti-Spoofing | —Unverified | 0 |
| MoColl: Agent-Based Specific and General Model Collaboration for Image Captioning | Jan 3, 2025 | DiagnosticGeneral Knowledge | —Unverified | 0 |
| CLIP-UP: CLIP-Based Unanswerable Problem Detection for Visual Question Answering | Jan 2, 2025 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| AVQACL: A Novel Benchmark for Audio-Visual Question Answering Continual Learning | Jan 1, 2025 | Audio-visual Question AnsweringContinual Learning | CodeCode Available | 0 |
| Separation of Powers: On Segregating Knowledge from Observation in LLM-enabled Knowledge-based Visual Question Answering | Jan 1, 2025 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| AdaDARE-gamma: Balancing Stability and Plasticity in Multi-modal LLMs through Efficient Adaptation | Jan 1, 2025 | Image CaptioningQuestion Answering | —Unverified | 0 |
| Alignment, Mining and Fusion: Representation Alignment with Hard Negative Mining and Selective Knowledge Fusion for Medical Visual Question Answering | Jan 1, 2025 | Contrastive LearningMedical Visual Question Answering | —Unverified | 0 |
| JTD-UAV: MLLM-Enhanced Joint Tracking and Description Framework for Anti-UAV Systems | Jan 1, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| EfficientLLaVA: Generalizable Auto-Pruning for Large Vision-language Models | Jan 1, 2025 | MM-VetMultimodal Reasoning | —Unverified | 0 |