| VLM-Assisted Continual learning for Visual Question Answering in Self-Driving | Feb 2, 2025 | Autonomous DrivingContinual Learning | —Unverified | 0 |
| Hypo3D: Exploring Hypothetical Reasoning in 3D | Feb 2, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Anatomy Might Be All You Need: Forecasting What to Do During Surgery | Jan 29, 2025 | AllAnatomy | —Unverified | 0 |
| Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling | Jan 29, 2025 | Image Generation | CodeCode Available | 11 |
| Large Models in Dialogue for Active Perception and Anomaly Detection | Jan 27, 2025 | Anomaly DetectionQuestion Answering | CodeCode Available | 0 |
| Scaling Large Vision-Language Models for Enhanced Multimodal Comprehension In Biomedical Image Analysis | Jan 26, 2025 | ArticlesHallucination | —Unverified | 0 |
| Analyzing and Boosting the Power of Fine-Grained Visual Recognition for Multi-modal Large Language Models | Jan 25, 2025 | AttributeContrastive Learning | CodeCode Available | 2 |
| Scene Understanding Enabled Semantic Communication with Open Channel Coding | Jan 24, 2025 | Question AnsweringScene Understanding | —Unverified | 0 |
| Combining Knowledge Graph and LLMs for Enhanced Zero-shot Visual Question Answering | Jan 22, 2025 | Knowledge GraphsQuestion Answering | —Unverified | 0 |
| Patent Figure Classification using Large Vision-language Models | Jan 22, 2025 | ClassificationFew-Shot Learning | CodeCode Available | 0 |
| VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model | Jan 21, 2025 | Image GenerationInstruction Following | CodeCode Available | 3 |
| Can Multimodal LLMs do Visual Temporal Understanding and Reasoning? The answer is No! | Jan 18, 2025 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| Double Visual Defense: Adversarial Pre-training and Instruction Tuning for Improving Vision-Language Model Robustness | Jan 16, 2025 | Adversarial DefenseAdversarial Robustness | —Unverified | 0 |
| A Simple Aerial Detection Baseline of Multimodal Language Models | Jan 16, 2025 | object-detectionObject Detection | CodeCode Available | 2 |
| Embodied Scene Understanding for Vision Language Models via MetaVQA | Jan 15, 2025 | Decision MakingQuestion Answering | —Unverified | 0 |
| Dynamic Knowledge Integration for Enhanced Vision-Language Reasoning | Jan 15, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding | Jan 14, 2025 | image-classificationImage Classification | CodeCode Available | 2 |
| SAR Strikes Back: A New Hope for RSVQA | Jan 14, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| The Quest for Visual Understanding: A Journey Through the Evolution of Visual Question Answering | Jan 13, 2025 | Common Sense ReasoningQuestion Answering | —Unverified | 0 |
| GeoPix: Multi-Modal Large Language Model for Pixel-level Image Understanding in Remote Sensing | Jan 12, 2025 | Image CaptioningLanguage Modeling | —Unverified | 0 |
| Overcoming Language Priors for Visual Question Answering Based on Knowledge Distillation | Jan 10, 2025 | Knowledge DistillationQuestion Answering | —Unverified | 0 |
| LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding | Jan 9, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Feedback-Driven Vision-Language Alignment with Minimal Human Supervision | Jan 8, 2025 | HallucinationQuestion Answering | —Unverified | 0 |
| KAnoCLIP: Zero-Shot Anomaly Detection through Knowledge-Driven Prompt Learning and Enhanced Cross-Modal Integration | Jan 7, 2025 | Anomaly DetectionAnomaly Segmentation | —Unverified | 0 |
| Visual question answering: from early developments to recent advances -- a survey | Jan 7, 2025 | DescriptiveNatural Language Understanding | —Unverified | 0 |
| ReDiT: Re‑evaluating large visual question answering model confidence by defining input scenario Difficulty and applying Temperature mapping | Jan 6, 2025 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation | Jan 6, 2025 | Language Model EvaluationLanguage Modeling | CodeCode Available | 1 |
| Generalizing from SIMPLE to HARD Visual Reasoning: Can We Mitigate Modality Imbalance in VLMs? | Jan 5, 2025 | Image CaptioningImage to text | CodeCode Available | 1 |
| Accounting for Focus Ambiguity in Visual Questions | Jan 4, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| MoColl: Agent-Based Specific and General Model Collaboration for Image Captioning | Jan 3, 2025 | DiagnosticGeneral Knowledge | —Unverified | 0 |
| Interpretable Face Anti-Spoofing: Enhancing Generalization with Multimodal Large Language Models | Jan 3, 2025 | Binary ClassificationFace Anti-Spoofing | —Unverified | 0 |
| CLIP-UP: CLIP-Based Unanswerable Problem Detection for Visual Question Answering | Jan 2, 2025 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| Separation of Powers: On Segregating Knowledge from Observation in LLM-enabled Knowledge-based Visual Question Answering | Jan 1, 2025 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| AdaDARE-gamma: Balancing Stability and Plasticity in Multi-modal LLMs through Efficient Adaptation | Jan 1, 2025 | Image CaptioningQuestion Answering | —Unverified | 0 |
| Notes-guided MLLM Reasoning: Enhancing MLLM with Knowledge and Visual Notes for Visual Question Answering | Jan 1, 2025 | Large Language ModelMultimodal Large Language Model | CodeCode Available | 1 |
| EfficientLLaVA: Generalizable Auto-Pruning for Large Vision-language Models | Jan 1, 2025 | MM-VetMultimodal Reasoning | —Unverified | 0 |
| JTD-UAV: MLLM-Enhanced Joint Tracking and Description Framework for Anti-UAV Systems | Jan 1, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| AVQACL: A Novel Benchmark for Audio-Visual Question Answering Continual Learning | Jan 1, 2025 | Audio-visual Question AnsweringContinual Learning | CodeCode Available | 0 |
| Alignment, Mining and Fusion: Representation Alignment with Hard Negative Mining and Selective Knowledge Fusion for Medical Visual Question Answering | Jan 1, 2025 | Contrastive LearningMedical Visual Question Answering | —Unverified | 0 |
| Probing Visual Language Priors in VLMs | Dec 31, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Dual Diffusion for Unified Image Generation and Understanding | Dec 31, 2024 | Image GenerationLanguage Modeling | CodeCode Available | 2 |
| MapEval: A Map-Based Evaluation of Geo-Spatial Reasoning in Foundation Models | Dec 31, 2024 | Multiple-choiceQuestion Answering | CodeCode Available | 0 |
| UniRS: Unifying Multi-temporal Remote Sensing Tasks through Vision Language Models | Dec 30, 2024 | Question AnsweringScene Classification | CodeCode Available | 0 |
| Enhanced Multimodal RAG-LLM for Accurate Visual Question Answering | Dec 30, 2024 | Image CaptioningObject Recognition | —Unverified | 0 |
| HALLUCINOGEN: A Benchmark for Evaluating Object Hallucination in Large Visual-Language Models | Dec 29, 2024 | HallucinationObject | CodeCode Available | 0 |
| ErgoChat: a Visual Query System for the Ergonomic Risk Assessment of Construction Workers | Dec 27, 2024 | Image CaptioningQuestion Answering | —Unverified | 0 |
| LININ: Logic Integrated Neural Inference Network for Explanatory Visual Question Answering | Dec 24, 2024 | Explanatory Visual Question AnsweringMultimodal Reasoning | CodeCode Available | 0 |
| TextMatch: Enhancing Image-Text Consistency Through Multimodal Optimization | Dec 24, 2024 | In-Context LearningQuestion Answering | —Unverified | 0 |
| Multi-Agents Based on Large Language Models for Knowledge-based Visual Question Answering | Dec 24, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Cross-Lingual Text-Rich Visual Comprehension: An Information Theory Perspective | Dec 23, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |