| VLM-Assisted Continual learning for Visual Question Answering in Self-Driving | Feb 2, 2025 | Autonomous DrivingContinual Learning | —Unverified | 0 |
| Hypo3D: Exploring Hypothetical Reasoning in 3D | Feb 2, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Anatomy Might Be All You Need: Forecasting What to Do During Surgery | Jan 29, 2025 | AllAnatomy | —Unverified | 0 |
| Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling | Jan 29, 2025 | Image Generation | CodeCode Available | 11 |
| Large Models in Dialogue for Active Perception and Anomaly Detection | Jan 27, 2025 | Anomaly DetectionQuestion Answering | CodeCode Available | 0 |
| Scaling Large Vision-Language Models for Enhanced Multimodal Comprehension In Biomedical Image Analysis | Jan 26, 2025 | ArticlesHallucination | —Unverified | 0 |
| Analyzing and Boosting the Power of Fine-Grained Visual Recognition for Multi-modal Large Language Models | Jan 25, 2025 | AttributeContrastive Learning | CodeCode Available | 2 |
| Scene Understanding Enabled Semantic Communication with Open Channel Coding | Jan 24, 2025 | Question AnsweringScene Understanding | —Unverified | 0 |
| Patent Figure Classification using Large Vision-language Models | Jan 22, 2025 | ClassificationFew-Shot Learning | CodeCode Available | 0 |
| Combining Knowledge Graph and LLMs for Enhanced Zero-shot Visual Question Answering | Jan 22, 2025 | Knowledge GraphsQuestion Answering | —Unverified | 0 |
| VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model | Jan 21, 2025 | Image GenerationInstruction Following | CodeCode Available | 3 |
| Can Multimodal LLMs do Visual Temporal Understanding and Reasoning? The answer is No! | Jan 18, 2025 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| Double Visual Defense: Adversarial Pre-training and Instruction Tuning for Improving Vision-Language Model Robustness | Jan 16, 2025 | Adversarial DefenseAdversarial Robustness | —Unverified | 0 |
| A Simple Aerial Detection Baseline of Multimodal Language Models | Jan 16, 2025 | object-detectionObject Detection | CodeCode Available | 2 |
| Embodied Scene Understanding for Vision Language Models via MetaVQA | Jan 15, 2025 | Decision MakingQuestion Answering | —Unverified | 0 |
| Dynamic Knowledge Integration for Enhanced Vision-Language Reasoning | Jan 15, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding | Jan 14, 2025 | image-classificationImage Classification | CodeCode Available | 2 |
| SAR Strikes Back: A New Hope for RSVQA | Jan 14, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| The Quest for Visual Understanding: A Journey Through the Evolution of Visual Question Answering | Jan 13, 2025 | Common Sense ReasoningQuestion Answering | —Unverified | 0 |
| GeoPix: Multi-Modal Large Language Model for Pixel-level Image Understanding in Remote Sensing | Jan 12, 2025 | Image CaptioningLanguage Modeling | —Unverified | 0 |
| Overcoming Language Priors for Visual Question Answering Based on Knowledge Distillation | Jan 10, 2025 | Knowledge DistillationQuestion Answering | —Unverified | 0 |
| LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding | Jan 9, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Feedback-Driven Vision-Language Alignment with Minimal Human Supervision | Jan 8, 2025 | HallucinationQuestion Answering | —Unverified | 0 |
| KAnoCLIP: Zero-Shot Anomaly Detection through Knowledge-Driven Prompt Learning and Enhanced Cross-Modal Integration | Jan 7, 2025 | Anomaly DetectionAnomaly Segmentation | —Unverified | 0 |
| Visual question answering: from early developments to recent advances -- a survey | Jan 7, 2025 | DescriptiveNatural Language Understanding | —Unverified | 0 |