| Curriculum Script Distillation for Multilingual Visual Question Answering | Jan 17, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| ParsVQA-Caps: A Benchmark for Visual Question Answering and Image Captioning in Persian | Dec 7, 2022 | Image CaptioningQuestion Answering | —Unverified | 0 | 0 |
| Curriculum Learning for Compositional Visual Reasoning | Mar 27, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Patch-level Sounding Object Tracking for Audio-Visual Question Answering | Dec 14, 2024 | Audio-visual Question AnsweringObject Tracking | —Unverified | 0 | 0 |
| A Gaze-grounded Visual Question Answering Dataset for Clarifying Ambiguous Japanese Questions | Mar 26, 2024 | Gaze Target EstimationQuestion Answering | —Unverified | 0 | 0 |
| Pathological Visual Question Answering | Oct 6, 2020 | AI AgentQuestion Answering | —Unverified | 0 | 0 |
| Curriculum Learning Effectively Improves Low Data VQA | Dec 1, 2021 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| CTRL-O: Language-Controllable Object-Centric Visual Representation Learning | Mar 27, 2025 | Image GenerationObject | —Unverified | 0 | 0 |
| PAT: Parallel Attention Transformer for Visual Question Answering in Vietnamese | Jul 17, 2023 | Question AnsweringVietnamese Visual Question Answering | —Unverified | 0 | 0 |
| CT-Agent: A Multimodal-LLM Agent for 3D CT Radiology Question Answering | May 22, 2025 | Computed Tomography (CT)Question Answering | —Unverified | 0 | 0 |
| PDF-MVQA: A Dataset for Multimodal Information Retrieval in PDF-based Visual Question Answering | Apr 19, 2024 | ArticlesInformation Retrieval | —Unverified | 0 | 0 |
| PDFVQA: A New Dataset for Real-World VQA on PDF Documents | Apr 13, 2023 | document understandingKey Information Extraction | —Unverified | 0 | 0 |
| PEBench: A Fictitious Dataset to Benchmark Machine Unlearning for Multimodal Large Language Models | Mar 16, 2025 | Machine UnlearningPrivacy Preserving | —Unverified | 0 | 0 |
| CS-VQA: Visual Question Answering with Compressively Sensed Images | Jun 8, 2018 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| CrossVQA: Scalably Generating Benchmarks for Systematically Testing VQA Generalization | Nov 1, 2021 | Answer GenerationQuestion-Answer-Generation | —Unverified | 0 | 0 |
| A Free Lunch in Generating Datasets: Building a VQG and VQA System with Attention and Humans in the Loop | Nov 30, 2019 | Question AnsweringQuestion Generation | —Unverified | 0 | 0 |
| Performance Analysis of Traditional VQA Models Under Limited Computational Resources | Feb 9, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Parameter Efficient Reinforcement Learning from Human Feedback | Mar 15, 2024 | Question Answeringreinforcement-learning | —Unverified | 0 | 0 |
| Cross-Modal Safety Mechanism Transfer in Large Vision-Language Models | Oct 16, 2024 | Visual Question Answering | —Unverified | 0 | 0 |
| PhyBlock: A Progressive Benchmark for Physical Understanding and Planning via 3D Block Assembly | Jun 10, 2025 | Question AnsweringScene Understanding | —Unverified | 0 | 0 |
| Physically Grounded Vision-Language Models for Robotic Manipulation | Sep 5, 2023 | Image CaptioningLanguage Modelling | —Unverified | 0 | 0 |
| PiggyBack: Pretrained Visual Question Answering Environment for Backing up Non-deep Learning Professionals | Nov 29, 2022 | Deep LearningQuestion Answering | —Unverified | 0 | 0 |
| Cross-Modal Retrieval Augmentation for Multi-Modal Classification | Apr 16, 2021 | ClassificationCross-Modal Retrieval | —Unverified | 0 | 0 |
| Why Does a Visual Question Have Different Answers? | Aug 12, 2019 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs | Feb 12, 2024 | Instruction FollowingLogical Reasoning | —Unverified | 0 | 0 |
| Cross-modal Knowledge Reasoning for Knowledge-based Visual Question Answering | Aug 31, 2020 | Knowledge GraphsQuestion Answering | —Unverified | 0 | 0 |
| Cross-Modal Generative Augmentation for Visual Question Answering | May 11, 2021 | Data AugmentationQuestion Answering | —Unverified | 0 | 0 |
| A Focused Dynamic Attention Model for Visual Question Answering | Apr 6, 2016 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Playing Lottery Tickets with Vision and Language | Apr 23, 2021 | Image-text RetrievalQuestion Answering | —Unverified | 0 | 0 |
| Crossformer: Transformer with Alternated Cross-Layer Guidance | Sep 29, 2021 | Inductive BiasMachine Translation | —Unverified | 0 | 0 |
| Why Does the VQA Model Answer No?: Improving Reasoning through Visual and Linguistic Inference | Sep 25, 2019 | Common Sense ReasoningQuestion Answering | —Unverified | 0 | 0 |
| Cross-Dataset Adaptation for Visual Question Answering | Jun 10, 2018 | Domain AdaptationQuestion Answering | —Unverified | 0 | 0 |
| CROME: Cross-Modal Adapters for Efficient Multimodal LLM | Aug 13, 2024 | Instruction FollowingLanguage Modeling | —Unverified | 0 | 0 |
| POINTS: Improving Your Vision-language Model with Affordable Strategies | Sep 7, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| Polar-VQA: Visual Question Answering on Remote Sensed Ice sheet Imagery from Polar Region | Mar 13, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| CREPE: Coordinate-Aware End-to-End Document Parser | May 1, 2024 | document understandingOptical Character Recognition (OCR) | —Unverified | 0 | 0 |
| Does CLIP Benefit Visual Question Answering in the Medical Domain as Much as it Does in the General Domain? | Dec 27, 2021 | ArticlesMedical Visual Question Answering | —Unverified | 0 | 0 |
| Precision Empowers, Excess Distracts: Visual Question Answering With Dynamically Infused Knowledge In Language Models | Jun 14, 2024 | DecoderKnowledge Graphs | —Unverified | 0 | 0 |
| CQ-VQA: Visual Question Answering on Categorized Questions | Feb 17, 2020 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Predicting Relative Depth between Objects from Semantic Features | Jan 12, 2021 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| CPL: Counterfactual Prompt Learning for Vision and Language Models | Oct 19, 2022 | counterfactualimage-classification | —Unverified | 0 | 0 |
| PreSTU: Pre-Training for Scene-Text Understanding | Sep 12, 2022 | DecoderImage Captioning | —Unverified | 0 | 0 |
| Pre-training image-language transformers for open-vocabulary tasks | Sep 9, 2022 | Question AnsweringVisual Entailment | —Unverified | 0 | 0 |
| Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs | Apr 23, 2024 | Question AnsweringRetrieval | —Unverified | 0 | 0 |
| A Fast, Reliable, and Secure Programming Language for LLM Agents with Code Actions | Jun 13, 2025 | Conformal PredictionQuestion Answering | —Unverified | 0 | 0 |
| CPath-Omni: A Unified Multimodal Foundation Model for Patch and Whole Slide Image Analysis in Computational Pathology | Dec 16, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| Co-VQA : Answering by Interactive Sub Question Sequence | Apr 2, 2022 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Privacy Preserving Visual Question Answering | Feb 15, 2022 | Privacy PreservingQuestion Answering | —Unverified | 0 | 0 |
| Aesthetic Visual Question Answering of Photographs | Aug 10, 2022 | Question AnsweringSentiment Analysis | —Unverified | 0 | 0 |
| Probabilistic Neural-symbolic Models for Interpretable Visual Question Answering | Feb 21, 2019 | counterfactualQuestion Answering | —Unverified | 0 | 0 |