| LCV2: An Efficient Pretraining-Free Framework for Grounded Visual Question Answering | Jan 29, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| FocusLLaVA: A Coarse-to-Fine Approach for Efficient and Effective Visual Token Compression | Nov 21, 2024 | Visual Question Answering | —Unverified | 0 | 0 |
| Answer-checking in Context: A Multi-modal FullyAttention Network for Visual Question Answering | Oct 17, 2020 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Learning Answer Embeddings for Visual Question Answering | Jun 10, 2018 | Question AnsweringTransfer Learning | —Unverified | 0 | 0 |
| Learning by Asking Questions | Dec 4, 2017 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| A Novel Framework for Robustness Analysis of Visual QA Models | Nov 16, 2017 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Learning by Hallucinating: Vision-Language Pre-training with Weak Supervision | Oct 24, 2022 | cross-modal alignmentCross-Modal Retrieval | —Unverified | 0 | 0 |
| Learning Compositional Representation for Few-shot Visual Question Answering | Feb 21, 2021 | AttributeQuestion Answering | —Unverified | 0 | 0 |
| Variational Disentangled Attention for Regularized Visual Dialog | Sep 29, 2021 | Question AnsweringVisual Dialog | —Unverified | 0 | 0 |
| Variational Visual Question Answering | May 14, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| A Novel Attention-based Aggregation Function to Combine Vision and Language | Apr 27, 2020 | General ClassificationImage Captioning | —Unverified | 0 | 0 |
| FOCUS: Internal MLLM Representations for Efficient Fine-Grained Visual Question Answering | Jun 25, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| VCD: Knowledge Base Guided Visual Commonsense Discovery in Images | Feb 27, 2024 | Decision MakingLanguage Modelling | —Unverified | 0 | 0 |
| Learning How To Ask: Cycle-Consistency Refines Prompts in Multimodal Foundation Models | Feb 13, 2024 | Code GenerationHumanEval | —Unverified | 0 | 0 |
| Learning Models for Actions and Person-Object Interactions with Transfer to Question Answering | Apr 16, 2016 | General ClassificationHuman-Object Interaction Detection | —Unverified | 0 | 0 |
| Learning Reasoning Paths over Semantic Graphs for Video-grounded Dialogues | Mar 1, 2021 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| An Open-Source Software Toolkit & Benchmark Suite for the Evaluation and Adaptation of Multimodal Action Models | Jun 10, 2025 | Action GenerationImage Captioning | —Unverified | 0 | 0 |
| Learning Rich Image Region Representation for Visual Question Answering | Oct 29, 2019 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| FMBench: Benchmarking Fairness in Multimodal Large Language Models on Medical Tasks | Oct 1, 2024 | BenchmarkingFairness | —Unverified | 0 | 0 |
| Learning Sparse Mixture of Experts for Visual Question Answering | Sep 19, 2019 | Mixture-of-ExpertsQuestion Answering | —Unverified | 0 | 0 |
| Learning Sparsity for Effective and Efficient Music Performance Question Answering | Jun 2, 2025 | Audio-visual Question AnsweringQuestion Answering | —Unverified | 0 | 0 |
| Annotation Methodologies for Vision and Language Dataset Creation | Jul 10, 2016 | Action RecognitionImage Description | —Unverified | 0 | 0 |
| FlowVQA: Mapping Multimodal Logic in Visual Question Answering with Flowcharts | Jun 27, 2024 | Decision MakingLogical Reasoning | —Unverified | 0 | 0 |
| FlexCap: Describe Anything in Images in Controllable Detail | Mar 18, 2024 | AttributeDense Captioning | —Unverified | 0 | 0 |
| Learning to Compose Diversified Prompts for Image Emotion Classification | Jan 26, 2022 | ClassificationEmotion Classification | —Unverified | 0 | 0 |