| Pragmatic Issue-Sensitive Image Captioning | Apr 29, 2020 | DescriptiveImage Captioning | CodeCode Available | 0 |
| Enhancing Compositional Reasoning in Vision-Language Models with Synthetic Preference Data | Apr 7, 2025 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| Learning Visual Question Answering by Bootstrapping Hard Attention | Aug 1, 2018 | Hard AttentionQuestion Answering | CodeCode Available | 0 |
| Cognitive Visual Commonsense Reasoning Using Dynamic Working Memory | Jul 4, 2021 | Question AnsweringScene Understanding | CodeCode Available | 0 |
| Learning to Reason: End-to-End Module Networks for Visual Question Answering | Apr 18, 2017 | Visual DialogVisual Question Answering | CodeCode Available | 0 |
| Black-box Model Ensembling for Textual and Visual Question Answering via Information Fusion | Jul 4, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| End-to-End Instance Segmentation with Recurrent Attention | May 30, 2016 | Autonomous DrivingImage Captioning | CodeCode Available | 0 |
| Pre-Training Multi-Modal Dense Retrievers for Outside-Knowledge Visual Question Answering | Jun 28, 2023 | Passage RetrievalQuestion Answering | CodeCode Available | 0 |
| Pretraining Vision-Language Model for Difference Visual Question Answering in Longitudinal Chest X-rays | Feb 14, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| Learning to Model and Ignore Dataset Bias with Mixed Capacity Ensembles | Nov 7, 2020 | Natural Language InferenceQuestion Answering | CodeCode Available | 0 |
| Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs | Apr 11, 2024 | DescriptiveHallucination | CodeCode Available | 0 |
| End-to-End Audio Visual Scene-Aware Dialog using Multimodal Attention-Based Video Features | Jun 21, 2018 | Question AnsweringVideo Description | CodeCode Available | 0 |
| Probabilistic Embeddings for Frozen Vision-Language Models: Uncertainty Quantification with Gaussian Process Latent Variable Models | May 8, 2025 | Active Learningcross-modal alignment | CodeCode Available | 0 |
| Co-attending Regions and Detections with Multi-modal Multiplicative Embedding for VQA | Nov 18, 2017 | FormQuestion Answering | CodeCode Available | 0 |
| Learning to Follow Object-Centric Image Editing Instructions Faithfully | Oct 29, 2023 | ObjectQuestion Answering | CodeCode Available | 0 |
| Co-attending Free-form Regions and Detections with Multi-modal Multiplicative Feature Embedding for Visual Question Answering | Nov 18, 2017 | FormVisual Question Answering | CodeCode Available | 0 |
| Augmenting Visual Question Answering with Semantic Frame Information in a Multitask Learning Approach | Jan 31, 2020 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| VinVL+L: Enriching Visual Representation with Location Context in VQA | Feb 22, 2023 | Question AnsweringTAG | CodeCode Available | 0 |
| CluMo: Cluster-based Modality Fusion Prompt for Continual Learning in Visual Question Answering | Aug 21, 2024 | Continual LearningQuestion Answering | CodeCode Available | 0 |
| TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering | Apr 14, 2017 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| Learning to Count Objects in Natural Images for Visual Question Answering | Feb 15, 2018 | Visual Question AnsweringVisual Question Answering (VQA) | CodeCode Available | 0 |
| Effective Approaches to Batch Parallelization for Dynamic Neural Network Architectures | Jul 8, 2017 | Mixture-of-ExpertsQuestion Answering | CodeCode Available | 0 |
| EaSe: A Diagnostic Tool for VQA based on Answer Diversity | Jun 1, 2021 | DiagnosticDiversity | CodeCode Available | 0 |
| Learning the meanings of function words from grounded language using a visual question answering model | Aug 16, 2023 | Logical ReasoningQuestion Answering | CodeCode Available | 0 |
| Progressive Prompt Detailing for Improved Alignment in Text-to-Image Generative Models | Mar 22, 2025 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| Learning Representations of Sets through Optimized Permutations | Dec 10, 2018 | General ClassificationQuestion Answering | CodeCode Available | 0 |
| ViQuAE, a Dataset for Knowledge-based Visual Question Answering about Named Entities | Nov 16, 2021 | ArticlesFace Recognition | CodeCode Available | 0 |
| ClinKD: Cross-Modal Clinical Knowledge Distiller For Multi-Task Medical Images | Feb 9, 2025 | Clinical KnowledgeMedical Visual Question Answering | CodeCode Available | 0 |
| VisFIS: Visual Feature Importance Supervision with Right-for-the-Right-Reason Objectives | Jun 22, 2022 | Feature ImportanceQuestion Answering | CodeCode Available | 0 |
| Learning from Lexical Perturbations for Consistent Visual Question Answering | Nov 26, 2020 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| The Illusion of Competence: Evaluating the Effect of Explanations on Users' Mental Models of Visual Question Answering Systems | Jun 27, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| Learning Convolutional Text Representations for Visual Question Answering | May 18, 2017 | General Classificationimage-classification | CodeCode Available | 0 |
| Attribute Diversity Determines the Systematicity Gap in VQA | Nov 15, 2023 | AttributeDiagnostic | CodeCode Available | 0 |
| What value do explicit high level concepts have in vision to language problems? | Jun 3, 2015 | Image CaptioningQuestion Answering | CodeCode Available | 0 |
| CLEVR-Ref+: Diagnosing Visual Reasoning with Referring Expressions | Jan 3, 2019 | DiagnosticImage Segmentation | CodeCode Available | 0 |
| Learning content and context with language bias for Visual Question Answering | Dec 21, 2020 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision | Apr 26, 2019 | Image-text RetrievalObject | CodeCode Available | 0 |
| The Promise of Premise: Harnessing Question Premises in Visual Question Answering | May 1, 2017 | Question AnsweringRelevance Detection | CodeCode Available | 0 |
| Attention on Attention: Architectures for Visual Question Answering (VQA) | Mar 21, 2018 | GPUQuestion Answering | CodeCode Available | 0 |
| Dynamic Task and Weight Prioritization Curriculum Learning for Multimodal Imagery | Oct 29, 2023 | Deep LearningMultimodal Deep Learning | CodeCode Available | 0 |
| Ask Your Neurons: A Deep Learning Approach to Visual Question Answering | May 9, 2016 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| Learning Conditioned Graph Structures for Interpretable Visual Question Answering | Jun 19, 2018 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| QAVA: Query-Agnostic Visual Attack to Large Vision-Language Models | Apr 15, 2025 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| Learning by Correction: Efficient Tuning Task for Zero-Shot Generative Vision-Language Reasoning | Apr 1, 2024 | Image CaptioningInstruction Following | CodeCode Available | 0 |
| VL-InterpreT: An Interactive Visualization Tool for Interpreting Vision-Language Transformers | Mar 30, 2022 | Question AnsweringVisual Commonsense Reasoning | CodeCode Available | 0 |
| QLEVR: A Diagnostic Dataset for Quantificational Language and Elementary Visual Reasoning | May 6, 2022 | DiagnosticQuestion Answering | CodeCode Available | 0 |
| QLIP: A Dynamic Quadtree Vision Prior Enhances MLLM Performance Without Retraining | May 29, 2025 | Question AnsweringRepresentation Learning | CodeCode Available | 0 |
| Quantifying and Alleviating the Language Prior Problem in Visual Question Answering | May 13, 2019 | Information RetrievalQuestion Answering | CodeCode Available | 0 |
| Learn from Downstream and Be Yourself in Multimodal Large Language Model Fine-Tuning | Nov 17, 2024 | Image CaptioningLanguage Modeling | CodeCode Available | 0 |
| Value-Spectrum: Quantifying Preferences of Vision-Language Models via Value Decomposition in Social Media Contexts | Nov 18, 2024 | BenchmarkingMultimodal Large Language Model | CodeCode Available | 0 |