| Exploring Modulated Detection Transformer as a Tool for Action Recognition in Videos | Sep 21, 2022 | Action DetectionAction Recognition | CodeCode Available | 0 | 5 |
| A simple neural network module for relational reasoning | Jun 5, 2017 | Image Retrieval with Multi-Modal QueryQuestion Answering | CodeCode Available | 0 | 5 |
| Kvasir-VQA: A Text-Image Pair GI Tract Dataset | Sep 2, 2024 | Image CaptioningImage Generation | CodeCode Available | 0 | 5 |
| Kvasir-VQA-x1: A Multimodal Dataset for Medical Reasoning and Robust MedVQA in Gastrointestinal Endoscopy | Jun 11, 2025 | Medical Visual Question AnsweringQuestion Answering | CodeCode Available | 0 | 5 |
| Modularized Zero-shot VQA with Pre-trained Models | May 27, 2023 | object-detectionObject Detection | CodeCode Available | 0 | 5 |
| Modulating early visual processing by language | Jul 2, 2017 | Question AnsweringVisual Question Answering | CodeCode Available | 0 | 5 |
| A Simple Loss Function for Improving the Convergence and Accuracy of Visual Question Answering Models | Aug 2, 2017 | Question AnsweringVisual Question Answering | CodeCode Available | 0 | 5 |
| A Simple Baseline for Knowledge-Based Visual Question Answering | Oct 20, 2023 | In-Context LearningQuestion Answering | CodeCode Available | 0 | 5 |
| MM-PoE: Multiple Choice Reasoning via. Process of Elimination using Multi-Modal Models | Dec 10, 2024 | Multiple-choiceQuestion Answering | CodeCode Available | 0 | 5 |
| MM-Prompt: Cross-Modal Prompt Tuning for Continual Visual Question Answering | May 26, 2025 | Continual LearningQuestion Answering | CodeCode Available | 0 | 5 |
| Examining Gender and Racial Bias in Large Vision-Language Models Using a Novel Dataset of Parallel Images | Feb 8, 2024 | Image CaptioningQuestion Answering | CodeCode Available | 0 | 5 |
| MIRTT: Learning Multimodal Interaction Representations from Trilinear Transformers for Visual Question Answering | Nov 1, 2021 | multimodal interactionMultiple-choice | CodeCode Available | 0 | 5 |
| Mixture-of-Subspaces in Low-Rank Adaptation | Jun 16, 2024 | Common Sense ReasoningImage Generation | CodeCode Available | 0 | 5 |
| Evaluating Fairness in Large Vision-Language Models Across Diverse Demographic Attributes and Prompts | Jun 25, 2024 | FairnessQuestion Answering | CodeCode Available | 0 | 5 |
| ArtQuest: Countering Hidden Language Biases in ArtVQA | Jan 4, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 0 | 5 |
| Evaluating Attribute Comprehension in Large Vision-Language Models | Aug 25, 2024 | AttributeImage-text matching | CodeCode Available | 0 | 5 |
| ERVQA: A Dataset to Benchmark the Readiness of Large Vision Language Models in Hospital Environments | Oct 8, 2024 | DecoderQuestion Answering | CodeCode Available | 0 | 5 |
| MHSAN: Multi-Head Self-Attention Network for Visual Semantic Embedding | Jan 11, 2020 | Image CaptioningImage-text Retrieval | CodeCode Available | 0 | 5 |
| MUREL: Multimodal Relational Reasoning for Visual Question Answering | Feb 25, 2019 | Relational ReasoningVisual Question Answering | CodeCode Available | 0 | 5 |
| Med-PMC: Medical Personalized Multi-modal Consultation with a Proactive Ask-First-Observe-Next Paradigm | Aug 16, 2024 | Decision MakingMedical Visual Question Answering | CodeCode Available | 0 | 5 |
| MedHallTune: An Instruction-Tuning Benchmark for Mitigating Medical Hallucination in Vision-Language Models | Feb 28, 2025 | Decision MakingHallucination | CodeCode Available | 0 | 5 |
| Enhancing Vietnamese VQA through Curriculum Learning on Raw and Augmented Text Representations | Mar 5, 2025 | Question AnsweringVisual Question Answering | CodeCode Available | 0 | 5 |
| Measuring Faithful and Plausible Visual Grounding in VQA | May 24, 2023 | Question AnsweringVisual Grounding | CodeCode Available | 0 | 5 |
| Are VLMs Really Blind | Oct 29, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 | 5 |
| Self-Bootstrapped Visual-Language Model for Knowledge Selection and Question Answering | Apr 22, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 | 5 |
| Latent Alignment and Variational Attention | Jul 10, 2018 | Hard AttentionMachine Translation | CodeCode Available | 0 | 5 |
| Answer Questions with Right Image Regions: A Visual Attention Regularization Approach | Feb 3, 2021 | Question AnsweringVisual Grounding | CodeCode Available | 0 | 5 |
| CAST: Cross-modal Alignment Similarity Test for Vision Language Models | Sep 17, 2024 | cross-modal alignmentQuestion Answering | CodeCode Available | 0 | 5 |
| Enhancing Cross-Prompt Transferability in Vision-Language Models through Contextual Injection of Target Tokens | Jun 19, 2024 | Caption Generationimage-classification | CodeCode Available | 0 | 5 |
| Are Vision LLMs Road-Ready? A Comprehensive Benchmark for Safety-Critical Driving Video Understanding | Apr 20, 2025 | Autonomous DrivingImage Captioning | CodeCode Available | 0 | 5 |
| Enhancing Continual Learning in Visual Question Answering with Modality-Aware Feature Distillation | Jun 27, 2024 | Continual LearningQuestion Answering | CodeCode Available | 0 | 5 |
| LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking | Apr 18, 2022 | cross-modal alignmentDocument AI | CodeCode Available | 0 | 5 |
| Enhancing Compositional Reasoning in Vision-Language Models with Synthetic Preference Data | Apr 7, 2025 | Question AnsweringVisual Question Answering | CodeCode Available | 0 | 5 |
| Cascaded Mutual Modulation for Visual Reasoning | Sep 6, 2018 | Question AnsweringVisual Question Answering | CodeCode Available | 0 | 5 |
| Learn from Downstream and Be Yourself in Multimodal Large Language Model Fine-Tuning | Nov 17, 2024 | Image CaptioningLanguage Modeling | CodeCode Available | 0 | 5 |
| Answer Them All! Toward Universal Visual Question Answering Models | Mar 1, 2019 | AllQuestion Answering | CodeCode Available | 0 | 5 |
| MapEval: A Map-Based Evaluation of Geo-Spatial Reasoning in Foundation Models | Dec 31, 2024 | Multiple-choiceQuestion Answering | CodeCode Available | 0 | 5 |
| Learning by Correction: Efficient Tuning Task for Zero-Shot Generative Vision-Language Reasoning | Apr 1, 2024 | Image CaptioningInstruction Following | CodeCode Available | 0 | 5 |
| MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks | Mar 29, 2023 | Cross-Modal RetrievalDecoder | CodeCode Available | 0 | 5 |
| Visual Question Answering: which investigated applications? | Mar 4, 2021 | Image CaptioningQuestion Answering | CodeCode Available | 0 | 5 |
| End-to-End Instance Segmentation with Recurrent Attention | May 30, 2016 | Autonomous DrivingImage Captioning | CodeCode Available | 0 | 5 |
| End-to-End Audio Visual Scene-Aware Dialog using Multimodal Attention-Based Video Features | Jun 21, 2018 | Question AnsweringVideo Description | CodeCode Available | 0 | 5 |
| LPF: A Language-Prior Feedback Objective Function for De-biased Visual Question Answering | May 29, 2021 | Question AnsweringVisual Question Answering | CodeCode Available | 0 | 5 |
| LXMERT Model Compression for Visual Question Answering | Oct 23, 2023 | modelModel Compression | CodeCode Available | 0 | 5 |
| Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering | Dec 2, 2016 | Visual Question AnsweringVisual Question Answering (VQA) | CodeCode Available | 0 | 5 |
| Marten: Visual Question Answering with Mask Generation for Multi-modal Document Understanding | Mar 18, 2025 | document understandingQuestion Answering | CodeCode Available | 0 | 5 |
| Logical Implications for Visual Question Answering Consistency | Mar 16, 2023 | Language ModelingLanguage Modelling | CodeCode Available | 0 | 5 |
| Locally Smoothed Neural Networks | Nov 22, 2017 | Face VerificationQuestion Answering | CodeCode Available | 0 | 5 |
| LLM-Assisted Multi-Teacher Continual Learning for Visual Question Answering in Robotic Surgery | Feb 26, 2024 | Continual LearningExemplar-Free | CodeCode Available | 0 | 5 |
| Loss re-scaling VQA: Revisiting the LanguagePrior Problem from a Class-imbalance View | Oct 30, 2020 | Face Recognitionimage-classification | CodeCode Available | 0 | 5 |