| Improved Alignment of Modalities in Large Vision Language Models | Mar 25, 2025 | GPUImage Captioning | —Unverified | 0 |
| Domain Adaptation of VLM for Soccer Video Understanding | May 20, 2025 | Action ClassificationDomain Adaptation | —Unverified | 0 |
| Do Explanations make VQA Models more Predictable to a Human? | Oct 29, 2018 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Adversarial Regularization for Visual Question Answering: Strengths, Shortcomings, and Side Effects | Jun 20, 2019 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Improved Baselines for Data-efficient Perceptual Augmentation of LLMs | Mar 20, 2024 | Audio captioningImage Captioning | —Unverified | 0 |
| Does Object Grounding Really Reduce Hallucination of Large Vision-Language Models? | Jun 20, 2024 | Caption GenerationHallucination | —Unverified | 0 |
| Does my multimodal model learn cross-modal interactions? It's harder to tell than you might think! | Oct 13, 2020 | DiagnosticImage-text Classification | —Unverified | 0 |
| Boosting Cross-task Transferability of Adversarial Patches with Visual Relations | Apr 11, 2023 | Image CaptioningObject Recognition | —Unverified | 0 |
| Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering | May 2, 2022 | DecoderImage Captioning | —Unverified | 0 |
| BOK-VQA: Bilingual outside Knowledge-Based Visual Question Answering via Graph Representation Pretraining | Jan 12, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Document Visual Question Answering Challenge 2020 | Aug 20, 2020 | Question AnsweringRetrieval | —Unverified | 0 |
| Document Collection Visual Question Answering | Apr 27, 2021 | document understandingQuestion Answering | —Unverified | 0 |
| Blocks as Probes: Dissecting Categorization Ability of Large Multimodal Models | Sep 3, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Improved Bilinear Pooling with CNNs | Jul 21, 2017 | GPUQuestion Answering | —Unverified | 0 |
| Improving Users' Mental Model with Attention-directed Counterfactual Edits | Oct 13, 2021 | counterfactualQuestion Answering | —Unverified | 0 |
| Document AI: Benchmarks, Models and Applications | Nov 16, 2021 | Deep LearningDocument AI | —Unverified | 0 |
| A Cognitive Evaluation Benchmark of Image Reasoning and Description for Large Vision-Language Models | Feb 28, 2024 | Image DescriptionQuestion Answering | —Unverified | 0 |
| Image Semantic Relation Generation | Oct 19, 2022 | Image RetrievalImage Segmentation | —Unverified | 0 |
| DLIP: Distilling Language-Image Pre-training | Aug 24, 2023 | Image CaptioningImage-text Retrieval | —Unverified | 0 |
| Generating Question Relevant Captions to Aid Visual Question Answering | Jun 3, 2019 | General KnowledgeImage Captioning | —Unverified | 0 |
| ImageTTR: Grounding Type Theory with Records in Image Classification for Visual Question Answering | Jun 1, 2019 | General Classificationimage-classification | —Unverified | 0 |
| Diversity and Consistency: Exploring Visual Question-Answer Pair Generation | Nov 1, 2021 | DiversityQuestion Answering | —Unverified | 0 |
| Diversifying Joint Vision-Language Tokenization Learning | Jun 6, 2023 | Question AnsweringRepresentation Learning | —Unverified | 0 |
| DistilDoc: Knowledge Distillation for Visually-Rich Document Applications | Jun 12, 2024 | document-image-classificationDocument Image Classification | —Unverified | 0 |
| Adversarial Attacks Beyond the Image Space | Nov 20, 2017 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models | Aug 8, 2024 | Contrastive LearningFine-Grained Image Recognition | —Unverified | 0 |
| Disentangling Knowledge-based and Visual Reasoning by Question Decomposition in KB-VQA | Jun 27, 2024 | General KnowledgeQuestion Answering | —Unverified | 0 |
| Adventurer's Treasure Hunt: A Transparent System for Visually Grounded Compositional Visual Question Answering based on Scene Graphs | Jun 28, 2021 | Question AnsweringTask 2 | —Unverified | 0 |
| Image Captioning with Compositional Neural Module Networks | Jul 10, 2020 | Image CaptioningQuestion Answering | —Unverified | 0 |
| Answer-checking in Context: A Multi-modal FullyAttention Network for Visual Question Answering | Oct 17, 2020 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Image Manipulation via Multi-Hop Instructions -- A New Dataset and Weakly-Supervised Neuro-Symbolic Approach | May 23, 2023 | Image ManipulationQuestion Answering | —Unverified | 0 |
| Image-of-Thought Prompting for Visual Reasoning Refinement in Multimodal Large Language Models | May 22, 2024 | Multimodal ReasoningVisual Question Answering | —Unverified | 0 |
| Discovering Pathology Rationale and Token Allocation for Efficient Multimodal Pathology Reasoning | May 21, 2025 | Computational EfficiencyDiagnostic | —Unverified | 0 |
| Directional Gradient Projection for Robust Fine-Tuning of Foundation Models | Feb 21, 2025 | image-classificationImage Classification | —Unverified | 0 |
| DiN: Diffusion Model for Robust Medical VQA with Semantic Noisy Labels | Mar 24, 2025 | Medical Visual Question AnsweringQuestion Answering | —Unverified | 0 |
| A Novel Framework for Robustness Analysis of Visual QA Models | Nov 16, 2017 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Image Captioning and Visual Question Answering Based on Attributes and External Knowledge | Mar 9, 2016 | General KnowledgeImage Captioning | —Unverified | 0 |
| Image Position Prediction in Multimodal Documents | May 1, 2020 | ArticlesCaption Generation | —Unverified | 0 |
| ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance | Dec 9, 2024 | Image GenerationLanguage Modeling | —Unverified | 0 |
| Beyond VQA: Generating Multi-word Answer and Rationale to Visual Questions | Oct 24, 2020 | General ClassificationMultiple-choice | —Unverified | 0 |
| Differentiable End-to-End Program Executor for Sample and Computationally Efficient VQA | Jan 1, 2021 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| A Novel Attention-based Aggregation Function to Combine Vision and Language | Apr 27, 2020 | General ClassificationImage Captioning | —Unverified | 0 |
| Beyond the Hype: A dispassionate look at vision-language models in medical scenario | Aug 16, 2024 | Question AnsweringSpatial Reasoning | —Unverified | 0 |
| Advancing Surgical VQA with Scene Graph Knowledge | Dec 15, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Detection-based Intermediate Supervision for Visual Question Answering | Dec 26, 2023 | cross-modal alignmentLogical Reasoning | —Unverified | 0 |
| Beyond the Frame: Generating 360° Panoramic Videos from Perspective Videos | Apr 10, 2025 | Question AnsweringVideo Generation | —Unverified | 0 |
| An Open-Source Software Toolkit & Benchmark Suite for the Evaluation and Adaptation of Multimodal Action Models | Jun 10, 2025 | Action GenerationImage Captioning | —Unverified | 0 |
| CLIPPO: Image-and-Language Understanding from Pixels Only | Dec 15, 2022 | Contrastive Learningimage-classification | —Unverified | 0 |
| Detecting and Evaluating Medical Hallucinations in Large Vision Language Models | Jun 14, 2024 | HallucinationMedical Visual Question Answering | —Unverified | 0 |
| Detect, Describe, Discriminate: Moving Beyond VQA for MLLM Evaluation | Sep 23, 2024 | Multiple-choiceQuestion Answering | —Unverified | 0 |