| Multimodal Commonsense Knowledge Distillation for Visual Question Answering | Nov 5, 2024 | Knowledge DistillationQuestion Answering | —Unverified | 0 | 0 |
| VisionGPT: Vision-Language Understanding Agent Using Generalized Multimodal Framework | Mar 14, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| Multimodal Compact Bilinear Pooling for Multimodal Neural Machine Translation | Mar 23, 2017 | DecoderMachine Translation | —Unverified | 0 | 0 |
| Multimodal Continuous Visual Attention Mechanisms | Apr 7, 2021 | ClusteringQuestion Answering | —Unverified | 0 | 0 |
| Multi-modal Deep Analysis for Multimedia | Oct 11, 2019 | Multi-modal RecommendationQuestion Answering | —Unverified | 0 | 0 |
| Multi-Modal Explainable Medical AI Assistant for Trustworthy Human-AI Collaboration | May 11, 2025 | BenchmarkingDescriptive | —Unverified | 0 | 0 |
| Vision-Language Models as Success Detectors | Mar 13, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Vision Language Models Can Parse Floor Plan Maps | Sep 19, 2024 | Image CaptioningQuestion Answering | —Unverified | 0 | 0 |
| Does my multimodal model learn cross-modal interactions? It's harder to tell than you might think! | Oct 13, 2020 | DiagnosticImage-text Classification | —Unverified | 0 | 0 |
| Multimodal Few-Shot Learning with Frozen Language Models | Jun 25, 2021 | Few-Shot LearningLanguage Modeling | —Unverified | 0 | 0 |
| Document Visual Question Answering Challenge 2020 | Aug 20, 2020 | Question AnsweringRetrieval | —Unverified | 0 | 0 |
| Multi-Modal Fusion Transformer for Visual Question Answering in Remote Sensing | Oct 10, 2022 | Question AnsweringRepresentation Learning | —Unverified | 0 | 0 |
| Multimodal Graph Networks for Compositional Generalization in Visual Question Answering | Dec 1, 2020 | Graph Neural NetworkQuestion Answering | —Unverified | 0 | 0 |
| Multimodal grid features and cell pointers for Scene Text Visual Question Answering | Jun 1, 2020 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Multi-Modal Instruction-Tuning Small-Scale Language-and-Vision Assistant for Semiconductor Electron Micrograph Analysis | Aug 27, 2024 | Instruction FollowingQuestion Answering | —Unverified | 0 | 0 |
| Multimodal Integration of Human-Like Attention in Visual Question Answering | Sep 27, 2021 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Multimodal Intelligence: Representation Learning, Information Fusion, and Applications | Nov 10, 2019 | Caption GenerationImage Generation | —Unverified | 0 | 0 |
| Document Collection Visual Question Answering | Apr 27, 2021 | document understandingQuestion Answering | —Unverified | 0 | 0 |
| Multi-modality Latent Interaction Network for Visual Question Answering | Aug 10, 2019 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| Document AI: Benchmarks, Models and Applications | Nov 16, 2021 | Deep LearningDocument AI | —Unverified | 0 | 0 |
| Vision-Language Models for Edge Networks: A Comprehensive Survey | Feb 11, 2025 | Autonomous VehiclesImage Captioning | —Unverified | 0 | 0 |
| Multimodal Learning and Reasoning for Visual Question Answering | Dec 1, 2017 | Question AnsweringRepresentation Learning | —Unverified | 0 | 0 |
| Scene Graph Reasoning with Prior Visual Relationship for Visual Question Answering | Dec 23, 2018 | Cross-Modal Information RetrievalInformation Retrieval | —Unverified | 0 | 0 |
| Multimodal Neural Graph Memory Networks for Visual Question Answering | Jul 1, 2020 | Graph Neural NetworkQuestion Answering | —Unverified | 0 | 0 |
| DLIP: Distilling Language-Image Pre-training | Aug 24, 2023 | Image CaptioningImage-text Retrieval | —Unverified | 0 | 0 |
| A Multimodal Memes Classification: A Survey and Open Research Issues | Sep 17, 2020 | ClassificationGeneral Classification | —Unverified | 0 | 0 |
| Diversity and Consistency: Exploring Visual Question-Answer Pair Generation | Nov 1, 2021 | DiversityQuestion Answering | —Unverified | 0 | 0 |
| Diversifying Joint Vision-Language Tokenization Learning | Jun 6, 2023 | Question AnsweringRepresentation Learning | —Unverified | 0 | 0 |
| Multimodal Representations for Teacher-Guided Compositional Visual Reasoning | Oct 24, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Multimodal Reranking for Knowledge-Intensive Visual Question Answering | Jul 17, 2024 | Answer GenerationQuestion Answering | —Unverified | 0 | 0 |
| American == White in Multimodal Language-and-Image AI | Jul 1, 2022 | Image CaptioningQuestion Answering | —Unverified | 0 | 0 |
| DistilDoc: Knowledge Distillation for Visually-Rich Document Applications | Jun 12, 2024 | document-image-classificationDocument Image Classification | —Unverified | 0 | 0 |
| Multimodal Transformer With a Low-Computational-Cost Guarantee | Feb 23, 2024 | Action RecognitionQuestion Answering | —Unverified | 0 | 0 |
| Disentangling Knowledge-based and Visual Reasoning by Question Decomposition in KB-VQA | Jun 27, 2024 | General KnowledgeQuestion Answering | —Unverified | 0 | 0 |
| Multimodal Unified Attention Networks for Vision-and-Language Interactions | Aug 12, 2019 | Question AnsweringVisual Grounding | —Unverified | 0 | 0 |
| All You May Need for VQA are Image Captions | Jan 16, 2022 | AllImage Captioning | —Unverified | 0 | 0 |
| All-in-one: Understanding and Generation in Multimodal Reasoning with the MAIA Benchmark | Feb 24, 2025 | AllMultimodal Reasoning | —Unverified | 0 | 0 |
| Discovering Pathology Rationale and Token Allocation for Efficient Multimodal Pathology Reasoning | May 21, 2025 | Computational EfficiencyDiagnostic | —Unverified | 0 | 0 |
| Directional Gradient Projection for Robust Fine-Tuning of Foundation Models | Feb 21, 2025 | image-classificationImage Classification | —Unverified | 0 | 0 |
| Vision-Language Pretraining: Current Trends and the Future | May 1, 2022 | Question AnsweringRepresentation Learning | —Unverified | 0 | 0 |
| DiN: Diffusion Model for Robust Medical VQA with Semantic Noisy Labels | Mar 24, 2025 | Medical Visual Question AnsweringQuestion Answering | —Unverified | 0 | 0 |
| Multi-task Learning of Hierarchical Vision-Language Representation | Dec 3, 2018 | Multi-Task LearningQuestion Answering | —Unverified | 0 | 0 |
| AlignVE: Visual Entailment Recognition Based on Alignment Relations | Nov 16, 2022 | Question AnsweringRelation | —Unverified | 0 | 0 |
| Vision LLMs Are Bad at Hierarchical Visual Understanding, and LLMs Are the Bottleneck | May 30, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| MUST-VQA: MUltilingual Scene-text VQA | Sep 14, 2022 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Alignment, Mining and Fusion: Representation Alignment with Hard Negative Mining and Selective Knowledge Fusion for Medical Visual Question Answering | Jan 1, 2025 | Contrastive LearningMedical Visual Question Answering | —Unverified | 0 | 0 |
| Differentiable End-to-End Program Executor for Sample and Computationally Efficient VQA | Jan 1, 2021 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| MuVAM: A Multi-View Attention-based Model for Medical Visual Question Answering | Jul 7, 2021 | Medical Visual Question AnsweringMissing Labels | —Unverified | 0 | 0 |
| MyVLM: Personalizing VLMs for User-Specific Queries | Mar 21, 2024 | Image CaptioningLanguage Modelling | —Unverified | 0 | 0 |
| Vision-to-Language Tasks Based on Attributes and Attention Mechanism | May 29, 2019 | Image CaptioningQuestion Answering | —Unverified | 0 | 0 |