| Multimodal Commonsense Knowledge Distillation for Visual Question Answering | Nov 5, 2024 | Knowledge DistillationQuestion Answering | —Unverified | 0 | 0 |
| VisionGPT: Vision-Language Understanding Agent Using Generalized Multimodal Framework | Mar 14, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| Multimodal Compact Bilinear Pooling for Multimodal Neural Machine Translation | Mar 23, 2017 | DecoderMachine Translation | —Unverified | 0 | 0 |
| Multimodal Continuous Visual Attention Mechanisms | Apr 7, 2021 | ClusteringQuestion Answering | —Unverified | 0 | 0 |
| Multi-modal Deep Analysis for Multimedia | Oct 11, 2019 | Multi-modal RecommendationQuestion Answering | —Unverified | 0 | 0 |
| Multi-Modal Explainable Medical AI Assistant for Trustworthy Human-AI Collaboration | May 11, 2025 | BenchmarkingDescriptive | —Unverified | 0 | 0 |
| Vision-Language Models as Success Detectors | Mar 13, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Vision Language Models Can Parse Floor Plan Maps | Sep 19, 2024 | Image CaptioningQuestion Answering | —Unverified | 0 | 0 |
| Does my multimodal model learn cross-modal interactions? It's harder to tell than you might think! | Oct 13, 2020 | DiagnosticImage-text Classification | —Unverified | 0 | 0 |
| Multimodal Few-Shot Learning with Frozen Language Models | Jun 25, 2021 | Few-Shot LearningLanguage Modeling | —Unverified | 0 | 0 |
| Document Visual Question Answering Challenge 2020 | Aug 20, 2020 | Question AnsweringRetrieval | —Unverified | 0 | 0 |
| Multi-Modal Fusion Transformer for Visual Question Answering in Remote Sensing | Oct 10, 2022 | Question AnsweringRepresentation Learning | —Unverified | 0 | 0 |
| Multimodal Graph Networks for Compositional Generalization in Visual Question Answering | Dec 1, 2020 | Graph Neural NetworkQuestion Answering | —Unverified | 0 | 0 |
| Multimodal grid features and cell pointers for Scene Text Visual Question Answering | Jun 1, 2020 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Multi-Modal Instruction-Tuning Small-Scale Language-and-Vision Assistant for Semiconductor Electron Micrograph Analysis | Aug 27, 2024 | Instruction FollowingQuestion Answering | —Unverified | 0 | 0 |
| Multimodal Integration of Human-Like Attention in Visual Question Answering | Sep 27, 2021 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Multimodal Intelligence: Representation Learning, Information Fusion, and Applications | Nov 10, 2019 | Caption GenerationImage Generation | —Unverified | 0 | 0 |
| Document Collection Visual Question Answering | Apr 27, 2021 | document understandingQuestion Answering | —Unverified | 0 | 0 |
| Multi-modality Latent Interaction Network for Visual Question Answering | Aug 10, 2019 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| Document AI: Benchmarks, Models and Applications | Nov 16, 2021 | Deep LearningDocument AI | —Unverified | 0 | 0 |
| Vision-Language Models for Edge Networks: A Comprehensive Survey | Feb 11, 2025 | Autonomous VehiclesImage Captioning | —Unverified | 0 | 0 |
| Multimodal Learning and Reasoning for Visual Question Answering | Dec 1, 2017 | Question AnsweringRepresentation Learning | —Unverified | 0 | 0 |
| Scene Graph Reasoning with Prior Visual Relationship for Visual Question Answering | Dec 23, 2018 | Cross-Modal Information RetrievalInformation Retrieval | —Unverified | 0 | 0 |
| Multimodal Neural Graph Memory Networks for Visual Question Answering | Jul 1, 2020 | Graph Neural NetworkQuestion Answering | —Unverified | 0 | 0 |
| DLIP: Distilling Language-Image Pre-training | Aug 24, 2023 | Image CaptioningImage-text Retrieval | —Unverified | 0 | 0 |