| Enhancing Generalization in Medical Visual Question Answering Tasks via Gradient-Guided Model Perturbation | Mar 5, 2024 | Data AugmentationMedical Visual Question Answering | —Unverified | 0 | 0 |
| ViLMedic: a framework for research at the intersection of vision and language in medical AI | May 1, 2022 | Medical Visual Question AnsweringQuestion Answering | —Unverified | 0 | 0 |
| Enhancing Explainability in Multimodal Large Language Models Using Ontological Context | Sep 27, 2024 | Image CaptioningQuestion Answering | —Unverified | 0 | 0 |
| Enhancing Document Information Analysis with Multi-Task Pre-training: A Robust Approach for Information Extraction in Visually-Rich Documents | Oct 25, 2023 | AllDocument Classification | —Unverified | 0 | 0 |
| MIMOQA: Multimodal Input Multimodal Output Question Answering | Jun 1, 2021 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| MindBench: A Comprehensive Benchmark for Mind Map Structure Recognition and Analysis | Jul 3, 2024 | PositionQuestion Answering | —Unverified | 0 | 0 |
| Mindstorms in Natural Language-Based Societies of Mind | May 26, 2023 | 3D GenerationImage Captioning | —Unverified | 0 | 0 |
| Enhancing BERT-Based Visual Question Answering through Keyword-Driven Sentence Selection | Oct 13, 2023 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| Enhanced Textual Feature Extraction for Visual Question Answering: A Simple Convolutional Approach | May 1, 2024 | Computational EfficiencyQuestion Answering | —Unverified | 0 | 0 |
| Enhanced Multimodal RAG-LLM for Accurate Visual Question Answering | Dec 30, 2024 | Image CaptioningObject Recognition | —Unverified | 0 | 0 |
| Enforcing Reasoning in Visual Commonsense Reasoning | Oct 21, 2019 | Question AnsweringReinforcement Learning | —Unverified | 0 | 0 |
| End-to-End Vision Tokenizer Tuning | May 15, 2025 | Image GenerationQuestion Answering | —Unverified | 0 | 0 |
| Encyclopedic VQA: Visual questions about detailed properties of fine-grained categories | Jun 15, 2023 | Question AnsweringRetrieval | —Unverified | 0 | 0 |
| Enabling Multimodal Generation on CLIP via Vision-Language Knowledge Distillation | Mar 12, 2022 | Image CaptioningKnowledge Distillation | —Unverified | 0 | 0 |
| Where is this coming from? Making groundedness count in the evaluation of Document VQA models | Mar 24, 2025 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Enabling Multimodal Generation on CLIP via Vision-Language Knowledge Distillation | Nov 16, 2021 | Image CaptioningKnowledge Distillation | —Unverified | 0 | 0 |
| Mitigating Hallucination in Visual-Language Models via Re-Balancing Contrastive Decoding | Sep 10, 2024 | HallucinationImage Captioning | —Unverified | 0 | 0 |
| EmoAssist: Emotional Assistant for Visual Impairment Community | Feb 13, 2025 | Emotional IntelligenceQuestion Answering | —Unverified | 0 | 0 |
| Mitigating Low-Level Visual Hallucinations Requires Self-Awareness: Database, Model and Training Strategy | Mar 26, 2025 | HallucinationImage Captioning | —Unverified | 0 | 0 |
| Data-augmented phrase-level alignment for mitigating object hallucination | May 28, 2024 | Data AugmentationHallucination | —Unverified | 0 | 0 |
| Mitigating the Impact of Attribute Editing on Face Recognition | Mar 12, 2024 | AttributeFace Recognition | —Unverified | 0 | 0 |
| MIVC: Multiple Instance Visual Component for Visual-Language Models | Dec 28, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Emerging Pixel Grounding in Large Multimodal Models Without Grounding Supervision | Oct 10, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Embodied Scene Understanding for Vision Language Models via MetaVQA | Jan 15, 2025 | Decision MakingQuestion Answering | —Unverified | 0 | 0 |
| Mixture of Rationale: Multi-Modal Reasoning Mixture for Visual Question Answering | Jun 3, 2024 | DiversityQuestion Answering | —Unverified | 0 | 0 |
| Analysis of Visual Question Answering Algorithms with attention model | May 4, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought | May 24, 2023 | Image CaptioningLanguage Modelling | —Unverified | 0 | 0 |
| MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning | Sep 30, 2024 | Mixture-of-ExpertsOptical Character Recognition (OCR) | —Unverified | 0 | 0 |
| MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training | Mar 14, 2024 | In-Context LearningMixture-of-Experts | —Unverified | 0 | 0 |
| MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling | Oct 14, 2024 | DenoisingImage Generation | —Unverified | 0 | 0 |
| ELIXR: Towards a general purpose X-ray artificial intelligence system through alignment of large language models and radiology vision encoders | Aug 2, 2023 | Contrastive LearningQuestion Answering | —Unverified | 0 | 0 |
| Eliminating the Language Bias for Visual Question Answering with fine-grained Causal Intervention | Oct 14, 2024 | Contrastive Learningcounterfactual | —Unverified | 0 | 0 |
| MMCOMPOSITION: Revisiting the Compositionality of Pre-trained Vision-Language Models | Oct 13, 2024 | Cross-Modal RetrievalQuestion Answering | —Unverified | 0 | 0 |
| MMCTAgent: Multi-modal Critical Thinking Agent Framework for Complex Visual Reasoning | May 28, 2024 | Decision MakingVideo Understanding | —Unverified | 0 | 0 |
| Eliminating Catastrophic Interference with Biased Competition | Jul 3, 2020 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| MMED: A Multi-domain and Multi-modality Event Dataset | Apr 4, 2019 | ArticlesQuestion Answering | —Unverified | 0 | 0 |
| MME-Finance: A Multimodal Finance Benchmark for Expert-level Understanding and Reasoning | Nov 5, 2024 | MMEQuestion Answering | —Unverified | 0 | 0 |
| ElectroVizQA: How well do Multi-modal LLMs perform in Electronics Visual Question Answering? | Nov 27, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Efficient Multi-modal Large Language Models via Visual Token Grouping | Nov 26, 2024 | Image CaptioningQuestion Answering | —Unverified | 0 | 0 |
| EfficientLLaVA: Generalizable Auto-Pruning for Large Vision-language Models | Jan 1, 2025 | MM-VetMultimodal Reasoning | —Unverified | 0 | 0 |
| EfficientLLaVA:Generalizable Auto-Pruning for Large Vision-language Models | Mar 19, 2025 | MM-VetMultimodal Reasoning | —Unverified | 0 | 0 |
| MMIU: Dataset for Visual Intent Understanding in Multimodal Assistants | Oct 13, 2021 | intent-classificationIntent Classification | —Unverified | 0 | 0 |
| MMKB-RAG: A Multi-Modal Knowledge-Based Retrieval-Augmented Generation Framework | Apr 14, 2025 | Question AnsweringRAG | —Unverified | 0 | 0 |
| Efficient Few-Shot Continual Learning in Vision-Language Models | Feb 6, 2025 | Continual LearningImage Captioning | —Unverified | 0 | 0 |
| Where To Look: Focus Regions for Visual Question Answering | Nov 23, 2015 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| AMXFP4: Taming Activation Outliers with Asymmetric Microscaling Floating-Point for 4-bit LLM Inference | Nov 15, 2024 | QuantizationQuestion Answering | —Unverified | 0 | 0 |
| MM-R^3: On (In-)Consistency of Multi-modal Large Language Models (MLLMs) | Oct 7, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Efficient Bilinear Attention-based Fusion for Medical Visual Question Answering | Oct 28, 2024 | Computational EfficiencyDecision Making | —Unverified | 0 | 0 |
| MM-SpuBench: Towards Better Understanding of Spurious Biases in Multimodal LLMs | Jun 24, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Efficiency in Focus: LayerNorm as a Catalyst for Fine-tuning Medical Visual Language Pre-trained Models | Apr 25, 2024 | Medical Visual Question Answeringparameter-efficient fine-tuning | —Unverified | 0 | 0 |