| CaMML: Context-Aware Multimodal Learner for Large Models | Jan 6, 2024 | Visual Question Answering | CodeCode Available | 1 |
| Check It Again:Progressive Visual Question Answering via Visual Entailment | Aug 1, 2021 | Question AnsweringVisual Entailment | CodeCode Available | 1 |
| Lever LM: Configuring In-Context Sequence to Lever Large Vision Language Models | Dec 15, 2023 | Image CaptioningIn-Context Learning | CodeCode Available | 1 |
| GPT-4V-AD: Exploring Grounding Potential of VQA-oriented GPT-4V for Zero-shot Anomaly Detection | Nov 5, 2023 | Anomaly DetectionQuestion Answering | CodeCode Available | 1 |
| InfMLLM: A Unified Framework for Visual-Language Tasks | Nov 12, 2023 | GPUImage Captioning | CodeCode Available | 1 |
| MMNeuron: Discovering Neuron-Level Domain-Specific Interpretation in Multimodal Large Language Model | Jun 17, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge | Jun 3, 2022 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| Dynamic Multimodal Evaluation with Flexible Complexity by Vision-Language Bootstrapping | Oct 11, 2024 | MMEQuestion Answering | CodeCode Available | 1 |
| Hypergraph Transformer: Weakly-supervised Multi-hop Reasoning for Knowledge-based Visual Question Answering | Apr 22, 2022 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| Calibrating Concepts and Operations: Towards Symbolic Reasoning on Real Images | Oct 1, 2021 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models | Mar 20, 2024 | MMEVisual Question Answering | CodeCode Available | 1 |
| MuKEA: Multimodal Knowledge Extraction and Accumulation for Knowledge-based Visual Question Answering | Mar 17, 2022 | Implicit RelationsQuestion Answering | CodeCode Available | 1 |
| Dynamic Language Binding in Relational Visual Reasoning | Apr 30, 2020 | ObjectQuestion Answering | CodeCode Available | 1 |
| Expressive Scene Graph Generation Using Commonsense Knowledge Infusion for Visual Understanding and Reasoning | May 31, 2022 | Common Sense ReasoningGraph Generation | CodeCode Available | 1 |
| FloodNet: A High Resolution Aerial Imagery Dataset for Post Flood Scene Understanding | Dec 5, 2020 | image-classificationImage Classification | CodeCode Available | 1 |
| I2I: Initializing Adapters with Improvised Knowledge | Apr 4, 2023 | Continual LearningQuestion Answering | CodeCode Available | 1 |
| Multimodal Co-Attention Transformer for Survival Prediction in Gigapixel Whole Slide Images | Jan 1, 2021 | AttributeMultiple Instance Learning | CodeCode Available | 1 |
| Multimodal Federated Learning via Contrastive Representation Ensemble | Feb 17, 2023 | Federated LearningImage-text Retrieval | CodeCode Available | 1 |
| LaPA: Latent Prompt Assist Model For Medical Visual Question Answering | Apr 19, 2024 | Medical Visual Question AnsweringQuestion Answering | CodeCode Available | 1 |
| Multimodality Representation Learning: A Survey on Evolution, Pretraining and Its Applications | Feb 1, 2023 | Question AnsweringRepresentation Learning | CodeCode Available | 1 |
| Multi-modal Pre-training for Medical Vision-language Understanding and Generation: An Empirical Study with A New Benchmark | Jun 10, 2023 | Image-text RetrievalMedical Report Generation | CodeCode Available | 1 |
| Faithful Multimodal Explanation for Visual Question Answering | Sep 8, 2018 | Explanatory Visual Question AnsweringQuestion Answering | CodeCode Available | 1 |
| Skipping Computations in Multimodal LLMs | Oct 12, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| Dynamic Fusion With Intra- and Inter-Modality Attention Flow for Visual Question Answering | Jun 1, 2019 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Dynamic Fusion with Intra- and Inter- Modality Attention Flow for Visual Question Answering | Dec 13, 2018 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Building Trustworthy Multimodal AI: A Review of Fairness, Transparency, and Ethics in Vision-Language Tasks | Apr 14, 2025 | EthicsFairness | —Unverified | 0 |
| DUBLIN -- Document Understanding By Language-Image Network | May 23, 2023 | Document Classificationdocument understanding | —Unverified | 0 |
| BuDDIE: A Business Document Dataset for Multi-task Information Extraction | Apr 5, 2024 | Document Classificationdocument understanding | —Unverified | 0 |
| How Much Can CLIP Benefit Vision-and-Language Tasks? | Sep 29, 2021 | Question AnsweringVisual Entailment | —Unverified | 0 |
| Adversarial Representation Learning for Text-to-Image Matching | Aug 28, 2019 | Image CaptioningLanguage Modeling | —Unverified | 0 |
| AntiGrounding: Lifting Robotic Actions into VLM Representation Space for Decision Making | Jun 14, 2025 | Decision MakingQuestion Answering | —Unverified | 0 |
| Ontology-based knowledge representation for bone disease diagnosis: a foundation for safe and sustainable medical artificial intelligence systems | Jun 5, 2025 | DiagnosticMultimodal Deep Learning | —Unverified | 0 |
| DualNet: Domain-Invariant Network for Visual Question Answering | Jun 20, 2016 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Bridging the Semantic Gaps: Improving Medical VQA Consistency with LLM-Augmented Question Sets | Apr 16, 2025 | DiversityMedical Visual Question Answering | —Unverified | 0 |
| Dual Capsule Attention Mask Network with Mutual Learning for Visual Question Answering | Oct 1, 2022 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Bridge Damage Cause Estimation Using Multiple Images Based on Visual Question Answering | Feb 18, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites | Apr 25, 2024 | 4kLanguage Modeling | —Unverified | 0 |
| Breaking Neural Network Scaling Laws with Modularity | Sep 9, 2024 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| DreamSync: Aligning Text-to-Image Generation with Image Understanding Feedback | Nov 29, 2023 | Image GenerationQuestion Answering | —Unverified | 0 |
| Breaking Down Questions for Outside-Knowledge Visual Question Answering | Nov 16, 2021 | Graph Neural NetworkQuestion Answering | —Unverified | 0 |
| Answer-Type Prediction for Visual Question Answering | Jun 1, 2016 | Object RecognitionPrediction | —Unverified | 0 |
| How good are deep models in understanding the generated images? | Aug 23, 2022 | ObjectObject Recognition | —Unverified | 0 |
| How to Design Sample and Computationally Efficient VQA Models | Mar 22, 2021 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Breaking Down Questions for Outside-Knowledge VQA | Sep 29, 2021 | Graph Neural NetworkQuestion Answering | —Unverified | 0 |
| Double Visual Defense: Adversarial Pre-training and Instruction Tuning for Improving Vision-Language Model Robustness | Jan 16, 2025 | Adversarial DefenseAdversarial Robustness | —Unverified | 0 |
| Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images | Mar 13, 2023 | Common Sense ReasoningExplanation Generation | —Unverified | 0 |
| Adversarial Multimodal Network for Movie Question Answering | Jun 24, 2019 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| Domain-robust VQA with diverse datasets and methods but no target labels | Mar 29, 2021 | Domain AdaptationObject Recognition | —Unverified | 0 |
| Hierarchical Modeling for Medical Visual Question Answering with Cross-Attention Fusion | Apr 4, 2025 | DiagnosticMedical Visual Question Answering | —Unverified | 0 |
| Domain Adaptation of VLM for Soccer Video Understanding | May 20, 2025 | Action ClassificationDomain Adaptation | —Unverified | 0 |