| Expressive Scene Graph Generation Using Commonsense Knowledge Infusion for Visual Understanding and Reasoning | May 31, 2022 | Common Sense ReasoningGraph Generation | CodeCode Available | 1 | 5 |
| Localized Questions in Medical Visual Question Answering | Jul 3, 2023 | Medical Visual Question AnsweringQuestion Answering | CodeCode Available | 1 | 5 |
| Learning Situation Hyper-Graphs for Video Question Answering | Apr 18, 2023 | DecoderQuestion Answering | CodeCode Available | 1 | 5 |
| Learning Trimodal Relation for AVQA with Missing Modality | Jul 23, 2024 | Audio-visual Question AnsweringAudio-Visual Question Answering (AVQA) | CodeCode Available | 1 | 5 |
| Beyond Embeddings: The Promise of Visual Table in Visual Reasoning | Mar 27, 2024 | Representation LearningVisual Question Answering | CodeCode Available | 1 | 5 |
| BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs | Mar 2, 2023 | ArticlesMedical Visual Question Answering | CodeCode Available | 1 | 5 |
| LaPA: Latent Prompt Assist Model For Medical Visual Question Answering | Apr 19, 2024 | Medical Visual Question AnsweringQuestion Answering | CodeCode Available | 1 | 5 |
| EHRXQA: A Multi-Modal Question Answering Dataset for Electronic Health Records with Chest X-ray Images | Oct 28, 2023 | Decision MakingMedical Visual Question Answering | CodeCode Available | 1 | 5 |
| Large-Scale Adversarial Training for Vision-and-Language Representation Learning | Jun 11, 2020 | Image-text RetrievalQuestion Answering | CodeCode Available | 1 | 5 |
| Large Scale Multimodal Classification Using an Ensemble of Transformer Models and Co-Attention | Nov 23, 2020 | ClassificationGeneral Classification | CodeCode Available | 1 | 5 |
| Language Prior Is Not the Only Shortcut: A Benchmark for Shortcut Learning in VQA | Oct 10, 2022 | Question AnsweringVisual Question Answering | CodeCode Available | 1 | 5 |
| Language Quantized AutoEncoders: Towards Unsupervised Text-Image Alignment | Feb 2, 2023 | AttributeFew-Shot Image Classification | CodeCode Available | 1 | 5 |
| Dynamic Multimodal Evaluation with Flexible Complexity by Vision-Language Bootstrapping | Oct 11, 2024 | MMEQuestion Answering | CodeCode Available | 1 | 5 |
| Advancing High Resolution Vision-Language Models in Biomedicine | Jun 12, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 | 5 |
| EarthVQA: Towards Queryable Earth via Relational Reasoning-Based Remote Sensing Visual Question Answering | Dec 19, 2023 | ObjectObject Counting | CodeCode Available | 1 | 5 |
| Language Repository for Long Video Understanding | Mar 21, 2024 | EgoSchemaQuestion Answering | CodeCode Available | 1 | 5 |
| Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art Baseline | Dec 5, 2019 | Language ModellingRepresentation Learning | CodeCode Available | 1 | 5 |
| Benchmarking Retrieval-Augmented Multimomal Generation for Document Question Answering | May 22, 2025 | BenchmarkingEvidence Selection | CodeCode Available | 1 | 5 |
| LaKo: Knowledge-driven Visual Question Answering via Late Knowledge-to-Text Injection | Jul 26, 2022 | DecoderKnowledge Graphs | CodeCode Available | 1 | 5 |
| ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning | Oct 23, 2024 | Image CaptioningInstruction Following | CodeCode Available | 1 | 5 |
| Dynamic Language Binding in Relational Visual Reasoning | Apr 30, 2020 | ObjectQuestion Answering | CodeCode Available | 1 | 5 |
| Language Guided Visual Question Answering: Elevate Your Multimodal Language Model Using Knowledge-Enriched Prompts | Oct 31, 2023 | Image CaptioningLanguage Modeling | CodeCode Available | 1 | 5 |
| BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models | Dec 5, 2023 | BenchmarkingVisual Question Answering | CodeCode Available | 1 | 5 |
| Kosmos-2: Grounding Multimodal Large Language Models to the World | Jun 26, 2023 | Image CaptioningIn-Context Learning | CodeCode Available | 1 | 5 |
| Bayesian Attention Modules | Oct 20, 2020 | Image CaptioningMachine Translation | CodeCode Available | 1 | 5 |
| Label-Descriptive Patterns and Their Application to Characterizing Classification Errors | Oct 18, 2021 | Descriptivenamed-entity-recognition | CodeCode Available | 1 | 5 |
| Language-Informed Visual Concept Learning | Dec 6, 2023 | DisentanglementNovel Concepts | CodeCode Available | 1 | 5 |
| LaTr: Layout-Aware Transformer for Scene-Text VQA | Dec 23, 2021 | Optical Character Recognition (OCR)Question Answering | CodeCode Available | 1 | 5 |
| An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA | Sep 10, 2021 | Image CaptioningQuestion Answering | CodeCode Available | 1 | 5 |
| Just Ask: Learning to Answer Questions from Millions of Narrated Videos | Dec 1, 2020 | Question AnsweringQuestion Generation | CodeCode Available | 1 | 5 |
| JDocQA: Japanese Document Question Answering Dataset for Generative Language Models | Mar 28, 2024 | HallucinationQuestion Answering | CodeCode Available | 1 | 5 |
| Dual-Key Multimodal Backdoors for Visual Question Answering | Dec 14, 2021 | Question AnsweringVisual Question Answering | CodeCode Available | 1 | 5 |
| Benchmarking Multimodal Mathematical Reasoning with Explicit Visual Dependency | Apr 24, 2025 | BenchmarkingMath | CodeCode Available | 1 | 5 |
| Visual Grounding Methods for VQA are Working for the Wrong Reasons! | Apr 12, 2020 | Question AnsweringVisual Grounding | CodeCode Available | 1 | 5 |
| BackdoorMBTI: A Backdoor Learning Multimodal Benchmark Tool Kit for Backdoor Defense Evaluation | Nov 17, 2024 | Action Recognitionbackdoor defense | CodeCode Available | 1 | 5 |
| A Dataset and Baselines for Visual Question Answering on Art | Aug 28, 2020 | Question AnsweringQuestion Generation | CodeCode Available | 1 | 5 |
| Don't Take the Easy Way Out: Ensemble Based Methods for Avoiding Known Dataset Biases | Sep 9, 2019 | Natural Language InferenceQuestion Answering | CodeCode Available | 1 | 5 |
| An Approach to Solving the Abstraction and Reasoning Corpus (ARC) Challenge | Jun 6, 2023 | ARCQuestion Answering | CodeCode Available | 1 | 5 |
| Interpreting Chest X-rays Like a Radiologist: A Benchmark with Clinical Reasoning | May 29, 2025 | DiagnosticQuestion Answering | CodeCode Available | 1 | 5 |
| Investigating Prompting Techniques for Zero- and Few-Shot Visual Question Answering | Jun 16, 2023 | Image CaptioningQuestion Answering | CodeCode Available | 1 | 5 |
| InstructionGPT-4: A 200-Instruction Paradigm for Fine-Tuning MiniGPT-4 | Aug 23, 2023 | Instruction FollowingQuestion Answering | CodeCode Available | 1 | 5 |
| Does Vision-and-Language Pretraining Improve Lexical Grounding? | Sep 21, 2021 | Question AnsweringVisual Question Answering | CodeCode Available | 1 | 5 |
| Instruction-Guided Visual Masking | May 30, 2024 | Instruction FollowingVisual Grounding | CodeCode Available | 1 | 5 |
| InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks | Dec 21, 2023 | Image RetrievalImage-to-Text Retrieval | CodeCode Available | 1 | 5 |
| Knowledge-Routed Visual Question Reasoning: Challenges for Deep Representation Embedding | Dec 14, 2020 | Question AnsweringVisual Question Answering | CodeCode Available | 1 | 5 |
| LIVE: Learnable In-Context Vector for Visual Question Answering | Jun 19, 2024 | In-Context LearningQuestion Answering | CodeCode Available | 1 | 5 |
| LXMERT: Learning Cross-Modality Encoder Representations from Transformers | Aug 20, 2019 | Language ModelingLanguage Modelling | CodeCode Available | 1 | 5 |
| Disentangling 3D Prototypical Networks For Few-Shot Concept Learning | Nov 6, 2020 | 3D geometry3D Object Detection | CodeCode Available | 1 | 5 |
| In Defense of Grid Features for Visual Question Answering | Jan 10, 2020 | Image CaptioningQuestion Answering | CodeCode Available | 1 | 5 |
| Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation | Jan 6, 2025 | Language Model EvaluationLanguage Modeling | CodeCode Available | 1 | 5 |