| Investigating Prompting Techniques for Zero- and Few-Shot Visual Question Answering | Jun 16, 2023 | Image CaptioningQuestion Answering | CodeCode Available | 1 |
| Kosmos-2: Grounding Multimodal Large Language Models to the World | Jun 26, 2023 | Image CaptioningIn-Context Learning | CodeCode Available | 1 |
| An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA | Sep 10, 2021 | Image CaptioningQuestion Answering | CodeCode Available | 1 |
| Improving Selective Visual Question Answering by Learning from Your Peers | Jun 14, 2023 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| IMPACT: A Large-scale Integrated Multimodal Patent Analysis and Creation Dataset for Design Patents | Dec 10, 2024 | Cross-Modal RetrievalImage Classification | CodeCode Available | 1 |
| IllusionVQA: A Challenging Optical Illusion Dataset for Vision Language Models | Mar 23, 2024 | Common Sense ReasoningIn-Context Learning | CodeCode Available | 1 |
| In Defense of Grid Features for Visual Question Answering | Jan 10, 2020 | Image CaptioningQuestion Answering | CodeCode Available | 1 |
| IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning | Oct 25, 2021 | Arithmetic ReasoningMathematical Question Answering | CodeCode Available | 1 |
| Benchmarking Multimodal Mathematical Reasoning with Explicit Visual Dependency | Apr 24, 2025 | BenchmarkingMath | CodeCode Available | 1 |
| IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and Languages | Jan 27, 2022 | Cross-Modal RetrievalFew-Shot Learning | CodeCode Available | 1 |
| Visual Grounding Methods for VQA are Working for the Wrong Reasons! | Apr 12, 2020 | Question AnsweringVisual Grounding | CodeCode Available | 1 |
| Lever LM: Configuring In-Context Sequence to Lever Large Vision Language Models | Dec 15, 2023 | Image CaptioningIn-Context Learning | CodeCode Available | 1 |
| INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model | Jul 23, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| BackdoorMBTI: A Backdoor Learning Multimodal Benchmark Tool Kit for Backdoor Defense Evaluation | Nov 17, 2024 | Action Recognitionbackdoor defense | CodeCode Available | 1 |
| A Dataset and Baselines for Visual Question Answering on Art | Aug 28, 2020 | Question AnsweringQuestion Generation | CodeCode Available | 1 |
| HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models | Mar 20, 2024 | MMEVisual Question Answering | CodeCode Available | 1 |
| CAT-ViL: Co-Attention Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery | Jul 11, 2023 | Question AnsweringScene Understanding | CodeCode Available | 1 |
| An Approach to Solving the Abstraction and Reasoning Corpus (ARC) Challenge | Jun 6, 2023 | ARCQuestion Answering | CodeCode Available | 1 |
| COBRA: Contrastive Bi-Modal Representation Algorithm | May 7, 2020 | Cross-Modal RetrievalImage Captioning | CodeCode Available | 1 |
| I2I: Initializing Adapters with Improvised Knowledge | Apr 4, 2023 | Continual LearningQuestion Answering | CodeCode Available | 1 |
| Combo of Thinking and Observing for Outside-Knowledge VQA | May 10, 2023 | DecoderQuestion Answering | CodeCode Available | 1 |
| Contrast and Classify: Training Robust VQA Models | Oct 13, 2020 | Contrastive LearningData Augmentation | CodeCode Available | 1 |
| CoCa: Contrastive Captioners are Image-Text Foundation Models | May 4, 2022 | Action ClassificationDecoder | CodeCode Available | 1 |
| Hypergraph Transformer: Weakly-supervised Multi-hop Reasoning for Knowledge-based Visual Question Answering | Apr 22, 2022 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| I Can't Believe There's No Images! Learning Visual Tasks Using only Language Supervision | Nov 17, 2022 | Image CaptioningQuestion Answering | CodeCode Available | 1 |