| IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and Languages | Jan 27, 2022 | Cross-Modal RetrievalFew-Shot Learning | CodeCode Available | 1 | 5 |
| EarthVQA: Towards Queryable Earth via Relational Reasoning-Based Remote Sensing Visual Question Answering | Dec 19, 2023 | ObjectObject Counting | CodeCode Available | 1 | 5 |
| Hierarchical multimodal transformers for Multi-Page DocVQA | Dec 7, 2022 | DecoderQuestion Answering | CodeCode Available | 1 | 5 |
| How Do Multimodal Large Language Models Handle Complex Multimodal Reasoning? Placing Them in An Extensible Escape Game | Mar 13, 2025 | Multimodal ReasoningQuestion Answering | CodeCode Available | 1 | 5 |
| Evaluating Image Hallucination in Text-to-Image Generation with Question-Answering | Sep 19, 2024 | HallucinationHallucination Evaluation | CodeCode Available | 1 | 5 |
| FaceBench: A Multi-View Multi-Level Facial Attribute VQA Dataset for Benchmarking Face Perception MLLMs | Mar 27, 2025 | AttributeBenchmarking | CodeCode Available | 1 | 5 |
| Multimodality Representation Learning: A Survey on Evolution, Pretraining and Its Applications | Feb 1, 2023 | Question AnsweringRepresentation Learning | CodeCode Available | 1 | 5 |
| Expressive Scene Graph Generation Using Commonsense Knowledge Infusion for Visual Understanding and Reasoning | May 31, 2022 | Common Sense ReasoningGraph Generation | CodeCode Available | 1 | 5 |
| How Much Can CLIP Benefit Vision-and-Language Tasks? | Jul 13, 2021 | Question AnsweringVision and Language Navigation | CodeCode Available | 1 | 5 |
| CaMML: Context-Aware Multimodal Learner for Large Models | Jan 6, 2024 | Visual Question Answering | CodeCode Available | 1 | 5 |
| Hypergraph Transformer: Weakly-supervised Multi-hop Reasoning for Knowledge-based Visual Question Answering | Apr 22, 2022 | Question AnsweringVisual Question Answering | CodeCode Available | 1 | 5 |
| Multi-modal Preference Alignment Remedies Degradation of Visual Instruction Tuning on Language Models | Feb 16, 2024 | DiversityInstruction Following | CodeCode Available | 1 | 5 |
| HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models | Mar 20, 2024 | MMEVisual Question Answering | CodeCode Available | 1 | 5 |
| Change Detection Meets Visual Question Answering | Dec 12, 2021 | Answer GenerationChange Detection | CodeCode Available | 1 | 5 |
| I Can't Believe There's No Images! Learning Visual Tasks Using only Language Supervision | Nov 17, 2022 | Image CaptioningQuestion Answering | CodeCode Available | 1 | 5 |
| Multiple Meta-model Quantifying for Medical Visual Question Answering | May 19, 2021 | Medical Visual Question AnsweringMeta-Learning | CodeCode Available | 1 | 5 |
| A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge | Jun 3, 2022 | Question AnsweringVisual Question Answering | CodeCode Available | 1 | 5 |
| Dynamic Multimodal Evaluation with Flexible Complexity by Vision-Language Bootstrapping | Oct 11, 2024 | MMEQuestion Answering | CodeCode Available | 1 | 5 |
| Multimodal Federated Learning via Contrastive Representation Ensemble | Feb 17, 2023 | Federated LearningImage-text Retrieval | CodeCode Available | 1 | 5 |
| Explaining Autonomous Driving Actions with Visual Question Answering | Jul 19, 2023 | Autonomous DrivingAutonomous Vehicles | CodeCode Available | 1 | 5 |
| Probing Image-Language Transformers for Verb Understanding | Jun 16, 2021 | Image RetrievalQuestion Answering | CodeCode Available | 1 | 5 |
| ChatVLA: Unified Multimodal Understanding and Robot Control with Vision-Language-Action Model | Feb 20, 2025 | Mixture-of-ExpertsQuestion Answering | CodeCode Available | 1 | 5 |
| Progressive Compositionality In Text-to-Image Generative Models | Oct 22, 2024 | AttributeContrastive Learning | CodeCode Available | 1 | 5 |
| Calibrating Concepts and Operations: Towards Symbolic Reasoning on Real Images | Oct 1, 2021 | Question AnsweringVisual Question Answering | CodeCode Available | 1 | 5 |
| Investigating Prompting Techniques for Zero- and Few-Shot Visual Question Answering | Jun 16, 2023 | Image CaptioningQuestion Answering | CodeCode Available | 1 | 5 |