| A Unified End-to-End Retriever-Reader Framework for Knowledge-based VQA | Jun 30, 2022 | Question AnsweringRetrieval | CodeCode Available | 1 |
| Cross-modal Information Flow in Multimodal Large Language Models | Nov 27, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| Knowledge-Routed Visual Question Reasoning: Challenges for Deep Representation Embedding | Dec 14, 2020 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| Language-Informed Visual Concept Learning | Dec 6, 2023 | DisentanglementNovel Concepts | CodeCode Available | 1 |
| Investigating Prompting Techniques for Zero- and Few-Shot Visual Question Answering | Jun 16, 2023 | Image CaptioningQuestion Answering | CodeCode Available | 1 |
| UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling | Nov 23, 2021 | Image CaptioningImage Description | CodeCode Available | 1 |
| JDocQA: Japanese Document Question Answering Dataset for Generative Language Models | Mar 28, 2024 | HallucinationQuestion Answering | CodeCode Available | 1 |
| Instruction-Guided Visual Masking | May 30, 2024 | Instruction FollowingVisual Grounding | CodeCode Available | 1 |
| Cross-modal Retrieval for Knowledge-based Visual Question Answering | Jan 11, 2024 | Cross-Modal RetrievalQuestion Answering | CodeCode Available | 1 |
| InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks | Dec 21, 2023 | Image RetrievalImage-to-Text Retrieval | CodeCode Available | 1 |
| Attention in Reasoning: Dataset, Analysis, and Modeling | Apr 20, 2022 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| M3-Jepa: Multimodal Alignment via Multi-directional MoE based on the JEPA framework | Sep 9, 2024 | Computational EfficiencyCross-Modal Retrieval | CodeCode Available | 1 |
| Cross-Modal Causal Relational Reasoning for Event-Level Visual Question Answering | Jul 26, 2022 | Causal InferenceQuestion Answering | CodeCode Available | 1 |
| Interpreting Chest X-rays Like a Radiologist: A Benchmark with Clinical Reasoning | May 29, 2025 | DiagnosticQuestion Answering | CodeCode Available | 1 |
| Language Prior Is Not the Only Shortcut: A Benchmark for Shortcut Learning in VQA | Oct 10, 2022 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| InfMLLM: A Unified Framework for Visual-Language Tasks | Nov 12, 2023 | GPUImage Captioning | CodeCode Available | 1 |
| Combo of Thinking and Observing for Outside-Knowledge VQA | May 10, 2023 | DecoderQuestion Answering | CodeCode Available | 1 |
| INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model | Jul 23, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Improving Selective Visual Question Answering by Learning from Your Peers | Jun 14, 2023 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| IMPACT: A Large-scale Integrated Multimodal Patent Analysis and Creation Dataset for Design Patents | Dec 10, 2024 | Cross-Modal RetrievalImage Classification | CodeCode Available | 1 |
| In Defense of Grid Features for Visual Question Answering | Jan 10, 2020 | Image CaptioningQuestion Answering | CodeCode Available | 1 |
| Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning | Dec 4, 2024 | Multimodal Large Language ModelVideo Understanding | CodeCode Available | 1 |
| A Survey on Interpretable Cross-modal Reasoning | Sep 5, 2023 | Cross-Modal RetrievalDecision Making | CodeCode Available | 1 |
| IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and Languages | Jan 27, 2022 | Cross-Modal RetrievalFew-Shot Learning | CodeCode Available | 1 |
| Lever LM: Configuring In-Context Sequence to Lever Large Vision Language Models | Dec 15, 2023 | Image CaptioningIn-Context Learning | CodeCode Available | 1 |
| CoCa: Contrastive Captioners are Image-Text Foundation Models | May 4, 2022 | Action ClassificationDecoder | CodeCode Available | 1 |
| A Survey on Efficient Vision-Language Models | Apr 13, 2025 | Image CaptioningQuestion Answering | CodeCode Available | 1 |
| IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning | Oct 25, 2021 | Arithmetic ReasoningMathematical Question Answering | CodeCode Available | 1 |
| IllusionVQA: A Challenging Optical Illusion Dataset for Vision Language Models | Mar 23, 2024 | Common Sense ReasoningIn-Context Learning | CodeCode Available | 1 |
| COBRA: Contrastive Bi-Modal Representation Algorithm | May 7, 2020 | Cross-Modal RetrievalImage Captioning | CodeCode Available | 1 |
| CAT-ViL: Co-Attention Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery | Jul 11, 2023 | Question AnsweringScene Understanding | CodeCode Available | 1 |
| Hypergraph Transformer: Weakly-supervised Multi-hop Reasoning for Knowledge-based Visual Question Answering | Apr 22, 2022 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| How Much Can CLIP Benefit Vision-and-Language Tasks? | Jul 13, 2021 | Question AnsweringVision and Language Navigation | CodeCode Available | 1 |
| A Survey of Medical Vision-and-Language Applications and Their Techniques | Nov 19, 2024 | Decision MakingDiagnostic | CodeCode Available | 1 |
| How to Configure Good In-Context Sequence for Visual Question Answering | Dec 4, 2023 | In-Context LearningQuestion Answering | CodeCode Available | 1 |
| HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models | Mar 20, 2024 | MMEVisual Question Answering | CodeCode Available | 1 |
| Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone | Jun 15, 2022 | Described Object DetectionImage Captioning | CodeCode Available | 1 |
| Coarse-to-Fine Reasoning for Visual Question Answering | Oct 6, 2021 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| Hierarchical Question-Image Co-Attention for Visual Question Answering | May 31, 2016 | Visual DialogVisual Question Answering | CodeCode Available | 1 |
| Cognitive Visual-Language Mapper: Advancing Multimodal Comprehension with Enhanced Visual Knowledge Alignment | Feb 21, 2024 | Language ModellingQuestion Answering | CodeCode Available | 1 |
| Visually-Situated Natural Language Understanding with Contrastive Reading Model and Frozen Large Language Models | May 24, 2023 | document understandingImage Captioning | CodeCode Available | 1 |
| ActiView: Evaluating Active Perception Ability for Multimodal Large Language Models | Oct 7, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| Hierarchical multimodal transformers for Multi-Page DocVQA | Dec 7, 2022 | DecoderQuestion Answering | CodeCode Available | 1 |
| How Do Multimodal Large Language Models Handle Complex Multimodal Reasoning? Placing Them in An Extensible Escape Game | Mar 13, 2025 | Multimodal ReasoningQuestion Answering | CodeCode Available | 1 |
| A Symmetric Dual Encoding Dense Retrieval Framework for Knowledge-Intensive Visual Question Answering | Apr 26, 2023 | DecoderKnowledge Distillation | CodeCode Available | 1 |
| I2I: Initializing Adapters with Improvised Knowledge | Apr 4, 2023 | Continual LearningQuestion Answering | CodeCode Available | 1 |
| Closed Loop Neural-Symbolic Learning via Integrating Neural Perception, Grammar Parsing, and Symbolic Reasoning | Jun 11, 2020 | Question AnsweringReinforcement Learning (RL) | CodeCode Available | 1 |
| HAAR: Text-Conditioned Generative Model of 3D Strand-based Human Hairstyles | Dec 18, 2023 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| Attention-Based Context Aware Reasoning for Situation Recognition | Jun 1, 2020 | Action RecognitionFine-grained Action Recognition | CodeCode Available | 1 |
| GRIT: General Robust Image Task Benchmark | Apr 28, 2022 | Instance SegmentationKeypoint Detection | CodeCode Available | 1 |