| A Survey on Efficient Vision-Language Models | Apr 13, 2025 | Image CaptioningQuestion Answering | CodeCode Available | 1 |
| IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and Languages | Jan 27, 2022 | Cross-Modal RetrievalFew-Shot Learning | CodeCode Available | 1 |
| IMPACT: A Large-scale Integrated Multimodal Patent Analysis and Creation Dataset for Design Patents | Dec 10, 2024 | Cross-Modal RetrievalImage Classification | CodeCode Available | 1 |
| COBRA: Contrastive Bi-Modal Representation Algorithm | May 7, 2020 | Cross-Modal RetrievalImage Captioning | CodeCode Available | 1 |
| CAT-ViL: Co-Attention Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery | Jul 11, 2023 | Question AnsweringScene Understanding | CodeCode Available | 1 |
| HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models | Mar 20, 2024 | MMEVisual Question Answering | CodeCode Available | 1 |
| How to Configure Good In-Context Sequence for Visual Question Answering | Dec 4, 2023 | In-Context LearningQuestion Answering | CodeCode Available | 1 |
| A Survey of Medical Vision-and-Language Applications and Their Techniques | Nov 19, 2024 | Decision MakingDiagnostic | CodeCode Available | 1 |
| Hypergraph Transformer: Weakly-supervised Multi-hop Reasoning for Knowledge-based Visual Question Answering | Apr 22, 2022 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| I2I: Initializing Adapters with Improvised Knowledge | Apr 4, 2023 | Continual LearningQuestion Answering | CodeCode Available | 1 |
| Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone | Jun 15, 2022 | Described Object DetectionImage Captioning | CodeCode Available | 1 |
| Coarse-to-Fine Reasoning for Visual Question Answering | Oct 6, 2021 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| How Do Multimodal Large Language Models Handle Complex Multimodal Reasoning? Placing Them in An Extensible Escape Game | Mar 13, 2025 | Multimodal ReasoningQuestion Answering | CodeCode Available | 1 |
| Hierarchical multimodal transformers for Multi-Page DocVQA | Dec 7, 2022 | DecoderQuestion Answering | CodeCode Available | 1 |
| Cognitive Visual-Language Mapper: Advancing Multimodal Comprehension with Enhanced Visual Knowledge Alignment | Feb 21, 2024 | Language ModellingQuestion Answering | CodeCode Available | 1 |
| ActiView: Evaluating Active Perception Ability for Multimodal Large Language Models | Oct 7, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| Hierarchical Question-Image Co-Attention for Visual Question Answering | May 31, 2016 | Visual DialogVisual Question Answering | CodeCode Available | 1 |
| How Much Can CLIP Benefit Vision-and-Language Tasks? | Jul 13, 2021 | Question AnsweringVision and Language Navigation | CodeCode Available | 1 |
| I Can't Believe There's No Images! Learning Visual Tasks Using only Language Supervision | Nov 17, 2022 | Image CaptioningQuestion Answering | CodeCode Available | 1 |
| A Symmetric Dual Encoding Dense Retrieval Framework for Knowledge-Intensive Visual Question Answering | Apr 26, 2023 | DecoderKnowledge Distillation | CodeCode Available | 1 |
| Improving Selective Visual Question Answering by Learning from Your Peers | Jun 14, 2023 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| Closed Loop Neural-Symbolic Learning via Integrating Neural Perception, Grammar Parsing, and Symbolic Reasoning | Jun 11, 2020 | Question AnsweringReinforcement Learning (RL) | CodeCode Available | 1 |
| HAAR: Text-Conditioned Generative Model of 3D Strand-based Human Hairstyles | Dec 18, 2023 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| Attention-Based Context Aware Reasoning for Situation Recognition | Jun 1, 2020 | Action RecognitionFine-grained Action Recognition | CodeCode Available | 1 |
| GRIT: General Robust Image Task Benchmark | Apr 28, 2022 | Instance SegmentationKeypoint Detection | CodeCode Available | 1 |