| How Much Can CLIP Benefit Vision-and-Language Tasks? | Jul 13, 2021 | Question AnsweringVision and Language Navigation | CodeCode Available | 1 |
| Maintaining Reasoning Consistency in Compositional Visual Question Answering | Jan 1, 2022 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| Making Large Language Models Better Data Creators | Oct 31, 2023 | Instruction FollowingPrompt Engineering | CodeCode Available | 1 |
| FaceBench: A Multi-View Multi-Level Facial Attribute VQA Dataset for Benchmarking Face Perception MLLMs | Mar 27, 2025 | AttributeBenchmarking | CodeCode Available | 1 |
| Evaluating Image Hallucination in Text-to-Image Generation with Question-Answering | Sep 19, 2024 | HallucinationHallucination Evaluation | CodeCode Available | 1 |
| How Do Multimodal Large Language Models Handle Complex Multimodal Reasoning? Placing Them in An Extensible Escape Game | Mar 13, 2025 | Multimodal ReasoningQuestion Answering | CodeCode Available | 1 |
| How to Configure Good In-Context Sequence for Visual Question Answering | Dec 4, 2023 | In-Context LearningQuestion Answering | CodeCode Available | 1 |
| Hierarchical multimodal transformers for Multi-Page DocVQA | Dec 7, 2022 | DecoderQuestion Answering | CodeCode Available | 1 |
| Faithful Multimodal Explanation for Visual Question Answering | Sep 8, 2018 | Explanatory Visual Question AnsweringQuestion Answering | CodeCode Available | 1 |
| HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal Reasoning | Jul 22, 2024 | BenchmarkingHallucination | CodeCode Available | 1 |
| Hierarchical Question-Image Co-Attention for Visual Question Answering | May 31, 2016 | Visual DialogVisual Question Answering | CodeCode Available | 1 |
| Hypergraph Transformer: Weakly-supervised Multi-hop Reasoning for Knowledge-based Visual Question Answering | Apr 22, 2022 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| EarthVQA: Towards Queryable Earth via Relational Reasoning-Based Remote Sensing Visual Question Answering | Dec 19, 2023 | ObjectObject Counting | CodeCode Available | 1 |
| HAAR: Text-Conditioned Generative Model of 3D Strand-based Human Hairstyles | Dec 18, 2023 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| GRIT: General Robust Image Task Benchmark | Apr 28, 2022 | Instance SegmentationKeypoint Detection | CodeCode Available | 1 |
| CaMML: Context-Aware Multimodal Learner for Large Models | Jan 6, 2024 | Visual Question Answering | CodeCode Available | 1 |
| Hallucination Augmented Contrastive Learning for Multimodal Large Language Model | Dec 12, 2023 | Contrastive LearningHallucination | CodeCode Available | 1 |
| HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models | Mar 20, 2024 | MMEVisual Question Answering | CodeCode Available | 1 |
| A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge | Jun 3, 2022 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| Dynamic Multimodal Evaluation with Flexible Complexity by Vision-Language Bootstrapping | Oct 11, 2024 | MMEQuestion Answering | CodeCode Available | 1 |
| Graphhopper: Multi-Hop Scene Graph Reasoning for Visual Question Answering | Jul 13, 2021 | NavigateQuestion Answering | CodeCode Available | 1 |
| Calibrating Concepts and Operations: Towards Symbolic Reasoning on Real Images | Oct 1, 2021 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| MemeCap: A Dataset for Captioning and Interpreting Memes | May 23, 2023 | Image CaptioningMeme Captioning | CodeCode Available | 1 |
| GraghVQA: Language-Guided Graph Neural Networks for Graph-based Visual Question Answering | Apr 20, 2021 | Graph Neural NetworkGraph Question Answering | CodeCode Available | 1 |
| Graph Optimal Transport for Cross-Domain Alignment | Jun 26, 2020 | Graph MatchingImage Captioning | CodeCode Available | 1 |