| MapQA: A Dataset for Question Answering on Choropleth Maps | Nov 15, 2022 | ArticlesQuestion Answering | CodeCode Available | 1 |
| Visual Named Entity Linking: A New Dataset and A Baseline | Nov 9, 2022 | Entity LinkingImage Retrieval | CodeCode Available | 1 |
| VLC-BERT: Visual Question Answering with Contextualized Commonsense Knowledge | Oct 24, 2022 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| MAPL: Parameter-Efficient Adaptation of Unimodal Pre-Trained Models for Vision-Language Few-Shot Prompting | Oct 13, 2022 | Image CaptioningQuestion Answering | CodeCode Available | 1 |
| SlotFormer: Unsupervised Visual Dynamics Simulation with Object-Centric Models | Oct 12, 2022 | ObjectQuestion Answering | CodeCode Available | 1 |
| ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document Understanding | Oct 12, 2022 | document-image-classificationDocument Image Classification | CodeCode Available | 1 |
| MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training Model | Oct 11, 2022 | Contrastive LearningImage-text matching | CodeCode Available | 1 |
| Language Prior Is Not the Only Shortcut: A Benchmark for Shortcut Learning in VQA | Oct 10, 2022 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| Towards Robust Visual Question Answering: Making the Most of Biased Samples via Contrastive Learning | Oct 10, 2022 | Contrastive LearningQuestion Answering | CodeCode Available | 1 |
| Linearly Mapping from Image to Text Space | Sep 30, 2022 | Image CaptioningImage to text | CodeCode Available | 1 |
| TVLT: Textless Vision-Language Transformer | Sep 28, 2022 | Automatic Speech Recognition (ASR)Image Retrieval | CodeCode Available | 1 |
| Towards Explainable 3D Grounded Visual Question Answering: A New Benchmark and Strong Baseline | Sep 24, 2022 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| MaXM: Towards Multilingual Visual Question Answering | Sep 12, 2022 | Question AnsweringTranslation | CodeCode Available | 1 |
| Symbolic Replay: Scene Graph as Prompt for Continual Learning on VQA Task | Aug 24, 2022 | Continual LearningQuestion Answering | CodeCode Available | 1 |
| CLEVR-Math: A Dataset for Compositional Language, Visual and Mathematical Reasoning | Aug 10, 2022 | MathMathematical Reasoning | CodeCode Available | 1 |
| ChiQA: A Large Scale Image-based Real-World Question Answering Dataset for Multi-Modal Understanding | Aug 5, 2022 | Image RetrievalQuestion Answering | CodeCode Available | 1 |
| Generative Bias for Robust Visual Question Answering | Aug 1, 2022 | Knowledge DistillationQuestion Answering | CodeCode Available | 1 |
| LaKo: Knowledge-driven Visual Question Answering via Late Knowledge-to-Text Injection | Jul 26, 2022 | DecoderKnowledge Graphs | CodeCode Available | 1 |
| Cross-Modal Causal Relational Reasoning for Event-Level Visual Question Answering | Jul 26, 2022 | Causal InferenceQuestion Answering | CodeCode Available | 1 |
| Rethinking Data Augmentation for Robust Visual Question Answering | Jul 18, 2022 | Data AugmentationKnowledge Distillation | CodeCode Available | 1 |
| ViQuAE, a Dataset for Knowledge-based Visual Question Answering about Named Entities | Jul 11, 2022 | ArticlesFew-Shot Learning | CodeCode Available | 1 |
| Weakly Supervised Grounding for VQA in Vision-Language Transformers | Jul 5, 2022 | Question AnsweringRepresentation Learning | CodeCode Available | 1 |
| A Unified End-to-End Retriever-Reader Framework for Knowledge-based VQA | Jun 30, 2022 | Question AnsweringRetrieval | CodeCode Available | 1 |
| Consistency-preserving Visual Question Answering in Medical Imaging | Jun 27, 2022 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| Surgical-VQA: Visual Question Answering in Surgical Scenes using Transformer | Jun 22, 2022 | Question AnsweringSentence | CodeCode Available | 1 |
| MixGen: A New Multi-Modal Data Augmentation | Jun 16, 2022 | Data AugmentationImage-text Retrieval | CodeCode Available | 1 |
| Zero-Shot Video Question Answering via Frozen Bidirectional Language Models | Jun 16, 2022 | Fill MaskLanguage Modeling | CodeCode Available | 1 |
| Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone | Jun 15, 2022 | Described Object DetectionImage Captioning | CodeCode Available | 1 |
| A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge | Jun 3, 2022 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| REVIVE: Regional Visual Representation Matters in Knowledge-Based Visual Question Answering | Jun 2, 2022 | Question AnsweringRetrieval | CodeCode Available | 1 |
| Expressive Scene Graph Generation Using Commonsense Knowledge Infusion for Visual Understanding and Reasoning | May 31, 2022 | Common Sense ReasoningGraph Generation | CodeCode Available | 1 |
| mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections | May 24, 2022 | Computational Efficiencycross-modal alignment | CodeCode Available | 1 |
| Learning to Answer Visual Questions from Web Videos | May 10, 2022 | Dataset GenerationQuestion Answering | CodeCode Available | 1 |
| Declaration-based Prompt Tuning for Visual Question Answering | May 5, 2022 | Image-text matchingLanguage Modeling | CodeCode Available | 1 |
| CoCa: Contrastive Captioners are Image-Text Foundation Models | May 4, 2022 | Action ClassificationDecoder | CodeCode Available | 1 |
| Reliable Visual Question Answering: Abstain Rather Than Answer Incorrectly | Apr 28, 2022 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| GRIT: General Robust Image Task Benchmark | Apr 28, 2022 | Instance SegmentationKeypoint Detection | CodeCode Available | 1 |
| Hypergraph Transformer: Weakly-supervised Multi-hop Reasoning for Knowledge-based Visual Question Answering | Apr 22, 2022 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| Attention in Reasoning: Dataset, Analysis, and Modeling | Apr 20, 2022 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| CLEVR-X: A Visual Reasoning Dataset for Natural Language Explanations | Apr 5, 2022 | Explanation GenerationQuestion Answering | CodeCode Available | 1 |
| SwapMix: Diagnosing and Regularizing the Over-Reliance on Visual Context in Visual Question Answering | Apr 5, 2022 | Data AugmentationQuestion Answering | CodeCode Available | 1 |
| Learning to Answer Questions in Dynamic Audio-Visual Scenarios | Mar 26, 2022 | audio-visual learningAudio-visual Question Answering | CodeCode Available | 1 |
| A Stitch in Time Saves Nine: A Train-Time Regularizing Loss for Improved Neural Network Calibration | Mar 25, 2022 | image-classificationImage Classification | CodeCode Available | 1 |
| MuKEA: Multimodal Knowledge Extraction and Accumulation for Knowledge-based Visual Question Answering | Mar 17, 2022 | Implicit RelationsQuestion Answering | CodeCode Available | 1 |
| IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and Languages | Jan 27, 2022 | Cross-Modal RetrievalFew-Shot Learning | CodeCode Available | 1 |
| Maintaining Reasoning Consistency in Compositional Visual Question Answering | Jan 1, 2022 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| LaTr: Layout-Aware Transformer for Scene-Text VQA | Dec 23, 2021 | Optical Character Recognition (OCR)Question Answering | CodeCode Available | 1 |
| Comprehensive Visual Question Answering on Point Clouds through Compositional Scene Manipulation | Dec 22, 2021 | Common Sense ReasoningQuestion Answering | CodeCode Available | 1 |
| Distilled Dual-Encoder Model for Vision-Language Understanding | Dec 16, 2021 | Image to textmodel | CodeCode Available | 1 |
| Bilateral Cross-Modality Graph Matching Attention for Feature Fusion in Visual Question Answering | Dec 14, 2021 | Graph MatchingQuestion Answering | CodeCode Available | 1 |