| Unified Multimodal Pre-training and Prompt-based Tuning for Vision-Language Understanding and Generation | Dec 10, 2021 | Image-text matchingImage-text Retrieval | —Unverified | 0 |
| MoCA: Incorporating Multi-stage Domain Pretraining and Cross-guided Multimodal Attention for Textbook Question Answering | Dec 6, 2021 | Language ModellingQuestion Answering | —Unverified | 0 |
| eaVQA: An Experimental Analysis on Visual Question Answering Models | Dec 1, 2021 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Curriculum Learning Effectively Improves Low Data VQA | Dec 1, 2021 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Debiased Visual Question Answering from Feature and Sample Perspectives | Dec 1, 2021 | Bias DetectionQuestion Answering | CodeCode Available | 1 |
| Scallop: From Probabilistic Deductive Databases to Scalable Differentiable Reasoning | Dec 1, 2021 | Logical ReasoningQuestion Answering | —Unverified | 0 |
| Robust Visual Reasoning via Language Guided Neural Module Networks | Dec 1, 2021 | Question AnsweringReferring Expression | —Unverified | 0 |
| Searching the Search Space of Vision Transformer | Nov 29, 2021 | Neural Architecture Searchobject-detection | CodeCode Available | 1 |
| Scene Graph Generation with Geometric Context | Nov 25, 2021 | Activity RecognitionGraph Generation | —Unverified | 0 |
| UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling | Nov 23, 2021 | Image CaptioningImage Description | CodeCode Available | 1 |
| Florence: A New Foundation Model for Computer Vision | Nov 22, 2021 | Action ClassificationAction Recognition | CodeCode Available | 1 |
| Many Heads but One Brain: Fusion Brain -- a Competition and a Single Multimodal Multitask Architecture | Nov 22, 2021 | Handwritten Text Recognitionobject-detection | CodeCode Available | 1 |
| A Confidence-Based Interface for Neuro-Symbolic Visual Question Answering | Nov 21, 2021 | Question AnsweringTranslation | —Unverified | 0 |
| UFO: A UniFied TransfOrmer for Vision-Language Representation Learning | Nov 19, 2021 | Image CaptioningImage-text matching | —Unverified | 0 |
| Medical Visual Question Answering: A Survey | Nov 19, 2021 | Medical Visual Question AnsweringQuestion Answering | —Unverified | 0 |
| Achieving Human Parity on Visual Question Answering | Nov 17, 2021 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Enabling Multimodal Generation on CLIP via Vision-Language Knowledge Distillation | Nov 16, 2021 | Image CaptioningKnowledge Distillation | —Unverified | 0 |
| Uncertainty-based Visual Question Answering: Estimating Semantic Inconsistency between Image and Knowledge Base | Nov 16, 2021 | Question AnsweringSemantic Similarity | —Unverified | 0 |
| Co-VQA : Answering by Interactive Sub Question Sequence | Nov 16, 2021 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Breaking Down Questions for Outside-Knowledge Visual Question Answering | Nov 16, 2021 | Graph Neural NetworkQuestion Answering | —Unverified | 0 |
| Question-Led Semantic Structure Enhanced Attentions for VQA | Nov 16, 2021 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| ViQuAE, a Dataset for Knowledge-based Visual Question Answering about Named Entities | Nov 16, 2021 | ArticlesFace Recognition | CodeCode Available | 0 |
| Document AI: Benchmarks, Models and Applications | Nov 16, 2021 | Deep LearningDocument AI | —Unverified | 0 |
| Language bias in Visual Question Answering: A Survey and Taxonomy | Nov 16, 2021 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Graph Relation Transformer: Incorporating pairwise object features into the Transformer architecture | Nov 11, 2021 | Graph AttentionQuestion Answering | —Unverified | 0 |
| Visual Question Answering based on Formal Logic | Nov 8, 2021 | Formal LogicQuestion Answering | —Unverified | 0 |
| ViVQA: Vietnamese Visual Question Answering | Nov 1, 2021 | Question AnsweringVietnamese Visual Question Answering | CodeCode Available | 1 |
| CrossVQA: Scalably Generating Benchmarks for Systematically Testing VQA Generalization | Nov 1, 2021 | Answer GenerationQuestion-Answer-Generation | —Unverified | 0 |
| Diversity and Consistency: Exploring Visual Question-Answer Pair Generation | Nov 1, 2021 | DiversityQuestion Answering | —Unverified | 0 |
| MIRTT: Learning Multimodal Interaction Representations from Trilinear Transformers for Visual Question Answering | Nov 1, 2021 | multimodal interactionMultiple-choice | CodeCode Available | 0 |
| Perceptual Score: What Data Modalities Does Your Model Perceive? | Oct 27, 2021 | Question AnsweringVisual Dialog | CodeCode Available | 0 |
| Alignment Attention by Matching Key and Query Distributions | Oct 25, 2021 | Graph AttentionQuestion Answering | CodeCode Available | 0 |
| IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning | Oct 25, 2021 | Arithmetic ReasoningMathematical Question Answering | CodeCode Available | 1 |
| Single-Modal Entropy based Active Learning for Visual Question Answering | Oct 21, 2021 | Active LearningQuestion Answering | —Unverified | 0 |
| Robustness through Data Augmentation Loss Consistency | Oct 21, 2021 | Multi-domain Dialogue State TrackingVisual Question Answering | CodeCode Available | 0 |
| Label-Descriptive Patterns and Their Application to Characterizing Classification Errors | Oct 18, 2021 | Descriptivenamed-entity-recognition | CodeCode Available | 1 |
| Towards Language-guided Visual Recognition via Dynamic Convolutions | Oct 17, 2021 | Question AnsweringReferring Expression | CodeCode Available | 0 |
| xGQA: Cross-Lingual Visual Question Answering | Oct 16, 2021 | Cross-Lingual TransferLanguage Modeling | —Unverified | 0 |
| MMIU: Dataset for Visual Intent Understanding in Multimodal Assistants | Oct 13, 2021 | intent-classificationIntent Classification | —Unverified | 0 |
| Improving Users' Mental Model with Attention-directed Counterfactual Edits | Oct 13, 2021 | counterfactualQuestion Answering | —Unverified | 0 |
| Pano-AVQA: Grounded Audio-Visual Question Answering on 360^ Videos | Oct 11, 2021 | Audio-visual Question AnsweringQuestion Answering | CodeCode Available | 1 |
| Beyond Accuracy: A Consolidated Tool for Visual Question Answering Benchmarking | Oct 11, 2021 | BenchmarkingQuestion Answering | CodeCode Available | 0 |
| Coarse-to-Fine Reasoning for Visual Question Answering | Oct 6, 2021 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| Counterfactual Samples Synthesizing and Training for Robust Visual Question Answering | Oct 3, 2021 | counterfactualDiagnostic | CodeCode Available | 1 |
| Asking questions on handwritten document collections | Oct 2, 2021 | Optical Character Recognition (OCR)Question Answering | —Unverified | 0 |
| The Spoon Is in the Sink: Assisting Visually Impaired People in the Kitchen | Oct 1, 2021 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| Calibrating Concepts and Operations: Towards Symbolic Reasoning on Real Images | Oct 1, 2021 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| Breaking Down Questions for Outside-Knowledge VQA | Sep 29, 2021 | Graph Neural NetworkQuestion Answering | —Unverified | 0 |
| Variational Disentangled Attention for Regularized Visual Dialog | Sep 29, 2021 | Question AnsweringVisual Dialog | —Unverified | 0 |
| How Much Can CLIP Benefit Vision-and-Language Tasks? | Sep 29, 2021 | Question AnsweringVisual Entailment | —Unverified | 0 |