| GRIT: General Robust Image Task Benchmark | Apr 28, 2022 | Instance SegmentationKeypoint Detection | CodeCode Available | 1 |
| Multimodal Adaptive Distillation for Leveraging Unimodal Encoders for Vision-Language Tasks | Apr 22, 2022 | Question AnsweringVisual Commonsense Reasoning | —Unverified | 0 |
| Hypergraph Transformer: Weakly-supervised Multi-hop Reasoning for Knowledge-based Visual Question Answering | Apr 22, 2022 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| Attention in Reasoning: Dataset, Analysis, and Modeling | Apr 20, 2022 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking | Apr 18, 2022 | cross-modal alignmentDocument AI | CodeCode Available | 0 |
| Attention Mechanism based Cognition-level Scene Understanding | Apr 17, 2022 | Question AnsweringScene Understanding | —Unverified | 0 |
| Improving Cross-Modal Understanding in Visual Dialog via Contrastive Learning | Apr 15, 2022 | Contrastive LearningQuestion Answering | —Unverified | 0 |
| CLEVR-X: A Visual Reasoning Dataset for Natural Language Explanations | Apr 5, 2022 | Explanation GenerationQuestion Answering | CodeCode Available | 1 |
| SwapMix: Diagnosing and Regularizing the Over-Reliance on Visual Context in Visual Question Answering | Apr 5, 2022 | Data AugmentationQuestion Answering | CodeCode Available | 1 |
| Question-Driven Graph Fusion Network For Visual Question Answering | Apr 3, 2022 | Graph AttentionObject | —Unverified | 0 |
| Co-VQA : Answering by Interactive Sub Question Sequence | Apr 2, 2022 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| SimVQA: Exploring Simulated Environments for Visual Question Answering | Mar 31, 2022 | Data AugmentationDiversity | —Unverified | 0 |
| VL-InterpreT: An Interactive Visualization Tool for Interpreting Vision-Language Transformers | Mar 30, 2022 | Question AnsweringVisual Commonsense Reasoning | CodeCode Available | 0 |
| Single-Stream Multi-Level Alignment for Vision-Language Pretraining | Mar 27, 2022 | Image-text RetrievalQuestion Answering | CodeCode Available | 0 |
| Learning to Answer Questions in Dynamic Audio-Visual Scenarios | Mar 26, 2022 | audio-visual learningAudio-visual Question Answering | CodeCode Available | 1 |
| A Stitch in Time Saves Nine: A Train-Time Regularizing Loss for Improved Neural Network Calibration | Mar 25, 2022 | image-classificationImage Classification | CodeCode Available | 1 |
| Bilaterally Slimmable Transformer for Elastic and Efficient Visual Question Answering | Mar 24, 2022 | GPUQuestion Answering | CodeCode Available | 0 |
| Towards Escaping from Language Bias and OCR Error: Semantics-Centered Text Visual Question Answering | Mar 24, 2022 | Optical Character RecognitionOptical Character Recognition (OCR) | —Unverified | 0 |
| MuKEA: Multimodal Knowledge Extraction and Accumulation for Knowledge-based Visual Question Answering | Mar 17, 2022 | Implicit RelationsQuestion Answering | CodeCode Available | 1 |
| Can you even tell left from right? Presenting a new challenge for VQA | Mar 15, 2022 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| CLIP Models are Few-shot Learners: Empirical Studies on VQA and Visual Entailment | Mar 14, 2022 | parameter-efficient fine-tuningQuestion Answering | —Unverified | 0 |
| Enabling Multimodal Generation on CLIP via Vision-Language Knowledge Distillation | Mar 12, 2022 | Image CaptioningKnowledge Distillation | —Unverified | 0 |
| Barlow constrained optimization for Visual Question Answering | Mar 7, 2022 | Question AnsweringVisual Question Answering | CodeCode Available | 0 |
| Dynamic Key-value Memory Enhanced Multi-step Graph Reasoning for Knowledge-based Visual Question Answering | Mar 6, 2022 | Graph AttentionQuestion Answering | CodeCode Available | 0 |
| Modeling Coreference Relations in Visual Dialog | Mar 6, 2022 | Question AnsweringVisual Dialog | —Unverified | 0 |