| PromptCap: Prompt-Guided Task-Aware Image Captioning | Nov 15, 2022 | Image CaptioningLanguage Modelling | CodeCode Available | 1 |
| Visual Named Entity Linking: A New Dataset and A Baseline | Nov 9, 2022 | Entity LinkingImage Retrieval | CodeCode Available | 1 |
| VLC-BERT: Visual Question Answering with Contextualized Commonsense Knowledge | Oct 24, 2022 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| MAPL: Parameter-Efficient Adaptation of Unimodal Pre-Trained Models for Vision-Language Few-Shot Prompting | Oct 13, 2022 | Image CaptioningQuestion Answering | CodeCode Available | 1 |
| SlotFormer: Unsupervised Visual Dynamics Simulation with Object-Centric Models | Oct 12, 2022 | ObjectQuestion Answering | CodeCode Available | 1 |
| ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document Understanding | Oct 12, 2022 | document-image-classificationDocument Image Classification | CodeCode Available | 1 |
| MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training Model | Oct 11, 2022 | Contrastive LearningImage-text matching | CodeCode Available | 1 |
| Towards Robust Visual Question Answering: Making the Most of Biased Samples via Contrastive Learning | Oct 10, 2022 | Contrastive LearningQuestion Answering | CodeCode Available | 1 |
| Language Prior Is Not the Only Shortcut: A Benchmark for Shortcut Learning in VQA | Oct 10, 2022 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| Linearly Mapping from Image to Text Space | Sep 30, 2022 | Image CaptioningImage to text | CodeCode Available | 1 |
| TVLT: Textless Vision-Language Transformer | Sep 28, 2022 | Automatic Speech Recognition (ASR)Image Retrieval | CodeCode Available | 1 |
| Towards Explainable 3D Grounded Visual Question Answering: A New Benchmark and Strong Baseline | Sep 24, 2022 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| MaXM: Towards Multilingual Visual Question Answering | Sep 12, 2022 | Question AnsweringTranslation | CodeCode Available | 1 |
| Symbolic Replay: Scene Graph as Prompt for Continual Learning on VQA Task | Aug 24, 2022 | Continual LearningQuestion Answering | CodeCode Available | 1 |
| CLEVR-Math: A Dataset for Compositional Language, Visual and Mathematical Reasoning | Aug 10, 2022 | MathMathematical Reasoning | CodeCode Available | 1 |
| ChiQA: A Large Scale Image-based Real-World Question Answering Dataset for Multi-Modal Understanding | Aug 5, 2022 | Image RetrievalQuestion Answering | CodeCode Available | 1 |
| Generative Bias for Robust Visual Question Answering | Aug 1, 2022 | Knowledge DistillationQuestion Answering | CodeCode Available | 1 |
| Cross-Modal Causal Relational Reasoning for Event-Level Visual Question Answering | Jul 26, 2022 | Causal InferenceQuestion Answering | CodeCode Available | 1 |
| LaKo: Knowledge-driven Visual Question Answering via Late Knowledge-to-Text Injection | Jul 26, 2022 | DecoderKnowledge Graphs | CodeCode Available | 1 |
| Rethinking Data Augmentation for Robust Visual Question Answering | Jul 18, 2022 | Data AugmentationKnowledge Distillation | CodeCode Available | 1 |
| ViQuAE, a Dataset for Knowledge-based Visual Question Answering about Named Entities | Jul 11, 2022 | ArticlesFew-Shot Learning | CodeCode Available | 1 |
| Weakly Supervised Grounding for VQA in Vision-Language Transformers | Jul 5, 2022 | Question AnsweringRepresentation Learning | CodeCode Available | 1 |
| A Unified End-to-End Retriever-Reader Framework for Knowledge-based VQA | Jun 30, 2022 | Question AnsweringRetrieval | CodeCode Available | 1 |
| Consistency-preserving Visual Question Answering in Medical Imaging | Jun 27, 2022 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| Surgical-VQA: Visual Question Answering in Surgical Scenes using Transformer | Jun 22, 2022 | Question AnsweringSentence | CodeCode Available | 1 |