| TRAR: Routing the Attention Spans in Transformer for Visual Question Answering | Jan 1, 2021 | Question AnsweringReferring Expression | CodeCode Available | 1 |
| Large-Scale Adversarial Training for Vision-and-Language Representation Learning | Jun 11, 2020 | Image-text RetrievalQuestion Answering | CodeCode Available | 1 |
| Multi-task Collaborative Network for Joint Referring Expression Comprehension and Segmentation | Mar 19, 2020 | Generalized Referring Expression ComprehensionReferring Expression | CodeCode Available | 1 |
| UNITER: UNiversal Image-TExt Representation Learning | Sep 25, 2019 | Image-text matchingImage-text Retrieval | CodeCode Available | 1 |
| Talk2Car: Taking Control of Your Self-Driving Car | Sep 24, 2019 | Autonomous DrivingObject | CodeCode Available | 1 |
| VL-BERT: Pre-training of Generic Visual-Linguistic Representations | Aug 22, 2019 | Image-text matchingLanguage Modelling | CodeCode Available | 1 |
| A Fast and Accurate One-Stage Approach to Visual Grounding | Aug 18, 2019 | Referring ExpressionReferring Expression Comprehension | CodeCode Available | 1 |
| ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks | Aug 6, 2019 | Image RetrievalQuestion Answering | CodeCode Available | 1 |
| Explainable Neural Computation via Stack Neural Module Networks | Jul 23, 2018 | Decision MakingQuestion Answering | CodeCode Available | 1 |
| Compositional Attention Networks for Machine Reasoning | Mar 8, 2018 | Referring Expression ComprehensionVisual Question Answering (VQA) | CodeCode Available | 1 |