| Attention Mechanism based Cognition-level Scene Understanding | Apr 17, 2022 | Question AnsweringScene Understanding | —Unverified | 0 |
| VL-InterpreT: An Interactive Visualization Tool for Interpreting Vision-Language Transformers | Mar 30, 2022 | Question AnsweringVisual Commonsense Reasoning | CodeCode Available | 0 |
| All in One: Exploring Unified Video-Language Pre-training | Mar 14, 2022 | AllLanguage Modelling | CodeCode Available | 2 |
| Joint Answering and Explanation for Visual Commonsense Reasoning | Feb 25, 2022 | Knowledge DistillationQuestion Answering | CodeCode Available | 0 |
| CLIP-TD: CLIP Targeted Distillation for Vision-Language Tasks | Jan 15, 2022 | Question AnsweringVisual Commonsense Reasoning | —Unverified | 0 |
| MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound | Jan 7, 2022 | Action ClassificationNavigate | —Unverified | 0 |
| SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning | Dec 16, 2021 | Visual Commonsense Reasoning | —Unverified | 0 |
| Towards artificial general intelligence via a multimodal foundation model | Oct 27, 2021 | Image ClassificationReading Comprehension | CodeCode Available | 1 |
| Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning | Sep 14, 2021 | Cultural Vocal Bursts Intensity PredictionVisual Commonsense Reasoning | CodeCode Available | 1 |
| X-modaler: A Versatile and High-performance Codebase for Cross-modal Analytics | Aug 18, 2021 | Cross-Modal RetrievalDecoder | CodeCode Available | 1 |