| ILLUME: Rationalizing Vision-Language Models through Human Interactions | Aug 17, 2022 | Image CaptioningQuestion Answering | CodeCode Available | 0 |
| On Advances in Text Generation from Images Beyond Captioning: A Case Study in Self-Rationalization | May 24, 2022 | DescriptiveImage Captioning | —Unverified | 0 |
| PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language Models | May 23, 2022 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Super-Prompting: Utilizing Model-Independent Contextual Data to Reduce Data Annotation Required in Visual Commonsense Tasks | Apr 25, 2022 | Few-Shot LearningIn-Context Learning | —Unverified | 0 |
| Multimodal Adaptive Distillation for Leveraging Unimodal Encoders for Vision-Language Tasks | Apr 22, 2022 | Question AnsweringVisual Commonsense Reasoning | —Unverified | 0 |
| Attention Mechanism based Cognition-level Scene Understanding | Apr 17, 2022 | Question AnsweringScene Understanding | —Unverified | 0 |
| VL-InterpreT: An Interactive Visualization Tool for Interpreting Vision-Language Transformers | Mar 30, 2022 | Question AnsweringVisual Commonsense Reasoning | CodeCode Available | 0 |
| All in One: Exploring Unified Video-Language Pre-training | Mar 14, 2022 | AllLanguage Modelling | CodeCode Available | 2 |
| Joint Answering and Explanation for Visual Commonsense Reasoning | Feb 25, 2022 | Knowledge DistillationQuestion Answering | CodeCode Available | 0 |
| CLIP-TD: CLIP Targeted Distillation for Vision-Language Tasks | Jan 15, 2022 | Question AnsweringVisual Commonsense Reasoning | —Unverified | 0 |
| MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound | Jan 7, 2022 | Action ClassificationNavigate | —Unverified | 0 |
| SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning | Dec 16, 2021 | Visual Commonsense Reasoning | —Unverified | 0 |
| Towards artificial general intelligence via a multimodal foundation model | Oct 27, 2021 | Image ClassificationReading Comprehension | CodeCode Available | 1 |
| Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning | Sep 14, 2021 | Cultural Vocal Bursts Intensity PredictionVisual Commonsense Reasoning | CodeCode Available | 1 |
| X-modaler: A Versatile and High-performance Codebase for Cross-modal Analytics | Aug 18, 2021 | Cross-Modal RetrievalDecoder | CodeCode Available | 1 |
| Interpretable Visual Understanding with Cognitive Attention Network | Aug 6, 2021 | Scene UnderstandingVisual Commonsense Reasoning | CodeCode Available | 0 |
| Cognitive Visual Commonsense Reasoning Using Dynamic Working Memory | Jul 4, 2021 | Question AnsweringScene Understanding | CodeCode Available | 0 |
| MERLOT: Multimodal Neural Script Knowledge Models | Jun 4, 2021 | Multimodal ReasoningVisual Commonsense Reasoning | CodeCode Available | 1 |
| Premise-based Multimodal Reasoning: Conditional Inference on Joint Textual and Visual Clues | May 15, 2021 | Multimodal ReasoningNatural Language Inference | —Unverified | 0 |
| Playing Lottery Tickets with Vision and Language | Apr 23, 2021 | Image-text RetrievalQuestion Answering | —Unverified | 0 |
| Unifying Vision-and-Language Tasks via Text Generation | Feb 4, 2021 | Conditional Text GenerationDecoder | CodeCode Available | 1 |
| KVL-BERT: Knowledge Enhanced Visual-and-Linguistic BERT for Visual Commonsense Reasoning | Dec 13, 2020 | SentenceVisual Commonsense Reasoning | —Unverified | 0 |
| Natural Language Rationales with Full-Stack Visual Reasoning: From Pixels to Semantic Frames to Commonsense Graphs | Oct 15, 2020 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| To Root Artificial Intelligence Deeply in Basic Science for a New Generation of AI | Sep 11, 2020 | Brain Computer InterfaceDecision Making | —Unverified | 0 |
| Large-Scale Adversarial Training for Vision-and-Language Representation Learning | Jun 11, 2020 | Image-text RetrievalQuestion Answering | CodeCode Available | 1 |