| A Survey on Interpretable Cross-modal Reasoning | Sep 5, 2023 | Cross-Modal RetrievalDecision Making | CodeCode Available | 1 |
| MERLOT: Multimodal Neural Script Knowledge Models | Jun 4, 2021 | Multimodal ReasoningVisual Commonsense Reasoning | CodeCode Available | 1 |
| Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning | Sep 14, 2021 | Cultural Vocal Bursts Intensity PredictionVisual Commonsense Reasoning | CodeCode Available | 1 |
| Towards artificial general intelligence via a multimodal foundation model | Oct 27, 2021 | Image ClassificationReading Comprehension | CodeCode Available | 1 |
| Improving Visual Commonsense in Language Models via Multiple Image Generation | Jun 19, 2024 | Common Sense ReasoningImage Generation | CodeCode Available | 1 |
| Natural Language Rationales with Full-Stack Visual Reasoning: From Pixels to Semantic Frames to Commonsense Graphs | Oct 15, 2020 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Large-Scale Adversarial Training for Vision-and-Language Representation Learning | Jun 11, 2020 | Image-text RetrievalQuestion Answering | CodeCode Available | 1 |
| ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks | Aug 6, 2019 | Image RetrievalQuestion Answering | CodeCode Available | 1 |
| Do Vision-Language Transformers Exhibit Visual Commonsense? An Empirical Study of VCR | May 27, 2024 | Question AnsweringTAG | —Unverified | 0 |
| Discovering Novel Actions from Open World Egocentric Videos with Object-Grounded Visual Commonsense Reasoning | May 26, 2023 | Object RecognitionVisual Commonsense Reasoning | —Unverified | 0 |