| A Survey on Interpretable Cross-modal Reasoning | Sep 5, 2023 | Cross-Modal RetrievalDecision Making | CodeCode Available | 1 | 5 |
| PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language Models | May 23, 2022 | Language ModelingLanguage Modelling | CodeCode Available | 1 | 5 |
| Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning | Sep 14, 2021 | Cultural Vocal Bursts Intensity PredictionVisual Commonsense Reasoning | CodeCode Available | 1 | 5 |
| Large-Scale Adversarial Training for Vision-and-Language Representation Learning | Jun 11, 2020 | Image-text RetrievalQuestion Answering | CodeCode Available | 1 | 5 |
| MERLOT: Multimodal Neural Script Knowledge Models | Jun 4, 2021 | Multimodal ReasoningVisual Commonsense Reasoning | CodeCode Available | 1 | 5 |
| Unifying Vision-and-Language Tasks via Text Generation | Feb 4, 2021 | Conditional Text GenerationDecoder | CodeCode Available | 1 | 5 |
| VL-BERT: Pre-training of Generic Visual-Linguistic Representations | Aug 22, 2019 | Image-text matchingLanguage Modelling | CodeCode Available | 1 | 5 |
| Natural Language Rationales with Full-Stack Visual Reasoning: From Pixels to Semantic Frames to Commonsense Graphs | Oct 15, 2020 | Language ModelingLanguage Modelling | CodeCode Available | 1 | 5 |
| Learning to Correction: Explainable Feedback Generation for Visual Commonsense Reasoning Distractor | Dec 8, 2024 | MisconceptionsMultiple-choice | CodeCode Available | 0 | 5 |
| Connective Cognition Network for Directional Visual Commonsense Reasoning | Dec 1, 2019 | SentenceVisual Commonsense Reasoning | CodeCode Available | 0 | 5 |