| Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering | Sep 20, 2022 | Multimodal Deep LearningMultimodal Reasoning | CodeCode Available | 2 |
| GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest | Jul 7, 2023 | AttributeCommon Sense Reasoning | CodeCode Available | 2 |
| All in One: Exploring Unified Video-Language Pre-training | Mar 14, 2022 | AllLanguage Modelling | CodeCode Available | 2 |
| Dragonfly: Multi-Resolution Zoom-In Encoding Enhances Vision-Language Models | Jun 3, 2024 | Image CaptioningLanguage Modelling | CodeCode Available | 2 |
| Large-Scale Adversarial Training for Vision-and-Language Representation Learning | Jun 11, 2020 | Image-text RetrievalQuestion Answering | CodeCode Available | 1 |
| VL-BERT: Pre-training of Generic Visual-Linguistic Representations | Aug 22, 2019 | Image-text matchingLanguage Modelling | CodeCode Available | 1 |
| Unifying Vision-and-Language Tasks via Text Generation | Feb 4, 2021 | Conditional Text GenerationDecoder | CodeCode Available | 1 |
| A Survey on Interpretable Cross-modal Reasoning | Sep 5, 2023 | Cross-Modal RetrievalDecision Making | CodeCode Available | 1 |
| PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language Models | May 23, 2022 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning | Sep 14, 2021 | Cultural Vocal Bursts Intensity PredictionVisual Commonsense Reasoning | CodeCode Available | 1 |
| Improving Visual Commonsense in Language Models via Multiple Image Generation | Jun 19, 2024 | Common Sense ReasoningImage Generation | CodeCode Available | 1 |
| ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks | Aug 6, 2019 | Image RetrievalQuestion Answering | CodeCode Available | 1 |
| Natural Language Rationales with Full-Stack Visual Reasoning: From Pixels to Semantic Frames to Commonsense Graphs | Oct 15, 2020 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| MERLOT: Multimodal Neural Script Knowledge Models | Jun 4, 2021 | Multimodal ReasoningVisual Commonsense Reasoning | CodeCode Available | 1 |
| X-modaler: A Versatile and High-performance Codebase for Cross-modal Analytics | Aug 18, 2021 | Cross-Modal RetrievalDecoder | CodeCode Available | 1 |
| Fusing Pre-Trained Language Models With Multimodal Prompts Through Reinforcement Learning | Jan 1, 2023 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| UNITER: UNiversal Image-TExt Representation Learning | Sep 25, 2019 | Image-text matchingImage-text Retrieval | CodeCode Available | 1 |
| Towards artificial general intelligence via a multimodal foundation model | Oct 27, 2021 | Image ClassificationReading Comprehension | CodeCode Available | 1 |
| Multimodal Adaptive Distillation for Leveraging Unimodal Encoders for Vision-Language Tasks | Apr 22, 2022 | Question AnsweringVisual Commonsense Reasoning | —Unverified | 0 |
| A survey on knowledge-enhanced multimodal learning | Nov 19, 2022 | Conditional Image GenerationFactual Visual Question Answering | —Unverified | 0 |
| Attention Mechanism based Cognition-level Scene Understanding | Apr 17, 2022 | Question AnsweringScene Understanding | —Unverified | 0 |
| Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images | Mar 13, 2023 | Common Sense ReasoningExplanation Generation | —Unverified | 0 |
| CAVL: Learning Contrastive and Adaptive Representations of Vision and Language | Apr 10, 2023 | Image RetrievalPhrase Grounding | —Unverified | 0 |
| CLIP-TD: CLIP Targeted Distillation for Vision-Language Tasks | Jan 15, 2022 | Question AnsweringVisual Commonsense Reasoning | —Unverified | 0 |
| Commonsense-T2I Challenge: Can Text-to-Image Generation Models Understand Commonsense? | Jun 11, 2024 | Adversarial TextImage Generation | —Unverified | 0 |
| Discovering Novel Actions from Open World Egocentric Videos with Object-Grounded Visual Commonsense Reasoning | May 26, 2023 | Object RecognitionVisual Commonsense Reasoning | —Unverified | 0 |
| Do Vision-Language Transformers Exhibit Visual Commonsense? An Empirical Study of VCR | May 27, 2024 | Question AnsweringTAG | —Unverified | 0 |
| Enforcing Reasoning in Visual Commonsense Reasoning | Oct 21, 2019 | Question AnsweringReinforcement Learning | —Unverified | 0 |
| EventLens: Leveraging Event-Aware Pretraining and Cross-modal Linking Enhances Visual Commonsense Reasoning | Apr 22, 2024 | Visual Commonsense Reasoning | —Unverified | 0 |
| Generative Visual Commonsense Answering and Explaining with Generative Scene Graph Constructing | Jan 15, 2025 | Visual Commonsense Reasoning | —Unverified | 0 |
| GRILL: Grounded Vision-language Pre-training via Aligning Text and Image Regions | May 24, 2023 | ObjectQuestion Answering | —Unverified | 0 |
| How Vision-Language Tasks Benefit from Large Pre-trained Models: A Survey | Dec 11, 2024 | Image CaptioningQuestion Answering | —Unverified | 0 |
| Improving Vision-and-Language Reasoning via Spatial Relations Modeling | Nov 9, 2023 | Position regressionRelation | —Unverified | 0 |
| InterBERT: Vision-and-Language Interaction for Multi-modal Pretraining | Mar 30, 2020 | Image RetrievalImage-text matching | —Unverified | 0 |
| KVL-BERT: Knowledge Enhanced Visual-and-Linguistic BERT for Visual Commonsense Reasoning | Dec 13, 2020 | SentenceVisual Commonsense Reasoning | —Unverified | 0 |
| Learning to Agree on Vision Attention for Visual Commonsense Reasoning | Feb 4, 2023 | Visual Commonsense ReasoningVisual Reasoning | —Unverified | 0 |
| MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound | Jan 7, 2022 | Action ClassificationNavigate | —Unverified | 0 |
| ALGO: Object-Grounded Visual Commonsense Reasoning for Open-World Egocentric Action Recognition | Jun 9, 2024 | Action RecognitionObject Recognition | —Unverified | 0 |
| On Advances in Text Generation from Images Beyond Captioning: A Case Study in Self-Rationalization | May 24, 2022 | DescriptiveImage Captioning | —Unverified | 0 |
| Playing Lottery Tickets with Vision and Language | Apr 23, 2021 | Image-text RetrievalQuestion Answering | —Unverified | 0 |
| Premise-based Multimodal Reasoning: Conditional Inference on Joint Textual and Visual Clues | May 15, 2021 | Multimodal ReasoningNatural Language Inference | —Unverified | 0 |
| Multi-modal Large Language Model Enhanced Pseudo 3D Perception Framework for Visual Commonsense Reasoning | Jan 30, 2023 | Language ModelingLanguage Modelling | —Unverified | 0 |
| SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning | Dec 16, 2021 | Visual Commonsense Reasoning | —Unverified | 0 |
| Super-Prompting: Utilizing Model-Independent Contextual Data to Reduce Data Annotation Required in Visual Commonsense Tasks | Apr 25, 2022 | Few-Shot LearningIn-Context Learning | —Unverified | 0 |
| To Root Artificial Intelligence Deeply in Basic Science for a New Generation of AI | Sep 11, 2020 | Brain Computer InterfaceDecision Making | —Unverified | 0 |
| Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training | Aug 16, 2019 | Image-text matchingImage-text Retrieval | —Unverified | 0 |
| UNITER: Learning UNiversal Image-TExt Representations | Sep 25, 2019 | Image-text matchingImage-text Retrieval | —Unverified | 0 |
| ViCor: Bridging Visual Understanding and Commonsense Reasoning with Large Language Models | Oct 9, 2023 | Image CaptioningVisual Commonsense Reasoning | —Unverified | 0 |
| VisualCOMET: Reasoning about the Dynamic Context of a Still Image | Apr 22, 2020 | Visual Commonsense Reasoning | —Unverified | 0 |
| TAB-VCR: Tags and Attributes based Visual Commonsense Reasoning Baselines | Oct 31, 2019 | AttributeQuestion Answering | CodeCode Available | 0 |