| Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset | Feb 22, 2024 | DiversityMath | CodeCode Available | 2 |
| BBA: Bi-Modal Behavioral Alignment for Reasoning with Large Vision-Language Models | Feb 21, 2024 | Geometry Problem SolvingMolecular Property Prediction | —Unverified | 0 |
| Question Aware Vision Transformer for Multimodal Reasoning | Feb 8, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion | Feb 8, 2024 | Computational EfficiencyMultimodal Reasoning | CodeCode Available | 2 |
| On the generalization capacity of neural networks during generic multimodal reasoning | Jan 26, 2024 | Multimodal ReasoningSystematic Generalization | CodeCode Available | 0 |
| Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine | Jan 16, 2024 | DiagnosticImage Comprehension | —Unverified | 0 |
| Exploring the Reasoning Abilities of Multimodal Large Language Models (MLLMs): A Comprehensive Survey on Emerging Trends in Multimodal Reasoning | Jan 10, 2024 | Multimodal ReasoningSurvey | —Unverified | 0 |
| Towards a Unified Multimodal Reasoning Framework | Dec 22, 2023 | Multimodal ReasoningMultiple-choice | CodeCode Available | 0 |
| Assessing GPT4-V on Structured Reasoning Tasks | Dec 13, 2023 | Code GenerationLanguage Modeling | —Unverified | 0 |
| Beneath the Surface: Unveiling Harmful Memes with Multimodal Reasoning Distilled from Large Language Models | Dec 9, 2023 | Multimodal Reasoning | CodeCode Available | 1 |
| Boosting the Power of Small Multimodal Reasoning Models to Match Larger Models with Self-Consistency Training | Nov 23, 2023 | Multimodal ReasoningScience Question Answering | CodeCode Available | 1 |
| Apollo: Zero-shot MultiModal Reasoning with Multiple Experts | Oct 25, 2023 | Image CaptioningMultimodal Reasoning | CodeCode Available | 0 |
| DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models | Oct 25, 2023 | Multimodal Reasoning | —Unverified | 0 |
| MM-BigBench: Evaluating Multimodal Models on Multimodal Content Comprehension Tasks | Oct 13, 2023 | multimodal interactionMultimodal Reasoning | CodeCode Available | 1 |
| DOMINO: A Dual-System for Multi-step Visual Language Reasoning | Oct 4, 2023 | Arithmetic ReasoningLanguage Modeling | CodeCode Available | 1 |
| Enhancing Human-like Multi-Modal Reasoning: A New Challenging Dataset and Comprehensive Framework | Jul 24, 2023 | Contrastive LearningMultimodal Reasoning | CodeCode Available | 1 |
| Beyond Chain-of-Thought, Effective Graph-of-Thought Reasoning in Language Models | May 26, 2023 | GSM8KMultimodal Reasoning | CodeCode Available | 3 |
| LLM-CXR: Instruction-Finetuned LLM for CXR Image Understanding and Generation | May 19, 2023 | Image GenerationInstruction Following | CodeCode Available | 1 |
| Personality-aware Human-centric Multimodal Reasoning: A New Task, Dataset and Baselines | Apr 5, 2023 | Decision MakingMultimodal Reasoning | —Unverified | 0 |
| MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action | Mar 20, 2023 | Multimodal ReasoningVisual Question Answering | CodeCode Available | 2 |
| AutoFraudNet: A Multimodal Network to Detect Fraud in the Auto Insurance Industry | Jan 15, 2023 | Fraud DetectionMultimodal Reasoning | —Unverified | 0 |
| Variational Causal Inference Network for Explanatory Visual Question Answering | Jan 1, 2023 | Explanation GenerationExplanatory Visual Question Answering | CodeCode Available | 1 |
| Modal-specific Pseudo Query Generation for Video Corpus Moment Retrieval | Oct 23, 2022 | Moment RetrievalMultimodal Reasoning | CodeCode Available | 0 |
| Do Vision-and-Language Transformers Learn Grounded Predicate-Noun Dependencies? | Oct 21, 2022 | Image-text matchingLanguage Modeling | CodeCode Available | 0 |
| Multimodal Analogical Reasoning over Knowledge Graphs | Oct 1, 2022 | Graph EmbeddingKnowledge Graph Embedding | CodeCode Available | 2 |
| Deep Neural Networks for Visual Reasoning | Sep 24, 2022 | Multimodal ReasoningVisual Reasoning | —Unverified | 0 |
| Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering | Sep 20, 2022 | Multimodal Deep LearningMultimodal Reasoning | CodeCode Available | 2 |
| Reducing the Vision and Language Bias for Temporal Sentence Grounding | Jul 27, 2022 | Information RetrievalMultimodal Reasoning | —Unverified | 0 |
| DisinfoMeme: A Multimodal Dataset for Detecting Meme Intentionally Spreading Out Disinformation | May 25, 2022 | Multimodal ReasoningOptical Character Recognition (OCR) | —Unverified | 0 |
| Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language | Apr 1, 2022 | DiversityImage Captioning | CodeCode Available | 0 |
| Do Vision-Language Pretrained Models Learn Composable Primitive Concepts? | Mar 31, 2022 | Fine-Grained Visual RecognitionMultimodal Reasoning | CodeCode Available | 0 |
| Shifting More Attention to Visual Backbone: Query-modulated Refinement Networks for End-to-End Visual Grounding | Mar 29, 2022 | Multimodal ReasoningVisual Grounding | CodeCode Available | 1 |
| Fine-Grained Visual Entailment | Mar 29, 2022 | Multimodal ReasoningVisual Entailment | CodeCode Available | 1 |
| PACS: A Dataset for Physical Audiovisual CommonSense Reasoning | Mar 21, 2022 | Common Sense ReasoningMultimodal Reasoning | CodeCode Available | 1 |
| Learning from Inside: Self-driven Siamese Sampling and Reasoning for Video Question Answering | Dec 1, 2021 | Multimodal ReasoningQuestion Answering | —Unverified | 0 |
| Improving Pre-trained Vision-and-Language Embeddings for Phrase Grounding | Nov 1, 2021 | Multimodal ReasoningPhrase Grounding | —Unverified | 0 |
| TxT: Crossmodal End-to-End Learning with Transformers | Sep 9, 2021 | Multimodal ReasoningQuestion Answering | —Unverified | 0 |
| WebQA: Multihop and Multimodal QA | Sep 1, 2021 | Image RetrievalMultimodal Reasoning | CodeCode Available | 1 |
| Towers of Babel: Combining Images, Language, and 3D Geometry for Learning Multimodal Vision | Aug 12, 2021 | 3D geometryDescriptive | CodeCode Available | 1 |
| C^3: Compositional Counterfactual Contrastive Learning for Video-grounded Dialogues | Jun 16, 2021 | Contrastive Learningcounterfactual | —Unverified | 0 |
| MERLOT: Multimodal Neural Script Knowledge Models | Jun 4, 2021 | Multimodal ReasoningVisual Commonsense Reasoning | CodeCode Available | 1 |
| Premise-based Multimodal Reasoning: Conditional Inference on Joint Textual and Visual Clues | May 15, 2021 | Multimodal ReasoningNatural Language Inference | —Unverified | 0 |
| Visual Goal-Step Inference using wikiHow | Apr 12, 2021 | Multimodal ReasoningVGSI | CodeCode Available | 0 |
| UniT: Multimodal Multitask Learning with a Unified Transformer | Feb 22, 2021 | DecoderMultimodal Reasoning | CodeCode Available | 0 |
| DOC2PPT: Automatic Presentation Slides Generation from Scientific Documents | Jan 28, 2021 | Document SummarizationMultimodal Reasoning | —Unverified | 0 |
| A Multimodal Framework for the Detection of Hateful Memes | Dec 23, 2020 | Ensemble LearningMultimodal Reasoning | CodeCode Available | 1 |
| Sound2Sight: Generating Visual Dynamics from Sound and Context | Jul 23, 2020 | Multimodal ReasoningVideo Forecasting | —Unverified | 0 |
| e-SNLI-VE: Corrected Visual-Textual Entailment with Natural Language Explanations | Apr 7, 2020 | Multimodal ReasoningNatural Language Inference | CodeCode Available | 1 |
| DMRM: A Dual-channel Multi-hop Reasoning Model for Visual Dialog | Dec 18, 2019 | AI AgentDecoder | CodeCode Available | 0 |
| Multimodal Transformer with Multi-View Visual Representation for Image Captioning | May 20, 2019 | DecoderImage Captioning | —Unverified | 0 |