| Knowledge-Aware Reasoning over Multimodal Semi-structured Tables | Aug 25, 2024 | Multimodal ReasoningQuestion Answering | —Unverified | 0 |
| Towards Holistic Disease Risk Prediction using Small Language Models | Aug 13, 2024 | Multimodal Reasoning | —Unverified | 0 |
| User-in-the-loop Evaluation of Multimodal LLMs for Activity Assistance | Aug 4, 2024 | Action AnticipationBenchmarking | —Unverified | 0 |
| Controllable Contextualized Image Captioning: Directing the Visual Narrative through User-Defined Highlights | Jul 16, 2024 | Image CaptioningMultimodal Reasoning | CodeCode Available | 0 |
| On scalable oversight with weak LLMs judging strong LLMs | Jul 5, 2024 | Multimodal ReasoningQuestion Answering | —Unverified | 0 |
| Improving Multi-Agent Debate with Sparse Communication Topology | Jun 17, 2024 | Multimodal Reasoning | —Unverified | 0 |
| POEM: Interactive Prompt Optimization for Enhancing Multimodal Reasoning of Large Language Models | Jun 6, 2024 | Multimodal ReasoningPrompt Engineering | —Unverified | 0 |
| Multimodal Reasoning with Multimodal Knowledge Graph | Jun 4, 2024 | cross-modal alignmentGraph Attention | —Unverified | 0 |
| Don't Buy it! Reassessing the Ad Understanding Abilities of Contrastive Multimodal Models | May 31, 2024 | Multimodal ReasoningRetrieval | CodeCode Available | 0 |
| Retrieval Meets Reasoning: Even High-school Textbook Knowledge Benefits Multimodal Reasoning | May 31, 2024 | Answer GenerationMultimodal Reasoning | —Unverified | 0 |
| M4U: Evaluating Multilingual Understanding and Reasoning for Large Multimodal Models | May 24, 2024 | Multimodal Reasoning | CodeCode Available | 0 |
| Image-of-Thought Prompting for Visual Reasoning Refinement in Multimodal Large Language Models | May 22, 2024 | Multimodal ReasoningVisual Question Answering | —Unverified | 0 |
| Inquire, Interact, and Integrate: A Proactive Agent Collaborative Framework for Zero-Shot Multimodal Medical Reasoning | May 19, 2024 | Multimodal ReasoningQuestion Answering | —Unverified | 0 |
| AccidentBlip: Agent of Accident Warning based on MA-former | Apr 18, 2024 | Language ModellingLarge Language Model | —Unverified | 0 |
| Closed-Loop Open-Vocabulary Mobile Manipulation with GPT-4V | Apr 16, 2024 | Instruction FollowingMultimodal Reasoning | —Unverified | 0 |
| MM-MATH: Advancing Multimodal Math Evaluation with Process Evaluation and Fine-grained Classification | Apr 7, 2024 | Image ComprehensionMath | CodeCode Available | 0 |
| Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval | Mar 26, 2024 | Multimodal ReasoningRetrieval | —Unverified | 0 |
| VEglue: Testing Visual Entailment Systems via Object-Aligned Joint Erasing | Mar 5, 2024 | Multimodal ReasoningSentence | CodeCode Available | 0 |
| Measuring Vision-Language STEM Skills of Neural Models | Feb 27, 2024 | Multimodal Reasoning | CodeCode Available | 0 |
| RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis | Feb 25, 2024 | Code GenerationMultimodal Reasoning | —Unverified | 0 |
| Exploring Failure Cases in Multimodal Reasoning About Physical Dynamics | Feb 24, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| BBA: Bi-Modal Behavioral Alignment for Reasoning with Large Vision-Language Models | Feb 21, 2024 | Geometry Problem SolvingMolecular Property Prediction | —Unverified | 0 |
| Question Aware Vision Transformer for Multimodal Reasoning | Feb 8, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| On the generalization capacity of neural networks during generic multimodal reasoning | Jan 26, 2024 | Multimodal ReasoningSystematic Generalization | CodeCode Available | 0 |
| Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine | Jan 16, 2024 | DiagnosticImage Comprehension | —Unverified | 0 |
| Exploring the Reasoning Abilities of Multimodal Large Language Models (MLLMs): A Comprehensive Survey on Emerging Trends in Multimodal Reasoning | Jan 10, 2024 | Multimodal ReasoningSurvey | —Unverified | 0 |
| Towards a Unified Multimodal Reasoning Framework | Dec 22, 2023 | Multimodal ReasoningMultiple-choice | CodeCode Available | 0 |
| Assessing GPT4-V on Structured Reasoning Tasks | Dec 13, 2023 | Code GenerationLanguage Modeling | —Unverified | 0 |
| Apollo: Zero-shot MultiModal Reasoning with Multiple Experts | Oct 25, 2023 | Image CaptioningMultimodal Reasoning | CodeCode Available | 0 |
| DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models | Oct 25, 2023 | Multimodal Reasoning | —Unverified | 0 |
| Personality-aware Human-centric Multimodal Reasoning: A New Task, Dataset and Baselines | Apr 5, 2023 | Decision MakingMultimodal Reasoning | —Unverified | 0 |
| AutoFraudNet: A Multimodal Network to Detect Fraud in the Auto Insurance Industry | Jan 15, 2023 | Fraud DetectionMultimodal Reasoning | —Unverified | 0 |
| Modal-specific Pseudo Query Generation for Video Corpus Moment Retrieval | Oct 23, 2022 | Moment RetrievalMultimodal Reasoning | CodeCode Available | 0 |
| Do Vision-and-Language Transformers Learn Grounded Predicate-Noun Dependencies? | Oct 21, 2022 | Image-text matchingLanguage Modeling | CodeCode Available | 0 |
| Deep Neural Networks for Visual Reasoning | Sep 24, 2022 | Multimodal ReasoningVisual Reasoning | —Unverified | 0 |
| Reducing the Vision and Language Bias for Temporal Sentence Grounding | Jul 27, 2022 | Information RetrievalMultimodal Reasoning | —Unverified | 0 |
| DisinfoMeme: A Multimodal Dataset for Detecting Meme Intentionally Spreading Out Disinformation | May 25, 2022 | Multimodal ReasoningOptical Character Recognition (OCR) | —Unverified | 0 |
| Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language | Apr 1, 2022 | DiversityImage Captioning | CodeCode Available | 0 |
| Do Vision-Language Pretrained Models Learn Composable Primitive Concepts? | Mar 31, 2022 | Fine-Grained Visual RecognitionMultimodal Reasoning | CodeCode Available | 0 |
| Learning from Inside: Self-driven Siamese Sampling and Reasoning for Video Question Answering | Dec 1, 2021 | Multimodal ReasoningQuestion Answering | —Unverified | 0 |
| Improving Pre-trained Vision-and-Language Embeddings for Phrase Grounding | Nov 1, 2021 | Multimodal ReasoningPhrase Grounding | —Unverified | 0 |
| TxT: Crossmodal End-to-End Learning with Transformers | Sep 9, 2021 | Multimodal ReasoningQuestion Answering | —Unverified | 0 |
| C^3: Compositional Counterfactual Contrastive Learning for Video-grounded Dialogues | Jun 16, 2021 | Contrastive Learningcounterfactual | —Unverified | 0 |
| Premise-based Multimodal Reasoning: Conditional Inference on Joint Textual and Visual Clues | May 15, 2021 | Multimodal ReasoningNatural Language Inference | —Unverified | 0 |
| Visual Goal-Step Inference using wikiHow | Apr 12, 2021 | Multimodal ReasoningVGSI | CodeCode Available | 0 |
| UniT: Multimodal Multitask Learning with a Unified Transformer | Feb 22, 2021 | DecoderMultimodal Reasoning | CodeCode Available | 0 |
| DOC2PPT: Automatic Presentation Slides Generation from Scientific Documents | Jan 28, 2021 | Document SummarizationMultimodal Reasoning | —Unverified | 0 |
| Sound2Sight: Generating Visual Dynamics from Sound and Context | Jul 23, 2020 | Multimodal ReasoningVideo Forecasting | —Unverified | 0 |
| DMRM: A Dual-channel Multi-hop Reasoning Model for Visual Dialog | Dec 18, 2019 | AI AgentDecoder | CodeCode Available | 0 |
| Multimodal Transformer with Multi-View Visual Representation for Image Captioning | May 20, 2019 | DecoderImage Captioning | —Unverified | 0 |