| Learning Compact Vision Tokens for Efficient Large Multimodal Models | Jun 8, 2025 | Multimodal ReasoningToken Reduction | CodeCode Available | 1 | 5 |
| Metis-RISE: RL Incentivizes and SFT Enhances Multimodal Reasoning Model Learning | Jun 16, 2025 | Multimodal ReasoningReinforcement Learning (RL) | CodeCode Available | 1 | 5 |
| MFC-Bench: Benchmarking Multimodal Fact-Checking with Large Vision-Language Models | Jun 17, 2024 | BenchmarkingFact Checking | CodeCode Available | 1 | 5 |
| MDK12-Bench: A Multi-Discipline Benchmark for Evaluating Reasoning in Multimodal Large Language Models | Apr 8, 2025 | MathMultimodal Reasoning | CodeCode Available | 1 | 5 |
| Math-PUMA: Progressive Upward Multimodal Alignment to Enhance Mathematical Reasoning | Aug 16, 2024 | MathMathematical Reasoning | CodeCode Available | 1 | 5 |
| Fine-Grained Visual Entailment | Mar 29, 2022 | Multimodal ReasoningVisual Entailment | CodeCode Available | 1 | 5 |
| MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale | Dec 6, 2024 | Multimodal ReasoningVisual Question Answering | CodeCode Available | 1 | 5 |
| MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research | Mar 17, 2025 | ArticlesBenchmarking | CodeCode Available | 1 | 5 |
| GThinker: Towards General Multimodal Reasoning via Cue-Guided Rethinking | Jun 1, 2025 | 4kMath | CodeCode Available | 0 | 5 |
| Controllable Contextualized Image Captioning: Directing the Visual Narrative through User-Defined Highlights | Jul 16, 2024 | Image CaptioningMultimodal Reasoning | CodeCode Available | 0 | 5 |
| VEglue: Testing Visual Entailment Systems via Object-Aligned Joint Erasing | Mar 5, 2024 | Multimodal ReasoningSentence | CodeCode Available | 0 | 5 |
| Visual Goal-Step Inference using wikiHow | Apr 12, 2021 | Multimodal ReasoningVGSI | CodeCode Available | 0 | 5 |
| UniT: Multimodal Multitask Learning with a Unified Transformer | Feb 22, 2021 | DecoderMultimodal Reasoning | CodeCode Available | 0 | 5 |
| FiVL: A Framework for Improved Vision-Language Alignment | Dec 19, 2024 | Answer GenerationMultimodal Reasoning | CodeCode Available | 0 | 5 |
| Apollo: Zero-shot MultiModal Reasoning with Multiple Experts | Oct 25, 2023 | Image CaptioningMultimodal Reasoning | CodeCode Available | 0 | 5 |
| Understanding the Role of LLMs in Multimodal Evaluation Benchmarks | Oct 16, 2024 | BenchmarkingLarge Language Model | CodeCode Available | 0 | 5 |
| APO: Enhancing Reasoning Ability of MLLMs via Asymmetric Policy Optimization | Jun 26, 2025 | Multimodal ReasoningReinforcement Learning (RL) | CodeCode Available | 0 | 5 |
| MM-MATH: Advancing Multimodal Math Evaluation with Process Evaluation and Fine-grained Classification | Apr 7, 2024 | Image ComprehensionMath | CodeCode Available | 0 | 5 |
| USER-VLM 360: Personalized Vision Language Models with User-aware Tuning for Social Human-Robot Interactions | Feb 15, 2025 | Multimodal ReasoningVisual Question Answering (VQA) | CodeCode Available | 0 | 5 |
| Towards a Unified Multimodal Reasoning Framework | Dec 22, 2023 | Multimodal ReasoningMultiple-choice | CodeCode Available | 0 | 5 |
| Towards Low-Resource Harmful Meme Detection with LMM Agents | Nov 8, 2024 | Multimodal Reasoning | CodeCode Available | 0 | 5 |
| Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language | Apr 1, 2022 | DiversityImage Captioning | CodeCode Available | 0 | 5 |
| Socratic Questioning: Learn to Self-guide Multimodal Reasoning in the Wild | Jan 6, 2025 | HallucinationMultimodal Reasoning | CodeCode Available | 0 | 5 |
| Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization | Nov 15, 2024 | Multimodal Reasoning | CodeCode Available | 0 | 5 |
| SegSub: Evaluating Robustness to Knowledge Conflicts and Hallucinations in Vision-Language Models | Feb 19, 2025 | counterfactualHallucination | CodeCode Available | 0 | 5 |
| SilVar: Speech Driven Multimodal Model for Reasoning Visual Question Answering and Object Localization | Dec 21, 2024 | Image CaptioningMultimodal Reasoning | CodeCode Available | 0 | 5 |
| Dual Attention Networks for Multimodal Reasoning and Matching | Nov 2, 2016 | Collaborative InferenceImage-text matching | CodeCode Available | 0 | 5 |
| Do Vision-Language Pretrained Models Learn Composable Primitive Concepts? | Mar 31, 2022 | Fine-Grained Visual RecognitionMultimodal Reasoning | CodeCode Available | 0 | 5 |
| Do Vision-and-Language Transformers Learn Grounded Predicate-Noun Dependencies? | Oct 21, 2022 | Image-text matchingLanguage Modeling | CodeCode Available | 0 | 5 |
| Don't Buy it! Reassessing the Ad Understanding Abilities of Contrastive Multimodal Models | May 31, 2024 | Multimodal ReasoningRetrieval | CodeCode Available | 0 | 5 |
| On the generalization capacity of neural networks during generic multimodal reasoning | Jan 26, 2024 | Multimodal ReasoningSystematic Generalization | CodeCode Available | 0 | 5 |
| M4U: Evaluating Multilingual Understanding and Reasoning for Large Multimodal Models | May 24, 2024 | Multimodal Reasoning | CodeCode Available | 0 | 5 |
| DMRM: A Dual-channel Multi-hop Reasoning Model for Visual Dialog | Dec 18, 2019 | AI AgentDecoder | CodeCode Available | 0 | 5 |
| Modal-specific Pseudo Query Generation for Video Corpus Moment Retrieval | Oct 23, 2022 | Moment RetrievalMultimodal Reasoning | CodeCode Available | 0 | 5 |
| LININ: Logic Integrated Neural Inference Network for Explanatory Visual Question Answering | Dec 24, 2024 | Explanatory Visual Question AnsweringMultimodal Reasoning | CodeCode Available | 0 | 5 |
| LENS: Multi-level Evaluation of Multimodal Reasoning with Large Language Models | May 21, 2025 | Multimodal Reasoning | CodeCode Available | 0 | 5 |
| MindGYM: Enhancing Vision-Language Models via Synthetic Self-Challenging Questions | Mar 12, 2025 | Computational EfficiencyMultimodal Reasoning | CodeCode Available | 0 | 5 |
| Language Models Can See Better: Visual Contrastive Decoding For LLM Multimodal Reasoning | Feb 17, 2025 | In-Context LearningMultimodal Reasoning | CodeCode Available | 0 | 5 |
| MMBoundary: Advancing MLLM Knowledge Boundary Awareness through Reasoning Step Confidence Calibration | May 29, 2025 | HallucinationMultimodal Reasoning | CodeCode Available | 0 | 5 |
| MM-R5: MultiModal Reasoning-Enhanced ReRanker via Reinforcement Learning for Document Retrieval | Jun 14, 2025 | Instruction FollowingMultimodal Reasoning | CodeCode Available | 0 | 5 |
| Measuring Vision-Language STEM Skills of Neural Models | Feb 27, 2024 | Multimodal Reasoning | CodeCode Available | 0 | 5 |
| KGAlign: Joint Semantic-Structural Knowledge Encoding for Multimodal Fake News Detection | May 18, 2025 | Fake News DetectionMisinformation | CodeCode Available | 0 | 5 |
| JourneyBench: A Challenging One-Stop Vision-Language Understanding Benchmark of Generated Images | Sep 19, 2024 | HallucinationImage Captioning | CodeCode Available | 0 | 5 |
| Infi-Med: Low-Resource Medical MLLMs with Robust Reasoning Evaluation | May 29, 2025 | DiagnosticMultimodal Reasoning | —Unverified | 0 | 0 |
| Incentivizing Multimodal Reasoning in Large Models for Direct Robot Manipulation | May 19, 2025 | Multimodal ReasoningRobot Manipulation | —Unverified | 0 | 0 |
| Improving Pre-trained Vision-and-Language Embeddings for Phrase Grounding | Nov 1, 2021 | Multimodal ReasoningPhrase Grounding | —Unverified | 0 | 0 |
| Improving Multi-Agent Debate with Sparse Communication Topology | Jun 17, 2024 | Multimodal Reasoning | —Unverified | 0 | 0 |
| CutPaste&Find: Efficient Multimodal Hallucination Detector with Visual-aid Knowledge Base | Feb 18, 2025 | AttributeHallucination | —Unverified | 0 | 0 |
| Image-of-Thought Prompting for Visual Reasoning Refinement in Multimodal Large Language Models | May 22, 2024 | Multimodal ReasoningVisual Question Answering | —Unverified | 0 | 0 |
| Critique Before Thinking: Mitigating Hallucination through Rationale-Augmented Instruction Tuning | May 12, 2025 | HallucinationMultimodal Reasoning | —Unverified | 0 | 0 |