| Revisiting Reliability in the Reasoning-based Pose Estimation Benchmark | Jul 17, 2025 | Multimodal ReasoningPose Estimation | —Unverified | 0 |
| RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis | Feb 25, 2024 | Code GenerationMultimodal Reasoning | —Unverified | 0 |
| RSVP: Reasoning Segmentation via Visual Prompting and Multi-modal Chain-of-Thought | Jun 4, 2025 | Multimodal ReasoningReasoning Segmentation | —Unverified | 0 |
| VisualPuzzles: Decoupling Multimodal Reasoning Evaluation from Domain Knowledge | Apr 14, 2025 | Logical ReasoningMultimodal Reasoning | —Unverified | 0 |
| SAM-R1: Leveraging SAM for Reward Feedback in Multimodal Segmentation via Reinforcement Learning | May 28, 2025 | Image SegmentationMultimodal Reasoning | —Unverified | 0 |
| VisualQuest: A Diverse Image Dataset for Evaluating Visual Recognition in LLMs | Mar 25, 2025 | DiversityMultimodal Reasoning | —Unverified | 0 |
| Scientists' First Exam: Probing Cognitive Abilities of MLLM via Perception, Understanding, and Reasoning | Jun 12, 2025 | AttributeMultimodal Reasoning | —Unverified | 0 |
| Seed1.5-VL Technical Report | May 11, 2025 | Mixture-of-ExpertsMultimodal Reasoning | —Unverified | 0 |
| Seeing and Reasoning with Confidence: Supercharging Multimodal LLMs with an Uncertainty-Aware Agentic Framework | Mar 11, 2025 | Conformal PredictionMultimodal Reasoning | —Unverified | 0 |
| Shakti-VLMs: Scalable Vision-Language Models for Enterprise AI | Feb 24, 2025 | document understandingMultimodal Reasoning | —Unverified | 0 |
| VisualSphinx: Large-Scale Synthetic Vision Logic Puzzles for RL | May 29, 2025 | Arithmetic ReasoningImage Generation | —Unverified | 0 |
| Advancing Multimodal Reasoning: From Optimized Cold Start to Staged Reinforcement Learning | Jun 4, 2025 | Multimodal ReasoningReinforcement Learning (RL) | —Unverified | 0 |
| Advancing Conversational Diagnostic AI with Multimodal Reasoning | May 6, 2025 | DiagnosticManagement | —Unverified | 0 |
| Skywork-VL Reward: An Effective Reward Model for Multimodal Understanding and Reasoning | May 12, 2025 | Multimodal Reasoning | —Unverified | 0 |
| SlowFastVAD: Video Anomaly Detection via Integrating Simple Detector and RAG-Enhanced Vision-Language Model | Apr 14, 2025 | Anomaly DetectionDomain Adaptation | —Unverified | 0 |
| Sound2Sight: Generating Visual Dynamics from Sound and Context | Jul 23, 2020 | Multimodal ReasoningVideo Forecasting | —Unverified | 0 |
| SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning | Jun 2, 2025 | Multimodal Reasoningreinforcement-learning | —Unverified | 0 |
| GThinker: Towards General Multimodal Reasoning via Cue-Guided Rethinking | Jun 1, 2025 | 4kMath | CodeCode Available | 0 |
| APO: Enhancing Reasoning Ability of MLLMs via Asymmetric Policy Optimization | Jun 26, 2025 | Multimodal ReasoningReinforcement Learning (RL) | CodeCode Available | 0 |
| MM-R5: MultiModal Reasoning-Enhanced ReRanker via Reinforcement Learning for Document Retrieval | Jun 14, 2025 | Instruction FollowingMultimodal Reasoning | CodeCode Available | 0 |
| Understanding the Role of LLMs in Multimodal Evaluation Benchmarks | Oct 16, 2024 | BenchmarkingLarge Language Model | CodeCode Available | 0 |
| DMRM: A Dual-channel Multi-hop Reasoning Model for Visual Dialog | Dec 18, 2019 | AI AgentDecoder | CodeCode Available | 0 |
| USER-VLM 360: Personalized Vision Language Models with User-aware Tuning for Social Human-Robot Interactions | Feb 15, 2025 | Multimodal ReasoningVisual Question Answering (VQA) | CodeCode Available | 0 |
| Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization | Nov 15, 2024 | Multimodal Reasoning | CodeCode Available | 0 |
| Measuring Vision-Language STEM Skills of Neural Models | Feb 27, 2024 | Multimodal Reasoning | CodeCode Available | 0 |
| SegSub: Evaluating Robustness to Knowledge Conflicts and Hallucinations in Vision-Language Models | Feb 19, 2025 | counterfactualHallucination | CodeCode Available | 0 |
| FiVL: A Framework for Improved Vision-Language Alignment | Dec 19, 2024 | Answer GenerationMultimodal Reasoning | CodeCode Available | 0 |
| VEglue: Testing Visual Entailment Systems via Object-Aligned Joint Erasing | Mar 5, 2024 | Multimodal ReasoningSentence | CodeCode Available | 0 |
| Modal-specific Pseudo Query Generation for Video Corpus Moment Retrieval | Oct 23, 2022 | Moment RetrievalMultimodal Reasoning | CodeCode Available | 0 |
| Towards a Unified Multimodal Reasoning Framework | Dec 22, 2023 | Multimodal ReasoningMultiple-choice | CodeCode Available | 0 |
| SilVar: Speech Driven Multimodal Model for Reasoning Visual Question Answering and Object Localization | Dec 21, 2024 | Image CaptioningMultimodal Reasoning | CodeCode Available | 0 |
| MindGYM: Enhancing Vision-Language Models via Synthetic Self-Challenging Questions | Mar 12, 2025 | Computational EfficiencyMultimodal Reasoning | CodeCode Available | 0 |
| JourneyBench: A Challenging One-Stop Vision-Language Understanding Benchmark of Generated Images | Sep 19, 2024 | HallucinationImage Captioning | CodeCode Available | 0 |
| KGAlign: Joint Semantic-Structural Knowledge Encoding for Multimodal Fake News Detection | May 18, 2025 | Fake News DetectionMisinformation | CodeCode Available | 0 |
| On the generalization capacity of neural networks during generic multimodal reasoning | Jan 26, 2024 | Multimodal ReasoningSystematic Generalization | CodeCode Available | 0 |
| Don't Buy it! Reassessing the Ad Understanding Abilities of Contrastive Multimodal Models | May 31, 2024 | Multimodal ReasoningRetrieval | CodeCode Available | 0 |
| MM-MATH: Advancing Multimodal Math Evaluation with Process Evaluation and Fine-grained Classification | Apr 7, 2024 | Image ComprehensionMath | CodeCode Available | 0 |
| Language Models Can See Better: Visual Contrastive Decoding For LLM Multimodal Reasoning | Feb 17, 2025 | In-Context LearningMultimodal Reasoning | CodeCode Available | 0 |
| Dual Attention Networks for Multimodal Reasoning and Matching | Nov 2, 2016 | Collaborative InferenceImage-text matching | CodeCode Available | 0 |
| Do Vision-and-Language Transformers Learn Grounded Predicate-Noun Dependencies? | Oct 21, 2022 | Image-text matchingLanguage Modeling | CodeCode Available | 0 |
| MMBoundary: Advancing MLLM Knowledge Boundary Awareness through Reasoning Step Confidence Calibration | May 29, 2025 | HallucinationMultimodal Reasoning | CodeCode Available | 0 |
| Controllable Contextualized Image Captioning: Directing the Visual Narrative through User-Defined Highlights | Jul 16, 2024 | Image CaptioningMultimodal Reasoning | CodeCode Available | 0 |
| LENS: Multi-level Evaluation of Multimodal Reasoning with Large Language Models | May 21, 2025 | Multimodal Reasoning | CodeCode Available | 0 |
| Do Vision-Language Pretrained Models Learn Composable Primitive Concepts? | Mar 31, 2022 | Fine-Grained Visual RecognitionMultimodal Reasoning | CodeCode Available | 0 |
| LININ: Logic Integrated Neural Inference Network for Explanatory Visual Question Answering | Dec 24, 2024 | Explanatory Visual Question AnsweringMultimodal Reasoning | CodeCode Available | 0 |
| Visual Goal-Step Inference using wikiHow | Apr 12, 2021 | Multimodal ReasoningVGSI | CodeCode Available | 0 |
| Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language | Apr 1, 2022 | DiversityImage Captioning | CodeCode Available | 0 |
| UniT: Multimodal Multitask Learning with a Unified Transformer | Feb 22, 2021 | DecoderMultimodal Reasoning | CodeCode Available | 0 |
| Apollo: Zero-shot MultiModal Reasoning with Multiple Experts | Oct 25, 2023 | Image CaptioningMultimodal Reasoning | CodeCode Available | 0 |
| Socratic Questioning: Learn to Self-guide Multimodal Reasoning in the Wild | Jan 6, 2025 | HallucinationMultimodal Reasoning | CodeCode Available | 0 |