SOTAVerified

Multimodal Reasoning

Reasoning over multimodal inputs.

Papers

Showing 101150 of 302 papers

TitleStatusHype
Shifting More Attention to Visual Backbone: Query-modulated Refinement Networks for End-to-End Visual GroundingCode1
Fine-Grained Visual EntailmentCode1
PACS: A Dataset for Physical Audiovisual CommonSense ReasoningCode1
WebQA: Multihop and Multimodal QACode1
Towers of Babel: Combining Images, Language, and 3D Geometry for Learning Multimodal VisionCode1
MERLOT: Multimodal Neural Script Knowledge ModelsCode1
A Multimodal Framework for the Detection of Hateful MemesCode1
e-SNLI-VE: Corrected Visual-Textual Entailment with Natural Language ExplanationsCode1
EgoPrune: Efficient Token Pruning for Egomotion Video Reasoning in Embodied Agent0
Revisiting Reliability in the Reasoning-based Pose Estimation Benchmark0
The Synergy Dilemma of Long-CoT SFT and RL: Investigating Post-Training Techniques for Reasoning VLMs0
MagiC: Evaluating Multimodal Cognition Toward Grounded Visual Reasoning0
Enhancing Scientific Visual Question Answering through Multimodal Reasoning and Ensemble Modeling0
Perception-Aware Policy Optimization for Multimodal Reasoning0
APO: Enhancing Reasoning Ability of MLLMs via Asymmetric Policy OptimizationCode0
MultiFinRAG: An Optimized Multimodal Retrieval-Augmented Generation (RAG) Framework for Financial Question Answering0
Adapting Vision-Language Models for Evaluating World Models0
GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning0
GeoGuess: Multimodal Reasoning based on Hierarchy of Visual Information in Street View0
MEGC2025: Micro-Expression Grand Challenge on Spot Then Recognize and Visual Question Answering0
RadFabric: Agentic AI System with Reasoning Capability for Radiology0
PeRL: Permutation-Enhanced Reinforcement Learning for Interleaved Vision-Language Reasoning0
FinLMM-R1: Enhancing Financial Reasoning in LMM through Scalable Data and Reward Design0
VL-GenRM: Enhancing Vision-Language Verification via Vision Experts and Iterative Training0
MM-R5: MultiModal Reasoning-Enhanced ReRanker via Reinforcement Learning for Document RetrievalCode0
MMMG: A Massive, Multidisciplinary, Multi-Tier Generation Benchmark for Text-to-Image Reasoning0
FedNano: Toward Lightweight Federated Tuning for Pretrained Multimodal Large Language Models0
Scientists' First Exam: Probing Cognitive Abilities of MLLM via Perception, Understanding, and Reasoning0
Optimus-3: Towards Generalist Multimodal Minecraft Agents with Scalable Task Experts0
ChartReasoner: Code-Driven Modality Bridging for Long-Chain Reasoning in Chart Question Answering0
Wait, We Don't Need to "Wait"! Removing Thinking Tokens Improves Reasoning Efficiency0
KokushiMD-10: Benchmark for Evaluating Large Language Models on Ten Japanese National Healthcare Licensing Examinations0
Decoupling the Image Perception and Multimodal Reasoning for Reasoning Segmentation with Digital Twin Representations0
Look Before You Leap: A GUI-Critic-R1 Model for Pre-Operative Error Diagnosis in GUI Automation0
MuSciClaims: Multimodal Scientific Claim Verification0
Advancing Multimodal Reasoning: From Optimized Cold Start to Staged Reinforcement Learning0
RSVP: Reasoning Segmentation via Visual Prompting and Multi-modal Chain-of-Thought0
MMR-V: What's Left Unsaid? A Benchmark for Multimodal Deep Reasoning in Videos0
SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning0
GThinker: Towards General Multimodal Reasoning via Cue-Guided RethinkingCode0
MIRAGE: Assessing Hallucination in Multimodal Reasoning Chains of MLLM0
Infi-Med: Low-Resource Medical MLLMs with Robust Reasoning Evaluation0
Argus: Vision-Centric Reasoning with Grounded Chain-of-Thought0
Preemptive Hallucination Reduction: An Input-Level Approach for Multimodal Language Model0
GAM-Agent: Game-Theoretic and Uncertainty-Aware Collaboration for Complex Visual Reasoning0
MMBoundary: Advancing MLLM Knowledge Boundary Awareness through Reasoning Step Confidence CalibrationCode0
Elicit and Enhance: Advancing Multimodal Reasoning in Medical Scenarios0
VisualSphinx: Large-Scale Synthetic Vision Logic Puzzles for RL0
Infi-MMR: Curriculum-based Unlocking Multimodal Reasoning via Phased Reinforcement Learning in Multimodal Small Language Models0
SAM-R1: Leveraging SAM for Reward Feedback in Multimodal Segmentation via Reinforcement Learning0
Show:102550
← PrevPage 3 of 7Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4VAccuracy24Unverified
2Gemini ProAccuracy13.2Unverified
3LLaVa-1.5-13BAccuracy1.8Unverified
4LLaVa-1.5-7BAccuracy1.5Unverified
5BLIP2-FLAN-T5-XXLAccuracy0.9Unverified
6QWENAccuracy0.9Unverified
7CogVLMAccuracy0.9Unverified
8InstructBLIPAccuracy0.6Unverified
#ModelMetricClaimedVerifiedStatus
1GPT4VAccuracy22.76Unverified
2Gemini ProAccuracy17.66Unverified
3Qwen-VL-MaxAccuracy15.59Unverified
4InternLM-XComposer2-VLAccuracy14.54Unverified
#ModelMetricClaimedVerifiedStatus
1GPT-4Acc30.3Unverified