SOTAVerified

Multimodal Reasoning

Reasoning over multimodal inputs.

Papers

Showing 251300 of 302 papers

TitleStatusHype
Revisiting Reliability in the Reasoning-based Pose Estimation Benchmark0
RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis0
RSVP: Reasoning Segmentation via Visual Prompting and Multi-modal Chain-of-Thought0
VisualPuzzles: Decoupling Multimodal Reasoning Evaluation from Domain Knowledge0
SAM-R1: Leveraging SAM for Reward Feedback in Multimodal Segmentation via Reinforcement Learning0
VisualQuest: A Diverse Image Dataset for Evaluating Visual Recognition in LLMs0
Scientists' First Exam: Probing Cognitive Abilities of MLLM via Perception, Understanding, and Reasoning0
Seed1.5-VL Technical Report0
Seeing and Reasoning with Confidence: Supercharging Multimodal LLMs with an Uncertainty-Aware Agentic Framework0
Shakti-VLMs: Scalable Vision-Language Models for Enterprise AI0
VisualSphinx: Large-Scale Synthetic Vision Logic Puzzles for RL0
Advancing Multimodal Reasoning: From Optimized Cold Start to Staged Reinforcement Learning0
Advancing Conversational Diagnostic AI with Multimodal Reasoning0
Skywork-VL Reward: An Effective Reward Model for Multimodal Understanding and Reasoning0
SlowFastVAD: Video Anomaly Detection via Integrating Simple Detector and RAG-Enhanced Vision-Language Model0
Sound2Sight: Generating Visual Dynamics from Sound and Context0
SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning0
GThinker: Towards General Multimodal Reasoning via Cue-Guided RethinkingCode0
APO: Enhancing Reasoning Ability of MLLMs via Asymmetric Policy OptimizationCode0
MM-R5: MultiModal Reasoning-Enhanced ReRanker via Reinforcement Learning for Document RetrievalCode0
Understanding the Role of LLMs in Multimodal Evaluation BenchmarksCode0
DMRM: A Dual-channel Multi-hop Reasoning Model for Visual DialogCode0
USER-VLM 360: Personalized Vision Language Models with User-aware Tuning for Social Human-Robot InteractionsCode0
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference OptimizationCode0
Measuring Vision-Language STEM Skills of Neural ModelsCode0
SegSub: Evaluating Robustness to Knowledge Conflicts and Hallucinations in Vision-Language ModelsCode0
FiVL: A Framework for Improved Vision-Language AlignmentCode0
VEglue: Testing Visual Entailment Systems via Object-Aligned Joint ErasingCode0
Modal-specific Pseudo Query Generation for Video Corpus Moment RetrievalCode0
Towards a Unified Multimodal Reasoning FrameworkCode0
SilVar: Speech Driven Multimodal Model for Reasoning Visual Question Answering and Object LocalizationCode0
MindGYM: Enhancing Vision-Language Models via Synthetic Self-Challenging QuestionsCode0
JourneyBench: A Challenging One-Stop Vision-Language Understanding Benchmark of Generated ImagesCode0
KGAlign: Joint Semantic-Structural Knowledge Encoding for Multimodal Fake News DetectionCode0
On the generalization capacity of neural networks during generic multimodal reasoningCode0
Don't Buy it! Reassessing the Ad Understanding Abilities of Contrastive Multimodal ModelsCode0
MM-MATH: Advancing Multimodal Math Evaluation with Process Evaluation and Fine-grained ClassificationCode0
Language Models Can See Better: Visual Contrastive Decoding For LLM Multimodal ReasoningCode0
Dual Attention Networks for Multimodal Reasoning and MatchingCode0
Do Vision-and-Language Transformers Learn Grounded Predicate-Noun Dependencies?Code0
MMBoundary: Advancing MLLM Knowledge Boundary Awareness through Reasoning Step Confidence CalibrationCode0
Controllable Contextualized Image Captioning: Directing the Visual Narrative through User-Defined HighlightsCode0
LENS: Multi-level Evaluation of Multimodal Reasoning with Large Language ModelsCode0
Do Vision-Language Pretrained Models Learn Composable Primitive Concepts?Code0
LININ: Logic Integrated Neural Inference Network for Explanatory Visual Question AnsweringCode0
Visual Goal-Step Inference using wikiHowCode0
Socratic Models: Composing Zero-Shot Multimodal Reasoning with LanguageCode0
UniT: Multimodal Multitask Learning with a Unified TransformerCode0
Apollo: Zero-shot MultiModal Reasoning with Multiple ExpertsCode0
Socratic Questioning: Learn to Self-guide Multimodal Reasoning in the WildCode0
Show:102550
← PrevPage 6 of 7Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4VAccuracy24Unverified
2Gemini ProAccuracy13.2Unverified
3LLaVa-1.5-13BAccuracy1.8Unverified
4LLaVa-1.5-7BAccuracy1.5Unverified
5BLIP2-FLAN-T5-XXLAccuracy0.9Unverified
6QWENAccuracy0.9Unverified
7CogVLMAccuracy0.9Unverified
8InstructBLIPAccuracy0.6Unverified
#ModelMetricClaimedVerifiedStatus
1GPT4VAccuracy22.76Unverified
2Gemini ProAccuracy17.66Unverified
3Qwen-VL-MaxAccuracy15.59Unverified
4InternLM-XComposer2-VLAccuracy14.54Unverified
#ModelMetricClaimedVerifiedStatus
1GPT-4Acc30.3Unverified