SOTAVerified

Multimodal Reasoning

Reasoning over multimodal inputs.

Papers

Showing 251300 of 302 papers

TitleStatusHype
Deep Learning and Machine Learning, Advancing Big Data Analytics and Management: Unveiling AI's Potential Through Tools, Techniques, and Applications0
EgoPrune: Efficient Token Pruning for Egomotion Video Reasoning in Embodied Agent0
Elicit and Enhance: Advancing Multimodal Reasoning in Medical Scenarios0
Decoupling the Image Perception and Multimodal Reasoning for Reasoning Segmentation with Digital Twin Representations0
DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models0
Towards Agentic Recommender Systems in the Era of Multimodal Large Language Models0
Enhancing Scientific Visual Question Answering through Multimodal Reasoning and Ensemble Modeling0
EnigmaEval: A Benchmark of Long Multimodal Reasoning Challenges0
EVADE: Multimodal Benchmark for Evasive Content Detection in E-Commerce Applications0
EVLM: Self-Reflective Multimodal Reasoning for Cross-Dimensional Visual Editing0
Evolutionary Prompt Optimization Discovers Emergent Multimodal Reasoning Strategies in Vision-Language Models0
Exploring Advanced Techniques for Visual Question Answering: A Comprehensive Comparison0
Exploring Failure Cases in Multimodal Reasoning About Physical Dynamics0
Exploring the Reasoning Abilities of Multimodal Large Language Models (MLLMs): A Comprehensive Survey on Emerging Trends in Multimodal Reasoning0
Towards Holistic Disease Risk Prediction using Small Language Models0
FedNano: Toward Lightweight Federated Tuning for Pretrained Multimodal Large Language Models0
VLMT: Vision-Language Multimodal Transformer for Multimodal Multi-hop Question Answering0
FinLMM-R1: Enhancing Financial Reasoning in LMM through Scalable Data and Reward Design0
CutPaste&Find: Efficient Multimodal Hallucination Detector with Visual-aid Knowledge Base0
GAM-Agent: Game-Theoretic and Uncertainty-Aware Collaboration for Complex Visual Reasoning0
GeoGuess: Multimodal Reasoning based on Hierarchy of Visual Information in Street View0
GeoSense: Evaluating Identification and Application of Geometric Principles in Multimodal Reasoning0
Critique Before Thinking: Mitigating Hallucination through Rationale-Augmented Instruction Tuning0
GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning0
Human-Aligned Bench: Fine-Grained Assessment of Reasoning Ability in MLLMs vs. Humans0
Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine0
Hints of Prompt: Enhancing Visual Representation for Multimodal LLMs in Autonomous Driving0
Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning0
Training-Free Personalization via Retrieval and Reasoning on Fingerprints0
Cracking the Code of Hallucination in LVLMs with Vision-aware Head Divergence0
Image-of-Thought Prompting for Visual Reasoning Refinement in Multimodal Large Language Models0
Improving Multi-Agent Debate with Sparse Communication Topology0
Improving Pre-trained Vision-and-Language Embeddings for Phrase Grounding0
Incentivizing Multimodal Reasoning in Large Models for Direct Robot Manipulation0
Infi-Med: Low-Resource Medical MLLMs with Robust Reasoning Evaluation0
Infi-MMR: Curriculum-based Unlocking Multimodal Reasoning via Phased Reinforcement Learning in Multimodal Small Language Models0
Inquire, Interact, and Integrate: A Proactive Agent Collaborative Framework for Zero-Shot Multimodal Medical Reasoning0
Integrating Chain-of-Thought for Multimodal Alignment: A Study on 3D Vision-Language Learning0
COSINT-Agent: A Knowledge-Driven Multimodal Agent for Chinese Open Source Intelligence0
Knowledge-Aware Reasoning over Multimodal Semi-structured Tables0
KokushiMD-10: Benchmark for Evaluating Large Language Models on Ten Japanese National Healthcare Licensing Examinations0
Training-Free Reasoning and Reflection in MLLMs0
Learning from Inside: Self-driven Siamese Sampling and Reasoning for Video Question Answering0
Learning to Ground VLMs without Forgetting0
Leveraging Retrieval-Augmented Tags for Large Vision-Language Understanding in Complex Scenes0
Closed-Loop Open-Vocabulary Mobile Manipulation with GPT-4V0
Chat-TS: Enhancing Multi-Modal Reasoning Over Time-Series and Natural Language Data0
ChartReasoner: Code-Driven Modality Bridging for Long-Chain Reasoning in Chart Question Answering0
TxT: Crossmodal End-to-End Learning with Transformers0
Look Before You Leap: A GUI-Critic-R1 Model for Pre-Operative Error Diagnosis in GUI Automation0
Show:102550
← PrevPage 6 of 7Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4VAccuracy24Unverified
2Gemini ProAccuracy13.2Unverified
3LLaVa-1.5-13BAccuracy1.8Unverified
4LLaVa-1.5-7BAccuracy1.5Unverified
5BLIP2-FLAN-T5-XXLAccuracy0.9Unverified
6QWENAccuracy0.9Unverified
7CogVLMAccuracy0.9Unverified
8InstructBLIPAccuracy0.6Unverified
#ModelMetricClaimedVerifiedStatus
1GPT4VAccuracy22.76Unverified
2Gemini ProAccuracy17.66Unverified
3Qwen-VL-MaxAccuracy15.59Unverified
4InternLM-XComposer2-VLAccuracy14.54Unverified
#ModelMetricClaimedVerifiedStatus
1GPT-4Acc30.3Unverified