SOTAVerified

Multimodal Reasoning

Reasoning over multimodal inputs.

Papers

Showing 201250 of 302 papers

TitleStatusHype
Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning0
Hints of Prompt: Enhancing Visual Representation for Multimodal LLMs in Autonomous Driving0
Thinking Before Looking: Improving Multimodal LLM Reasoning via Mitigating Visual HallucinationCode1
LLaVA-CoT: Let Vision Language Models Reason Step-by-StepCode7
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference OptimizationCode0
Motion-Grounded Video Reasoning: Understanding and Perceiving Motion at Pixel Level0
Towards Low-Resource Harmful Meme Detection with LMM AgentsCode0
Distill Visual Chart Reasoning Ability from LLMs to MLLMsCode2
Understanding the Role of LLMs in Multimodal Evaluation BenchmarksCode0
Learning to Ground VLMs without Forgetting0
An X-Ray Is Worth 15 Features: Sparse Autoencoders for Interpretable Radiology Report Generation0
NL-Eye: Abductive NLI for Images0
Deep Learning and Machine Learning, Advancing Big Data Analytics and Management: Unveiling AI's Potential Through Tools, Techniques, and Applications0
Proof of Thought : Neurosymbolic Program Synthesis allows Robust and Interpretable Reasoning0
JourneyBench: A Challenging One-Stop Vision-Language Understanding Benchmark of Generated ImagesCode0
NVLM: Open Frontier-Class Multimodal LLMs0
Knowledge-Aware Reasoning over Multimodal Semi-structured Tables0
Math-PUMA: Progressive Upward Multimodal Alignment to Enhance Mathematical ReasoningCode1
Towards Holistic Disease Risk Prediction using Small Language Models0
DC3DO: Diffusion Classifier for 3D ObjectsCode1
User-in-the-loop Evaluation of Multimodal LLMs for Activity Assistance0
MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language ModelsCode3
LLMs can be Dangerous Reasoners: Analyzing-based Jailbreak Attack on Large Language ModelsCode1
HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal ReasoningCode1
Controllable Contextualized Image Captioning: Directing the Visual Narrative through User-Defined HighlightsCode0
On scalable oversight with weak LLMs judging strong LLMs0
Improving Multi-Agent Debate with Sparse Communication Topology0
MFC-Bench: Benchmarking Multimodal Fact-Checking with Large Vision-Language ModelsCode1
POEM: Interactive Prompt Optimization for Enhancing Multimodal Reasoning of Large Language Models0
Multimodal Reasoning with Multimodal Knowledge Graph0
Retrieval Meets Reasoning: Even High-school Textbook Knowledge Benefits Multimodal Reasoning0
Don't Buy it! Reassessing the Ad Understanding Abilities of Contrastive Multimodal ModelsCode0
M4U: Evaluating Multilingual Understanding and Reasoning for Large Multimodal ModelsCode0
Image-of-Thought Prompting for Visual Reasoning Refinement in Multimodal Large Language Models0
Inquire, Interact, and Integrate: A Proactive Agent Collaborative Framework for Zero-Shot Multimodal Medical Reasoning0
CofiPara: A Coarse-to-fine Paradigm for Multimodal Sarcasm Target Identification with Large Multimodal ModelsCode1
AccidentBlip: Agent of Accident Warning based on MA-former0
Exploring the Transferability of Visual Prompting for Multimodal Large Language ModelsCode1
Closed-Loop Open-Vocabulary Mobile Manipulation with GPT-4V0
MM-MATH: Advancing Multimodal Math Evaluation with Process Evaluation and Fine-grained ClassificationCode0
Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval0
A Picture Is Worth a Graph: A Blueprint Debate Paradigm for Multimodal ReasoningCode1
PuzzleVQA: Diagnosing Multimodal Reasoning Challenges of Language Models with Abstract Visual PatternsCode2
Are Language Models Puzzle Prodigies? Algorithmic Puzzles Unveil Serious Challenges in Multimodal ReasoningCode2
VEglue: Testing Visual Entailment Systems via Object-Aligned Joint ErasingCode0
All in an Aggregated Image for In-Image LearningCode1
Measuring Vision-Language STEM Skills of Neural ModelsCode0
RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis0
Exploring Failure Cases in Multimodal Reasoning About Physical Dynamics0
Stop Reasoning! When Multimodal LLM with Chain-of-Thought Reasoning Meets Adversarial ImageCode1
Show:102550
← PrevPage 5 of 7Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4VAccuracy24Unverified
2Gemini ProAccuracy13.2Unverified
3LLaVa-1.5-13BAccuracy1.8Unverified
4LLaVa-1.5-7BAccuracy1.5Unverified
5BLIP2-FLAN-T5-XXLAccuracy0.9Unverified
6QWENAccuracy0.9Unverified
7CogVLMAccuracy0.9Unverified
8InstructBLIPAccuracy0.6Unverified
#ModelMetricClaimedVerifiedStatus
1GPT4VAccuracy22.76Unverified
2Gemini ProAccuracy17.66Unverified
3Qwen-VL-MaxAccuracy15.59Unverified
4InternLM-XComposer2-VLAccuracy14.54Unverified
#ModelMetricClaimedVerifiedStatus
1GPT-4Acc30.3Unverified