SOTAVerified

Visual Reasoning

Ability to understand actions and reasoning associated with any visual images

Papers

Showing 51100 of 698 papers

TitleStatusHype
Seeing is Believing, but How Much? A Comprehensive Analysis of Verbalized Calibration in Vision-Language Models0
SeePhys: Does Seeing Help Thinking? -- Benchmarking Vision-Based Physics ReasoningCode1
VTool-R1: VLMs Learn to Think with Images via Reinforcement Learning on Multimodal Tool UseCode2
The Eye of Sherlock Holmes: Uncovering User Private Attribute Profiling via Vision-Language Model Agentic Framework0
Are Vision Language Models Ready for Clinical Diagnosis? A 3D Medical Benchmark for Tumor-centric Visual Question AnsweringCode1
ChartSketcher: Reasoning with Multimodal Feedback and Reflection for Chart UnderstandingCode0
Doc-CoB: Enhancing Multi-Modal Document Understanding with Visual Chain-of-Boxes Reasoning0
Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual Reasoning from Transit Maps0
GRE Suite: Geo-localization Inference via Fine-Tuned Vision-Language Models and Enhanced Reasoning ChainsCode1
FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving0
One RL to See Them All: Visual Triple Unified Reinforcement Learning0
DanmakuTPPBench: A Multi-modal Benchmark for Temporal Point Process Modeling and UnderstandingCode2
Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across ModalitiesCode1
OCR-Reasoning Benchmark: Unveiling the True Capabilities of MLLMs in Complex Text-Rich Image ReasoningCode0
ARB: A Comprehensive Arabic Multimodal Reasoning BenchmarkCode1
From EduVisBench to EduVisAgent: A Benchmark and Multi-Agent Framework for Pedagogical VisualizationCode1
RBench-V: A Primary Assessment for Visual Reasoning Models with Multi-modal Outputs0
OpenSeg-R: Improving Open-Vocabulary Segmentation via Step-by-Step Visual ReasoningCode1
LaViDa: A Large Diffusion Language Model for Multimodal UnderstandingCode3
GRIT: Teaching MLLMs to Think with Images0
Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning0
STAR-R1: Spacial TrAnsformation Reasoning by Reinforcing Multimodal LLMsCode0
Chain-of-Focus: Adaptive Visual Search and Zooming for Multimodal Reasoning via RL0
VisualQuality-R1: Reasoning-Induced Image Quality Assessment via Reinforcement Learning to RankCode2
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement LearningCode5
Visionary-R1: Mitigating Shortcuts in Visual Reasoning with Reinforcement Learning0
ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision-Language Models0
Neurosymbolic Diffusion ModelsCode2
Advancing Generalization Across a Variety of Abstract Visual Reasoning Tasks0
ViPlan: A Benchmark for Visual Planning with Symbolic Predicates and Vision-Language ModelsCode0
RVTBench: A Benchmark for Visual Reasoning TasksCode0
Human-Aligned Bench: Fine-Grained Assessment of Reasoning Ability in MLLMs vs. Humans0
OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement LearningCode3
Leveraging Vision-Language Models for Visual Grounding and Analysis of Automotive UICode0
EchoInk-R1: Exploring Audio-Visual Reasoning in Multimodal LLMs via Reinforcement LearningCode2
VLM Q-Learning: Aligning Vision-Language Models for Interactive Decision-Making0
A Survey of Slow Thinking-based Reasoning LLMs using Reinforced Learning and Inference-time Scaling Law0
Localizing Before Answering: A Hallucination Evaluation Benchmark for Grounded Medical Multimodal LLMs0
NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks0
Doxing via the Lens: Revealing Location-related Privacy Leakage on Multi-modal Large Reasoning Models0
A Comprehensive Survey of Knowledge-Based Vision Question Answering Systems: The Lifecycle of Knowledge in Visual Reasoning Task0
LongPerceptualThoughts: Distilling System-2 Reasoning for System-1 Perception0
VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models0
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?0
NoisyRollout: Reinforcing Visual Reasoning with Data AugmentationCode2
LVLM_CSP: Accelerating Large Vision Language Models via Clustering, Scattering, and Pruning for Reasoning Segmentation0
Visual Language Models show widespread visual deficits on neuropsychological tests0
CameraBench: Benchmarking Visual Reasoning in MLLMs via Photography0
VisualPuzzles: Decoupling Multimodal Reasoning Evaluation from Domain Knowledge0
SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language ModelsCode2
Show:102550
← PrevPage 2 of 14Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4o + CAText Score75.5Unverified
2GPT-4V (CoT, pick b/w two options)Text Score75.25Unverified
3GPT-4V (pick b/w two options)Text Score69.25Unverified
4MMICL + CoCoTText Score64.25Unverified
5GPT-4V + CoCoTText Score58.5Unverified
6OpenFlamingo + CoCoTText Score58.25Unverified
7GPT-4VText Score54.5Unverified
8FIBER (EqSim)Text Score51.5Unverified
9FIBER (finetuned, Flickr30k)Text Score51.25Unverified
10MMICL + CCoTText Score51Unverified
#ModelMetricClaimedVerifiedStatus
1BEiT-3Accuracy91.51Unverified
2X2-VLM (large)Accuracy88.7Unverified
3XFM (base)Accuracy87.6Unverified
4X2-VLM (base)Accuracy86.2Unverified
5CoCaAccuracy86.1Unverified
6VLMoAccuracy85.64Unverified
7VK-OODAccuracy84.6Unverified
8SimVLMAccuracy84.53Unverified
9X-VLM (base)Accuracy84.41Unverified
10VK-OODAccuracy83.9Unverified
#ModelMetricClaimedVerifiedStatus
1BEiT-3Accuracy92.58Unverified
2X2-VLM (large)Accuracy89.4Unverified
3XFM (base)Accuracy88.4Unverified
4CoCaAccuracy87Unverified
5X2-VLM (base)Accuracy87Unverified
6VLMoAccuracy86.86Unverified
7SimVLMAccuracy85.15Unverified
8X-VLM (base)Accuracy84.76Unverified
9BLIP-129MAccuracy83.09Unverified
10ALBEF (14M)Accuracy82.55Unverified
#ModelMetricClaimedVerifiedStatus
1AI CoreAverage-per ques.95.24Unverified
2redherringAverage-per ques.91.14Unverified
3VRDPAverage-per ques.90.24Unverified
4FightttttAverage-per ques.88.71Unverified
5neuralAverage-per ques.88.27Unverified
6NERVAverage-per ques.88.05Unverified
7DCLAverage-per ques.75.52Unverified
8troublesolverAverage-per ques.73.3Unverified
9v0.1Average-per ques.73.1Unverified
10First_testAverage-per ques.69.65Unverified
#ModelMetricClaimedVerifiedStatus
1Gemini-2.0 + CA2-Class Accuracy93.6Unverified
2GPT-4o + CA2-Class Accuracy92.8Unverified
3Human2-Class Accuracy91Unverified
4SNAIL2-Class Accuracy64Unverified
5InstructBLIP + GPT-42-Class Accuracy63.8Unverified
6BLIP-2 + ChatGPT (Fine-tuned)2-Class Accuracy63.3Unverified
7InstructBLIP + ChatGPT + Neuro-Symbolic2-Class Accuracy55.5Unverified
8ChatCaptioner + ChatGPT2-Class Accuracy49.3Unverified
9Otter2-Class Accuracy49.3Unverified
#ModelMetricClaimedVerifiedStatus
1HumansJaccard Index90Unverified
2ViLT (Zero-Shot)Jaccard Index52Unverified
3X-VLM (Zero-Shot)Jaccard Index46Unverified
4CLIP-ViT-B/32 (Zero-Shot)Jaccard Index41Unverified
5CLIP-ViT-L/14 (Zero-Shot)Jaccard Index40Unverified
6CLIP-RN50x64/14 (Zero-Shot)Jaccard Index38Unverified
7CLIP-RN50 (Zero-Shot)Jaccard Index35Unverified
8CLIP-ViL (Zero-Shot)Jaccard Index15Unverified
#ModelMetricClaimedVerifiedStatus
1LXMERTaccuracy70.1Unverified
2ViLTaccuracy69.3Unverified
3CLIP (finetuned)accuracy65.1Unverified
4CLIP (frozen)accuracy56Unverified
5VisualBERTaccuracy55.2Unverified
#ModelMetricClaimedVerifiedStatus
1RPINAUCCESS42.2Unverified
2Dec[Joint]1fAUCCESS40.3Unverified
3Dynamics-Aware DQNAUCCESS39.9Unverified
4DQNAUCCESS36.8Unverified
#ModelMetricClaimedVerifiedStatus
1Dynamics-Aware DQNAUCCESS85.2Unverified
2RPINAUCCESS85.2Unverified
3Dec[Joint]1fAUCCESS80Unverified
4DQNAUCCESS77.6Unverified
#ModelMetricClaimedVerifiedStatus
1Swin1:1 Accuracy52.9Unverified
2ConvNeXt1:1 Accuracy51.2Unverified
3ViT1:1 Accuracy50.3Unverified
4DEiT1:1 Accuracy47.2Unverified
#ModelMetricClaimedVerifiedStatus
1Humans1-of-100 Accuracy100Unverified
#ModelMetricClaimedVerifiedStatus
1VisualBERTAccuracy (Dev)67.4Unverified