SOTAVerified

Visual Reasoning

Ability to understand actions and reasoning associated with any visual images

Papers

Showing 101150 of 698 papers

TitleStatusHype
ColorBench: Can VLMs See and Understand the Colorful World? A Comprehensive Benchmark for Color Perception, Reasoning, and RobustnessCode1
SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-ImprovementCode2
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language ModelCode9
OmniCaptioner: One Captioner to Rule Them AllCode2
V-MAGE: A Game Evaluation Framework for Assessing Vision-Centric Capabilities in Multimodal Large Language ModelsCode1
TGraphX: Tensor-Aware Graph Neural Network for Multi-Dimensional Feature LearningCode0
Rethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation SchemeCode2
On Data Synthesis and Post-training for Visual Abstract Reasoning0
TDBench: Benchmarking Vision-Language Models in Understanding Top-Down ImagesCode0
GenVP: Generating Visual Puzzles with Contrastive Hierarchical VAEs0
Q-Insight: Understanding Image Quality via Visual Reinforcement LearningCode2
Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive TasksCode2
Reason-RFT: Reinforcement Fine-Tuning for Visual ReasoningCode3
DWIM: Towards Tool-aware Visual Reasoning via Discrepancy-aware Workflow Generation & Instruct-Masking Tuning0
RGB-Th-Bench: A Dense benchmark for Visual-Thermal Understanding of Vision Language Models0
Neuro-Symbolic Scene Graph Conditioning for Synthetic Image Dataset Generation0
Chain of Functions: A Programmatic Pipeline for Fine-Grained Chart Reasoning Data0
Agentic Keyframe Search for Video Question AnsweringCode1
From Head to Tail: Towards Balanced Representation in Large Vision-Language Models through Adaptive Data Calibration0
VERIFY: A Benchmark of Visual Explanation and Reasoning for Investigating Multimodal Reasoning Fidelity0
Interpretable Image Classification via Non-parametric Part Prototype LearningCode1
SciVerse: Unveiling the Knowledge Comprehension and Visual Reasoning of LMMs on Multi-modal Scientific Problems0
How Do Multimodal Large Language Models Handle Complex Multimodal Reasoning? Placing Them in An Extensible Escape GameCode1
DriveLMM-o1: A Step-by-Step Reasoning Dataset and Large Multimodal Model for Driving Scenario UnderstandingCode2
SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator TrajectoriesCode2
SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator TrajectoriesCode2
PhysVLM: Enabling Visual Language Models to Understand Robotic Physical ReachabilityCode1
VisRL: Intention-Driven Visual Perception via Reinforced ReasoningCode1
Does Acceleration Cause Hidden Instability in Vision Language Models? Uncovering Instance-Level Divergence Through a Large-Scale Empirical Study0
Poisoned-MRAG: Knowledge Poisoning Attacks to Multimodal Retrieval Augmented Generation0
R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT ModelCode4
LVLM-Compress-Bench: Benchmarking the Broader Impact of Large Vision-Language Model CompressionCode0
Towards Visual Discrimination and Reasoning of Real-World Physical Dynamics: Physics-Grounded Anomaly Detection0
EXCLAIM: An Explainable Cross-Modal Agentic System for Misinformation Detection with Hierarchical Retrieval0
MMSciBench: Benchmarking Language Models on Multimodal Scientific Problems0
M-LLM Based Video Frame Selection for Efficient Video Understanding0
End-to-End Chart Summarization via Visual Chain-of-Thought in Vision-Language Models0
Shakti-VLMs: Scalable Vision-Language Models for Enterprise AI0
Unraveling the geometry of visual relational reasoningCode0
R1-Onevision:An Open-Source Multimodal Large Language Model Capable of Deep ReasoningCode4
VisFactor: Benchmarking Fundamental Visual Cognition in Multimodal Large Language ModelsCode0
Visual Reasoning Evaluation of Grok, Deepseek Janus, Gemini, Qwen, Mistral, and ChatGPT0
Chitrarth: Bridging Vision and Language for a Billion People0
Forgotten Polygons: Multimodal Large Language Models are Shape-BlindCode1
KnowZRel: Common Sense Knowledge-based Zero-Shot Relationship Retrieval for Generalised Scene Graph GenerationCode0
AlphaMaze: Enhancing Large Language Models' Spatial Intelligence via GRPOCode2
Enhancing Cognition and Explainability of Multimodal Foundation Models with Self-Synthesized DataCode1
CityEQA: A Hierarchical LLM Agent on Embodied Question Answering Benchmark in City SpaceCode1
Do we Really Need Visual Instructions? Towards Visual Instruction-Free Fine-tuning for Large Vision-Language Models0
Learning to Stop Overthinking at Test Time0
Show:102550
← PrevPage 3 of 14Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4o + CAText Score75.5Unverified
2GPT-4V (CoT, pick b/w two options)Text Score75.25Unverified
3GPT-4V (pick b/w two options)Text Score69.25Unverified
4MMICL + CoCoTText Score64.25Unverified
5GPT-4V + CoCoTText Score58.5Unverified
6OpenFlamingo + CoCoTText Score58.25Unverified
7GPT-4VText Score54.5Unverified
8FIBER (EqSim)Text Score51.5Unverified
9FIBER (finetuned, Flickr30k)Text Score51.25Unverified
10MMICL + CCoTText Score51Unverified
#ModelMetricClaimedVerifiedStatus
1BEiT-3Accuracy91.51Unverified
2X2-VLM (large)Accuracy88.7Unverified
3XFM (base)Accuracy87.6Unverified
4X2-VLM (base)Accuracy86.2Unverified
5CoCaAccuracy86.1Unverified
6VLMoAccuracy85.64Unverified
7VK-OODAccuracy84.6Unverified
8SimVLMAccuracy84.53Unverified
9X-VLM (base)Accuracy84.41Unverified
10VK-OODAccuracy83.9Unverified
#ModelMetricClaimedVerifiedStatus
1BEiT-3Accuracy92.58Unverified
2X2-VLM (large)Accuracy89.4Unverified
3XFM (base)Accuracy88.4Unverified
4X2-VLM (base)Accuracy87Unverified
5CoCaAccuracy87Unverified
6VLMoAccuracy86.86Unverified
7SimVLMAccuracy85.15Unverified
8X-VLM (base)Accuracy84.76Unverified
9BLIP-129MAccuracy83.09Unverified
10ALBEF (14M)Accuracy82.55Unverified
#ModelMetricClaimedVerifiedStatus
1AI CoreAverage-per ques.95.24Unverified
2redherringAverage-per ques.91.14Unverified
3VRDPAverage-per ques.90.24Unverified
4FightttttAverage-per ques.88.71Unverified
5neuralAverage-per ques.88.27Unverified
6NERVAverage-per ques.88.05Unverified
7DCLAverage-per ques.75.52Unverified
8troublesolverAverage-per ques.73.3Unverified
9v0.1Average-per ques.73.1Unverified
10First_testAverage-per ques.69.65Unverified
#ModelMetricClaimedVerifiedStatus
1Gemini-2.0 + CA2-Class Accuracy93.6Unverified
2GPT-4o + CA2-Class Accuracy92.8Unverified
3Human2-Class Accuracy91Unverified
4SNAIL2-Class Accuracy64Unverified
5InstructBLIP + GPT-42-Class Accuracy63.8Unverified
6BLIP-2 + ChatGPT (Fine-tuned)2-Class Accuracy63.3Unverified
7InstructBLIP + ChatGPT + Neuro-Symbolic2-Class Accuracy55.5Unverified
8ChatCaptioner + ChatGPT2-Class Accuracy49.3Unverified
9Otter2-Class Accuracy49.3Unverified
#ModelMetricClaimedVerifiedStatus
1HumansJaccard Index90Unverified
2ViLT (Zero-Shot)Jaccard Index52Unverified
3X-VLM (Zero-Shot)Jaccard Index46Unverified
4CLIP-ViT-B/32 (Zero-Shot)Jaccard Index41Unverified
5CLIP-ViT-L/14 (Zero-Shot)Jaccard Index40Unverified
6CLIP-RN50x64/14 (Zero-Shot)Jaccard Index38Unverified
7CLIP-RN50 (Zero-Shot)Jaccard Index35Unverified
8CLIP-ViL (Zero-Shot)Jaccard Index15Unverified
#ModelMetricClaimedVerifiedStatus
1LXMERTaccuracy70.1Unverified
2ViLTaccuracy69.3Unverified
3CLIP (finetuned)accuracy65.1Unverified
4CLIP (frozen)accuracy56Unverified
5VisualBERTaccuracy55.2Unverified
#ModelMetricClaimedVerifiedStatus
1RPINAUCCESS42.2Unverified
2Dec[Joint]1fAUCCESS40.3Unverified
3Dynamics-Aware DQNAUCCESS39.9Unverified
4DQNAUCCESS36.8Unverified
#ModelMetricClaimedVerifiedStatus
1RPINAUCCESS85.2Unverified
2Dynamics-Aware DQNAUCCESS85.2Unverified
3Dec[Joint]1fAUCCESS80Unverified
4DQNAUCCESS77.6Unverified
#ModelMetricClaimedVerifiedStatus
1Swin1:1 Accuracy52.9Unverified
2ConvNeXt1:1 Accuracy51.2Unverified
3ViT1:1 Accuracy50.3Unverified
4DEiT1:1 Accuracy47.2Unverified
#ModelMetricClaimedVerifiedStatus
1Humans1-of-100 Accuracy100Unverified
#ModelMetricClaimedVerifiedStatus
1VisualBERTAccuracy (Dev)67.4Unverified