SOTAVerified

Visual Reasoning

Ability to understand actions and reasoning associated with any visual images

Papers

Showing 201250 of 698 papers

TitleStatusHype
Distill Visual Chart Reasoning Ability from LLMs to MLLMsCode2
CAMEL-Bench: A Comprehensive Arabic LMM BenchmarkCode1
ProReason: Multi-Modal Proactive Reasoning with Decoupled Eyesight and Wisdom0
HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding TasksCode1
MCTBench: Multimodal Cognition towards Text-Rich Visual Scenes BenchmarkCode0
ForgeryGPT: Multimodal Large Language Model For Explainable Image Forgery Detection and Localization0
Towards Efficient Visual-Language Alignment of the Q-Former for Visual Reasoning TasksCode1
TVBench: Redesigning Video-Language Evaluation0
Tackling the Abstraction and Reasoning Corpus with Vision Transformers: the Importance of 2D Representation, Positions, and ObjectsCode1
Transformers Utilization in Chart Understanding: A Review of Recent Advances & Future Trends0
Mind the GAP: Glimpse-based Active Perception improves generalization and sample efficiency of visual reasoningCode0
From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video UnderstandingCode1
Advancing Object Detection in Transportation with Multimodal Large Language Models (MLLMs): A Comprehensive Review and Empirical Testing0
GSON: A Group-based Social Navigation Framework with Large Multimodal Model0
FineCops-Ref: A new Dataset and Task for Fine-Grained Compositional Referring Expression ComprehensionCode1
Enhancing Advanced Visual Reasoning Ability of Large Language Models0
Impact of ML Optimization Tactics on Greener Pre-Trained ML Models0
JourneyBench: A Challenging One-Stop Vision-Language Understanding Benchmark of Generated ImagesCode0
What Makes a Maze Look Like a Maze?0
Critical Features Tracking on Triangulated Irregular Networks by a Scale-Space Method0
MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct0
How to Determine the Preferred Image Distribution of a Black-Box Vision-Language Model?Code0
Zero-Shot Visual Reasoning by Vision-Language Models: Benchmarking and Analysis0
Multi-Modal Dialogue State Tracking for Playing GuessWhich GameCode0
UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond ScalingCode3
ArtVLM: Attribute Recognition Through Vision-Based Prefix Language ModelingCode0
Compromising Embodied Agents with Contextual Backdoor Attacks0
ExoViP: Step-by-step Verification and Exploration with Exoskeleton Modules for Compositional Visual Reasoning0
Chat2Layout: Interactive 3D Furniture Layout with a Multimodal LLM0
A Plug-and-Play Method for Rare Human-Object Interactions Detection by Bridging Domain GapCode0
Pyramid Coder: Hierarchical Code Generator for Compositional Visual Question Answering0
Take A Step Back: Rethinking the Two Stages in Visual Reasoning0
Wonderful Team: Zero-Shot Physical Task Planning with Visual LLMsCode1
KiVA: Kid-inspired Visual Analogies for Testing Large Multimodal ModelsCode1
Untrained neural networks can demonstrate memorization-independent abstract reasoningCode0
LLMs can be Dangerous Reasoners: Analyzing-based Jailbreak Attack on Large Language ModelsCode1
Can VLMs be used on videos for action recognition? LLMs are Visual Reasoning Coordinators0
I Know About "Up"! Enhancing Spatial Reasoning in Visual Language Models Through 3D Reconstruction0
X-Former: Unifying Contrastive and Reconstruction Learning for MLLMs0
Open-World Visual Reasoning by a Neuro-Symbolic Program of Zero-Shot Symbols0
SwitchCIT: Switching for Continual Instruction Tuning0
NTSEBENCH: Cognitive Reasoning Benchmark for Vision Language Models0
Affordance-Guided Reinforcement Learning via Visual Prompting0
NODE-Adapter: Neural Ordinary Differential Equations for Better Vision-Language Reasoning0
Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language ModelCode2
TokenPacker: Efficient Visual Projector for Multimodal LLMCode3
We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?Code2
From the Least to the Most: Building a Plug-and-Play Visual Reasoner via Data SynthesisCode1
MMRo: Are Multimodal LLMs Eligible as the Brain for In-Home Robotics?0
Disentangling Knowledge-based and Visual Reasoning by Question Decomposition in KB-VQA0
Show:102550
← PrevPage 5 of 14Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4o + CAText Score75.5Unverified
2GPT-4V (CoT, pick b/w two options)Text Score75.25Unverified
3GPT-4V (pick b/w two options)Text Score69.25Unverified
4MMICL + CoCoTText Score64.25Unverified
5GPT-4V + CoCoTText Score58.5Unverified
6OpenFlamingo + CoCoTText Score58.25Unverified
7GPT-4VText Score54.5Unverified
8FIBER (EqSim)Text Score51.5Unverified
9FIBER (finetuned, Flickr30k)Text Score51.25Unverified
10MMICL + CCoTText Score51Unverified
#ModelMetricClaimedVerifiedStatus
1BEiT-3Accuracy91.51Unverified
2X2-VLM (large)Accuracy88.7Unverified
3XFM (base)Accuracy87.6Unverified
4X2-VLM (base)Accuracy86.2Unverified
5CoCaAccuracy86.1Unverified
6VLMoAccuracy85.64Unverified
7VK-OODAccuracy84.6Unverified
8SimVLMAccuracy84.53Unverified
9X-VLM (base)Accuracy84.41Unverified
10VK-OODAccuracy83.9Unverified
#ModelMetricClaimedVerifiedStatus
1BEiT-3Accuracy92.58Unverified
2X2-VLM (large)Accuracy89.4Unverified
3XFM (base)Accuracy88.4Unverified
4CoCaAccuracy87Unverified
5X2-VLM (base)Accuracy87Unverified
6VLMoAccuracy86.86Unverified
7SimVLMAccuracy85.15Unverified
8X-VLM (base)Accuracy84.76Unverified
9BLIP-129MAccuracy83.09Unverified
10ALBEF (14M)Accuracy82.55Unverified
#ModelMetricClaimedVerifiedStatus
1AI CoreAverage-per ques.95.24Unverified
2redherringAverage-per ques.91.14Unverified
3VRDPAverage-per ques.90.24Unverified
4FightttttAverage-per ques.88.71Unverified
5neuralAverage-per ques.88.27Unverified
6NERVAverage-per ques.88.05Unverified
7DCLAverage-per ques.75.52Unverified
8troublesolverAverage-per ques.73.3Unverified
9v0.1Average-per ques.73.1Unverified
10First_testAverage-per ques.69.65Unverified
#ModelMetricClaimedVerifiedStatus
1Gemini-2.0 + CA2-Class Accuracy93.6Unverified
2GPT-4o + CA2-Class Accuracy92.8Unverified
3Human2-Class Accuracy91Unverified
4SNAIL2-Class Accuracy64Unverified
5InstructBLIP + GPT-42-Class Accuracy63.8Unverified
6BLIP-2 + ChatGPT (Fine-tuned)2-Class Accuracy63.3Unverified
7InstructBLIP + ChatGPT + Neuro-Symbolic2-Class Accuracy55.5Unverified
8ChatCaptioner + ChatGPT2-Class Accuracy49.3Unverified
9Otter2-Class Accuracy49.3Unverified
#ModelMetricClaimedVerifiedStatus
1HumansJaccard Index90Unverified
2ViLT (Zero-Shot)Jaccard Index52Unverified
3X-VLM (Zero-Shot)Jaccard Index46Unverified
4CLIP-ViT-B/32 (Zero-Shot)Jaccard Index41Unverified
5CLIP-ViT-L/14 (Zero-Shot)Jaccard Index40Unverified
6CLIP-RN50x64/14 (Zero-Shot)Jaccard Index38Unverified
7CLIP-RN50 (Zero-Shot)Jaccard Index35Unverified
8CLIP-ViL (Zero-Shot)Jaccard Index15Unverified
#ModelMetricClaimedVerifiedStatus
1LXMERTaccuracy70.1Unverified
2ViLTaccuracy69.3Unverified
3CLIP (finetuned)accuracy65.1Unverified
4CLIP (frozen)accuracy56Unverified
5VisualBERTaccuracy55.2Unverified
#ModelMetricClaimedVerifiedStatus
1RPINAUCCESS42.2Unverified
2Dec[Joint]1fAUCCESS40.3Unverified
3Dynamics-Aware DQNAUCCESS39.9Unverified
4DQNAUCCESS36.8Unverified
#ModelMetricClaimedVerifiedStatus
1Dynamics-Aware DQNAUCCESS85.2Unverified
2RPINAUCCESS85.2Unverified
3Dec[Joint]1fAUCCESS80Unverified
4DQNAUCCESS77.6Unverified
#ModelMetricClaimedVerifiedStatus
1Swin1:1 Accuracy52.9Unverified
2ConvNeXt1:1 Accuracy51.2Unverified
3ViT1:1 Accuracy50.3Unverified
4DEiT1:1 Accuracy47.2Unverified
#ModelMetricClaimedVerifiedStatus
1Humans1-of-100 Accuracy100Unverified
#ModelMetricClaimedVerifiedStatus
1VisualBERTAccuracy (Dev)67.4Unverified