SOTAVerified

Visual Reasoning

Ability to understand actions and reasoning associated with any visual images

Papers

Showing 401450 of 698 papers

TitleStatusHype
Naturally Supervised 3D Visual Grounding with Language-Regularized Concept Learners0
Navigating to Objects Specified by Images0
Dynamic Visual Reasoning by Learning Differentiable Physics Models from Video and Language0
Neural-guided, Bidirectional Program Search for Abstraction and Reasoning0
Dynamic Graph Attention for Referring Expression Comprehension0
Neural Structure Mapping For Learning Abstract Visual Analogies0
DWIM: Towards Tool-aware Visual Reasoning via Discrepancy-aware Workflow Generation & Instruct-Masking Tuning0
Neuro-Symbolic Scene Graph Conditioning for Synthetic Image Dataset Generation0
Neuro-Symbolic Visual Reasoning: Disentangling "Visual" from "Reasoning"0
Dual Local-Global Contextual Pathways for Recognition in Aerial Imagery0
VisualPuzzles: Decoupling Multimodal Reasoning Evaluation from Domain Knowledge0
NODE-Adapter: Neural Ordinary Differential Equations for Better Vision-Language Reasoning0
DRIVINGVQA: Analyzing Visual Chain-of-Thought Reasoning of Vision Language Models in Real-World Scenarios with Driving Theory Tests0
NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks0
Not-So-CLEVR: Visual Relations Strain Feedforward Neural Networks0
Doxing via the Lens: Revealing Location-related Privacy Leakage on Multi-modal Large Reasoning Models0
NTSEBENCH: Cognitive Reasoning Benchmark for Vision Language Models0
Attention over learned object embeddings enables complex visual reasoning0
Object-Centric Diagnosis of Visual Reasoning0
Advancing Object Detection in Transportation with Multimodal Large Language Models (MLLMs): A Comprehensive Review and Empirical Testing0
Object Ordering with Bidirectional Matchings for Visual Reasoning0
OC-NMN: Object-centric Compositional Neural Module Network for Generative Visual Analogical Reasoning0
3D Concept Learning and Reasoning from Multi-View Images0
Visual Question Answering in the Medical Domain0
OmniAD: Detect and Understand Industrial Anomaly via Multimodal Reasoning0
Do we Really Need Visual Instructions? Towards Visual Instruction-Free Fine-tuning for Large Vision-Language Models0
On Data Synthesis and Post-training for Visual Abstract Reasoning0
One for All: One-stage Referring Expression Comprehension with Dynamic Reasoning0
One RL to See Them All: Visual Triple Unified Reinforcement Learning0
3D Concept Grounding on Neural Fields0
Zero-Shot Visual Reasoning by Vision-Language Models: Benchmarking and Analysis0
Two-stage Rule-induction Visual Reasoning on RPMs with an Application to Video Prediction0
On the Potential of CLIP for Compositional Logical Reasoning0
Do Vision-Language Transformers Exhibit Visual Commonsense? An Empirical Study of VCR0
Does Visual Pretraining Help End-to-End Reasoning?0
Open Set Video HOI detection from Action-Centric Chain-of-Look Prompting0
Does Structural Attention Improve Compositional Representations in Vision-Language Models?0
Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning0
Open Visual Knowledge Extraction via Relation-Oriented Multimodality Model Prompting0
Open-World Visual Reasoning by a Neuro-Symbolic Program of Zero-Shot Symbols0
Visual Reasoning Evaluation of Grok, Deepseek Janus, Gemini, Qwen, Mistral, and ChatGPT0
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?0
PaLI: A Jointly-Scaled Multilingual Language-Image Model0
Doc-CoB: Enhancing Multi-Modal Document Understanding with Visual Chain-of-Boxes Reasoning0
Disentangling Knowledge-based and Visual Reasoning by Question Decomposition in KB-VQA0
Deep Visual Reasoning: Learning to Predict Action Sequences for Task and Motion Planning from an Initial Scene Image0
Perception Tokens Enhance Visual Reasoning in Multimodal Language Models0
PhD Thesis: Exploring the role of (self-)attention in cognitive and computer vision architecture0
Deep Reason: A Strong Baseline for Real-World Visual Reasoning0
Advancing Generalization Across a Variety of Abstract Visual Reasoning Tasks0
Show:102550
← PrevPage 9 of 14Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4o + CAText Score75.5Unverified
2GPT-4V (CoT, pick b/w two options)Text Score75.25Unverified
3GPT-4V (pick b/w two options)Text Score69.25Unverified
4MMICL + CoCoTText Score64.25Unverified
5GPT-4V + CoCoTText Score58.5Unverified
6OpenFlamingo + CoCoTText Score58.25Unverified
7GPT-4VText Score54.5Unverified
8FIBER (EqSim)Text Score51.5Unverified
9FIBER (finetuned, Flickr30k)Text Score51.25Unverified
10MMICL + CCoTText Score51Unverified
#ModelMetricClaimedVerifiedStatus
1BEiT-3Accuracy91.51Unverified
2X2-VLM (large)Accuracy88.7Unverified
3XFM (base)Accuracy87.6Unverified
4X2-VLM (base)Accuracy86.2Unverified
5CoCaAccuracy86.1Unverified
6VLMoAccuracy85.64Unverified
7VK-OODAccuracy84.6Unverified
8SimVLMAccuracy84.53Unverified
9X-VLM (base)Accuracy84.41Unverified
10VK-OODAccuracy83.9Unverified
#ModelMetricClaimedVerifiedStatus
1BEiT-3Accuracy92.58Unverified
2X2-VLM (large)Accuracy89.4Unverified
3XFM (base)Accuracy88.4Unverified
4X2-VLM (base)Accuracy87Unverified
5CoCaAccuracy87Unverified
6VLMoAccuracy86.86Unverified
7SimVLMAccuracy85.15Unverified
8X-VLM (base)Accuracy84.76Unverified
9BLIP-129MAccuracy83.09Unverified
10ALBEF (14M)Accuracy82.55Unverified
#ModelMetricClaimedVerifiedStatus
1AI CoreAverage-per ques.95.24Unverified
2redherringAverage-per ques.91.14Unverified
3VRDPAverage-per ques.90.24Unverified
4FightttttAverage-per ques.88.71Unverified
5neuralAverage-per ques.88.27Unverified
6NERVAverage-per ques.88.05Unverified
7DCLAverage-per ques.75.52Unverified
8troublesolverAverage-per ques.73.3Unverified
9v0.1Average-per ques.73.1Unverified
10First_testAverage-per ques.69.65Unverified
#ModelMetricClaimedVerifiedStatus
1Gemini-2.0 + CA2-Class Accuracy93.6Unverified
2GPT-4o + CA2-Class Accuracy92.8Unverified
3Human2-Class Accuracy91Unverified
4SNAIL2-Class Accuracy64Unverified
5InstructBLIP + GPT-42-Class Accuracy63.8Unverified
6BLIP-2 + ChatGPT (Fine-tuned)2-Class Accuracy63.3Unverified
7InstructBLIP + ChatGPT + Neuro-Symbolic2-Class Accuracy55.5Unverified
8ChatCaptioner + ChatGPT2-Class Accuracy49.3Unverified
9Otter2-Class Accuracy49.3Unverified
#ModelMetricClaimedVerifiedStatus
1HumansJaccard Index90Unverified
2ViLT (Zero-Shot)Jaccard Index52Unverified
3X-VLM (Zero-Shot)Jaccard Index46Unverified
4CLIP-ViT-B/32 (Zero-Shot)Jaccard Index41Unverified
5CLIP-ViT-L/14 (Zero-Shot)Jaccard Index40Unverified
6CLIP-RN50x64/14 (Zero-Shot)Jaccard Index38Unverified
7CLIP-RN50 (Zero-Shot)Jaccard Index35Unverified
8CLIP-ViL (Zero-Shot)Jaccard Index15Unverified
#ModelMetricClaimedVerifiedStatus
1LXMERTaccuracy70.1Unverified
2ViLTaccuracy69.3Unverified
3CLIP (finetuned)accuracy65.1Unverified
4CLIP (frozen)accuracy56Unverified
5VisualBERTaccuracy55.2Unverified
#ModelMetricClaimedVerifiedStatus
1RPINAUCCESS42.2Unverified
2Dec[Joint]1fAUCCESS40.3Unverified
3Dynamics-Aware DQNAUCCESS39.9Unverified
4DQNAUCCESS36.8Unverified
#ModelMetricClaimedVerifiedStatus
1RPINAUCCESS85.2Unverified
2Dynamics-Aware DQNAUCCESS85.2Unverified
3Dec[Joint]1fAUCCESS80Unverified
4DQNAUCCESS77.6Unverified
#ModelMetricClaimedVerifiedStatus
1Swin1:1 Accuracy52.9Unverified
2ConvNeXt1:1 Accuracy51.2Unverified
3ViT1:1 Accuracy50.3Unverified
4DEiT1:1 Accuracy47.2Unverified
#ModelMetricClaimedVerifiedStatus
1Humans1-of-100 Accuracy100Unverified
#ModelMetricClaimedVerifiedStatus
1VisualBERTAccuracy (Dev)67.4Unverified