SOTAVerified

Visual Reasoning

Ability to understand actions and reasoning associated with any visual images

Papers

Showing 251300 of 698 papers

TitleStatusHype
Selective Vision is the Challenge for Visual Reasoning: A Benchmark for Visual Argument UnderstandingCode1
Disentangling Knowledge-based and Visual Reasoning by Question Decomposition in KB-VQA0
Visual Reasoning and Multi-Agent Approach in Multimodal Large Language Models (MLLMs): Solving TSP and mTSP Combinatorial ChallengesCode0
Evaluating Visual and Cultural Interpretation: The K-Viscuit Benchmark with Human-VLM Collaboration0
Beyond the Doors of Perception: Vision Transformers Represent Relations Between ObjectsCode0
Whiteboard-of-Thought: Thinking Step-by-Step Across Modalities0
GSR-BENCH: A Benchmark for Grounded Spatial Reasoning Evaluation via Multimodal LLMs0
VDebugger: Harnessing Execution Feedback for Debugging Visual ProgramsCode0
RS-GPT4V: A Unified Multimodal Instruction-Following Dataset for Remote Sensing Image UnderstandingCode1
Beyond Visual Appearances: Privacy-sensitive Objects Identification via Hybrid Graph Reasoning0
Slot State Space ModelsCode1
ClawMachine: Learning to Fetch Visual Tokens for Referential ComprehensionCode1
A Unified View of Abstract Visual Reasoning Problems0
A-I-RAVEN and I-RAVEN-Mesh: Two New Benchmarks for Abstract Visual Reasoning0
What is the Visual Cognition Gap between Humans and Multimodal LLMs?Code0
Neural Concept BinderCode1
Comparison Visual Instruction Tuning0
Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language ModelsCode3
INS-MMBench: A Comprehensive Benchmark for Evaluating LVLMs' Performance in InsuranceCode1
Eyeballing Combinatorial Problems: A Case Study of Using Multimodal Large Language Models to Solve Traveling Salesman Problems0
HENASY: Learning to Assemble Scene-Entities for Egocentric Video-Language Model0
MMCTAgent: Multi-modal Critical Thinking Agent Framework for Complex Visual Reasoning0
Do Vision-Language Transformers Exhibit Visual Commonsense? An Empirical Study of VCR0
Code Repair with LLMs gives an Exploration-Exploitation Tradeoff0
Visual Description Grounding Reduces Hallucinations and Boosts Reasoning in LVLMsCode1
Image-of-Thought Prompting for Visual Reasoning Refinement in Multimodal Large Language Models0
Analogist: Out-of-the-box Visual In-Context Learning with Image Diffusion Model0
CLIP-Powered TASS: Target-Aware Single-Stream Network for Audio-Visual Question Answering0
Learning to Compose: Improving Object Centric Learning by Injecting CompositionalityCode0
Naturally Supervised 3D Visual Grounding with Language-Regularized Concept Learners0
BlenderAlchemy: Editing 3D Graphics with Vision-Language Models0
List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMsCode2
Cantor: Inspiring Multimodal Chain-of-Thought of MLLM0
Think-Program-reCtify: 3D Situated Reasoning with Large Language Models0
MARVEL: Multidimensional Abstraction and Reasoning through Visual Evaluation and LearningCode0
Automated Evaluation of Large Vision-Language Models on Self-driving Corner Cases0
MMCode: Benchmarking Multimodal Large Language Models for Code Generation with Visually Rich Programming ProblemsCode1
Visually Descriptive Language Model for Vector Graphics ReasoningCode9
Wu's Method can Boost Symbolic AI to Rival Silver Medalists and AlphaGeometry to Outperform Gold Medalists at IMO Geometry0
Plug-and-Play Grounding of Reasoning in Multimodal Large Language Models0
Beyond Embeddings: The Promise of Visual Table in Visual ReasoningCode1
PropTest: Automatic Property Testing for Improved Visual Programming0
LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal ModelsCode2
VURF: A General-purpose Reasoning and Self-refinement Framework for Video UnderstandingCode0
Just Shift It: Test-Time Prototype Shifting for Zero-Shot Generalization with Vision-Language ModelsCode1
HYDRA: A Hyper Agent for Dynamic Compositional Visual ReasoningCode1
Just Say the Name: Online Continual Learning with Category Names Only via Data Generation0
Test-time Distribution Learning Adapter for Cross-modal Visual Reasoning0
How Far Are We from Intelligent Visual Deductive Reasoning?Code1
Slot Abstractors: Toward Scalable Abstract Visual ReasoningCode0
Show:102550
← PrevPage 6 of 14Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4o + CAText Score75.5Unverified
2GPT-4V (CoT, pick b/w two options)Text Score75.25Unverified
3GPT-4V (pick b/w two options)Text Score69.25Unverified
4MMICL + CoCoTText Score64.25Unverified
5GPT-4V + CoCoTText Score58.5Unverified
6OpenFlamingo + CoCoTText Score58.25Unverified
7GPT-4VText Score54.5Unverified
8FIBER (EqSim)Text Score51.5Unverified
9FIBER (finetuned, Flickr30k)Text Score51.25Unverified
10MMICL + CCoTText Score51Unverified
#ModelMetricClaimedVerifiedStatus
1BEiT-3Accuracy91.51Unverified
2X2-VLM (large)Accuracy88.7Unverified
3XFM (base)Accuracy87.6Unverified
4X2-VLM (base)Accuracy86.2Unverified
5CoCaAccuracy86.1Unverified
6VLMoAccuracy85.64Unverified
7VK-OODAccuracy84.6Unverified
8SimVLMAccuracy84.53Unverified
9X-VLM (base)Accuracy84.41Unverified
10VK-OODAccuracy83.9Unverified
#ModelMetricClaimedVerifiedStatus
1BEiT-3Accuracy92.58Unverified
2X2-VLM (large)Accuracy89.4Unverified
3XFM (base)Accuracy88.4Unverified
4X2-VLM (base)Accuracy87Unverified
5CoCaAccuracy87Unverified
6VLMoAccuracy86.86Unverified
7SimVLMAccuracy85.15Unverified
8X-VLM (base)Accuracy84.76Unverified
9BLIP-129MAccuracy83.09Unverified
10ALBEF (14M)Accuracy82.55Unverified
#ModelMetricClaimedVerifiedStatus
1AI CoreAverage-per ques.95.24Unverified
2redherringAverage-per ques.91.14Unverified
3VRDPAverage-per ques.90.24Unverified
4FightttttAverage-per ques.88.71Unverified
5neuralAverage-per ques.88.27Unverified
6NERVAverage-per ques.88.05Unverified
7DCLAverage-per ques.75.52Unverified
8troublesolverAverage-per ques.73.3Unverified
9v0.1Average-per ques.73.1Unverified
10First_testAverage-per ques.69.65Unverified
#ModelMetricClaimedVerifiedStatus
1Gemini-2.0 + CA2-Class Accuracy93.6Unverified
2GPT-4o + CA2-Class Accuracy92.8Unverified
3Human2-Class Accuracy91Unverified
4SNAIL2-Class Accuracy64Unverified
5InstructBLIP + GPT-42-Class Accuracy63.8Unverified
6BLIP-2 + ChatGPT (Fine-tuned)2-Class Accuracy63.3Unverified
7InstructBLIP + ChatGPT + Neuro-Symbolic2-Class Accuracy55.5Unverified
8ChatCaptioner + ChatGPT2-Class Accuracy49.3Unverified
9Otter2-Class Accuracy49.3Unverified
#ModelMetricClaimedVerifiedStatus
1HumansJaccard Index90Unverified
2ViLT (Zero-Shot)Jaccard Index52Unverified
3X-VLM (Zero-Shot)Jaccard Index46Unverified
4CLIP-ViT-B/32 (Zero-Shot)Jaccard Index41Unverified
5CLIP-ViT-L/14 (Zero-Shot)Jaccard Index40Unverified
6CLIP-RN50x64/14 (Zero-Shot)Jaccard Index38Unverified
7CLIP-RN50 (Zero-Shot)Jaccard Index35Unverified
8CLIP-ViL (Zero-Shot)Jaccard Index15Unverified
#ModelMetricClaimedVerifiedStatus
1LXMERTaccuracy70.1Unverified
2ViLTaccuracy69.3Unverified
3CLIP (finetuned)accuracy65.1Unverified
4CLIP (frozen)accuracy56Unverified
5VisualBERTaccuracy55.2Unverified
#ModelMetricClaimedVerifiedStatus
1RPINAUCCESS42.2Unverified
2Dec[Joint]1fAUCCESS40.3Unverified
3Dynamics-Aware DQNAUCCESS39.9Unverified
4DQNAUCCESS36.8Unverified
#ModelMetricClaimedVerifiedStatus
1RPINAUCCESS85.2Unverified
2Dynamics-Aware DQNAUCCESS85.2Unverified
3Dec[Joint]1fAUCCESS80Unverified
4DQNAUCCESS77.6Unverified
#ModelMetricClaimedVerifiedStatus
1Swin1:1 Accuracy52.9Unverified
2ConvNeXt1:1 Accuracy51.2Unverified
3ViT1:1 Accuracy50.3Unverified
4DEiT1:1 Accuracy47.2Unverified
#ModelMetricClaimedVerifiedStatus
1Humans1-of-100 Accuracy100Unverified
#ModelMetricClaimedVerifiedStatus
1VisualBERTAccuracy (Dev)67.4Unverified