SOTAVerified

Visual Reasoning

Ability to understand actions and reasoning associated with any visual images

Papers

Showing 301350 of 698 papers

TitleStatusHype
Progressive Multi-granular Alignments for Grounded Reasoning in Large Vision-Language ModelsCode0
Prompting Large Vision-Language Models for Compositional ReasoningCode0
QLEVR: A Diagnostic Dataset for Quantificational Language and Elementary Visual ReasoningCode0
Raven's Progressive Matrices Completion with Latent Gaussian Process PriorsCode0
Revisiting Disentanglement in Downstream Tasks: A Study on Its Necessity for Abstract Visual ReasoningCode0
RVTBench: A Benchmark for Visual Reasoning TasksCode0
SAViR-T: Spatially Attentive Visual Reasoning with TransformersCode0
Slot Abstractors: Toward Scalable Abstract Visual ReasoningCode0
Smart Home Appliances: Chat with Your FridgeCode0
Socratic Questioning: Learn to Self-guide Multimodal Reasoning in the WildCode0
Solving ARC visual analogies with neural embeddings and vector arithmetic: A generalized methodCode0
STAR-R1: Spacial TrAnsformation Reasoning by Reinforcing Multimodal LLMsCode0
Stop Pre-Training: Adapt Visual-Language Models to Unseen LanguagesCode0
Systematic Visual Reasoning through Object-Centric Relational AbstractionCode0
TDBench: Benchmarking Vision-Language Models in Understanding Top-Down ImagesCode0
Techniques for Symbol Grounding with SATNetCode0
Temporal Reasoning via Audio Question AnsweringCode0
TGraphX: Tensor-Aware Graph Neural Network for Multi-Dimensional Feature LearningCode0
The Abduction of Sherlock Holmes: A Dataset for Visual Abductive ReasoningCode0
Five Points to Check when Comparing Visual Perception in Humans and MachinesCode0
Thinking with Generated ImagesCode0
Toward Building General Foundation Models for Language, Vision, and Vision-Language Understanding TasksCode0
Toward Multi-Granularity Decision-Making: Explicit Visual Reasoning with Hierarchical KnowledgeCode0
UniT: Multimodal Multitask Learning with a Unified TransformerCode0
Transparency by Design: Closing the Gap Between Performance and Interpretability in Visual ReasoningCode0
Unicode Analogies: An Anti-Objectivist Visual Reasoning ChallengeCode0
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning FrameworkCode0
Unraveling the geometry of visual relational reasoningCode0
VASR: Visual Analogies of Situation RecognitionCode0
VDebugger: Harnessing Execution Feedback for Debugging Visual ProgramsCode0
ViCLEVR: A Visual Reasoning Dataset and Hybrid Multimodal Fusion Model for Visual Question Answering in VietnameseCode0
ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward ModelingCode0
VisFactor: Benchmarking Fundamental Visual Cognition in Multimodal Large Language ModelsCode0
VisFIS: Visual Feature Importance Supervision with Right-for-the-Right-Reason ObjectivesCode0
Visual Choice of Plausible Alternatives: An Evaluation of Image-based Commonsense Causal ReasoningCode0
Visual Contexts Clarify Ambiguous Expressions: A Benchmark DatasetCode0
Visual Question Answering From Another Perspective: CLEVR Mental Rotation TestsCode0
Visual Reasoning and Multi-Agent Approach in Multimodal Large Language Models (MLLMs): Solving TSP and mTSP Combinatorial ChallengesCode0
Visual Reasoning by Progressive Module NetworksCode0
Visual Reasoning in Object-Centric Deep Neural Networks: A Comparative Cognition ApproachCode0
Visual Reasoning with Multi-hop Feature ModulationCode0
Visual Transformation TellingCode0
V-LoL: A Diagnostic Dataset for Visual Logical LearningCode0
VReST: Enhancing Reasoning in Large Vision-Language Models through Tree Search and Self-Reward MechanismCode0
VURF: A General-purpose Reasoning and Self-refinement Framework for Video UnderstandingCode0
Weakly Supervised Relative Spatial Reasoning for Visual Question AnsweringCode0
Weakly-supervised Semantic Parsing with Abstract ExamplesCode0
What Is Missing in Multilingual Visual Reasoning and How to Fix ItCode0
What is the Visual Cognition Gap between Humans and Multimodal LLMs?Code0
When Causal Intervention Meets Adversarial Examples and Image Masking for Deep Neural NetworksCode0
Show:102550
← PrevPage 7 of 14Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4o + CAText Score75.5Unverified
2GPT-4V (CoT, pick b/w two options)Text Score75.25Unverified
3GPT-4V (pick b/w two options)Text Score69.25Unverified
4MMICL + CoCoTText Score64.25Unverified
5GPT-4V + CoCoTText Score58.5Unverified
6OpenFlamingo + CoCoTText Score58.25Unverified
7GPT-4VText Score54.5Unverified
8FIBER (EqSim)Text Score51.5Unverified
9FIBER (finetuned, Flickr30k)Text Score51.25Unverified
10MMICL + CCoTText Score51Unverified
#ModelMetricClaimedVerifiedStatus
1BEiT-3Accuracy91.51Unverified
2X2-VLM (large)Accuracy88.7Unverified
3XFM (base)Accuracy87.6Unverified
4X2-VLM (base)Accuracy86.2Unverified
5CoCaAccuracy86.1Unverified
6VLMoAccuracy85.64Unverified
7VK-OODAccuracy84.6Unverified
8SimVLMAccuracy84.53Unverified
9X-VLM (base)Accuracy84.41Unverified
10VK-OODAccuracy83.9Unverified
#ModelMetricClaimedVerifiedStatus
1BEiT-3Accuracy92.58Unverified
2X2-VLM (large)Accuracy89.4Unverified
3XFM (base)Accuracy88.4Unverified
4CoCaAccuracy87Unverified
5X2-VLM (base)Accuracy87Unverified
6VLMoAccuracy86.86Unverified
7SimVLMAccuracy85.15Unverified
8X-VLM (base)Accuracy84.76Unverified
9BLIP-129MAccuracy83.09Unverified
10ALBEF (14M)Accuracy82.55Unverified
#ModelMetricClaimedVerifiedStatus
1AI CoreAverage-per ques.95.24Unverified
2redherringAverage-per ques.91.14Unverified
3VRDPAverage-per ques.90.24Unverified
4FightttttAverage-per ques.88.71Unverified
5neuralAverage-per ques.88.27Unverified
6NERVAverage-per ques.88.05Unverified
7DCLAverage-per ques.75.52Unverified
8troublesolverAverage-per ques.73.3Unverified
9v0.1Average-per ques.73.1Unverified
10First_testAverage-per ques.69.65Unverified
#ModelMetricClaimedVerifiedStatus
1Gemini-2.0 + CA2-Class Accuracy93.6Unverified
2GPT-4o + CA2-Class Accuracy92.8Unverified
3Human2-Class Accuracy91Unverified
4SNAIL2-Class Accuracy64Unverified
5InstructBLIP + GPT-42-Class Accuracy63.8Unverified
6BLIP-2 + ChatGPT (Fine-tuned)2-Class Accuracy63.3Unverified
7InstructBLIP + ChatGPT + Neuro-Symbolic2-Class Accuracy55.5Unverified
8ChatCaptioner + ChatGPT2-Class Accuracy49.3Unverified
9Otter2-Class Accuracy49.3Unverified
#ModelMetricClaimedVerifiedStatus
1HumansJaccard Index90Unverified
2ViLT (Zero-Shot)Jaccard Index52Unverified
3X-VLM (Zero-Shot)Jaccard Index46Unverified
4CLIP-ViT-B/32 (Zero-Shot)Jaccard Index41Unverified
5CLIP-ViT-L/14 (Zero-Shot)Jaccard Index40Unverified
6CLIP-RN50x64/14 (Zero-Shot)Jaccard Index38Unverified
7CLIP-RN50 (Zero-Shot)Jaccard Index35Unverified
8CLIP-ViL (Zero-Shot)Jaccard Index15Unverified
#ModelMetricClaimedVerifiedStatus
1LXMERTaccuracy70.1Unverified
2ViLTaccuracy69.3Unverified
3CLIP (finetuned)accuracy65.1Unverified
4CLIP (frozen)accuracy56Unverified
5VisualBERTaccuracy55.2Unverified
#ModelMetricClaimedVerifiedStatus
1RPINAUCCESS42.2Unverified
2Dec[Joint]1fAUCCESS40.3Unverified
3Dynamics-Aware DQNAUCCESS39.9Unverified
4DQNAUCCESS36.8Unverified
#ModelMetricClaimedVerifiedStatus
1Dynamics-Aware DQNAUCCESS85.2Unverified
2RPINAUCCESS85.2Unverified
3Dec[Joint]1fAUCCESS80Unverified
4DQNAUCCESS77.6Unverified
#ModelMetricClaimedVerifiedStatus
1Swin1:1 Accuracy52.9Unverified
2ConvNeXt1:1 Accuracy51.2Unverified
3ViT1:1 Accuracy50.3Unverified
4DEiT1:1 Accuracy47.2Unverified
#ModelMetricClaimedVerifiedStatus
1Humans1-of-100 Accuracy100Unverified
#ModelMetricClaimedVerifiedStatus
1VisualBERTAccuracy (Dev)67.4Unverified