SOTAVerified

Visual Reasoning

Ability to understand actions and reasoning associated with any visual images

Papers

Showing 201250 of 698 papers

TitleStatusHype
Agentic Keyframe Search for Video Question AnsweringCode1
Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation LearningCode1
Measuring Progress in Fine-grained Vision-and-Language UnderstandingCode1
Multi-Step Visual Reasoning with Visual Tokens Scaling and VerificationCode1
Natural Language Rationales with Full-Stack Visual Reasoning: From Pixels to Semantic Frames to Commonsense GraphsCode1
LLMs can be Dangerous Reasoners: Analyzing-based Jailbreak Attack on Large Language ModelsCode1
Beyond Embeddings: The Promise of Visual Table in Visual ReasoningCode1
Seeing is Not Reasoning: MVPBench for Graph-based Evaluation of Multi-path Visual Physical CoTCode1
FiLM: Visual Reasoning with a General Conditioning LayerCode1
FineCops-Ref: A new Dataset and Task for Fine-Grained Compositional Referring Expression ComprehensionCode1
SeePhys: Does Seeing Help Thinking? -- Benchmarking Vision-Based Physics ReasoningCode1
ClawMachine: Learning to Fetch Visual Tokens for Referential ComprehensionCode1
Forgotten Polygons: Multimodal Large Language Models are Shape-BlindCode1
Forward Prediction for Physical ReasoningCode1
ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal ModelsCode1
Context-Aware Alignment and Mutual Masking for 3D-Language Pre-TrainingCode1
How Do Multimodal Large Language Models Handle Complex Multimodal Reasoning? Placing Them in An Extensible Escape GameCode1
Winoground: Probing Vision and Language Models for Visio-Linguistic CompositionalityCode1
BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal ModelsCode1
Compositional Chain-of-Thought Prompting for Large Multimodal ModelsCode1
From the Least to the Most: Building a Plug-and-Play Visual Reasoner via Data SynthesisCode1
Image Retrieval on Real-life Images with Pre-trained Vision-and-Language ModelsCode1
See, Think, Confirm: Interactive Prompting Between Vision and Language Models for Knowledge-based Visual ReasoningCode1
Abstract Visual Reasoning: An Algebraic Approach for Solving Raven's Progressive MatricesCode1
Think Step by Step: Chain-of-Gesture Prompting for Error Detection in Robotic Surgical VideosCode1
Complete 3D Scene Parsing from an RGBD ImageCode0
HALLUCINOGEN: A Benchmark for Evaluating Object Hallucination in Large Visual-Language ModelsCode0
Revisiting Disentanglement in Downstream Tasks: A Study on Its Necessity for Abstract Visual ReasoningCode0
QLEVR: A Diagnostic Dataset for Quantificational Language and Elementary Visual ReasoningCode0
Raven's Progressive Matrices Completion with Latent Gaussian Process PriorsCode0
Prompting Large Vision-Language Models for Compositional ReasoningCode0
Program synthesis performance constrained by non-linear spatial relations in Synthetic Visual Reasoning TestCode0
Progressive Multi-granular Alignments for Grounded Reasoning in Large Vision-Language ModelsCode0
Grounded Reinforcement Learning for Visual ReasoningCode0
A Dual-Attention Learning Network with Word and Sentence Embedding for Medical Visual Question AnsweringCode0
Physical Reasoning Using Dynamics-Aware ModelsCode0
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image InputsCode0
GPT4SGG: Synthesizing Scene Graphs from Holistic and Region-specific NarrativesCode0
A Survey on Multimodal Large Language ModelsCode0
PaLI: A Jointly-Scaled Multilingual Language-Image ModelCode0
Orchestrator-Agent Trust: A Modular Agentic AI Visual Classification System with Trust-Aware Orchestration and RAG-Based ReasoningCode0
Predicting Complete 3D Models of Indoor ScenesCode0
Collecting Visually-Grounded Dialogue with A Game Of SortsCode0
CLEVR-Ref+: Diagnosing Visual Reasoning with Referring ExpressionsCode0
Odd-One-Out Representation LearningCode0
CLEVR Parser: A Graph Parser Library for Geometric Learning on Language Grounded Image ScenesCode0
CLEVRER: CoLlision Events for Video REpresentation and ReasoningCode0
Object Level Visual Reasoning in VideosCode0
Abstracting Concept-Changing Rules for Solving Raven's Progressive Matrix ProblemsCode0
Attention over learned object embeddings enables complex visual reasoningCode0
Show:102550
← PrevPage 5 of 14Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4o + CAText Score75.5Unverified
2GPT-4V (CoT, pick b/w two options)Text Score75.25Unverified
3GPT-4V (pick b/w two options)Text Score69.25Unverified
4MMICL + CoCoTText Score64.25Unverified
5GPT-4V + CoCoTText Score58.5Unverified
6OpenFlamingo + CoCoTText Score58.25Unverified
7GPT-4VText Score54.5Unverified
8FIBER (EqSim)Text Score51.5Unverified
9FIBER (finetuned, Flickr30k)Text Score51.25Unverified
10MMICL + CCoTText Score51Unverified
#ModelMetricClaimedVerifiedStatus
1BEiT-3Accuracy91.51Unverified
2X2-VLM (large)Accuracy88.7Unverified
3XFM (base)Accuracy87.6Unverified
4X2-VLM (base)Accuracy86.2Unverified
5CoCaAccuracy86.1Unverified
6VLMoAccuracy85.64Unverified
7VK-OODAccuracy84.6Unverified
8SimVLMAccuracy84.53Unverified
9X-VLM (base)Accuracy84.41Unverified
10VK-OODAccuracy83.9Unverified
#ModelMetricClaimedVerifiedStatus
1BEiT-3Accuracy92.58Unverified
2X2-VLM (large)Accuracy89.4Unverified
3XFM (base)Accuracy88.4Unverified
4X2-VLM (base)Accuracy87Unverified
5CoCaAccuracy87Unverified
6VLMoAccuracy86.86Unverified
7SimVLMAccuracy85.15Unverified
8X-VLM (base)Accuracy84.76Unverified
9BLIP-129MAccuracy83.09Unverified
10ALBEF (14M)Accuracy82.55Unverified
#ModelMetricClaimedVerifiedStatus
1AI CoreAverage-per ques.95.24Unverified
2redherringAverage-per ques.91.14Unverified
3VRDPAverage-per ques.90.24Unverified
4FightttttAverage-per ques.88.71Unverified
5neuralAverage-per ques.88.27Unverified
6NERVAverage-per ques.88.05Unverified
7DCLAverage-per ques.75.52Unverified
8troublesolverAverage-per ques.73.3Unverified
9v0.1Average-per ques.73.1Unverified
10First_testAverage-per ques.69.65Unverified
#ModelMetricClaimedVerifiedStatus
1Gemini-2.0 + CA2-Class Accuracy93.6Unverified
2GPT-4o + CA2-Class Accuracy92.8Unverified
3Human2-Class Accuracy91Unverified
4SNAIL2-Class Accuracy64Unverified
5InstructBLIP + GPT-42-Class Accuracy63.8Unverified
6BLIP-2 + ChatGPT (Fine-tuned)2-Class Accuracy63.3Unverified
7InstructBLIP + ChatGPT + Neuro-Symbolic2-Class Accuracy55.5Unverified
8ChatCaptioner + ChatGPT2-Class Accuracy49.3Unverified
9Otter2-Class Accuracy49.3Unverified
#ModelMetricClaimedVerifiedStatus
1HumansJaccard Index90Unverified
2ViLT (Zero-Shot)Jaccard Index52Unverified
3X-VLM (Zero-Shot)Jaccard Index46Unverified
4CLIP-ViT-B/32 (Zero-Shot)Jaccard Index41Unverified
5CLIP-ViT-L/14 (Zero-Shot)Jaccard Index40Unverified
6CLIP-RN50x64/14 (Zero-Shot)Jaccard Index38Unverified
7CLIP-RN50 (Zero-Shot)Jaccard Index35Unverified
8CLIP-ViL (Zero-Shot)Jaccard Index15Unverified
#ModelMetricClaimedVerifiedStatus
1LXMERTaccuracy70.1Unverified
2ViLTaccuracy69.3Unverified
3CLIP (finetuned)accuracy65.1Unverified
4CLIP (frozen)accuracy56Unverified
5VisualBERTaccuracy55.2Unverified
#ModelMetricClaimedVerifiedStatus
1RPINAUCCESS42.2Unverified
2Dec[Joint]1fAUCCESS40.3Unverified
3Dynamics-Aware DQNAUCCESS39.9Unverified
4DQNAUCCESS36.8Unverified
#ModelMetricClaimedVerifiedStatus
1RPINAUCCESS85.2Unverified
2Dynamics-Aware DQNAUCCESS85.2Unverified
3Dec[Joint]1fAUCCESS80Unverified
4DQNAUCCESS77.6Unverified
#ModelMetricClaimedVerifiedStatus
1Swin1:1 Accuracy52.9Unverified
2ConvNeXt1:1 Accuracy51.2Unverified
3ViT1:1 Accuracy50.3Unverified
4DEiT1:1 Accuracy47.2Unverified
#ModelMetricClaimedVerifiedStatus
1Humans1-of-100 Accuracy100Unverified
#ModelMetricClaimedVerifiedStatus
1VisualBERTAccuracy (Dev)67.4Unverified