SOTAVerified

Visual Reasoning

Ability to understand actions and reasoning associated with any visual images

Papers

Showing 101150 of 698 papers

TitleStatusHype
FineCops-Ref: A new Dataset and Task for Fine-Grained Compositional Referring Expression ComprehensionCode1
Wonderful Team: Zero-Shot Physical Task Planning with Visual LLMsCode1
KiVA: Kid-inspired Visual Analogies for Testing Large Multimodal ModelsCode1
LLMs can be Dangerous Reasoners: Analyzing-based Jailbreak Attack on Large Language ModelsCode1
From the Least to the Most: Building a Plug-and-Play Visual Reasoner via Data SynthesisCode1
Selective Vision is the Challenge for Visual Reasoning: A Benchmark for Visual Argument UnderstandingCode1
Think Step by Step: Chain-of-Gesture Prompting for Error Detection in Robotic Surgical VideosCode1
RS-GPT4V: A Unified Multimodal Instruction-Following Dataset for Remote Sensing Image UnderstandingCode1
Slot State Space ModelsCode1
ClawMachine: Learning to Fetch Visual Tokens for Referential ComprehensionCode1
Neural Concept BinderCode1
INS-MMBench: A Comprehensive Benchmark for Evaluating LVLMs' Performance in InsuranceCode1
Visual Description Grounding Reduces Hallucinations and Boosts Reasoning in LVLMsCode1
MMCode: Benchmarking Multimodal Large Language Models for Code Generation with Visually Rich Programming ProblemsCode1
Beyond Embeddings: The Promise of Visual Table in Visual ReasoningCode1
HYDRA: A Hyper Agent for Dynamic Compositional Visual ReasoningCode1
Just Shift It: Test-Time Prototype Shifting for Zero-Shot Generalization with Vision-Language ModelsCode1
How Far Are We from Intelligent Visual Deductive Reasoning?Code1
Peacock: A Family of Arabic Multimodal Large Language Models and BenchmarksCode1
Stop Reasoning! When Multimodal LLM with Chain-of-Thought Reasoning Meets Adversarial ImageCode1
ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal ModelsCode1
BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal ModelsCode1
X-InstructBLIP: A Framework for aligning X-Modal instruction-aware representations to LLMs and Emergent Cross-modal ReasoningCode1
How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMsCode1
Compositional Chain-of-Thought Prompting for Large Multimodal ModelsCode1
GENOME: GenerativE Neuro-symbOlic visual reasoning by growing and reusing ModulEsCode1
NeuSyRE: Neuro-Symbolic Visual Understanding and Reasoning Framework based on Scene Graph EnrichmentCode1
Weakly Supervised Semantic Parsing with Execution-based Spurious Program FilteringCode1
What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction TuningCode1
Myriad: Large Multimodal Model by Applying Vision Experts for Industrial Anomaly DetectionCode1
What's Left? Concept Grounding with Logic-Enhanced Foundation ModelsCode1
Interpreting and Controlling Vision Foundation Models via Text ExplanationsCode1
Bongard-OpenWorld: Few-Shot Reasoning for Free-form Visual Concepts in the Real WorldCode1
Measuring and Improving Chain-of-Thought Reasoning in Vision-Language ModelsCode1
A Survey on Interpretable Cross-modal ReasoningCode1
Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following ModelsCode1
An Examination of the Compositionality of Large Generative Vision-Language ModelsCode1
VL-PET: Vision-and-Language Parameter-Efficient Tuning via Granularity ControlCode1
Uni-NLX: Unifying Textual Explanations for Vision and Vision-Language TasksCode1
Learning Differentiable Logic Programs for Abstract Visual ReasoningCode1
Revisiting the Role of Language Priors in Vision-Language ModelsCode1
CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language TransformersCode1
What You See is What You Read? Improving Text-Image Alignment EvaluationCode1
Measuring Progress in Fine-grained Vision-and-Language UnderstandingCode1
Visual Reasoning: from State to TransformationCode1
Going Beyond Nouns With Vision & Language Models Using Synthetic DataCode1
IRFL: Image Recognition of Figurative LanguageCode1
Equivariant Similarity for Vision-Language Foundation ModelsCode1
NS3D: Neuro-Symbolic Grounding of 3D Objects and RelationsCode1
Abstract Visual Reasoning: An Algebraic Approach for Solving Raven's Progressive MatricesCode1
Show:102550
← PrevPage 3 of 14Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4o + CAText Score75.5Unverified
2GPT-4V (CoT, pick b/w two options)Text Score75.25Unverified
3GPT-4V (pick b/w two options)Text Score69.25Unverified
4MMICL + CoCoTText Score64.25Unverified
5GPT-4V + CoCoTText Score58.5Unverified
6OpenFlamingo + CoCoTText Score58.25Unverified
7GPT-4VText Score54.5Unverified
8FIBER (EqSim)Text Score51.5Unverified
9FIBER (finetuned, Flickr30k)Text Score51.25Unverified
10MMICL + CCoTText Score51Unverified
#ModelMetricClaimedVerifiedStatus
1BEiT-3Accuracy91.51Unverified
2X2-VLM (large)Accuracy88.7Unverified
3XFM (base)Accuracy87.6Unverified
4X2-VLM (base)Accuracy86.2Unverified
5CoCaAccuracy86.1Unverified
6VLMoAccuracy85.64Unverified
7VK-OODAccuracy84.6Unverified
8SimVLMAccuracy84.53Unverified
9X-VLM (base)Accuracy84.41Unverified
10VK-OODAccuracy83.9Unverified
#ModelMetricClaimedVerifiedStatus
1BEiT-3Accuracy92.58Unverified
2X2-VLM (large)Accuracy89.4Unverified
3XFM (base)Accuracy88.4Unverified
4CoCaAccuracy87Unverified
5X2-VLM (base)Accuracy87Unverified
6VLMoAccuracy86.86Unverified
7SimVLMAccuracy85.15Unverified
8X-VLM (base)Accuracy84.76Unverified
9BLIP-129MAccuracy83.09Unverified
10ALBEF (14M)Accuracy82.55Unverified
#ModelMetricClaimedVerifiedStatus
1AI CoreAverage-per ques.95.24Unverified
2redherringAverage-per ques.91.14Unverified
3VRDPAverage-per ques.90.24Unverified
4FightttttAverage-per ques.88.71Unverified
5neuralAverage-per ques.88.27Unverified
6NERVAverage-per ques.88.05Unverified
7DCLAverage-per ques.75.52Unverified
8troublesolverAverage-per ques.73.3Unverified
9v0.1Average-per ques.73.1Unverified
10First_testAverage-per ques.69.65Unverified
#ModelMetricClaimedVerifiedStatus
1Gemini-2.0 + CA2-Class Accuracy93.6Unverified
2GPT-4o + CA2-Class Accuracy92.8Unverified
3Human2-Class Accuracy91Unverified
4SNAIL2-Class Accuracy64Unverified
5InstructBLIP + GPT-42-Class Accuracy63.8Unverified
6BLIP-2 + ChatGPT (Fine-tuned)2-Class Accuracy63.3Unverified
7InstructBLIP + ChatGPT + Neuro-Symbolic2-Class Accuracy55.5Unverified
8ChatCaptioner + ChatGPT2-Class Accuracy49.3Unverified
9Otter2-Class Accuracy49.3Unverified
#ModelMetricClaimedVerifiedStatus
1HumansJaccard Index90Unverified
2ViLT (Zero-Shot)Jaccard Index52Unverified
3X-VLM (Zero-Shot)Jaccard Index46Unverified
4CLIP-ViT-B/32 (Zero-Shot)Jaccard Index41Unverified
5CLIP-ViT-L/14 (Zero-Shot)Jaccard Index40Unverified
6CLIP-RN50x64/14 (Zero-Shot)Jaccard Index38Unverified
7CLIP-RN50 (Zero-Shot)Jaccard Index35Unverified
8CLIP-ViL (Zero-Shot)Jaccard Index15Unverified
#ModelMetricClaimedVerifiedStatus
1LXMERTaccuracy70.1Unverified
2ViLTaccuracy69.3Unverified
3CLIP (finetuned)accuracy65.1Unverified
4CLIP (frozen)accuracy56Unverified
5VisualBERTaccuracy55.2Unverified
#ModelMetricClaimedVerifiedStatus
1RPINAUCCESS42.2Unverified
2Dec[Joint]1fAUCCESS40.3Unverified
3Dynamics-Aware DQNAUCCESS39.9Unverified
4DQNAUCCESS36.8Unverified
#ModelMetricClaimedVerifiedStatus
1Dynamics-Aware DQNAUCCESS85.2Unverified
2RPINAUCCESS85.2Unverified
3Dec[Joint]1fAUCCESS80Unverified
4DQNAUCCESS77.6Unverified
#ModelMetricClaimedVerifiedStatus
1Swin1:1 Accuracy52.9Unverified
2ConvNeXt1:1 Accuracy51.2Unverified
3ViT1:1 Accuracy50.3Unverified
4DEiT1:1 Accuracy47.2Unverified
#ModelMetricClaimedVerifiedStatus
1Humans1-of-100 Accuracy100Unverified
#ModelMetricClaimedVerifiedStatus
1VisualBERTAccuracy (Dev)67.4Unverified