SOTAVerified

Visual Reasoning

Ability to understand actions and reasoning associated with any visual images

Papers

Showing 101150 of 698 papers

TitleStatusHype
INS-MMBench: A Comprehensive Benchmark for Evaluating LVLMs' Performance in InsuranceCode1
How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMsCode1
CityEQA: A Hierarchical LLM Agent on Embodied Question Answering Benchmark in City SpaceCode1
Are Vision Language Models Ready for Clinical Diagnosis? A 3D Medical Benchmark for Tumor-centric Visual Question AnsweringCode1
ClawMachine: Learning to Fetch Visual Tokens for Referential ComprehensionCode1
Comprehensive Visual Question Answering on Point Clouds through Compositional Scene ManipulationCode1
How Do Multimodal Large Language Models Handle Complex Multimodal Reasoning? Placing Them in An Extensible Escape GameCode1
MixGen: A New Multi-Modal Data AugmentationCode1
A Survey on Interpretable Cross-modal ReasoningCode1
MMCode: Benchmarking Multimodal Large Language Models for Code Generation with Visually Rich Programming ProblemsCode1
Grounded Situation Recognition with TransformersCode1
A Benchmark for Compositional Visual ReasoningCode1
MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question AnsweringCode1
Coarse-to-Fine Vision-Language Pre-training with Fusion in the BackboneCode1
Going Beyond Nouns With Vision & Language Models Using Synthetic DataCode1
Attention-Based Context Aware Reasoning for Situation RecognitionCode1
Large-Scale Adversarial Training for Vision-and-Language Representation LearningCode1
GENOME: GenerativE Neuro-symbOlic visual reasoning by growing and reusing ModulEsCode1
GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question AnsweringCode1
Collaborative Transformers for Grounded Situation RecognitionCode1
Learning Differentiable Logic Programs for Abstract Visual ReasoningCode1
ColorBench: Can VLMs See and Understand the Colorful World? A Comprehensive Benchmark for Color Perception, Reasoning, and RobustnessCode1
Learning Long-term Visual Dynamics with Region Proposal Interaction NetworksCode1
PHYRE: A New Benchmark for Physical ReasoningCode1
Belief Revision based Caption Re-ranker with Visual Semantic InformationCode1
Compositional Attention Networks for Machine ReasoningCode1
CAMEL-Bench: A Comprehensive Arabic LMM BenchmarkCode1
An Examination of the Compositionality of Large Generative Vision-Language ModelsCode1
Generalizing from SIMPLE to HARD Visual Reasoning: Can We Mitigate Modality Imbalance in VLMs?Code1
HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding TasksCode1
From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video UnderstandingCode1
An Empirical Study of Training End-to-End Vision-and-Language TransformersCode1
From the Least to the Most: Building a Plug-and-Play Visual Reasoner via Data SynthesisCode1
Forward Prediction for Physical ReasoningCode1
Forgotten Polygons: Multimodal Large Language Models are Shape-BlindCode1
From EduVisBench to EduVisAgent: A Benchmark and Multi-Agent Framework for Pedagogical VisualizationCode1
GRE Suite: Geo-localization Inference via Fine-Tuned Vision-Language Models and Enhanced Reasoning ChainsCode1
Measuring Progress in Fine-grained Vision-and-Language UnderstandingCode1
Bongard-OpenWorld: Few-Shot Reasoning for Free-form Visual Concepts in the Real WorldCode1
Bongard-LOGO: A New Benchmark for Human-Level Concept Learning and ReasoningCode1
Bongard-HOI: Benchmarking Few-Shot Visual Reasoning for Human-Object InteractionsCode1
FiLM: Visual Reasoning with a General Conditioning LayerCode1
Machine Number Sense: A Dataset of Visual Arithmetic Problems for Abstract and Relational ReasoningCode1
Align before Fuse: Vision and Language Representation Learning with Momentum DistillationCode1
LLMs can be Dangerous Reasoners: Analyzing-based Jailbreak Attack on Large Language ModelsCode1
LogiCity: Advancing Neuro-Symbolic AI with Abstract Urban SimulationCode1
Expressive Scene Graph Generation Using Commonsense Knowledge Infusion for Visual Understanding and ReasoningCode1
FineCops-Ref: A new Dataset and Task for Fine-Grained Compositional Referring Expression ComprehensionCode1
Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across ModalitiesCode1
LXMERT: Learning Cross-Modality Encoder Representations from TransformersCode1
Show:102550
← PrevPage 3 of 14Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4o + CAText Score75.5Unverified
2GPT-4V (CoT, pick b/w two options)Text Score75.25Unverified
3GPT-4V (pick b/w two options)Text Score69.25Unverified
4MMICL + CoCoTText Score64.25Unverified
5GPT-4V + CoCoTText Score58.5Unverified
6OpenFlamingo + CoCoTText Score58.25Unverified
7GPT-4VText Score54.5Unverified
8FIBER (EqSim)Text Score51.5Unverified
9FIBER (finetuned, Flickr30k)Text Score51.25Unverified
10MMICL + CCoTText Score51Unverified
#ModelMetricClaimedVerifiedStatus
1BEiT-3Accuracy91.51Unverified
2X2-VLM (large)Accuracy88.7Unverified
3XFM (base)Accuracy87.6Unverified
4X2-VLM (base)Accuracy86.2Unverified
5CoCaAccuracy86.1Unverified
6VLMoAccuracy85.64Unverified
7VK-OODAccuracy84.6Unverified
8SimVLMAccuracy84.53Unverified
9X-VLM (base)Accuracy84.41Unverified
10VK-OODAccuracy83.9Unverified
#ModelMetricClaimedVerifiedStatus
1BEiT-3Accuracy92.58Unverified
2X2-VLM (large)Accuracy89.4Unverified
3XFM (base)Accuracy88.4Unverified
4X2-VLM (base)Accuracy87Unverified
5CoCaAccuracy87Unverified
6VLMoAccuracy86.86Unverified
7SimVLMAccuracy85.15Unverified
8X-VLM (base)Accuracy84.76Unverified
9BLIP-129MAccuracy83.09Unverified
10ALBEF (14M)Accuracy82.55Unverified
#ModelMetricClaimedVerifiedStatus
1AI CoreAverage-per ques.95.24Unverified
2redherringAverage-per ques.91.14Unverified
3VRDPAverage-per ques.90.24Unverified
4FightttttAverage-per ques.88.71Unverified
5neuralAverage-per ques.88.27Unverified
6NERVAverage-per ques.88.05Unverified
7DCLAverage-per ques.75.52Unverified
8troublesolverAverage-per ques.73.3Unverified
9v0.1Average-per ques.73.1Unverified
10First_testAverage-per ques.69.65Unverified
#ModelMetricClaimedVerifiedStatus
1Gemini-2.0 + CA2-Class Accuracy93.6Unverified
2GPT-4o + CA2-Class Accuracy92.8Unverified
3Human2-Class Accuracy91Unverified
4SNAIL2-Class Accuracy64Unverified
5InstructBLIP + GPT-42-Class Accuracy63.8Unverified
6BLIP-2 + ChatGPT (Fine-tuned)2-Class Accuracy63.3Unverified
7InstructBLIP + ChatGPT + Neuro-Symbolic2-Class Accuracy55.5Unverified
8ChatCaptioner + ChatGPT2-Class Accuracy49.3Unverified
9Otter2-Class Accuracy49.3Unverified
#ModelMetricClaimedVerifiedStatus
1HumansJaccard Index90Unverified
2ViLT (Zero-Shot)Jaccard Index52Unverified
3X-VLM (Zero-Shot)Jaccard Index46Unverified
4CLIP-ViT-B/32 (Zero-Shot)Jaccard Index41Unverified
5CLIP-ViT-L/14 (Zero-Shot)Jaccard Index40Unverified
6CLIP-RN50x64/14 (Zero-Shot)Jaccard Index38Unverified
7CLIP-RN50 (Zero-Shot)Jaccard Index35Unverified
8CLIP-ViL (Zero-Shot)Jaccard Index15Unverified
#ModelMetricClaimedVerifiedStatus
1LXMERTaccuracy70.1Unverified
2ViLTaccuracy69.3Unverified
3CLIP (finetuned)accuracy65.1Unverified
4CLIP (frozen)accuracy56Unverified
5VisualBERTaccuracy55.2Unverified
#ModelMetricClaimedVerifiedStatus
1RPINAUCCESS42.2Unverified
2Dec[Joint]1fAUCCESS40.3Unverified
3Dynamics-Aware DQNAUCCESS39.9Unverified
4DQNAUCCESS36.8Unverified
#ModelMetricClaimedVerifiedStatus
1RPINAUCCESS85.2Unverified
2Dynamics-Aware DQNAUCCESS85.2Unverified
3Dec[Joint]1fAUCCESS80Unverified
4DQNAUCCESS77.6Unverified
#ModelMetricClaimedVerifiedStatus
1Swin1:1 Accuracy52.9Unverified
2ConvNeXt1:1 Accuracy51.2Unverified
3ViT1:1 Accuracy50.3Unverified
4DEiT1:1 Accuracy47.2Unverified
#ModelMetricClaimedVerifiedStatus
1Humans1-of-100 Accuracy100Unverified
#ModelMetricClaimedVerifiedStatus
1VisualBERTAccuracy (Dev)67.4Unverified