SOTAVerified

Visual Reasoning

Ability to understand actions and reasoning associated with any visual images

Papers

Showing 351400 of 698 papers

TitleStatusHype
WinoGAViL: Gamified Association Benchmark to Challenge Vision-and-Language ModelsCode0
Making History Matter: History-Advantage Sequence Training for Visual Dialog0
MAMO: Masked Multimodal Modeling for Fine-Grained Vision-Language Representation Learning0
Explicit3D: Graph Network with Spatial Inference for Single Image 3D Object Detection0
Abductive Symbolic Solver on Abstraction and Reasoning Corpus0
MAPS: Advancing Multi-Modal Reasoning in Expert-Level Physical Science0
Visual Commonsense based Heterogeneous Graph Contrastive Learning0
3D-MoE: A Mixture-of-Experts Multi-modal LLM for 3D Vision and Pose Diffusion via Rectified Flow0
MATP-BENCH: Can MLLM Be a Good Automated Theorem Prover for Multimodal Problems?0
Visual Entailment: A Novel Task for Fine-Grained Image Understanding0
Explainable AI And Visual Reasoning: Insights From Radiology0
Measuring CLEVRness: Black-box Testing of Visual Reasoning Models0
Measuring CLEVRness: Blackbox testing of Visual Reasoning Models0
Learning to Assemble Neural Module Tree Networks for Visual Grounding0
Analysis of Visual Reasoning on One-Stage Object Detection0
MET-Bench: Multimodal Entity Tracking for Evaluating the Limitations of Vision-Language and Reasoning Models0
MiCo: Multi-image Contrast for Reinforcement Visual Reasoning0
Visual In-Context Learning for Large Vision-Language Models0
EXCLAIM: An Explainable Cross-Modal Agentic System for Misinformation Detection with Hierarchical Retrieval0
EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE0
MIRAGE: Assessing Hallucination in Multimodal Reasoning Chains of MLLM0
Leveraging VLM-Based Pipelines to Annotate 3D Objects0
Evaluating Visual and Cultural Interpretation: The K-Viscuit Benchmark with Human-VLM Collaboration0
M-LLM Based Video Frame Selection for Efficient Video Understanding0
Evaluating MLLMs with Multimodal Multi-image Reasoning Benchmark0
MMCTAgent: Multi-modal Critical Thinking Agent Framework for Complex Visual Reasoning0
MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct0
EuclidNet: Deep Visual Reasoning for Constructible Problems in Geometry0
Interactive Visual Reasoning under Uncertainty0
Analogist: Out-of-the-box Visual In-Context Learning with Image Diffusion Model0
MMRo: Are Multimodal LLMs Eligible as the Brain for In-Home Robotics?0
MMSciBench: Benchmarking Language Models on Multimodal Scientific Problems0
Modeling Gestalt Visual Reasoning on the Raven's Progressive Matrices Intelligence Test Using Generative Image Inpainting Techniques0
Modelling Working Memory using Deep Recurrent Reinforcement Learning0
Modularity Matters: Learning Invariant Relational Reasoning Tasks0
Modulated Self-attention Convolutional Network for VQA0
Muffin or Chihuahua? Challenging Multimodal Large Language Models with Multipanel VQA0
Enhancing Visual Reasoning with Autonomous Imagination in Multimodal Large Language Models0
Multi-Granularity Modularized Network for Abstract Visual Reasoning0
Visual Language Models show widespread visual deficits on neuropsychological tests0
AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning0
A Continual Learning Paradigm for Non-differentiable Visual Programming Frameworks on Visual Reasoning Tasks0
Affordance-Guided Reinforcement Learning via Visual Prompting0
Enhancing Advanced Visual Reasoning Ability of Large Language Models0
Multimodal Representations for Teacher-Guided Compositional Visual Reasoning0
End-to-End Learning of Semantic Grasping0
Superpixel Semantics Representation and Pre-training for Vision-Language Task0
End-to-End Chart Summarization via Visual Chain-of-Thought in Vision-Language Models0
EgoReID: Cross-view Self-Identification and Human Re-identification in Egocentric and Surveillance Videos0
EarthDial: Turning Multi-sensory Earth Observations to Interactive Dialogues0
Show:102550
← PrevPage 8 of 14Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4o + CAText Score75.5Unverified
2GPT-4V (CoT, pick b/w two options)Text Score75.25Unverified
3GPT-4V (pick b/w two options)Text Score69.25Unverified
4MMICL + CoCoTText Score64.25Unverified
5GPT-4V + CoCoTText Score58.5Unverified
6OpenFlamingo + CoCoTText Score58.25Unverified
7GPT-4VText Score54.5Unverified
8FIBER (EqSim)Text Score51.5Unverified
9FIBER (finetuned, Flickr30k)Text Score51.25Unverified
10MMICL + CCoTText Score51Unverified
#ModelMetricClaimedVerifiedStatus
1BEiT-3Accuracy91.51Unverified
2X2-VLM (large)Accuracy88.7Unverified
3XFM (base)Accuracy87.6Unverified
4X2-VLM (base)Accuracy86.2Unverified
5CoCaAccuracy86.1Unverified
6VLMoAccuracy85.64Unverified
7VK-OODAccuracy84.6Unverified
8SimVLMAccuracy84.53Unverified
9X-VLM (base)Accuracy84.41Unverified
10VK-OODAccuracy83.9Unverified
#ModelMetricClaimedVerifiedStatus
1BEiT-3Accuracy92.58Unverified
2X2-VLM (large)Accuracy89.4Unverified
3XFM (base)Accuracy88.4Unverified
4X2-VLM (base)Accuracy87Unverified
5CoCaAccuracy87Unverified
6VLMoAccuracy86.86Unverified
7SimVLMAccuracy85.15Unverified
8X-VLM (base)Accuracy84.76Unverified
9BLIP-129MAccuracy83.09Unverified
10ALBEF (14M)Accuracy82.55Unverified
#ModelMetricClaimedVerifiedStatus
1AI CoreAverage-per ques.95.24Unverified
2redherringAverage-per ques.91.14Unverified
3VRDPAverage-per ques.90.24Unverified
4FightttttAverage-per ques.88.71Unverified
5neuralAverage-per ques.88.27Unverified
6NERVAverage-per ques.88.05Unverified
7DCLAverage-per ques.75.52Unverified
8troublesolverAverage-per ques.73.3Unverified
9v0.1Average-per ques.73.1Unverified
10First_testAverage-per ques.69.65Unverified
#ModelMetricClaimedVerifiedStatus
1Gemini-2.0 + CA2-Class Accuracy93.6Unverified
2GPT-4o + CA2-Class Accuracy92.8Unverified
3Human2-Class Accuracy91Unverified
4SNAIL2-Class Accuracy64Unverified
5InstructBLIP + GPT-42-Class Accuracy63.8Unverified
6BLIP-2 + ChatGPT (Fine-tuned)2-Class Accuracy63.3Unverified
7InstructBLIP + ChatGPT + Neuro-Symbolic2-Class Accuracy55.5Unverified
8ChatCaptioner + ChatGPT2-Class Accuracy49.3Unverified
9Otter2-Class Accuracy49.3Unverified
#ModelMetricClaimedVerifiedStatus
1HumansJaccard Index90Unverified
2ViLT (Zero-Shot)Jaccard Index52Unverified
3X-VLM (Zero-Shot)Jaccard Index46Unverified
4CLIP-ViT-B/32 (Zero-Shot)Jaccard Index41Unverified
5CLIP-ViT-L/14 (Zero-Shot)Jaccard Index40Unverified
6CLIP-RN50x64/14 (Zero-Shot)Jaccard Index38Unverified
7CLIP-RN50 (Zero-Shot)Jaccard Index35Unverified
8CLIP-ViL (Zero-Shot)Jaccard Index15Unverified
#ModelMetricClaimedVerifiedStatus
1LXMERTaccuracy70.1Unverified
2ViLTaccuracy69.3Unverified
3CLIP (finetuned)accuracy65.1Unverified
4CLIP (frozen)accuracy56Unverified
5VisualBERTaccuracy55.2Unverified
#ModelMetricClaimedVerifiedStatus
1RPINAUCCESS42.2Unverified
2Dec[Joint]1fAUCCESS40.3Unverified
3Dynamics-Aware DQNAUCCESS39.9Unverified
4DQNAUCCESS36.8Unverified
#ModelMetricClaimedVerifiedStatus
1RPINAUCCESS85.2Unverified
2Dynamics-Aware DQNAUCCESS85.2Unverified
3Dec[Joint]1fAUCCESS80Unverified
4DQNAUCCESS77.6Unverified
#ModelMetricClaimedVerifiedStatus
1Swin1:1 Accuracy52.9Unverified
2ConvNeXt1:1 Accuracy51.2Unverified
3ViT1:1 Accuracy50.3Unverified
4DEiT1:1 Accuracy47.2Unverified
#ModelMetricClaimedVerifiedStatus
1Humans1-of-100 Accuracy100Unverified
#ModelMetricClaimedVerifiedStatus
1VisualBERTAccuracy (Dev)67.4Unverified