SOTAVerified

Visual Reasoning

Ability to understand actions and reasoning associated with any visual images

Papers

Showing 301350 of 698 papers

TitleStatusHype
Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual Reasoning from Transit Maps0
Cantor: Inspiring Multimodal Chain-of-Thought of MLLM0
Can VLMs be used on videos for action recognition? LLMs are Visual Reasoning Coordinators0
Can We Automate Diagrammatic Reasoning?0
CAVL: Learning Contrastive and Adaptive Representations of Vision and Language0
Chain-of-Focus: Adaptive Visual Search and Zooming for Multimodal Reasoning via RL0
Chain of Functions: A Programmatic Pipeline for Fine-Grained Chart Reasoning Data0
ChartBench: A Benchmark for Complex Visual Reasoning in Charts0
ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision-Language Models0
ChartNet: Visual Reasoning over Statistical Charts using MAC-Networks0
ChartReasoner: Code-Driven Modality Bridging for Long-Chain Reasoning in Chart Question Answering0
Chat2Layout: Interactive 3D Furniture Layout with a Multimodal LLM0
Chitrarth: Bridging Vision and Language for a Billion People0
Chop Chop BERT: Visual Question Answering by Chopping VisualBERT's Heads0
CityLoc: 6DoF Pose Distributional Localization for Text Descriptions in Large-Scale Scenes with Gaussian Representation0
CLIP-Powered TASS: Target-Aware Single-Stream Network for Audio-Visual Question Answering0
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs0
Code Repair with LLMs gives an Exploration-Exploitation Tradeoff0
A Cognitive Paradigm Approach to Probe the Perception-Reasoning Interface in VLMs0
Comparing Visual Reasoning in Humans and AI0
Comparison Visual Instruction Tuning0
Compositional Law Parsing with Latent Random Functions0
Continual learning on 3D point clouds with random compressed rehearsal0
Cops-Ref: A new Dataset and Task on Compositional Referring Expression Comprehension0
Co-VQA : Answering by Interactive Sub Question Sequence0
Co-VQA : Answering by Interactive Sub Question Sequence0
Critical Features Tracking on Triangulated Irregular Networks by a Scale-Space Method0
Curriculum Learning for Compositional Visual Reasoning0
DAReN: A Collaborative Approach Towards Reasoning And Disentangling0
Multimodal Analysis Of Google Bard And GPT-Vision: Experiments In Visual Reasoning0
Deep Learning Methods for Abstract Visual Reasoning: A Survey on Raven's Progressive Matrices0
Deep Neural Networks for Visual Reasoning0
Deep Reason: A Strong Baseline for Real-World Visual Reasoning0
Deep Visual Reasoning: Learning to Predict Action Sequences for Task and Motion Planning from an Initial Scene Image0
Disentangling Knowledge-based and Visual Reasoning by Question Decomposition in KB-VQA0
Doc-CoB: Enhancing Multi-Modal Document Understanding with Visual Chain-of-Boxes Reasoning0
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?0
Does Structural Attention Improve Compositional Representations in Vision-Language Models?0
Does Visual Pretraining Help End-to-End Reasoning?0
Do Vision-Language Transformers Exhibit Visual Commonsense? An Empirical Study of VCR0
Do we Really Need Visual Instructions? Towards Visual Instruction-Free Fine-tuning for Large Vision-Language Models0
Doxing via the Lens: Revealing Location-related Privacy Leakage on Multi-modal Large Reasoning Models0
DRIVINGVQA: Analyzing Visual Chain-of-Thought Reasoning of Vision Language Models in Real-World Scenarios with Driving Theory Tests0
Dual Local-Global Contextual Pathways for Recognition in Aerial Imagery0
DWIM: Towards Tool-aware Visual Reasoning via Discrepancy-aware Workflow Generation & Instruct-Masking Tuning0
Dynamic Graph Attention for Referring Expression Comprehension0
Dynamic Visual Reasoning by Learning Differentiable Physics Models from Video and Language0
EarthDial: Turning Multi-sensory Earth Observations to Interactive Dialogues0
EgoReID: Cross-view Self-Identification and Human Re-identification in Egocentric and Surveillance Videos0
End-to-End Chart Summarization via Visual Chain-of-Thought in Vision-Language Models0
Show:102550
← PrevPage 7 of 14Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4o + CAText Score75.5Unverified
2GPT-4V (CoT, pick b/w two options)Text Score75.25Unverified
3GPT-4V (pick b/w two options)Text Score69.25Unverified
4MMICL + CoCoTText Score64.25Unverified
5GPT-4V + CoCoTText Score58.5Unverified
6OpenFlamingo + CoCoTText Score58.25Unverified
7GPT-4VText Score54.5Unverified
8FIBER (EqSim)Text Score51.5Unverified
9FIBER (finetuned, Flickr30k)Text Score51.25Unverified
10MMICL + CCoTText Score51Unverified
#ModelMetricClaimedVerifiedStatus
1BEiT-3Accuracy91.51Unverified
2X2-VLM (large)Accuracy88.7Unverified
3XFM (base)Accuracy87.6Unverified
4X2-VLM (base)Accuracy86.2Unverified
5CoCaAccuracy86.1Unverified
6VLMoAccuracy85.64Unverified
7VK-OODAccuracy84.6Unverified
8SimVLMAccuracy84.53Unverified
9X-VLM (base)Accuracy84.41Unverified
10VK-OODAccuracy83.9Unverified
#ModelMetricClaimedVerifiedStatus
1BEiT-3Accuracy92.58Unverified
2X2-VLM (large)Accuracy89.4Unverified
3XFM (base)Accuracy88.4Unverified
4X2-VLM (base)Accuracy87Unverified
5CoCaAccuracy87Unverified
6VLMoAccuracy86.86Unverified
7SimVLMAccuracy85.15Unverified
8X-VLM (base)Accuracy84.76Unverified
9BLIP-129MAccuracy83.09Unverified
10ALBEF (14M)Accuracy82.55Unverified
#ModelMetricClaimedVerifiedStatus
1AI CoreAverage-per ques.95.24Unverified
2redherringAverage-per ques.91.14Unverified
3VRDPAverage-per ques.90.24Unverified
4FightttttAverage-per ques.88.71Unverified
5neuralAverage-per ques.88.27Unverified
6NERVAverage-per ques.88.05Unverified
7DCLAverage-per ques.75.52Unverified
8troublesolverAverage-per ques.73.3Unverified
9v0.1Average-per ques.73.1Unverified
10First_testAverage-per ques.69.65Unverified
#ModelMetricClaimedVerifiedStatus
1Gemini-2.0 + CA2-Class Accuracy93.6Unverified
2GPT-4o + CA2-Class Accuracy92.8Unverified
3Human2-Class Accuracy91Unverified
4SNAIL2-Class Accuracy64Unverified
5InstructBLIP + GPT-42-Class Accuracy63.8Unverified
6BLIP-2 + ChatGPT (Fine-tuned)2-Class Accuracy63.3Unverified
7InstructBLIP + ChatGPT + Neuro-Symbolic2-Class Accuracy55.5Unverified
8ChatCaptioner + ChatGPT2-Class Accuracy49.3Unverified
9Otter2-Class Accuracy49.3Unverified
#ModelMetricClaimedVerifiedStatus
1HumansJaccard Index90Unverified
2ViLT (Zero-Shot)Jaccard Index52Unverified
3X-VLM (Zero-Shot)Jaccard Index46Unverified
4CLIP-ViT-B/32 (Zero-Shot)Jaccard Index41Unverified
5CLIP-ViT-L/14 (Zero-Shot)Jaccard Index40Unverified
6CLIP-RN50x64/14 (Zero-Shot)Jaccard Index38Unverified
7CLIP-RN50 (Zero-Shot)Jaccard Index35Unverified
8CLIP-ViL (Zero-Shot)Jaccard Index15Unverified
#ModelMetricClaimedVerifiedStatus
1LXMERTaccuracy70.1Unverified
2ViLTaccuracy69.3Unverified
3CLIP (finetuned)accuracy65.1Unverified
4CLIP (frozen)accuracy56Unverified
5VisualBERTaccuracy55.2Unverified
#ModelMetricClaimedVerifiedStatus
1RPINAUCCESS42.2Unverified
2Dec[Joint]1fAUCCESS40.3Unverified
3Dynamics-Aware DQNAUCCESS39.9Unverified
4DQNAUCCESS36.8Unverified
#ModelMetricClaimedVerifiedStatus
1RPINAUCCESS85.2Unverified
2Dynamics-Aware DQNAUCCESS85.2Unverified
3Dec[Joint]1fAUCCESS80Unverified
4DQNAUCCESS77.6Unverified
#ModelMetricClaimedVerifiedStatus
1Swin1:1 Accuracy52.9Unverified
2ConvNeXt1:1 Accuracy51.2Unverified
3ViT1:1 Accuracy50.3Unverified
4DEiT1:1 Accuracy47.2Unverified
#ModelMetricClaimedVerifiedStatus
1Humans1-of-100 Accuracy100Unverified
#ModelMetricClaimedVerifiedStatus
1VisualBERTAccuracy (Dev)67.4Unverified