SOTAVerified

Visual Reasoning

Ability to understand actions and reasoning associated with any visual images

Papers

Showing 451500 of 698 papers

TitleStatusHype
Reason from Context with Self-supervised Learning0
X^2-VLM: All-In-One Pre-trained Model For Vision-Language TasksCode2
Unifying Vision-Language Representation Space with Single-tower Transformer0
A survey on knowledge-enhanced multimodal learning0
Visual Programming: Compositional visual reasoning without trainingCode2
lilGym: Natural Language Visual Reasoning with Reinforcement Learning0
MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training ModelCode1
MAMO: Masked Multimodal Modeling for Fine-Grained Vision-Language Representation Learning0
When and why vision-language models behave like bags-of-words, and what to do about it?Code2
Learning to Collocate Visual-Linguistic Neural Modules for Image CaptioningCode0
Enhancing Interpretability and Interactivity in Robot Manipulation: A Neurosymbolic ApproachCode0
A Dual-Attention Learning Network with Word and Sentence Embedding for Medical Visual Question AnsweringCode0
Zero-shot visual reasoning through probabilistic analogical mapping0
Deep Neural Networks for Visual Reasoning0
Belief Revision based Caption Re-ranker with Visual Semantic InformationCode1
Compositional Law Parsing with Latent Random Functions0
VIPHY: Probing "Visible" Physical Commonsense KnowledgeCode1
PaLI: A Jointly-Scaled Multilingual Language-Image Model0
Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical AlignmentCode1
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language TasksCode0
One for All: One-stage Referring Expression Comprehension with Dynamic Reasoning0
WinoGAViL: Gamified Association Benchmark to Challenge Vision-and-Language ModelsCode0
3D Concept Grounding on Neural Fields0
From Shallow to Deep: Compositional Reasoning over Graphs for Visual Question Answering0
VisFIS: Visual Feature Importance Supervision with Right-for-the-Right-Reason ObjectivesCode0
SAViR-T: Spatially Attentive Visual Reasoning with TransformersCode0
Interactive Visual Reasoning under Uncertainty0
MixGen: A New Multi-Modal Data AugmentationCode1
Coarse-to-Fine Vision-Language Pre-training with Fusion in the BackboneCode1
A Benchmark for Compositional Visual ReasoningCode1
GAMR: A Guided Attention Model for (visual) ReasoningCode0
VL-BEiT: Generative Vision-Language Pretraining0
Expressive Scene Graph Generation Using Commonsense Knowledge Infusion for Visual Understanding and ReasoningCode1
Few-shot Subgoal Planning with Language Models0
CyCLIP: Cyclic Contrastive Language-Image PretrainingCode1
Bongard-HOI: Benchmarking Few-Shot Visual Reasoning for Human-Object InteractionsCode1
Guiding Visual Question Answering with Attention Priors0
Continual learning on 3D point clouds with random compressed rehearsal0
Multilevel Hierarchical Network with Multiscale Sampling for Video Question AnsweringCode0
Introduction to Soar0
QLEVR: A Diagnostic Dataset for Quantificational Language and Elementary Visual ReasoningCode0
CoCa: Contrastive Captioners are Image-Text Foundation ModelsCode1
Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering0
Visual Spatial ReasoningCode1
RelViT: Concept-guided Vision Transformer for Visual Relational ReasoningCode1
Winoground: Probing Vision and Language Models for Visio-Linguistic CompositionalityCode1
CLEVR-X: A Visual Reasoning Dataset for Natural Language ExplanationsCode1
Co-VQA : Answering by Interactive Sub Question Sequence0
Collaborative Transformers for Grounded Situation RecognitionCode1
REX: Reasoning-aware and Grounded ExplanationCode1
Show:102550
← PrevPage 10 of 14Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4o + CAText Score75.5Unverified
2GPT-4V (CoT, pick b/w two options)Text Score75.25Unverified
3GPT-4V (pick b/w two options)Text Score69.25Unverified
4MMICL + CoCoTText Score64.25Unverified
5GPT-4V + CoCoTText Score58.5Unverified
6OpenFlamingo + CoCoTText Score58.25Unverified
7GPT-4VText Score54.5Unverified
8FIBER (EqSim)Text Score51.5Unverified
9FIBER (finetuned, Flickr30k)Text Score51.25Unverified
10MMICL + CCoTText Score51Unverified
#ModelMetricClaimedVerifiedStatus
1BEiT-3Accuracy91.51Unverified
2X2-VLM (large)Accuracy88.7Unverified
3XFM (base)Accuracy87.6Unverified
4X2-VLM (base)Accuracy86.2Unverified
5CoCaAccuracy86.1Unverified
6VLMoAccuracy85.64Unverified
7VK-OODAccuracy84.6Unverified
8SimVLMAccuracy84.53Unverified
9X-VLM (base)Accuracy84.41Unverified
10VK-OODAccuracy83.9Unverified
#ModelMetricClaimedVerifiedStatus
1BEiT-3Accuracy92.58Unverified
2X2-VLM (large)Accuracy89.4Unverified
3XFM (base)Accuracy88.4Unverified
4CoCaAccuracy87Unverified
5X2-VLM (base)Accuracy87Unverified
6VLMoAccuracy86.86Unverified
7SimVLMAccuracy85.15Unverified
8X-VLM (base)Accuracy84.76Unverified
9BLIP-129MAccuracy83.09Unverified
10ALBEF (14M)Accuracy82.55Unverified
#ModelMetricClaimedVerifiedStatus
1AI CoreAverage-per ques.95.24Unverified
2redherringAverage-per ques.91.14Unverified
3VRDPAverage-per ques.90.24Unverified
4FightttttAverage-per ques.88.71Unverified
5neuralAverage-per ques.88.27Unverified
6NERVAverage-per ques.88.05Unverified
7DCLAverage-per ques.75.52Unverified
8troublesolverAverage-per ques.73.3Unverified
9v0.1Average-per ques.73.1Unverified
10First_testAverage-per ques.69.65Unverified
#ModelMetricClaimedVerifiedStatus
1Gemini-2.0 + CA2-Class Accuracy93.6Unverified
2GPT-4o + CA2-Class Accuracy92.8Unverified
3Human2-Class Accuracy91Unverified
4SNAIL2-Class Accuracy64Unverified
5InstructBLIP + GPT-42-Class Accuracy63.8Unverified
6BLIP-2 + ChatGPT (Fine-tuned)2-Class Accuracy63.3Unverified
7InstructBLIP + ChatGPT + Neuro-Symbolic2-Class Accuracy55.5Unverified
8ChatCaptioner + ChatGPT2-Class Accuracy49.3Unverified
9Otter2-Class Accuracy49.3Unverified
#ModelMetricClaimedVerifiedStatus
1HumansJaccard Index90Unverified
2ViLT (Zero-Shot)Jaccard Index52Unverified
3X-VLM (Zero-Shot)Jaccard Index46Unverified
4CLIP-ViT-B/32 (Zero-Shot)Jaccard Index41Unverified
5CLIP-ViT-L/14 (Zero-Shot)Jaccard Index40Unverified
6CLIP-RN50x64/14 (Zero-Shot)Jaccard Index38Unverified
7CLIP-RN50 (Zero-Shot)Jaccard Index35Unverified
8CLIP-ViL (Zero-Shot)Jaccard Index15Unverified
#ModelMetricClaimedVerifiedStatus
1LXMERTaccuracy70.1Unverified
2ViLTaccuracy69.3Unverified
3CLIP (finetuned)accuracy65.1Unverified
4CLIP (frozen)accuracy56Unverified
5VisualBERTaccuracy55.2Unverified
#ModelMetricClaimedVerifiedStatus
1RPINAUCCESS42.2Unverified
2Dec[Joint]1fAUCCESS40.3Unverified
3Dynamics-Aware DQNAUCCESS39.9Unverified
4DQNAUCCESS36.8Unverified
#ModelMetricClaimedVerifiedStatus
1Dynamics-Aware DQNAUCCESS85.2Unverified
2RPINAUCCESS85.2Unverified
3Dec[Joint]1fAUCCESS80Unverified
4DQNAUCCESS77.6Unverified
#ModelMetricClaimedVerifiedStatus
1Swin1:1 Accuracy52.9Unverified
2ConvNeXt1:1 Accuracy51.2Unverified
3ViT1:1 Accuracy50.3Unverified
4DEiT1:1 Accuracy47.2Unverified
#ModelMetricClaimedVerifiedStatus
1Humans1-of-100 Accuracy100Unverified
#ModelMetricClaimedVerifiedStatus
1VisualBERTAccuracy (Dev)67.4Unverified