SOTAVerified

Visual Reasoning

Ability to understand actions and reasoning associated with any visual images

Papers

Showing 151200 of 698 papers

TitleStatusHype
Is BERT Blind? Exploring the Effect of Vision-and-Language Pretraining on Visual Language UnderstandingCode1
Divide and Conquer: Answering Questions with Object Factorization and Compositional ReasoningCode1
Multimodality Representation Learning: A Survey on Evolution, Pretraining and Its ApplicationsCode1
UPop: Unified and Progressive Pruning for Compressing Vision-Language TransformersCode1
See, Think, Confirm: Interactive Prompting Between Vision and Language Models for Knowledge-based Visual ReasoningCode1
Context-Aware Alignment and Mutual Masking for 3D-Language Pre-TrainingCode1
Cross-modal Attention Congruence Regularization for Vision-Language Relation AlignmentCode1
MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question AnsweringCode1
Position-guided Text Prompt for Vision-Language Pre-trainingCode1
Benchmarking Robustness of Multimodal Image-Text Models under Distribution ShiftCode1
Super-CLEVR: A Virtual Benchmark to Diagnose Domain Robustness in Visual ReasoningCode1
Perceive, Ground, Reason, and Act: A Benchmark for General-purpose Visual RepresentationCode1
MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training ModelCode1
Belief Revision based Caption Re-ranker with Visual Semantic InformationCode1
VIPHY: Probing "Visible" Physical Commonsense KnowledgeCode1
Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical AlignmentCode1
MixGen: A New Multi-Modal Data AugmentationCode1
Coarse-to-Fine Vision-Language Pre-training with Fusion in the BackboneCode1
A Benchmark for Compositional Visual ReasoningCode1
Expressive Scene Graph Generation Using Commonsense Knowledge Infusion for Visual Understanding and ReasoningCode1
CyCLIP: Cyclic Contrastive Language-Image PretrainingCode1
Bongard-HOI: Benchmarking Few-Shot Visual Reasoning for Human-Object InteractionsCode1
CoCa: Contrastive Captioners are Image-Text Foundation ModelsCode1
Visual Spatial ReasoningCode1
RelViT: Concept-guided Vision Transformer for Visual Relational ReasoningCode1
Winoground: Probing Vision and Language Models for Visio-Linguistic CompositionalityCode1
CLEVR-X: A Visual Reasoning Dataset for Natural Language ExplanationsCode1
Collaborative Transformers for Grounded Situation RecognitionCode1
REX: Reasoning-aware and Grounded ExplanationCode1
Comprehensive Visual Question Answering on Point Clouds through Compositional Scene ManipulationCode1
Distilled Dual-Encoder Model for Vision-Language UnderstandingCode1
FLAVA: A Foundational Language And Vision Alignment ModelCode1
Grounded Situation Recognition with TransformersCode1
Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual ConceptsCode1
VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-ExpertsCode1
An Empirical Study of Training End-to-End Vision-and-Language TransformersCode1
ProTo: Program-Guided Transformer for Program-Guided TasksCode1
Visually Grounded Reasoning across Languages and CulturesCode1
ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge IntegrationCode1
Image Retrieval on Real-life Images with Pre-trained Vision-and-Language ModelsCode1
Align before Fuse: Vision and Language Representation Learning with Momentum DistillationCode1
Understanding and Evaluating Racial Biases in Image CaptioningCode1
Referring Transformer: A One-step Approach to Multi-task Visual GroundingCode1
Learning Relation Alignment for Calibrated Cross-modal RetrievalCode1
Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic ReasoningCode1
Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation LearningCode1
ViLT: Vision-and-Language Transformer Without Convolution or Region SupervisionCode1
DVD: A Diagnostic Dataset for Multi-step Reasoning in Video Grounded DialogueCode1
Transformation Driven Visual ReasoningCode1
Natural Language Rationales with Full-Stack Visual Reasoning: From Pixels to Semantic Frames to Commonsense GraphsCode1
Show:102550
← PrevPage 4 of 14Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4o + CAText Score75.5Unverified
2GPT-4V (CoT, pick b/w two options)Text Score75.25Unverified
3GPT-4V (pick b/w two options)Text Score69.25Unverified
4MMICL + CoCoTText Score64.25Unverified
5GPT-4V + CoCoTText Score58.5Unverified
6OpenFlamingo + CoCoTText Score58.25Unverified
7GPT-4VText Score54.5Unverified
8FIBER (EqSim)Text Score51.5Unverified
9FIBER (finetuned, Flickr30k)Text Score51.25Unverified
10MMICL + CCoTText Score51Unverified
#ModelMetricClaimedVerifiedStatus
1BEiT-3Accuracy91.51Unverified
2X2-VLM (large)Accuracy88.7Unverified
3XFM (base)Accuracy87.6Unverified
4X2-VLM (base)Accuracy86.2Unverified
5CoCaAccuracy86.1Unverified
6VLMoAccuracy85.64Unverified
7VK-OODAccuracy84.6Unverified
8SimVLMAccuracy84.53Unverified
9X-VLM (base)Accuracy84.41Unverified
10VK-OODAccuracy83.9Unverified
#ModelMetricClaimedVerifiedStatus
1BEiT-3Accuracy92.58Unverified
2X2-VLM (large)Accuracy89.4Unverified
3XFM (base)Accuracy88.4Unverified
4X2-VLM (base)Accuracy87Unverified
5CoCaAccuracy87Unverified
6VLMoAccuracy86.86Unverified
7SimVLMAccuracy85.15Unverified
8X-VLM (base)Accuracy84.76Unverified
9BLIP-129MAccuracy83.09Unverified
10ALBEF (14M)Accuracy82.55Unverified
#ModelMetricClaimedVerifiedStatus
1AI CoreAverage-per ques.95.24Unverified
2redherringAverage-per ques.91.14Unverified
3VRDPAverage-per ques.90.24Unverified
4FightttttAverage-per ques.88.71Unverified
5neuralAverage-per ques.88.27Unverified
6NERVAverage-per ques.88.05Unverified
7DCLAverage-per ques.75.52Unverified
8troublesolverAverage-per ques.73.3Unverified
9v0.1Average-per ques.73.1Unverified
10First_testAverage-per ques.69.65Unverified
#ModelMetricClaimedVerifiedStatus
1Gemini-2.0 + CA2-Class Accuracy93.6Unverified
2GPT-4o + CA2-Class Accuracy92.8Unverified
3Human2-Class Accuracy91Unverified
4SNAIL2-Class Accuracy64Unverified
5InstructBLIP + GPT-42-Class Accuracy63.8Unverified
6BLIP-2 + ChatGPT (Fine-tuned)2-Class Accuracy63.3Unverified
7InstructBLIP + ChatGPT + Neuro-Symbolic2-Class Accuracy55.5Unverified
8ChatCaptioner + ChatGPT2-Class Accuracy49.3Unverified
9Otter2-Class Accuracy49.3Unverified
#ModelMetricClaimedVerifiedStatus
1HumansJaccard Index90Unverified
2ViLT (Zero-Shot)Jaccard Index52Unverified
3X-VLM (Zero-Shot)Jaccard Index46Unverified
4CLIP-ViT-B/32 (Zero-Shot)Jaccard Index41Unverified
5CLIP-ViT-L/14 (Zero-Shot)Jaccard Index40Unverified
6CLIP-RN50x64/14 (Zero-Shot)Jaccard Index38Unverified
7CLIP-RN50 (Zero-Shot)Jaccard Index35Unverified
8CLIP-ViL (Zero-Shot)Jaccard Index15Unverified
#ModelMetricClaimedVerifiedStatus
1LXMERTaccuracy70.1Unverified
2ViLTaccuracy69.3Unverified
3CLIP (finetuned)accuracy65.1Unverified
4CLIP (frozen)accuracy56Unverified
5VisualBERTaccuracy55.2Unverified
#ModelMetricClaimedVerifiedStatus
1RPINAUCCESS42.2Unverified
2Dec[Joint]1fAUCCESS40.3Unverified
3Dynamics-Aware DQNAUCCESS39.9Unverified
4DQNAUCCESS36.8Unverified
#ModelMetricClaimedVerifiedStatus
1RPINAUCCESS85.2Unverified
2Dynamics-Aware DQNAUCCESS85.2Unverified
3Dec[Joint]1fAUCCESS80Unverified
4DQNAUCCESS77.6Unverified
#ModelMetricClaimedVerifiedStatus
1Swin1:1 Accuracy52.9Unverified
2ConvNeXt1:1 Accuracy51.2Unverified
3ViT1:1 Accuracy50.3Unverified
4DEiT1:1 Accuracy47.2Unverified
#ModelMetricClaimedVerifiedStatus
1Humans1-of-100 Accuracy100Unverified
#ModelMetricClaimedVerifiedStatus
1VisualBERTAccuracy (Dev)67.4Unverified