SOTAVerified

Visual Reasoning

Ability to understand actions and reasoning associated with any visual images

Papers

Showing 301350 of 698 papers

TitleStatusHype
SNIFFER: Multimodal Large Language Model for Explainable Out-of-Context Misinformation Detection0
What Is Missing in Multilingual Visual Reasoning and How to Fix ItCode0
Peacock: A Family of Arabic Multimodal Large Language Models and BenchmarksCode1
Revisiting Disentanglement in Downstream Tasks: A Study on Its Necessity for Abstract Visual ReasoningCode0
VISREAS: Complex Visual Reasoning with Unanswerable Questions0
Stop Reasoning! When Multimodal LLM with Chain-of-Thought Reasoning Meets Adversarial ImageCode1
PALO: A Polyglot Large Multimodal Model for 5B PeopleCode2
Visual Reasoning in Object-Centric Deep Neural Networks: A Comparative Cognition ApproachCode0
Visual In-Context Learning for Large Vision-Language Models0
ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward ModelingCode0
CogCoM: Train Large Vision-Language Models Diving into Details through Chain of ManipulationsCode3
Neural networks for abstraction and reasoning: Towards broad generalization in machinesCode3
Muffin or Chihuahua? Challenging Multimodal Large Language Models with Multipanel VQA0
ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal ModelsCode1
Prompting Large Vision-Language Models for Compositional ReasoningCode0
Image Safeguarding: Reasoning with Conditional Vision Language Model and Obfuscating Unsafe Content CounterfactuallyCode0
Towards Generative Abstract Reasoning: Completing Raven's Progressive Matrix via Rule Abstraction and SelectionCode0
Language-Conditioned Robotic Manipulation with Fast and Slow Thinking0
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image InputsCode0
Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as Programmers0
Generate Subgoal Images before Act: Unlocking the Chain-of-Thought Reasoning in Diffusion Model for Robot Manipulation with Multimodal Prompts0
ChartBench: A Benchmark for Complex Visual Reasoning in Charts0
VCoder: Versatile Vision Encoders for Multimodal Large Language ModelsCode2
A Challenger to GPT-4V? Early Explorations of Gemini in Visual ExpertiseCode0
One Self-Configurable Model to Solve Many Abstract Visual Reasoning ProblemsCode0
GPT4SGG: Synthesizing Scene Graphs from Holistic and Region-specific NarrativesCode0
BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal ModelsCode1
X-InstructBLIP: A Framework for aligning X-Modal instruction-aware representations to LLMs and Emergent Cross-modal ReasoningCode1
Leveraging VLM-Based Pipelines to Annotate 3D Objects0
Compositional Chain-of-Thought Prompting for Large Multimodal ModelsCode1
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGICode5
How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMsCode1
From Wrong To Right: A Recursive Approach Towards Vision-Language Explanation0
SelfEval: Leveraging the discriminative nature of generative models for evaluation0
The Role of Chain-of-Thought in Complex Vision-Language Reasoning Task0
Solving ARC visual analogies with neural embeddings and vector arithmetic: A generalized methodCode0
Adaptive recurrent vision performs zero-shot computation scaling to unseen difficulty levels0
Visual Commonsense based Heterogeneous Graph Contrastive Learning0
Towards A Unified Neural Architecture for Visual Recognition and Reasoning0
GENOME: GenerativE Neuro-symbOlic visual reasoning by growing and reusing ModulEsCode1
NeuSyRE: Neuro-Symbolic Visual Understanding and Reasoning Framework based on Scene Graph EnrichmentCode1
What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction TuningCode1
Weakly Supervised Semantic Parsing with Execution-based Spurious Program FilteringCode1
Myriad: Large Multimodal Model by Applying Vision Experts for Industrial Anomaly DetectionCode1
OC-NMN: Object-centric Compositional Neural Module Network for Generative Visual Analogical Reasoning0
Open Visual Knowledge Extraction via Relation-Oriented Multimodality Model Prompting0
ViCLEVR: A Visual Reasoning Dataset and Hybrid Multimodal Fusion Model for Visual Question Answering in VietnameseCode0
Multimodal Representations for Teacher-Guided Compositional Visual Reasoning0
What's Left? Concept Grounding with Logic-Enhanced Foundation ModelsCode1
Superpixel Semantics Representation and Pre-training for Vision-Language Task0
Show:102550
← PrevPage 7 of 14Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4o + CAText Score75.5Unverified
2GPT-4V (CoT, pick b/w two options)Text Score75.25Unverified
3GPT-4V (pick b/w two options)Text Score69.25Unverified
4MMICL + CoCoTText Score64.25Unverified
5GPT-4V + CoCoTText Score58.5Unverified
6OpenFlamingo + CoCoTText Score58.25Unverified
7GPT-4VText Score54.5Unverified
8FIBER (EqSim)Text Score51.5Unverified
9FIBER (finetuned, Flickr30k)Text Score51.25Unverified
10MMICL + CCoTText Score51Unverified
#ModelMetricClaimedVerifiedStatus
1BEiT-3Accuracy91.51Unverified
2X2-VLM (large)Accuracy88.7Unverified
3XFM (base)Accuracy87.6Unverified
4X2-VLM (base)Accuracy86.2Unverified
5CoCaAccuracy86.1Unverified
6VLMoAccuracy85.64Unverified
7VK-OODAccuracy84.6Unverified
8SimVLMAccuracy84.53Unverified
9X-VLM (base)Accuracy84.41Unverified
10VK-OODAccuracy83.9Unverified
#ModelMetricClaimedVerifiedStatus
1BEiT-3Accuracy92.58Unverified
2X2-VLM (large)Accuracy89.4Unverified
3XFM (base)Accuracy88.4Unverified
4CoCaAccuracy87Unverified
5X2-VLM (base)Accuracy87Unverified
6VLMoAccuracy86.86Unverified
7SimVLMAccuracy85.15Unverified
8X-VLM (base)Accuracy84.76Unverified
9BLIP-129MAccuracy83.09Unverified
10ALBEF (14M)Accuracy82.55Unverified
#ModelMetricClaimedVerifiedStatus
1AI CoreAverage-per ques.95.24Unverified
2redherringAverage-per ques.91.14Unverified
3VRDPAverage-per ques.90.24Unverified
4FightttttAverage-per ques.88.71Unverified
5neuralAverage-per ques.88.27Unverified
6NERVAverage-per ques.88.05Unverified
7DCLAverage-per ques.75.52Unverified
8troublesolverAverage-per ques.73.3Unverified
9v0.1Average-per ques.73.1Unverified
10First_testAverage-per ques.69.65Unverified
#ModelMetricClaimedVerifiedStatus
1Gemini-2.0 + CA2-Class Accuracy93.6Unverified
2GPT-4o + CA2-Class Accuracy92.8Unverified
3Human2-Class Accuracy91Unverified
4SNAIL2-Class Accuracy64Unverified
5InstructBLIP + GPT-42-Class Accuracy63.8Unverified
6BLIP-2 + ChatGPT (Fine-tuned)2-Class Accuracy63.3Unverified
7InstructBLIP + ChatGPT + Neuro-Symbolic2-Class Accuracy55.5Unverified
8ChatCaptioner + ChatGPT2-Class Accuracy49.3Unverified
9Otter2-Class Accuracy49.3Unverified
#ModelMetricClaimedVerifiedStatus
1HumansJaccard Index90Unverified
2ViLT (Zero-Shot)Jaccard Index52Unverified
3X-VLM (Zero-Shot)Jaccard Index46Unverified
4CLIP-ViT-B/32 (Zero-Shot)Jaccard Index41Unverified
5CLIP-ViT-L/14 (Zero-Shot)Jaccard Index40Unverified
6CLIP-RN50x64/14 (Zero-Shot)Jaccard Index38Unverified
7CLIP-RN50 (Zero-Shot)Jaccard Index35Unverified
8CLIP-ViL (Zero-Shot)Jaccard Index15Unverified
#ModelMetricClaimedVerifiedStatus
1LXMERTaccuracy70.1Unverified
2ViLTaccuracy69.3Unverified
3CLIP (finetuned)accuracy65.1Unverified
4CLIP (frozen)accuracy56Unverified
5VisualBERTaccuracy55.2Unverified
#ModelMetricClaimedVerifiedStatus
1RPINAUCCESS42.2Unverified
2Dec[Joint]1fAUCCESS40.3Unverified
3Dynamics-Aware DQNAUCCESS39.9Unverified
4DQNAUCCESS36.8Unverified
#ModelMetricClaimedVerifiedStatus
1Dynamics-Aware DQNAUCCESS85.2Unverified
2RPINAUCCESS85.2Unverified
3Dec[Joint]1fAUCCESS80Unverified
4DQNAUCCESS77.6Unverified
#ModelMetricClaimedVerifiedStatus
1Swin1:1 Accuracy52.9Unverified
2ConvNeXt1:1 Accuracy51.2Unverified
3ViT1:1 Accuracy50.3Unverified
4DEiT1:1 Accuracy47.2Unverified
#ModelMetricClaimedVerifiedStatus
1Humans1-of-100 Accuracy100Unverified
#ModelMetricClaimedVerifiedStatus
1VisualBERTAccuracy (Dev)67.4Unverified