SOTAVerified

Visual Reasoning

Ability to understand actions and reasoning associated with any visual images

Papers

Showing 351400 of 698 papers

TitleStatusHype
Interpreting and Controlling Vision Foundation Models via Text ExplanationsCode1
Bongard-OpenWorld: Few-Shot Reasoning for Free-form Visual Concepts in the Real WorldCode1
Implicit Differentiable Outlier Detection Enable Robust Deep Multimodal AnalysisCode0
Visual Question Answering in the Medical Domain0
A Continual Learning Paradigm for Non-differentiable Visual Programming Frameworks on Visual Reasoning Tasks0
MMICL: Empowering Vision-language Model with Multi-Modal In-Context LearningCode2
Collecting Visually-Grounded Dialogue with A Game Of SortsCode0
Measuring and Improving Chain-of-Thought Reasoning in Vision-Language ModelsCode1
A Survey on Interpretable Cross-modal ReasoningCode1
Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following ModelsCode1
On the Potential of CLIP for Compositional Logical Reasoning0
EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE0
An Examination of the Compositionality of Large Generative Vision-Language ModelsCode1
Seeing the Intangible: Survey of Image Classification into High-Level and Abstract Categories0
Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language Models0
VL-PET: Vision-and-Language Parameter-Efficient Tuning via Granularity ControlCode1
Tree-of-Mixed-Thought: Combining Fast and Slow Thinking for Multi-hop Visual Reasoning0
Multimodal Analysis Of Google Bard And GPT-Vision: Experiments In Visual Reasoning0
Uni-NLX: Unifying Textual Explanations for Vision and Vision-Language TasksCode1
Learning logic programs by discovering higher-order abstractionsCode0
Learning Abstract Visual Reasoning via Task Decomposition: A Case Study in Raven Progressive MatricesCode0
3D-VisTA: Pre-trained Transformer for 3D Vision and Text AlignmentCode2
TinyLVLM-eHub: Towards Comprehensive and Efficient Evaluation for Large Vision-Language ModelsCode2
Bridging the Gap: Exploring the Capabilities of Bridge-Architectures for Complex Visual Reasoning Tasks0
LOIS: Looking Out of Instance Semantics for Visual Question Answering0
Grounded Object Centric Learning0
How is ChatGPT's behavior changing over time?Code4
Does Visual Pretraining Help End-to-End Reasoning?0
Abstracting Concept-Changing Rules for Solving Raven's Progressive Matrix Problems0
Learning Differentiable Logic Programs for Abstract Visual ReasoningCode1
Look, Remember and Reason: Grounded reasoning in videos with language models0
Stop Pre-Training: Adapt Visual-Language Models to Unseen LanguagesCode0
PhD Thesis: Exploring the role of (self-)attention in cognitive and computer vision architecture0
A Survey on Multimodal Large Language Models0
V-LoL: A Diagnostic Dataset for Visual Logical LearningCode0
A Domain-Independent Agent Architecture for Adaptive Operation in Evolving Open Worlds0
Leveraging Large Language Models for Scalable Vector Graphics-Driven Image UnderstandingCode0
Systematic Visual Reasoning through Object-Centric Relational AbstractionCode0
Revisiting the Role of Language Priors in Vision-Language ModelsCode1
CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language TransformersCode1
What You See is What You Read? Improving Text-Image Alignment EvaluationCode1
Measuring Progress in Fine-grained Vision-and-Language UnderstandingCode1
Simple Token-Level Confidence Improves Caption Correctness0
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs0
Otter: A Multi-Modal Model with In-Context Instruction TuningCode4
Visual Transformation TellingCode0
Visual Reasoning: from State to TransformationCode1
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language ModelsCode7
Visual Instruction TuningCode6
The role of object-centric representations, guided attention, and external memory on generalizing visual relations0
Show:102550
← PrevPage 8 of 14Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4o + CAText Score75.5Unverified
2GPT-4V (CoT, pick b/w two options)Text Score75.25Unverified
3GPT-4V (pick b/w two options)Text Score69.25Unverified
4MMICL + CoCoTText Score64.25Unverified
5GPT-4V + CoCoTText Score58.5Unverified
6OpenFlamingo + CoCoTText Score58.25Unverified
7GPT-4VText Score54.5Unverified
8FIBER (EqSim)Text Score51.5Unverified
9FIBER (finetuned, Flickr30k)Text Score51.25Unverified
10MMICL + CCoTText Score51Unverified
#ModelMetricClaimedVerifiedStatus
1BEiT-3Accuracy91.51Unverified
2X2-VLM (large)Accuracy88.7Unverified
3XFM (base)Accuracy87.6Unverified
4X2-VLM (base)Accuracy86.2Unverified
5CoCaAccuracy86.1Unverified
6VLMoAccuracy85.64Unverified
7VK-OODAccuracy84.6Unverified
8SimVLMAccuracy84.53Unverified
9X-VLM (base)Accuracy84.41Unverified
10VK-OODAccuracy83.9Unverified
#ModelMetricClaimedVerifiedStatus
1BEiT-3Accuracy92.58Unverified
2X2-VLM (large)Accuracy89.4Unverified
3XFM (base)Accuracy88.4Unverified
4CoCaAccuracy87Unverified
5X2-VLM (base)Accuracy87Unverified
6VLMoAccuracy86.86Unverified
7SimVLMAccuracy85.15Unverified
8X-VLM (base)Accuracy84.76Unverified
9BLIP-129MAccuracy83.09Unverified
10ALBEF (14M)Accuracy82.55Unverified
#ModelMetricClaimedVerifiedStatus
1AI CoreAverage-per ques.95.24Unverified
2redherringAverage-per ques.91.14Unverified
3VRDPAverage-per ques.90.24Unverified
4FightttttAverage-per ques.88.71Unverified
5neuralAverage-per ques.88.27Unverified
6NERVAverage-per ques.88.05Unverified
7DCLAverage-per ques.75.52Unverified
8troublesolverAverage-per ques.73.3Unverified
9v0.1Average-per ques.73.1Unverified
10First_testAverage-per ques.69.65Unverified
#ModelMetricClaimedVerifiedStatus
1Gemini-2.0 + CA2-Class Accuracy93.6Unverified
2GPT-4o + CA2-Class Accuracy92.8Unverified
3Human2-Class Accuracy91Unverified
4SNAIL2-Class Accuracy64Unverified
5InstructBLIP + GPT-42-Class Accuracy63.8Unverified
6BLIP-2 + ChatGPT (Fine-tuned)2-Class Accuracy63.3Unverified
7InstructBLIP + ChatGPT + Neuro-Symbolic2-Class Accuracy55.5Unverified
8ChatCaptioner + ChatGPT2-Class Accuracy49.3Unverified
9Otter2-Class Accuracy49.3Unverified
#ModelMetricClaimedVerifiedStatus
1HumansJaccard Index90Unverified
2ViLT (Zero-Shot)Jaccard Index52Unverified
3X-VLM (Zero-Shot)Jaccard Index46Unverified
4CLIP-ViT-B/32 (Zero-Shot)Jaccard Index41Unverified
5CLIP-ViT-L/14 (Zero-Shot)Jaccard Index40Unverified
6CLIP-RN50x64/14 (Zero-Shot)Jaccard Index38Unverified
7CLIP-RN50 (Zero-Shot)Jaccard Index35Unverified
8CLIP-ViL (Zero-Shot)Jaccard Index15Unverified
#ModelMetricClaimedVerifiedStatus
1LXMERTaccuracy70.1Unverified
2ViLTaccuracy69.3Unverified
3CLIP (finetuned)accuracy65.1Unverified
4CLIP (frozen)accuracy56Unverified
5VisualBERTaccuracy55.2Unverified
#ModelMetricClaimedVerifiedStatus
1RPINAUCCESS42.2Unverified
2Dec[Joint]1fAUCCESS40.3Unverified
3Dynamics-Aware DQNAUCCESS39.9Unverified
4DQNAUCCESS36.8Unverified
#ModelMetricClaimedVerifiedStatus
1Dynamics-Aware DQNAUCCESS85.2Unverified
2RPINAUCCESS85.2Unverified
3Dec[Joint]1fAUCCESS80Unverified
4DQNAUCCESS77.6Unverified
#ModelMetricClaimedVerifiedStatus
1Swin1:1 Accuracy52.9Unverified
2ConvNeXt1:1 Accuracy51.2Unverified
3ViT1:1 Accuracy50.3Unverified
4DEiT1:1 Accuracy47.2Unverified
#ModelMetricClaimedVerifiedStatus
1Humans1-of-100 Accuracy100Unverified
#ModelMetricClaimedVerifiedStatus
1VisualBERTAccuracy (Dev)67.4Unverified