| Bongard-LOGO: A New Benchmark for Human-Level Concept Learning and Reasoning | Oct 2, 2020 | Novel ConceptsRepresentation Learning | CodeCode Available | 1 |
| Learning Long-term Visual Dynamics with Region Proposal Interaction Networks | Aug 5, 2020 | Common Sense ReasoningObject | CodeCode Available | 1 |
| A Closer Look at Generalisation in RAVEN | Aug 1, 2020 | Visual Reasoning | CodeCode Available | 1 |
| Learning to Discretely Compose Reasoning Module Networks for Video Captioning | Jul 17, 2020 | DecoderQuestion Answering | CodeCode Available | 1 |
| Forward Prediction for Physical Reasoning | Jun 18, 2020 | PredictionVisual Reasoning | CodeCode Available | 1 |
| Large-Scale Adversarial Training for Vision-and-Language Representation Learning | Jun 11, 2020 | Image-text RetrievalQuestion Answering | CodeCode Available | 1 |
| Structured Multimodal Attentions for TextVQA | Jun 1, 2020 | Graph AttentionOptical Character Recognition (OCR) | CodeCode Available | 1 |
| Attention-Based Context Aware Reasoning for Situation Recognition | Jun 1, 2020 | Action RecognitionFine-grained Action Recognition | CodeCode Available | 1 |
| Cross-Modality Relevance for Reasoning on Language and Vision | May 12, 2020 | Question AnsweringVisual Question Answering | CodeCode Available | 1 |
| Dynamic Language Binding in Relational Visual Reasoning | Apr 30, 2020 | ObjectQuestion Answering | CodeCode Available | 1 |
| Differentiable Adaptive Computation Time for Visual Reasoning | Apr 27, 2020 | Visual Reasoning | CodeCode Available | 1 |
| Machine Number Sense: A Dataset of Visual Arithmetic Problems for Abstract and Relational Reasoning | Apr 25, 2020 | Relational ReasoningVisual Reasoning | CodeCode Available | 1 |
| Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers | Apr 2, 2020 | Image-text matchingImage-text Retrieval | CodeCode Available | 1 |
| Weakly Supervised Visual Semantic Parsing | Jan 8, 2020 | Graph GenerationImage Retrieval | CodeCode Available | 1 |
| UNITER: UNiversal Image-TExt Representation Learning | Sep 25, 2019 | Image-text matchingImage-text Retrieval | CodeCode Available | 1 |
| Visual Semantic Reasoning for Image-Text Matching | Sep 6, 2019 | Cross-Modal RetrievalImage Retrieval | CodeCode Available | 1 |
| LXMERT: Learning Cross-Modality Encoder Representations from Transformers | Aug 20, 2019 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| PHYRE: A New Benchmark for Physical Reasoning | Aug 15, 2019 | Visual Reasoning | CodeCode Available | 1 |
| VisualBERT: A Simple and Performant Baseline for Vision and Language | Aug 9, 2019 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks | Aug 6, 2019 | Image RetrievalQuestion Answering | CodeCode Available | 1 |
| GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering | Feb 25, 2019 | Question AnsweringVisual Question Answering (VQA) | CodeCode Available | 1 |
| Compositional Attention Networks for Machine Reasoning | Mar 8, 2018 | Referring Expression ComprehensionVisual Question Answering (VQA) | CodeCode Available | 1 |
| FiLM: Visual Reasoning with a General Conditioning Layer | Sep 22, 2017 | Image Retrieval with Multi-Modal QueryVisual Question Answering (VQA) | CodeCode Available | 1 |
| VSE++: Improving Visual-Semantic Embeddings with Hard Negatives | Jul 18, 2017 | Cross-Modal RetrievalImage Retrieval | CodeCode Available | 1 |
| CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning | Dec 20, 2016 | DiagnosticQuestion Answering | CodeCode Available | 1 |
| LaViPlan : Language-Guided Visual Path Planning with RLVR | Jul 17, 2025 | Autonomous DrivingVision-Language-Action | —Unverified | 0 |
| Beyond Task-Specific Reasoning: A Unified Conditional Generative Framework for Abstract Visual Reasoning | Jul 15, 2025 | Visual Reasoning | —Unverified | 0 |
| PyVision: Agentic Vision with Dynamic Tooling | Jul 10, 2025 | Visual Reasoning | —Unverified | 0 |
| Orchestrator-Agent Trust: A Modular Agentic AI Visual Classification System with Trust-Aware Orchestration and RAG-Based Reasoning | Jul 9, 2025 | BenchmarkingImage Retrieval | CodeCode Available | 0 |
| MagiC: Evaluating Multimodal Cognition Toward Grounded Visual Reasoning | Jul 9, 2025 | DiagnosticMultimodal Reasoning | —Unverified | 0 |
| Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning | Jul 7, 2025 | Reinforcement Learning (RL)Visual Reasoning | —Unverified | 0 |
| Foundation Models for Zero-Shot Segmentation of Scientific Images without AI-Ready Data | Jun 30, 2025 | Visual ReasoningZero Shot Segmentation | —Unverified | 0 |
| Visual Structures Helps Visual Reasoning: Addressing the Binding Problem in VLMs | Jun 27, 2025 | Visual Reasoning | —Unverified | 0 |
| MiCo: Multi-image Contrast for Reinforcement Visual Reasoning | Jun 27, 2025 | Logical ReasoningRepresentation Learning | —Unverified | 0 |
| HalluSegBench: Counterfactual Visual Reasoning for Segmentation Hallucination Evaluation | Jun 26, 2025 | counterfactualCounterfactual Reasoning | —Unverified | 0 |
| World-aware Planning Narratives Enhance Large Vision-Language Model Planner | Jun 26, 2025 | Imitation LearningLanguage Modeling | —Unverified | 0 |
| VLM@school -- Evaluation of AI image understanding on German middle school knowledge | Jun 13, 2025 | Visual Reasoning | —Unverified | 0 |
| VGR: Visual Grounded Reasoning | Jun 13, 2025 | Large Language ModelMath | —Unverified | 0 |
| LLMs Are Not Yet Ready for Deepfake Image Detection | Jun 12, 2025 | DeepFake DetectionFace Swapping | —Unverified | 0 |
| ChartReasoner: Code-Driven Modality Bridging for Long-Chain Reasoning in Chart Question Answering | Jun 11, 2025 | Chart Question AnsweringImage to text | —Unverified | 0 |
| VReST: Enhancing Reasoning in Large Vision-Language Models through Tree Search and Self-Reward Mechanism | Jun 10, 2025 | Mathematical ReasoningVisual Reasoning | CodeCode Available | 0 |
| Socratic-MCTS: Test-Time Visual Reasoning by Asking the Right Questions | Jun 10, 2025 | Visual Reasoning | —Unverified | 0 |
| VIKI-R: Coordinating Embodied Multi-Agent Cooperation via Reinforcement Learning | Jun 10, 2025 | Task PlanningVisual Reasoning | —Unverified | 0 |
| Language-Vision Planner and Executor for Text-to-Visual Reasoning | Jun 9, 2025 | In-Context LearningMME | —Unverified | 0 |
| KokushiMD-10: Benchmark for Evaluating Large Language Models on Ten Japanese National Healthcare Licensing Examinations | Jun 9, 2025 | Multimodal ReasoningVisual Reasoning | —Unverified | 0 |
| Synthetic Visual Genome | Jun 9, 2025 | Referring ExpressionReferring Expression Comprehension | —Unverified | 0 |
| Hallucination at a Glance: Controlled Visual Edits and Fine-Grained Multimodal Learning | Jun 8, 2025 | AttributeHallucination | —Unverified | 0 |
| MATP-BENCH: Can MLLM Be a Good Automated Theorem Prover for Multimodal Problems? | Jun 6, 2025 | Automated Theorem ProvingVisual Reasoning | —Unverified | 0 |
| Evaluating MLLMs with Multimodal Multi-image Reasoning Benchmark | Jun 4, 2025 | SentenceVisual Reasoning | —Unverified | 0 |
| ControlThinker: Unveiling Latent Semantics for Controllable Image Generation through Visual Reasoning | Jun 4, 2025 | Image GenerationVisual Reasoning | CodeCode Available | 0 |