| WinoGAViL: Gamified Association Benchmark to Challenge Vision-and-Language Models | Jul 25, 2022 | Common Sense ReasoningGeneral Knowledge | CodeCode Available | 0 | 5 |
| Making History Matter: History-Advantage Sequence Training for Visual Dialog | Feb 25, 2019 | Answer GenerationDecoder | —Unverified | 0 | 0 |
| MAMO: Masked Multimodal Modeling for Fine-Grained Vision-Language Representation Learning | Oct 9, 2022 | Image-text Retrievalmultimodal interaction | —Unverified | 0 | 0 |
| Explicit3D: Graph Network with Spatial Inference for Single Image 3D Object Detection | Feb 13, 2023 | 3D Object DetectionGraph Generation | —Unverified | 0 | 0 |
| Abductive Symbolic Solver on Abstraction and Reasoning Corpus | Nov 27, 2024 | ARCVisual Reasoning | —Unverified | 0 | 0 |
| MAPS: Advancing Multi-Modal Reasoning in Expert-Level Physical Science | Jan 18, 2025 | Visual Reasoning | —Unverified | 0 | 0 |
| Visual Commonsense based Heterogeneous Graph Contrastive Learning | Nov 11, 2023 | Contrastive LearningQuestion Answering | —Unverified | 0 | 0 |
| 3D-MoE: A Mixture-of-Experts Multi-modal LLM for 3D Vision and Pose Diffusion via Rectified Flow | Jan 28, 2025 | Instruction FollowingMixture-of-Experts | —Unverified | 0 | 0 |
| MATP-BENCH: Can MLLM Be a Good Automated Theorem Prover for Multimodal Problems? | Jun 6, 2025 | Automated Theorem ProvingVisual Reasoning | —Unverified | 0 | 0 |
| Visual Entailment: A Novel Task for Fine-Grained Image Understanding | Jan 20, 2019 | Natural Language InferenceQuestion Answering | —Unverified | 0 | 0 |
| Explainable AI And Visual Reasoning: Insights From Radiology | Apr 6, 2023 | DiagnosticExplainable Artificial Intelligence (XAI) | —Unverified | 0 | 0 |
| Measuring CLEVRness: Black-box Testing of Visual Reasoning Models | Sep 29, 2021 | BenchmarkingDiagnostic | —Unverified | 0 | 0 |
| Measuring CLEVRness: Blackbox testing of Visual Reasoning Models | Feb 24, 2022 | BenchmarkingDiagnostic | —Unverified | 0 | 0 |
| Learning to Assemble Neural Module Tree Networks for Visual Grounding | Dec 8, 2018 | Dependency ParsingNatural Language Visual Grounding | —Unverified | 0 | 0 |
| Analysis of Visual Reasoning on One-Stage Object Detection | Feb 26, 2022 | Objectobject-detection | —Unverified | 0 | 0 |
| MET-Bench: Multimodal Entity Tracking for Evaluating the Limitations of Vision-Language and Reasoning Models | Feb 15, 2025 | Natural Language UnderstandingVisual Reasoning | —Unverified | 0 | 0 |
| MiCo: Multi-image Contrast for Reinforcement Visual Reasoning | Jun 27, 2025 | Logical ReasoningRepresentation Learning | —Unverified | 0 | 0 |
| Visual In-Context Learning for Large Vision-Language Models | Feb 18, 2024 | In-Context LearningPosition | —Unverified | 0 | 0 |
| EXCLAIM: An Explainable Cross-Modal Agentic System for Misinformation Detection with Hierarchical Retrieval | Mar 1, 2025 | Explanation GenerationMisinformation | —Unverified | 0 | 0 |
| EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE | Aug 23, 2023 | Image-text matchingImage-text Retrieval | —Unverified | 0 | 0 |
| MIRAGE: Assessing Hallucination in Multimodal Reasoning Chains of MLLM | May 30, 2025 | HallucinationMultimodal Reasoning | —Unverified | 0 | 0 |
| Leveraging VLM-Based Pipelines to Annotate 3D Objects | Nov 29, 2023 | In-Context LearningLanguage Modeling | —Unverified | 0 | 0 |
| Evaluating Visual and Cultural Interpretation: The K-Viscuit Benchmark with Human-VLM Collaboration | Jun 24, 2024 | DiversityMultiple-choice | —Unverified | 0 | 0 |
| M-LLM Based Video Frame Selection for Efficient Video Understanding | Feb 27, 2025 | EgoSchemaLanguage Modeling | —Unverified | 0 | 0 |
| Evaluating MLLMs with Multimodal Multi-image Reasoning Benchmark | Jun 4, 2025 | SentenceVisual Reasoning | —Unverified | 0 | 0 |
| MMCTAgent: Multi-modal Critical Thinking Agent Framework for Complex Visual Reasoning | May 28, 2024 | Decision MakingVideo Understanding | —Unverified | 0 | 0 |
| MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct | Sep 9, 2024 | DiversityVisual Reasoning | —Unverified | 0 | 0 |
| EuclidNet: Deep Visual Reasoning for Constructible Problems in Geometry | Dec 27, 2022 | Automated Theorem ProvingVisual Reasoning | —Unverified | 0 | 0 |
| Interactive Visual Reasoning under Uncertainty | Jun 18, 2022 | Visual Reasoning | —Unverified | 0 | 0 |
| Analogist: Out-of-the-box Visual In-Context Learning with Image Diffusion Model | May 16, 2024 | Image InpaintingIn-Context Learning | —Unverified | 0 | 0 |
| MMRo: Are Multimodal LLMs Eligible as the Brain for In-Home Robotics? | Jun 28, 2024 | Task PlanningVisual Reasoning | —Unverified | 0 | 0 |
| MMSciBench: Benchmarking Language Models on Multimodal Scientific Problems | Feb 27, 2025 | BenchmarkingVisual Reasoning | —Unverified | 0 | 0 |
| Modeling Gestalt Visual Reasoning on the Raven's Progressive Matrices Intelligence Test Using Generative Image Inpainting Techniques | Nov 18, 2019 | Image InpaintingVisual Reasoning | —Unverified | 0 | 0 |
| Modelling Working Memory using Deep Recurrent Reinforcement Learning | Sep 11, 2019 | Decision Makingreinforcement-learning | —Unverified | 0 | 0 |
| Modularity Matters: Learning Invariant Relational Reasoning Tasks | Jun 18, 2018 | Mixture-of-ExpertsRelational Reasoning | —Unverified | 0 | 0 |
| Modulated Self-attention Convolutional Network for VQA | Oct 8, 2019 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| Muffin or Chihuahua? Challenging Multimodal Large Language Models with Multipanel VQA | Jan 29, 2024 | BenchmarkingImage Comprehension | —Unverified | 0 | 0 |
| Enhancing Visual Reasoning with Autonomous Imagination in Multimodal Large Language Models | Nov 27, 2024 | Visual Reasoning | —Unverified | 0 | 0 |
| Multi-Granularity Modularized Network for Abstract Visual Reasoning | Jul 9, 2020 | Visual GroundingVisual Reasoning | —Unverified | 0 | 0 |
| Visual Language Models show widespread visual deficits on neuropsychological tests | Apr 15, 2025 | Object RecognitionVisual Reasoning | —Unverified | 0 | 0 |
| AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning | Mar 30, 2021 | Question AnsweringVideo Question Answering | —Unverified | 0 | 0 |
| A Continual Learning Paradigm for Non-differentiable Visual Programming Frameworks on Visual Reasoning Tasks | Sep 18, 2023 | Continual LearningVisual Reasoning | —Unverified | 0 | 0 |
| Affordance-Guided Reinforcement Learning via Visual Prompting | Jul 14, 2024 | reinforcement-learningReinforcement Learning | —Unverified | 0 | 0 |
| Enhancing Advanced Visual Reasoning Ability of Large Language Models | Sep 21, 2024 | In-Context LearningVisual Reasoning | —Unverified | 0 | 0 |
| Multimodal Representations for Teacher-Guided Compositional Visual Reasoning | Oct 24, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 | 0 |
| End-to-End Learning of Semantic Grasping | Jul 6, 2017 | Objectobject-detection | —Unverified | 0 | 0 |
| Superpixel Semantics Representation and Pre-training for Vision-Language Task | Oct 20, 2023 | Self-Supervised LearningSuperpixels | —Unverified | 0 | 0 |
| End-to-End Chart Summarization via Visual Chain-of-Thought in Vision-Language Models | Feb 24, 2025 | Visual Reasoning | —Unverified | 0 | 0 |
| EgoReID: Cross-view Self-Identification and Human Re-identification in Egocentric and Surveillance Videos | Dec 24, 2016 | Person Re-IdentificationVisual Reasoning | —Unverified | 0 | 0 |
| EarthDial: Turning Multi-sensory Earth Observations to Interactive Dialogues | Dec 19, 2024 | Change DetectionDisaster Response | —Unverified | 0 | 0 |