| Agent-X: Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks | May 30, 2025 | Autonomous DrivingMath | CodeCode Available | 1 | 5 |
| Multi-Step Visual Reasoning with Visual Tokens Scaling and Verification | Jun 8, 2025 | Question AnsweringVisual Question Answering | CodeCode Available | 1 | 5 |
| HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning | Mar 19, 2024 | Reinforcement Learning (RL)Visual Grounding | CodeCode Available | 1 | 5 |
| Structured Multimodal Attentions for TextVQA | Jun 1, 2020 | Graph AttentionOptical Character Recognition (OCR) | CodeCode Available | 1 | 5 |
| How Do Multimodal Large Language Models Handle Complex Multimodal Reasoning? Placing Them in An Extensible Escape Game | Mar 13, 2025 | Multimodal ReasoningQuestion Answering | CodeCode Available | 1 | 5 |
| LLMs can be Dangerous Reasoners: Analyzing-based Jailbreak Attack on Large Language Models | Jul 23, 2024 | Multimodal ReasoningPrompt Engineering | CodeCode Available | 1 | 5 |
| Agentic Keyframe Search for Video Question Answering | Mar 20, 2025 | EgoSchemaQuestion Answering | CodeCode Available | 1 | 5 |
| How Far Are We from Intelligent Visual Deductive Reasoning? | Mar 7, 2024 | In-Context LearningVisual Reasoning | CodeCode Available | 1 | 5 |
| Beyond Embeddings: The Promise of Visual Table in Visual Reasoning | Mar 27, 2024 | Representation LearningVisual Question Answering | CodeCode Available | 1 | 5 |
| FineCops-Ref: A new Dataset and Task for Fine-Grained Compositional Referring Expression Comprehension | Sep 23, 2024 | Image ComprehensionReferring Expression | CodeCode Available | 1 | 5 |
| Slot State Space Models | Jun 18, 2024 | MambaState Space Models | CodeCode Available | 1 | 5 |
| ClawMachine: Learning to Fetch Visual Tokens for Referential Comprehension | Jun 17, 2024 | DecoderVisual Reasoning | CodeCode Available | 1 | 5 |
| Forgotten Polygons: Multimodal Large Language Models are Shape-Blind | Feb 21, 2025 | MathMathematical Problem-Solving | CodeCode Available | 1 | 5 |
| Forward Prediction for Physical Reasoning | Jun 18, 2020 | PredictionVisual Reasoning | CodeCode Available | 1 | 5 |
| ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models | Jan 24, 2024 | Visual Reasoning | CodeCode Available | 1 | 5 |
| Context-Aware Alignment and Mutual Masking for 3D-Language Pre-Training | Jan 1, 2023 | 3D dense captioning3D visual grounding | CodeCode Available | 1 | 5 |
| From EduVisBench to EduVisAgent: A Benchmark and Multi-Agent Framework for Pedagogical Visualization | May 22, 2025 | Visual Reasoning | CodeCode Available | 1 | 5 |
| BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models | Dec 5, 2023 | BenchmarkingVisual Question Answering | CodeCode Available | 1 | 5 |
| Compositional Chain-of-Thought Prompting for Large Multimodal Models | Nov 27, 2023 | Language ModellingLarge Language Model | CodeCode Available | 1 | 5 |
| A Survey on Interpretable Cross-modal Reasoning | Sep 5, 2023 | Cross-Modal RetrievalDecision Making | CodeCode Available | 1 | 5 |
| From the Least to the Most: Building a Plug-and-Play Visual Reasoner via Data Synthesis | Jun 28, 2024 | Visual Question Answering (VQA)Visual Reasoning | CodeCode Available | 1 | 5 |
| Wonderful Team: Zero-Shot Physical Task Planning with Visual LLMs | Jul 26, 2024 | Action GenerationLarge Language Model | CodeCode Available | 1 | 5 |
| Super-CLEVR: A Virtual Benchmark to Diagnose Domain Robustness in Visual Reasoning | Dec 1, 2022 | Domain GeneralizationQuestion Answering | CodeCode Available | 1 | 5 |
| Abstract Visual Reasoning: An Algebraic Approach for Solving Raven's Progressive Matrices | Mar 21, 2023 | Visual Reasoning | CodeCode Available | 1 | 5 |
| CLEVR-X: A Visual Reasoning Dataset for Natural Language Explanations | Apr 5, 2022 | Explanation GenerationQuestion Answering | CodeCode Available | 1 | 5 |
| Complete 3D Scene Parsing from an RGBD Image | Oct 25, 2017 | DiversityRetrieval | CodeCode Available | 0 | 5 |
| SAViR-T: Spatially Attentive Visual Reasoning with Transformers | Jun 18, 2022 | Inductive BiasVisual Reasoning | CodeCode Available | 0 | 5 |
| HALLUCINOGEN: A Benchmark for Evaluating Object Hallucination in Large Visual-Language Models | Dec 29, 2024 | HallucinationObject | CodeCode Available | 0 | 5 |
| Collecting Visually-Grounded Dialogue with A Game Of Sorts | Sep 10, 2023 | Coreference ResolutionImage Retrieval | CodeCode Available | 0 | 5 |
| Revisiting Disentanglement in Downstream Tasks: A Study on Its Necessity for Abstract Visual Reasoning | Mar 1, 2024 | DisentanglementInformativeness | CodeCode Available | 0 | 5 |
| RVTBench: A Benchmark for Visual Reasoning Tasks | May 17, 2025 | Reasoning SegmentationVisual Question Answering (VQA) | CodeCode Available | 0 | 5 |
| QLEVR: A Diagnostic Dataset for Quantificational Language and Elementary Visual Reasoning | May 6, 2022 | DiagnosticQuestion Answering | CodeCode Available | 0 | 5 |
| A Dual-Attention Learning Network with Word and Sentence Embedding for Medical Visual Question Answering | Oct 1, 2022 | Medical Visual Question AnsweringQuestion Answering | CodeCode Available | 0 | 5 |
| Prompting Large Vision-Language Models for Compositional Reasoning | Jan 20, 2024 | RetrievalVisual Reasoning | CodeCode Available | 0 | 5 |
| GPT4SGG: Synthesizing Scene Graphs from Holistic and Region-specific Narratives | Dec 7, 2023 | Graph GenerationLanguage Modelling | CodeCode Available | 0 | 5 |
| Predicting Complete 3D Models of Indoor Scenes | Apr 9, 2015 | DiversityVisual Reasoning | CodeCode Available | 0 | 5 |
| Program synthesis performance constrained by non-linear spatial relations in Synthetic Visual Reasoning Test | Nov 18, 2019 | Few-Shot LearningProgram Synthesis | CodeCode Available | 0 | 5 |
| Physical Reasoning Using Dynamics-Aware Models | Feb 20, 2021 | Visual Reasoning | CodeCode Available | 0 | 5 |
| Progressive Multi-granular Alignments for Grounded Reasoning in Large Vision-Language Models | Dec 11, 2024 | Question AnsweringVisual Grounding | CodeCode Available | 0 | 5 |
| Raven's Progressive Matrices Completion with Latent Gaussian Process Priors | Mar 22, 2021 | Answer SelectionGaussian Processes | CodeCode Available | 0 | 5 |
| CLEVR-Ref+: Diagnosing Visual Reasoning with Referring Expressions | Jan 3, 2019 | DiagnosticImage Segmentation | CodeCode Available | 0 | 5 |
| CLEVR Parser: A Graph Parser Library for Geometric Learning on Language Grounded Image Scenes | Sep 19, 2020 | Graph Neural NetworkVisual Reasoning | CodeCode Available | 0 | 5 |
| Orchestrator-Agent Trust: A Modular Agentic AI Visual Classification System with Trust-Aware Orchestration and RAG-Based Reasoning | Jul 9, 2025 | BenchmarkingImage Retrieval | CodeCode Available | 0 | 5 |
| CLEVRER: CoLlision Events for Video REpresentation and Reasoning | Oct 3, 2019 | counterfactualDescriptive | CodeCode Available | 0 | 5 |
| On Erroneous Agreements of CLIP Image Embeddings | Nov 7, 2024 | Visual Reasoning | CodeCode Available | 0 | 5 |
| One Self-Configurable Model to Solve Many Abstract Visual Reasoning Problems | Dec 15, 2023 | Odd One OutTransfer Learning | CodeCode Available | 0 | 5 |
| A Distance-preserving Matrix Sketch | Sep 8, 2020 | Clusteringfeature selection | CodeCode Available | 0 | 5 |
| Odd-One-Out Representation Learning | Dec 14, 2020 | DisentanglementMetric Learning | CodeCode Available | 0 | 5 |
| FigureQA: An Annotated Figure Dataset for Visual Reasoning | Oct 19, 2017 | BIG-bench Machine LearningChart Question Answering | CodeCode Available | 0 | 5 |
| Object Level Visual Reasoning in Videos | Jun 16, 2018 | Activity RecognitionHuman Activity Recognition | CodeCode Available | 0 | 5 |