| Grammar-Based Grounded Lexicon Learning | Feb 17, 2022 | Network EmbeddingSentence | —Unverified | 0 |
| CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs | Jan 5, 2024 | Image ComprehensionImage to text | —Unverified | 0 |
| A Domain-Independent Agent Architecture for Adaptive Operation in Evolving Open Worlds | Jun 9, 2023 | MinecraftVisual Reasoning | —Unverified | 0 |
| 3D Concept Learning and Reasoning from Multi-View Images | Mar 20, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| GenVP: Generating Visual Puzzles with Contrastive Hierarchical VAEs | Mar 30, 2025 | Visual Reasoning | —Unverified | 0 |
| Generate Subgoal Images before Act: Unlocking the Chain-of-Thought Reasoning in Diffusion Model for Robot Manipulation with Multimodal Prompts | Jan 1, 2024 | Image GenerationInstruction Following | —Unverified | 0 |
| CLIP-Powered TASS: Target-Aware Single-Stream Network for Audio-Visual Question Answering | May 13, 2024 | Audio-visual Question AnsweringAudio-Visual Question Answering (AVQA) | —Unverified | 0 |
| A Survey on Multimodal Large Language Models | Jun 23, 2023 | HallucinationIn-Context Learning | —Unverified | 0 |
| A-I-RAVEN and I-RAVEN-Mesh: Two New Benchmarks for Abstract Visual Reasoning | Jun 16, 2024 | Transfer LearningVisual Reasoning | —Unverified | 0 |
| GAM-Agent: Game-Theoretic and Uncertainty-Aware Collaboration for Complex Visual Reasoning | May 29, 2025 | Multimodal ReasoningMVBench | —Unverified | 0 |
| FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving | May 23, 2025 | Autonomous DrivingImage Generation | —Unverified | 0 |
| From Wrong To Right: A Recursive Approach Towards Vision-Language Explanation | Nov 21, 2023 | Explanation GenerationVisual Question Answering (VQA) | —Unverified | 0 |
| A survey on knowledge-enhanced multimodal learning | Nov 19, 2022 | Conditional Image GenerationFactual Visual Question Answering | —Unverified | 0 |
| Look, Remember and Reason: Grounded reasoning in videos with language models | Jun 30, 2023 | Objectobject-detection | —Unverified | 0 |
| MagiC: Evaluating Multimodal Cognition Toward Grounded Visual Reasoning | Jul 9, 2025 | DiagnosticMultimodal Reasoning | —Unverified | 0 |
| From Visual to Acoustic Question Answering | Feb 28, 2019 | Acoustic Question AnsweringPosition | —Unverified | 0 |
| Human-Aligned Bench: Fine-Grained Assessment of Reasoning Ability in MLLMs vs. Humans | May 16, 2025 | Multimodal ReasoningVisual Reasoning | —Unverified | 0 |
| From Shallow to Deep: Compositional Reasoning over Graphs for Visual Question Answering | Jun 25, 2022 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| LogicAD: Explainable Anomaly Detection via VLM-based Text Feature Extraction | Jan 3, 2025 | Anomaly DetectionVisual Reasoning | —Unverified | 0 |
| From Head to Tail: Towards Balanced Representation in Large Vision-Language Models through Adaptive Data Calibration | Mar 17, 2025 | DenoisingQuestion Answering | —Unverified | 0 |
| A Survey of Slow Thinking-based Reasoning LLMs using Reinforced Learning and Inference-time Scaling Law | May 5, 2025 | MathMedical Diagnosis | —Unverified | 0 |
| From Code to Compliance: Assessing ChatGPT's Utility in Designing an Accessible Webpage -- A Case Study | Jan 7, 2025 | Prompt EngineeringVisual Reasoning | —Unverified | 0 |
| Foundation Models for Zero-Shot Segmentation of Scientific Images without AI-Ready Data | Jun 30, 2025 | Visual ReasoningZero Shot Segmentation | —Unverified | 0 |
| A Divide-Align-Conquer Strategy for Program Synthesis | Jan 8, 2023 | ARCInductive logic programming | —Unverified | 0 |
| ForgeryGPT: Multimodal Large Language Model For Explainable Image Forgery Detection and Localization | Oct 14, 2024 | Explanation GenerationImage Forgery Detection | —Unverified | 0 |
| ArtVLM: Attribute Recognition Through Vision-Based Prefix Language Modeling | Aug 7, 2024 | AttributeLanguage Modeling | —Unverified | 0 |
| Filling in the details: Perceiving from low fidelity images | Apr 14, 2016 | FoveationVisual Reasoning | —Unverified | 0 |
| CityLoc: 6DoF Pose Distributional Localization for Text Descriptions in Large-Scale Scenes with Gaussian Representation | Jan 15, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Abstracting Concept-Changing Rules for Solving Raven's Progressive Matrix Problems | Jul 15, 2023 | Answer GenerationAnswer Selection | —Unverified | 0 |
| Localizing Before Answering: A Hallucination Evaluation Benchmark for Grounded Medical Multimodal LLMs | Apr 30, 2025 | HallucinationHallucination Evaluation | —Unverified | 0 |
| LOIS: Looking Out of Instance Semantics for Visual Question Answering | Jul 26, 2023 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Few-shot Visual Reasoning with Meta-analogical Contrastive Learning | Jul 23, 2020 | Contrastive LearningLogical Reasoning | —Unverified | 0 |
| Chop Chop BERT: Visual Question Answering by Chopping VisualBERT's Heads | Apr 30, 2021 | Question AnsweringVisual Question Answering | —Unverified | 0 |
| Few-shot Subgoal Planning with Language Models | May 28, 2022 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Few-Shot Abstract Visual Reasoning With Spectral Features | Oct 4, 2019 | Few-Shot LearningVisual Reasoning | —Unverified | 0 |
| Chitrarth: Bridging Vision and Language for a Billion People | Feb 21, 2025 | DiversityLanguage Modeling | —Unverified | 0 |
| A Review of Emerging Research Directions in Abstract Visual Reasoning | Feb 21, 2022 | Visual Reasoning | —Unverified | 0 |
| Factorization of View-Object Manifolds for Joint Object Recognition and Pose Estimation | Mar 23, 2015 | ObjectObject Recognition | —Unverified | 0 |
| Eyeballing Combinatorial Problems: A Case Study of Using Multimodal Large Language Models to Solve Traveling Salesman Problems | Jun 11, 2024 | In-Context LearningTraveling Salesman Problem | —Unverified | 0 |
| Explicit Knowledge Incorporation for Visual Reasoning | Jun 19, 2021 | Visual Reasoning | —Unverified | 0 |
| Chat2Layout: Interactive 3D Furniture Layout with a Multimodal LLM | Jul 31, 2024 | In-Context LearningLayout Design | —Unverified | 0 |
| Abstract Diagrammatic Reasoning with Multiplex Graph Networks | Jun 19, 2020 | Graph Neural NetworkVisual Reasoning | —Unverified | 0 |
| Explicit3D: Graph Network with Spatial Inference for Single Image 3D Object Detection | Feb 13, 2023 | 3D Object DetectionGraph Generation | —Unverified | 0 |
| Are Elephants Bigger than Butterflies? Reasoning about Sizes of Objects | Feb 2, 2016 | Visual Reasoning | —Unverified | 0 |
| Data augmentation by morphological mixup for solving Raven's Progressive Matrices | Mar 9, 2021 | Data AugmentationVisual Reasoning | —Unverified | 0 |
| Explainable AI And Visual Reasoning: Insights From Radiology | Apr 6, 2023 | DiagnosticExplainable Artificial Intelligence (XAI) | —Unverified | 0 |
| Learning to Assemble Neural Module Tree Networks for Visual Grounding | Dec 8, 2018 | Dependency ParsingNatural Language Visual Grounding | —Unverified | 0 |
| ChartReasoner: Code-Driven Modality Bridging for Long-Chain Reasoning in Chart Question Answering | Jun 11, 2025 | Chart Question AnsweringImage to text | —Unverified | 0 |
| 3D Concept Grounding on Neural Fields | Jul 13, 2022 | Instance SegmentationQuestion Answering | —Unverified | 0 |
| EXCLAIM: An Explainable Cross-Modal Agentic System for Misinformation Detection with Hierarchical Retrieval | Mar 1, 2025 | Explanation GenerationMisinformation | —Unverified | 0 |