| LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models | May 23, 2023 | Common Sense ReasoningImage Generation | CodeCode Available | 2 | 5 |
| ConceptFusion: Open-set Multimodal 3D Mapping | Feb 14, 2023 | 3D geometryAutonomous Driving | CodeCode Available | 2 | 5 |
| Introducing Visual Perception Token into Multimodal Large Language Model | Feb 24, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 2 | 5 |
| InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners | Apr 19, 2025 | Action GenerationLogical Reasoning | CodeCode Available | 2 | 5 |
| GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning | May 22, 2025 | AttributeImage Generation | CodeCode Available | 2 | 5 |
| Getting it Right: Improving Spatial Consistency in Text-to-Image Models | Apr 1, 2024 | Spatial Reasoning | CodeCode Available | 2 | 5 |
| IRef-VLA: A Benchmark for Interactive Referential Grounding with Imperfect Language in 3D Scenes | Mar 20, 2025 | Scene UnderstandingSpatial Reasoning | CodeCode Available | 2 | 5 |
| BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions | Aug 19, 2023 | MMEOptical Character Recognition (OCR) | CodeCode Available | 2 | 5 |
| Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks | Mar 27, 2025 | Imitation LearningMathematical Reasoning | CodeCode Available | 2 | 5 |
| Inference-Time Scaling for Complex Tasks: Where We Stand and What Lies Ahead | Mar 31, 2025 | MathSpatial Reasoning | CodeCode Available | 2 | 5 |