| EnterpriseOps-Gym: Environments and Evaluations for Stateful Agentic Planning and Tool Use in Enterprise Settings | Mar 13, 2026 | | —Unverified | 2 |
| Cheers: Decoupling Patch Details from Semantic Representations Enables Unified Multimodal Comprehension and Generation | Mar 13, 2026 | | —Unverified | 2 |
| XSkill: Continual Learning from Experience and Skills in Multimodal Agents | Mar 13, 2026 | | —Unverified | 2 |
| OmniForcing: Unleashing Real-time Joint Audio-Visual Generation | Mar 13, 2026 | | —Unverified | 2 |
| IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse | Mar 12, 2026 | | —Unverified | 2 |
| OmniStream: Mastering Perception, Reconstruction and Action in Continuous Streams | Mar 12, 2026 | | —Unverified | 2 |
| NeuralOS: Towards Simulating Operating Systems via Neural Generative Models | Mar 12, 2026 | | —Unverified | 2 |
| Spatial-TTT: Streaming Visual-based Spatial Intelligence with Test-Time Training | Mar 12, 2026 | | —Unverified | 2 |
| Mobile-GS: Real-time Gaussian Splatting for Mobile Devices | Mar 12, 2026 | | —Unverified | 2 |
| Ultra-Fast Language Generation via Discrete Diffusion Divergence Instruct | Mar 12, 2026 | | —Unverified | 2 |
| ShotVerse: Advancing Cinematic Camera Control for Text-Driven Multi-Shot Video Creation | Mar 12, 2026 | | —Unverified | 2 |
| Efficient Audio-Visual Speech Separation with Discrete Lip Semantics and Multi-Scale Global-Local Attention | Mar 11, 2026 | | —Unverified | 2 |
| Streaming Autoregressive Video Generation via Diagonal Distillation | Mar 11, 2026 | | —Unverified | 2 |
| LLM2Vec-Gen: Generative Embeddings from Large Language Models | Mar 11, 2026 | | —Unverified | 2 |
| MM-Zero: Self-Evolving Multi-Model Vision Language Models From Zero Data | Mar 10, 2026 | | —Unverified | 2 |
| ID-LoRA: Identity-Driven Audio-Video Personalization with In-Context LoRA | Mar 10, 2026 | | —Unverified | 2 |
| Robot Control Stack: A Lean Ecosystem for Robot Learning at Scale | Mar 10, 2026 | | —Unverified | 2 |
| Stepping VLMs onto the Court: Benchmarking Spatial Intelligence in Sports | Mar 10, 2026 | | —Unverified | 2 |
| HiAR: Efficient Autoregressive Long Video Generation via Hierarchical Denoising | Mar 9, 2026 | | —Unverified | 2 |
| Thinking in Uncertainty: Mitigating Hallucinations in MLRMs with Latent Entropy-Aware Decoding | Mar 9, 2026 | | —Unverified | 2 |
| OfficeQA Pro: An Enterprise Benchmark for End-to-End Grounded Reasoning | Mar 9, 2026 | | —Unverified | 2 |
| WildActor: Unconstrained Identity-Preserving Video Generation | Mar 9, 2026 | | —Unverified | 2 |
| ARLArena: A Unified Framework for Stable Agentic Reinforcement Learning | Mar 7, 2026 | | —Unverified | 2 |
| Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion | Mar 6, 2026 | | —Unverified | 2 |
| Physical Simulator In-the-Loop Video Generation | Mar 6, 2026 | | —Unverified | 2 |
| Lost in Stories: Consistency Bugs in Long Story Generation by LLMs | Mar 6, 2026 | | —Unverified | 2 |
| NeuralRemaster: Phase-Preserving Diffusion for Structure-Aligned Generation | Mar 5, 2026 | | —Unverified | 2 |
| MorphAny3D: Unleashing the Power of Structured Latent in 3D Morphing | Mar 5, 2026 | | —Unverified | 2 |
| From Word to World: Can Large Language Models be Implicit Text-based World Models? | Mar 5, 2026 | | —Unverified | 2 |
| Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs | Mar 5, 2026 | | —Unverified | 2 |
| Text-to-3D by Stitching a Multi-view Reconstruction Network to a Video Generator | Mar 5, 2026 | | —Unverified | 2 |
| Track4World: Feedforward World-centric Dense 3D Tracking of All Pixels | Mar 5, 2026 | | —Unverified | 2 |
| RealWonder: Real-Time Physical Action-Conditioned Video Generation | Mar 5, 2026 | | —Unverified | 2 |
| ReFusion: A Diffusion Large Language Model with Parallel Autoregressive Decoding | Mar 5, 2026 | | —Unverified | 2 |
| OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs | Mar 5, 2026 | | —Unverified | 2 |
| Hyperspherical Latents Improve Continuous-Token Autoregressive Generation | Mar 5, 2026 | | —Unverified | 2 |
| EmbodiedSplat: Online Feed-Forward Semantic 3DGS for Open-Vocabulary 3D Scene Understanding | Mar 4, 2026 | | —Unverified | 2 |
| CubeComposer: Spatio-Temporal Autoregressive 4K 360° Video Generation from Perspective Video | Mar 4, 2026 | | —Unverified | 2 |
| Stochastic Self-Guidance for Training-Free Enhancement of Diffusion Models | Mar 4, 2026 | | —Unverified | 2 |
| RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies | Mar 4, 2026 | | —Unverified | 2 |
| Agent Data Protocol: Unifying Datasets for Diverse, Effective Fine-tuning of LLM Agents | Mar 4, 2026 | | —Unverified | 2 |
| VidEoMT: Your ViT is Secretly Also a Video Segmentation Model | Mar 4, 2026 | | —Unverified | 2 |
| Phi-4-reasoning-vision-15B Technical Report | Mar 4, 2026 | | —Unverified | 2 |
| Latent Particle World Models: Self-supervised Object-centric Stochastic Dynamics Modeling | Mar 4, 2026 | | —Unverified | 2 |
| SimRecon: SimReady Compositional Scene Reconstruction from Real Videos | Mar 3, 2026 | | —Unverified | 2 |
| Shuffle-R1: Efficient RL framework for Multimodal Large Language Models via Data-centric Dynamic Shuffle | Mar 3, 2026 | | —Unverified | 2 |
| InstructVLA: Vision-Language-Action Instruction Tuning from Understanding to Manipulation | Mar 3, 2026 | | —Unverified | 2 |
| Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization | Mar 3, 2026 | | —Unverified | 2 |
| Geometry-Guided Reinforcement Learning for Multi-view Consistent 3D Scene Editing | Mar 3, 2026 | | —Unverified | 2 |
| HiFi-Inpaint: Towards High-Fidelity Reference-Based Inpainting for Generating Detail-Preserving Human-Product Images | Mar 3, 2026 | | —Unverified | 2 |