| Mitigating Attention Sinks and Massive Activations in Audio-Visual Speech Recognition with LLMs | Jan 27, 2026 | | —Unverified | 2 |
| OmniGAIA: Towards Native Omni-Modal AI Agents | Feb 28, 2026 | | —Unverified | 2 |
| DeepImageSearch: Benchmarking Multimodal Agents for Context-Aware Image Retrieval in Visual Histories | Feb 11, 2026 | | —Unverified | 2 |
| CLiFT: Compressive Light-Field Tokens for Compute-Efficient and Adaptive Neural Rendering | Feb 28, 2026 | | —Unverified | 2 |
| StockBench: Can LLM Agents Trade Stocks Profitably In Real-world Markets? | Mar 2, 2026 | | —Unverified | 2 |
| Deforming Videos to Masks: Flow Matching for Referring Video Segmentation | Feb 26, 2026 | | —Unverified | 2 |
| AnchorWeave: World-Consistent Video Generation with Retrieved Local Spatial Memories | Feb 16, 2026 | | —Unverified | 2 |
| Enhancing Spatial Understanding in Image Generation via Reward Modeling | Feb 27, 2026 | | —Unverified | 2 |
| EffectErase: Joint Video Object Removal and Insertion for High-Quality Effect Erasing | Mar 19, 2026 | | —Unverified | 2 |
| Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion | Mar 18, 2026 | | —Unverified | 2 |
| Text-to-3D by Stitching a Multi-view Reconstruction Network to a Video Generator | Mar 5, 2026 | | —Unverified | 2 |
| ReFusion: A Diffusion Large Language Model with Parallel Autoregressive Decoding | Mar 5, 2026 | | —Unverified | 2 |
| Latent Denoising Makes Good Tokenizers | Feb 14, 2026 | | —Unverified | 2 |
| VLANeXt: Recipes for Building Strong VLA Models | Feb 20, 2026 | | —Unverified | 2 |
| NewtonBench: Benchmarking Generalizable Scientific Law Discovery in LLM Agents | Feb 24, 2026 | | —Unverified | 2 |
| Mind-Brush: Integrating Agentic Cognitive Search and Reasoning into Image Generation | Feb 2, 2026 | | —Unverified | 2 |
| WorldStereo: Bridging Camera-Guided Video Generation and Scene Reconstruction via 3D Geometric Memories | Mar 2, 2026 | | —Unverified | 2 |
| Easier Painting Than Thinking: Can Text-to-Image Models Set the Stage, but Not Direct the Play? | Feb 28, 2026 | | —Unverified | 2 |
| How to Correctly Report LLM-as-a-Judge Evaluations | Feb 9, 2026 | | —Unverified | 2 |
| The Trinity of Consistency as a Defining Principle for General World Models | Feb 26, 2026 | | —Unverified | 2 |
| Kanade: A Simple Disentangled Tokenizer for Spoken Language Modeling | Jan 31, 2026 | | —Unverified | 2 |
| XSkill: Continual Learning from Experience and Skills in Multimodal Agents | Mar 13, 2026 | | —Unverified | 2 |
| OBS-Diff: Accurate Pruning For Diffusion Models in One-Shot | Feb 23, 2026 | | —Unverified | 2 |
| Efficient Audio-Visual Speech Separation with Discrete Lip Semantics and Multi-Scale Global-Local Attention | Mar 11, 2026 | | —Unverified | 2 |
| UniGenBench++: A Unified Semantic Evaluation Benchmark for Text-to-Image Generation | Feb 24, 2026 | | —Unverified | 2 |
| Unified Multimodal Models as Auto-Encoders | Feb 26, 2026 | | —Unverified | 2 |
| OmniStream: Mastering Perception, Reconstruction and Action in Continuous Streams | Mar 12, 2026 | | —Unverified | 2 |
| Youtu-Parsing: Perception, Structuring and Recognition via High-Parallelism Decoding | Jan 28, 2026 | | —Unverified | 2 |
| Streaming Autoregressive Video Generation via Diagonal Distillation | Mar 11, 2026 | | —Unverified | 2 |
| Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs | Mar 5, 2026 | | —Unverified | 2 |
| Experiential Reinforcement Learning | Feb 15, 2026 | | —Unverified | 2 |
| SimVLA: A Simple VLA Baseline for Robotic Manipulation | Feb 20, 2026 | | —Unverified | 2 |
| InstructVLA: Vision-Language-Action Instruction Tuning from Understanding to Manipulation | Mar 3, 2026 | | —Unverified | 2 |
| SDPose: Exploiting Diffusion Priors for Out-of-Domain and Robust Pose Estimation | Mar 13, 2026 | | —Unverified | 2 |
| Efficient Reasoning with Balanced Thinking | Mar 19, 2026 | | —Unverified | 2 |
| Less is Enough: Synthesizing Diverse Data in Feature Space of LLMs | Feb 12, 2026 | | —Unverified | 2 |
| Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion | Mar 6, 2026 | | —Unverified | 2 |
| MolmoSpaces: A Large-Scale Open Ecosystem for Robot Navigation and Manipulation | Feb 19, 2026 | | —Unverified | 2 |
| EnterpriseOps-Gym: Environments and Evaluations for Stateful Agentic Planning and Tool Use in Enterprise Settings | Mar 13, 2026 | | —Unverified | 2 |
| Scaling Behavior Cloning Improves Causal Reasoning: An Open Model for Real-Time Video Game Playing | Jan 28, 2026 | | —Unverified | 2 |
| Hyperspherical Latents Improve Continuous-Token Autoregressive Generation | Mar 5, 2026 | | —Unverified | 2 |
| RealWonder: Real-Time Physical Action-Conditioned Video Generation | Mar 5, 2026 | | —Unverified | 2 |
| X-Coder: Advancing Competitive Programming with Fully Synthetic Tasks, Solutions, and Tests | Feb 1, 2026 | | —Unverified | 2 |
| From Statics to Dynamics: Physics-Aware Image Editing with Latent Transition Priors | Feb 27, 2026 | | —Unverified | 2 |
| Towards Pixel-Level VLM Perception via Simple Points Prediction | Jan 27, 2026 | | —Unverified | 2 |
| The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs | Mar 13, 2026 | | —Unverified | 2 |
| Learning a Generative Meta-Model of LLM Activations | Feb 6, 2026 | | —Unverified | 2 |
| Theory of Space: Can Foundation Models Construct Spatial Beliefs through Active Exploration? | Feb 4, 2026 | | —Unverified | 2 |
| ShotVerse: Advancing Cinematic Camera Control for Text-Driven Multi-Shot Video Creation | Mar 12, 2026 | | —Unverified | 2 |
| EEG Foundation Models: Progresses, Benchmarking, and Open Problems | Feb 5, 2026 | | —Unverified | 2 |