MoWM: Mixture-of-World-Models for Embodied Planning via Latent-to-Pixel Feature Modulation

2026-02-10Code Available0· sign in to hype

Yangcheng Yu, Xin Jin, Yu Shang, Xin Zhang, Haisheng Su, Wei Wu, Yong Li

Code Available — Be the first to reproduce this paper.

Code

github.com/tsinghua-fib-lab/mowm
OfficialIn paper★ 9

Abstract

Embodied action planning is a core challenge in robotics, requiring models to generate precise actions from visual observations and language instructions. While video generation world models are promising, their reliance on pixel-level reconstruction often introduces visual redundancies that hinder action decoding and generalization. Latent world models offer a compact, motion-aware representation, but overlook the fine-grained details critical for precise manipulation. To overcome these limitations, we propose MoWM, a mixture-of-world-model framework that fuses representations from hybrid world models for embodied action planning. Our approach combines motion-aware latent world model features with pixel-space features, enabling MoWM to emphasize action-relevant visual details for action decoding. Extensive evaluations on the CALVIN and real-world manipulation tasks demonstrate that our method achieves state-of-the-art task success rates and superior generalization. We also provide a comprehensive analysis of the strengths of each feature space, offering valuable insights for future research in embodied planning. The code is available at: https://github.com/tsinghua-fib-lab/MoWM.

MoWM: Mixture-of-World-Models for Embodied Planning via Latent-to-Pixel Feature Modulation

Code

Abstract

Reproductions