Emerging Properties in Unified Multimodal Pretraining

2025-05-20Code Available9· sign in to hype

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, Haoqi Fan

arXiv PDF

Code Available — Be the first to reproduce this paper.

Reproduce

Code

github.com/ByteDance-Seed/Bagel
pytorch★ 5,762
github.com/neverbiasu/ComfyUI-BAGEL
pytorch★ 187

Abstract

Unifying multimodal understanding and generation has shown impressive capabilities in cutting-edge proprietary systems. In this work, we introduce BAGEL, an open0source foundational model that natively supports multimodal understanding and generation. BAGEL is a unified, decoder0only model pretrained on trillions of tokens curated from large0scale interleaved text, image, video, and web data. When scaled with such diverse multimodal interleaved data, BAGEL exhibits emerging capabilities in complex multimodal reasoning. As a result, it significantly outperforms open-source unified models in both multimodal generation and understanding across standard benchmarks, while exhibiting advanced multimodal reasoning abilities such as free-form image manipulation, future frame prediction, 3D manipulation, and world navigation. In the hope of facilitating further opportunities for multimodal research, we share the key findings, pretraining details, data creation protocal, and release our code and checkpoints to the community. The project page is at https://bagel-ai.org/

Tasks

Image Editing Image Generation Image Manipulation multimodal generation Multimodal Reasoning

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
GEdit-Bench-EN	BAGEL	Overall	6.52	—	Unverified
ImgEdit-Data	BAGEL	Overall	3.2	—	Unverified

Emerging Properties in Unified Multimodal Pretraining

Code

Abstract

Tasks

Benchmark Results

Reproductions