Setting the Stage: Text-Driven Scene-Consistent Image Generation

2026-03-16Unverified0· sign in to hype

Cong Xie, Che Wang, Yan Zhang, Ruiqi Yu, Han Zou, Zheng Pan, Zhenpeng Zhan

Unverified — Be the first to reproduce this paper.

Abstract

We focus on the foundational task of Scene Staging: given a reference scene image and a text condition specifying an actor category to be generated in the scene and its spatial relation to the scene, the goal is to synthesize an output image that preserves the same scene identity as the reference image while correctly generating the actor according to the spatial relation described in the text. Existing methods struggle with this task, largely due to the scarcity of high-quality paired data and unconstrained generation objectives. To overcome the data bottleneck, we propose a novel data construction pipeline that combines real-world photographs, entity removal, and image-to-video diffusion models to generate training pairs with diverse scenes, viewpoints and correct entity-scene relationships. Furthermore, we introduce a novel correspondence-guided attention loss that leverages cross-view cues to enforce spatial alignment with the reference scene. Experiments on our scene-consistent benchmark show that our approach achieves better scene alignment and text-image alignment than state-of-the-art baselines, according to both automatic metrics and human preference studies. Our method generates images with diverse viewpoints and compositions while faithfully following the textual instructions and preserving the reference scene identity.

Setting the Stage: Text-Driven Scene-Consistent Image Generation

Abstract

Reproductions