Illiterate DALLE Learns to Compose

2021-09-29ICLR 2022Unverified0· sign in to hype

Gautam Singh, Fei Deng, Sungjin Ahn

Unverified — Be the first to reproduce this paper.

Abstract

DALLE has shown an impressive ability of composition-based systematic generalization in image generation. This is possible because it utilizes the dataset of text-image pairs where the text provides the source of compositionality. Following this result, an important extending question is whether this compositionality can still be achieved even without conditioning on text. In this paper, we propose an architecture called Slot2Seq that achieves this text-free DALLE by learning compositional slot-based representations purely from images, an ability lacking in DALLE. Unlike existing object-centric representation models that decode pixels independently for each slot and each pixel location and compose them via mixture-based alpha composition, we propose to use the Image GPT decoder conditioned on the slots for a more flexible generation by capturing complex interaction among the pixels and the slots. In experiments, we show that this simple architecture achieves zero-shot generation of novel images without text and better quality in generation than the models based on mixture decoders.

Tasks

Decoder Image Generation Systematic Generalization

Illiterate DALLE Learns to Compose

Abstract

Tasks

Reproductions