SOTAVerified

User-Friendly Customized Generation with Multi-Modal Prompts

2024-05-26Code Available1· sign in to hype

Linhao Zhong, Yan Hong, Wentao Chen, Binglin Zhou, Yiyi Zhang, Jianfu Zhang, Liqing Zhang

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

Text-to-image generation models have seen considerable advancement, catering to the increasing interest in personalized image creation. Current customization techniques often necessitate users to provide multiple images (typically 3-5) for each customized object, along with the classification of these objects and descriptive textual prompts for scenes. This paper questions whether the process can be made more user-friendly and the customization more intricate. We propose a method where users need only provide images along with text for each customization topic, and necessitates only a single image per visual concept. We introduce the concept of a ``multi-modal prompt'', a novel integration of text and images tailored to each customization concept, which simplifies user interaction and facilitates precise customization of both objects and scenes. Our proposed paradigm for customized text-to-image generation surpasses existing finetune-based methods in user-friendliness and the ability to customize complex objects with user-friendly inputs. Our code is available at https://github.com/zhongzero/Multi-Modal-Prompthttps://github.com/zhongzero/Multi-Modal-Prompt.

Tasks

Reproductions