Gen4Gen

Abstract

Recent text-to-image diffusion models are able to learn and synthesize images containing novel, personalized concepts (e.g., their own pets or specific items) with just a few examples for training. This paper tackles two interconnected issues within this realm of personalizing text-to-image diffusion models. First, current personalization techniques fail to extend to multiple concepts --- we hypothesize this to be due to the mismatch between complex scenes and simple text descriptions in the pre-training dataset (e.g., LAION). Second, given an image containing multiple personalized concepts, there lacks a holistic metric that evaluates performance on not just the degree of resemblance of personalized concepts, but also whether all concepts are present in the image and whether the image accurately reflects the overall text description. To address these issues, we create MyCanvas, a semi-automatically created dataset containing multiple personalized concepts in complex compositions, accompanied by accurate text descriptions. In addition, we design a comprehensive metric comprising two scores (CP-CLIP and TI-CLIP) for better quantifying the performance of multi-concept, personalized text-to-image diffusion methods. We provide a simple baseline built on top of Custom Diffusion with empirical prompting strategies for future researchers to evaluate on MyCanvas. We show that by improving data quality and prompting strategies, we can significantly increase multi-concept personalized image generation quality, without requiring any modifications to model architecture or training algorithms. We demonstrate that chaining strong foundation models could be a promising direction for generating high-quality datasets targeting a variety of challenging tasks in the computer vision community.

Our Data Creation Pipeline - Gen4Gen

Semi-Automated Data Creation Overview: Given source images representing multiple concepts, our Gen4Gen leverages recent advancements in image foreground extraction, LLMs, MLLMs, and inpainting to compose realistic, personalized images and paired text descriptions, namely, MyCanvas Dataset.

Qualitative Results for Multi-Concept Composition

We present four sets of results in ascending order of composition difficulty (more personalized concepts). Given training methods like Custom Diffusion, our generated MyCanvas dataset brings drastic improvements in disentangling object identities similar in the latent space (e.g., cat and lion, tractor1 and tractor2), preserving the distinctiveness of each object.

MyCanvas Dataset Examples

Our semi-automated generated dataset contains multiple personalized objects in complex compositions with high resolution, realistic images along with accurate text descriptions. Compositions with 3 and 5 concepts are listed below.

MyCanvas Dataset Statistics

(a) A pie chart depicts that roughly 30% of the images in MyCanvas are paired with text descriptions over a length of 20 words. (b) Word cloud of the categories used within the images to show the variety of objects used. (c) and (d): Word cloud of the frequent descriptions used during training and inference, which are very different to ensure fairness in comparison.

BibTeX


        @misc{yeh2024gen4gen,
              title={Gen4Gen: Generative Data Pipeline for Generative Multi-Concept Composition}, 
              author={Chun-Hsiao Yeh and Ta-Ying Cheng and He-Yen Hsieh and Chuan-En Lin and Yi Ma and Andrew Markham and Niki Trigoni and H. T. Kung and Yubei Chen},
              year={2024},
              eprint={2402.15504},
              archivePrefix={arXiv},
              primaryClass={cs.CV}
        }

🏞️ Gen4Gen: Generative Data Pipeline for Generative Multi-Concept Composition

Tech Report