Gen4Gen: Generative Data Pipeline for Generative Multi-Concept Composition

*Equal Contribution
1UC Berkeley 2Oxford 3NTU 4UW 5HKU 6Harvard 7UC Davis
BMVC 2025

Figure 1. Gen4Gen composes personalized concepts into realistic scenes with complex compositions, accompanied by detailed text descriptions.

TL;DR
  • The problem: Current personalization techniques for text-to-image diffusion models fail to extend to multiple concepts, likely due to the mismatch between complex scenes and simple text descriptions in pre-training datasets.
  • Our approach: Gen4Gen is a semi-automated data creation pipeline that leverages image foreground extraction, LLMs, MLLMs, and inpainting to compose realistic, personalized images with accurate text descriptions, producing the MyCanvas dataset.
  • The payoff: By improving data quality and prompting strategies alone, we significantly increase multi-concept personalized image generation quality, without requiring any modifications to model architecture or training algorithms.

Abstract

Recent text-to-image diffusion models are able to learn and synthesize images containing novel, personalized concepts (e.g., their own pets or specific items) with just a few examples for training. This paper tackles two interconnected issues within this realm of personalizing text-to-image diffusion models. First, current personalization techniques fail to extend to multiple concepts; we hypothesize this to be due to the mismatch between complex scenes and simple text descriptions in the pre-training dataset (e.g., LAION). Second, given an image containing multiple personalized concepts, there lacks a holistic metric that evaluates performance on not just the degree of resemblance of personalized concepts, but also whether all concepts are present in the image and whether the image accurately reflects the overall text description. To address these issues, we create MyCanvas, a semi-automatically created dataset containing multiple personalized concepts in complex compositions, accompanied by accurate text descriptions. In addition, we design a comprehensive metric comprising two scores (CP-CLIP and TI-CLIP) for better quantifying the performance of multi-concept, personalized text-to-image diffusion methods. We provide a simple baseline built on top of Custom Diffusion with empirical prompting strategies for future researchers to evaluate on MyCanvas. We show that by improving data quality and prompting strategies, we can significantly increase multi-concept personalized image generation quality, without requiring any modifications to model architecture or training algorithms. We demonstrate that chaining strong foundation models could be a promising direction for generating high-quality datasets targeting a variety of challenging tasks in the computer vision community.

Key Findings


Data Quality Over Model Complexity

By improving data quality and prompting strategies alone, multi-concept personalized image generation quality increases significantly, without requiring any changes to model architecture or training algorithms.

MyCanvas Dataset and Comprehensive Metrics

We introduce MyCanvas, a semi-automatically created dataset with multiple personalized concepts in complex compositions, along with CP-CLIP and TI-CLIP scores for holistic multi-concept evaluation.

Chaining Foundation Models for Data Generation

Gen4Gen demonstrates that chaining strong foundation models (foreground extraction, LLMs, MLLMs, inpainting) is a promising direction for generating high-quality datasets for challenging vision tasks.

Our Data Creation Pipeline: Gen4Gen

Given source images representing multiple concepts, our Gen4Gen leverages recent advancements in image foreground extraction, LLMs, MLLMs, and inpainting to compose realistic, personalized images and paired text descriptions, producing the MyCanvas Dataset.

Gen4Gen Pipeline Overview

Figure 2. Semi-Automated Data Creation Overview: Given source images representing multiple concepts, Gen4Gen leverages image foreground extraction, LLMs, MLLMs, and inpainting to compose realistic, personalized images and paired text descriptions.

Qualitative Results for Multi-Concept Composition

We present four sets of results in ascending order of composition difficulty (more personalized concepts). Given training methods like Custom Diffusion, our generated MyCanvas dataset brings drastic improvements in disentangling object identities similar in the latent space (e.g., cat and lion, tractor1 and tractor2), preserving the distinctiveness of each object.

Qualitative Results

Figure 3. Qualitative comparison showing that training on MyCanvas dataset drastically improves multi-concept composition, particularly in disentangling visually similar objects.

MyCanvas Dataset Examples

Our semi-automated generated dataset contains multiple personalized objects in complex compositions with high resolution, realistic images along with accurate text descriptions. Compositions with 3 and 5 concepts are listed below.

MyCanvas Dataset Examples

Figure 4. Examples from the MyCanvas dataset showing compositions with 3 and 5 personalized concepts.

MyCanvas Dataset Statistics

(a) A pie chart depicts that roughly 30% of the images in MyCanvas are paired with text descriptions over a length of 20 words. (b) Word cloud of the categories used within the images to show the variety of objects used. (c) and (d): Word cloud of the frequent descriptions used during training and inference, which are very different to ensure fairness in comparison.

MyCanvas Dataset Statistics

Figure 5. MyCanvas dataset statistics: (a) text length, (b) object categories, (c) training descriptions, (d) inference descriptions.

BibTeX

@misc{yeh2024gen4gen,
      title={Gen4Gen: Generative Data Pipeline for Generative Multi-Concept Composition},
      author={Chun-Hsiao Yeh and Ta-Ying Cheng and He-Yen Hsieh and Chuan-En Lin and Yi Ma and Andrew Markham and Niki Trigoni and H. T. Kung and Yubei Chen},
      year={2024},
      eprint={2402.15504},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}