Skip to content

Official implementation of Würstchen: Efficient Pretraining of Text-to-Image Models

License

Notifications You must be signed in to change notification settings

jcho19/Wuerstchen_FT

 
 

Repository files navigation

Open In Colab

Würstchen

huggingface-blog-post-thumbnail

What is this?

Würstchen is a new framework for training text-conditional models by moving the computationally expensive text-conditional stage into a highly compressed latent space. Common approaches make use of a single stage compression, while Würstchen introduces another Stage that introduces even more compression. In total we have Stage A & B that are responsible for compressing images and Stage C that learns the text-conditional part in the low dimensional latent space. With that Würstchen achieves a 42x compression factor, while still reconstructing images faithfully. This enables training of Stage C to be fast and computationally cheap. We refer to the paper for details.

Use Würstchen

You can use the model simply through the notebooks here. The Stage B notebook only for reconstruction and the Stage C notebook is for the text-conditional generation. You can also try the text-to-image generation on Google Colab.

Using in 🧨 diffusers

Würstchen is fully integrated into the diffusers library. Here's how to use it:

# pip install -U transformers accelerate diffusers

import torch
from diffusers import AutoPipelineForText2Image
from diffusers.pipelines.wuerstchen import DEFAULT_STAGE_C_TIMESTEPS

pipe = AutoPipelineForText2Image.from_pretrained("warp-ai/wuerstchen", torch_dtype=torch.float16).to("cuda")

caption = "Anthropomorphic cat dressed as a fire fighter"
images = pipe(
    caption, 
    width=1024,
    height=1536,
    prior_timesteps=DEFAULT_STAGE_C_TIMESTEPS,
    prior_guidance_scale=4.0,
    num_images_per_prompt=2,
).images

Refer to the official documentation to learn more.

Train your own Würstchen

Training Würstchen is considerably faster and cheaper than other text-to-image as it trains in a much smaller latent space of 12x12. We provide training scripts for both Stage B and Stage C.

Download Models

Model Download Parameters Conditioning Training Steps Resolution
Würstchen v1 Hugging Face 1B (Stage C) + 600M (Stage B) + 19M (Stage A) CLIP-H-Text 800.000 512x512
Würstchen v2 Hugging Face 1B (Stage C) + 600M (Stage B) + 19M (Stage A) CLIP-bigG-Text 918.000 1024x1024

Acknowledgment

Special thanks to Stability AI for providing compute for our research.

About

Official implementation of Würstchen: Efficient Pretraining of Text-to-Image Models

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 99.7%
  • Other 0.3%