Stable Diffusion 3 on HuggingFace

Stable Diffusion 3 (SD3), the latest release from Stability AI, is now available on the Hugging Face Hub and can be used with Diffusers. The newly released model is Stable Diffusion 3 Medium, featuring 2 billion parameters.

This release includes:

Models on the Hub
Diffusers Integration
SD3 Dreambooth and LoRA training scripts

Stable Diffusion 3 Model

SD3 is a latent diffusion model with three different text encoders (CLIP L/14, OpenCLIP bigG/14, and T5-v1.1-XXL), a novel Multimodal Diffusion Transformer (MMDiT) model, and a 16-channel AutoEncoder model similar to that in Stable Diffusion XL.

SD3 processes text inputs and pixel latents as sequences of embeddings. Positional encodings are added to 2×2 patches of latents, which are then flattened into a patch encoding sequence. This sequence, along with the text encoding sequence, is fed into the MMDiT blocks, where they are embedded to a common dimensionality, concatenated, and passed through a sequence of modulated attentions and MLPs.

To account for differences between the two modalities, the MMDiT blocks use separate sets of weights to embed the text and image sequences to a common dimensionality. These sequences are joined before the attention operation, allowing both representations to operate in their own space while considering the other during attention. This bi-directional flow of information between text and image data is a departure from previous text-to-image synthesis approaches, which incorporated text information into the latent via cross-attention with a fixed text representation.

SD3 also uses pooled text embeddings from both its CLIP models as part of its timestep conditioning. These embeddings are concatenated and added to the timestep embedding before being passed to each MMDiT block.

Training with Rectified Flow Matching

Beyond architectural changes, SD3 uses a conditional flow-matching objective for training. In this approach, the forward noising process is defined as a rectified flow connecting the data and noise distributions in a straight line.

The rectified flow-matching sampling process is simpler and performs well with fewer sampling steps. For SD3 inference, a new scheduler (FlowMatchEulerDiscreteScheduler) with a rectified flow-matching formulation and Euler method steps has been introduced. It also implements resolution-dependent shifting of the timestep schedule via a shift parameter. Increasing the shift value improves noise scaling for higher resolutions, with shift=3.0 recommended for the 2B model.

Using SD3 with HuggingFace Diffusers

To use SD3 with Diffusers, ensure you have the latest Diffusers release by upgrading your installation:

pip install --upgrade diffusers

Since the model is gated, you need to visit the Stable Diffusion 3 Medium Hugging Face page, complete the form, and accept the gate. Once accepted, log in to Hugging Face using the following command:

huggingface-cli login

The snippet below will download the 2B parameter version of SD3 in fp16 precision, which is the recommended format for running inference:

Text-To-Image

Prompt:

“A cat holding a sign that says hello world”

Code:

import torch
from diffusers import StableDiffusion3Pipeline

# Load the pipeline
pipe = StableDiffusion3Pipeline.from_pretrained(
    "stabilityai/stable-diffusion-3-medium-diffusers", 
    torch_dtype=torch.float16
)

# Move the pipeline to GPU
pipe = pipe.to("cuda")

# Generate an image
image = pipe(
    prompt="A cat holding a sign that says hello world",
    negative_prompt="",
    num_inference_steps=28,
    guidance_scale=7.0,
).images[0]

# Display the generated image
image

Memory Optimizations for SD3

SD3 uses three text encoders, including the very large T5-XXL model, making it challenging to run the model on GPUs with less than 24GB of VRAM, even with fp16 precision.

To address this, the Diffusers integration includes memory optimizations to enable SD3 to run on a broader range of devices.

import torch
from diffusers import StableDiffusion3Pipeline

# Load the pipeline with memory optimization
pipe = StableDiffusion3Pipeline.from_pretrained(
    "stabilityai/stable-diffusion-3-medium-diffusers", 
    torch_dtype=torch.float16,
    revision="fp16",  # Ensure using the fp16 optimized revision
)

# Apply memory optimization settings
pipe.enable_attention_slicing()
pipe.enable_sequential_cpu_offload()

# Move the pipeline to GPU
pipe = pipe.to("cuda")

# Generate an image
image = pipe(
    prompt="A cat holding a sign that says hello world",
    negative_prompt="",
    num_inference_steps=28,
    guidance_scale=7.0,
).images[0]

You can check out the SD3 documentation here.

Key Memory Optimization Techniques

Attention Slicing: This splits the attention computation into smaller chunks, reducing peak memory usage.
Sequential CPU Offload: This offloads layers to the CPU when they are not needed on the GPU, allowing larger models to fit into limited VRAM.

These optimizations allow SD3 to be used on devices with less VRAM, expanding the range of compatible hardware.

Read related articles: