From Prompt to Pixels

The Generation Pipeline

APrompt Tokenization

Your text is split into sub-word tokens using byte-pair encoding (the same method used by language models).

Example: "an astronaut" → ["an", "Ġastro", "naut"]

Each token gets its own attention map, making word choice and order crucial for the final result.

BText to Vector Encoding

The tokens are processed through a frozen language-vision encoder (like CLIP or T5), creating a 3D "semantic cloud" that represents your prompt's meaning. In Stable Diffusion v1.5, this creates 77 × 768 vectors.

CConditioned Diffusion

The system starts with a 64 × 64 × 4 Gaussian noise tensor and begins the denoising process:

  • Feeds noise, timestep, and text embedding into U-Net
  • Repeats 20-40 times, gradually predicting and subtracting noise
  • Works in compressed latent space for efficiency
DCross-Attention

Inside each U-Net block, image queries attend to keys/values from the text embedding. Early steps establish global composition, while later steps refine details. This is where your prompt's structure really matters.

EClassifier-Free Guidance

The model makes two predictions and blends them:

x = x_uncond + s · (x_cond – x_uncond)

The scale s (typically 5-12) lets you balance between realism and prompt faithfulness.

FVAE Decoding

A Variational Auto-Encoder transforms the 64 × 64 latent into a full-resolution RGB image, restoring color detail that was compressed in the latent space.

GPost-Processing

Optional steps include safety filtering and upscaling. This is why your 768px preview can become a crisp poster-quality image.

Next: Learn how to write effective prompts in our Prompt Crafting Guide.