From Prompt to Pixels
The Generation Pipeline
Your text is split into sub-word tokens using byte-pair encoding (the same method used by language models).
Example: "an astronaut" → ["an", "Ġastro", "naut"]
Each token gets its own attention map, making word choice and order crucial for the final result.
The tokens are processed through a frozen language-vision encoder (like CLIP or T5), creating a 3D "semantic cloud" that represents your prompt's meaning. In Stable Diffusion v1.5, this creates 77 × 768 vectors.
The system starts with a 64 × 64 × 4 Gaussian noise tensor and begins the denoising process:
- Feeds noise, timestep, and text embedding into U-Net
- Repeats 20-40 times, gradually predicting and subtracting noise
- Works in compressed latent space for efficiency
Inside each U-Net block, image queries attend to keys/values from the text embedding. Early steps establish global composition, while later steps refine details. This is where your prompt's structure really matters.
The model makes two predictions and blends them:
x = x_uncond + s · (x_cond – x_uncond)
The scale s (typically 5-12) lets you balance between realism and prompt faithfulness.
A Variational Auto-Encoder transforms the 64 × 64 latent into a full-resolution RGB image, restoring color detail that was compressed in the latent space.
Optional steps include safety filtering and upscaling. This is why your 768px preview can become a crisp poster-quality image.
Next: Learn how to write effective prompts in our Prompt Crafting Guide.