Introduction
Stable Diffusion is a tool that creates images. You give it a text description like "a cat sitting on a beach at sunset" and it makes that image for you.
Now, how does it work?
Think of it like this: computers don't naturally know how to create images. They need to learn what things look like.
The team that built Stable Diffusion showed their computer millions of images with text descriptions. The computer studied these images and learned patterns -> what cats look like, what beaches look like, what sunsets look like.
Noise in images
First, let's understand "noise" in images. Noise is like random dots or static that makes an image less clear. If you've ever seen TV static -> those black and white dots when there's no signal -> that's pure noise.
Diffusion explained
Now, the diffusion process has two main parts:
Part 1: The Forward Process (Adding Noise)
We start with a clear image, like a photo of a cat
We add a tiny bit of noise to it, making it slightly less clear
Then we add a bit more noise, making it even less clear
We keep adding more and more noise in steps
Eventually, after many steps, the image becomes just random noise with no recognizable features
Part 2: The Reverse Process (Removing Noise)
The computer studies many examples of this forward process
It learns what happens at each step when noise is added
Then it learns how to reverse each step -> how to take a noisy image and make it slightly less noisy
It practices this over and over with millions of images
When Stable Diffusion creates a new image:
It starts with complete noise -> just random pixels
It uses what it learned to remove a little bit of noise
Then it removes a little more noise
It keeps doing this step by step
Eventually, a clear image appears that matches the text description you gave it
This step-by-step removal of noise is what "diffusion" refers to in Stable Diffusion.
Recap
During training:
The model sees clean images get progressively noisier (the forward process)
It learns how this noise affects images at each step
It practices predicting what the less noisy version would look like
During image generation:
We start with pure random noise
We use the trained model to gradually remove noise, step by step
We end up with a clear image that matches the text description
This approach is powerful because the model doesn't have to learn to create a complete image in one go. It only needs to learn how to make small improvements to a noisy image -> removing a little bit of noise at each step. This makes the learning process more manageable.
And the text description helps guide this process -> telling the model what kind of image should come out of the noise.
High level architecture
Problems with early diffusion models
Early diffusion models worked directly with full-sized images -> processing every pixel in the image. This created two big problems:
They needed powerful, expensive computers (high-end GPUs)
They were extremely slow -> taking minutes or even hours to generate a single image
They often produced inconsistent results with artifacts or distortions
These limitations meant that AI image generation was restricted to research labs and couldn't be used by regular people or in practical applications.
What makes Stable Diffusion "stable"
Latent Space Approach -> Instead of working with full images pixel-by-pixel, Stable Diffusion first compresses images into a simplified representation (called "latent space"). Think of this like working with a blueprint instead of the actual building. This compression makes everything much faster and requires much less computing power.
Better Training Method -> The researchers developed a more efficient way to train the model that produces more consistent results without requiring as much data or computing power.
The result of these improvements:
Image generation now takes seconds instead of hours
It can run on regular computers instead of needing special research equipment
Produces higher quality, more consistent images with fewer glitches or artifacts
This is why it's called "Stable" Diffusion -> it made the technology stable enough for practical, everyday use by regular developers and users.