McCormick How Stable Diffusion Works Dec 2022
McCormick How Stable Diffusion Works Dec 2022
Become an NLP expert with videos & code for BERT and beyond → Join NLP Basecamp
now!
The ability for a computer to generate art from nothing but a written
description is fascinating! I know that I, for one, would be desperately
curious to see what’s actually going on “under the hood” that would make
this possible, so I wanted to do what I can here to provide a less superficial
explanation of what’s going on even for those who aren’t familiar with
the concepts in artificial intelligence.
Overview
In the first section, I’ll give you the high‐level explanation ﴾that you may
already be familiar with﴿. It’s a good start, but I know that it wouldn’t
satisfy my curiosity. ὠ I’d be asking, “Ok, great, but how does it do that?”
To address this, I’ll show you some of Stable Diffusion’s inner workings.
The insides are more complex than you might be hoping, but I at least
wanted to show you more concretely what’s going on, so that it’s not a
complete mystery anymore.
More specifically:
We use Stable Diffusion to generate art, but what it actually does behind
the scenes is “clean up” images!
It’s much more sophisticated than the noise removal slider in your phone’s
image editor, though. It actually has an understanding of what the world
looks like, and an understanding of written language, and it leverages
these to guide the process.
For example, imagine if I gave the below image on the left to a skilled
graphic artist and told them that it’s a painting of an alien playing a guitar
in the style of H.R. Giger. I bet they could go in and painstakingly clean it
up to create something like the image on the right.
﴾These are actual images from Stable Diffusion!﴿
“Inference Steps”
Are you familiar with the “Inference Steps” slider in most art generation
tools? Stable Diffusion removes noise incrementally.
In fact, that noisy alien example was actually taken from about halfway
through the process–it actually started out as complete noise as well!
If you gave that task to a graphic artist, they’d throw up their hands–“I
can’t help you, the image is completely unrecognizable!”
At the simplest level, the answer is that it’s a computer program and it has
no choice but to do its thing and produce something for us.
A deeper answer has to do with the fact that AI models ﴾more technically,
“Machine Learning” models﴿ like Stable Diffusion are heavily based on
statistics. They estimate probabilities for all of their options, and even if all
of the options have extremely low probability of being right, they still just
pick whichever path has the highest probability.
So, for example, it has some idea of the places where a guitar might go in
an image, and it could look for whatever part of the noise seems most like
it could be the edge of the guitar ﴾even though there really is no “right”
choice﴿, and starts filling things in.
Since there’s no right answer, every time you give it a different image of
pure noise it’s going to come up with a different piece of artwork!
And I don’t mean that in the sense of “well, sure, computers are ultimately
just big calculators, and everything they do boils down to math”. I’m
talking about the “bewildering equations on a chalkboard” kind of math,
like the ones below:
﴾That’s from a technical tutorial I wrote on one of the many building blocks
of Stable Diffusion called “Attention”.﴿
The full set of equations that define each of the different building blocks
would fill a few pages, at least.
You might already be familiar with how images are represented, but let’s
look at an example. Here’s a long exposure photo I took at high tide:
And here’s how it’s represented mathematically. It’s 512 x 512 pixels, so we
represent it as a table with 512 rows and 512 columns. But we actually
need three tables to represent an image, because each pixel is made up of
a mixture of Red, Green, and Blue ﴾RGB﴿. Here are the actual values for the
above image.
With Stable Diffusion, we also work with text. Here’s a description I might
write for the image:
A long exposure color photograph of decaying concrete steps leading dow
And here’s how this is represented as a table of numbers. There is one row
for each of the words, and each word is represented by 768 numbers.
These are the actual numbers used in Stable Diffusion v1.5 to represent
these words:
How we choose the numbers to represent a word is a fascinating topic, but
also fairly technical. You can loosely think of those numbers as each
representing a different aspect of the meaning of a word.
The most important and mind‐bending part of all of this, though, is the
concept of parameters.
A Billion Parameters
The initial noise and our text description are what we call our inputs to
Stable Diffusion, and different inputs will have different values in those
tables.
There is a much, much larger set of numbers that we plug into those
equations as well, though, that are the same every time–these are called
Stable Diffusion’s parameters.
The input image was represented by about 790k values, and the 33
“tokens” in our prompt are represented by about 25k values.
Those 1 billion numbers are spread out across about 1,100 different
matrices of varying sizes. Each matrix is used at a different point in the
math.
I’ve printed out the full list of these matrices here, if you’re curious!
Stable Diffusion works because we figured out the right values to use for
each of those 1 billion numbers. How absurd is that?!
Not only did we not choose these numbers–we can’t even explain a single
one of them! This is why we can’t fully explain how Stable Diffusion works.
We have some decent intuition about what those equations are doing, but
a lot of what’s going on is hidden in the values of those numbers, and we
can’t fully make sense of it.
Insane, right?
When we run the very first training input through ﴾with completely random
parameter values﴿ what the model spits out is going to be nothing like the
desired output.
But, using the difference between the actual output and desired output,
we can apply some very basic calculus on those equations that will tell us,
for every one of those 1 billion numbers, a specific amount that we should
add or subtract. ﴾Each individual parameter is tweaked by a different, small
amount!﴿
Once the authors finished training the model, they published the
parameter values for everyone to use freely!
Conclusion
I won’t be offended if you’re a little disappointed by the explanation here,
and that it’s not more understandable, but hopefully you at least feel like
the veil has been lifted, and that what you saw was mind‐bending and
inspiring!
1 Comment
1 Login
Name
see more
0 0 Reply • Share ›
Related posts
Choosing a Sampler for Stable Diffusion 11 Apr 2023
Classifier‐Free Guidance ﴾CFG﴿ Scale 20 Feb 2023
Steps and Seeds in Stable Diffusion 11 Jan 2023