Chapter 1. An Introduction To Generative Media: A Note For Early Release Readers
Chapter 1. An Introduction To Generative Media: A Note For Early Release Readers
An
Introduction to
Generative Media
With Early Release ebooks, you get books in their earliest form—the authors’ raw and
unedited content as they write—so you can take advantage of these technologies long before
This will be the first chapter of the final book. Please note that the GitHub repo will be made
If you have comments about how we might improve the content and/or examples in this
book, or if you notice missing material within this chapter, please reach out to the editor at
Generative models have become widely popular in recent years. If you’re reading this book,
you’ve probably interacted with a generative model at some point. Maybe you’ve used
ChatGPT to generate text, used style transfer in apps like Instagram, or seen the deepfake
videos that have been making headlines. These are all examples of generative models in
action!
In this book, we’ll explore the world of generative models, starting with the basics of two
families of generative models, transformers and diffusion, and working our way up to more
advanced topics. We’ll cover the different types of generative models, how they work, and
how to use them. We’ll also look at some of the ethical and societal implications of
generative models and how they’re being used in the real world. In this chapter, we’ll cover
some of the history of how we got here and take a look at some of the capabilities offered by
some of the models, which we’ll explore in more depth throughout the book.
What exactly is generative modeling? The high-level idea is to provide data to a model to
train it so afterward it can generate new data that looks similar to the training data. For
example, if I train a model on a dataset of images of cats, I can then use that model to
generate new images of cats that look like they could have come from the original dataset.
This is a powerful idea, and it has a wide range of applications, from creating novel images
As we’ll see through the book, there are popular tools that allow us to use existing models
easily. In the world of Machine Learning, one can find open-access models that are trained on
large datasets and are available for anyone to use. Usually, training such models can cost a lot
of money and time, so having open access to them is very convenient. These pre-trained
models can be used to generate new data, classify existing data, and more. We can even
modify these models to use them in novel use cases. One of the most popular places to find
open-access models is Hugging Face, a platform with over a million models for all kinds of
library that provides access to state-of-the-art diffusion models. It’s a powerful, simple
By going to the Hugging Face Hub and filtering for models that generate images based on a
text (text-to-image), we can find some of the most popular models, such as Stable
Diffusion and SDXL. We’ll use Stable Diffusion version 1.5, a diffusion model capable of
generating high-quality images! If you browse the model website, you can read the model
card, an essential document for discoverability and reproducibility. There, you can read about
the model, how it was trained, intended use cases, and more.
Given we have a model (Stable Diffusion) and a tool to use the model (diffusers), we can now
generate our first image! When we load models, we’ll need to send them to a specific
hardware device, such as CPU (cpu), GPU (cuda or cuda:0), or Mac hardware called
Metal (mps). The following code will frequently appear in the book: it assigns a variable to
import torch
StableDiffusionPipeline, which is ideal for this use case. Don’t worry about all the
● There are many models with the Stable Diffusion architecture, so we need to
this case.
● We need to specify the precision we’ll load the model with. Precision is something
we’ll learn more about later. At a high level, generative models are composed of
learned during training, and we can store these parameters with different levels of
precision (in other words, we can use more bits to store the model). A larger
precision means more memory and computation but usually also means a better
The first time you run this code, it can take a bit: the pipeline downloads a model of multiple
gigabytes, after all! If you load the pipeline a second time, it will only re-download the model
if there has been a change in the remote repository that hosts the model on Hugging Face.
Hugging Face libraries store the model locally in a cache, making things much faster for
subsequent loads.
pipe = StableDiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
torch_dtype=torch.float16,
variant="fp16",
).to(device)
Now that the model is loaded, we can define a prompt, the text input the model will receive.
We can then pass the prompt through the model and generate an image based on that text!
pipe(prompt).images[0]
and generate new images. You will notice that the generations could improve. Later chapters
will explore how we can further control these generations and more recent models with better
generation capabilities.
● Chapters 4 and 5 dive into all the components behind diffusion models and how to
get from text to new images. Before that, Chapter 3 introduces methods, like
autoencoders, that can learn efficient representations from input data and reduce
example, we can teach Stable Diffusion the concept of "my dog" to generate
images of the author’s dog in novel scenarios, such as "my dog visiting
the moon.“.
● Chapter 8 shows how diffusion models can be used for more than just image
image.
Just as diffusers is a very convenient library for diffusion models, the popular transformers
library is extremely useful for running transformers-based models and adapting to new use
cases. It provides a standardized interface for a wide range of tasks, such as generating text,
The transformers library provides different layers of abstractions. For example, if you don’t
care about all the internals, the easiest is to use pipeline, which abstracts all the
text-classification.
from transformers import pipeline
classifier = pipeline("text-classification")
The model correctly predicted that the sentiment in the input text was positive. By default,
the text classification pipeline uses a sentiment analysis model under the hood, but we can
Similarly, we can switch the task to text generation (text-generation), with which we
can generate new text based on an input prompt. By default, the pipeline will use the GPT-2
model.
generator = pipeline("text-generation")
prompt = "It was a dark and stormy"
generator(prompt)[0]["generated_text"]
'It was a dark and stormy afternoon, and it was quite crowded
and I looked up and saw a large group of people in a black
mitten wearing white and a yellow hat covering their
faces.\n\nThe weather forecast was low (7.'
Although GPT-2 is not a great model by today’s standards, it gives us an initial example of
transformers’ generation capabilities while using a small model. The same concepts we learn
about GPT-2 can be applied to models such as Llama or Mistral, some of the most powerful
open-access models (at the time of writing). Throughout the book, we’ll strike a balance
between the quality and size of the models. Usually, larger models have higher-quality
generations. At the same time, we want people with consumer computers or access to free
services (as mentioned in the Preface) to be able to create new generations by running code.
● Chapter 2 will teach how transformer models work under the hood. We’ll dive
into different types of transformer models and how to use them for generating
text.
● Chapter 5 will teach us how to continue training transformer models with our data
for different use cases. This will allow us to make conversational models like
those you might have seen in ChatGPT or Bard. We’ll also discuss efficient
Generative models are not limited to images and text. Models can generate videos, short
transcribing meetings and generating sound effects! For now, we can limit ourselves to the
now familiar transformers pipeline and use the small version of MusicGen, a model
pipe = pipeline("text-to-audio",
model="facebook/musicgen-small", device=device)
print(data)
Later, we’ll learn how audio data is represented and what these numbers are. Of course,
there’s no way for us to print the audio file directly in the book! The best alternative is to
show a viewer in our notebook or save the audio to a file we can play with our favorite audio
display(ipd.Audio(data["audio"][0],
rate=data["sampling_rate"]))
Ethical and Societal
implications
While generative models offer remarkable capabilities, their widespread adoption raises
important considerations around ethics and societal impact. It’s important to keep them in
mind as we explore the capabilities of generative models. Here are a few key areas to
consider:
images and videos based on very little data poses significant challenges to privacy.
For example, creating synthetic images from a small set of real images from an
individual raises questions about using personal data without consent. It also
● Bias and fairness: Generative models are trained on large datasets that contain
biases. These biases can be inherited and amplified by the generative models, as
we’ll explore in Chapter 2. For example, biased datasets used to train image
important to consider mitigating these biases and ensure that generative models
● Regulation: Given the potential risks associated with generative models, there is a
generative models.
It’s important to approach generative models with a thoughtful and ethical mindset. As we
explore the capabilities of these models, we’ll also consider the ethical implications and how
Generative models began decades ago with efforts focused on rule-based systems. As
computing power and data availability increased, generative models evolved to use statistical
methods and machine learning. With the emergence of deep learning as a powerful paradigm
in Machine Learning and breakthroughs in the fields of image and speech recognition,
Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) have
become widely popular in the last decade. CNNs revolutionized image processing tasks, and
RNNs brought sequential data modeling capabilities, enabling tasks like translating text and
text generation.
The introduction of Generative Adversarial Networks (GANs) by Ian Goodfellow in 2014,
and variants such as DCGAN, conditional GANs, and others, brought a new era of generative
models. GANs have been used to generate high-quality images and applied to tasks like style
transfer, enabling users to apply artistic styles to their images with astonishing realism.
Although quite powerful, the quality of GANs has been surpassed by diffusion models in
recent years.
Similarly, although RNNs were the to-go tool for language modeling, transformer models,
benchmarks and setting new standards in NLP performance. GPT, in particular, became
extremely popular due to its ability to generate coherent and contextually relevant text. Not
Generative AI began with simple models that could generate random data, such as random
images or text. Over time, the field has advanced significantly with the development of deep
learning and the rise of generative models that can generate highly realistic and complex data.
We’ve seen the emergence of models like GANs, which can generate high-quality images,
and transformers, which can generate coherent and contextually relevant text. We’ve also
seen the development of models like DALL·E and Stable Diffusion, which can generate
With the rapid expansion of research, resources, and development in generative AI in recent
years, a growing community interested in the area, a reach open-source ecosystem, and
research facilitating deployment, the field of generative AI is more accessible than ever,
leading to a wide range of applications and use cases. Between 2023 and 2024, we’ve seen a
new generation of models that can generate high-quality images, text, code, videos, and
more; examples include ChatGPT, DALL·E, Imagen, Stable Diffusion, Llama, Mistral, and
many others.
Several of the most impressive generative models we’ve seen in the past couple of years were
created by influential research labs in big, private companies. OpenAI developed ChatGPT,
DALL·E, and Sora; Google built Imagen, Bard, and Gemini; and Meta created Llama and
Code Llama. There’s a varying degree of openness in the way these models are released.
Some can be used via specific UIs, some have access through developer APIs, and some are
just announced as research reports with no public access at all. In some cases, code and
model weights are released as well: these are usually called open-source releases because
those are the essential artifacts necessary to run the model on your hardware. Frequently,
open-source models as the clay for their creativity. All types of practitioners, including
researchers, engineers, tinkerers, and amateurs, build on top of each other’s work and come
up with novel solutions and clever ideas that push the field forward, one commit at a time.
Some of these ideas make their way into the theoretical corpus of knowledge where
researchers draw from, and new impressive models that use them come out after a while.
Big models, even when hidden, serve as inspiration for the community, whose work yields
This cycle can only work because some of the models are open-sourced and can be used by
the community. Companies that release open-source models don’t do it for altruistic reasons
but because they see economic value in this strategy. By providing code and models that are
adopted by the community, they receive public scrutiny with bug fixes, new ideas, derived
model architectures, or even new datasets that work well with the models released. Because
all these contributions are based on the assets they published, these companies can quickly
adopt them and thus move faster than they would on their own. When Meta released Llama,
one of the most popular LLMs, a thriving ecosystem organically grew around it. Both
established and new companies alike, including Meta, Stability AI (Stable Diffusion), or
Mistral AI, have embraced varying degrees of open source as part of their business strategy.
This is as legitimate as the strategy of competing companies that prefer to keep their trade
secrets behind closed doors (even if those companies can also draw from the open-source
community).
At this point, we’d like to clarify that model releases are rarely truly open-source. Unlike in
the software world, source code is not enough to fully understand a machine learning system.
Model weights are not enough either: they are just the final output of the model training
process. Being able to exactly replicate an existing model would require the source code used
to train the model (not just the modeling code or the inference code), the training regime and
parameters, and, crucially, all the data used for training. None of these, and particularly the
data, are usually released. If there were access to these details, it would be possible for the
community and the public to understand how the model works, explore the biases that may
afflict it, and better assess its strengths and limitations. Access to the weights and model code
provides an imperfect estimation of all this knowledge, but the actual hard data would be
much better. On top of it, even when the models are publicly released, they often come out
with a special license that does not adhere to the Open Source Initiative’s definition of open
source. This is not to say that the models are not useful or that the companies are not doing a
good thing by releasing them, but it’s an important context to keep in mind and one of the
Be it as it may, there has never been a better time to build generative models or with
generative models. You don’t need to be an engineer in a top-notch research lab to come up
with ideas to solve the problems that interest you or to contribute to the field if you are so
Hopefully, after generating your first images, audios, and texts, you’ll be excited to learn how
diffusion and transformers work under the hood, how to adapt them for new use cases, and
how to use them for different creative applications. Although this chapter focused on
high-level tools, we’ll build solid foundations and intuition on how these models work as we
embark on our generative journey. Let’s go ahead and learn about the principles of
transformers models!
1 You might wonder about the variant parameter. In some repositories, you might find
we download the default model (float32) and convert it to float16. By also specifying
the fp16 variant, we’re downloading a smaller checkpoint already stored in float16
precision, which requires half the bandwidth and storage to download it. Check the repo you