0% found this document useful (0 votes)
31 views17 pages

Chapter 1. An Introduction To Generative Media: A Note For Early Release Readers

Uploaded by

Gobi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views17 pages

Chapter 1. An Introduction To Generative Media: A Note For Early Release Readers

Uploaded by

Gobi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Chapter 1.

An
Introduction to
Generative Media

A NOTE FOR EARLY RELEASE READERS

With Early Release ebooks, you get books in their earliest form—the authors’ raw and

unedited content as they write—so you can take advantage of these technologies long before

the official release of these titles.

This will be the first chapter of the final book. Please note that the GitHub repo will be made

active later on.

If you have comments about how we might improve the content and/or examples in this

book, or if you notice missing material within this chapter, please reach out to the editor at

[email protected].

Generative models have become widely popular in recent years. If you’re reading this book,

you’ve probably interacted with a generative model at some point. Maybe you’ve used
ChatGPT to generate text, used style transfer in apps like Instagram, or seen the deepfake

videos that have been making headlines. These are all examples of generative models in

action!

In this book, we’ll explore the world of generative models, starting with the basics of two

families of generative models, transformers and diffusion, and working our way up to more

advanced topics. We’ll cover the different types of generative models, how they work, and

how to use them. We’ll also look at some of the ethical and societal implications of

generative models and how they’re being used in the real world. In this chapter, we’ll cover

some of the history of how we got here and take a look at some of the capabilities offered by

some of the models, which we’ll explore in more depth throughout the book.

What exactly is generative modeling? The high-level idea is to provide data to a model to

train it so afterward it can generate new data that looks similar to the training data. For

example, if I train a model on a dataset of images of cats, I can then use that model to

generate new images of cats that look like they could have come from the original dataset.

This is a powerful idea, and it has a wide range of applications, from creating novel images

and videos to generating text with a specific style.

As we’ll see through the book, there are popular tools that allow us to use existing models

easily. In the world of Machine Learning, one can find open-access models that are trained on

large datasets and are available for anyone to use. Usually, training such models can cost a lot

of money and time, so having open access to them is very convenient. These pre-trained

models can be used to generate new data, classify existing data, and more. We can even

modify these models to use them in novel use cases. One of the most popular places to find

open-access models is Hugging Face, a platform with over a million models for all kinds of

Machine Learning tasks (including generating images!).

Generating Our First Image


As an example of an open-source library, we’ll kick off with diffusers. diffusers is a popular

library that provides access to state-of-the-art diffusion models. It’s a powerful, simple

toolbox that allows us to quickly load and train diffusion models!

By going to the Hugging Face Hub and filtering for models that generate images based on a

text (text-to-image), we can find some of the most popular models, such as Stable

Diffusion and SDXL. We’ll use Stable Diffusion version 1.5, a diffusion model capable of

generating high-quality images! If you browse the model website, you can read the model

card, an essential document for discoverability and reproducibility. There, you can read about

the model, how it was trained, intended use cases, and more.

Given we have a model (Stable Diffusion) and a tool to use the model (diffusers), we can now

generate our first image! When we load models, we’ll need to send them to a specific

hardware device, such as CPU (cpu), GPU (cuda or cuda:0), or Mac hardware called

Metal (mps). The following code will frequently appear in the book: it assigns a variable to

cuda:0 if a GPU is available; otherwise, it will use a CPU.

import torch

device = torch.device("cuda:0" if torch.cuda.is_available()


else "cpu")

print(f"Using device: {device}")

Using device: cuda:0


Next, we’ll load Stable Diffusion 1.5. diffusers offers a convenient, high-level wrapper called

StableDiffusionPipeline, which is ideal for this use case. Don’t worry about all the

parameters for now - the highlights are:

● There are many models with the Stable Diffusion architecture, so we need to

specify the one we want to use, runwayml/stable-diffusion-v1-5, in

this case.

● We need to specify the precision we’ll load the model with. Precision is something

we’ll learn more about later. At a high level, generative models are composed of

many parameters (millions or billions of them). Each parameter is a number

learned during training, and we can store these parameters with different levels of

precision (in other words, we can use more bits to store the model). A larger

precision means more memory and computation but usually also means a better

model. On the other hand, we can use a lower precision by setting

torch_dtype=float16 and use less memory than the default float32.1

The first time you run this code, it can take a bit: the pipeline downloads a model of multiple

gigabytes, after all! If you load the pipeline a second time, it will only re-download the model

if there has been a change in the remote repository that hosts the model on Hugging Face.

Hugging Face libraries store the model locally in a cache, making things much faster for

subsequent loads.

from diffusers import StableDiffusionPipeline

pipe = StableDiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
torch_dtype=torch.float16,
variant="fp16",
).to(device)

Loading pipeline components...: 0%| | 0/7 [00:00<?,


?it/s]

`text_config_dict` is provided which will be used to


initialize `CLIPTextConfig`. The value
`text_config["id2label"]` will be overriden.
`text_config_dict` is provided which will be used to
initialize `CLIPTextConfig`. The value
`text_config["bos_token_id"]` will be overriden.

`text_config_dict` is provided which will be used to


initialize `CLIPTextConfig`. The value
`text_config["eos_token_id"]` will be overriden.

Now that the model is loaded, we can define a prompt, the text input the model will receive.

We can then pass the prompt through the model and generate an image based on that text!

prompt = "a photograph of an astronaut riding a horse"

pipe(prompt).images[0]

0%| | 0/50 [00:00<?, ?it/s]


Exciting! With a couple of lines of code, we generated a novel image. Play with the prompt

and generate new images. You will notice that the generations could improve. Later chapters

will explore how we can further control these generations and more recent models with better

generation capabilities.

● Chapters 4 and 5 dive into all the components behind diffusion models and how to

get from text to new images. Before that, Chapter 3 introduces methods, like

autoencoders, that can learn efficient representations from input data and reduce

the compute requirements to build diffusion and other generative models.


● In Chapter 7, we’ll learn how to teach new concepts to Stable Diffusion. For

example, we can teach Stable Diffusion the concept of "my dog" to generate

images of the author’s dog in novel scenarios, such as "my dog visiting

the moon.“.

● Chapter 8 shows how diffusion models can be used for more than just image

generation, such as editing images with a prompt or filling empty parts of an

image.

Generating Our First


Text

Just as diffusers is a very convenient library for diffusion models, the popular transformers

library is extremely useful for running transformers-based models and adapting to new use

cases. It provides a standardized interface for a wide range of tasks, such as generating text,

detecting objects in images, and transcribing an audio file into text.

The transformers library provides different layers of abstractions. For example, if you don’t

care about all the internals, the easiest is to use pipeline, which abstracts all the

processing required to get a prediction. We can instantiate a pipeline by calling the

pipeline() function and specifying which task we want to solve, such as

text-classification.
from transformers import pipeline

classifier = pipeline("text-classification")

classifier("This movie is disgustingly good !")

[{'label': 'POSITIVE', 'score': 0.9998536109924316}]

The model correctly predicted that the sentiment in the input text was positive. By default,

the text classification pipeline uses a sentiment analysis model under the hood, but we can

also specify other transformers-based text classification models.

Similarly, we can switch the task to text generation (text-generation), with which we

can generate new text based on an input prompt. By default, the pipeline will use the GPT-2

model.

generator = pipeline("text-generation")
prompt = "It was a dark and stormy"

generator(prompt)[0]["generated_text"]

'It was a dark and stormy afternoon, and it was quite crowded
and I looked up and saw a large group of people in a black
mitten wearing white and a yellow hat covering their
faces.\n\nThe weather forecast was low (7.'

Although GPT-2 is not a great model by today’s standards, it gives us an initial example of

transformers’ generation capabilities while using a small model. The same concepts we learn
about GPT-2 can be applied to models such as Llama or Mistral, some of the most powerful

open-access models (at the time of writing). Throughout the book, we’ll strike a balance

between the quality and size of the models. Usually, larger models have higher-quality

generations. At the same time, we want people with consumer computers or access to free

services (as mentioned in the Preface) to be able to create new generations by running code.

● Chapter 2 will teach how transformer models work under the hood. We’ll dive

into different types of transformer models and how to use them for generating

text.

● Chapter 5 will teach us how to continue training transformer models with our data

for different use cases. This will allow us to make conversational models like

those you might have seen in ChatGPT or Bard. We’ll also discuss efficient

training approaches so you can run the training on your computer!

Generating Our First


Sound Clip

Generative models are not limited to images and text. Models can generate videos, short

songs, synthetic spoken speech, protein proposals, and more!


Chapter 9 deepens into audio-related tasks that can be solved with Machine Learning, such as

transcribing meetings and generating sound effects! For now, we can limit ourselves to the

now familiar transformers pipeline and use the small version of MusicGen, a model

released by Meta to generate music conditioned on text.

pipe = pipeline("text-to-audio",
model="facebook/musicgen-small", device=device)

data = pipe("electric rock solo, very intense")

print(data)

{'audio': array([[[0.12342193, 0.11794732, 0.14775363, ...,


0.0265964 ,

0.02168683, 0.03067675]]], dtype=float32),


'sampling_rate': 32000}

Later, we’ll learn how audio data is represented and what these numbers are. Of course,

there’s no way for us to print the audio file directly in the book! The best alternative is to

show a viewer in our notebook or save the audio to a file we can play with our favorite audio

application. For example, we can use +IPython.display+ for this!

import IPython.display as ipd

display(ipd.Audio(data["audio"][0],
rate=data["sampling_rate"]))
Ethical and Societal
implications

While generative models offer remarkable capabilities, their widespread adoption raises

important considerations around ethics and societal impact. It’s important to keep them in

mind as we explore the capabilities of generative models. Here are a few key areas to

consider:

● Privacy and consent: The ability of generative models to generate realistic

images and videos based on very little data poses significant challenges to privacy.

For example, creating synthetic images from a small set of real images from an

individual raises questions about using personal data without consent. It also

increases the risk of creating deepfakes, which can be used to spread

misinformation or harm individuals.

● Bias and fairness: Generative models are trained on large datasets that contain

biases. These biases can be inherited and amplified by the generative models, as

we’ll explore in Chapter 2. For example, biased datasets used to train image

generation models may generate stereotypical or discriminatory images. It’s

important to consider mitigating these biases and ensure that generative models

are used fairly and ethically.

● Regulation: Given the potential risks associated with generative models, there is a

growing call for regulatory oversight and accountability mechanisms to ensure

responsible development and development. This includes transparency


requirements, ethical guidelines, and legal frameworks to address the misuse of

generative models.

It’s important to approach generative models with a thoughtful and ethical mindset. As we

explore the capabilities of these models, we’ll also consider the ethical implications and how

to use them responsibly.

Where We’ve Been and


Where Things Stand

Generative models began decades ago with efforts focused on rule-based systems. As

computing power and data availability increased, generative models evolved to use statistical

methods and machine learning. With the emergence of deep learning as a powerful paradigm

in Machine Learning and breakthroughs in the fields of image and speech recognition,

generative models have advanced significantly. Although invented decades ago,

Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) have

become widely popular in the last decade. CNNs revolutionized image processing tasks, and

RNNs brought sequential data modeling capabilities, enabling tasks like translating text and

text generation.
The introduction of Generative Adversarial Networks (GANs) by Ian Goodfellow in 2014,

and variants such as DCGAN, conditional GANs, and others, brought a new era of generative

models. GANs have been used to generate high-quality images and applied to tasks like style

transfer, enabling users to apply artistic styles to their images with astonishing realism.

Although quite powerful, the quality of GANs has been surpassed by diffusion models in

recent years.

Similarly, although RNNs were the to-go tool for language modeling, transformer models,

including architectures like GPT, achieved state-of-the-art performance in natural language

processing. These models have demonstrated remarkable capabilities in tasks such as

language understanding, text generation, and machine translation, surpassing previous

benchmarks and setting new standards in NLP performance. GPT, in particular, became

extremely popular due to its ability to generate coherent and contextually relevant text. Not

long afterward, a huge wave of generative language models emerged!

Generative AI began with simple models that could generate random data, such as random

images or text. Over time, the field has advanced significantly with the development of deep

learning and the rise of generative models that can generate highly realistic and complex data.

We’ve seen the emergence of models like GANs, which can generate high-quality images,

and transformers, which can generate coherent and contextually relevant text. We’ve also

seen the development of models like DALL·E and Stable Diffusion, which can generate

images based on textual prompts.

With the rapid expansion of research, resources, and development in generative AI in recent

years, a growing community interested in the area, a reach open-source ecosystem, and

research facilitating deployment, the field of generative AI is more accessible than ever,

leading to a wide range of applications and use cases. Between 2023 and 2024, we’ve seen a
new generation of models that can generate high-quality images, text, code, videos, and

more; examples include ChatGPT, DALL·E, Imagen, Stable Diffusion, Llama, Mistral, and

many others.

How Are Generative AI


Models Created? Big
Budgets and Open
Source

Several of the most impressive generative models we’ve seen in the past couple of years were

created by influential research labs in big, private companies. OpenAI developed ChatGPT,

DALL·E, and Sora; Google built Imagen, Bard, and Gemini; and Meta created Llama and

Code Llama. There’s a varying degree of openness in the way these models are released.

Some can be used via specific UIs, some have access through developer APIs, and some are

just announced as research reports with no public access at all. In some cases, code and

model weights are released as well: these are usually called open-source releases because

those are the essential artifacts necessary to run the model on your hardware. Frequently,

however, they are kept hidden for strategic reasons.


At the same time, an ever-increasing, energetic, flourishing, and enthusiastic community uses

open-source models as the clay for their creativity. All types of practitioners, including

researchers, engineers, tinkerers, and amateurs, build on top of each other’s work and come

up with novel solutions and clever ideas that push the field forward, one commit at a time.

Some of these ideas make their way into the theoretical corpus of knowledge where

researchers draw from, and new impressive models that use them come out after a while.

Big models, even when hidden, serve as inspiration for the community, whose work yields

fruits that serve the field as a whole.

This cycle can only work because some of the models are open-sourced and can be used by

the community. Companies that release open-source models don’t do it for altruistic reasons

but because they see economic value in this strategy. By providing code and models that are

adopted by the community, they receive public scrutiny with bug fixes, new ideas, derived

model architectures, or even new datasets that work well with the models released. Because

all these contributions are based on the assets they published, these companies can quickly

adopt them and thus move faster than they would on their own. When Meta released Llama,

one of the most popular LLMs, a thriving ecosystem organically grew around it. Both

established and new companies alike, including Meta, Stability AI (Stable Diffusion), or

Mistral AI, have embraced varying degrees of open source as part of their business strategy.

This is as legitimate as the strategy of competing companies that prefer to keep their trade

secrets behind closed doors (even if those companies can also draw from the open-source

community).

At this point, we’d like to clarify that model releases are rarely truly open-source. Unlike in

the software world, source code is not enough to fully understand a machine learning system.

Model weights are not enough either: they are just the final output of the model training
process. Being able to exactly replicate an existing model would require the source code used

to train the model (not just the modeling code or the inference code), the training regime and

parameters, and, crucially, all the data used for training. None of these, and particularly the

data, are usually released. If there were access to these details, it would be possible for the

community and the public to understand how the model works, explore the biases that may

afflict it, and better assess its strengths and limitations. Access to the weights and model code

provides an imperfect estimation of all this knowledge, but the actual hard data would be

much better. On top of it, even when the models are publicly released, they often come out

with a special license that does not adhere to the Open Source Initiative’s definition of open

source. This is not to say that the models are not useful or that the companies are not doing a

good thing by releasing them, but it’s an important context to keep in mind and one of the

reasons we’ll often say open access instead of open source.

Be it as it may, there has never been a better time to build generative models or with

generative models. You don’t need to be an engineer in a top-notch research lab to come up

with ideas to solve the problems that interest you or to contribute to the field if you are so

inclined. We hope you find these pages helpful in your journey!

The Path Ahead

Hopefully, after generating your first images, audios, and texts, you’ll be excited to learn how

diffusion and transformers work under the hood, how to adapt them for new use cases, and

how to use them for different creative applications. Although this chapter focused on

high-level tools, we’ll build solid foundations and intuition on how these models work as we
embark on our generative journey. Let’s go ahead and learn about the principles of

transformers models!

1 You might wonder about the variant parameter. In some repositories, you might find

multiple checkpoints with different precision. When specifying torch_dtype=float16,

we download the default model (float32) and convert it to float16. By also specifying

the fp16 variant, we’re downloading a smaller checkpoint already stored in float16

precision, which requires half the bandwidth and storage to download it. Check the repo you

want to use to see if there are multiple precision variants!

You might also like