0% found this document useful (0 votes)
79 views36 pages

8 Generative AI

Uploaded by

tomahawx3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
79 views36 pages

8 Generative AI

Uploaded by

tomahawx3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

CIS433

AI and Deep Learning

Huaxia Rui

February 29, 2024

H. RUI
1/36
CIS433
Generative AI

Agenda

8 Generative AI
Generative Pre-trained Transformer (GPT)
Variational Autoencoder (VAE)
Generative Adversarial Network (GAN)
Diffusion
Multimodal Learning

H. RUI
2/36
CIS433
Generative AI

Generative AI

2014–2017: VAE and GAN Era


2018–2019: Transformer Era
2020–2022: Big Model Era

H. RUI
3/36
CIS433
Generative AI
Generative Pre-trained Transformer (GPT)

GPT

1 def sample_next (probs , temperature =1.0):


2 probs = np.log(np. asarray ( probs). astype (" float64 "))
3 probs = np.exp( probs / temperature )
4 probs = probs / np.sum(probs)
5 sample = np. random . multinomial (1, probs , 1)
6 return np. argmax ( sample ) # sample is a one -hot ndarray
Since p > q ⇒ limt→0 (p/q)1/t → ∞, greedy sampling (t → 0) always chooses
the most likely token. Stochastic sampling (t = 1) randomly chooses based on
the original distribution.
This movie is nothing short of crap that they might start off with an oscar winner but i still wanted to kill ...
This movie is one of my favourite martial arts action martial arts action plays a jackie policeman who is it H. RUI
hard representation decides to fight mobsters 4/36
CIS433
Generative AI
Generative Pre-trained Transformer (GPT)

1 class TextGenerator ( keras . callbacks . Callback ):


2 def __init__ ( self , prompt , generate_len , input_len ,
3 temperatures =(1. ,) , print_freq =1 ):
4 self. prompt = prompt
5 self. generate_len = generate_len
6 self. model_input_len = input_len
7 self. temperatures = temperatures
8 self. print_freq = print_freq
9 vprompt = text_vectorization ([ prompt ]) [0]. numpy ()
10 self. prompt_len = np. nonzero ( vprompt == 0) [0][0]
11 # np. nonzero () returns a tuple of arrays
12 def on_epoch_end (self , epoch , logs=None):
13 if (epoch + 1) % self. print_freq != 0: return
14 for temperature in self. temperatures :
15 sentence = self. prompt
16 for i in range(self. generate_len ):
17 tokens = text_vectorization ([ sentence ])
18 predictions = self. model( tokens )
19 next_token = sample_next ( predictions [0,
20 self. prompt_len -1+i, :] )
21 sentence += " " + tokens_index [ next_token ] H. RUI
5/36
CIS433
Generative AI
Generative Pre-trained Transformer (GPT)

1 def prepare_lm_dataset ( text_batch ):


2 vectorized_sequences = text_vectorization ( text_batch )
3 x = vectorized_sequences [:, :-1] # remove the last
4 y = vectorized_sequences [:, 1:] # offset by one
5 return x, y
6
7 lm_dataset = dataset .map( prepare_lm_dataset ,
8 num_parallel_calls =4 )
9
10 inputs = Input( shape =( None ,), dtype=" int64 " )
11 x = PositionalEmbedding (seq_len , 10000 , d_emb)( inputs )
12 x = GT( d_emb =d_emb , n_heads =n_heads , d_ff=d_ff )(x)
13 outputs = Dense (10000 , activation =" softmax ")(x)
14 model = keras .Model( inputs , outputs )
15 model . compile ( loss=" sparse_categorical_crossentropy ",
16 optimizer =" rmsprop " )
17 prompt = "This movie"
18 cb = TextGenerator (prompt , input_len =seq_len ,
19 generate_len =50 , temperatures =(0.7 , 1))
20 model .fit( lm_dataset , epochs =200 , callbacks =[cb])
H. RUI
6/36
CIS433
Generative AI
Variational Autoencoder (VAE)

Image Generation
The key idea of image generation is to learn a low-dimensional vector space
where any point represents a “valid” image — one resembling a real thing.

The module that maps a latent point to an image, called a generator or a


decoder, is then used to sample from the latent space and produce fake but
H. RUI
realistic image that are essentially the in-betweens of training images. 7/36
CIS433
Generative AI
Variational Autoencoder (VAE)

Autoencoders
By recovering the input data, autoencoders learn a low-dimensional latent
space of “valid” images. Clearly, an encoder too powerful may just learn to
map each input to a single arbitrary number with the decoder being its inverse.
PCA is an
autoencoder
with MSE loss
and linear
activation.

1 encoder = Sequential ([
2 Flatten ( input_shape =[28 , 28]) ,
3 Dense (100 , activation ="selu"),
4 Dense (30 , activation ="selu") ])
5 decoder = Sequential ([
6 Dense (100 , activation ="selu", input_shape =[30]) ,
7 Dense (28*28 , activation =" sigmoid "),
8 Reshape ([28 ,28]) ])
9 ae = Sequential ([ encoder , decoder ]) H. RUI
8/36
CIS433
Generative AI
Variational Autoencoder (VAE)

Variational Autoencoders (VAE)


Variational autoencoders are probabilistic autoencoders.
Autoencoder: maps an image directly to a point in the latent space.
VAE: maps an image to a Gaussian distribution N (µ, σ 2 ) in the latent
space which is then used to sample an actual representation, or coding.

For numerical stability, the encoder outputs log(σ 2 ) instead of σ. H. RUI


9/36
CIS433
Generative AI
Variational Autoencoder (VAE)

1 img = Input ( shape =(28 , 28, 1) )


2 x = Conv2D (32 , 3, activation ="relu", padding ="same",
3 strides =2)(img) # use strides to downsample
4 x = Conv2D (64 , 3, activation ="relu", padding ="same",
5 strides =2)(x)
6 x = Flatten ()(x) # output shape: 3136 = 7*7*64
7 x = Dense (16 , activation ="relu")(x)
8 z_mean = Dense (d_latent , name=" z_mean ")(x)
9 z_logvar = Dense(d_latent , name=" z_log_var ")(x)
10 encoder = Model (img , [z_mean , z_logvar ], name=" encoder ")
11 # Unlike pooling , striding retains spatial information .
12 z = Input ( shape =( d_latent ,) )
13 x = Dense (7 * 7 * 64, activation ="relu")(z)
14 x = Reshape ( (7, 7, 64) )(x) # revert Flatten ()
15 x = Conv2DTranspose (64 , 3, activation ="relu", strides =2,
padding ="same")(x) # output shape (14 ,14 ,64)
16 x = Conv2DTranspose (32 , 3, activation ="relu", strides =2,
padding ="same")(x) # output shape (28 ,28 ,32)
17 outputs = Conv2D (1, 3, activation =" sigmoid ",
18 padding ="same")(x) # shape (28 ,28 ,1)
19 decoder = Model (z, outputs , name=" decoder ") H. RUI
10/36
CIS433
Generative AI
Variational Autoencoder (VAE)

1 def train_step (self , data): # for class VAE(keras .Model)


2 with tf. GradientTape () as tape:
3 z_mean , z_logvar = self. encoder (data)
4 z = self. sampler (z_mean , z_logvar ) # Gaussian sample
5 recon = decoder (z)
6 recon_loss = tf. reduce_mean (tf. reduce_sum (
7 binary_crossentropy (data , recon), axis =(1, 2)))
8 kl_loss = -0.5 * (1 + z_log_var - tf. square ( z_mean )
9 - tf.exp( z_log_var ))
10 loss = recon_loss + tf. reduce_mean ( kl_loss )
11
12 grads = tape. gradient (loss , self. trainable_weights )
13 self. optimizer . apply_gradients (
14 zip(grads , self. trainable_weights ) )
15
16 self. loss_tracker . update_state (loss)
17 self. recon_loss_tracker . update_state ( recon_loss )
18 self. kl_loss_tracker . update_state ( kl_loss )
19 return { "loss": self. loss_tracker . result (),
20 " recon_loss ": self. recon_loss_tracker . result (),
21 " kl_loss ": self. kl_loss_tracker . result ()} H. RUI
11/36
CIS433
Generative AI
Variational Autoencoder (VAE)

Even though labels were not used for training,


VAE has learnt different digits by itself in order
to help minimize reconstruction loss.

VAE generates continuous


distribution of image
classes, with one morphing
into another as we traverse
through the latent space.
They are great for learning
well-structured latent space
where specific directions
encode meaningful axis of
variation.

H. RUI
12/36
CIS433
Generative AI
Generative Adversarial Network (GAN)

Generative Adversarial Networks (GANs)

GANs consist of a generator


trying to generate fake data
resembling the real data and a
discriminator trying to
distinguish them. Each training
iteration is divided into 2
phases with opposite goals.
1 Train the discriminator
using a balanced sample
of real and fake data.
2 Train the generator with
the discriminator
component frozen.

H. RUI
13/36
CIS433
Generative AI
Generative Adversarial Network (GAN)

1 discriminator = Sequential ([
2 Input( shape =(64 , 64, 3)),
3 Conv2D (64, kernel_size =4, strides =2, padding ="same"),
4 LeakyReLU (alpha =0.2) ,
5 Conv2D (128 , kernel_size =4, strides =2, padding ="same"),
6 LeakyReLU (alpha =0.2) ,
7 Conv2D (128 , kernel_size =4, strides =2, padding ="same"),
8 LeakyReLU (alpha =0.2) ,
9 Flatten () ,Dropout (0.2) ,Dense (1, activation =" sigmoid ")])
10
11 generator = Sequential ([
12 Input( shape =( d_latent ,)),
13 Dense (8*8*128) ,
14 Reshape ((8 ,8 ,128)),
15 Conv2DTranspose (128 , kernel_size =4, strides =2, padding
16 ="same"), LeakyReLU ( alpha =0.2) ,
17 Conv2DTranspose (256 , kernel_size =4, strides =2, padding
18 ="same"), LeakyReLU ( alpha =0.2) ,
19 Conv2DTranspose (512 , kernel_size =4, strides =2, padding
20 ="same"), LeakyReLU ( alpha =0.2) ,
21 Conv2D (3, 5, padding ="same", activation =" sigmoid ") ]) H. RUI
14/36
CIS433
Generative AI
Generative Adversarial Network (GAN)

Generated images are labeled as 1. Let the latent dimension be 128, the batch
size be 32, and the dataset of real images be real_img.
1 class GAN(keras. Model ):
2 def __init__ (self , discriminator ,generator , d_latent ):
3 super (). __init__ ()
4 self. discriminator = discriminator
5 self. generator = generator
6 self. d_latent = d_latent
7 self. d_loss_metric =keras . metrics .Mean(name=" d_loss ")
8 self. g_loss_metric =keras . metrics .Mean(name=" g_loss ")
9
10 def compile (self , d_optimizer , g_optimizer , loss_fn ):
11 super (). compile ()
12 self. d_optimizer = d_optimizer
13 self. g_optimizer = g_optimizer
14 self. loss_fn = loss_fn
15
16 # the decorator implements a getter
17 @property
18 def metrics (self):
19 return [self. d_loss_metric , self. g_loss_metric ] H. RUI
15/36
CIS433
Generative AI
Generative Adversarial Network (GAN)

1 def train_step (self , real_images ):


2 batch = tf.shape( real_images )[0]
3
4 # train the discriminator , for 1 step
5 x = tf. random . normal ( shape =( batch , self. d_latent ) )
6 fake_imgs = self. generator (x)
7 imgs = tf. concat ( [fake_imgs , real_imgs ], axis =0 )
8 y = tf. concat ([tf.ones (( batch ,1)), # 1: fake
9 tf.zeros (( batch ,1))], axis =0) # 0: real
10 y += 0.05 * tf. random . uniform (tf.shape(y))
11
12 with tf. GradientTape () as tape:
13 predictions = self. discriminator (imgs)
14 d_loss = self. loss_fn (y, predictions )
15
16 grads = tape. gradient ( d_loss ,
17 self. discriminator . trainable_weights )
18 self. d_optimizer . apply_gradients ( zip(grads ,
19 self. discriminator . trainable_weights ) )
H. RUI
16/36
CIS433
Generative AI
Generative Adversarial Network (GAN)

20 # train the generator , for 1 step


21 x = tf. random . normal ( shape =( batch , self. d_latent ) )
22 with tf. GradientTape () as tape:
23 predictions = self. discriminator (self. generator (x))
24 g_loss = self. loss_fn (tf.zeros (( batch ,1)), # 0: real
25 predictions )
26 grads = tape. gradient (g_loss ,
27 self. generator . trainable_weights )
28 self. g_optimizer . apply_gradients ( zip(grads ,
29 self. generator . trainable_weights ))
30
31 # update metrics
32 self. d_loss_metric . update_state ( d_loss )
33 self. g_loss_metric . update_state ( g_loss )
34 return { " d_loss ": self. d_loss_metric . result (),
35 " g_loss ": self. g_loss_metric . result ()}

H. RUI
17/36
CIS433
Generative AI
Generative Adversarial Network (GAN)

Wasserstein GAN with Gradient Penalty


P 
Binary cross-entropy loss − n1 n i=1 yi log pi + (1 − yi ) log(1 − pi ) labels
response from {1, 0}. Wasserstein loss labels response from {1, −1} and
requires no activation from the final layer of the discriminator.
GAN discriminator optimization GAN generator optimization
z  }| { z }| {
min − Ex∼pX log D(x) + Ez∼pZ log 1 − D(G(z)) min −Ez∼pZ log D(G(z))
D G
 
min − Ex∼pX D(x) − Ez∼pZ D(G(z)) min −Ez∼pZ D(G(z))
D G
| {z } | {z }
WGAN discriminator/critic optimization WGAN generator optimization

WGAN critic outputs a score in (−∞, ∞) and tries to maximize the difference
between scores of its predictions for real images and fake images. The
Wasserstein loss can be very large which is troubling for neural network. To
counter this, we require the critic to be a 1-Lipschitz continuous function:

|D(x1 ) − D(x2 )| There is a double cone


≤1
∥x1 − x2 ∥1 whose origin can be moved
where ∥x1 − x2 ∥1 is the average along the graph while the
pixelwise absolute difference whole graph always stays H. RUI
between images x and x . outside the double cone. 18/36
CIS433
Generative AI
Generative Adversarial Network (GAN)

1 def gradient_penalty (self , batch_size , real , fake):


2 alpha = tf. random . normal ([ batch_size , 1,1,1], 0,1)
3 # calculate a set of interpolated images
4 interpolated = real + alpha * (fake - real)
5 with tf. GradientTape () as gp_tape :
6 gp_tape . watch ( interpolated )
7 pred = self. critic ( interpolated , training =True)
8 grads = gp_tape . gradient (pred , [ interpolated ]) [0]
9 norm = tf.sqrt(tf. reduce_sum ( tf. square (grads),
10 axis =[1, 2, 3]))
11 # penalize deviation from 1
12 return tf. reduce_mean ( (norm -1) ** 2)
An advantage of Wasserstein loss is that we don’t need to worry about
balancing the training of the critic and the generator. In fact, the critic
must be trained to convergence before updating the generator.
Batch normalization shouldn’t be used in the critic.
1 c_wass_loss = tf. reduce_mean ( fake_scores )
2 - tf. reduce_mean ( real_scores )
3 c_gp = self. gradient_penalty ( batch_size , real , fake)
H. RUI
4 c_loss = c_wass_loss + c_gp * self. gp_weight 19/36
CIS433
Generative AI
Generative Adversarial Network (GAN)

Conditional GAN (CGAN)

Because the critic now has access to


extra information on image content,
the generator is forced to fake images
that agree with the content label so as
to keep fooling the critic. Otherwise,
the critic can dectect fake images by
noticing the discrepancy between the
image content and the content label.

1 critic_input = Input( shape =(64 ,64 ,3) )


2 label_input = Input( shape =(64 ,64 ,2) )
3 x = Concatenate (axis = -1)( [ critic_input , label_input ] )

1 generator_input = Input ( shape =( D_LATENT ,) )


2 label_input = Input( shape =(2 ,) )
3 x = Concatenate (axis = -1)([ generator_input , label_input ])
H. RUI
20/36
CIS433
Generative AI
Generative Adversarial Network (GAN)

1 def train_step (self , data):


2 real , labels = data # unpack images and labels
3 label_images = labels [:, None , None , :] # match size
4 label_images = tf. repeat ( label_images ,
5 repeats =64, axis =1 )
6 label_images = tf. repeat ( label_images ,
7 repeats =64, axis =2 )
8 b = tf. shape (real)[0]
9 # train the critic
10 for i in range (self. critic_steps ):
11 x = tf. random . normal ( shape =(b, self. d_latent ))
12 with tf. GradientTape () as tape:
13 fake = self. generator ([x, labels ], training =True)
14 fake_scores = self. critic ([fake , label_images ]
15 ,training =True )
16 real_scores = self. critic ([real , label_images ]
17 ,training =True )
18 c_wass_loss = tf. reduce_mean ( fake_scores )
19 - tf. reduce_mean ( real_scores )
20 c_loss = c_wass_loss + self. gp_weight *
21 self. gradient_penalty (b,real ,fake , label_images ) H. RUI
21/36
CIS433
Generative AI
Generative Adversarial Network (GAN)

22 c_gradient = tape. gradient ( c_loss ,


23 self. critic . trainable_variables )
24 self. c_optimizer . apply_gradients ( zip(c_gradient ,
25 self. critic . trainable_variables ) )
26 # train the generator
27 x = tf. random . normal ( shape =(b, self. d_latent ) )
28 with tf. GradientTape () as tape:
29 # feed with both latent vectors and label vectors
30 fake = self. generator ( [x, labels ], training =True )
31 # feed with both fake images and label images
32 fake_scores = self. critic ( [fake , label_images ],
33 training =True)
34 g_loss = -tf. reduce_mean ( fake_scores )
35
36 gen_gradient = tape. gradient (g_loss ,
37 self. generator . trainable_variables )
38 self. g_optimizer . apply_gradients ( zip( gen_gradient ,
39 self. generator . trainable_variables ) )

H. RUI
22/36
CIS433
Generative AI
Diffusion

Diffusion
If we normalize the original image x0 to have zero mean and unit variance, xT
will approximate a standard Gaussian distribution N (0, 1) for large enough T .

√ √
xt ≡ αt xt−1 + 1 − αt ϵt−1

where {ϵt } are i.i.d. standard Gaussian.


√ √ p  √
xt = αt αt−1 xt−2 + 1 − αt−1 ϵt−2 + 1 − αt ϵt−1
√ p √
= αt αt−1 xt−2 + αt (1 − αt−1 )ϵt−2 + 1 − αt ϵt−1
| {z }
N (0,1−αt αt−1 )
√ p
= αt αt−1 xt−2 + 1 − αt αt−1 ϵ̃t−2
..
.
√ √ Yt
= ᾱt x0 + 1 − ᾱt ϵ̃0 where ᾱt ≡ αi .
i=1

Given a diffusion schedule {ᾱt }, we√can jump from x0 to any step of the
forward diffusion process xt ∼ N ( ᾱt x0 , (1 − ᾱt )I) ≡ q(xt |x0 ). H. RUI
23/36
CIS433
Generative AI
Diffusion

t π
ᾱt = cos2 ·
T 2
t π
1 − ᾱt = sin2 ·
T 2
1 def cosine_diffusion_schedule ( diffusion_times ):
2 signal_rates = tf.cos( diffusion_times * math.pi / 2)
3 noise_rates = tf.sin( diffusion_times * math.pi / 2)
4 return noise_rates , signal_rates
5 def linear_diffusion_schedule ( diffusion_times ):
6 alphas = 0.9999 - diffusion_times * 0.0199
7 alpha_bars = tf.math. cumprod ( alphas )
8 return tf.sqrt( alpha_bars ), tf.sqrt (1 - alpha_bars ) H. RUI
24/36
CIS433
Generative AI
Diffusion

Algorithm 1 Training Process for a Denoising Diffusion Model


1: repeat
2: x0 ∼ q(x0 ) ▷ sample an image
3: t ∼ Uniform({1,
√ √ · · · , T }), ϵ ∼ N (0, I)
4: xt = ᾱt x0 + 1 − ᾱt ϵ ▷ transform x0 by t noising steps
5: The neural network predicts the noise as ϵθ using (xt , ᾱt ).
6: Take gradient descent step on ∇θ ∥ϵ − ϵθ ∥2
7: until convergence

1 def train_step (self , images ):


2 images = self. normalizer (images , training =True)
3 # sample noise to match the shape of images
4 noises = tf. random . normal ( shape =(32 , 64, 64, 3) )
5 # sample one diffusion time step for each image
6 t = tf. random . uniform ( shape =(32 , 1, 1, 1),
7 minval =0.0 , maxval =1.0 )
8 # obtain 1 noise rate and 1 signal rate for each image
9 noise_rates , signal_rates = self. diffusion_schedule (t)
10 noisy_images = signal_rates * images + noise_rates * noises H. RUI
25/36
CIS433
Generative AI
Diffusion

Unlike VAE encoder, the forward process is unparameterized, but like the VAE
decoder, the reverse diffusion process also aims to transform random input into
meaningful output using a network.
11 with tf. GradientTape () as tape:
12 pred_noises , pred_images =self. denoise ( noisy_images ,
13 noise_rates , signal_rates , training =True)
14 # denoise () passes noisy images to the network
15 noise_loss = self.loss(noises , pred_noises )
16 gradients =tape. gradient (noise_loss ,
17 self. network . trainable_weights )
18 self. optimizer . apply_gradients ( zip(gradients ,
19 self. network . trainable_weights ) )
20 self. noise_loss_tracker . update_state ( noise_loss )
21 # the diffusion model maintains 2 networks
22 # the exponential moving average is more robust for
23 # generation than the actively trained network
24 for w, ema_w in zip( self. network .weights ,
25 self. ema_network . weights ):
26 ema_weight . assign ( EMA* ema_w + (1- EMA)*w )
27 return {m.name: m. result () for m in self. metrics }
H. RUI
26/36
CIS433
Generative AI
Diffusion

1 def denoise (self , noisy_images ,


2 noise_rates , signal_rates , training ):
3 if training :
4 nn = self. network
5 else:
6 nn = self. ema_network
7
8 pred_noises = nn( [ noisy_images , noise_rates **2] ,
9 training = training )
10 pred_images = ( noisy_images - noise_rates * pred_noises )
11 / signal_rates
12 return pred_noises , pred_images

H. RUI
27/36
CIS433
Generative AI
Diffusion

Inside generate(self, n_img, n_steps )


1 step = 1.0 / n_steps
2 current_images = tf. random . normal (shape =( n_img ,64 ,64 ,3))
3 for i in range( n_steps ):
4 t = tf.ones( (n_img , 1, 1, 1) ) - i * step
5 n_rates , s_rates = self. diffusion_schedule (t)
6 pred_noises , pred_images = self. denoise ( current_images ,
7 n_rates , s_rates , training =False)
8 # the next is based on the currently predicted one
9 n_rates , s_rates = self. diffusion_schedule ( t-step )
10 current_images = s_rates * pred_images
11 + n_rates * pred_noises
12 images = self. normalizer .mean +
13 pred_images * self. normalizer . variance **0.5
14 return tf. clip_by_value (images , 0.0, 1.0)

Increasing the number of diffusion steps (i.e.,


n_steps) in the reverse process improves the
image quality, as is shown in the figure.
H. RUI
28/36
CIS433
Generative AI
Diffusion

We can interpolate between points in the Gaussian latent space to smoothly


transition between images in the pixel space.

The initial noise map at each step is a sin( π2 t) + b cos( π2 t)


t ranges from 0 to 1
a, b are two randomly sampled Gaussian noise tensors. H. RUI
29/36
CIS433
Generative AI
Diffusion

Each successive DownBlock first increases the number


of channels via ResidualBlock and then halves the
image size via AveragePooling2D.
Each successive UpBlock decreases the number of
channels via ResidualBlock while also concatenating
the outputs from the DownBlock through skip
connections across the U-Net.
1 def ResidualBlock ( n_channels ):
2 def f(x):
3 if x. shape [3] == n_channels :
4 identity = x
5 else:
6 identity = Conv2D ( n_channels , kernel_size =1)(x)
7 x = BatchNormalization ( center =False , scale= False )(x)
8 x = Conv2D ( n_channels , 3, padding ="same",
9 activation =keras. activations .swish)(x)
10 x = Conv2D ( n_channels , 3, padding ="same")(x)
11 x = Add () ([x, identity ])
12 return x
13 return f H. RUI
30/36
CIS433
Generative AI
Diffusion

1 def DownBlock (n, depth):


2 def f(x):
3 x, skips = x
4 for _ in range (depth ):
5 x = ResidualBlock (n)(x)
6 skips . append (x)
7 x = AveragePooling2D ( pool_size =2)(x)
8 return x
9 return f
10
11 def UpBlock (n, depth ):
12 def f(x):
13 x, skips = x
14 # By default , UpSampling2D () copies to match size
15 x = UpSampling2D (size =2, interpolation =" bilinear ")(x)
16 for _ in range (depth ):
17 x = Concatenate () ([x, skips.pop ()])
18 x = ResidualBlock (n)(x)
19 return x
20 return f
H. RUI
31/36
CIS433
Generative AI
Diffusion

1 noisy_images = Input( shape =(64 ,64 ,3) )


sin_embedding() 2 x = Conv2D (32, 1)( noisy_images )
converts a scalar into 3 noise_variances = Input(shape =(1 ,1 ,1))
a 32-dimensional vector. 4 embedding = Lambda ( sin_embedding )(
keras.layers.Lambda() 5 noise_variances )
wraps it as a Layer 6 embedding = UpSampling2D ( size =(64 ,64))
object for the model. 7 ( embedding )
8 x = Concatenate ()([x, embedding ])
9 skips =[] # hold outputs from DownBlocks
10 x = DownBlock (32, depth =2) ([x, skips ])
11 x = DownBlock (64, depth =2) ([x, skips ])
12 x = DownBlock (96, depth =2) ([x, skips ])
13 x = ResidualBlock (128)(x)
14 x = ResidualBlock (128)(x)
15 x = UpBlock (96, depth =2) ([x, skips ])
16 x = UpBlock (64, depth =2) ([x, skips ])
17 x = UpBlock (32, depth =2) ([x, skips ])
18 x = Conv2D (3, 1, kernel_initializer =
19 " zeros")(x)
20 unet = keras. Model(
H. RUI
21 [ noisy_images , noise_variances ], x) 32/36
CIS433
Generative AI
Multimodal Learning

Multimodal Models
CLIP, or Contrastive Language-Image Pre-training (OpenAI 2021) is a “neural
network that efficiently learns visual concepts from natural language
supervision” trained on 400 million text-image pairs by maximizing the cosine
similarity between text-image pairs and minimizing the cosine similarity
between incorrect text-image pairs.

Both the text encoder and the image encoder are Transformers where the
Vision Transformer (ViT) applies an encoder Transformer on a sequence of
H. RUI
nonoverlapping input patches of an image with positional embedding. 33/36
CIS433
Generative AI
Multimodal Learning

DALL.E 2

The CLIP training process learns a joint representation space for text and
images, as is depicted above the dotted line. The text-to-image generation
process is depicted below the dotted line.
1 A CLIP text embedding is first fed to a diffusion prior to produce an
image embedding.
2 The image embedding is used to condition a diffusion decoder which
produces a final image.
H. RUI
The CLIP model is frozen during training of the prior and decoder 34/36
CIS433
Generative AI
Multimodal Learning

The diffusion decoder of DALL.E 2 borrows from GLIDE, or Guided Language


to Image Diffusion for Generation and Editing (OpenAI 2021) which is
trained from scratch based on text prompts instead of using CLIP embeddings.
GLIDE uses a U-Net for the denoiser, the Transformer architecture for the
text encoder, and an Upsampler to scale the image to 1024×1024.
GLIDE trains its 3.5B parameters from scratch, including 2.3B for the
visual part and 1.2B for the Transformer. H. RUI
35/36
CIS433
Generative AI
Multimodal Learning

Stable Diffusion
Stable Diffusion (Stability AI, 2022) wraps the diffusion model within an
autoencoder, so the diffusion process operates on a latent space of the image
rather than the pixel space.

Compared with U-Net models that operate in pixel space, the denoising U-Net
of Stable Diffusion is lighter.
The autoencoder does the heavy lifting of encoding image details into
latent space and decoding the latent space back to the pixel space.
H. RUI
The diffusion model work purely in a latent conceptual space. 36/36

You might also like