8 Generative AI
8 Generative AI
Huaxia Rui
H. RUI
1/36
CIS433
Generative AI
Agenda
8 Generative AI
Generative Pre-trained Transformer (GPT)
Variational Autoencoder (VAE)
Generative Adversarial Network (GAN)
Diffusion
Multimodal Learning
H. RUI
2/36
CIS433
Generative AI
Generative AI
H. RUI
3/36
CIS433
Generative AI
Generative Pre-trained Transformer (GPT)
GPT
Image Generation
The key idea of image generation is to learn a low-dimensional vector space
where any point represents a “valid” image — one resembling a real thing.
Autoencoders
By recovering the input data, autoencoders learn a low-dimensional latent
space of “valid” images. Clearly, an encoder too powerful may just learn to
map each input to a single arbitrary number with the decoder being its inverse.
PCA is an
autoencoder
with MSE loss
and linear
activation.
1 encoder = Sequential ([
2 Flatten ( input_shape =[28 , 28]) ,
3 Dense (100 , activation ="selu"),
4 Dense (30 , activation ="selu") ])
5 decoder = Sequential ([
6 Dense (100 , activation ="selu", input_shape =[30]) ,
7 Dense (28*28 , activation =" sigmoid "),
8 Reshape ([28 ,28]) ])
9 ae = Sequential ([ encoder , decoder ]) H. RUI
8/36
CIS433
Generative AI
Variational Autoencoder (VAE)
H. RUI
12/36
CIS433
Generative AI
Generative Adversarial Network (GAN)
H. RUI
13/36
CIS433
Generative AI
Generative Adversarial Network (GAN)
1 discriminator = Sequential ([
2 Input( shape =(64 , 64, 3)),
3 Conv2D (64, kernel_size =4, strides =2, padding ="same"),
4 LeakyReLU (alpha =0.2) ,
5 Conv2D (128 , kernel_size =4, strides =2, padding ="same"),
6 LeakyReLU (alpha =0.2) ,
7 Conv2D (128 , kernel_size =4, strides =2, padding ="same"),
8 LeakyReLU (alpha =0.2) ,
9 Flatten () ,Dropout (0.2) ,Dense (1, activation =" sigmoid ")])
10
11 generator = Sequential ([
12 Input( shape =( d_latent ,)),
13 Dense (8*8*128) ,
14 Reshape ((8 ,8 ,128)),
15 Conv2DTranspose (128 , kernel_size =4, strides =2, padding
16 ="same"), LeakyReLU ( alpha =0.2) ,
17 Conv2DTranspose (256 , kernel_size =4, strides =2, padding
18 ="same"), LeakyReLU ( alpha =0.2) ,
19 Conv2DTranspose (512 , kernel_size =4, strides =2, padding
20 ="same"), LeakyReLU ( alpha =0.2) ,
21 Conv2D (3, 5, padding ="same", activation =" sigmoid ") ]) H. RUI
14/36
CIS433
Generative AI
Generative Adversarial Network (GAN)
Generated images are labeled as 1. Let the latent dimension be 128, the batch
size be 32, and the dataset of real images be real_img.
1 class GAN(keras. Model ):
2 def __init__ (self , discriminator ,generator , d_latent ):
3 super (). __init__ ()
4 self. discriminator = discriminator
5 self. generator = generator
6 self. d_latent = d_latent
7 self. d_loss_metric =keras . metrics .Mean(name=" d_loss ")
8 self. g_loss_metric =keras . metrics .Mean(name=" g_loss ")
9
10 def compile (self , d_optimizer , g_optimizer , loss_fn ):
11 super (). compile ()
12 self. d_optimizer = d_optimizer
13 self. g_optimizer = g_optimizer
14 self. loss_fn = loss_fn
15
16 # the decorator implements a getter
17 @property
18 def metrics (self):
19 return [self. d_loss_metric , self. g_loss_metric ] H. RUI
15/36
CIS433
Generative AI
Generative Adversarial Network (GAN)
H. RUI
17/36
CIS433
Generative AI
Generative Adversarial Network (GAN)
WGAN critic outputs a score in (−∞, ∞) and tries to maximize the difference
between scores of its predictions for real images and fake images. The
Wasserstein loss can be very large which is troubling for neural network. To
counter this, we require the critic to be a 1-Lipschitz continuous function:
H. RUI
22/36
CIS433
Generative AI
Diffusion
Diffusion
If we normalize the original image x0 to have zero mean and unit variance, xT
will approximate a standard Gaussian distribution N (0, 1) for large enough T .
√ √
xt ≡ αt xt−1 + 1 − αt ϵt−1
Given a diffusion schedule {ᾱt }, we√can jump from x0 to any step of the
forward diffusion process xt ∼ N ( ᾱt x0 , (1 − ᾱt )I) ≡ q(xt |x0 ). H. RUI
23/36
CIS433
Generative AI
Diffusion
t π
ᾱt = cos2 ·
T 2
t π
1 − ᾱt = sin2 ·
T 2
1 def cosine_diffusion_schedule ( diffusion_times ):
2 signal_rates = tf.cos( diffusion_times * math.pi / 2)
3 noise_rates = tf.sin( diffusion_times * math.pi / 2)
4 return noise_rates , signal_rates
5 def linear_diffusion_schedule ( diffusion_times ):
6 alphas = 0.9999 - diffusion_times * 0.0199
7 alpha_bars = tf.math. cumprod ( alphas )
8 return tf.sqrt( alpha_bars ), tf.sqrt (1 - alpha_bars ) H. RUI
24/36
CIS433
Generative AI
Diffusion
Unlike VAE encoder, the forward process is unparameterized, but like the VAE
decoder, the reverse diffusion process also aims to transform random input into
meaningful output using a network.
11 with tf. GradientTape () as tape:
12 pred_noises , pred_images =self. denoise ( noisy_images ,
13 noise_rates , signal_rates , training =True)
14 # denoise () passes noisy images to the network
15 noise_loss = self.loss(noises , pred_noises )
16 gradients =tape. gradient (noise_loss ,
17 self. network . trainable_weights )
18 self. optimizer . apply_gradients ( zip(gradients ,
19 self. network . trainable_weights ) )
20 self. noise_loss_tracker . update_state ( noise_loss )
21 # the diffusion model maintains 2 networks
22 # the exponential moving average is more robust for
23 # generation than the actively trained network
24 for w, ema_w in zip( self. network .weights ,
25 self. ema_network . weights ):
26 ema_weight . assign ( EMA* ema_w + (1- EMA)*w )
27 return {m.name: m. result () for m in self. metrics }
H. RUI
26/36
CIS433
Generative AI
Diffusion
H. RUI
27/36
CIS433
Generative AI
Diffusion
Multimodal Models
CLIP, or Contrastive Language-Image Pre-training (OpenAI 2021) is a “neural
network that efficiently learns visual concepts from natural language
supervision” trained on 400 million text-image pairs by maximizing the cosine
similarity between text-image pairs and minimizing the cosine similarity
between incorrect text-image pairs.
Both the text encoder and the image encoder are Transformers where the
Vision Transformer (ViT) applies an encoder Transformer on a sequence of
H. RUI
nonoverlapping input patches of an image with positional embedding. 33/36
CIS433
Generative AI
Multimodal Learning
DALL.E 2
The CLIP training process learns a joint representation space for text and
images, as is depicted above the dotted line. The text-to-image generation
process is depicted below the dotted line.
1 A CLIP text embedding is first fed to a diffusion prior to produce an
image embedding.
2 The image embedding is used to condition a diffusion decoder which
produces a final image.
H. RUI
The CLIP model is frozen during training of the prior and decoder 34/36
CIS433
Generative AI
Multimodal Learning
Stable Diffusion
Stable Diffusion (Stability AI, 2022) wraps the diffusion model within an
autoencoder, so the diffusion process operates on a latent space of the image
rather than the pixel space.
Compared with U-Net models that operate in pixel space, the denoising U-Net
of Stable Diffusion is lighter.
The autoencoder does the heavy lifting of encoding image details into
latent space and decoding the latent space back to the pixel space.
H. RUI
The diffusion model work purely in a latent conceptual space. 36/36