3rd Unit Notes
3rd Unit Notes
A type of generative model that learns to create data samples (like images) that mimic a target
distribution—such as photos of human faces—by playing a game between two competing
neural networks: a generator and a discriminator. It was introduced by Ian Goodfellow and
his colleagues in 2014.
The Idea behind GAN
The adventures of Gene and Di hunting elusive nocturnal ganimals (we will explore this
example later in this chapter) are a metaphor for one of the most important deep learning
advancements of recent years: generative adversarial networks.
A GAN consists of two neural networks that are trained simultaneously in a game-theoretic
setup:
1. Generator (G):
o Takes random noise as input and generates synthetic data (e.g., images, audio).
o Its goal is to produce data that looks like it came from the real dataset.
2. Discriminator (D):
o Takes real and generated data as input and classifies them as real or fake.
o Its goal is to distinguish between real data and the fake data produced by the
generator.
Training Process
1. Initialize G and D.
2. Train D:
o Use real data labeled as “real.”
o Use G's output labeled as “fake.”
o Update D to improve classification accuracy.
3. Train G:
o Generate data.
o Pass it to D.
o Use D’s feedback to adjust G to produce more realistic data.
4. Repeat: Alternately update D and G to improve both.
Examples of the inputs and outputs to the two networks are shown in the above figure
Ganimals is a project that uses Generative Adversarial Networks (GANs) to generate hybrid
animal images by blending features from two or more real animals—like a mix between a
dog and a lion, or a cat and a horse.
Applications of GANs
Image synthesis (e.g., generating human faces)
Image-to-image translation (e.g., turning sketches into photos)
Super-resolution (enhancing image quality)
Data augmentation
Style transfer
Art and music generation
Code:
import numpy as np
data = np.load('cat.npy')
data = (data.astype(np.float32) - 127.5) / 127.5 # Normalize to [-1, 1]
data = data.reshape(-1, 28, 28, 1) # Shape for CNN input
2. Define the Generator
• Input: random noise vector (e.g., 100-dimensional).
• Output: generated 28×28 grayscale image with values in [-1, 1].
Code:
def build_generator():
model = Sequential()
model.add(Dense(128, input_dim=100))
model.add(LeakyReLU(alpha=0.2))
model.add(Dense(28*28, activation='tanh'))
model.add(Reshape((28, 28, 1)))
return model
Code:
def build_discriminator():
model = Sequential()
model.add(Flatten(input_shape=(28, 28, 1)))
model.add(Dense(128))
model.add(LeakyReLU(alpha=0.2))
model.add(Dense(1, activation='sigmoid'))
return model
Code:
for epoch in range(epochs):
# Get a batch of real images
real_imgs = get_real_images(batch_size)
# Generate fake images from random noise
noise = np.random.normal(0, 1, (batch_size, 100))
fake_imgs = generator.predict(noise)
# Train Discriminator
d_loss_real = discriminator.train_on_batch(real_imgs, np.ones((batch_size, 1)))
d_loss_fake = discriminator.train_on_batch(fake_imgs, np.zeros((batch_size,
1)))
# Train Generator (via GAN model)
g_loss = gan.train_on_batch(noise, np.ones((batch_size, 1)))
Cycle GAN
A Cycle GAN is a type of Generative Adversarial Network (GAN) used for unpaired image-
to-image translation, such as turning a photo into a Monet-style painting and vice versa —
without needing one-to-one image pairs for training.
The CycleGAN paper was released only a few months after the pix2pix paper and shows how
it is possible to train a model to tackle problems where we do not have pairs of images in the
source and target domains. The below shows the difference between the paired and unpaired
datasets of pix2pix and CycleGAN, respectively.
Two Discriminators:
o D_A: Tries to distinguish real images in domain A from fake ones generated
by G_BA.
o D_B: Tries to distinguish real images in domain B from fake ones generated
by G_AB.
Diagram of the four Cycle GAN models
o InstanceNormalization
o ReLU activation
o Conv2D
o InstanceNormalization
o ReLU
o Optional Dropout
Skip Connections
At each level, corresponding layers in the encoder and decoder are connected via skip
connections.
These are implemented using Concatenate() layers in Keras.
They preserve spatial information that would otherwise be lost during downsampling.
Output Layer
The final layer is a Conv2D with tanh activation.
Produces a 3-channel RGB image with values scaled to [-1, 1].
Discriminator
The discriminators that we have seen so far have output a single number: the predicted
probability that the input image is “real.” The discriminators in the CycleGAN that we will be
building output an 8 × 8 single-channel tensor rather than a single number.
The reason for this is that the CycleGAN inherits its discriminator architecture from a model
known as a PatchGAN, where the discriminator divides the image into square overlapping
“patches” and guesses if each patch is real or fake, rather than predicting for the image as a
whole. Therefore, the output of the discriminator is a tensor that contains the predicted
probability for each patch, rather than just a single number
Purpose:
The discriminator’s job is to distinguish real images from fake images produced by
the generator.
In CycleGAN, you have two discriminators — one for each domain (e.g., apples and
oranges).
PatchGAN Discriminator:
Instead of classifying the whole image as real or fake, the discriminator classifies each
patch (typically 70×70 pixels) in the image.
This patch-based approach focuses on high-frequency local structures and texture
details rather than the entire image.
The output is a matrix (a “patch” of probabilities) indicating real/fake for each patch,
not just a single scalar.
Architecture:
A series of convolutional layers that gradually reduce the spatial size while increasing
the number of channels.
Uses LeakyReLU activations (usually with slope 0.2).
Uses Instance Normalization after conv layers except the first.
The final layer outputs a 1-channel feature map (patch map) without a sigmoid
activation — because CycleGAN uses a least squares GAN loss, which doesn’t
require sigmoid at the output.
Why InstanceNorm?
InstanceNorm helps the discriminator generalize better to style and texture variations,
which is important for style transfer tasks like CycleGAN.
Output:
The discriminator outputs a grid of values, each representing the likelihood of the
corresponding patch being real or fake.
This helps the model learn local realism rather than just global.
However, we cannot compile the generators directly, as we do not have paired images in our
dataset. Instead, we judge the generators simultaneously on three criteria:
1. Validity. Do the images produced by each generator fool the relevant discriminator? (For
example, does output from g_BA fool d_A and does output from g_AB fool d_B?)
2. Reconstruction. If we apply the two generators one after the other (in both directions), do
we return to the original image? The CycleGAN gets its name from this cyclic reconstruction
criterion.
3. Identity. If we apply each generator to images from its own target domain, does the image
remain unchanged?
The idea works on the premise that we want to minimize a loss function that is a weighted
sum of three distinct parts:
Content loss
We would like the combined image to contain the same content as the base image.
Style loss
We would like the combined image to have the same general style as the style image.
Total variance loss
We would like the combined image to appear smooth rather than pixelated. We minimize this
loss via gradient descent—that is, we update each pixel value by an amount proportional to
the negative gradient of the loss function, over many iterations. This way, the loss gradually
decreases with each iteration and we end up with an image that merges the content of one
image with the style of another
Content loss
Content Loss is a key component in Neural Style Transfer (NST), which is a technique used
to blend the content of one image (e.g., a photo) with the style of another (e.g., a painting).
Two images that contain similar-looking scenes (e.g., a photo of a row of buildings and
another photo of the same buildings taken in different light from a different angle) should
have a smaller loss than two images that contain completely different scenes. Simply
comparing the pixel values of the two images won’t do, because even in two distinct images
of the same scene, we wouldn’t expect individual pixel values to be similar. We don’t really
want the content loss to care about the values of individual pixels; we’d rather that it scores
images based on the presence and approximate position of higher-level features such as
buildings, sky, or river.
To extract these high-level features, we use a pretrained VGG19 network, a 19-layer CNN
trained on the ImageNet dataset, which contains over a million images across 1,000
categories. VGG19 naturally learns a hierarchy of visual features, making it ideal for
capturing image content at deeper layers.
The content loss function is then computed as the mean squared error (MSE) between the
feature representations of the base image and the generated image at a chosen deep layer
(e.g., block5_conv2). This guides the training process to preserve the content structure while
allowing stylistic transformation.
The content loss function using VGG19 model
Style Loss
Style Loss is a key component in Neural Style Transfer (NST), which aims to combine the
style of one image (like a painting) with the content of another (like a photograph).
While content loss ensures the generated image preserves the layout of the content image,
style loss ensures it reflects the artistic style (textures, colors, brushstrokes) of the style
reference image.
How Style Loss Works (Conceptual Basis)
1. Lower CNN layers capture textures and fine details (edges, strokes), while higher
layers capture more abstract style components (color composition, broader patterns).
2. The style of an image can be expressed as feature correlations — that is, how different
filter activations relate to each other at a given layer.
3. These correlations are represented using a Gram matrix, which is a square matrix
capturing the inner products between feature maps.
Implementation Details
The book uses VGG19 pretrained on ImageNet to extract the style features.
Style layers typically include: block1_conv1, block2_conv1, block3_conv1,
block4_conv1, and block5_conv1.
The Gram matrix is computed using the dot product of the feature maps reshaped into
a 2D matrix.
The generated image is trained to minimize this style loss in conjunction with the
content loss.
Total Variance Loss
The total variance loss is simply a measure of noise in the combined image. To judge how
noisy an image is, we can shift it one pixel to the right and calculate the sum of the squared
difference between the translated and original images. For balance, we can also do the same
procedure but shift the image one pixel down. The sum of these two terms is the total
variance loss.
Running the Neural Style Transfer
The learning process involves running gradient descent to minimize this loss function, with
respect to the pixels in the combined image.