0% found this document useful (0 votes)
10 views16 pages

3rd Unit Notes

Generative Adversarial Networks (GANs) consist of two competing neural networks, a generator that creates synthetic data and a discriminator that evaluates its authenticity. GANs can be applied in various domains, including image synthesis and style transfer, and can be trained using unpaired datasets, as demonstrated by CycleGANs. The document also outlines the process of building a GAN, including data preparation, defining the generator and discriminator, and training the model.

Uploaded by

ABHINAV AI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views16 pages

3rd Unit Notes

Generative Adversarial Networks (GANs) consist of two competing neural networks, a generator that creates synthetic data and a discriminator that evaluates its authenticity. GANs can be applied in various domains, including image synthesis and style transfer, and can be trained using unpaired datasets, as demonstrated by CycleGANs. The document also outlines the process of building a GAN, including data preparation, defining the generator and discriminator, and training the model.

Uploaded by

ABHINAV AI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 16

Introduction to GAN (Generative adversarial Networks)

A type of generative model that learns to create data samples (like images) that mimic a target
distribution—such as photos of human faces—by playing a game between two competing
neural networks: a generator and a discriminator. It was introduced by Ian Goodfellow and
his colleagues in 2014.
The Idea behind GAN
The adventures of Gene and Di hunting elusive nocturnal ganimals (we will explore this
example later in this chapter) are a metaphor for one of the most important deep learning
advancements of recent years: generative adversarial networks.
A GAN consists of two neural networks that are trained simultaneously in a game-theoretic
setup:
1. Generator (G):
o Takes random noise as input and generates synthetic data (e.g., images, audio).
o Its goal is to produce data that looks like it came from the real dataset.
2. Discriminator (D):
o Takes real and generated data as input and classifies them as real or fake.
o Its goal is to distinguish between real data and the fake data produced by the
generator.
Training Process
1. Initialize G and D.
2. Train D:
o Use real data labeled as “real.”
o Use G's output labeled as “fake.”
o Update D to improve classification accuracy.
3. Train G:
o Generate data.
o Pass it to D.
o Use D’s feedback to adjust G to produce more realistic data.
4. Repeat: Alternately update D and G to improve both.
Examples of the inputs and outputs to the two networks are shown in the above figure

Ganimals – A Creative Application of GANs

Ganimals is a project that uses Generative Adversarial Networks (GANs) to generate hybrid
animal images by blending features from two or more real animals—like a mix between a
dog and a lion, or a cat and a horse.

How Ganimals Works ?

1. Training on Real Animal Images


The GAN is first trained on a large collection of real animal photos—dogs, lions, cats,
horses, etc.
 The Generator learns to create realistic animal images from random input vectors.
 The Discriminator learns to distinguish between real animal photos and fake images
created by the Generator.
2. Learning the Latent Space
During training, the GAN learns a latent space—a mathematical space where each
point (vector) corresponds to a meaningful animal image.
 Nearby points in this space represent similar images.
 Different directions encode different features (e.g., “more fur,” “longer snout,” “ear
shape”).
3. Blending Animals by Vector Arithmetic
To create a hybrid animal (a “ganimal”), you take the latent vectors representing two
or more animals and combine them using weighted addition.

4. Generating Hybrid Images


The Generator takes this blended vector zganimalz_{ganimal}zganimal and produces
an image that visually mixes features from both animals.
 The result might have a dog’s face shape with a lion’s mane, or any
combination depending on the weights.

5. Improvement Through Feedback


The Discriminator helps improve the Generator by scoring how realistic the hybrids
look during training, pushing the Generator to create better blends.

Applications of GANs
 Image synthesis (e.g., generating human faces)
 Image-to-image translation (e.g., turning sketches into photos)
 Super-resolution (enhancing image quality)
 Data augmentation
 Style transfer
 Art and music generation

Your First GAN


First, you’ll need to download the training data. We’ll be using the Quick, Draw! data‐set
from Google. This is a crowdsourced collection of 28 × 28–pixel grayscale doo‐dles, labeled
by subject. The dataset was collected as part of an online game that challenged players to
draw a picture of an object or concept, while a neural network tries to guess the subject of the
doodle. It’s a really useful and fun dataset for learning the fundamentals of deep learning.
Steps to Build Your First GAN
Let’s build our first GAN by using the cats class from the Quick, Draw! data‐set from
Google.

1. Prepare the Data


• Download `.npy` file for your chosen class (e.g., cat).
• Normalize pixel values to [-1, 1], which helps stabilize GAN training.
• Reshape data for CNN input (28×28 grayscale images with 1 channel).

Code:

import numpy as np

data = np.load('cat.npy')
data = (data.astype(np.float32) - 127.5) / 127.5 # Normalize to [-1, 1]
data = data.reshape(-1, 28, 28, 1) # Shape for CNN input
2. Define the Generator
• Input: random noise vector (e.g., 100-dimensional).
• Output: generated 28×28 grayscale image with values in [-1, 1].

Code:

from tensorflow.keras.models import Sequential


from tensorflow.keras.layers import Dense, LeakyReLU, Reshape

def build_generator():
model = Sequential()
model.add(Dense(128, input_dim=100))
model.add(LeakyReLU(alpha=0.2))
model.add(Dense(28*28, activation='tanh'))
model.add(Reshape((28, 28, 1)))
return model

3. Define the Discriminator


• Input: 28×28 grayscale image.
• Output: probability that the image is real (close to 1) or fake (close to 0).

Code:

from tensorflow.keras.models import Sequential


from tensorflow.keras.layers import Flatten, Dense, LeakyReLU

def build_discriminator():
model = Sequential()
model.add(Flatten(input_shape=(28, 28, 1)))
model.add(Dense(128))
model.add(LeakyReLU(alpha=0.2))
model.add(Dense(1, activation='sigmoid'))
return model

4. Train the GAN


• Alternate training the Discriminator and Generator:
- Train Discriminator on real images (label=1) and fake images (label=0).
- Train Generator to fool the Discriminator (train with label=1).

Code:
for epoch in range(epochs):
# Get a batch of real images
real_imgs = get_real_images(batch_size)
# Generate fake images from random noise
noise = np.random.normal(0, 1, (batch_size, 100))
fake_imgs = generator.predict(noise)
# Train Discriminator
d_loss_real = discriminator.train_on_batch(real_imgs, np.ones((batch_size, 1)))
d_loss_fake = discriminator.train_on_batch(fake_imgs, np.zeros((batch_size,
1)))
# Train Generator (via GAN model)
g_loss = gan.train_on_batch(noise, np.ones((batch_size, 1)))

# Optional: save or display generated images periodically

Cycle GAN
A Cycle GAN is a type of Generative Adversarial Network (GAN) used for unpaired image-
to-image translation, such as turning a photo into a Monet-style painting and vice versa —
without needing one-to-one image pairs for training.
The CycleGAN paper was released only a few months after the pix2pix paper and shows how
it is possible to train a model to tackle problems where we do not have pairs of images in the
source and target domains. The below shows the difference between the paired and unpaired
datasets of pix2pix and CycleGAN, respectively.

pix2pix dataset and domain mapping example


Cycle GAN and pix2pix are both generative models designed for image-to-image translation,
but they differ significantly in their approach and use cases. The most fundamental distinction
is in the type of data each model requires. pix2pix needs paired training data, meaning that
for every image in the source domain, there must be a corresponding image in the target
domain. This makes pix2pix ideal for tasks like edge-to-photo conversion, satellite maps to
street maps, or sketches to real images, where a clear alignment between the input and output
is available. In contrast, Cycle GAN excels in situations where such paired data is unavailable
or difficult to collect. It uses unpaired datasets, allowing it to learn transformations between
domains like apples and oranges, or photographs and Monet paintings, where exact pixel-
level correspondences do not exist.

Cycle GAN Overview


A Cycle GAN is a framework designed for unpaired image-to-image translation. Unlike
models such as pix2pix, which require paired images from source and target domains,
CycleGAN learns to map between two image domains without requiring paired examples.
Key Components:
 Two Generators:
o G_AB: Translates images from domain A (e.g., apples) to domain B (e.g.,
oranges).
o G_BA: Translates images from domain B to domain A.

 Two Discriminators:
o D_A: Tries to distinguish real images in domain A from fake ones generated
by G_BA.
o D_B: Tries to distinguish real images in domain B from fake ones generated
by G_AB.
Diagram of the four Cycle GAN models

The Generators (U-Net


The above figure depicts the U-Net architecture used in the CycleGAN generators. It is a
symmetric encoder-decoder model with skip connections that forms a “U” shape — hence the
name U-Net.
Encoder (Downsampling Path)
 The left half of the U-Net compresses the image spatially using convolutional layers.
 As we move down the U, the width and height decrease while the number of channels
increases.
 This part learns what is in the image — high-level abstract features.
 It uses:
o Conv2D layers with stride 2

o InstanceNormalization

o ReLU activation

Decoder (Upsampling Path)


 The right half of the U-Net expands the compressed feature maps back to the original
resolution.
 Spatial dimensions increase while channels decrease.
 This path recovers where things are in the image.
 It uses:
o UpSampling2D

o Conv2D

o InstanceNormalization

o ReLU

o Optional Dropout

Skip Connections
 At each level, corresponding layers in the encoder and decoder are connected via skip
connections.
 These are implemented using Concatenate() layers in Keras.
 They preserve spatial information that would otherwise be lost during downsampling.
Output Layer
 The final layer is a Conv2D with tanh activation.
 Produces a 3-channel RGB image with values scaled to [-1, 1].

Discriminator
The discriminators that we have seen so far have output a single number: the predicted
probability that the input image is “real.” The discriminators in the CycleGAN that we will be
building output an 8 × 8 single-channel tensor rather than a single number.
The reason for this is that the CycleGAN inherits its discriminator architecture from a model
known as a PatchGAN, where the discriminator divides the image into square overlapping
“patches” and guesses if each patch is real or fake, rather than predicting for the image as a
whole. Therefore, the output of the discriminator is a tensor that contains the predicted
probability for each patch, rather than just a single number
 Purpose:
 The discriminator’s job is to distinguish real images from fake images produced by
the generator.
 In CycleGAN, you have two discriminators — one for each domain (e.g., apples and
oranges).
 PatchGAN Discriminator:
 Instead of classifying the whole image as real or fake, the discriminator classifies each
patch (typically 70×70 pixels) in the image.
 This patch-based approach focuses on high-frequency local structures and texture
details rather than the entire image.
 The output is a matrix (a “patch” of probabilities) indicating real/fake for each patch,
not just a single scalar.
 Architecture:
 A series of convolutional layers that gradually reduce the spatial size while increasing
the number of channels.
 Uses LeakyReLU activations (usually with slope 0.2).
 Uses Instance Normalization after conv layers except the first.
 The final layer outputs a 1-channel feature map (patch map) without a sigmoid
activation — because CycleGAN uses a least squares GAN loss, which doesn’t
require sigmoid at the output.
 Why InstanceNorm?
 InstanceNorm helps the discriminator generalize better to style and texture variations,
which is important for style transfer tasks like CycleGAN.
 Output:
 The discriminator outputs a grid of values, each representing the likelihood of the
corresponding patch being real or fake.
 This helps the model learn local realism rather than just global.

Compiling the CycleGAN


AS we know the CycleGAN as two generators and compilers therefore it is required to
compile four distinct models, two generators and two discriminators, as follows:
g_AB Learns to convert an image from domain A to domain B.
g_BA Learns to convert an image from domain B to domain A.
d_A Learns the difference between real images from domain A and fake images generated by
g_BA.
d_B Learns the difference between real images from domain B and fake images generated by
g_AB.
We can compile the two discriminators directly, as we have the inputs (images from each
domain) and outputs (binary responses: 1 if the image was from the domain or 0 if it was a
generated fake). This is shown in the below example

However, we cannot compile the generators directly, as we do not have paired images in our
dataset. Instead, we judge the generators simultaneously on three criteria:
1. Validity. Do the images produced by each generator fool the relevant discriminator? (For
example, does output from g_BA fool d_A and does output from g_AB fool d_B?)
2. Reconstruction. If we apply the two generators one after the other (in both directions), do
we return to the original image? The CycleGAN gets its name from this cyclic reconstruction
criterion.
3. Identity. If we apply each generator to images from its own target domain, does the image
remain unchanged?

Training the CycleGAN & Analysis of the CycleGAN


Please refer the textbook
Neural Style Transfer
Till now we have seen how a Cycle GAN can transfer images between two domains where
we have used training dataset which is need not to be paired. Now we shall look at a different
application of style transfer, where we do not have a training set at all, but instead wish to
transfer the style of one single image onto another, as shown in Figure This is known as
neural style transfer

The idea works on the premise that we want to minimize a loss function that is a weighted
sum of three distinct parts:
Content loss
We would like the combined image to contain the same content as the base image.
Style loss
We would like the combined image to have the same general style as the style image.
Total variance loss
We would like the combined image to appear smooth rather than pixelated. We minimize this
loss via gradient descent—that is, we update each pixel value by an amount proportional to
the negative gradient of the loss function, over many iterations. This way, the loss gradually
decreases with each iteration and we end up with an image that merges the content of one
image with the style of another
Content loss
Content Loss is a key component in Neural Style Transfer (NST), which is a technique used
to blend the content of one image (e.g., a photo) with the style of another (e.g., a painting).
Two images that contain similar-looking scenes (e.g., a photo of a row of buildings and
another photo of the same buildings taken in different light from a different angle) should
have a smaller loss than two images that contain completely different scenes. Simply
comparing the pixel values of the two images won’t do, because even in two distinct images
of the same scene, we wouldn’t expect individual pixel values to be similar. We don’t really
want the content loss to care about the values of individual pixels; we’d rather that it scores
images based on the presence and approximate position of higher-level features such as
buildings, sky, or river.
To extract these high-level features, we use a pretrained VGG19 network, a 19-layer CNN
trained on the ImageNet dataset, which contains over a million images across 1,000
categories. VGG19 naturally learns a hierarchy of visual features, making it ideal for
capturing image content at deeper layers.
The content loss function is then computed as the mean squared error (MSE) between the
feature representations of the base image and the generated image at a chosen deep layer
(e.g., block5_conv2). This guides the training process to preserve the content structure while
allowing stylistic transformation.
The content loss function using VGG19 model

Style Loss
Style Loss is a key component in Neural Style Transfer (NST), which aims to combine the
style of one image (like a painting) with the content of another (like a photograph).
While content loss ensures the generated image preserves the layout of the content image,
style loss ensures it reflects the artistic style (textures, colors, brushstrokes) of the style
reference image.
How Style Loss Works (Conceptual Basis)
1. Lower CNN layers capture textures and fine details (edges, strokes), while higher
layers capture more abstract style components (color composition, broader patterns).
2. The style of an image can be expressed as feature correlations — that is, how different
filter activations relate to each other at a given layer.
3. These correlations are represented using a Gram matrix, which is a square matrix
capturing the inner products between feature maps.

Mathematical detail of Style loss

Implementation Details
 The book uses VGG19 pretrained on ImageNet to extract the style features.
 Style layers typically include: block1_conv1, block2_conv1, block3_conv1,
block4_conv1, and block5_conv1.
 The Gram matrix is computed using the dot product of the feature maps reshaped into
a 2D matrix.
 The generated image is trained to minimize this style loss in conjunction with the
content loss.
Total Variance Loss
The total variance loss is simply a measure of noise in the combined image. To judge how
noisy an image is, we can shift it one pixel to the right and calculate the sum of the squared
difference between the translated and original images. For balance, we can also do the same
procedure but shift the image one pixel down. The sum of these two terms is the total
variance loss.
Running the Neural Style Transfer
The learning process involves running gradient descent to minimize this loss function, with
respect to the pixels in the combined image.

Analysis of the Neural Style Transfer Model


Refer the textbook

You might also like