Unit 5 Deep Unsupervised Learning
Unit 5 Deep Unsupervised Learning
Today Deep learning has become one of the most popular and visible areas of machine learning, due
to its success in a variety of applications, such as computer vision, natural language processing, and
Reinforcement learning.
Deep learning can be used for supervised, unsupervised as well as reinforcement machine learning. it
uses a variety of ways to process these.
Supervised Machine Learning: Supervised machine learning is the machine
learning technique in which the neural network learns to make predictions or classify data based
on the labeled datasets. Here we input both input features along with the target variables. the
neural network learns to make predictions based on the cost or error that comes from the
difference between the predicted and the actual target, this process is known as
backpropagation. Deep learning algorithms like Convolutional neural networks, Recurrent
neural networks are used for many supervised tasks like image classifications and recognization,
sentiment analysis, language translations, etc.
Unsupervised Machine Learning: Unsupervised machine learning is the machine
learning technique in which the neural network learns to discover the patterns or to cluster the
dataset based on unlabeled datasets. Here there are no target variables. while the machine has to
self-determined the hidden patterns or relationships within the datasets. Deep learning
algorithms like autoencoders and generative models are used for unsupervised tasks like
clustering, dimensionality reduction, and anomaly detection.
Reinforcement Machine Learning: Reinforcement Machine Learning is the machine
learning technique in which an agent learns to make decisions in an environment to maximize a
reward signal. The agent interacts with the environment by taking action and observing the
resulting rewards. Deep learning can be used to learn policies, or a set of actions, that maximizes
the cumulative reward over time. Deep reinforcement learning algorithms like Deep Q networks
and Deep Deterministic Policy Gradient (DDPG) are used to reinforce tasks like robotics and
game playing etc
Unsupervised learning The task of unsupervised learning is to uncover structure in data, without using
human-provided supervision as a guide to what is salient or interesting about particular observations.
When doing unsupervised learning we seek to explain or analyze our data, or to provide useful inputs
for further applications. In practice there exists a varied body of unsupervised methods that each aim to
characterize different kinds of structure in the data. For instance, cluster analysis is an unsupervised
learning method where the goal is to identify groups, or clusters, of statistically similar observations
(Jain et al., 1999). Collaborative filtering seeks to “complete” a partial array of data, by leveraging
correlations between data variables (Su and Khoshgoftaar, 2009). Dimensionality reduction posits that
many datasets exhibit substantial redundancy across variables, and aims to reduce the data to its
essential directions of variability (van der Maaten et al., 2009).
Figure 2.1: Unsupervised learning methods include clustering (a) where data is separated in statistically
similar groups, (b) dimensionality reduction where the aim is to capture low-dimensional structure of
the data, and (c) density estimation, where the aim is approximate the true data distribution.
Dimensionality reduction is closely related to representation learning in which the aim is to learn
transformations of data that serve as useful representations for down-stream tasks (Bengio et al., 2013).
Many unsupervised learning approaches can be understood from a probabilistic perspective, where the
goal is to find a model pθ that closely matches the observed data. When dealing with continuous data
this task is often referred to as density estimation. However, one issue with framing unsupervised
learning in this way is that successful density estimation does not always incentivize the learning of
useful structure in the data. For instance, consider a perfect black-box model that outputs calibrated
probability densities for any input x. Such a model has perfectly characterized the statistical
dependencies between data variables, but it may not be useful to us if we are interested in cluster-
structure, or low-dimensional representations of data for use in alternative tasks. To reconcile the goals
of unsupervised learning, with the generic probabilistic objective of density estimation we can impose
structure on the parametric model pθ. In doing so we obtain probabilistic analogs to a number of
classical unsupervised objectives. For instance, if we assume the data are generated using unobserved
latent variables z, and that the latent variables are of lower dimensionality than the observed variables,
then by performing inference we also do dimensionality reduction. Under certain modeling
assumptions that are discussed in the following section, this reduces to a probabilistic variant of the
classic dimensionality reduction method PCA (Tipping and Bishop, 1999). Alternatively if the latent
variables are categorical cluster indicators, as in a mixture model, then we naturally recover a
clustering method through the EM-algorithm.
Figure 2.2: The data generating process for images from the CLEVR dataset (Johnson et al., 2017). To
generate an image, latent scene variables (left) including object properties and lighting conditions are
chosen from a prior model, and then transformed through the rendering process to a photo-realistic
image (right).
assume that data variables x are generated via interactions with unobserved, or latent variables,
typically denoted z. Intuitively, it is reasonable to believe that in a particular dataset we will not have
observed all the relevant variables, and that correlations between variables may be caused by some
unobserved source. For example, if we collect data about umbrella sales and car accidents over time,
we will probably observe positive correlations in the variables. However, these variables are really
independent, given the knowledge that it is raining, or not raining, which in this case is a latent variable.
Alternatively, for perceptual data like images of faces, we know that there exist some underlying
factors that explain most of the variability across images: skin tone, face shape, camera pose, facial
expression, etc. We expect the dimensionality of these latent factors to be smaller than the number of
pixels in an image. Latent variable models formalize these assumptions by describing a data-generating
process where 24 Chapter 2. Background latent variables are first sampled, and then data variables are
generated conditioned on these latent variables. For instance, to generate an image of a person we first
generate the latent features z such as skin tone or camera pose, and then transform these features to
pixels x via a digital image-formation process. Figure 2.2 depicts a similar generative process for
images in the CLEVR dataset, with scene attributes such as object types, colours and materials being
fed into a renderer to produce photo-realistic images (Johnson et al., 2017). Latent variable models are
useful ways to describe natural observations, and we can obtain the model distribution over the
observed variables by marginalizing out the latent variables: p(x; θ) = Z z p(x|z; θ)p(z; θ) dz. (2.1) This
means that to evaluate the probability of a given observation, we need to consider all possible settings
of the latent variables, weight them by their prior probabilities, and evaluate the probability of the
observation assuming those particular latent variables.
We will present a simple scenario which will clear our concepts on latent variables.An IT company
wants to hire an employee for one of its open position.The candidates have the following features.i)High
School Grade ii) University School Grade iii)IQ Score iv)Phone Interview Score .The XYZ company
wants to bring some candidates for an onsite interview.
Fig 1:Dataset of features of a hiring candidate
They can’t bring all of them because of the expenses involved .So they decided to predict the onsite
interview score based on which they will entertain the idea of whether to bring the candidate for an on-
site interview or not .They had the previous historical data for performance of on-site candidates .It
seems a trivial standard regression model where one had to predict the regression score.Well it seems to
be not for basically two reasons :
i) Missing values in the dataset- We can fill them with all the trades and tricks we have learnt from
Machine Learning foundations but still they will induce an uncertainty which will hamper any
probabilistic model drawn from here.
ii) Quantifying the uncertainty in predictions-Two persons having same score 50/100 one with some
missing data and one with none.We are sure about the one with none missing data that he is going to
perform badly owing to low prediction score but we are not so sure about the one with same score but
with missing values.He may be good after all .
Thus we go for probabilistic model of data owing to the above mentioned reasons.We draw some
random variables and understand the connection between these random variables.In our case all the
random variables might be connected to each other.A High IQ score might affect High School grade , a
GPA might affect the phone interview and so on.
Now to deal with this situation the concept of latent variable has come into existence , where we
introduce a latent variable which will be a measure that quantifies the uncertainty .In our case it might
be Intelligence. The latent variable is direct causation of all the parameters .Now our model is much
simpler to work with and we will get the same efficiency without reducing the flexibility of model.
i) Fewer Parameters
ii) Simple Model
iii) Easy to interpret.
Drawbacks:Harder to train.
Auto Encoders
Auto Encoder is an unsupervised Artificial Neural Network that attempts to encode the data by
compressing it into the lower dimensions (bottleneck layer or code) and then decoding the data to
reconstruct the original input.The bottleneck layer(or code) holds the compressed representation of the
in put data. In Auto Encoder the number of output units must be equal to the number of input units
since we’re attempting to reconstruct the input data. Auto Encoders usually consist of an encoder and a
decoder. The encoder encodes the provided data into a lower dimension which is the size of the
bottleneck layer and the decoder decodes the compressed data into its original form.The number of
neurons in the layers of the encoder will be decreasing as we move on with further layers, whereas the
number of neurons in the layers of the decoder will be increasing as we move on with further
layers.There are three layers used in the encoder and decoder in the following example.The encoder
contains 32, 16, and 7 units in each layer respectively and the decoder contains 7,16, and 32 units in
each layer respectively. The code size/the number of neurons in bottle-neck must be less than the
number of features in the data. Before feeding the data into the AutoEncoder the data must definitely
be scaled between 0 and 1 using Min Max Scaler since we are going to use sigmoid activation function
in the output layer which outputs values between 0 and 1.When we are using AutoEncoders for
dimensionality reduction we’ll be extracting the bottleneck layer and use it to reduce the dimensions.
This process can be viewed as feature extraction.The type of AutoEncoder that we’re using is Deep
AutoEncoder, where the encoder and the decoder are symmetrical. The Autoencoders don’t necessarily
have a symmetrical encoder and decoder but we can have the encoder and decoder non-symmetrical as
well.
Architecture of Auto encoder in Deep Learning The general architecture of an autoencoder
includes an encoder, decoder, and bottleneck layer.
Encoder
1. Input layer take raw input data
2. The hidden layers progressively reduce the dimensionality of the input, capturing important
features and patterns. These layer compose the encoder.
3. The bottleneck layer (latent space) is the final hidden layer, where the dimensionality is
significantly reduced. This layer represents the compressed encoding of the input data.
Decoder
1. The bottleneck layer takes the encoded representation and expands it back to the dimensionality
of the original input.
2. The hidden layers progressively increase the dimensionality and aim to reconstruct the original
input.
3. The output layer produces the reconstructed output, which ideally should be as close as possible
to the input data.
4. The loss function used during training is typically a reconstruction loss, measuring the
difference between the input and the reconstructed output. Common choices include mean
squared error (MSE) for continuous data or binary cross-entropy for binary data.
5. During training, the autoencoder learns to minimize the reconstruction loss, forcing the network
to capture the most important features of the input data in the bottleneck layer.
After the training process, only the encoder part of the autoencoder is retained to encode a similar
type of data used in the training process. The different ways to constrain the network are: –
Keep small Hidden Layers: If the size of each hidden layer is kept as small as possible,
then the network will be forced to pick up only the representative features of the data thus
encoding the data.
Regularization: In this method, a loss term is added to the cost function which encourages
the network to train in ways other than copying the input.
Denoising: Another way of constraining the network is to add noise to the input and teach
the network how to remove the noise from the data.
Tuning the Activation Functions: This method involves changing the activation functions
of various nodes so that a majority of the nodes are dormant thus, effectively reducing the size
of the hidden layers.
Types of Autoencoders
There are diverse types of autoencoders and analyze the advantages and disadvantages associated
with different variation
1. Denoising Autoencoder
Denoising autoencoder works on a partially corrupted input and trains to recover the original
undistorted image. As mentioned above, this method is an effective way to constrain the network
from simply copying the input and thus learn the underlying structure and important features of the
data.
During the training phase, present the denoising autoencoder (DAE) with a collection of clean
input examples along with their respective noisy counterparts. The objective is to acquire a
function that maps a noisy input to a relatively clean output using an encoder -decoder
architecture. To achieve this, a reconstruction loss function is typically employed to evaluate the
disparity between the clean input and the reconstructed output. A DAE is trained by minimizing
this loss through the use of backpropagation, which involves updating the weights of both
encoder and decoder components.Applications of Denoising Autoencoders (DAEs) span a variety
of domains, including computer vision, speech processing, and natural language processing.
Examples
Image Denoising: DAEs are effective in removing noise from images, such as Gaussian
noise or salt-and-pepper noise.
Fraud Detection: DAEs can contribute to identifying fraudulent transactions by learning
to reconstruct common transactions from their noisy counterparts.
Data Imputation: To reconstruct missing values from available data by learning, DAEs
can facilitate data imputation in datasets with incomplete information.
Data Compression: DAEs can compress data by obtaining a concise representation of
the data in the encoding space.
Anomaly Detection: Using DAEs, anomalies in a dataset can be detected by training a
model to reconstruct normal data and then flag challenging inputs as potentially
abnormal.
Advantages
1. This type of autoencoder can extract important features and reduce the noise or the useless
features.
2. Denoising autoencoders can be used as a form of data augmentation, the restored images
can be used as augmented data thus generating additional training samples.
Disadvantages
1. Selecting the right type and level of noise to introduce can be challenging and may require
domain knowledge.
2. Denoising process can result into loss of some information that is needed from the original
input. This loss can impact accuracy of the output.
2.Sparse Autoencoder
This type of autoencoder typically contains more hidden units than the input but only a few are
allowed to be active at once. This property is called the sparsity of the network. The sparsity of the
network can be controlled by either manually zeroing the required hidden units, tuning the activation
functions or by adding a loss term to the cost function.
Advantages
1. The sparsity constraint in sparse autoencoders helps in filtering out noise and irrelevant
features during the encoding process.
2. These autoencoders often learn important and meaningful features due to their emphasis on
sparse activations.
Disadvantages
1. The choice of hyperparameters play a significant role in the performance of this
autoencoder. Different inputs should result in the activation of different nodes of the network.
2. The application of sparsity constraint increases computational complexity.
3. Convolutional Autoencoder
Convolutional autoencoders are a type of autoencoder that use convolutional neural networks
(CNNs) as their building blocks. The encoder consists of multiple layers that take a image or a grid
as input and pass it through different convolution layers thus forming a compressed representation of
the input. The decoder is the mirror image of the encoder it deconvolves the compressed
representation and tries to reconstruct the original image.
Advantages
1. Convolutional autoencoder can compress high-dimensional image data into a lower-
dimensional data. This improves storage efficiency and transmission of image data.
2. Convolutional autoencoder can reconstruct missing parts of an image. It can also handle
images with slight variations in object position or orientation.
Disadvantages
1. These autoencoder are prone to overfitting. Proper regularization techniques should be used
to tackle this issue.
2. Compression of data can cause data loss which can result in reconstruction of a lower
quality image.
Autoencoders have emerged as an architecture for data representation and generation. Among them,
Variational Autoencoders (VAEs) stand out, introducing probabilistic encoding and opening new
avenues for diverse applications. In this article, we are going to explore the architecture and
foundational concepts of variational autoencoders (VAEs).
Autoencoders are neural network architectures that are intended for the compression and
reconstruction of data. It consists of an encoder and a decoder; these networks are learning a simple
representation of the input data. Reconstruction loss ensures a close match of output with input,
which is the basis for understanding more advanced architectures such as VAEs. The encoder aims
to learn efficient data encoding from the dataset and pass it into a bottleneck architecture. The other
part of the autoencoder is a decoder that uses latent space in the bottleneck layer to regenerate
images similar to the dataset. These results back propagate the neural network in the form of the loss
function.
Variational Autoencoder
Variational autoencoder was proposed in 2013 by Diederik P. Kingma and Max Welling at Google
and Qualcomm. A variational autoencoder (VAE) provides a probabilistic manner for describing an
observation in latent space. Thus, rather than building an encoder that outputs a single value to
describe each latent state attribute, we’ll formulate our encoder to describe a probability distribution
for each latent attribute. It has many applications, such as data compression, synthetic data creation,
etc.
Variational autoencoder is different from an autoencoder in a way that it provides a statistical
manner for describing the samples of the dataset in latent space. Therefore, in the variational
autoencoder, the encoder outputs a probability distribution in the bottleneck layer instead of a single
output value.
Variational Autoencoder
Variational autoencoder uses KL-divergence as its loss function, the goal of this is to minimize the
difference between a supposed distribution and original distribution of dataset.
Suppose we have a distribution z and we want to generate the observation x from it. In other words,
we want to calculate
This usually makes it an intractable distribution. Hence, we need to approximate p(z|x) to q(z|x) to
make it a tractable distribution. To better approximate p(z|x) to q(z|x), we will minimize the KL-
divergence loss which calculates how similar two distributions are:
The first term represents the reconstruction likelihood and the other term ensures that our learned
distribution q is similar to the true prior distribution p.
Thus our total loss consists of two terms, one is reconstruction error and other is KL-divergence loss:
Advantages
1. Variational Autoencoders are used to generate new data points that resemble the original
training data. These samples are learned from the latent space.
2. Variational Autoencoder is probabilistic framework that is used to learn a compressed
representation of the data that captures its underlying structure and variations, so it is useful in
detecting anomalies and data exploration.
Disadvantages
1. Variational Autoencoder use approximations to estimate the true distribution of the latent
variables. This approximation introduces some level of error, which can affect the quality of
generated samples.
2. The generated samples may only cover a limited subset of the true data distribution. This
can result in a lack of diversity in generated samples.
1. Generator: It is trained to generate new dataset, for example in computer vision it generates
new images from existing real world images.
2. Discriminator: It compares those images with some real world examples and classify real and
fake images.
Example:
The Generator generates some random images (eg. tables) and then the discriminator compares those
images with some real world table images and sends the feed back to itself and Generator. Look at
GAN structure in fig. 1.
1. Firstly some random noise signal is sent to a generator which creates some useless images
containing noise
2. Two inputs are given to Discriminator. First is the sample output images generated from
Generator and second being the real world dog image samples.
3. There after, The Discriminator populates some values (probability) after comparing both the
images which can be seen in fig. 2. It calculates 0.8, 0.3 and 0.5 for generator output images and
0.1, 0.9 and 0.2 for the real world images.
4. Now, an error is calculated by comparing probabilities of generated images with 0 (Zero) and
comparing probabilities of real-word images with 1 (One). (Ex. 0-0.8, 0-0.3, 0-0.5 and 1-0.1, 1-0.9,
1-0.2).
5. After calculating individual errors, it will calculate cumulative error(loss) which is
backpropagated and the weights of the Discriminator are adjusted. This is how a Discriminator is
trained.
After a few iterations, you will see that the Generator starts generating images close to real-world
images.
Applications of GAN:
1. Generating Images
2. Super Resolution
3. Image Modification
4. Photo realistic images
5. Face Ageing
al Network) represents a cutting-edge approach to generative modeling within deep learning, often
leveraging architectures like convolutional neural networks. The goal of generative modeling is to
autonomously identify patterns in input data, enabling the model to produce new examples that
feasibly resemble the original dataset.
What is a Generative Adversarial Network?
Generative Adversarial Networks (GANs) are a powerful class of neural networks that are used for
an unsupervised learning. GANs are made up of two neural networks, a discriminator and a
generator. They use adversarial training to produce artificial data that is identical to actual data.
The Generator attempts to fool the Discriminator, which is tasked with accurately
distinguishing between produced and genuine data, by producing random noise samples.
Realistic, high-quality samples are produced as a result of this competitive interaction,
which drives both networks toward advancement.
GANs are proving to be highly versatile artificial intelligence tools, as evidenced by their
extensive use in image synthesis, style transfer, and text-to-image synthesis.
They have also revolutionized generative modeling.
Through adversarial training, these models engage in a competitive interplay until the generator
becomes adept at creating realistic samples, fooling the discriminator approximately half the time.
Generative Adversarial Networks (GANs) can be broken down into three parts:
Generative: To learn a generative model, which describes how data is generated in terms of
a probabilistic model.
Adversarial: The word adversarial refers to setting one thing up against another. This
means that, in the context of GANs, the generative result is compared with the actual images in
the data set. A mechanism known as a discriminator is used to apply a model that attempts to
distinguish between real and fake images.
Networks: Use deep neural networks as artificial intelligence (AI) algorithms for training
purposes.
Types of GANs
1. Vanilla GAN: This is the simplest type of GAN. Here, the Generator and the Discriminator
are simple a basic multi-layer perceptrons. In vanilla GAN, the algorithm is really simple, it tries
to optimize the mathematical equation using stochastic gradient descent.
2. Conditional GAN (CGAN): CGAN can be described as a deep learning method in
which some conditional parameters are put into place.
In CGAN, an additional parameter ‘y’ is added to the Generator for generating the
corresponding data.
Labels are also put into the input to the Discriminator in order for the
Discriminator to help distinguish the real data from the fake generated data.
3. Deep Convolutional GAN (DCGAN): DCGAN is one of the most popular and also the
most successful implementations of GAN. It is composed of ConvNets in place of multi-layer
perceptrons.
The ConvNets are implemented without max pooling, which is in fact replaced by
convolutional stride.
Also, the layers are not fully connected.
4. Laplacian Pyramid GAN (LAPGAN): The Laplacian pyramid is a linear invertible image
representation consisting of a set of band-pass images, spaced an octave apart, plus a low-
frequency residual.
This approach uses multiple numbers of Generator and Discriminator
networks and different levels of the Laplacian Pyramid.
This approach is mainly used because it produces very high-quality images. The
image is down-sampled at first at each layer of the pyramid and then it is again up-scaled at each
layer in a backward pass where the image acquires some noise from the Conditional GAN at
these layers until it reaches its original size.
5. Super Resolution GAN (SRGAN): SRGAN as the name suggests is a way of designing a
GAN in which a deep neural network is used along with an adversarial network in order to
produce higher-resolution images. This type of GAN is particularly useful in optimally up-
scaling native low-resolution images to enhance their details minimizing errors while doing so.
Architecture of GANs
A Generative Adversarial Network (GAN) is composed of two primary parts, which are the
Generator and the Discriminator.
Generator Model
A key element responsible for creating fresh, accurate data in a Generative Adversarial Network
(GAN) is the generator model. The generator takes random noise as input and converts it into
complex data samples, such text or images. It is commonly depicted as a deep neural network.
The training data’s underlying distribution is captured by layers of learnable parameters in its design
through training. The generator adjusts its output to produce samples that closely mimic real data as
it is being trained by using backpropagation to fine-tune its parameters.
The generator’s ability to generate high-quality, varied samples that can fool the discriminator is
what makes it successful.
Generator Loss
The objective of the generator in a GAN is to produce synthetic samples that are realistic enough to
fool the discriminator. The generator achieves this by minimizing its loss function JG. The loss is
minimized when the log probability is maximized, i.e., when the discriminator is highly likely to
classify the generated samples as real. The following equation is given below:
Where,
JG measure how well the generator is fooling the discriminator.
log D(G(zi))represents log probability of the discriminator being correct for generated samples.
The generator aims to minimize this loss, encouraging the production of samples that the
discriminator classifies as real (logD(G(zi)), close to 1.
Discriminator Model
An artificial neural network called a discriminator model is used in Generative Adversarial Networks
(GANs) to differentiate between generated and actual input. By evaluating input samples and
allocating probability of authenticity, the discriminator functions as a binary classifier.
Over time, the discriminator learns to differentiate between genuine data from the dataset and
artificial samples created by the generator. This allows it to progressively hone its parameters and
increase its level of proficiency.
Convolutional layers or pertinent structures for other modalities are usually used in its architecture
when dealing with picture data. Maximizing the discriminator’s capacity to accurately identify
generated samples as fraudulent and real samples as authentic is the aim of the adversarial training
procedure. The discriminator grows increasingly discriminating as a result of the generator and
discriminator’s interaction, which helps the GAN produce extremely realistic-looking synthetic data
overall.
Discriminator Loss
The discriminator reduces the negative log likelihood of correctly classifying both produced and real
samples. This loss incentivizes the discriminator to accurately categorize generated samples as fake
and real samples with the following equation:
JD assesses the discriminator’s ability to discern between produced and actual samples.
The log likelihood that the discriminator will accurately categorize real data is represented by
logD(xi).
The log chance that the discriminator would correctly categorize generated samples as fake is
represented by (1−log(1−D(G(zi))).
The discriminator aims to reduce this loss by accurately identifying artificial and real samples.
MinMax Loss
In a Generative Adversarial Network (GAN), the minimax loss formula is provided by:
Where,
G is generator network and is D is the discriminator network
Actual data samples obtained from the true data distribution pdata(x) are represented by x.
Random noise sampled from a previous distribution pz(z)(usually a normal or uniform
distribution) is represented by z.
D(x) represents the discriminator’s likelihood of correctly identifying actual data as real.
D(G(z)) is the likelihood that the discriminator will identify generated data coming from the
generator as authentic.
How does a GAN work?
The steps involved in how a GAN works:
1. Initialization: Two neural networks are created: a Generator (G) and a Discriminator (D).
G is tasked with creating new data, like images or text, that closely resembles real data.
D acts as a critic, trying to distinguish between real data (from a training dataset) and the data
generated by G.
2. Generator’s First Move:
G takes a random noise vector as input. This noise vector contains random values and acts as the
starting point for G’s creation process. Using its internal layers and learned patterns, G transforms
the noise vector into a new data sample, like a generated image.
3. Discriminator’s Turn: D receives two kinds of inputs:
Real data samples from the training dataset.
The data samples generated by G in the previous step. D’s job is to analyze each input and
determine whether it’s real data or something G cooked up. It outputs a probability score
between 0 and 1. A score of 1 indicates the data is likely real, and 0 suggests it’s fake.
4. The Learning Process: Now, the adversarial part comes in:
If D correctly identifies real data as real (score close to 1) and generated data as fake (score close to
0), both G and D are rewarded to a small degree. This is because they’re both doing their jobs well.
However, the key is to continuously improve. If D consistently identifies everything correctly, it
won’t learn much. So, the goal is for G to eventually trick D.
5. Generator’s Improvement:
When D mistakenly labels G’s creation as real (score close to 1), it’s a sign that G is on the right
track. In this case, G receives a significant positive update, while D receives a penalty for being
fooled.
This feedback helps G improve its generation process to create more realistic data.
6. Discriminator’s Adaptation:
Conversely, if D correctly identifies G’s fake data (score close to 0), but G receives no reward, D is
further strengthened in its discrimination abilities.
This ongoing duel between G and D refines both networks over time.
As training progresses, G gets better at generating realistic data, making it harder for D to tell the
difference. Ideally, G becomes so adept that D can’t reliably distinguish real from fake data. At this
point, G is considered well-trained and can be used to generate new, realistic data samples.
Implementation of Generative Adversarial Network (GAN)
We will follow and understand the steps to understand how GAN is implemented:
Step1 : Importing the required libraries
Python3
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
from torchvision import datasets, transforms
import matplotlib.pyplot as plt
import numpy as np
# Set device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
For training on the CIFAR-10 image dataset, this PyTorch module creates a Generative Adversarial
Network (GAN), switching between generator and discriminator training. Visualization of the
generated images occurs every tenth epoch, and the development of the GAN is tracked.
Step 2: Defining a Transform
The code uses PyTorch’s transforms to define a simple picture transforms.Compose. It normalizes
and transforms photos into tensors.
Python3
Python3
train_dataset = datasets.CIFAR10(root='./data',\
train=True, download=True, transform=transform)
dataloader = torch.utils.data.DataLoader(train_dataset, \
batch_size=32, shuffle=True)
Python3
# Hyperparameters
latent_dim = 100
lr = 0.0002
beta1 = 0.5
beta2 = 0.999
num_epochs = 10
Python3
self.model = nn.Sequential(
nn.Linear(latent_dim, 128 * 8 * 8),
nn.ReLU(),
nn.Unflatten(1, (128, 8, 8)),
nn.Upsample(scale_factor=2),
nn.Conv2d(128, 128, kernel_size=3, padding=1),
nn.BatchNorm2d(128, momentum=0.78),
nn.ReLU(),
nn.Upsample(scale_factor=2),
nn.Conv2d(128, 64, kernel_size=3, padding=1),
nn.BatchNorm2d(64, momentum=0.78),
nn.ReLU(),
nn.Conv2d(64, 3, kernel_size=3, padding=1),
nn.Tanh()
)
Python3
self.model = nn.Sequential(
nn.Conv2d(3, 32, kernel_size=3, stride=2, padding=1),
nn.LeakyReLU(0.2),
nn.Dropout(0.25),
nn.Conv2d(32, 64, kernel_size=3, stride=2, padding=1),
nn.ZeroPad2d((0, 1, 0, 1)),
nn.BatchNorm2d(64, momentum=0.82),
nn.LeakyReLU(0.25),
nn.Dropout(0.25),
nn.Conv2d(64, 128, kernel_size=3, stride=2, padding=1),
nn.BatchNorm2d(128, momentum=0.82),
nn.LeakyReLU(0.2),
nn.Dropout(0.25),
nn.Conv2d(128, 256, kernel_size=3, stride=1, padding=1),
nn.BatchNorm2d(256, momentum=0.8),
nn.LeakyReLU(0.25),
nn.Dropout(0.25),
nn.Flatten(),
nn.Linear(256 * 5 * 5, 1),
nn.Sigmoid()
)
Python3
Python3
# Training loop
for epoch in range(num_epochs):
for i, batch in enumerate(dataloader):
# Convert list to tensor
real_images = batch[0].to(device)
# Adversarial ground truths
valid = torch.ones(real_images.size(0), 1, device=device)
fake = torch.zeros(real_images.size(0), 1, device=device)
# Configure input
real_images = real_images.to(device)
# ---------------------
# Train Discriminator
# ---------------------
optimizer_D.zero_grad()
# Sample noise as generator input
z = torch.randn(real_images.size(0), latent_dim, device=device)
# Generate a batch of images
fake_images = generator(z)
# -----------------
# Train Generator
# -----------------
optimizer_G.zero_grad()
# Generate a batch of images
gen_images = generator(z)
# Adversarial loss
g_loss = adversarial_loss(discriminator(gen_images), valid)
# Backward pass and optimize
g_loss.backward()
optimizer_G.step()
# ---------------------
# Progress Monitoring
# ---------------------
if (i + 1) % 100 == 0:
print(
f"Epoch [{epoch+1}/{num_epochs}]\
Batch {i+1}/{len(dataloader)} "
f"Discriminator Loss: {d_loss.item():.4f} "
f"Generator Loss: {g_loss.item():.4f}"
)
# Save generated images for every epoch
if (epoch + 1) % 10 == 0:
with torch.no_grad():
z = torch.randn(16, latent_dim, device=device)
generated = generator(z).detach().cpu()
grid = torchvision.utils.make_grid(generated,\
nrow=4, normalize=True)
plt.imshow(np.transpose(grid, (1, 2, 0)))
plt.axis("off")
plt.show()
Output:
Epoch [10/10] Batch 1300/1563 Discriminator Loss: 0.4473 Generator Loss: 0.9555
Epoch [10/10] Batch 1400/1563 Discriminator Loss: 0.6643 Generator Loss: 1.0215
Epoch [10/10] Batch 1500/1563 Discriminator Loss: 0.4720 Generator Loss: 2.5027
GAN Output
Application Of Generative Adversarial Networks (GANs)
GANs, or Generative Adversarial Networks, have many uses in many different fields. Here are some
of the widely recognized uses of GANs:
1. Image Synthesis and Generation : GANs are often used for picture synthesis and
generation tasks, They may create fresh, lifelike pictures that mimic training data by learning
the distribution that explains the dataset. The development of lifelike avatars, high-resolution
photographs, and fresh artwork have all been facilitated by these types of generative networks.
2. Image-to-Image Translation : GANs may be used for problems involving image-to-image
translation, where the objective is to convert an input picture from one domain to another while
maintaining its key features. GANs may be used, for instance, to change pictures from day to
night, transform drawings into realistic images, or change the creative style of an image.
3. Text-to-Image Synthesis : GANs have been used to create visuals from descriptions in text.
GANs may produce pictures that translate to a description given a text input, such as a phrase or
a caption. This application might have an impact on how realistic visual material is produced
using text-based instructions.
4. Data Augmentation : GANs can augment present data and increase the robustness and
generalizability of machine-learning models by creating synthetic data samples.
5. Data Generation for Training : GANs can enhance the resolution and quality of low-
resolution images. By training on pairs of low-resolution and high-resolution images, GANs can
generate high-resolution images from low-resolution inputs, enabling improved image quality in
various applications such as medical imaging, satellite imaging, and video enhancement.
Advantages of GAN
1. Synthetic data generation: GANs can generate new, synthetic data that resembles some
known data distribution, which can be useful for data augmentation, anomaly detection, or
creative applications.
2. High-quality results: GANs can produce high-quality, photorealistic results in image
synthesis, video synthesis, music synthesis, and other tasks.
3. Unsupervised learning: GANs can be trained without labeled data, making them suitable
for unsupervised learning tasks, where labeled data is scarce or difficult to obtain.
4. Versatility: GANs can be applied to a wide range of tasks, including image synthesis, text-
to-image synthesis, image-to-image translation, anomaly detection, data augmentation, and
others.
Disadvantages of GAN
The disadvantages of the GANs are as follows:
1. Training Instability: GANs can be difficult to train, with the risk of instability, mode
collapse, or failure to converge.
2. Computational Cost: GANs can require a lot of computational resources and can be slow
to train, especially for high-resolution images or large datasets.
3. Overfitting: GANs can overfit the training data, producing synthetic data that is too similar
to the training data and lacking diversity.
4. Bias and Fairness: GANs can reflect the biases and unfairness present in the training data,
leading to discriminatory or biased synthetic data.
5. Interpretability and Accountability: GANs can be opaque and difficult to interpret or
explain, making it challenging to ensure accountability, transparency, or fairness in their
applications.
1. Text-to-Image synthesis: Generating images from text descriptions, such as scene descriptions,
object descriptions, or attributes.
2. Image-to-Image translation: Translating images from one domain to another, such as
converting grayscale images to color, changing the season of a scene, or transforming sketches into
photorealistic images.
3. Anomaly detection: Identifying anomalies or outliers in data, such as detecting fraud in
financial transactions, detecting network intrusions, or identifying medical conditions in medical
imaging.
4. Data augmentation: Increasing the size and diversity of a dataset for training deep learning
models, such as in computer vision, speech recognition, or natural language processing.
5. Video synthesis: Generating new, realistic video sequences from a given data distribution,
such as human action sequences, animal behaviors, or animated sequences.
6. Music synthesis: Generating new, original music from a given data distribution, such as
musical genres, styles, or instrumentations.
3D model synthesis: Generating new, realistic 3D models from a given data distribution, such as
objects, scenes, or shapes.
Generative Adversarial Networks (GANs) are most popular for generating images from a given dataset
of images but apart from it, GANs are now being used for a variety of applications. These are a class of
neural networks that has a discriminator block and a generator block that works together and is able to
produce new samples apart from just classifying or predicting the class of sample.
Some of the newly discovered uses cases of GANs are:
Security: Artificial intelligence has proved to be a boon to many industries but it is also surrounded by
the problem of Cyber threats.GANs are proved to be a great help to handle adversarial attacks. The
adversarial attacks use a variety of techniques to fool deep learning architectures. By creating fake
examples and training the model to identify them we counter these attacks.
Generating Data using GANs: Data is the most important key for any deep learning algorithm. In
general, the more is the data, the better is the performance of any deep learning algorithm. But in many
cases such as health diagnostics, the amount of data is restricted, in such cases, there is a need to
generate good quality data. For which GANs are being used.
Privacy-Preserving: There are many cases when our data needs to be kept confidential. This is
especially useful in defense and military applications. We have many data encryption schemes but each
has its own limitations, in such a case GANs can be useful. Recently, in 2016, Google opened a new
research path on using GAN competitive framework for encryption problems, where two networks had
to compete in creating the code and cracking it.
Data Manipulation:
We can use GANs for pseudo style transfer i.e. modifying a part of the subject, without complete style
transfer. For e.g. in many applications, we want to add a smile to an image, or just work on the eyes
part of the image. This can also be extended to other domains such as Natural Language Processing,
speech processing, etc. For e.g. we can work on some selected words of a paragraph without modifying
the whole paragraph.
Advantages of Generative Adversarial Network (GAN) use cases:
1. Image synthesis: GANs can generate high-quality, photorealistic images, which can be used in
a variety of applications, such as entertainment, art, or marketing.
2. Text-to-Image synthesis: GANs can generate images from text descriptions, which can be
useful for generating illustrations, animations, or virtual environments.
3. Image-to-Image translation: GANs can translate images from one domain to another, which
can be used for colorization, style transfer, or data augmentation.
4. Anomaly detection: GANs can identify anomalies or outliers in data, which can be useful for
detecting fraud, network intrusions, or medical conditions.
5. Data augmentation: GANs can increase the size and diversity of a dataset for training deep
learning models, which can improve their performance, robustness, or generalization.
6. Video synthesis: GANs can generate high-quality, realistic video sequences, which can be
used in animation, film, or video games.
7. Music synthesis: GANs can generate new, original music, which can be used in music
composition, performance, or entertainment.
8. 3D model synthesis: GANs can generate high-quality, realistic 3D models, which can be used
in architecture, design, or engineering.
Disadvantages of Generative Adversarial Network (GAN) use cases:
1. Training difficulty: GANs can be difficult to train and require a lot of computational resources,
which can be a barrier for some applications.
2. Overfitting: GANs can overfit to the training data, producing synthetic data that is too similar
to the training data and lacking diversity.
3. Bias and fairness: GANs can reflect the biases and unfairness present in the training data,
leading to discriminatory or biased synthetic data.
4. Interpretability and accountability: GANs can be opaque and difficult to interpret or explain,
making it challenging to ensure accountability, transparency, or fairness in their applications.
5. Quality control: GANs can generate unrealistic or irrelevant synthetic data if the generator and
discriminator are not properly trained, which can affect the quality of the results.
1. Input Layers: It’s the layer in which we give input to our model. The number of neurons in
this layer is equal to the total number of features in our data (number of pixels in the case of an
image).
2. Hidden Layer: The input from the Input layer is then fed into the hidden layer. There can be
many hidden layers depending on our model and data size. Each hidden layer can have different
numbers of neurons which are generally greater than the number of features. The output from each
layer is computed by matrix multiplication of the output of the previous layer with learnable
weights of that layer and then by the addition of learnable biases followed by activation function
which makes the network nonlinear.
3. Output Layer: The output from the hidden layer is then fed into a logistic function like
sigmoid or softmax which converts the output of each class into the probability score of each class.
The data is fed into the model and output from each layer is obtained from the above step is
called feedforward, we then calculate the error using an error function, some common error functions
are cross-entropy, square loss error, etc. The error function measures how well the network is
performing. After that, we backpropagate into the model by calculating the derivatives. This step is
called Backpropagation which basically is used to minimize the loss.
Convolution Neural Network
Convolutional Neural Network (CNN) is the extended version of artificial neural networks
(ANN) which is predominantly used to extract the feature from the grid-like matrix dataset. For
example visual datasets like images or videos where data patterns play an extensive role.
CNN architecture
Convolutional Neural Network consists of multiple layers like the input layer, Convolutional layer,
Pooling layer, and fully connected layers.
Simple CNN architecture
The Convolutional layer applies filters to the input image to extract features, the Pooling layer
downsamples the image to reduce computation, and the fully connected layer makes the final
prediction. The network learns the optimal filters through backpropagation and gradient descent.
How Convolutional Layers works
Convolution Neural Networks or covnets are neural networks that share their parameters. Imagine you
have an image. It can be represented as a cuboid having its length, width (dimension of the image), and
height (i.e the channel as images generally have red, green, and blue channels).
Now imagine taking a small patch of this image and running a small neural network, called a filter or
kernel on it, with say, K outputs and representing them vertically. Now slide that neural network across
the whole image, as a result, we will get another image with different widths, heights, and depths.
Instead of just R, G, and B channels now we have more channels but lesser width and height. This
operation is called Convolution. If the patch size is the same as that of the image it will be a regular
neural network. Because of this small patch, we have fewer weights.
Now let’s talk about a bit of mathematics that is involved in the whole convolution process.
Convolution layers consist of a set of learnable filters (or kernels) having small widths and
heights and the same depth as that of input volume (3 if the input layer is image input).
For example, if we have to run convolution on an image with dimensions 34x34x3. The
possible size of filters can be axax3, where ‘a’ can be anything like 3, 5, or 7 but smaller as
compared to the image dimension.
During the forward pass, we slide each filter across the whole input volume step by step where
each step is called stride (which can have a value of 2, 3, or even 4 for high-dimensional images)
and compute the dot product between the kernel weights and patch from input volume.
As we slide our filters we’ll get a 2-D output for each filter and we’ll stack them together as a
result, we’ll get output volume having a depth equal to the number of filters. The network will
learn all the filters.
Flattening: The resulting feature maps are flattened into a one-dimensional vector after the
convolution and pooling layers so they can be passed into a completely linked layer for
categorization or regression.
Fully Connected Layers: It takes the input from the previous layer and computes the final
classification or regression task.
Image source: cs231n.stanford.edu
Output Layer: The output from the fully connected layers is then fed into a logistic function for
classification tasks like sigmoid or softmax which converts the output of each class into the probability
score of each class.
Transfer Learning
We, humans, are very perfect at applying the transfer of knowledge between tasks. This means that
whenever we encounter a new problem or a task, we recognize it and apply our relevant knowledge
from our previous learning experiences. This makes our work easy and fast to finish. For instance, if
you know how to ride a bicycle and if you are asked to ride a motorbike which you have never done
before. In such a case, our experience with a bicycle will come into play and handle tasks like
balancing the bike, steering, etc. This will make things easier compared to a complete beginner. Such
learnings are very useful in real life as they make us more perfect and allow us to earn more
experience. Following the same approach, a term was introduced Transfer Learning in the field of
machine learning. This approach involves the use of knowledge that was learned in some tasks and
applying it to solve the problem in the related target task. While most machine learning is designed
to address a single task, the development of algorithms that facilitate transfer learning is a topic of
ongoing interest in the machine-learning community.
What is Transfer Learning?
Transfer learning is a technique in machine learning where a model trained on one task is used as the
starting point for a model on a second task. This can be useful when the second task is similar to the
first task, or when there is limited data available for the second task. By using the learned features
from the first task as a starting point, the model can learn more quickly and effectively on the second
task. This can also help to prevent overfitting, as the model will have already learned general
features that are likely to be useful in the second task.
Many deep neural networks trained on images have a curious phenomenon in common: in the early
layers of the network, a deep learning model tries to learn a low level of features, like detecting
edges, colours, variations of intensities, etc. Such kind of features appear not to be specific to a
particular dataset or a task because no matter what type of image we are processing either for
detecting a lion or car. In both cases, we have to detect these low-level features. All these features
occur regardless of the exact cost function or image dataset. Thus, learning these features in one task
of detecting lions can be used in other tasks like detecting humans.
How does Transfer Learning work?
Transfer Learning
Low-level features learned for task A should be beneficial for learning of model for task B.
This is what transfer learning is. Nowadays, it is very hard to see people training whole
convolutional neural networks from scratch, and it is common to use a pre-trained model trained on a
variety of images in a similar task, e.g models trained on ImageNet (1.2 million images with 1000
categories) and use features from them to solve a new task. When dealing with transfer learning, we
come across a phenomenon called the freezing of layers. A layer, it can be a CNN layer, hidden layer,
a block of layers, or any subset of a set of all layers, is said to be fixed when it is no longer available
to train. Hence, the weights of freeze layers will not be updated during training. While layers that are
not frozen follows regular training procedure. When we use transfer learning in solving a problem,
we select a pre-trained model as our base model. Now, there are two possible approaches to using
knowledge from the pre-trained model. The first way is to freeze a few layers of the pre-trained
model and train other layers on our new dataset for the new task. The second way is to make a new
model, but also take out some features from the layers in the pre-trained model and use them in a
newly created model. In both cases, we take out some of the learned features and try to train the rest
of the model. This makes sure that the only feature that may be the same in both of the tasks is taken
out from the pre-trained model, and the rest of the model is changed to fit the new dataset by
training.
Domain mismatch: The pre-trained model may not be well-suited to the second task if the
two tasks are vastly different or the data distribution between the two tasks is very different.
Overfitting: Transfer learning can lead to overfitting if the model is fine-tuned too much on
the second task, as it may learn task-specific features that do not generalize well to new data.
Complexity: The pre-trained model and the fine-tuning process can be computationally
expensive and may require specialized hardware.