0% found this document useful (0 votes)
17 views24 pages

Unit Iii

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views24 pages

Unit Iii

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 24

Generative Adversarial Network (GAN)

GAN(Generative Adversarial Network) represents a cutting-edge approach to generative modeling within


deep learning, often leveraging architectures like convolutional neural networks. The goal of generative
modeling is to autonomously identify patterns in input data, enabling the model to produce new examples
that feasibly resemble the original dataset.

Generative Adversarial Networks (GANs) are a powerful class of neural networks that are used for an
unsupervised learning. GANs are made up of two neural networks, a discriminator and a generator. They
use adversarial training to produce artificial data that is identical to actual data.

The Generator attempts to fool the Discriminator, which is tasked with accurately distinguishing between
produced and genuine data, by producing random noise samples.Realistic, high-quality samples are
produced as a result of this competitive interaction, which drives both networks toward advancement.
GANs are proving to be highly versatile artificial intelligence tools, as evidenced by their extensive use in
image synthesis, style transfer, and text-to-image synthesis. They have also revolutionized generative
modeling.

Generative Adversarial Networks (GANs) can be broken down into three parts

Generative: To learn a generative model, which describes how data is generated in terms of a
probabilistic model.

Adversarial: The word adversarial refers to setting one thing up against another. This means that, in the
context of GANs, the generative result is compared with the actual images in the data set. A mechanism
known as a discriminator is used to apply a model that attempts to distinguish between real and fake
images.

Networks: Use deep neural networks as artificial intelligence (AI) algorithms for training purposes

Architecture of GANs

A Generative Adversarial Network (GAN) is composed of two primary parts, which are the Generator
and the Discriminator.

Generator Model

A key element responsible for creating fresh, accurate data in a Generative Adversarial Network (GAN)
is the generator model. The generator takes random noise as input and converts it into complex data
samples, such text or images. It is commonly depicted as a deep neural network.

The training data’s underlying distribution is captured by layers of learnable parameters in its design
through training. The generator adjusts its output to produce samples that closely mimic real data as it is
being trained by using backpropagation to fine-tune its parameters.

The generator’s ability to generate high-quality, varied samples that can fool the discriminator is what
makes it successful.
Generator Loss

The objective of the generator in a GAN is to produce synthetic samples that are realistic enough to fool
the discriminator. The generator achieves this by minimizing its loss function J. The loss is minimized
when the log probability is maximized, i.e., when the discriminator is highly likely to classify the
generated samples as real.

Discriminator Model

An artificial neural network called a discriminator model is used in Generative Adversarial Networks
(GANs) to differentiate between generated and actual input. By evaluating input samples and allocating
probability of authenticity, the discriminator functions as a binary classifier.

Over time, the discriminator learns to differentiate between genuine data from the dataset and artificial
samples created by the generator. This allows it to progressively hone its parameters and increase its level
of proficiency.

Convolutional layers or pertinent structures for other modalities are usually used in its architecture when
dealing with picture data. Maximizing the discriminator’s capacity to accurately identify generated
samples as fraudulent and real samples as authentic is the aim of the adversarial training procedure. The
discriminator grows increasingly discriminating as a result of the generator and discriminator’s
interaction, which helps the GAN produce extremely realistic-looking synthetic data overall. The
discriminator aims to reduce this loss by accurately identifying artificial and real samples.

Discriminator Loss

The discriminator reduces the negative log likelihood of correctly classifying both produced and real
samples. This loss incentivizes the discriminator to accurately categorize generated samples as fake and
real samples.

The formula for generator and discriminator loss are given below:
● log D(G(zi)) represents log probability of the discriminator being
correct for generated samples.The generator aims to minimize this
loss, encouraging the production of samples that the discriminator
classifies as real (logD(G(zi)) close to 1.
● JD assesses the discriminator’s ability to discern between produced
and actual samples.

MinMax Loss
In a Generative Adversarial Network (GAN), the minimax loss formula is
provided by:

Where,

● G is generator network and is D is the discriminator network

● Actual data samples obtained from the true data distribution pdata(x)
are represented by x.
● Random noise sampled from a previous distribution pz(z) (usually a
normal or uniform distribution) is represented by z.
● D(x) represents the discriminator’s likelihood of correctly identifying

actual data as real.

● D(G(z)) is the likelihood that the discriminator will identify generated

data coming from the generator as authentic.


Types of GANs

Vanilla GAN: This is the simplest type of GAN. Here, the Generator and the Discriminator are simple a
basic multi-layer perceptrons. In vanilla GAN, the algorithm is really simple, it tries to optimize the
mathematical equation using stochastic gradient descent.

Conditional GAN (CGAN): CGAN can be described as a deep learning method in which some
conditional parameters are put into place. In CGAN, an additional parameter ‘y’ is added to the Generator
for generating the corresponding data. Labels are also put into the input to the Discriminator in order for
the Discriminator to help distinguish the real data from the fake generated data.

Deep Convolutional GAN (DCGAN): DCGAN is one of the most popular and also the most successful
implementations of GAN. It is composed of ConvNets in place of multi-layer perceptrons. The ConvNets
are implemented without max pooling, which is in fact replaced by convolutional stride.

Also, the layers are not fully connected.

Laplacian Pyramid GAN (LAPGAN): The Laplacian pyramid is a linear invertible image
representation consisting of a set of band-pass images, spaced an octave apart, plus a low-frequency
residual. This approach uses multiple numbers of Generator and Discriminator networks and different
levels of the Laplacian Pyramid. This approach is mainly used because it produces very high-quality
images. The image is down-sampled at first at each layer of the pyramid and then it is again up-scaled at
each layer in a backward pass where the image acquires some noise from the Conditional GAN at these
layers until it reaches its original size.

Super Resolution GAN (SRGAN): SRGAN as the name suggests is a way of designing a GAN in which
a deep neural network is used along with an adversarial network in order to produce higher-resolution
images. This type of GAN is particularly useful in optimally up-scaling native low-resolution images to
enhance their details minimizing errors while doing so.

Architecture of GANs

A Generative Adversarial Network (GAN) is composed of two primary parts, which are the Generator
and the Discriminator.

Generator Model

A key element responsible for creating fresh, accurate data in a Generative Adversarial Network (GAN)
is the generator model. The generator takes random noise as input and converts it into complex data
samples, such text or images. It is commonly depicted as a deep neural network.The training data’s
underlying distribution is captured by layers of learnable parameters in its design through training. The
generator adjusts its output to produce samples that closely mimic real data as it is being trained by using
backpropagation to fine-tune its parameters.The generator’s ability to generate high-quality, varied
samples that can fool the discriminator is what makes it successful.

Generator Loss

The objective of the generator in a GAN is to produce synthetic samples that are realistic enough to fool
the discriminator. The generator achieves this by minimizing its loss function . The loss is minimized
when the log probability is maximized, i.e., when the discriminator is highly likely to classify the
generated samples as real. The following equation is given below:

How does a GAN work?

The steps involved in how a GAN works:

Initialization: Two neural networks are created: a Generator (G) and a Discriminator (D).

G is tasked with creating new data, like images or text, that closely resembles real data.

D acts as a critic, trying to distinguish between real data (from a training dataset) and the data generated
by G.

Generator’s First Move: G takes a random noise vector as input. This noise vector contains random
values and acts as the starting point for G’s creation process. Using its internal layers and learned
patterns, G transforms the noise vector into a new data sample, like a generated image.

Discriminator’s Turn: D receives two kinds of inputs:Real data samples from the training dataset.

The data samples generated by G in the previous step. D’s job is to analyze each input and determine
whether it’s real data or something G cooked up. It outputs a probability score between 0 and 1. A score
of 1 indicates the data is likely real, and 0 suggests it’s fake.

The Learning Process: Now, the adversarial part comes in:If D correctly identifies real data as real
(score close to 1) and generated data as fake (score close to 0), both G and D are rewarded to a small
degree. This is because they’re both doing their jobs well. However, the key is to continuously improve.
If D consistently identifies everything correctly, it won’t learn much. So, the goal is for G to eventually
trick D.

Generator’s Improvement:

When D mistakenly labels G’s creation as real (score close to 1), it’s a sign that G is on the right track. In
this case, G receives a significant positive update, while D receives a penalty for being fooled. This
feedback helps G improve its generation process to create more realistic data.

Discriminator’s Adaptation:
Conversely, if D correctly identifies G’s fake data (score close to 0), but G receives no reward, D is
further strengthened in its discrimination abilities. This ongoing duel between G and D refines both
networks over time. As training progresses, G gets better at generating realistic data, making it harder for
D to tell the difference. Ideally, G becomes so adept that D can’t reliably distinguish real from fake data.
At this point, G is considered well-trained and can be used to generate new, realistic data samples.

Discriminator

A discriminator that tells how real an image is, is basically a deep Convolutional Neural Network (CNN)
as shown in Figure 1. For MNIST Dataset, the input is an image (28 pixel x 28 pixel x 1 channel). The
sigmoid output is a scalar value of the probability of how real the image is (0.0 is certainly fake, 1.0 is
certainly real, anything in between is a gray area). The difference from a typical CNN is the absence of
max-pooling in between layers. Instead, a strided convolution is used for downsampling. The activation
function used in each CNN layer is a leaky ReLU. A dropout between 0.4 and 0.7 between layers prevent
over fitting and memorization. Listing 1 shows the implementation in Keras.

Figure 1. Discriminator of DCGAN tells how real an input image of a digit is. MNIST Dataset is used as
ground truth for real images. Strided convolution instead of max-pooling down samples the image.

self.D = Sequential()

depth = 64

dropout = 0.4

# In: 28 x 28 x 1, depth = 1

# Out: 14 x 14 x 1, depth=64

input_shape = (self.img_rows, self.img_cols, self.channel)

self.D.add(Conv2D(depth*1, 5, strides=2, input_shape=input_shape,\

padding='same', activation=LeakyReLU(alpha=0.2)))

self.D.add(Dropout(dropout))

self.D.add(Conv2D(depth*2, 5, strides=2, padding='same',\

activation=LeakyReLU(alpha=0.2)))
self.D.add(Dropout(dropout))

self.D.add(Conv2D(depth*4, 5, strides=2, padding='same',\

activation=LeakyReLU(alpha=0.2)))

self.D.add(Dropout(dropout))

self.D.add(Conv2D(depth*8, 5, strides=1, padding='same',\

activation=LeakyReLU(alpha=0.2)))

self.D.add(Dropout(dropout))

# Out: 1-dim probability

self.D.add(Flatten())

self.D.add(Dense(1))

self.D.add(Activation('sigmoid'))

self.D.summary()

Generator

The generator synthesizes fake images. In Figure 2, the fake image is generated from a 100-dimensional
noise (uniform distribution between -1.0 to 1.0) using the inverse of convolution, called transposed
convolution. Instead of fractionally-strided convolution as suggested in DCGAN, upsampling between the
first three layers is used since it synthesizes more realistic handwriting images. In between layers, batch
normalization stabilizes learning. The activation function after each layer is a ReLU. The output of the
sigmoid at the last layer produces the fake image. Dropout of between 0.3 and 0.5 at the first layer
prevents overfitting. Listing 2 shows the implementation in Keras.

Figure 2. Generator model synthesizes fake MNIST images from noise. Upsampling is used instead of
fractionally-strided transposed convolution.

self.G = Sequential()

dropout = 0.4
depth = 64+64+64+64

dim = 7

# In: 100

# Out: dim x dim x depth

self.G.add(Dense(dim*dim*depth, input_dim=100))

self.G.add(BatchNormalization(momentum=0.9))

self.G.add(Activation('relu'))

self.G.add(Reshape((dim, dim, depth)))

self.G.add(Dropout(dropout))

# In: dim x dim x depth

# Out: 2*dim x 2*dim x depth/2

self.G.add(UpSampling2D())

self.G.add(Conv2DTranspose(int(depth/2), 5, padding='same'))

self.G.add(BatchNormalization(momentum=0.9))

self.G.add(Activation('relu'))

self.G.add(UpSampling2D())

self.G.add(Conv2DTranspose(int(depth/4), 5, padding='same'))

self.G.add(BatchNormalization(momentum=0.9))

self.G.add(Activation('relu'))

self.G.add(Conv2DTranspose(int(depth/8), 5, padding='same'))

self.G.add(BatchNormalization(momentum=0.9))

self.G.add(Activation('relu'))

# Out: 28 x 28 x 1 grayscale image [0.0,1.0] per pix

self.G.add(Conv2DTranspose(1, 5, padding='same'))

self.G.add(Activation('sigmoid'))

self.G.summary()
return self.G

GAN Model

So far, there are no models yet. It is time to build the models for training. We need two models: 1)
Discriminator Model (the police) and 2) Adversarial Model or Generator-Discriminator (the counterfeiter
learning from the police).

Discriminator Model

Listing 3 shows the Keras code for the Discriminator Model. It is the Discriminator described above with
the loss function defined for training. Since the output of the Discriminator is sigmoid, we use binary
cross entropy for the loss. RMSProp as optimizer generates more realistic fake images compared to Adam
for this case. Learning rate is 0.0008. Weight decay and clip value stabilize learning during the latter part
of the training. You have to adjust the decay if you adjust the learning rate.

optimizer = RMSprop(lr=0.0008, clipvalue=1.0, decay=6e-8)

self.DM = Sequential()

self.DM.add(self.discriminator())

self.DM.compile(loss='binary_crossentropy', optimizer=optimizer,\

metrics=['accuracy'])

Listing 3. Discriminator Model implemented in Keras.

Adversarial Model

The adversarial model is just the generator-discriminator stacked together as shown in Figure 3. The
Generator part is trying to fool the Discriminator and learning from its feedback at the same time. Listing
4 shows the implementation using Keras code. The training parameters are the same as in the
Discriminator model except for a reduced learning rate and corresponding weight decay.
Figure 3. The Adversarial model is simply generator with its output connected to the input of the
discriminator. Also shown is the training process wherein the Generator labels its fake image output with
1.0 trying to fool the Discriminator.

optimizer = RMSprop(lr=0.0004, clipvalue=1.0, decay=3e-8)

self.AM = Sequential()

self.AM.add(self.generator())

self.AM.add(self.discriminator())

self.AM.compile(loss='binary_crossentropy', optimizer=optimizer,\

metrics=['accuracy'])

Listing 4. Adversarial Model as shown in Figure 3 implemented in Keras.

Training

Training is the hardest part. We determine first if Discriminator model is correct by training it alone with
real and fake images. Afterwards, the Discriminator and Adversarial models are trained one after the
other. Figure 4 shows the Discriminator Model while Figure 3 shows the Adversarial Model during
training. Listing 5 shows the training code in Keras.
Figure 4. Discriminator model is trained to distinguish real from fake handwritten images.

images_train = self.x_train[np.random.randint(0,

self.x_train.shape[0], size=batch_size), :, :, :]

noise = np.random.uniform(-1.0, 1.0, size=[batch_size, 100])

images_fake = self.generator.predict(noise)

x = np.concatenate((images_train, images_fake))

y = np.ones([2*batch_size, 1])

y[batch_size:, :] = 0

d_loss = self.discriminator.train_on_batch(x, y)

y = np.ones([batch_size, 1])

noise = np.random.uniform(-1.0, 1.0, size=[batch_size, 100])


a_loss = self.adversarial.train_on_batch(noise, y)

Listing 5. Sequential training of Discriminator Model and Adversarial Model. Training steps above 1000
generates respectable outputs.

Sample Output

What are Optimizers in Deep Learning?

In deep learning, optimizers are crucial as algorithms that dynamically fine-tune a model’s parameters
throughout the training process, aiming to minimize a predefined loss function. These specialized
algorithms facilitate the learning process of neural networks by iteratively refining the weights and biases
based on the feedback received from the data. Well-known optimizers in deep learning encompass
Stochastic Gradient Descent (SGD), Adam, and RMSprop, each equipped with distinct update rules,
learning rates, and momentum strategies, all geared towards the overarching goal of discovering and
converging upon optimal model parameters, thereby enhancing overall performance.

Choosing the Right Optimizer

Optimizer algorithms are optimization method that helps improve a deep learning model’s performance.
These optimization algorithms or optimizers widely affect the accuracy and speed training of the deep
learning model. But first of all, the question arises of what an optimizer is.
While training the deep learning optimizers model, modify each epoch’s weights and minimize the loss
function. An optimizer is a function or an algorithm that adjusts the attributes of the neural network, such
as weights and learning rates. Thus, it helps in reducing the overall loss and improving accuracy. The
problem of choosing the right weights for the model is a daunting task, as a deep learning model generally
consists of millions of parameters. It raises the need to choose a suitable optimization algorithm for your
application. Hence understanding these machine learning algorithms is necessary for data scientists before
having a deep dive into the field.

You can use different optimizers in the machine learning model to change your weights and learning rate.
However, choosing the best optimizer depends upon the application. When dealing with hundreds of
gigabytes of data, even a single epoch can take considerable time. So randomly choosing an algorithm is
no less than gambling with your precious time that you will realize sooner or later in your journey.

There are various deep-learning optimizers, such as Gradient Descent, Stochastic Gradient Descent,
Stochastic Gradient descent with momentum, Mini-Batch Gradient Descent, Adagrad, RMSProp,
AdaDelta, and Adam.

Important Deep Learning Terms

Before proceeding, there are a few terms that you should be familiar with.

• Epoch – The number of times the algorithm runs on the whole training dataset.

• Sample – A single row of a dataset.

• Batch – It denotes the number of samples to be taken to for updating the model parameters.

• Learning rate – It is a parameter that provides the model a scale of how much model weights
should be updated.

• Cost Function/Loss Function – A cost function is used to calculate the cost, which is the
difference between the predicted value and the actual value.

• Weights/ Bias – The learnable parameters in a model that controls the signal between two
neurons.

Gradient Descent Deep Learning Optimizer

Gradient Descent can be considered the popular kid among the class of optimizers in deep learning. This
optimization algorithm uses calculus to consistently modify the values and achieve the local minimum.
Before moving ahead, you might question what a gradient is.

In simple terms, consider you are holding a ball resting at the top of a bowl. When you lose the ball, it
goes along the steepest direction and eventually settles at the bottom of the bowl. A Gradient provides the
ball in the steepest direction to reach the local minimum which is the bottom of the bowl.
The above equation means how the gradient is calculated. Here alpha is the step size that represents how
far to move against each gradient with each iteration.

Gradient descent works as follows:

1. Initialize Coefficients: Start with initial coefficients.

2. Evaluate Cost: Calculate the cost associated with these coefficients.

3. Search for Lower Cost: Look for a cost value lower than the current one.

4. Update Coefficients: Move towards the lower cost by updating the coefficients’ values.

5. Repeat Process: Continue this process iteratively.

6. Reach Local Minimum: Stop when a local minimum is reached, where further cost reduction is
not possible.

Gradient descent works best for most purposes. However, it has some downsides too. It is expensive to
calculate the gradients if the size of the data is huge. Gradient descent works well for convex functions,
but it doesn’t know how far to travel along the gradient for nonconvex functions.

Stochastic Gradient Descent Deep Learning Optimizer

At the end of the previous section, you learned why there might be better options than using gradient
descent on massive data. To tackle the challenges large datasets pose, we have stochastic gradient
descent, a popular approach among optimizers in deep learning. The term stochastic denotes the element
of randomness upon which the algorithm relies. In stochastic gradient descent, instead of processing the
entire dataset during each iteration, we randomly select batches of data. This implies that only a few
samples from the dataset are considered at a time, allowing for more efficient and computationally
feasible optimization in deep learning models.

The procedure is first to select the initial parameters w and learning rate n. Qi represents function.Then
randomly shuffle the data at each iteration to reach an approximate minimum.

Since we are not using the whole dataset but the batches of it for each iteration, the path taken by the
algorithm is full of noise as compared to the gradient descent algorithm. Thus, SGD uses a higher number
of iterations to reach the local minima. Due to an increase in the number of iterations, the overall
computation time increases. But even after increasing the number of iterations, the computation cost is
still less than that of the gradient descent optimizer. So the conclusion is if the data is enormous and
computational time is an essential factor, stochastic gradient descent should be preferred over batch
gradient descent algorithm.

Stochastic Gradient Descent With Momentum Deep Learning Optimizer


Stochastic Gradient Descent requires a more significant number of iterations to reach the optimal
minimum, and hence, computation time is very slow. To overcome the problem, we use stochastic
gradient descent with a momentum algorithm.

In the above equation, vt is called velocity, and it accelerates gradients in the direction that leads to
convergence.

What the momentum does is helps in faster convergence of the loss function. Stochastic gradient descent
oscillates between either direction of the gradient and updates the weights accordingly. However, adding
a fraction of the previous update to the current update will make the process a bit faster. One thing that
should be remembered while using this algorithm is that the learning rate should be decreased with a high
momentum term.

Mini Batch Gradient Descent Deep Learning Optimizer

In this variant of gradient descent, instead of taking all the training data, only a subset of the dataset is
used for calculating the loss function. Since we are using a batch of data instead of taking the whole
dataset, fewer iterations are needed. That is why the mini-batch gradient descent algorithm is faster than
both stochastic gradient descent and batch gradient descent algorithms. This algorithm is more efficient
and robust than the earlier variants of gradient descent. As the algorithm uses batching, all the training
data need not be loaded in the memory, thus making the process more efficient to implement. Moreover,
the cost function in mini-batch gradient descent is noisier than the batch gradient descent algorithm but
smoother than that of the stochastic gradient descent algorithm. Because of this, mini-batch gradient
descent is ideal and provides a good balance between speed and accuracy.

Despite all that, the mini-batch gradient descent algorithm has some downsides too. It needs a
hyperparameter that is “mini-batch-size”, which needs to be tuned to achieve the required accuracy.
Although, the batch size of 32 is considered to be appropriate for almost every case. Also, in some cases,
it results in poor final accuracy. Due to this, there needs a rise to look for other alternatives too.

Adagrad (Adaptive Gradient Descent) Deep Learning Optimizer

The adaptive gradient descent algorithm is slightly different from other gradient descent algorithms. This
is because it uses different learning rates for each iteration. The change in learning rate depends upon the
difference in the parameters during training. The more the parameters get changed, the more minor the
learning rate changes. This modification is highly beneficial because real-world datasets contain sparse as
well as dense features. So it is unfair to have the same value of learning rate for all the features. The
Adagrad algorithm uses the below formula to update the weights. Here the alpha(t) denotes the different
learning rates at each iteration, n is a constant, and E is a small positive to avoid division by 0.

The benefit of using Adagrad is that it abolishes the need to modify the learning rate manually. It is more
reliable than gradient descent algorithms and their variants, and it reaches convergence at a higher speed.
One downside of the AdaGrad optimizer is that it decreases the learning rate aggressively and
monotonically. There might be a point when the learning rate becomes extremely small. This is because
the squared gradients in the denominator keep accumulating, and thus the denominator part keeps on
increasing. Due to small learning rates, the model eventually becomes unable to acquire more knowledge,
and hence the accuracy of the model is compromised.

RMS Prop (Root Mean Square) Deep Learning Optimizer

RMS prop is one of the popular optimizers among deep learning enthusiasts. This is maybe because it
hasn’t been published but is still very well-known in the community. RMS prop is ideally an extension of
the work RPPROP. It resolves the problem of varying gradients. The problem with the gradients is that
some of them were small while others may be huge. So, defining a single learning rate might not be the
best idea. RPPROP uses the gradient sign, adapting the step size individually for each weight. In this
algorithm, the two gradients are first compared for signs. If they have the same sign, we’re going in the
right direction, increasing the step size by a small fraction. If they have opposite signs, we must decrease
the step size. Then we limit the step size and can now go for the weight update.

The problem with RPPROP is that it doesn’t work well with large datasets and when we want to perform
mini-batch updates. So, achieving the robustness of RPPROP and the efficiency of mini-batches
simultaneously was the main motivation behind the rise of RMS prop. RMS prop is an advancement in
AdaGrad optimizer as it reduces the monotonically decreasing learning rate.

RMS Prop Formula

The algorithm mainly focuses on accelerating the optimization process by decreasing the number of
function evaluations to reach the local minimum. The algorithm keeps the moving average of squared
gradients for every weight and divides the gradient by the square root of the mean square.

where Beta is the forgetting factor. Weights are updated by the below formula.

In simpler terms, if there exists a parameter due to which the cost function oscillates a lot, we want to
penalize the update of this parameter. Suppose you built a model to classify a variety of fishes. The model
relies on the factor ‘color’ mainly to differentiate between the fishes. Due to this, it makes a lot of errors.
What RMS Prop does is, penalize the parameter ‘color’ so that it can rely on other features too. This
prevents the algorithm from adapting too quickly to changes in the parameter ‘color’ compared to other
parameters. This algorithm has several benefits as compared to earlier versions of gradient descent
algorithms. The algorithm converges quickly and requires lesser tuning than gradient descent algorithms
and their variants.

The problem with RMS Prop is that the learning rate has to be defined manually, and the suggested value
doesn’t work for every application.
AdaDelta Deep Learning Optimizer

AdaDelta can be seen as a more robust version of the AdaGrad optimizer. It is based upon adaptive
learning and is designed to deal with significant drawbacks of AdaGrad and RMS prop optimizer. The
main problem with the above two optimizers is that the initial learning rate must be defined manually.
One other problem is the decaying learning rate which becomes infinitesimally small at some point. Due
to this, a certain number of iterations later, the model can no longer learn new knowledge.

To deal with these problems, AdaDelta uses two state variables to store the leaky average of the second
moment gradient and a leaky average of the second moment of change of parameters in the model.

Here St and delta Xt denote the state variables, g’t denotes rescaled gradient, delta Xt-1 denotes squares
rescaled gradients, and epsilon represents a small positive integer to handle division by 0.

Adam Optimizer in Deep Learning

Adam optimizer, short for Adaptive Moment Estimation optimizer, is an optimization algorithm
commonly used in deep learning. It is an extension of the stochastic gradient descent (SGD) algorithm
and is designed to update the weights of a neural network during training.

The name “Adam” is derived from “adaptive moment estimation,” highlighting its ability to adaptively
adjust the learning rate for each network weight individually. Unlike SGD, which maintains a single
learning rate throughout training, Adam optimizer dynamically computes individual learning rates based
on the past gradients and their second moments.

The creators of Adam optimizer incorporated the beneficial features of other optimization algorithms such
as AdaGrad and RMSProp. Similar to RMSProp, Adam optimizer considers the second moment of the
gradients, but unlike RMSProp, it calculates the uncentered variance of the gradients (without subtracting
the mean).

By incorporating both the first moment (mean) and second moment (uncentered variance) of the
gradients, Adam optimizer achieves an adaptive learning rate that can efficiently navigate the
optimization landscape during training. This adaptivity helps in faster convergence and improved
performance of the neural network.

In summary, Adam optimizer is an optimization algorithm that extends SGD by dynamically adjusting
learning rates based on individual weights. It combines the features of AdaGrad and RMSProp to provide
efficient and adaptive updates to the network weights during deep learning training.

Adam Optimizer Formula

The adam optimizer has several benefits, due to which it is used widely. It is adapted as a benchmark for
deep learning papers and recommended as a default optimization algorithm. Moreover, the algorithm is
straightforward to implement, has a faster running time, low memory requirements, and requires less
tuning than any other optimization algorithm.
The above formula represents the working of adam optimizer. Here B1 and B2 represent the decay rate of
the average of the gradients.

Batch Normalization:

What is Batch Normalization?

Batch normalization was introduced to mitigate the internal covariate shift problem in neural networks by
Sergey Ioffe and Christian Szegedy in 2015. The normalization process involves calculating the mean and
variance of each feature in a mini-batch and then scaling and shifting the features using these statistics.
This ensures that the input to each layer remains roughly in the same distribution, regardless of changes in
the distribution of earlier layers’ outputs. Consequently, Batch Normalization helps in stabilizing the
training process, enabling higher learning rates and faster convergence.

Need for Batch Normalization

Batch Normalization is extension of concept of normalization from just the input layer to the activations
of each hidden layer throughout the neural network. By normalizing the activations of each layer, Batch
Normalization helps to alleviate the internal covariate shift problem, which can hinder the convergence of
the network during training.The inputs to each hidden layer are the activations from the previous layer. If
these activations are normalized, it ensures that the network is consistently presented with inputs that
have a similar distribution, regardless of the training stage. This stability in the distribution of inputs
allows for smoother and more efficient training.

By applying Batch Normalization into the hidden layers of the network, the gradients propagated during
backpropagation are less likely to vanish or explode, leading to more stable training dynamics. This
ultimately facilitates faster convergence and better performance of the neural network on the given task.

Fundamentals of Batch Normalization

In this section, we are going to discuss the steps taken to perform batch normalization.

Step 1: Compute the Mean and Variance of Mini-Batches

For mini-batch of activations x1,x2,…,,…,xm, the mean 𝜇𝐵μB and variance 𝜎𝐵2σB2 of the mini-batch
are computed.

Step 2: Normalization
Step 3: Scale and Shift the Normalized Activations

import tensorflow as tf

# Define a simple model

model = tf.keras.Sequential([

tf.keras.layers.Dense(64, input_shape=(784,)),

tf.keras.layers.BatchNormalization(), # Add Batch Normalization layer

tf.keras.layers.Activation('relu'),

tf.keras.layers.Dense(10),

tf.keras.layers.Activation('softmax')

])

# Compile the model

model.compile(optimizer='adam',

loss='sparse_categorical_crossentropy',

metrics=['accuracy'])

# Train the model

model.fit(x_train, y_train, epochs=5, batch_size=32)

What is ReLU?

ReLU stands for Rectified Linear Unit. The function is defined as f(x) = max(0, x), which returns the
input value if it is positive and zero if it is negative. The output of the ReLU function is, therefore, always
non-negative.

The ReLU function has become a popular choice for activation functions in neural networks because it is
computationally efficient and does not suffer from the vanishing gradient problem that can occur with
other activation functions like the sigmoid or hyperbolic tangent functions.

Leaky ReLU
Leaky ReLU activation functionf(x)=max(0.01*x , x). This function returns x if it receives any positive
input, but for any negative value of x, it returns a really small value which is 0.01 times x. Thus it gives
an output for negative values as well.

Texture Generation Using GANs

Generative Adversarial Networks (GANs) have shown great potential in generating high-quality
images, but they can also be used to generate realistic textures. One of the challenges in texture
generation is to generate samples with high visual quality while preserving the coherence and
consistency of the texture. Spatial GAN builds upon the traditional GANs architecture by
transforming the generator and discriminator into fully convolutional networks. In this way, a
spatial input is mapped into an output texture image which can be expanded in size by expanding
the input. Periodic Spatial GAN (PSGAN) is an extension of Spatial GAN proposed to generate
textures with periodic patterns where they incorporated a periodic input into the generator
network to ensure that the generated textures have a periodic structure. The main drawback of
these methods is the scalability in memory, i.e., increasing the size of the output image will
increase the required GPU memory, this hinders the infinite size generation. TileGAN was able
to generate texture of hundreds of megapixels; however, it focuses on a different task where the
aim was to combine different patches generated by GANs by passing the pre-stored latent tiles to
the trained generator. This requires searching for the closest latents using a neighbourhood
similarity match which is not a scalable and deployable solution.

Texture Generation

Generating Music with a Generative Adversarial Network


Introduction

Generative Adversarial Networks (GANs) have become extraordinarily popular in recent years due to
their success with image generation. Websites like thispersondoesnotexist.com showcase the capabilities
of GANs in generating extraordinarily realistic human faces. Given their positive results regarding image
generation, we sought to find out if GANs could be applied to musical compositions with a similar
outcome.

Background

For some context, let’s briefly examine what a GAN actually is. GANs consist of two neural networks
with conflicting goals, namely a discriminator and a generator. The discriminator has the task of
determining whether or not input it is given is “real” or “fake”. The generator is challenged with creating
authentic-looking content that fools the discriminator into believing it’s real. The idea is that when one of
these networks gets better at its job, the other network has to learn how to better counteract its adversary.
This feedback loop results in obtaining better and better generated content.

Time-Frequency Data Representation

With this brief introduction to GANs out of the way, we’ll take a look at how we applied this concept to
music. With images, data representation is relatively straightforward; images are just 2-dimensional
arrays with a number of color channels (e.g. 1 channel for greyscale or 3 channels for red-green-blue).
However, music is structured differently than images. A single song can have multiple instruments
playing their own part at any time. This introduces a significant amount of variability that must be
captured, certainly too much to for a single 2D array. To make this task more feasible, we only used
songs from the classical music genre with a single track and we fixed the amount of data that we used
from each song.

In order to effectively utilize convolution within a GAN, the data used must maintain translational
invariance. In order to facilitate this, every musical training input was extracted as a 16 beat segment from
a song, where each beat was divided into 24 time slices. Each slice contained a vector of size 128 to hold
the volume of each possible note that could be played. This resulted in our discriminator input matrix
(and generator output matrix) being of size 384 x 128. Songs could be sampled multiple times to produce
additional training samples, with some danger of overlap between samples. These transformation steps
combined with the data filtering discussed in the paragraph above reduced our original input dataset of
about 113,000 midi files to roughly 6,000.

GAN

Convolutional models can be notoriously tricky to train, so we studied examples such as those provided
here for guidance. We needed to strike a balance between the additional complexity introduced with
deeper layers plus larger filters with the model’s potential to underfit the data and produce noise. We also
needed to find a training heuristic that would help avoid problems like non-convergence and mode
collapse. Keeping these ideas in mind, we came up with the following architectures for our generator and
discriminator networks.

Generator architecture

The generator element in our GAN takes a vector of 100 random real numbers as input and feeds this
through 5 hidden layers to produce the output song. Each input is selected from a normal distribution with
a mean of 0 and a variance of 1. The hidden layers in the network are organized like so: The first is a
fully-connected layer, whose output is reshaped to (6, 8, 256). This is then fed to a convolutional
transpose layer using a (5, 5) filter, followed by a third convolutional layer using a (4, 4) filter. Layers 4
and 5 both use (4, 2) filters to ultimately output a Pianoroll matrix of shape (384, 128). The last layer in
the generator is a Relu activation layer which limits each cell in the output to be between 0 and 2. Every
other layer uses a modified form of ReLU activation in every convolutional layer, which acts as a pass-
through activation for all positive outputs, but reduces the value by a factor of 3 if its negative. We also
use batch normalization between each layer to help control the magnitudes of the weights.

Discriminator architecture

The discriminator works in a reverse fashion. It takes the (384, 128) song as input, and feeds this into a
network of 3 convolutional layers to output a single scalar representing the probability that this input is
real. The first layer uses a (4, 2) filter to create an output of shape (96, 64, 32). This is fed to the second
convolutional layer with a (4, 4) filter and then to the third layer with a (4, 4) filter. This final layer is fed
to a fully connected layer with a single output: the class estimate. The same modified ReLU activation
seen in the generator is used for each convolutional layer, and a sigmoid activation is applied to the final
fully-connected layer. We also utilize dropout layers between each convolutional layer, which randomly
reduces 30% of the inputs to 0 to help it generalize to new data.
We applied cross-entropy as our loss functions as the generator loss is based on how well it is able to trick
the discriminator into identifying the fake song as real, while the discriminator loss is a sum of how well
it identifies both real and fake songs as their respective classes. Both models use the Adam algorithm as
its gradient optimizer, but the discriminator uses a learning rate of 1e-6, while the generator uses a
learning rate of 1e-4.

Training, Results, & Evaluation

We trained our model on a randomly selected batch of 200 of the 6,000 input samples for each of 10,000
epochs. We saved a generated song for examination every 250 epochs for later analysis and visualization.
At the conclusion of training, our model demonstrated that it was able to pick up some structural details
early on, but failed to generalize and ultimately was unable to produce compelling results.

The generator produced random noise after the first training epoch, which aligned well with our
expectations. After the first 250 epochs, some musical structure became clear as notes began to be played
at specific times in vertical sequence with other notes. However, this pattern would not persist, and as the
remaining training epochs progressed the output further resembled random sounds played at random
times, albeit at a lower density than when it started.
We were unable to find more conclusive results given time constraints on completing this project,
however there are several key ideas we would use to improve our model if we were to revisit this project
in the future. First, the filter sizes used by the discriminator are likely much too small to capture any large
structural patterns in the music. These would need to be scaled up significantly to find patterns that persist
through each sample. Second, we may not be using enough hidden layers in both models to capture and
reproduce the complex design of music. Adding more convolutional layers could potentially improve
performance. Third, our data could be selected with more strict criteria, such as enforcing a 4/4 time
signature, removing samples which contain key changes, and starting each sample on the first beat of a
measure.

You might also like