Neural-Networks-Unit-3 Edited
Neural-Networks-Unit-3 Edited
UNIT 3
of
lOMoARcPSD|31606405
the CNN, it starts to recognize larger elements or shapes of the object until it finally identifies the intended
object.
Convolutional layer
The convolutional layer is the core building block of a CNN, and it is where the majority of
computation occurs. It requires a few components, which are input data, a filter, and a feature map. Let’s
assume that the input will be a color image, which is made up of a matrix of pixels in 3D. This means that
the input will have three dimensions—a height, width, and depth—which correspond to RGB in an image.
We also have a feature detector, also known as a kernel or a filter, which will move across the receptive
fields of the image, checking if the feature is present. This process is known as a convolution.
The feature detector is a two-dimensional (2-D) array of weights, which represents part of the image. While
they can vary in size, the filter size is typically a 3x3 matrix; this also determines the size of the receptive
field. The filter is then applied to an area of the image, and a dot product is calculated between the input
pixels and the filter. This dot product is then fed into an output array. Afterwards, the filter shifts by a stride,
repeating the process until the kernel has swept across the entire image. The final output from the series of
dot products from the input and the filter is known as a feature map, activation map, or a convolved feature.
Note that the weights in the feature detector remain fixed as it moves across the image, which is also known
as parameter sharing. Some parameters, like the weight values, adjust during training through the process of
backpropagation and gradient descent. However, there are three hyperparameters which affect the volume
size of the output that need to be set before the training of the neural network begins. These include:
1. The number of filters affects the depth of the output. For example, three distinct filters would yield
three different feature maps, creating a depth of three.
2. Stride is the distance, or number of pixels, that the kernel moves over the input matrix. While stride
values of two or greater is rare, a larger stride yields a smaller output.
3. Zero-padding is usually used when the filters do not fit the input image. This sets all elements that
fall outside of the input matrix to zero, producing a larger or equally sized output. There are three types
of padding:
Valid padding: This is also known as no padding. In this case, the last convolution is dropped if
dimensions do not align.
Same padding: This padding ensures that the output layer has the same size as the input layer
Full padding: This type of padding increases the size of the output by adding zeros to the border
of the input.
After each convolution operation, a CNN applies a Rectified Linear Unit (ReLU) transformation to the
feature map, introducing nonlinearity to the model.
lOMoARcPSD|31606405
Pooling layer
Pooling layers, also known as downsampling, conducts dimensionality reduction, reducing the
number of parameters in the input. Similar to the convolutional layer, the pooling operation sweeps a filter
across the entire input, but the difference is that this filter does not have any weights. Instead, the kernel
applies an aggregation function to the values within the receptive field, populating the output array. There
are two main types of pooling:
lOMoARcPSD|31606405
Max pooling: As the filter moves across the input, it selects the pixel with the maximum value to
send to the output array. As an aside, this approach tends to be used more often compared to
average pooling.
Average pooling: As the filter moves across the input, it calculates the average value within the
receptive field to send to the output array.
While a lot of information is lost in the pooling layer, it also has a number of benefits to the CNN. They help
to reduce complexity, improve efficiency, and limit risk of overfitting.
Fully-connected layer
The name of the full-connected layer aptly describes itself. As mentioned earlier, the pixel values of
the input image are not directly connected to the output layer in partially connected layers. However, in the
fully-connected layer, each node in the output layer connects directly to a node in the previous layer.
This layer performs the task of classification based on the features extracted through the previous layers
and their different filters. While convolutional and pooling layers tend to use ReLu functions, FC layers
usually leverage a softmax activation function to classify inputs appropriately, producing a probability from
0 to 1.
Neural nets represented an immense stride forward in the field of deep learning.
However, it took decades for machine learning (and especially deep learning) to gain prominence.
We’ll explore why in the next section.
Why Deep Learning Did Not Immediately Work
If deep learning was originally conceived decades ago, why is it just beginning to gain momentum today?
It’s because any mature deep learning model requires an abundance of two resources:
Data
Computing power
lOMoARcPSD|31606405
At the time of deep learning’s conceptual birth, researchers did not have access to enough of either data or
computing power to build and train meaningful deep learning models. This has changed over time, which
has led to deep learning’s prominence today.
As you can see, neurons have quite an interesting structure. Groups of neurons work together inside
the human brain to perform the functionality that we require in our day-to-day lives.
The question that Geoffrey Hinton asked during his seminal research in neural networks was whether we
could build computer algorithms that behave similarly to neurons in the brain. The hope was that by
mimicking the brain’s structure, we might capture some of its capability.
To do this, researchers studied the way that neurons behaved in the brain. One important observation was
that a neuron by itself is useless. Instead, you require networks of neurons to generate any meaningful
functionality.
This is because neurons function by receiving and sending signals. More specifically, the neuron’s dendrites
receive signals and pass along those signals through the axon.
The dendrites of one neuron are connected to the axon of another neuron. These connections are called
synapses, which is a concept that has been generalized to the field of deep learning.
What is a Neuron in Deep Learning?
Neurons in deep learning models are nodes through which data and computations flow.
Neurons work like this:
They receive one or more input signals. These input signals can come from either the raw data set
or from neurons positioned at a previous layer of the neural net.
They perform some calculations.
lOMoARcPSD|31606405
They send some output signals to neurons deeper in the neural net through a
synapse. Here is a diagram of the functionality of a neuron in a deep learning neural net:
The activation function calculates the output value for the neuron. This output value is then passed on to the
next layer of the neural network through another synapse.
This serves as a broad overview of deep learning neurons. Do not worry if it was a lot to take in – we’ll learn
much more about neurons in the rest of this tutorial. For now, it’s sufficient for you to have a high-level
understanding of how they are structured in a deep learning model.
As the image above suggests, the threshold function is sometimes also called a unit step function.
Threshold functions are similar to boolean variables in computer programming. Their computed value is
either 1 (similar to True) or 0 (equivalent to False).
The Sigmoid Function
The sigmoid function is well-known among the data science community because of its use in logistic
regression, one of the core machine learning techniques used to solve classification problems.
The sigmoid function can accept any value, but always computes a value between 0 and 1.
Here is the mathematical definition of the sigmoid function:
lOMoARcPSD|31606405
One benefit of the sigmoid function over the threshold function is that its curve is smooth. This means it is
possible to calculate derivatives at any point along the curve.
The Rectifier Function
The rectifier function does not have the same smoothness property as the sigmoid function from the
last section. However, it is still very popular in the field of deep learning.
The rectifier function is defined as follows:
If the input value is less than 0, then the function outputs 0
If not, the function outputs its input value
Here is this concept explained mathematically:
Rectifier functions are often called Rectified Linear Unit activation functions, or ReLUs for short.
The Hyperbolic Tangent Function
The hyperbolic tangent function is the only activation function included in this tutorial that is based on a
trigonometric identity.
It’s mathematical definition is below:
lOMoARcPSD|31606405
The hyperbolic tangent function is similar in appearance to the sigmoid function, but its output values are all
shifted downwards.
House age
These four parameters will form the input layer of the artificial neural network. Note that in reality, there are
likely many more parameters that you could use to train a neural network to predict housing prices. We have
constrained this number to four to keep the example reasonably simple.
The Most Basic Form of a Neural Network
In its most basic form, a neural network only has two layers - the input layer and the output layer. The
output layer is the component of the neural net that actually makes predictions.
For example, if you wanted to make predictions using a simple weighted sum (also called linear regression)
model, your neural network would take the following form:
While this diagram is a bit abstract, the point is that most neural networks can be visualized in this manner:
An input layer
Possibly some hidden layers
An output layer
It is the hidden layer of neurons that causes neural networks to be so powerful for calculating predictions.
For each neuron in a hidden layer, it performs calculations using some (or all) of the neurons in the last
layer of the neural network. These values are then used in the next layer of the neural network.
The Purpose of Neurons in the Hidden Layer of a Neural Network
You are probably wondering – what exactly does each neuron in the hidden layer mean? Said differently,
how should machine learning practitioners interpret these values?
Generally speaking, neurons in the midden layers of a neural net are activated (meaning their activation
function returns 1) for an input value that satisfies certain sub-properties.
For our housing price prediction model, one example might be 5-bedroom houses with small distances to the
city center.
In most other cases, describing the characteristics that would cause a neuron in a hidden layer to activate is
not so easy.
How Neurons Determine Their Input Values
Earlier in this tutorial, I wrote “For each neuron in a hidden layer, it performs calculations using some (or
all) of the neurons in the last layer of the neural network.”
This illustrates an important point – that each neuron in a neural net does not need to use every neuron in the
preceding layer.
lOMoARcPSD|31606405
The process through which neurons determine which input values to use from the preceding layer of the
neural net is called training the model. We will learn more about training neural nets in the next section of
this course.
Visualizing A Neural Net’s Prediction Process
When visualizing a neutral network, we generally draw lines from the previous layer to the current layer
whenever the preceding neuron has a weight above 0 in the weighted sum formula for the current
neuron.
The following image will help visualize this:
As you can see, not every neuron-neuron pair has synapse. x4 only feeds three out of the five neurons in the
hidden layer, as an example. This illustrates an important point when building neural networks – that not
every neuron in a preceding layer must be used in the next layer of a neural network.
Hard-coding: you use specific parameters to predict whether an animal is a cat. More specifically,
you might say that if an animal’s weight and length lie within certain
Soft-coding: you provide a data set that contains animals labelled with their species type and
characteristics about those animals. Then you build a computer program to predict whether an animal
is a cat or not based on the characteristics in the data set.
As you might imagine, training neural networks falls into the category of soft-coding. Keep this in mind as
you proceed through this course.
Training A Neural Network Using A Cost Function
Neural networks are trained using a cost function, which is an equation used to measure the error contained
in a network’s prediction.
The formula for a deep learning cost function (of which there are many – this is just one example) is below:
Note: this cost function is called the mean squared error, which is why there is an MSE on the left side of the
equal sign.
While there is plenty of formula mathematics in this equation, it is best summarized as follows:
Take the difference between the predicted output value of an observation and the actual output value of that
observation. Square that difference and divide it by 2.
To reiterate, note that this is simply one example of a cost function that could be used in machine
learning (although it is admittedly the most popular choice). The choice of which cost function to use is a
complex and interesting topic on its own, and outside the scope of this tutorial.
As mentioned, the goal of an artificial neural network is to minimize the value of the cost function. The cost
function is minimized when your algorithm’s predicted value is as close to the actual value as possible.
Said differently, the goal of a neural network is to minimize the error it makes in its predictions!
Modifying A Neural Network
After an initial neural network is created and its cost function is imputed, changes are made to the neural
network to see if they reduce the value of the cost function.
More specifically, the actual component of the neural network that is modified is the weights of each
neuron at its synapse that communicate to the next layer of the network.
The mechanism through which the weights are modified to move the neural network to weights with less
error is called gradient descent. For now, it’s enough for you to understand that the process of training
neural networks looks like this:
Initial weights for the input values of each neuron are assigned
Predictions are calculated using these initial values
The predictions are fed into a cost function to measure the error of the neural network
A gradient descent algorithm changes the weights for each neuron’s input values
This process is continued until the weights stop changing (or until the amount of their change at each
iteration falls below a specified threshold)
lOMoARcPSD|31606405
This may seem very abstract - and that’s OK! These concepts are usually only fully understood when you
begin training your first machine learning models.
What is ELM?
ELM (Extreme Learning Machines) are feedforward neural networks. “Invented” in 2006 by G. Huang.
Hence the phrase “Extreme” in ELM (but the real reason for the name might vary depends on the source).
First, we look at standard SLFN (Single hidden Layer Feedforward Neural network):
lOMoARcPSD|31606405
Single hidden Layer Feedforward Neural network, Source: Shifei Ding under CC BY 3.0
5. calculate output
6. backpropagate
7. repeat everything
ELM removes step 4 (because it’s always SLFN), replaces step 6 with matrix inverse, and does it only once,
so step 7 goes away as well.
More details
Before going into details we need to look at how ELM output is calculated:
lOMoARcPSD|31606405
Where:
Where:
Where:
m is a number of outputs
H is called Hidden Layer Output Matrix
T is a training data target matrix
The theory behind the learning (You can skip this section if you want)
Now we have to dig dipper into theories behind the network to decide what to do next.
lOMoARcPSD|31606405
I’m not going to prove those theorems but if you’re interested please refer Page 3, ELM-NC-2006 for further
explanation.
Now what we have to do is to define our cost function. Bassing our assumptions on Capabilities of a four-
layered feedforward neural network: four layers versus three we can see that SLFN is a linear system if the
input weights and the hidden layer biases can be chosen randomly.
Because our ELM is a linear system then we can create optimization objective:
To approximate the solution we need to use Rao’s and Mitra’s work again:
Now we can figure out that because H is invertible we can calculate Beta hat as:
Learning algorithm
After going through some difficult math we can define learning algorithm now. The algorithm itself is
relatively easy:
lOMoARcPSD|31606405
https://github.com/burnpiro/elm-pure
https://github.com/burnpiro/elm-pure/blob/master/ELM%20example.ipynb
As you can see, a simple version of ELM achieves >91% accuracy on the MNIST dataset and it takes
around 3s to train the network on intel i7 7820X CPU.
Performance comparison
I’m going to use metrics from the original paper in this section and it might surprise you how long some
training is done in compare with previous MNIST example, but remember that original paper was published
in 2006 and networks were trained on Pentium 4 1.9GHz CPU.
Datasets
Results
We can ignore training time for now because it’s obvious that gradient descent takes longer than matrix
invert. The most important information form this result table is Accuracy and Nodes. In the first two
lOMoARcPSD|31606405
datasets, you can see that author used different size of BP to achieve the same results as ELM. The size of
BP network in the first case was 5x smaller and 2x smaller in the second case. That affects testing times (it’s
faster to run 100 nodes NN than 500 nodes NN). That tells us how accurate is our method in approximating
dataset.
It is hard to find any tests of ELM networks on popular datasets but I’ve managed to do so. Here is a
benchmark on CIFAR-10 and MNIST
Where:
I didn’t find training times for ELMs so there was no way to compare them with results from other networks
but all those multipliers ( 20x, 30x) are relative differences in training time based on the training of ELM
1000 on CIFAR-10. If there is a 30x time increase between ELM 1000 and ELM 3500 then you can
imagine how long it would take to train DELM which has 15000 neurons.
lOMoARcPSD|31606405
Universal Approximation Theorem(UAT) says that Deep Neural Networks(DNN) are powerful
function approximators.
However, Fully Connected DNNs(fully connected network means that any neuron in any of the layers
is connected to all the neurons in the previous layer) are prone to overfitting as the network is very
deep and the no. of parameters are very large which might result in the overfitting of the model.
And the second problem with the fully connected networks is that some gradients might vanish due
to long chains. Since the network is very deep, the gradients in the few of the starting layers might
get vanished when flowing back and therefore resulting in no training of the weights.
So, the objective is to have a network that is a complex network(having non-linearities) as we know that
in most of the real-world problems the output is going to be a complex function of the input but has
fewer parameters and is, therefore, less prone to overfitting. And CNN's belong to the family of
networks that serves this objective.
lOMoARcPSD|31606405
lOMoARcPSD|31606405
Convolutional Operation
Convolutional Operation means for a given input we re-estimate it as the weighted average of all the inputs
around it. We have some weights assigned to the neighbor values and we take the weighted sum of the
neighbor values to estimate the value of the current input/pixel.
For a 2D input, the classic input would be an image, where we re-calculate the value of every pixel by taking
the weighted sum of pixels(neighbors) around it for example: let’s say the input image is as given below
Input Image
Now in this input image, we calculate the value of each and every pixel by considering the weighted sum
of pixels around it
lOMoARcPSD|31606405
Here we are calculating the value of circled pixel considering 3 neighbors around it, assume that the weights
w1, w2, w3, w4 are associated with these 4 pixels respectively
Now, this matrix of weights is referred to as the Kernel or Filter. In the above case, we have the kernel
of size 2X2.
We compute the output(re-estimated value of current pixel) using the following formula:
Here m refers to the number of rows(which is 2 in this case) and n refers to the number of columns(which is
2 i this case).
Now we place the 2X2 filter over the first 2X2 portion of the image and take the weighted sum and
that would give the new value of the first pixel.
lOMoARcPSD|31606405
We map the 2X2 kernel/filter over the 2X2 portion of the input.
Then we move the filter horizontally by one and place it over the next 2 X 2 portion of the input; in this
case pixels of interest would be b, c, f, g and we compute the output using the same technique and we
would get:
And then again we move the kernel/filter by 1 in the horizontal direction and take the weighted sum.
So, after this, the output from the first layer would look like:
Then we move the kernel by 1 down in the vertical direction, calculate the output, move the kernel in the
horizontal direction and in general we move the kernel like this: first, we start off with the starting portion
lOMoARcPSD|31606405
of
lOMoARcPSD|31606405
the image, move the filter in the horizontal direction and cover this row completely then we move the filter
in the vertical direction(by some amount respective to top left portion of image), again stride it horizontally
through the entire row and continue like this. In essence, we move the kernel left to right top to bottom.
Instead of considering pixels only in the forward direction, we consider previous neighbors as well
And to consider the previous neighbors, the formula for computing the output would be:
We take the limits from -m/2 to m/2 i.e we take half of the rows from previous neighbors and the other
half from the forward direction(forward neighbors) and the same is the case in the vertical direction(-n/2 to
n/2).
and we use kernel/filter of size 3X3 and for each pixel, we take the 3 X 3 neighborhood around it(pixel
itself is a part of this 3 X 3 neighborhood and would be at the center) just like in the below image:
Input Image, we consider 3X3 portions of this image as the kernel is of size 3X3
Let’s say this input is a 30X30 image, we go over every pixel systematically, place the filter such that the
pixel is at the center of the kernel and re-estimate the value of that pixel as the weighted sum of pixels
around it.
lOMoARcPSD|31606405
So, in this way, we get back the re-estimated value of all the pixels.
We all have seen the convolutional operation in practice. Let’s say the kernel that we are using is as below:
Kernel
So, we move this kernel all over the image and re-compute every pixel as the weighted sum of the
neighborhood. In this case, since all the weights are 1/9 that means the re-estimated value of each and
every pixel would be 1/9th of its original value. This kernel is taking the average of all the 9 pixels
That means for each pixel/color in the image, if we take the average(divide the weighted sum value by 9),
it would dilute the value/blurs the image and the output we get by applying this convolutional operation is:
lOMoARcPSD|31606405
So, the blur operation that we all might have used in any of the photo editing application actually applies the
convolution operation behind the scenes.
Now in the below-mentioned scenario, we are using 5 as the weight for the central pixel and 0 for the all the
boundary pixels and -1 for the remaining pixels, so the net effect would be that the value/color intensity of
the central pixel is boosted and its neighborhood information is getting subtracted so the result of this is that
it sharpens the image.
Let’s take one more example: in the below case, the value for the central pixel is -8 and for all other pixels it
is 1, so if we have the same color in the 3X3 portion of the image(just like for the marked pixel in the below
image), let say the pixel intensity for this current pixel is denoted by ‘x’ then we get (8x from the central
pixel and -8x from the weighted sum of all other pixels and summation of the these results into 0).
So, wherever we have the same color in the 3X3 portion(some sample regions marked in the below image)
or to say the neighbors are exactly the same as the current pixel, we get the output intensity as 0.
lOMoARcPSD|31606405
So, in effect, what will happen is that where ever there is a boundary(yellow highlighted in the below
image), there the neighboring pixels can not be the same as the current pixel, only in such regions we get the
non-zero value, everywhere else we get a zero value. So, in effect, we end up detecting all the edges in the
input image.
lOMoARcPSD|31606405
Below is a complete picture of how the 2D convolutional operation is performed over the input, we start
with the top left corner, apply the kernel over that area, move the kernel horizontally towards right and once
we have reached the end(completed the entire row) on the right side, we move the kernel downwards by
some steps and again start from the left side and move towards right:
lOMoARcPSD|31606405
Once we complete the entire row, we slide the kernel vertically in downwards direction and start from the
left side
In the case of 3D input(image is also a 3D input as it has 3 channels corresponding to Red, Green, Blue, all
these channels are superimposed on each other and that’s how we get the final image. In other words, every
pixel in the image has 3 values associated with it, so we can look at that as the depth), we have 3
channels(depth) one corresponding to each of the RGB in the image, we use the filter of the same depth as
the input and place the filter over the input and compute the weighted sum across all the 3 dimensions.
lOMoARcPSD|31606405
In most cases when we use convolution for 3D inputs, we use a 3D convolution filter(as depicted in the
below image) that means if we place the filter at a given location in the image, we would take a weighted
average of its 3D neighborhood but we are not going to slide it along the depth. What this conveys is that the
kernel would have the same depth as the original input and that’s why there is no scope to move it through
the depth/input. For example, the input image depth is 3 and the kernel depth is also 3 so there is no scope to
move it along the depth. There is no movement available there
lOMoARcPSD|31606405
In this case, also, we move the filter horizontally and vertically as in the 2D case. We don’t move the filter
along the depth as the input image depth is the same as the filter depth and there is no scope to move across
the depth.
So, what we do in practice is we have this 3D kernel, we will start moving it, we will move it along the
horizontal direction first, and we keep doing this through the entire image and once we reach the last
box(we move from left to right and top to bottom), at the end of this, although our input was 3 dimensional,
we get back a 2D output.
lOMoARcPSD|31606405
Points to consider:
Input is 3D
The convolutional operation that we perform is 2D as we are sliding the filter horizontally and
This is because the depth of the filter is the same as the depth of the input
lOMoARcPSD|31606405
In practice, we apply multiple kernels/filters to the same input and get the different representations/output
from the same input as per the kernel used for example one filter might detect the vertical edges in the
input, second might detect the horizontal edges in the image, another filter might blur the image and so on.
lOMoARcPSD|31606405
In the above image, we are using 3 different filters and we are getting 3 outputs corresponding to each
filter. We can combine these different outputs representations into one single volume(each of the
output
representation would have width and height and after combining all of the representations we get the depth
as well). So, if we apply 3 filters to the input, we get an output of depth 3, if we apply 100 filters to the
input, we get the output of depth 100.
Points to consider:
Terminology
Let’s define some terminology and find out the relation between the input dimensions and the output
dimensions:
The spatial extent(extent of the neighborhood we are looking at) of a filter(F) means the dimension of the
filter, it would be ‘F X F’. Usually, we have an odd-dimensional filter and the depth of the filter would be
the same as the depth of the input(Di in this case).
Now we want to relate the output dimensions with the input dimensions:
Let’s take 2D input of dimension ‘7 X 7’ and we have a filter of size ‘3 X 3’ over it.
lOMoARcPSD|31606405
As we slide the filter over it(from left to right and top to bottom), we keep computing the output values,
and it's very clear that the output is smaller than the input.
The reason is obvious why this is happening, we can’t place the kernel at the corners as it will cross the
boundary
We can’t place the filter at the crossed pixel(below image) because if we place it there then yellow
And in practice, we would stop at the crossed pixel(as in the below image) when the filter completely
lies inside the image:
And this is why we get the smaller output because we would not be able to apply the filter in any part in
the shaded region in the below image:
lOMoARcPSD|31606405
Hence for every pixel in the input, we are not computing the re-estimated value and therefore the number
of pixels in the output is less than the number of pixels in the input.
This was the case for ‘3 X 3’ kernel, now let’s see what happens when we have ‘5 X 5’ kernel:
Now we can not place the kernel at the crossed pixel in the above image. We can not place the kernel at
the yellow highlighted pixel as well. So, in this case, we can not place the kernel at any of the shaded
regions in the below image:
lOMoARcPSD|31606405
If we want the output to be the same size as the input, then we need to pad the input appropriately:
lOMoARcPSD|31606405
Here we pad the input with 0 all over the input image and apply the 3X3 filter over the input and we get the
output of the same dimension as the input
If we place the kernel at the crossed pixel in the below image, we now have 5 artificial pixels with a value
of 0 and we would be able to re-estimate the value of this crossed pixel.
Now the output would be again ‘7 X 7’ as we have introduced this artificial boundary around the original
If we have a ‘5 X 5’ filter, it would still go outside the image even after this artificial padding
lOMoARcPSD|31606405
So, in this case, we need to increase padding. Earlier we added padding of 1(meaning 1 row at the top, 1 at
the bottom, 1 at the left and 1 at the right). And it’s obvious from the above image that if we want to use a
‘5 X 5’, then we should use the padding of 2.
The bigger the kernel size the larger is the padding required and the updated formula for the relation
between input and output dimension is:
lOMoARcPSD|31606405
Stride(S): Stride defines the interval at which the filter is applied, till now we discussed all the cases
considering stride to be 1 as we’re moving the filter by 1 in the horizontal and vertical direction as depicted
in the below image:
In some cases, we may not want this to say we don’t want a full replica of the image and just need a summary
of it. In that case, we may choose to apply the filter only at alternate locations in the input.
lOMoARcPSD|31606405
Here we use S = 2 i.e we move the filter by 2 in the horizontal as well as the vertical direction
This interval between two successive pixels where we apply the kernel is termed as the Stride. And in the
above case, the output would be roughly half the input as we are skipping part of the image by 1 every
time.
Now, if we are using a stride ‘S’, then the formula to compute the width and height is given by:
The depth of the output is going to be the same as the number of filters that we have.
Each 3D filter applied over 3D input would give one 2D output if we use K such filters, we get K such 2D
outputs and if we stack up all these K outputs we get the depth of the output as K. So, the depth of the
output is the same as the no. of filters used.
lOMoARcPSD|31606405
MOTIVATION:
a digital image is 2D grid image , since neural network expects a vector as input , one idea to deal
with images would be to flatten that image and feed the output of the flattening operation to the
neural network and this would work to some extent
But eventually ,that flattened vector won’t be the same for a translated image
lOMoARcPSD|31606405
The neural network would have to learn very different parameters in order to classify the objects , which Is
difficult job since natural images are very variant (lightning, translated , angles …..)
Also it is worth mentioning that the input Vector would be relatively big 64*64*3(RGB images) which
can cause problem with memory while using neural network since we will have in The first layer with
just 10 neurons alone (64*64*3*10) Weights to train
Sparse Connectivity : when processing an image, the input image might have thousands or millions of
pixels, but we can detect small, meaningful features such as edges with kernels that occupy only tens or
hundreds of pixels. This means that we need to store fewer parameters, which both reduces the memory
requirements of the model and improves its statistical efficiency. It also means that computing the
output requires fewer operations. These improvements in efficiency are usually quite large. If there are
m inputs and n outputs, then matrix multiplication requires m×n parameters and the algorithms used in
practice have O(m × n) runtime (per example). If we limit the number of connections each output may
have to k, then the sparsely connected approach requires only k × n parameters and O(k × n) runtime
lOMoARcPSD|31606405
Parameter sharing : In a convolutional neural net, each member of the kernel is used at every position
of the input (except perhaps some of the boundary pixels, depending on the design decisions regarding
the boundary). The parameter sharing used by the convolution operation means that rather than learning
a separate set of parameters for every location, we learn only one set. This does not affect the runtime
of forward propagation it is still O(k × n) but it does further reduce the storage requirements of the
model to k parameters
Equivariance : In the case of convolution, the particular form of parameter sharing causes the layer
to have a property called equivariance to translation. use the same network parameters to detect local
patterns at many locations in the image
So in Practical set , the convolutional operation is implemented by making The kernel slides across
The image and produces an output Value at each position
Also we convolve different Kernels and as a result obtain Different feature maps or channels
lOMoARcPSD|31606405
Same Convolution :pads in a way that the input size is the same as the output size
Full Convolution : — : we compute output wherever the kernel and the output overlap by at least
1 pixel
Strided Convolution : : kernel slides along the image with a step > 1
lOMoARcPSD|31606405
Dilated Convolution : : kernel is spread out, step > 1 between kernel elements
Depth wise Convolution : each output channel is connected only to one input channel
POOLING:
A pooling function replaces the output of the net at a certain location with a summary statistic of the nearby
outputs. For example, the max pooling operation reports the maximum output within a rectangular
neighborhood. Other popular pooling functions include the average of a rectangular neighborhood, the L2
norm of a rectangular neighborhood, or a weighted average based on the distance from the central pixel. In
all cases, pooling helps to make the representation become approximately invariant to small translations of
the input. Invariance to translation means that if we translate the input by a small amount, the values of most
of the pooled outputs do not change.
lOMoARcPSD|31606405
Kernel K with element Ki,j,k,l��,�,�,� giving the connection strength between a unit in channel
i of output and a unit in channel j of the input, with an offset of k rows and l columns between the
output unit and the input unit.
Input: Vi,j,k��,�,� with channel i, row j and column k
Output Z same format as V
Use 1 as first entry
Full Convolution
0 Padding 1 stride
Zi,j,k=∑l,m,nVl,j+m−1,k+n−1Ki,l,m,n��,�,�=∑�,�,���,�+�−1,�+�−1��,�,�,�
0 Padding s stride
Zi,j,k=c(K,V,s)i,j,k=∑l,m,n[Vl,s∗(j−1)+m,s∗(k−1)+nKi,l,m,n]��,�,�=�(�,�,�)�,�,�=∑�,�,
�[
��,�∗(�−1)+�,�∗(�−1)+���,�,�,�]
Convolution with a stride greater than 1 pixel is equivalent to conv with 1 stride followed by downsampling:
lOMoARcPSD|31606405
Usually the optimal amount of 0 padding lies somewhere between ‘Valid’ or ‘Same’
Unshared Convolution
In some case when we do not want to use convolution but want to use locally connected layer.
We use Unshared convolution. Indices into weight W
Zi,j,k=∑l,m,n[Vl,i+m−1,j+n−1Wi,j,k,l,m,n]��,�,�=∑�,�,�[��,�+�−1,�+�−1��,�,�,�,�
,�]
Comparison on local connections, convolution and full connection
Useful when we know that each feature should be a function of a small part of space, but no reason to think
that the same feature should occur accross all the space. eg: look for mouth only in the bottom half of the
image.
It can be also useful to make versions of convolution or local connected layers in which the connectivity is
further restricted, eg: constrain each output channeel i to be a function of only a subset of the input
channel.
Adv: * reduce memory consumption * increase statistical efficiency * reduce computation for both forward
and backward prop.
lOMoARcPSD|31606405
Tiled Convolution
Learn a set of kernels that we rotate through as we move through space. Immediately neighboring locations
will have different filters, but the memory requirement for storing the parameters will increase by a factor of
the size of this set of kernels. Comparison on locally connected layers, tiled convolution and stardard
convolution:
lOMoARcPSD|31606405
Zi,j,k=∑l,m,n[Vi,i+m−1,j+n−1Ki,l,m,n,j%t+1,k
%t+1]��,�,�=∑�,�,�[��,�+�−1,�+�−1��,
�,�,�,�%�+1,�%�+1]
Local connected layers and tiled convolutional layer with max pooling: the detector units of these layers are
driven by different filters. If the filters learn to detect different tranformed version of the same underlying
features, then the max-pooled units become invariant to the learned transformation.
Review:
lOMoARcPSD|31606405
lOMoARcPSD|31606405
K: Kernel stack
V: Input image
Z: Output of conv layer
G: gradient on Z
SRUCTURED OUTPUTS:
A deep neural network model is a powerful framework for learning representations. Usually, it is used
to learn the relation x→y by exploiting the regularities in the input x. In structured output prediction
problems, y is multi-dimensional and structural relations often exist between the dimensions. The
motivation of this work is to learn the output dependencies that may lie in the output data in order to
improve the prediction accuracy. Unfortunately, feedforward networks are unable to exploit the relations
between the outputs. In order to overcome this issue, we propose in this paper a regularization scheme for
training neural networks for these particular tasks using a multi-task framework. Our scheme aims at
incorporating the learning of the output representation y in the training process in an unsupervised fashion
while learning the supervised mapping function x→y.
TYPES OF DATA:
Deep learning can be applied to any data type. The data types you work with, and the data you gather, will
depend on the problem you’re trying to solve.
Use Cases
Deep learning can solve almost any problem of machine perception, including classifying data , clustering it,
or making predictions about it.
Classification: This image represents a horse; this email looks like spam; this transaction
is fraudulent
Clustering: These two sounds are similar. This document is probably what user X is looking for
Predictions: Given their web log activity, Customer A looks like they are going to stop using
your service
lOMoARcPSD|31606405
Deep learning is best applied to unstructured data like images, video, sound or text. An image is just a blob
of pixels, a message is just a blob of text. This data is not organized in a typical, relational database by rows
and columns. That makes it more difficult to specify its features manually.
Common use cases for deep learning include sentiment analysis, classifying images, predictive analytics,
recommendation systems, anomaly detection and more.
If you are not sure whether deep learning makes sense for your use case, please get in touch.
Data Attributes
For deep learning to succeed, your data needs to have certain characteristics.
Relevancy
The data you use to train your neural net must be directly relevant to your problem; that is, it must resemble
as much as possible the real-world data you hope to process. Neural networks are born as blank slates, and
they only learn what you teach them. If you want them to solve a problem involving certain kinds of data,
like CCTV video, then you have to train them on CCTV video, or something similar to it. The training data
should resemble the real-world data that they will classify in production.
Proper Classification
If a client wants to build a deep-learning solution that classifies data, then they need to have a labeled
dataset. That is, someone needs to apply labels to the raw data: “This image is a flower, that image is a
panda.” With time and tuning, this training dataset can teach a neural network to classify new images it has
not seen before.
Formatting
Neural networks eat vectors of data and spit out decisions about those vectors. All data needs to be
vectorized, and the vectors should be the same length when they enter the neural net. To get vectors of the
same length, it’s helpful to have, say, images of the same size (the same height and width). So sometimes
you need to resize the images. This is called data pre-processing.
Accessibility
The data needs to be stored in a place that’s easy to work with. A local file system, or HDFS (the Hadoop
file system), or an S3 bucket on AWS, for example. If the data is stored in many different databases that are
unconnected, you will have to build data pipelines. Building data pipelines and performing preprocessing
can account for at least half the time you spend building deep-learning solutions.
lOMoARcPSD|31606405
The minimums vary with the complexity of the problem, but 100,000 instances in total, across all categories,
is a good place to start.
If you have labeled data (i.e. categories A, B, C and D), it’s preferable to have an evenly balanced dataset
with 25,000 instances of each label; that is, 25,000 instances of A, 25,000 instances of B and so forth.
Efficient convolution algorithms are essential for various signal and image processing tasks, as well
as deep learning and computer vision applications. Convolution is a fundamental operation that involves
multiplying and summing values from two input arrays, and it can be computationally expensive, especially
for large input data and filter kernels. Several efficient algorithms and techniques have been developed to
speed up convolution operations. Here are some of the most important ones:
The most straightforward way to compute a convolution is to perform the element-wise multiplication and
sum for each possible location of the filter over the input. While this is conceptually simple, it is highly
inefficient and slow for large inputs and filters.
One efficient technique for convolution is to use the FFT. The idea is to convert the input and filter into
the frequency domain, perform element-wise multiplication, and then convert the result back to the time
domain using the inverse FFT. This approach can significantly reduce the computational complexity,
especially for large filters and inputs. However, it may introduce some artifacts due to the finite precision of
floating-point arithmetic.
3. Winograd Convolution:
Winograd convolution is an algorithm that minimizes the number of multiplications required for
convolution. It uses a small set of precomputed matrices to transform the input and filter, allowing for faster
computation with less computational cost. This method is particularly effective for small filter sizes.
4. Strassen Algorithm:
Originally developed for matrix multiplication, the Strassen algorithm has also been adapted
for convolution. It reduces the number of multiplicative operations by recursively breaking down
the convolution into smaller sub-convolutions. This can be more efficient for large convolutions.
lOMoARcPSD|31606405
For 2D convolutions, you can use the FFT approach by performing separate 1D FFTs along each
dimension of the input and filter, and then combining them. This is particularly useful when dealing
with images and 2D data.
These techniques involve reformatting the input and filter data into matrix form, where convolution can
be performed as a simple matrix multiplication operation. While this approach can be more efficient for
hardware implementations, it requires additional memory for the reformatted data.
Depthwise separable convolution is a technique used in deep learning, where a convolution operation
is split into two parts: depthwise convolution (applying a single filter to each input channel separately)
and pointwise convolution (applying 1x1 convolutions to combine the results). This reduces the
number of parameters and computations, making it efficient for mobile and embedded devices.
8. Winograd-like Transformations:
Variations of the Winograd algorithm exist for different input sizes and filter dimensions,
providing options for optimizing convolution for specific scenarios.
The choice of convolution algorithm depends on the specific use case, hardware, and trade-offs
between speed and memory usage.
NEUROSCIENTIFIC BASIS:
The history of convolutional networks begins with neuroscientific experiments long before the
relevant computational models were developed.
Neurophysiologists David Hubel and Torsten Wiesel observed how neurons in the cat’s brain responded
“Their great discovery was that neurons in the early visual system responded most strongly to very specific
patterns of light, such as precisely oriented bars, but responded hardly at all to other patterns”
lOMoARcPSD|31606405
The Neurons in the early visual cortex are organized in a hierarchical fashion, where the first cells
connected to the cat’s retinas are responsible for detecting simple patterns like edges and bars, followed by
later layers responding to more complex patterns by combining the earlier neuronal activities.
Convolutional Neural Network may learn to detect edges from raw pixels in the first layer, then use
the edges to detect simple shapes in the second layer, and then use these shapes to deter higher-level
features, such as facial shapes in higher layers
The Visual Cortex of the brain is a part of the cerebral cortex that processes visual information. V1 is
the first area of the brain that begins to
perform significantly advanced processing of visual input.
lOMoARcPSD|31606405
1. V1 is arranged in a spatial map. It actually has a two-dimensional structure mirroring the structure of
the image in the retina. Convolutional networks capture this property by having their features
2. V1 contains many simple cells. A simple cell’s activity can be characterized by a linear function of the
localized receptive field. The detector units of a convolutional network are designed to emulate
3. V1 also contains many complex cells. These cells respond to features that
are similar to those detected by simple cells, but complex cells are invariant to small shifts in the
position of the feature. This inspires the pooling units of convolutional networks.
1. The human eye is mostly very low resolution, except for a tiny patch called the fovea.
Most convolutional networks receive large full resolution photographs as input.
2. The human visual system is integrated with many other senses, such as
hearing, and factors like our moods and thoughts. Convolutional
networks so far are purely visual.
3. Even simple brain areas like V1 are heavily impacted by feedback from higher levels. Feedback has
been explored extensively in neural network models but has not yet been shown to offer a compelling
improvement.
APPLICATIONS
COMPUTER VISION:
Computer vision in AI is dedicated to the development of automated systems that can interpret visual data
(such as photographs or motion pictures) in the same manner as people do. The idea behind computer vision
is to instruct computers to interpret and comprehend images on a pixel-by-pixel basis. This is the foundation
of the computer vision field. Regarding the technical side of things, computers will seek to extract visual
data, manage it, and analyze the outcomes using sophisticated software programs.
The amount of data that we generate today is tremendous - 2.5 quintillion bytes of data every single day.
This growth in data has proven to be one of the driving factors behind the growth of computer vision.
Massive amounts of information are required for computer vision. Repeated data analyses are performed
until the system can differentiate between objects and identify visuals. Deep learning, a specific kind of
machine learning, and convolutional neural networks, an important form of a neural network, are the two
key techniques that are used to achieve this goal.
With the help of pre-programmed algorithmic frameworks, a machine learning system may automatically
learn about the interpretation of visual data. The model can learn to distinguish between similar pictures if it
is given a large enough dataset. Algorithms make it possible for the system to learn on its own, so that it
may replace human labor in tasks like image recognition.
Convolutional neural networks aid machine learning and deep learning models in understanding by dividing
visuals into smaller sections that may be tagged. With the help of the tags, it performs convolutions and
then leverages the tertiary function to make recommendations about the scene it is observing. With each
cycle, the neural network performs convolutions and evaluates the veracity of its recommendations. And
that's when it starts perceiving and identifying pictures like a human.
Computer vision is similar to solving a jigsaw puzzle in the real world. Imagine that you have all these
jigsaw pieces together and you need to assemble them in order to form a real image. That is exactly how
the neural networks inside a computer vision work. Through a series of filtering and actions, computers can
put all the parts of the image together and then think on their own. However, the computer is not just given
a puzzle of an image - rather, it is often fed with thousands of images that train it to recognize certain
objects.
For example, instead of training a computer to look for pointy ears, long tails, paws and whiskers that make
up a cat, software programmers upload and feed millions of images of cats to the computer. This enables the
computer to understand the different features that make up a cat and recognize it instantly.
History
For almost 60 years, researchers and developers have sought to teach computers how to perceive and make
sense of visual information. In 1959, neurophysiologists started showing a cat a variety of sights in an effort
to correlate a reaction in the animal's brain. They found that it was particularly sensitive to sharp corners and
lines, which technically indicates that straight lines and other basic forms are the foundation upon which
image analysis is built.
lOMoARcPSD|31606405
Around the same period, the first image-scanning technology emerged that enabled computers to scan
images and obtain digital copies of them. This gave computers the ability to digitize and store images. In
the 1960s, artificial intelligence (AI) emerged as an area of research, and the effort to address AI's inability
to mimic human vision began.
Neuroscientists demonstrated in 1982 that vision operates hierarchically and presented techniques enabling
computers to recognize edges, vertices, arcs, and other fundamental structures. At the same time, data
scientists created a pattern-recognition network of cells. By the year 2000, researchers were concentrating
their efforts on object identification, and by the following year, the industry saw the first-ever real-time face
recognition solutions.
Examining the algorithms upon which modern computer vision technology is based is essential to
understanding its development. Deep learning is a kind of machine learning that modern computer vision
utilizes to get data-based insights.
When it comes to computer vision, deep learning is the way to go. An algorithm known as a neural network
is used. Patterns in the data are extracted using neural networks. Algorithms are based on our current
knowledge of the brain's structure and operation, specifically the linkages between neurons within the
cerebral cortex.
The perceptron, a mathematical model of a biological neuron, is the fundamental unit of a neural network. It
is possible to have many layers of linked perceptrons, much like the layers of neurons in the biological
cerebral cortex. As raw data is fed into the perceptron-generated network, it is gradually transformed into
predictions.
Extremely fast CPUs and associated technology, together with a swift, dependable internet and cloud-based
infrastructures, make the entire process blistering fast nowadays. Importantly, several of the largest
businesses investing in AI research, like Google, Facebook, Microsoft, and IIBM, have been upfront about
their research and development in the field. In this way, people may build upon the foundation they've laid.
This has resulted in the AI sector heating up, and studies that used to take weeks to complete may now be
completed in a few minutes. In addition, for many computer vision tasks in the actual world, this whole
process takes place constantly in a matter of microseconds. As a result, a computer may currently achieve
what researchers refer to as "circumstantially conscious" status.
One field of Machine Learning where fundamental ideas are already included in mainstream products is
computer vision. The applications include:
lOMoARcPSD|31606405
Self-Driving Cars
With the use of computer vision, autonomous vehicles can understand their environment. Multiple cameras
record the environment surrounding the vehicle, which is then sent into computer vision algorithms that
analyzes the photos in perfect sync to locate road edges, decipher signposts, and see other vehicles,
obstacles, and people. Then, the autonomous vehicle can navigate streets and highways on its own, swerve
around obstructions, and get its passengers where they need to go safely.
Facial Recognition
Facial recognition programs, which use computer vision to recognize individuals in photographs, rely
heavily on this field of study. Facial traits in photos are identified by computer vision algorithms, which
then match those aspects to stored face profiles. In order to verify the identity of the people using consumer
electronics, face recognition is increasingly being used. Facial recognition is used in social networking
applications for both user detection and user tagging. For the same reason, law enforcement uses face
recognition software to track down criminals using surveillance footage.
Augmented reality, which allows computers like smartphones and wearable technology to superimpose or
embed digital content onto real-world environments, also relies heavily on computer vision. Virtual items
may be placed in the actual environment through computer vision in augmented reality equipment. In order
to properly generate depth and proportions and position virtual items in the real environment, augmented
reality apps rely on computer vision techniques to recognize surfaces like tabletops, ceilings, and floors.
Healthcare
Computer vision has contributed significantly to the development of health tech. Automating the process of
looking for malignant moles on a person's skin or locating indicators in an x-ray or MRI scan is only one of
the many applications of computer vision algorithms.
Examples
The following are some examples of well-established activities using computer vision:
Categorization of Images
A computer program that uses image categorization can determine what an image is of (a dog, a banana, a
human face, etc.). In particular, it may confidently assert that an input picture matches a specific category. It
might be used by a social networking platform, for instance, to filter out offensive photos that people post.
Object Detection
By first classifying images into categories, object detection may then utilize this information to search for
and catalog instances of the desired class of images. In the manufacturing industry, this can include finding
defects on the production line or locating broken equipment.
lOMoARcPSD|31606405
If an item is discovered, object tracking will continue to move in the same location. A common method for
doing this is by using a live video stream or a series of sequentially taken photos. For example, driverless
cars must not only identify and categorize moving things like people, other motorists, and road systems in
order to prevent crashes and adhere to traffic regulations.
In contrast to traditional visual retrieval methods, which rely on metadata labels, a content-based recognition
system employs computer vision to search, explore, and retrieve pictures from huge data warehouses based
on the actual image content. Automatic picture annotations, which can replace traditional visual tagging,
may be used for this work.
Computer vision algorithms include the different methods used to understand the objects in digital images
and extract high-dimensional data from the real world to produce numerical or symbolic information. There
are many other computer vision algorithms involved in recognizing things in photographs. Some common
ones are:
Object Classification - What is the main category of the object present in this photograph?
Object Recognition - What are the objects present in this photograph and where are they located?
Object Landmark Detection - What are the key points for the object in this photograph?
lOMoARcPSD|31606405
Many other advanced computer vision algorithms such as style transfer, colorization, human pose
estimation, action recognition, and more can be learned alongside deep learning algorithms.
Creating a machine with human-level vision is surprisingly challenging, and not only because of the
technical challenges involved in doing so with computers. We still have a lot to learn about the nature of
human vision.
To fully grasp biological vision, one must learn not just how various receptors like the eye work, but also
how the brain processes what it sees. The process has been mapped out, and its tricks and shortcuts have
been discovered, but, as with any study of the brain, there is still a considerable distance to cover.
Computer vision can automate several tasks without the need for human intervention. As a result, it
provides organizations with a number of benefits:
Faster and simpler process - Computer vision systems can carry out repetitive and monotonous tasks at a
faster rate, which simplifies the work for humans.
Better products and services - Computer vision systems that have been trained very well will commit
zero mistakes. This will result in faster delivery of high-quality products and services.
Cost-reduction - Companies do not have to spend money on fixing their flawed processes because
computer vision will leave no room for faulty products and services.
There is no technology that is free from flaws, which is true for computer vision systems. Here are a few
limitations of computer vision:
Lack of specialists - Companies need to have a team of highly trained professionals with deep knowledge
of the differences between AI vs. Machine Learning vs. Deep Learning technologies to train computer
vision systems. There is a need for more specialists that can help shape this future of technology.
Need for regular monitoring - If a computer vision system faces a technical glitch or breaks down, this
can cause immense loss to companies. Hence, companies need to have a dedicated team on board to
monitor and evaluate these systems.
lOMoARcPSD|31606405
Supercharge your career in AI and ML with Simplilearn's comprehensive courses. Gain the skills and
knowledge to transform industries and unleash your true potential. Enroll now and unlock limitless
possibilities!
IMAGE GENERATION:
Generative Adversarial Network which is popularly known as GANs is a deep learning, unsupervised
machine learning technique which is proposed in year 2014 through this research paper. The main blocks
of this architecture are ;
1. Generator : This block tries to generates the images which are very similar to that of original dataset
by taking noise as input. It tries to learn the join probability of the input data (X) and output data(Y);
P(X|Y).
2. Discriminator : This block tries to accept two inputs, one from main dataset and other from
images generated from Generator, and bifurcates them as Real or Fake.
To make this Generative and Adversarial process simple, both these block are made from Deep Neural
Network based architecture which can be trained through forward and backward propagation techniques.
To understand this concept in-depth, we will implement GAN architectures through tensorflow-keras. We
will be focusing on generation of MNIST images through simple GANs and also through Deep
Convolution GANs and also the Super Resolution GANs with working example.
With the above architecture of Simple GANs, we will look at the architecture of Generator model.
Generator consists of four dense layers, where a 100-dimensional Noise data is passed as input. The last
dense layer of the Generator produces the 784 (28x28 = 784) dimensional vector which is mainly flattened
vector corresponding to that of each of individual MNIST images.
lOMoARcPSD|31606405
For last Dense layer, we used tanh activation unit because we normalize each image from [-1, +1].This
generator vector from Generator is then passed to next block, which is Discriminator network of
GANs.
Discriminator, whose main task is to get maximum probability while predicting real or fake data, we pass
our 784-dimensional generator output vector to it. This block also comprises of four dense layers as shown
below.
lOMoARcPSD|31606405
A sigmoidal activation was used for the last layer, which gives the probability of input image being real
or fake.
LeakyRelu activation function is used in both, Generator and Discriminator which helps in faster
convergence of the model.
Both these Generative and Discriminative blocks are combined together as below;
To perform actual training, we will initialize the generator, discriminator and gan objects by initializing
each functional blocks. We will generate a 100-dimensional noise input for generator. As we normalized
images between [-1, +1], we will have random noise from normal distribution of range [-1, +1].
With above code, we will first generate sample images with generator by passing random noise to it. These
images are then combined with real images to generate a batch of real and fake images. This batch is
passed to a discriminator which predicts the probability of image being real or fake.
Till this stage of discriminator prediction, we keep discriminator trainable, as the loss of the prediction
needs to be back-propagated through network for updation of weights corresponding to each layer.
Now, by freezing the layers of discriminator, we try to back-propagate the GAN loss through generator for
updation of weights of each layer.
Below image shows the progress of GAN architecture, where with the simple noise input, generator is able
It was very interesting to see generation of similar MNIST images with the help of Deep Neural network,
yet the model fails to implement the major Deep Learning algorithm in the architecture, and that is
Convolutional Neural Networks.
Henceforth, instead of flattening the image into dense layer, we will be using convolutional filters to
generate the image from Noise input.
Generator of DC GAN consist of Dense Layer followed by Batch Normalization layer. Here, we first
take Noise as input which is being multiplied with FxFxK elements. This output is then reshaped into
FxFxK shape.
Convolution2DTranspose layer is used in DC GANs, whose main objective is to up sample the input
image. The complete architecture is as below;
Discriminator of the DC GANs, unlike previous examples, accepts the input as image instead of vector
format. Image generated through Generator and from Original data are sampled to pass it to the
discriminator.
lOMoARcPSD|31606405
LeakyRelu activation is used along with low value of Dropout to avoid the overfitting of the model.
The remaining flow of DC GANs is same as that of Simple GANs, where we let discriminator first update
it’s weights through backpropagation from training loss. After the discriminator gets updated, we freeze the
discriminator and fit the generator with fake data. The loss of generator is then back propagated through it so
as to update the weights.
Below is the image which shows us the progress of DC-GAN performance on MNIST data through
400 epochs.
Now we will look into one of the advance GAN architecture which is called as SR GANs. The main
purpose of this is;
Generation of Super Resolution image using Generator by accepting Low Resolution image as input.
This Super Resolution image is very similar to the original High Resolut
lOMoARcPSD|31606405
Original dataset consist of High Resolution (HR) images, which are down sampled to get the
Low Resolution (LR) images.
These LR images are then passed to SR GAN Generator which generates the Super Resolution
(SR) image which match close to that of HR image.
Batches of these SR and HR images are then passed to SR GAN Discriminator which predicts
whether the images are real or fake.
The final loss of the SR GAN is then back-propagated to Generator and Discriminator network
Now that we understood the working of the architecture, we can now understand the details of each of
the core blocks; Generator and Discriminator.
SR GAN Generator
lOMoARcPSD|31606405
Asstated in actual research paper, above image shows the block diagram of Generator of SR GAN. A LR
input image is passed through Convolution layer followed by Parametric ReLU activation. The output is
then passed to set of 16 residual blocks. Output from residual block is then passed to couple of
Convolutional blocks which is then passed to Up-sampling block, which increases the resolution of the
image to the desired level.
SR GAN Discriminator
Discriminator of SR GAN like all other GAN architecture, does the job of predicting the fake and real
images by accepting two images at the same time, here we can see little complex structure that was
implemented compared to that of previously seen architectures.
Loss function in SR GAN can be a combination of Content Loss and Adversarial Loss. Here, content loss
can be captured Pixel wise using MSE between HR and SR images. This can be calculated by extracting the
dense vector corresponding to each of input images by passing them through VGG19 network.
lOMoARcPSD|31606405
Wile running the complete SR GAN model, we will initialize Generator, Discriminator and SR-GAN objects.
Generator object will be compiled with Adam optimizer and only the content loss (i.e. VGG19 pixel wise
MSE).
Discriminator object, on the other hand, will be compiled with binary_crossentropy with Adam optimizer.
Instead of training a mixed batch of HR and LR images, we first pass HR images to discriminator and later
train the same with batch of LR images by making generator.trainable = True.
Now to train the discriminator, we’ll be freezing the discriminator and will train the srgan object with LR
Running above network with batch size of 64 and epochs of 400, we were able to get the significant output
from SE GAN architecture.
lOMoARcPSD|31606405
The GIF at the start of this article is generated using SR GAN architecture itself. From below, we can
clearly see, model was able to predict the edges for bridge and canal very clearly.
As the model progresses through epochs, we can see from Epoch1 to Epoch200, the Bridge structure was
very clearly visible and also the clif, its greenery along with some civilization was clearly visible in SR
image(center).
Here's a result at the end of Epoch400, the color of the sky improved pretty much, including the water on
the bridge. So, considering the low level image, where we could see small pixels to our eyes, model was
able to regenerate the image with much finer details.
lOMoARcPSD|31606405
Conclusion
Though, with simple modeling architecture, we were able to recreate not the best but much better image
output through all three forms of GAN architecture. By providing more image data and giving more time
to learn the features and in-dept detailing of the image, certainly the models will outperform the Original
data available.
IMAGE COMPRESSION:
Compression-decompression description
Сompression-decompression task involves compressing data, sending them using low internet traffic
usage, and their further decompression. The objective of the process is to achieve minimal difference
between the original and the decompressed images as well as obtain the same image quality after
compression- decompression as before data transfer.
Figure 2. Schema for a compression-decompression method. Data is the initial image file. An encoder is a
compression process, data compressed is a file after compression, a decoder is a decompression process,
Data* is a decompressed file.
To compare the performance of different methods we, first, measure compression coefficient and, after
that, we apply SSIM and PSNR metrics to measure similarities between the original image and the
decompressed image (all these metrics are described in the section Metrics below).
As we demonstrate in the Results section, different methods achieve different objectives: some produce
high- quality image results while having small compression efficiency, others reach high compression
efficiency while producing low-quality image results.
Dataset
We selected 10 images to compare and test different methods for a compression task. The dataset represents
5 bottles of Italian wines and 1 bottle of sauce (we chose this type of picture to further use the methods for
the bottle detection task as part of the ‘Bottle detection and classification’ company’s project). Examples of
images are presented in Figure 3:
lOMoARcPSD|31606405
Figure 3. Dataset for experiments with image compression methods (test data).
For the JPEG compression method, we employ the PIL library for python to compress .bmp images to .png
(code for running this is posted in GitHub), and JPEG format (Joint Photographic Experts Group)[10],
which is a standard image format for containing lossy and compressed image data. The format was
introduced in the early ‘90s, and since then, it became the most widely used image compression standard in
the world[11]. The main basis for JPEG’s lossy compression algorithm is the discrete cosine transform: this
mathematical operation converts each frame/field of the video source from the spatial (2D) domain into the
frequency domain. The JPEG standard specifies the codec, which defines how an image is compressed into a
stream of bytes and decompressed back into an image.
We tested several machine learning models (code for testing is posted in GitHub) and chose the most
optimal models (which are effortless to run, require minimal GPU, and can be evaluated using the selected
metrics).
The model is taken from the paper “Variational image compression with a scale hyperprior”[5]. The
Figure 4. The architecture of the proposed network ‘Variational image compression with a scale hyperprior’.
We employed TensorFlow framework[9] to compare the models because all the models can be run
within the same framework, and it is convenient for our task. We used Google Colab to run the models
because it provides free GPU. Below, we show the code for running the framework for Factorized Prior
Autoencoder model (installation instructions in Colab).
number 6 at the end of the name indicates the quality level (1: lowest, 8: highest);
lOMoARcPSD|31606405
We experimented with several quality levels, and in the result table, we include the models which give an
approximately similar performance for SSIM metrics (around 0.97), namely, bmshj2018-factorized-msssim-
6 in Table 2.
This script runs compression and produces a compressed file with .tfci name in addition to the target
input image (1.png). This file 1.png.tfci — is so-called compressed data from Figure 1.
Decompression in TensorFlow:
!python tfci.py decompress /1.png.tfci
This script produces a file with extension .png in addition to the compressed file name, for example,
1.png.tfci.png. The decompression code is the same for other models described below.
The second model is a nonlinear transform coder model with factorized priors (entropy models) optimized
for MSE, with GDN (generalized divisive normalization) activation functions, and 128 filters per layer[4].
Its architecture is shown in Figure 5. It was also run on TensorFlow framework[9].
Figure 5. Schema of model architecture for nonlinear transform coder with factorized priors (entropy
models) optimized for MSE, with GDN[12].
lOMoARcPSD|31606405
GDN is typically applied to linear filter responses z = Hx, where x is image data vectors; or applied to
linear filter responses inside a composite function such as an ANN (artificial neural networks). Its general
form is defined as
where y represents the vector of normalized responses, and vectors β, ε and matrices α, γ represent
parameters of the transformation (all non-negative).
Compression in TensorFlow for nonlinear transform coder model with factorized priors (entropy models)
optimized for MSE, with GDN (generalized divisive normalization) activation functions:
!python tfci.py compress b2018-gdn-128-4 /1.png
The number 1–4 at the end indicates the quality level (1: lowest, 4: highest). We experiment with different
levels of quality and choose the model which produces SSIM quality of approximately 0.97 (b2018-gdn-
128– 4 in Table 2).
The third model is hyperprior model with non zero-mean Gaussian conditionals (without autoregression),
optimized for MS-SSIM (multiscale SSIM)[6]. The architecture of the figure is shown in Figure 6. It was
also run on TensorFlow framework[9].
lOMoARcPSD|31606405
Figure 6. Model architecture for hyperprior model with non zero-mean Gaussian conditionals (without
autoregression) [6].
Compression in TensorFlow for hyperprior model with non zero-mean Gaussian conditionals (without
The number 1–8 at the end indicates the quality level (1: lowest, 8: highest). We experiment with different
levels of quality and choose the model which produces SSIM quality of approximately 0.97 (mbt2018-
mean- msssim-5 in table 2).
Metrics
The performance of image compression-decompression methods can be evaluated using several metrics [4]:
Compression efficiency/compression coefficient — the ratio between the compressed and the initial
data (image) size,
Image quality (Distortion Measurement) — the difference between the original image and
the compressed/decompressed image,
Computational cost — the number of seconds required for computing the compression and the
additional physical tool, such as GPU units.
lOMoARcPSD|31606405
Below, we summarize two metrics used for comparison, namely, compression efficiency/compression
coefficient, and image quality.
N_compression is a compression coefficient equal to the size of the compressed data divided by the size
of the initial data. Size(compressed data) — is the file size in bites after the models’ compression.
Size(uncompressed data) equals the image’s height*width*channels in bites. Our dataset for evaluation
has 10 equal images with width 576px, height 768px and channels =3, and size of the initial uncompressed
data 576*768*3 = 1,327,104 bits = 165,888 bytes= size(uncompressed data).
Image quality
To compare the quality of compression we chose three metrics. We measure the quality of the compressed
files using the formula:
where Quality_metric is either SSIM or PSNR. Below, we show formulas for those metrics.
SSIM
In image comparison, the mean squared error (MSE) is simple to implement, but it is not highly indicative
of the perceived similarity. Structural similarity aims to address this shortcoming by taking texture into
account[7].
lOMoARcPSD|31606405
where x, y — images to compare, μ — the average of image x or y respectively, σ — the variance of x and y
respectively, c1 and c2 — two variables to stabilize the division with weak denominator.
from skimage.metrics import structural_similaritySSIM = structural_similarity(img1, img2, multichannel=True)
PSNR
R is the maximum fluctuation in the input image data type. For example, if the input image has a double-
precision floating-point data type, then R is 1. If it has an 8-bit unsigned integer data type, R is 255.
Table 2. Results obtained for three different neural networks models: FactorizedPriorAutoencoder:
bmshj2018-factorized-msssim-6[5], nonlinear transform coder with factorized priors: b2018-gdn-128–4[4],
hyperprior model with non zero-mean: Gaussian conditionals mbt2018-mean-msssim-5[6].
Conclusions
We compare the classical JPEG compression method with three different machine learning models for
compression-decompression task with TensorFlow framework. Several metrics are applied to compare the
performance. The results are as follows: with relatively equal SSIM quality (about 0.97), the best
compression was produced by the mbt2018-mean-msssim-5 model (N_compression is approximately 0.13).
The next best compression model is bmshj2018-factorized-msssim-6 (N_compression is approximately
0.23). After this, follows the classical JPEG compression method with N_compression of around 0.288. The
latest in quality is the b2018-gdn-128–4 model (N_compression is approximately 0.29). At the same time,
the PSNR metrics for all neural networks models are approximately the same (about 35) (meaning that the
quality for MSE of images after compression-decompression is almost the same for every model). Also
interesting to mention, that the PSNR metric is higher for the JPEG method.