0% found this document useful (0 votes)
11 views9 pages

Convolutional Neural Networks: ZV0GDF798E

The document provides an overview of Convolutional Neural Networks (CNNs), explaining their structure and components such as convolution, pooling, padding, stride, and fully connected layers. It highlights the efficiency of CNNs in image processing by reducing the number of parameters needed compared to traditional neural networks. Additionally, it discusses the historical significance of AlexNet in demonstrating the effectiveness of deep learning for image classification tasks.

Uploaded by

Mandy Law
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views9 pages

Convolutional Neural Networks: ZV0GDF798E

The document provides an overview of Convolutional Neural Networks (CNNs), explaining their structure and components such as convolution, pooling, padding, stride, and fully connected layers. It highlights the efficiency of CNNs in image processing by reducing the number of parameters needed compared to traditional neural networks. Additionally, it discusses the historical significance of AlexNet in demonstrating the effectiveness of deep learning for image classification tasks.

Uploaded by

Mandy Law
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Convolutional Neural Networks

Let's understand the notion of a convolutional neural network (CNN) and how it is
different from the kind of neural network we learned earlier.

Let’s understand the first layer once. Let's say that our input is a grayscale image that's
256 x 256. In grayscale images, we have only one channel so the image size will be
256 x 256 x 1.

[email protected]
ZV0GDF798E

We can break this image up into 8 x 8 image patches. The patch is nothing but a small
portion of an image.

This file is meant for personal use by [email protected] only.


Sharing or publishing the contents in part or full is liable for legal action.
In total, we have a 32 x 32 grid of these patches that comprise the image.

Now let's consider a linear function on just the 8 x 8 image patch. Sometimes this is
called a filter.

In our case, a filter is an 8 x 8 grid of weights learned during backpropagation to learn


complex patterns in the image grid. And when we apply a filter to an image patch,
what we do is we take an inner product between them as vectors.
[email protected]
ZV0GDF798E

So if we apply a filter to each one of the patches, we get a new 32 x 32 grid of


numbers. What good is this? Remember the intuition is that the first few layers detect
simple objects like edges.

This file is meant for personal use by [email protected] only.


Sharing or publishing the contents in part or full is liable for legal action.
We can think of a filter as a naive object detector, and instead of having different
parameters when we apply it to each of the different image patches, why not have the
same parameters throughout? This drastically reduces the number of parameters and
consequently, there are much fewer things to learn. If we take this idea to its natural
conclusion, we get convolutional neural networks.

In the first layer, we have a collection of filters, each one is applied to each of the image
patches,
[email protected] and together they give the output of the first layer, after applying a
ZV0GDF798E
non-linearity at the end. This is already a major innovation because it means we can
work with much larger neural networks in practice. Just the first few layers are
convolutional and the others are general and fully connected.

Another important idea is the notion of dropout.

Here when we compute how well a neural network classifies some image, say through
the quadratic cost function, we instead randomly delete some fraction of the network,

This file is meant for personal use by [email protected] only.


Sharing or publishing the contents in part or full is liable for legal action.
and then compute the new function from the input to the outputs. The idea is, that if a
neural network continues to work, even if we drop perceptrons from the intermediate
layers then it must be spreading information out in a way that no node is a single point
of failure. Training a neural network with dropout makes the function that we learn
more robust.

Additional Content :

We have understood neural networks and how they can be used to build robust
models using numerical data. Let's assume we have an image with height = 6, width =
6, and the number of channels = 3 (colored image).

So there are 6 x 6 x 3 = 108 numbers required to fully describe the image. Consider we
have the first hidden layer of a neural network with 10 units. The total number of
parameters (weights) are 108 x 10 = 1080. So we need 1080 weights for only one
layer and generally, the size of the image will be equal to 224 x 224. In these kinds of
cases, we get a lot of parameters to train, which makes it very computationally
expensive and the model doesn't perform much better. To deal with this kind of
[email protected]
ZV0GDF798Eproblem, we have special neural networks called Convolutional Neural Networks
(CNNs).

A convolutional neural network is a type of neural network that is used in image


processing and image classification. This neural network takes the pixels of an image
as input and generates the desired output.

Let’s understand the various building blocks of CNN:


1. Convolution
2. Pooling
3. Padding
4. Stride
5. Fully Connected Layer

This file is meant for personal use by [email protected] only.


Sharing or publishing the contents in part or full is liable for legal action.
Convolution :
The first step of a CNN is to detect features like edges, shapes, etc. This is done by
applying a convolution to the image using Filters (Filters are responsible for detecting
some kind of shape).

Let’s understand this using an example:


We take an input image of 6 X 6 and convolve this 6 X 6 matrix with a 3 X 3 filter:

Note: The blue matrix on the right above, represents just the first application of the
filter to the first 3 X 3 portion of the 6 X 6 image. It does not represent the final
[email protected]
ZV0GDF798Eoutput of the convolution. The blue matrix on the right will actually result in just one
final number, which is the sum of the element-wise products (products of the big
number and the small number inside each square).

So for example, the blue matrix will actually give:


3*1 + 0*0 + 1*(-1) + 1*1 + 5*0 + 8*(-1) + 2*1 + 7*0 + 2*(-1) = -5 is the number
resulting from the blue matrix.

Similar numbers result from a moving application of the 3 X 3 filter to corresponding 3


X 3 regions in the 6 X 6 image, first horizontally throughout each row, and then for
every row in the whole image matrix itself. This will eventually give a 4 X 4 final
output after the convolution, where each number of the 4 X 4 final output is a number
computed from the sum of products, like the -5 computed above.

So after the convolution, we finally get a 4 X 4 image. The first element of the 4 X 4
matrix will be calculated as we take the first 3 X 3 matrix from the 6 X 6 image and

This file is meant for personal use by [email protected] only.


Sharing or publishing the contents in part or full is liable for legal action.
multiply it with the filter. The first element of the 4 X 4 matrix, will be the sum of the
element-wise product of these values which is:
3*1 + 0*0 + 1*(-1) + 1*1 + 5*0 + 8*(-1) + 2*1 + 7*0 + 2*(-1) = -5
Similarly, we will convolve over the entire image and get a 4 X 4 matrix in the end.

In convolution the total number of parameters = [ 3 x 3 x 3 (numbers in the filters) x 10


(number of filters) ] + 10 (bias) = 280.
280 is much smaller than the corresponding number of parameters we would have
required in the case of a fully connected neural network, and this is evidence of the
computational efficiency of using the convolution operation.

Filters:
Filters are responsible for locating objects in an image by detecting changes in the
intensity values of the images. Generally, we have an edge detector that is capable of
detecting edges in an image in mathematical form.
For example:

[email protected]
ZV0GDF798E

In images, we have a lot of complex features that need to be detected other than
edges. For that purpose, we randomly initialize filter values, and the model itself will
learn the best filter values for feature detection during the backpropagation phase.

Pooling:
Pooling is another technique used to reduce the spatial size of the representation, in
order to reduce the number of parameters and the computational cost of the network.

This file is meant for personal use by [email protected] only.


Sharing or publishing the contents in part or full is liable for legal action.
Strides:
When performing convolution, we observe that we slide from the top-most corner to
the bottom-most corner by a shift - this shift is called the stride. Hence, the stride
also helps us in dimensionality reduction; we observe that as the stride number
increases, the computational power required correspondingly decreases as well.

Padding:
The convolutional layers reduce the size of the output. So in cases where we want to
increase the size of the output and save the information presented in the corners, we
[email protected]
ZV0GDF798Ecan use padding layers, where padding helps by adding extra rows and columns on the
outer dimension of the images. So the size of the input data will remain similar to the
output data. We mostly add zeros in the extra rows and columns (Zero padding).

Fully Connected Layers:


The result after applying different filters is a matrix. So, we have to flatten that matrix
in the form of a vector to feed it into the fully connected layer. For this, we make a fully
connected layer. In the picture shown below, the first matrix is the result we get after

This file is meant for personal use by [email protected] only.


Sharing or publishing the contents in part or full is liable for legal action.
the image goes through convolutional layers, the second layer is the flattened layer
that acts as the input for the fully connected layers.

After getting a fully connected layer, we pass this layer as an input layer to the neural
network in order to get the results.

We now
[email protected] have an understanding of the building blocks of a CNN. CNN's do nothing
ZV0GDF798E
other than arranging these building blocks in the right order. Usually, that order is a
convolution layer followed by a pooling layer (multiple times), and finally a fully
connected layer.

Let’s look at one of the historically famous CNN architectures in Deep Learning.

This file is meant for personal use by [email protected] only.


Sharing or publishing the contents in part or full is liable for legal action.
AlexNet :
● AlexNet is a masterpiece created by the SuperVision group, which included Alex
Krizhevsky, Geoffrey Hinton, and Ilya Sutskever from the University of Toronto.
● The winner of ImageNet 2012, AlexNet showed that deep learning was the way
forward, towards achieving the lowest error rates in computer vision tasks.
.

[email protected]
ZV0GDF798E

What is the architectural structure of AlexNet?


● The major feature of AlexNet is that it overlaps the pooling operation to reduce
the size of the network.
● With five convolution layers and three fully connected layers, and the ReLU
function applied after every convolutional & fully connected layer, AlexNet
showed the way towards achieving state-of-the-art image classification results.
● It uses ReLU as its activation function, which speeds up the rate of training and
increases the accuracy. The regularization technique it uses is Dropout.

There are several other architectures that can be explored to get a better
understanding of which building blocks are to be used in the implementation of CNNs.

This file is meant for personal use by [email protected] only.


Sharing or publishing the contents in part or full is liable for legal action.

You might also like