DL Unit 3
DL Unit 3
CNN Architectures:
Deep learning, there are several types of models such as the Artificial Neural
Networks (ANN), Autoencoders, Recurrent Neural Networks (RNN) and Reinforcement
Learning. But there has been one particular model that has contributed a lot in the field of
computer vision and image analysis which is the Convolutional Neural Networks (CNN)
or the ConvNets.
CNN is very useful as it minimises human effort by automatically detecting the features.
For example, for apples and mangoes, it would automatically detect the distinct features
of each class on its own.
CNNs are a class of Deep Neural Networks that can recognize and classify particular
features from images and are widely used for analysing visual images. Their applications
range from image and video recognition, image classification, medical image analysis,
computer vision and natural language processing.
CNN has high accuracy, and because of the same, it is useful in image recognition. Image
recognition has a wide range of uses in various industries such as medical image
analysis, phone, security, recommendation systems, etc.
The term ‘Convolution” in CNN denotes the mathematical function of convolution which
is a special kind of linear operation wherein two functions are multiplied to produce a
third function which expresses how the shape of one function is modified by the other. In
simple terms, two images which can be represented as matrices are multiplied to give an
output that is used to extract features from the image.
1
PANIMALAR ENGINEERING COLLEGE
23CS2902 – DEEP LEARNING – UNIT 3
Basic Architecture
A convolution tool that separates and identifies the various features of the image for
analysis in a process called as Feature Extraction.
A fully connected layer that utilizes the output from the convolution process and predicts
the class of the image based on the features extracted in previous stages.
This CNN model of feature extraction aims to reduce the number of features present in a
dataset. It creates new features which summarises the existing features contained in an
original set of features. There are many CNN layers as shown in the CNN architecture
diagram.
CNN has high accuracy, and because of the same, it is useful in image recognition. Image
recognition has a wide range of uses in various industries such as medical image
analysis, phone, security, recommendation systems, etc.
1. Convolution Layers
There are three types of layers that make up the CNN which are the convolutional
layers, pooling layers, and fully-connected (FC) layers. When these layers are stacked, a
CNN architecture will be formed. In addition to these three layers, there are two more
important parameters which are the dropout layer and the activation function which are
2
PANIMALAR ENGINEERING COLLEGE
23CS2902 – DEEP LEARNING – UNIT 3
The output is termed as the Feature map which gives us information about the image such
as the corners and edges. Later, this feature map is fed to other layers to learn several
other features of the input image.
The convolution layer in CNN passes the result to the next layer once applying the
convolution operation in the input. Convolutional layers in CNN benefit a lot as they
ensure the spatial relationship between the pixels is intact.
2. Pooling Layer
The Fully Connected (FC) layer consists of the weights and biases along with the
neurons and is used to connect the neurons between two different layers. These layers are
usually placed before the output layer and form the last few layers of a CNN Architecture.
In this, the input image from the previous layers are flattened and fed to the FC
layer. The flattened vector then undergoes few more FC layers where the mathematical
functions operations usually take place. In this stage, the classification process begins to
take place. The reason two layers are connected is that two fully connected layers will
3
PANIMALAR ENGINEERING COLLEGE
23CS2902 – DEEP LEARNING – UNIT 3
perform better than a single connected layer. These layers in CNN reduce the human
supervision.
4. Dropout
Usually, when all the features are connected to the FC layer, it can cause over
fitting in the training dataset. Over fitting occurs when a particular model works so well
on the training data causing a negative impact in the model’s performance when used on
a new data.
To overcome this problem, a dropout layer is utilised wherein a few neurons are dropped
from the neural network during training process resulting in reduced size of the model.
On passing a dropout of 0.3, 30% of the nodes are dropped out randomly from the neural
network.
5. Activation Functions
Finally, one of the most important parameters of the CNN model is the activation
function. They are used to learn and approximate any kind of continuous and complex
relationship between variables of the network. In simple words, it decides which
information of the model should fire in the forward direction and which ones should not
at the end of the network.
It adds non-linearity to the network. There are several commonly used activation
functions such as the ReLU, Softmax, tanH and the Sigmoid functions. Each of these
functions have a specific usage. For a binary classification CNN model, sigmoid and
softmax functions are preferred an for a multi-class classification, generally softmax is
used. In simple terms, activation functions in a CNN model determine whether a neuron
should be activated or not. It decides whether the input to the work is important or not to
predict using mathematical operations.
4
PANIMALAR ENGINEERING COLLEGE
23CS2902 – DEEP LEARNING – UNIT 3
2. Analyzing Documents
Convolutional neural networks can also be used for document analysis. This is not
just useful for handwriting analysis, but also has a major stake in recognizers. For a
machine to be able to scan an individual's writing, and then compare that to the wide
database it has, it must execute almost a million commands a minute. It is said with the
use of CNNs and newer models and algorithms, the error rate has been brought down to a
minimum of 0.4% at a character level, and though it's complete testing is yet to be widely
seen
3. Collecting Historic and Environmental Elements
CNNs are also used for more complex purposes such as natural history
collections. These collections act as key players in documenting major parts of history
such as biodiversity, evolution, habitat loss, biological invasion, and climate change.
4. Understanding Climate
CNNs can be used to play a major role in the fight against climate change,
especially in understanding the reasons why we see such drastic changes and how we
could experiment in curbing the effect. It is said that the data in such natural history
collections can also provide greater social and scientific insights, but this would
require skilled human resources such as researchers who can physically visit these
types of repositories. There is a need for more manpower to carry out deeper
experiments in this field.
5 Understanding Gray Areas
Introduction of the Gray area into CNNs is posed to provide a much more realistic
picture of the real world. Currently, CNNs largely function exactly like a machine,
seeing a true and false value for every question. However, as humans, we understand
that the real world plays out in a thousand shades of gray. Allowing the machine to
understand and process fuzzier logic will help it understand the gray area we humans
live in and strive to work against. This will help CNNs get a more holistic view of
what human sees.
6. Advertising
5
PANIMALAR ENGINEERING COLLEGE
23CS2902 – DEEP LEARNING – UNIT 3
CNNs are poised to be the future with their introduction into driverless cars,
robots that can mimic human behaviour, aides to human genome mapping projects,
predicting earthquakes and natural disasters, and maybe even self-diagnoses of
medical problems. So, you wouldn't even have to drive down to a clinic or schedule an
appointment with a doctor to ensure your sneezing attack or high fever is just the
simple flu and not the symptoms of some rare disease. One problem that researchers
are working on with CNNs is brain cancer detection. The earlier detection of brain
cancer can prove to be a big step in saving more lives affected by this illness.
Convolution Layer
The convolution layer is the core building block of the CNN. It carries the main
portion of the network’s computational load.
This layer performs a dot product between two matrices, where one matrix is the
set of learnable parameters otherwise known as a kernel, and the other matrix is the
restricted portion of the receptive field. The kernel is spatially smaller than an image but
is more in- depth. This means that, if the image is composed of three (RGB) channels, the
kernel height and width will be spatially small, but the depth extends up to all three
channels.
6
PANIMALAR ENGINEERING COLLEGE
23CS2902 – DEEP LEARNING – UNIT 3
During the forward pass, the kernel slides across the height and width of the
image- producing the image representation of that receptive region. This produces a two-
dimensional representation of the image known as an activation map that gives the
response of the kernel at each spatial position of the image. The sliding size of the kernel
is called a stride.
If we have an input of size W x W x D and Dout number of kernels with a spatial
size of F with stride S and amount of padding P, then the size of output volume can be
determined by the following formula:
7
PANIMALAR ENGINEERING COLLEGE
23CS2902 – DEEP LEARNING – UNIT 3
Convolution leverages three important ideas that motivated computer vision researchers:
sparse interaction, parameter sharing, and equivariant representation. Let’s describe each
one of them in detail.
Trivial neural network layers use matrix multiplication by a matrix of parameters
describing the interaction between the input and output unit. This means that every output
unit interacts with every input unit. However, convolution neural networks have sparse
interaction. This is achieved by making kernel smaller than the input e.g., an image can
have millions or thousands of pixels, but while processing it using kernel we can detect
meaningful information that is of tens or hundreds of pixels. This means that we need to
store fewer parameters that not only reduces the memory requirement of the model but
also improves the statistical efficiency of the model.
If computing one feature at a spatial point (x1, y1) is useful then it should also be useful
at some other spatial point say (x2, y2). It means that for a single two-dimensional slice
i.e., for creating one activation map, neurons are constrained to use the same set of
weights. In a traditional neural network, each element of the weight matrix is used once
and then never revisited, while convolution network has shared parameters i.e., for
getting output, weights applied to one input are the same as the weight applied elsewhere.
8
PANIMALAR ENGINEERING COLLEGE
23CS2902 – DEEP LEARNING – UNIT 3
Due to parameter sharing, the layers of convolution neural network will have a property
of equivariance to translation. It says that if we changed the input in a way, the output
will also get changed in the same way.
Pooling Layer
The pooling layer replaces the output of the network at certain locations by deriving a
summary statistic of the nearby outputs. This helps in reducing the spatial size of the
representation, which decreases the required amount of computation and weights. The
pooling operation is processed on every slice of the representation individually.
There are several pooling functions such as the average of the rectangular neighbourhood,
L2 norm of the rectangular neighbourhood, and a weighted average based on the distance
from the central pixel. However, the most popular process is max pooling, which reports
the maximum output from the neighbourhood.
In all cases, pooling provides some translation invariance which means that an object
would be recognizable regardless of where it appears on the frame.
9
PANIMALAR ENGINEERING COLLEGE
23CS2902 – DEEP LEARNING – UNIT 3
Neurons in this layer have full connectivity with all neurons in the preceding and
succeeding layer as seen in regular FCNN. This is why it can be computed as usual by a
matrix multiplication followed by a bias effect.
The FC layer helps to map the representation between the input and the output.
Non-Linearity Layers
Since convolution is a linear operation and images are far from linear, non-linearity layers
are often placed directly after the convolutional layer to introduce non-linearity to the
activation map.
There are several types of non-linear operations, the popular ones being:
1. Sigmoid
The sigmoid non-linearity has the mathematical form σ(κ) = 1/(1+e¯κ). It takes a real-
valued number and “squashes” it into a range between 0 and 1.
However, a very undesirable property of sigmoid is that when the activation is at either
tail, the gradient becomes almost zero. If the local gradient becomes very small, then in
back propagation it will effectively “kill” the gradient. Also, if the data coming into the
neuron is always positive, then the output of sigmoid will be either all positives or all
negatives, resulting in a zig-zag dynamic of gradient updates for weight.
[-1, 1]. Like sigmoid, the activation saturates, but — unlike the sigmoid neurons —
its output is zero centered.
3. ReLU - The Rectified Linear Unit (ReLU) has become very popular in the last few
years. It computes the function ƒ(κ)=max (0,κ). In other words, the activation is
simply threshold at zero.
10
PANIMALAR ENGINEERING COLLEGE
23CS2902 – DEEP LEARNING – UNIT 3
In comparison to sigmoid and tanh, ReLU is more reliable and accelerates the
convergence by six times.
Unfortunately, a con is that ReLU can be fragile during training. A large gradient flowing
through it can update it in such a way that the neuron will never get further updated.
However, we can work with this by setting a proper learning rate.
[INPUT]
For both conv layers, we will use kernel of spatial size 5 x 5 with stride size 1 and padding
of 2. For both pooling layers, we will use max pool operation with kernel size 2, stride 2,
and zero padding.
11
PANIMALAR ENGINEERING COLLEGE
23CS2902 – DEEP LEARNING – UNIT 3
12
PANIMALAR ENGINEERING COLLEGE
23CS2902 – DEEP LEARNING – UNIT 3
Pooling layers are one of the building blocks of Convolutional Neural Networks. Where
Convolutional layers extract features from images, Pooling layers consolidate the
features learned by CNNs. Its purpose is to gradually shrink the representation’s spatial
dimension to minimize the number of parameters and computations in the network.
13
PANIMALAR ENGINEERING COLLEGE
23CS2902 – DEEP LEARNING – UNIT 3
makes the CNN invariant to translations, i.e., even if the input of the CNN is translated,
the CNN will still be able to recognize the features in the input.
In all cases, pooling helps to make the representation become approximately invariant to
small translations of the input. Invariance to translation means that if we translate the
input by a small amount, the values of most of the pooled outputs do not change.
How do Pooling layers achieve that? A Pooling layer is added after the Convolutional
layer(s), as seen in the structure of a CNN above. It downsamples the output of the
Convolutional layers by sliding the filter of some size with some stride size and
calculating the maximum or average of the input.
Let’s explore the working of Pooling Layers using TensorFlow. Create a NumPy array
and reshape it.
matrix=np.array([[3.,2.,0.,0.],
[0.,7.,1.,3.],
[5.,2.,3.,0.],
[0.,9.,2.,3.]]).reshape(1,4,4,1)
14
PANIMALAR ENGINEERING COLLEGE
23CS2902 – DEEP LEARNING – UNIT 3
Max Pooling
Create a MaxPool2D layer with pool size = 2 and strides = 2. Apply the MaxPool2D
layer to the matrix, and you will get the MaxPooled output in the tensor form. By
applying it to the matrix, the Max pooling layer will go through the matrix by computing
the max of each 2×2 pool with a jump of 2. Print the shape of the tensor. Use tf.squeeze
to remove dimensions of size 1 from the shape of a tensor.
max_pooling=tf.keras.layers.MaxPool2D(pool_size=2,strides=2)
max_pooled_matrix=max_pooling(matrix)
print(max_pooled_matrix.shape)
print(tf.squeeze(max_pooled_matrix))
Average Pooling
Create an AveragePooling2D layer with the same 2 pool_size and strides. Apply the
AveragePooling2D layer to the matrix. By applying it to the matrix, the average pooling
layer will go through the matrix by computing the average of 2×2 for each pool with a
jump of 2. Print the shape of the matrix and Use tf.squeeze to convert the output into a
readable form by removing all 1 size dimensions.
average_pooling=tf.keras.layers.AveragePooling2D(pool_size=2,
strides=2)
average_pooled_matrix=average_pooling(matrix)
print(averge_pooled_matrix.shape)
print(tf.squeeze(average_pooled_matrix))
15
PANIMALAR ENGINEERING COLLEGE
23CS2902 – DEEP LEARNING – UNIT 3
Global Pooling Layers often replace the classifier’s fully connected or Flatten layer. The
model instead ends with a convolutional layer that produces as many feature maps as
there are target classes and performs global average pooling on each of the feature maps
to combine each feature map into a single value.
Create the same NumPy array but with a different shape. By keeping the same shape as
above, the Global Pooling layers will reduce them to one value.
matrix=np.array([[[3.,2.,0.,0.],
[0.,7.,1.,3.]],
[[5.,2.,3.,0.],
[0.,9.,2.,3.]]]).reshape(1,2,2,4)
16
PANIMALAR ENGINEERING COLLEGE
23CS2902 – DEEP LEARNING – UNIT 3
global_average_pooling=tf.keras.layers.GlobalAveragePooling2D()
global_average_pooled_matrix=global_average_pooling(matrix)
print(global_average_pooled_matrix)
Conclusion
In general, pooling layers are useful when you want to detect an object in an image
regardless of its position in the image. The consequence of adding pooling layers is the
reduction of over fitting, increased efficiency, and faster training times in a CNN model.
While the max pooling layer draws out the most prominent features of an image, average
pooling smoothest the image retaining the essence of its features. Global pooling layers
often replace the Flatten or Dense output layers.
Transfer Learning for Deep Learning
The reuse of a previously learned model on a new problem is known as transfer learning.
It’s particularly popular in deep learning right now since it can train deep neural networks
with a small amount of data.
17
PANIMALAR ENGINEERING COLLEGE
23CS2902 – DEEP LEARNING – UNIT 3
With transfer learning, we basically try to use what we’ve learned in one task to better
understand the concepts in another. Weights are being automatically shifted to a network
performing “task A” from a network that performed new “task B.”
Because of the massive amount of CPU power required, transfer learning is typically
applied in computer vision and natural language processing tasks like sentiment analysis.
In computer vision, neural networks typically aim to detect edges in the first layer, forms
in the middle layer, and task-specific features in the latter layers. The early and central
layers are employed in transfer learning, and the latter layers are only retrained. It makes
use of the labelled data from the task it was trained on.
18
PANIMALAR ENGINEERING COLLEGE
23CS2902 – DEEP LEARNING – UNIT 3
Let’s return to the example of a model that has been intended to identify a backpack in an
image and will now be used to detect sunglasses. Because the model has trained to
recognise objects in the earlier levels, we will simply retrain the subsequent layers to
understand what distinguishes sunglasses from other objects.
The reuse of a previously learned model on a new problem is known as transfer learning.
It’s particularly popular in deep learning right now since it can train deep neural networks
with a small amount of data. This is particularly valuable in the field of data science, as
most real-world situations do not require millions of labelled data points to train
complicated models.
Let’s return to the example of a model that has been intended to identify a backpack in an
image and will now be used to detect sunglasses. Because the model has trained to
recognise objects in the earlier levels, we will simply retrain the subsequent layers to
understand what distinguishes sunglasses from other objects.
19
PANIMALAR ENGINEERING COLLEGE
23CS2902 – DEEP LEARNING – UNIT 3
Because the model has already been pre-trained, a good machine learning model can be
generated with fairly little training data using transfer learning. This is especially useful
in natural language processing, where huge labelled datasets require a lot of expert
knowledge. Additionally, training time is decreased because building a deep neural
network from the start of a complex task can take days or even weeks.
Transfer learning, on the other hand, only works if the features learnt in the first task are
general, meaning they can be applied to another activity. Furthermore, the model’s input
must be the same size as it was when it was first trained. If you don’t have it, add a step
to resize your input to the required size.
20
PANIMALAR ENGINEERING COLLEGE
23CS2902 – DEEP LEARNING – UNIT 3
3. EXTRACTION OF FEATURES
Another option is to utilise deep learning to identify the optimum representation of your
problem, which comprises identifying the key features. This method is known as
representation learning, and it can often produce significantly better results than hand-
designed representations.
Feature creation in machine learning is mainly done by hand by researchers and domain
specialists. Deep learning, fortunately, can extract features automatically. Of course, this
does not diminish the importance of feature engineering and domain knowledge; you
must still choose which features to include in your network.
Neural networks, on the other hand, have the ability to learn which features are critical
and which aren’t. Even for complicated tasks that would otherwise necessitate a lot of
human effort, a representation learning algorithm can find a decent combination of
characteristics in a short amount of time.
The learned representation can then be applied to a variety of other challenges. Simply
utilise the initial layers to find the appropriate feature representation, but avoid using the
network’s output because it is too task-specific. Instead, send data into your network and
output it through one of the intermediate layers.
The raw data can then be understood as a representation of this layer.
This method is commonly used in computer vision since it can shrink your dataset,
reducing computation time and making it more suited for classical algorithms.
Models That Have Been Pre-Trained
There are a number of popular pre-trained machine learning models available. The
Inception-v3 model, which was developed for the ImageNet “Large Visual Recognition
21
PANIMALAR ENGINEERING COLLEGE
23CS2902 – DEEP LEARNING – UNIT 3
Challenge,” is one of them.” Participants in this challenge had to categorize pictures into
1,000 subcategories such as “zebra,” “Dalmatian,” and “dishwasher.”
Topic II
Transfer learning is about leveraging feature representations from a pre-trained model, so
you don’t have to train a new model from scratch.
The pre-trained models are usually trained on massive datasets that are a standard
benchmark in the computer vision frontier. The weights obtained from the models can be
reused in other computer vision tasks.
These models can be used directly in making predictions on new tasks or integrated into
the process of training a new model. Including the pre-trained models in a new model
leads to lower training time and lower generalization error.
Transfer learning is particularly very useful when you have a small training dataset. In
this case, you can, for example, use the weights from the pre-trained models to initialize
the weights of the new model. As you will see later, transfer learning can also be applied
to natural language processing problems.
Models trained on the ImageNet can be used in real-world image classification problems.
This is because the dataset contains over 1000 classes. Let’s say you are an insect
researcher. You can use these models and fine-tune them to classify insects.
Classifying text requires knowledge of word representations in some vector space. You
can train vector representations yourself. The challenge here is that you might not have
enough data to train the embeddings. Furthermore, training will take a long time. In this
case, you can use a pre-trained word embedding like GloVe to hasten your development
process.
22
PANIMALAR ENGINEERING COLLEGE
23CS2902 – DEEP LEARNING – UNIT 3
Overfitting is avoidable. Just retrain the model or part of it using a low learning
rate. This is important because it prevents significant updates to the gradient. These
updates result in poor performance. Using a callback to stop the training process when the
model has stopped improving is also helpful.
24
PANIMALAR ENGINEERING COLLEGE
23CS2902 – DEEP LEARNING – UNIT 3
base_model.trainable = False
25
PANIMALAR ENGINEERING COLLEGE
23CS2902 – DEEP LEARNING – UNIT 3
Step 2:
_URL = 'https://storage.googleapis.com/mledu-datasets/cats_and_dogs_filtered.zip'
path_to_zip = tf.keras.utils.get_file('cats_and_dogs.zip', origin=_URL, extract=True)
PATH = os.path.join(os.path.dirname(path_to_zip), 'cats_and_dogs_filtered')
BATCH_SIZE = 32
IMG_SIZE = (160, 160)
train_dataset = tf.keras.utils.image_dataset_from_directory(train_dir,
26
PANIMALAR ENGINEERING COLLEGE
23CS2902 – DEEP LEARNING – UNIT 3
shuffle=True,
batch_size=BATCH_SIZE,
image_size=IMG_SIZE)
Output:
Step 3:
validation_dataset = tf.keras.utils.image_dataset_from_directory(validation_dir,
shuffle=True,
batch_size=BATCH_SIZE,
image_size=IMG_SIZE)
Output:
Step 4:
class_names = train_dataset.class_names
plt.figure(figsize=(10, 10))
for images, labels in train_dataset.take(1):
for i in range(9):
ax = plt.subplot(3, 3, i + 1)
plt.imshow(images[i].numpy().astype("uint8"))
plt.title(class_names[labels[i]])
plt.axis("off")
Output:
27