Module 2
Module 2
vi R
Convolutional Neural Networks
ar
Dr. R. Bhargavi
g a
Professor
SCOPE
B h
VIT University
1
Computer Vision - Applications
Image Classification Object Detection
R
Malignant/Benign
g avi
r
Style Transfer
B ha
Dr. R Bhargavi, VIT 2
Working with Images - Fully connected DNN
• A fully connected DNN/MLP takes only tabular data as the input.
• It does not work well with images because they heavily rely on certain pixel
positions. Hence any positional variance will result in miss-classification (Example
shown in figure below)
Features are considered as independent of each other.
R
•
i
• A traditional fully connected DNN has huge number of learnable parameters
av
• Images of size 1024 x 1024 x 3, with 2 hidden layers of size 1000 ?
har g
B
Dr. R Bhargavi, VIT 3
Working with Images –DNN (cont…)
Input image
vi R
ar g a
B h
Flattened Input image to a Fully connected DNN
v
is also used for DNN.
i R
As the conv layers learn the representations the name representation learning
ar g a
B h
Dr. R Bhargavi, VIT 5
Convolutional Neural Network - Architecture
vi R
ar g a
B h Convolutional layers
Abstract
Features
FC layers (for
classification)
vi R
Convolution
ar g a
Pooling
Fully
Connected
h
Convolution Fully
Pooling Fully
B
Connected
Connected
Trainable Layers
vi R
ar g a
B h
Dr. R Bhargavi, VIT 8
Convolutional Layer
• Convolutional layer is the core building block of a Convolutional Network.
• Conv layer’s parameters consist of a set of learnable filters.
•
input.
v R
Local connectivity: Each neuron is connected only to a small region in the
i
ar g a x1
x3
x2
x4 *
w1
w3
w2
w4
z
h
=
B
Z
𝑧 = 𝑏 + % 𝑤! 𝑥!
Receptive Field
of the Neuron in !
the feature map
Dr. R Bhargavi, VIT 9
Convolutional Layer (cont…)
• Parameter sharing/ Weight sharing: In one conv layer same filter is used for the
entire image.
R
• Rationale - If detecting a horizontal edge is important at some location in the
i
image, it should intuitively be useful at some other location as well due to the
v
translationally-invariant structure of images. There is therefore no need to
ar g a
relearn to detect a horizontal edge at every one of the distinct locations in the
Conv layer output volume.
B h
Dr. R Bhargavi, VIT 10
Convolution Operation
Feature Map
vi R
ar g a
B h
Output size is given by (nh – kh +1) x (nw – kw +1) where (nh x nw) is the size (height
and width) of the input tensor and (kh x kw) is the size of the kernel
i R
• Each kernel results in one channel.
v
a
• Same convolution operation is used for each of the output channels.
ar g
• Each kernel learns different parameters corresponding to different
filters.
h
B
Dr. R Bhargavi, VIT 12
Convolutions with Multiple Channels (cont…)
Multiple input channels (3channels) and single output channel
vi R
ar g a
B h
Dr. R Bhargavi, VIT 13
Convolutions with Multiple Channels (cont…)
Multiple input channels (3channels) and multiple output channel
Kernel1: 3 channels
vi R Kernel2: 3 channels
ar g a Kernel3: 3 channels
B h
Input: 3 channels
Kernel4: 3 channels
Kernel5: 3 channels
Output: 5 channels
Padding size = 1
vi R
a
0 0 0 0 0 0 0 -1 0 0 3 0 3 2 0
r g
0 1 0 2 2 1 0 -1 1 0 4 -1 1 0 -2
* =
ha
0 2 1 1 2 1 0 0 1 0 2
B
0 2 1 1 1 1 0 2
0 0 1 1 2 2 0 2
0 2 2 1 1 1 0
0 0 0 0 0 0 0
Stride = 2
R
0 0 0 0 0 0 0 -1 0 0 3 3 0
vi
0 1 0 2 2 1 0 -1 1 0 2 0 0
* =
a
2 -2 -2
g
0 2 1 1 2 1 0 0 1 0
ar
0 2 1 1 1 1 0
h
0 0 1 1 2 2 0
0
2
0
B 2
0
1
0
1
0
1
0
0
R
different parts of the image
i
•
v
The above two assumptions are called Inductive biases.
a
•
g
Inductive biases result in CNNs learn more quickly and generalize better as
r
compared to fully connected NNs.
B ha
Dr. R Bhargavi, VIT 17
Pooling
• Used between the conv layers.
• Reduce the spatial size of the representation to reduce the amount of parameters
and computation in the network.
• Controls the overfitting.
• Accepts a volume of size W1×H1×D1
R
• Requires two hyperparameters:
i
• Spatial extent F.
av
• The stride S,
g
Produces a volume of size W2×H2×D2 where:
r
•
a
• W2= ((W1−F)/S)+1
h
• H2= ((H1−F)/S)+1
B
• D2=D1
• No learnable parameters.
• Padding the input using zero-padding is not done for pooling layer.
vi R
ar g a
B h
Dr. R Bhargavi, VIT 19
CNN Architectures
vi R
ar g a
B h
Dr. R Bhargavi, VIT 20
Source: https://arxiv.org/pdf/1605.07678.pdf
LeNet
• LeNet, proposed by Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick
Haffner in 1998, laid the groundwork for convolutional neural networks (CNNs)
R
and their applications in handwritten digit recognition.
i
LeNet was trained using stochastic gradient descent (SGD) with
v
•
a
backpropagation.
•
har g
The network was trained on the MNIST dataset, comprising 60,000 training
examples and 10,000 test examples.
B
• Data augmentation techniques such as translation, rotation, and scaling were
employed to increase the diversity of training samples and improve
generalization.
• LeNet achieved a remarkable accuracy of over 99% on the MNIST dataset.
Dr. R Bhargavi, VIT 21
LeNet-5 (cont…)
• Used sigmoid and tanh activations.
• Has approx. 60k learnable parameters.
• LeNet was used to read zip codes, digits, etc
vi R
ar g a
Stride = 1B h
6 Kernels - 5 x 5 Avg pool - 2 x 2
Stride = 2 16 Kernels - 5 x 5
Stride = 1
Avg pool - 2 x 2
Stride = 2
vi R
Test data performance - Achieved top-1 and
top-5 error rates of 37.5% and 17.0%.
•
r g a
In the ILSVRC-2012 competition, a variant of
a
this model achieved a winning top-5 test error
h
rate of 15.3%.
B
Source: https://proceedings.neurips.cc/paper_files/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf
Dr. R Bhargavi, VIT 23
AlexNet
• AlexNet consists of eight layers, including
• five convolutional layers followed by
• max-pooling layers and
R
• three fully connected layers.
vi
• Rectified Linear Units (ReLU) were used as activation functions, providing
a
faster convergence and alleviating the vanishing gradient problem.
•
h r g
The neural network has 60 million parameters and 650,000 neurons
a
Local Response Normalization (LRN) was introduced to normalize activations
B
within local regions of the feature maps.
• LRN operates on local groups of neurons, normalizing activity within each
group and across feature channels.
• !
𝑎",$ is the activity of a neuron computed by applying kernel i at position (x, y)
and then applying ReLU
•
i R
n - “adjacent” kernel maps at the same spatial position.
v
a
• N - the total number of kernels in the layer.
r g
• The constants k, n, α, and β are hyper-parameters with values k = 2, n = 5, α =
a
10−4 , and β = 0.75.
•
B h
AlexNet was trained using stochastic gradient descent (SGD) with momentum.
R
were employed to increase the diversity of training samples.
•
avi
The network was trained on two NVIDIA GTX 580 GPUs, marking one of the
earliest instances of utilizing GPU acceleration for deep learning.
g
har
B
Dr. R Bhargavi, VIT 26
AlexNet - Architecture
Conv ReLU
ReLU Conv ReLU 3x3
Maxpool 5x5
Maxpool S = 1,
Conv 3x3 p = same
Same
11 x 11 S=1 3x3
S=2
S=4 S=2
227 x 227 x 3 55 x 55 x 96
Conv
ar g
ReLU
a Conv ReLU
h
3x3 3x3 Maxpool
B
S = 1, Same 3x3
Same S = 1, S=2
FC FC
FC
vi R
ar g a
B h
Dr. R Bhargavi, VIT 28
How to compute Number of parameters in
CNN
What will be the output size of the following network ? How many learnable
parameters exist? No padding is used and Stride = 1
vi R
a
3x3
g
10 x 10 x 1
r
x1
B ha Gray scale
image
Conv
R
Gray scale
i
image
g av
Output size = (10 -3 +1, 10-3+1, 1) = 8,8,1
r
a
Parameters = (3 x 3 x 1) + 1 = 10
B h
Dr. R Bhargavi, VIT 30
Number of parameters in CNN (cont…)
What will be the output size of the following network ? How many learnable
parameters exist? No padding is used and Stride = 1
vi R
g a
10 x 10 x 1
B
Gray scale 3x3x5 3x3x2
image
R
Conv Conv
i
Gray scale 3x3x5 3x3x2
v
image
r g a
After first Conv Output size = (10 -3 +1, 10-3+1, 5) = 8,8,5
a
h
Parameters = for each Each filter (3 x 3 x 1) + 1 = 10 , For 5 filters = 50
B
Now
After Second conv filter, output size = (8 – 3 + 1, 8 – 3 + 1, 2) = 6,6,2
Parameters = Each filter (3 x 3 x 5)+1 = 46; Two filters = 92
Total parameters = 50 + 92 = 142
Dr. R Bhargavi, VIT 32
Number of parameters in CNN (cont…)
What will be the output size of the following network ? How many learnable
parameters exist? No padding is used and Stride = 1
vi R
ar g a100 x 100 x 3
h
Conv Conv
B
Color image 3x3x8 3x3x1
100 x 100 x 3
vi R
Color image
Conv
3x3x8
Conv
3x3x1
r g a
After first Conv Output size = (100 -3 +1, 100-3+1, 8) = 98, 98, 8
a
h
Parameters = for each Each filter (3 x 3 x 3) + 1 = 28 , For 8 filters = 224
B
Now
After Second conv filter, output size = (98 – 3 + 1, 98 – 3 + 1, 1) = 96,96,1
Parameters = Each filter (3 x 3 x 8)+1 = 73; only one filter = 73
Total parameters = 224 + 73 = 297
Dr. R Bhargavi, VIT 34
Number of parameters in CNN (cont…)
What will be the output size of the following network ? How many learnable
parameters exist? No padding is used and Stride = 1
vi R
ar g a
h
Conv Conv
B
(100) , 5 (3), 8 (3) ,1
i R
Conv Conv
v
(100) , 5 (3), 8 (3) ,1
r g a
After first Conv Output size = (100-3+1, 8) = 98, 8
a
h
Parameters = for each Each filter (3 x 5) + 1 = 16 , For 8 filters = 128
B
Now
After Second conv filter, output size = (98 – 3 + 1, 1) = 96,1
Parameters = Each filter (3 x 8)+1 = 25; only one filter = 25
Total parameters = 128 + 25 = 153
Dr. R Bhargavi, VIT 36
INCEPTION Module
vi R
ar g a
B h
Dr. R Bhargavi, VIT 37
GOOGLENET / INCEPTION NET
vi R
ar g a
B h
Auxiliary Loss
Dr. R Bhargavi, VIT 38
INCEPTION NET (cont…)
vi R
ar g a
B h
Dr. R Bhargavi, VIT 39
vi R
ar g a
B h
Dr. R Bhargavi, VIT 40