DL Unit2
DL Unit2
Convolutional Neural
2 Networks
Syllabus
Convolution Operation-Sparse Interactions-Parameter
Sharing-Equivariance-Pooling-Convolution
'ariants: Strided-Tiled-Transposed and dilated convolutions; CNN Learning : Nonlinearity
Functions-Loss Functions-Regularization-Optimizers-Gradient Computation.
Contents
2.1 Introduction to Convolutional Neural Networks
2.2 Convolution Operation
2.3 Pooling
2.4 Convolution Variants :Tiled
2.5 Fully Connected Layers
2.6 CNN Learning : Nonlinearity Functions
2.7 Loss Function
2.8 Gradient Computation
2.9 Two Marks Questions with Answers
(2-1)
2-2
Convolutional Neural Networks
Deep Learning
neural network.
" CNNs make use of the same knowledge across all image locations.
2. Disadvantages: 01
Adversarial attacks are cases of feeding the network bad' examples to cause
misclassification.
" CNN requires lot of training data.
CNNS tendto be much slower because of operations like maxpool.
2.1.2 Application of CNN
" CNN is mostly used for image classification, for example to determine the satellite
images containing mountains and valleys or recognition of handwriting, etc. image
Segmentation, signal processing,etc. are the areas where CNN are used.
Object detection : Self-driving cars, Al-powered surveillance systems and smart homes
often use CNN to be able to identify and mark objects. CNN can identify objects on the
photos and in real-time, classify and label them.
Yoice synthesis : Google Assistant's voice synthesizer uses Deepmind's WaveNet
ConvNet model.
Astrophysics : They are used to make sense of radio telescope data and predict the
Probable visual image to represent that data.
4. Pool layer : This function mainly reduces the volume of the intermediate outpulb
which enables fast computation of the network model, thus preventing it rou
overfitting.
" In general form, convolution is an operation on two functions of a real valued argument.
To motivate the definition of convolution, we start with examples of two functions we
might use.
" Suppose we are tracking the location of aspaceship with a laser sensor. Laser sensor
provides a single output x(), the position of the spaceship at time t. Both "x" and t" are
real-valued, i.e., We can get adifferent reading from the laser sensor at any instant in
time.
" Nowsuppose that our laser sensor is somewhat noisy. To obtain a less noisy estimate of
the spaceship's position, we would like to average together several measurements. Of
course, more recent measurements are more relevant, so we will want this to be a
weighted average that gives more weight to recent measurements.
" We can do this with a weighting function w(a), where "a" is the age of a measurement.
If we apply such a weighted average operation at every moment, we obtain a new
L49
function providing a smoothed estimate of the position s" of the spaceship.
" Convolution operation uses three parameters :Input image, Feature detector and Feature
map.
Convolution operation involves an input matrix and a filter, also known as the kernel.
Input matrix can be pixel values of a grayscale image whereas a filter is a relatively
small matrix that detects edges by darkening areas of input image where there are
transitions from brighter to darker areas. There can be different types of filters
depending upon what type of features we want to detect, e.g. vertical, horizontal, or
diagonal, etc.
Input image is converted into binary 1and 0. The convolution operation, shown in
Fig. 2.2.I is known as the feature detector of aCNN. The input to a convolution can be
raw data or a feature map output from another convolution. It is often interpreted as a
filter in which the kernel filters input data for certain kinds of information.
Sometimes a 5 x 5 or a 7 x 7matrix is used as a feature detector. The feature detector is
often referred to as a "kernel" or a "ilter,". Ateach step, the kernel is multiplied by the
input data values within its bounds, creating a single entry in the output feature map.
Conyoluted feature
Kernel
Input data
Convolutional layers perform transformations on the input data volume that are a
function of the activations in the input volume and the parameters.
" In reality, convolutional neural networks develop multiple feature detectors and use
them todevelop several feature maps which are referred to as convolutional layers andit
is shown in Fig. 2.2.3.
Through training, the network determines what features it finds important in order for
to be able to scan images and categorize them more accurately.
" Convolutional layers have parameters for the layer and additional hyper-parameters.
Gradient descent is used to train the parameters in this layer such that the class score
are consistent with the labels in the training set.
input image
Convolutional layer
Fig. 2.2.3 Feature detectors
" If all neurons in a single depth slice are using the same weight vector, then the forward
pass of the convolutional layer can be computed in each depth slice as a convolution of
the neuron"s weights with the input volume. This is the reason why it is common to refer
to the sets of weights as a filter (or a kernel), that is convolved with the input.
" Fig. 2.2.4 shows convolution shares the same paranmeters across all spatial locations.
Fig. 2.2.4 Convolution shares the same parameters across all spatial locations
22.3Padding
Padding is the process of adding one or more pixels of zeros all around the boundaries
of an image, in order to increase its effective size. Zero padding helps to make output
dimensions and kernel size independent.
One observation is that the convolution operation reduces the size of the (q + 1) th layer
th
in comparison with the size of the q layer. This type of reduction in size is not
desirable in general, because it tends to lose some information along the borders of the
image. This problem can be resolved by using padding.
3common zero padding strategies are :
a) Valid convolution :Extreme case in which no zero-padding is used whatsoever, and
the convolution kernel is only allowed to visit positions where the entire kernel is
contained entirely within the input. For a kernel of size k in any dimension, the input
shape of min the direction will become m - k+ 1in the output. This shrinkage
restricts architecture depth.
b) Same convolution: Just enough zero-padding is added to keep the size oftheoutput
equalto the size of the input. Essentially, for adimension where kernel size is k, the
input is padded by k- 1zeros in that dimension.
c) Full convelution : Other extreme case where enough zeroes are added for every
pixelto be visited k times in each direction, resulting an output image of width
m+k- 1.
d) The 1D block is composed by a configurable number of filters, where the filter has a
set size, a convolution operation is performed between the vector and the filter,
producing as output a new vector with as many channels as the number of filters.
Every value in the tensor is then fed through an activation function to introduce
nonlinearity.
When padding is not used, the resulting padding" is also referred to as a valid padding.
Validpadding generally does not work well from an experimental point of view. In the
case of valid padding, the contributions of the pixels on the borders of the layer will be
under-represented compared to the central pixels in the next hidden layer, which is
undesirable.
2.2.4 Stride
Convolution functions used in practice differ slightly compared to convolution operation
as it is usually understood in the mathematical literature.
" In general aconvolution layer consists of application of several different kernels to the
input. This allows the extraction of several different features at all locations in the inpu.,
This means that in each layer, a single kernel is not applied. Multiple kernels, are used
as different feature detectors.
" The input is generally not real-valued but instead vector valued. Multi-channel
convolutions are commutative only if number of output and input channels is the same.
" In order to allow for calculation of features at a coarser level strided convolutions can be
used. The effect of strided convolution is the same as that of a convolution followed by
adown sampling stage. This can be used to reduce the representation size.
The stride indicates the pace by which the filter moves horizontally and vertically over
the pixels of the input image during convolution. Fig. 2.2.5 shows stride during
convolution.
Stride = 1
1Stride
so0oowwoww***********
. For example, one can extract square patches of the image to create the training data. The
number of filters in each layer is often a power of 2, because this often results in more
efficient processing, Such an approach also leads to hidden layer depths that are powers
of2.
2.2.6 RelULayer
, In this layer we remove every negative value from the filtered image and replace it with
zero. This function only activates when the node input is above a certain quantity. So,
when the input is below zero the output is zero.
. However, when the input rises above a certain threshold it has linear relationship with
the dependent variable. This means that it is able to accelerate the speed of a training
data set in a deep neural network that is faster than other activation functions.
. In traditional neural networks, the activation function is combined with a linear
transformation with a matrix of weights to create the next layer of activations.
The reason why the rectifier function is typically used as the activation function in a
convolutional neural network is to increase the nonlinearity of the data set. By removing
negative values from the neurons' input signals, the rectifier function is effectively
removing black pixels from the image and replacing them with gray pixels.
2.3 Pooling
Pooling helps the representation become slightly invariant to small translations of the
input. Apooling function takes the output of the previous layer at a certain location I.
and computes a"summary" of the neighborhood around L.
" The pooling layer reduces the height and width of the input. It helps reduce
computation, as well as helps make feature detectors more invariant to its position in the
input.
The function of the pooling layer is to progressively reduce the spatial size of the
representation to reduce the amount of parameters and computation in the network, and
hence to also control overfitting. No learning takes place on the pooling layers.
The addition of a pooling layer after the
convolutional layer is a common patten used
for ordering layers within a convolutional neural
network that may be repeated one or
more times in a given model.
The pooling layer operates upon each feature map
separately to create a new set of the
same number of pooled feature maps. Pooling
much like a filter to be applied to feature maps.
involves selecting a pooling operation.,
" The size of the pooling
operation or filter is smaller than the size of the feature map.
This means that the pooling layer will
always
factor of2, e.g. each dimension is halved, reduce the size of each feature map by a
each feature map to one quarter the size. reducing the number of pixels or values in
" For example, a pooling
layer applied to a
an output pooled feature map of 3 x 3 feature map of 6 x 6(36 pixels) will result n
rather than learned.
(9 pixels). Thepooling operation is specitieu
The pooling operation, also called
feature maps from the subsampling, is used to reduce the dimensionality
most common pooling convolution operation. Max pooling and average pooling a
" Pooling operations used in the CNN.
units are obtained using
L2 - norm functions like max-pooling, average pooling and even
pooling. At the pooling
pooling block being reduced to a layer, forward propagation results in an NxN
propagation of the pooling single value - value of the unit" Back-
value *winning unit". layer then computes the error which is "winning
acquired bythissingle
peepLeaming 2- 13 Convolutional Neural Networks
" Let kbe a 6-D tensor, where two of the dimensions correspond to different locations in
the output map. Rather than having a separate index for each location in the output
map, output locations cycle through a set of t different choices of kernel stack in each
direction. Ift is equal to the output width, this is the same as alocally connected layer.
24:1 Transposed and Dilated Convolutions
Transposed convolutions: These types of convolutions are also known as
deconvolutions or fractionally strided convolutions. A transposed convolutional layer
carries out a regular convolution but reverts its spatial transformation.
" Fig. 2.4.1 shows how transposed convolution with a 2x 2kernel is computed for a 2x 2
Input tensor.
|Transposed 01
213 Conv 23
olol
Output
0o1
=00 23 +02 03 046
6 4122 9
(a) (b)
7/3/0
3/2
(c) (d)
Fig. 2.4.2 Convolution with a dilated filter where the dilation factor is d
=2
. Dilation by a factor of "d" means that the original filter is expanded by d- 1 spos
between each element and the intermediate empty locations are filled in with eros.
D5Fully Connected Layers
Fully connected layers have the normal pararneters for the layer and hyper pararreters
this layer perform transformations on the input datz volume that are a function of the
activations in the input volume and the parameters.
Neural networks are a set of dependent non-linear functions. Each individual function
consists of a neuron (or a perceptron).
. In fully connected layers, the neuron applies alinear transformation to the input vector
through a weights matrix. A non-linear transformation is then applied to the product
through a non-linear activation function f.
Y)= f£
" Here. we are taking the dot product between the weights matrix W and the ínput
I will ígnore ít
vector x The bias term (W,) can be added inside the non-linear function.
decision-making and is
for the rest of the article as it doesn't affect the output sizes or
just another weight.
the input of the layer and the
" The activation function "f wTaps the dot product between
weights matrix of that layer.
input Feature map Feature map Feature map Feature map 120 84 10
32 32? 28 286 14 14 6 1010x 16 5 5 16
Layer C1 is a convolutional layer with six feature maps where the size of the feaure
maps is 28 x 28:
Layer S2 is a sub-sampling layer with six feature maps where the size of the feature
maps is 14x 14:
. Layer C3 is a convolutional layer with sixteen feature maps where the size of the feature
maps is 10× 10;
Layer S4 is s sub-sampling layer with sixteen feature maps where the size of feanure
maps is 5 x5:
. Laver Cs is a convolutional layer with 120 feature maps where the size of the feature
maps is 1×1;
Layer F6 contains 84 units and is fully connected to the CSconvolutional layer.
2.6 CNN Learning: Nonlinearity Functions
The weight layers in a CNN are often followed by a nonlinear activation function. Ih%
activation function takes a real valued input and squashes it within a small range suen
[0; 1]and [1; I), The application of anonlinear function after the weight layers
highly important, since it allows a neural network to learn nonlinear mappings.
" In the absence of nonlinearities, a stacked network of weight layers is equivalent
linear mapping from the input domain to the output domain.
f) h(X-) =Y; - Neural network transforming the input Xto give the output.
g) Thus y,, refers to the j" feature observed in the i" sample collected.
h) Loss function = L(W, b)
Advantages : For small errors, MSE helps converge to the minima efficiently, as the
gradient reduces gradually.
" Drawback :
a) Squaring the values does increases the rate of training, but at the same time, an
extremely large loss may lead to a drastic jump during backpropagation, which is not
desirable.
b) MSE is also sensitive to outliers.
One of the best methods to do validation by using "K-Fold Cross Validation" which
helps to provide ample data for training the model and ample data for validations.
Wih this technique, we simply build a model for each possible combination of all of the
hyperparameter values provided, evaluating each model and selecting the architecture
which produces the best results.
For example, say you have two continuous parameters andB, where manually selected
values for the parameters are the following :
a E {0, 1, 2}
Be {.25, .50, .75}
" Then the pairing of the selected hyperparametric values, H, can take on any of the
following:
He (0, .25), (0,.50), (0, .75), (1, .25), (1, .50), (1, .75), (2, .25), (2, .50), (2, .75)}
Grid search will examine each pairing of and to determine the best performing
combination. The resulting pairs, H, are simply each output that results from taking the
Cartesian product of a and B.
" While straightforward, this "brute force" approach for
hyperparameter optimization has
some drawbacks. Higher-dimensional hyperparametric spaces are far more
time
consuming to test than the simple two-dimensional problem presented here.
" Also, because there will always be a fixed number of
training samples for any given
model, the model's predictive power will decrease as the number of dimensions
increases. This is known as Hughes phenomenon.
2.8.2 Vanishing Gradient Problem
When back-propagation is used, the earlier layers will receive very small
updates
compared to the later layers. This problem is referred to as the vanishing gradien
problem.
interactions
a13 Define sparse
Ans. :
Sparse interactions are also referred to as sparse connectivity or sparse weights. Sparse
interaction is implenmented by using kernels or feature detector smaller than the input image, i.e.
input.
Making the kernelsmaller than the
functions?
0.14 What is loss
loss function is a measure of how good a prediction model does in terms of being able to
ict the expected outcome. Loss functions are used to calculate the difference between the
Dredicted output and the actual output.
convolutional layers.
0.15 List the components of
sharing and layer
Ans. :Components of convolutional layers are filters, activation maps, parameter
specific hyper-parameters.
0.16 What is use of parameter sharing in CNN ?
Convolutional layers
Ans. : Parameter sharing is used in CNN to control the total parameter count.
reduce the parameter count further by using a technique called parameter sharing.
Q.17 Explain padding in CNN.
boundaries of an
Ans. : Padding is the process of adding one or more pixels of zeros all around the
image, in order to increase its effective size. Zero padding helps to make output dimensions and
kernel size independent.
Q.18 HoW many filters must a CNN have?
Ans. :CNN does not learn with asingle filter; they know through multiple features in parallel for a
given input. For example, it is usual for a convolution layer to learn from 32 to S12 filters in parallel
for a piece of shared information.
Q.19 What is tiled convolution ?
Ans. : " Tiled convolution learn a set of kernels that is rotated through as we move through space,
Tather than learning a separate set of weights at every spatial location as in locally connected layer.
It offers a compromise between a convolutional layer and a locally connected layer.
Memory requirements for storing the parameters will increase only by a factor of the size
of this set of kernels.