0% found this document useful (0 votes)
21 views25 pages

DL Unit2

DL UNIT2

Uploaded by

Vinoth Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views25 pages

DL Unit2

DL UNIT2

Uploaded by

Vinoth Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

UNIT II

Convolutional Neural
2 Networks
Syllabus
Convolution Operation-Sparse Interactions-Parameter
Sharing-Equivariance-Pooling-Convolution
'ariants: Strided-Tiled-Transposed and dilated convolutions; CNN Learning : Nonlinearity
Functions-Loss Functions-Regularization-Optimizers-Gradient Computation.
Contents
2.1 Introduction to Convolutional Neural Networks
2.2 Convolution Operation
2.3 Pooling
2.4 Convolution Variants :Tiled
2.5 Fully Connected Layers
2.6 CNN Learning : Nonlinearity Functions
2.7 Loss Function
2.8 Gradient Computation
2.9 Two Marks Questions with Answers

(2-1)
2-2
Convolutional Neural Networks
Deep Learning

2.1 Introduction to Convolutional Neural Networks


" Convolutional Neural Network (CNN) is a deep learning neural network designed
processing structured arrays of data such as images. A CNN is a feed-forward neural
network. often with up to 20 or 30layers. The power of a convolutional neural network
convolutional layer.
comes from a special kind of layer called the
" Convolutional neural network is also
called ConvNet.
type of line
In CNN, 'convolution' is referred to as the mathematical function. It's a
third function the.
operation in which you can multiply two functions to create a
expresses how one function's shape can be changed by the other.
two matrices a
In simple terms, two images that are represented in the form of
image.
multiplied to provide an output that is used to extract information from the
for
" CNN represents the input data in the form of multidimensional arrays. It works well
a large number of labeled data. CNN extract the each and every portion of input image
which is known as receptive field. It assigns weights for each neuron based on the
significant role of the receptive field.
" Instead of preprocessing the data to derive features like textures and shapes, a CNN
takes just the image's raw pixel data as input and "learns" how to extract these features.
and ultimately infer what object they constitute.
The goal of CNN is to reduce the images so that it would be easier to process without
losing features that are valuable for accurate prediction.
" Aconvolutional neural network is made up of numerous layers, such as convolution
layers, pooling layers and fully connected layers and it uses a back-propagation
algorithm to learn spatial hierarchies of data automatically and adaptively.
To understand the Concept of Convolutional Neural Networks (CNNS), let us take an
example of the images our brain can interpret.
As soon as we see an image, our brain starts categorizing it based on the color, shafe
and sometimes also the message that image is conveying. Similar thing can be dont
through machines even after a rigorous training. But the difficulty is there is ahu:
difference in what humans interpret and what machine does. For a machine, the image i
merely an array of pixels. There is a unique pattern included, the image is merey
image
array of pixels. There is a unique pattern included in each object present in the
and the computer tries to find out these patterns to get the information about the image.
the
Machines can be trained giving tons of images to increase its ability to recognize
objects included in a given input image.
TECHNICAL PUBLICATIONS- an up-thrust for knowledge
Deep Leaming 2-3 Convolutional Neural Networks
.Most of the digital companies have opted for CNNS for
these include Google, Amazon, Instagram, Interest, image recognition, some of
Facebook, etc.
Hence, we define a convolutional neural network as: "A
neural network
multiple convolutional layers which are used mainly for image consisting
of

classification, segmentation and other correlated data". processing,


2.1.1 Advantages and Disadvantages of CNN
1. Advantages :
" CNN automatically detects the impórtant features without any human supervision.
" CNN is also computationaly efficient.
Higher accuracy.
Weight sharing is another major advantage of CNNs.
Convolutional neural networks also minimize computation in comparison with a regular 49

neural network.
" CNNs make use of the same knowledge across all image locations.
2. Disadvantages: 01

Adversarial attacks are cases of feeding the network bad' examples to cause
misclassification.
" CNN requires lot of training data.
CNNS tendto be much slower because of operations like maxpool.
2.1.2 Application of CNN
" CNN is mostly used for image classification, for example to determine the satellite
images containing mountains and valleys or recognition of handwriting, etc. image
Segmentation, signal processing,etc. are the areas where CNN are used.
Object detection : Self-driving cars, Al-powered surveillance systems and smart homes
often use CNN to be able to identify and mark objects. CNN can identify objects on the
photos and in real-time, classify and label them.
Yoice synthesis : Google Assistant's voice synthesizer uses Deepmind's WaveNet
ConvNet model.
Astrophysics : They are used to make sense of radio telescope data and predict the
Probable visual image to represent that data.

TECHNICAL PUBLICATIONs an up-thrust for knowledge


Deep Learning 2 -4
Convolutional Neural Netwol.
2.1.3 Basic Structure of CNN
Fuliy
connected
Convolution
Input
Poolirng
Output

Feature extraction Ciassification

Fig. 2.1.1 Basic architecture of CNN


" A convolutional neural network, as discussed above, has the following layers that are
useful for various deep learning algorithms. Let us see the working of these layers
taking an example of the image having dimension of 12 x 12 x 4. These are:
1. Input layer : This layer will accept the image of width 12, height 12 and depth 4.
2. Convolution layer : It computes the volume of the image by getting the dot product
between the image filters possible and the image patch. For example, there are 10
filters possible, then the volume will be computed as 12 x 12 x 10.
3. Activation function layer :This layer applies activation function to each element in
the output of the convolutional layer. Some of the well accepted activation functions
are ReLu, Sigmoid, Tanh, Leaky ReLu, etc. These functions will not change the
volume obtained at the convolutional layer and hence it will remain equal o
12 × 12x 10.

4. Pool layer : This function mainly reduces the volume of the intermediate outpulb
which enables fast computation of the network model, thus preventing it rou
overfitting.

2.2 Convolution Operation the


Convolution operation focuses on extracting/preserving important features from
input. Convolution operation allows the network to detect horizontal and vertical e
following
of an image and then based on those edges build high-level features in the
layers of neural network.

TECHNICAL PUBLICATIONS- an up-thrust for knowledge


Deep Leaming 2-5 Convolutional Neural Networks

" In general form, convolution is an operation on two functions of a real valued argument.
To motivate the definition of convolution, we start with examples of two functions we
might use.
" Suppose we are tracking the location of aspaceship with a laser sensor. Laser sensor
provides a single output x(), the position of the spaceship at time t. Both "x" and t" are
real-valued, i.e., We can get adifferent reading from the laser sensor at any instant in
time.

" Nowsuppose that our laser sensor is somewhat noisy. To obtain a less noisy estimate of
the spaceship's position, we would like to average together several measurements. Of
course, more recent measurements are more relevant, so we will want this to be a
weighted average that gives more weight to recent measurements.
" We can do this with a weighting function w(a), where "a" is the age of a measurement.
If we apply such a weighted average operation at every moment, we obtain a new
L49
function providing a smoothed estimate of the position s" of the spaceship.
" Convolution operation uses three parameters :Input image, Feature detector and Feature
map.
Convolution operation involves an input matrix and a filter, also known as the kernel.
Input matrix can be pixel values of a grayscale image whereas a filter is a relatively
small matrix that detects edges by darkening areas of input image where there are
transitions from brighter to darker areas. There can be different types of filters
depending upon what type of features we want to detect, e.g. vertical, horizontal, or
diagonal, etc.
Input image is converted into binary 1and 0. The convolution operation, shown in
Fig. 2.2.I is known as the feature detector of aCNN. The input to a convolution can be
raw data or a feature map output from another convolution. It is often interpreted as a
filter in which the kernel filters input data for certain kinds of information.
Sometimes a 5 x 5 or a 7 x 7matrix is used as a feature detector. The feature detector is
often referred to as a "kernel" or a "ilter,". Ateach step, the kernel is multiplied by the
input data values within its bounds, creating a single entry in the output feature map.

TECHNICAL PUBLICATIONS- an up-thrust for knowledge


Deep Learning 2-6
Convolutional Neural Networke
1|0

Conyoluted feature
Kernel

Input data

Fig. 2.2.1 Convolution operation


Generally, an image can be considered as a matrix whose elements are numbers between
0and 255. The size of image matrix is : image heighteimage width*number of
image channels.
" Agrayscale image has 1channel, where a colour image has 3 channels.
" Kernel : A kernel is a small matrix of numbers that is used in image convolutions.
Differently sized kernels containing different patterns of numbers produce different
results under convolution. The size of a kernel is arbitrary but 3 x 3 is often used.
Fig. 2.2.2 shows example of kernel.

Fig. 2.2.2 Example of kernel

Convolutional layers perform transformations on the input data volume that are a
function of the activations in the input volume and the parameters.
" In reality, convolutional neural networks develop multiple feature detectors and use
them todevelop several feature maps which are referred to as convolutional layers andit
is shown in Fig. 2.2.3.
Through training, the network determines what features it finds important in order for
to be able to scan images and categorize them more accurately.
" Convolutional layers have parameters for the layer and additional hyper-parameters.
Gradient descent is used to train the parameters in this layer such that the class score
are consistent with the labels in the training set.

TECHNICAL PUBLICATIONS- an up-thrust for knowledge


Deep Learming 2-7 Convolutional Neural Networks
We create many
feature maps to Feature rnaps
ooooo obtain our first
1 convolution iayer
oloß o
oldlo

input image

Convolutional layer
Fig. 2.2.3 Feature detectors

Components of convolutional layers are as follows :


a) Filters
b) Activation maps
c) Parameter sharing
d) Layer-specific hyper-parameters
" Filters are a function that has a width and height smaller than the width and height of the
input volume. The filters are tensors and they are used to convolve the input tensor
when the tensor is passed to the layer instance. The random values. inside the filter
tensors are the weights of the convolutional layer.
Sliding each filter across the spatial dimensions (width, height) of the input volume
during the forward pass of information through the CNN. This produces a two
dimensional output called an activation map for that specific filter.
2.2.1 Parameter Sharing
" Paranmeter sharing is used in CNN to control the total
parameter count. Convolutional
layers reduce the parameter count further by using a technique called parameter
sharing.
The user can reduce the number of parameters by making an
feature can compute at some spatial position (x,), then it is usefulassumption
that if one
to compute a different
place (x,-Y,).
" In other words, denoting a single 2D slice of depth as a
depth slice. For example, during
back-propagation, every neuron in the network will compute the gradient for its weights,
but these gradients will be added up across each depth slice and only
update a single set
of weights per slice.
TECHNICAL PUBLICATIONS- an up-thrust for knowledge
Deep Learning 2-8 Convolutional Neural Networks

" If all neurons in a single depth slice are using the same weight vector, then the forward
pass of the convolutional layer can be computed in each depth slice as a convolution of
the neuron"s weights with the input volume. This is the reason why it is common to refer
to the sets of weights as a filter (or a kernel), that is convolved with the input.
" Fig. 2.2.4 shows convolution shares the same paranmeters across all spatial locations.

Fig. 2.2.4 Convolution shares the same parameters across all spatial locations

2.2.2 Equivariant Representation


Convolution function is equivariant to translation. This means that shifting the input and
applying convolution is equivalent to applying convolution to the input and shifting it.
" If we move the object in the input, its representation will move the same anmount in the
output.
" General definition : If representation(transform(x)) = transform(representation(x)
then representation is equivariant to the transform
. Convolution is equivariant to translation. This is a direct consequence of parameter
sharing.
. It isuseful when detecting structures that are common in the input. For example, edges
in an image.
Equivariance in carly layers is good. We are able to achieve translation-invariance
(via max-pooling) due to this property.
Convolution is not equivariant to other operations such as change in scale or rotation.
" Example of equivariance: With 2D images convolution creates a map where certail
features appear in the input. If we move the object in the input, the representation W
move the same amount in the output. It is useful to detect edges in first layer
convolutional network. Same edges appear everywhere in image, so it is prracticalto
share parameters across entire image.

TECHNICAL PUBLICATIONS- an up-thrust for knowledge


DeeoLearning 2-9 Convolutional Neural Networks

22.3Padding
Padding is the process of adding one or more pixels of zeros all around the boundaries
of an image, in order to increase its effective size. Zero padding helps to make output
dimensions and kernel size independent.
One observation is that the convolution operation reduces the size of the (q + 1) th layer
th
in comparison with the size of the q layer. This type of reduction in size is not
desirable in general, because it tends to lose some information along the borders of the
image. This problem can be resolved by using padding.
3common zero padding strategies are :
a) Valid convolution :Extreme case in which no zero-padding is used whatsoever, and
the convolution kernel is only allowed to visit positions where the entire kernel is
contained entirely within the input. For a kernel of size k in any dimension, the input
shape of min the direction will become m - k+ 1in the output. This shrinkage
restricts architecture depth.
b) Same convolution: Just enough zero-padding is added to keep the size oftheoutput
equalto the size of the input. Essentially, for adimension where kernel size is k, the
input is padded by k- 1zeros in that dimension.
c) Full convelution : Other extreme case where enough zeroes are added for every
pixelto be visited k times in each direction, resulting an output image of width
m+k- 1.
d) The 1D block is composed by a configurable number of filters, where the filter has a
set size, a convolution operation is performed between the vector and the filter,
producing as output a new vector with as many channels as the number of filters.
Every value in the tensor is then fed through an activation function to introduce
nonlinearity.
When padding is not used, the resulting padding" is also referred to as a valid padding.
Validpadding generally does not work well from an experimental point of view. In the
case of valid padding, the contributions of the pixels on the borders of the layer will be
under-represented compared to the central pixels in the next hidden layer, which is
undesirable.

2.2.4 Stride
Convolution functions used in practice differ slightly compared to convolution operation
as it is usually understood in the mathematical literature.

TECHNICAL PUBLICATIONS- an up-thrust for knowledge


Deep Learning 2- 10 Convolutional Neural Networks

" In general aconvolution layer consists of application of several different kernels to the
input. This allows the extraction of several different features at all locations in the inpu.,
This means that in each layer, a single kernel is not applied. Multiple kernels, are used
as different feature detectors.
" The input is generally not real-valued but instead vector valued. Multi-channel
convolutions are commutative only if number of output and input channels is the same.
" In order to allow for calculation of features at a coarser level strided convolutions can be
used. The effect of strided convolution is the same as that of a convolution followed by
adown sampling stage. This can be used to reduce the representation size.
The stride indicates the pace by which the filter moves horizontally and vertically over
the pixels of the input image during convolution. Fig. 2.2.5 shows stride during
convolution.
Stride = 1

1Stride

so0oowwoww***********

Fig. 2.2.5 Stride during convolution


" Stride is a parameter of the neural network's filter that modifies the amount ot
movement over the image or video. Stride is a component for the compression of images
and video data. For example, if a neural network's stride is set to 1, the filter will move
one pixel or unit, at a time. If stride = 1, the filter will move one pixel.
Stride depends on what we expect in our output image. We prefer a smaller stride size
we expect several fine-grained features to reflect in our output. On the other hand, if
"G
are only interested in the macro-level of features, we
choose a larger stride size.
2.2.5 Typical Setting
" For square images, we use stride sizes of 1 in most settings. Even when
small strides of size 2 are used. In cases where the input images strides
are notareSquare.
u

preprocessing is used to enforce this property.

TECHNICAL PUBLICATIONs an up-thrust for knowledge


DeepLeaming 2-11 Convolutional Neural Networks

. For example, one can extract square patches of the image to create the training data. The
number of filters in each layer is often a power of 2, because this often results in more
efficient processing, Such an approach also leads to hidden layer depths that are powers
of2.

2.2.6 RelULayer
, In this layer we remove every negative value from the filtered image and replace it with
zero. This function only activates when the node input is above a certain quantity. So,
when the input is below zero the output is zero.
. However, when the input rises above a certain threshold it has linear relationship with
the dependent variable. This means that it is able to accelerate the speed of a training
data set in a deep neural network that is faster than other activation functions.
. In traditional neural networks, the activation function is combined with a linear
transformation with a matrix of weights to create the next layer of activations.
The reason why the rectifier function is typically used as the activation function in a
convolutional neural network is to increase the nonlinearity of the data set. By removing
negative values from the neurons' input signals, the rectifier function is effectively
removing black pixels from the image and replacing them with gray pixels.

2.2.7 Sparse Interactions


Sparse interactions are also referred to as sparse connectivity or sparse weights. Sparse
interaction is implemented by using kernels or feature detector smaller than the input
image, i.e. Making the kernel smaller than the input.
" If we have an input image of the size 256 by 256 then it becomes difficult to detect
edges in the image may occupy only a smaller subset of pixels in the image. This means
that we need to store fewer parameters, which both reduces the memory requirements of
the model and improves its statistical efficiency. It also means that computing the output
requires fewer operations.
Sparse interaction idea uses convolution kernel to interact with the local region in the
Image. This region is called receptive field, which improves the parameters and
efficiency compared with the full connection layer.
C For example, when processing a three channel picture, the pixels of the image may
contain thousands of pixels, but when we only need to detect the edge information in the
Image, we do not need to connect the pixels of the whole picture, we only need to use

TECHNICAL PUBLICATIONS®- an up-thrust for knowledge


Deep Learning 2-12 Convolutional Neural Networka
the convolution kemel containing hundreds of pixels to detect. This calculation method
not only improves the calculation efficiency, but also saves a large part of the parameter
space.

2.3 Pooling
Pooling helps the representation become slightly invariant to small translations of the
input. Apooling function takes the output of the previous layer at a certain location I.
and computes a"summary" of the neighborhood around L.
" The pooling layer reduces the height and width of the input. It helps reduce
computation, as well as helps make feature detectors more invariant to its position in the
input.
The function of the pooling layer is to progressively reduce the spatial size of the
representation to reduce the amount of parameters and computation in the network, and
hence to also control overfitting. No learning takes place on the pooling layers.
The addition of a pooling layer after the
convolutional layer is a common patten used
for ordering layers within a convolutional neural
network that may be repeated one or
more times in a given model.
The pooling layer operates upon each feature map
separately to create a new set of the
same number of pooled feature maps. Pooling
much like a filter to be applied to feature maps.
involves selecting a pooling operation.,
" The size of the pooling
operation or filter is smaller than the size of the feature map.
This means that the pooling layer will
always
factor of2, e.g. each dimension is halved, reduce the size of each feature map by a
each feature map to one quarter the size. reducing the number of pixels or values in
" For example, a pooling
layer applied to a
an output pooled feature map of 3 x 3 feature map of 6 x 6(36 pixels) will result n
rather than learned.
(9 pixels). Thepooling operation is specitieu
The pooling operation, also called
feature maps from the subsampling, is used to reduce the dimensionality
most common pooling convolution operation. Max pooling and average pooling a
" Pooling operations used in the CNN.
units are obtained using
L2 - norm functions like max-pooling, average pooling and even
pooling. At the pooling
pooling block being reduced to a layer, forward propagation results in an NxN
propagation of the pooling single value - value of the unit" Back-
value *winning unit". layer then computes the error which is "winning
acquired bythissingle
peepLeaming 2- 13 Convolutional Neural Networks

Pooling layers, also known as down sampling, conducts dimensionality reduction,


reducing the number of parameters in the input. Similar to the convolutional layer, the
pooling operation sweeps a filter across the entire input, but the difference is that this
filter does not have any weights. Instead, the kernel applies an aggregation function to
the values within the receptive field, populating the output array. There are two main
types of pooling :
Max pooling : As the filter moves across the input, it selects the pixel with the
maximum value to send to the output array. As an aside, this approach tends to be used
more often compared to average pooling.
. Average pooling : As the filter moves across the input, it calculates the average value
within the receptive field to send to the output array.
. Invariance to local translation can be useful if we care more about whether a certain
feature is present rather than exactly where it is.
24 Convolution Variants : Tiled
Tiled convolution learn a set of kernels that is rotated through as we move through
space, rather than learning aseparate set of weights at every spatial location as in locally
connected layer.
" It offers acompromise between aconvolutional layer and alocaly connected layer.
Memory requirements for storing the parameters will increase only by a factor of the
size of this set of kernels.

" Let kbe a 6-D tensor, where two of the dimensions correspond to different locations in
the output map. Rather than having a separate index for each location in the output
map, output locations cycle through a set of t different choices of kernel stack in each
direction. Ift is equal to the output width, this is the same as alocally connected layer.
24:1 Transposed and Dilated Convolutions
Transposed convolutions: These types of convolutions are also known as
deconvolutions or fractionally strided convolutions. A transposed convolutional layer
carries out a regular convolution but reverts its spatial transformation.
" Fig. 2.4.1 shows how transposed convolution with a 2x 2kernel is computed for a 2x 2
Input tensor.

TECHNICAL PUBLICATIONS®. an up-thrust for knowiedge


Deep Learning 2- 14
Convolutional Neural Networke
Input Kernel

|Transposed 01
213 Conv 23
olol
Output
0o1
=00 23 +02 03 046
6 4122 9

Fig. 2.4.1Transposed convolution with a 2 x 2 kernel


The shaded portions are a portion of an intermediate tensor as well as the input and
kernel tensor elements used for the computation.
" Dilated convolution operation expands window size without increasing the number of
weights by inserting zero-values into convolution kernels. Dilated convolutions can be
used in real time applications and in applications where the processing power is less as
the RAM requirements are less intensive.
" Dilated convolution also called atrous convolutions. The central idea is that a new
dilation parameter (d) is introduced, which decides on the spacing between the fiter
weights while performing convolution.
" Fig. 2.4.2 shows convolution with a dilated filter where the dilation factor is d=2.

(a) (b)

7/3/0
3/2

(c) (d)
Fig. 2.4.2 Convolution with a dilated filter where the dilation factor is d
=2

TECHNICAL PUBLICATIONS- an up-thrust for knowledge


2-15 Convlutinsl tleursl Ngtworks

. Dilation by a factor of "d" means that the original filter is expanded by d- 1 spos
between each element and the intermediate empty locations are filled in with eros.
D5Fully Connected Layers
Fully connected layers have the normal pararneters for the layer and hyper pararreters
this layer perform transformations on the input datz volume that are a function of the
activations in the input volume and the parameters.
Neural networks are a set of dependent non-linear functions. Each individual function
consists of a neuron (or a perceptron).
. In fully connected layers, the neuron applies alinear transformation to the input vector
through a weights matrix. A non-linear transformation is then applied to the product
through a non-linear activation function f.

Y)= f£
" Here. we are taking the dot product between the weights matrix W and the ínput
I will ígnore ít
vector x The bias term (W,) can be added inside the non-linear function.
decision-making and is
for the rest of the article as it doesn't affect the output sizes or
just another weight.
the input of the layer and the
" The activation function "f wTaps the dot product between
weights matrix of that layer.

25.1 The Interleaving between Layers


interleaved in a neural network
The comvolution. pooling, and ReLU layers are typically
network. The RelU layers often follow
n order to increase the expressive power of the combinations.
sets of convolutional-ReLU
the comvoutional layers. After two or three
one might have a max-pooling layer.
For exarmple :CRCRP. CRCRCRP.
layer and the max-pooling layer is
Here, C is convolutional layer and "R" is ReLU
max-pooling layer) might be repeated a
Genoted byP. This entire pattern (including the
Iew tines in order to create a deep neural network
example, if the first pattern above is repeated three times and followed by a fully
For pattern of neural network :
have the
Cornected layer (denoted by ). then we
CRCRPCRCRPCRCRPF.
Deep Learning 2- 16 Convolutional Neural Networke
" LeNet-5 is aconvolutional neural network architecture that was created by Yane
LeCunn in 1998. It includes 7 layers, excluding the input layer, which contains the
trainable parameters called weights.
LeNet-5 consists of two parts : (i) a convolutional encoder consisting of twn
convolutional layers and (ii) adense block consisting of three fully connected layers.
" Fig. 2.5.1 shows architecture of LaNet-5.
Fuly
Convciuton Convolution
Convoluton connected Outout
(5 5; Subsampling (3 5) Subsamping (5x 5) layer

input Feature map Feature map Feature map Feature map 120 84 10
32 32? 28 286 14 14 6 1010x 16 5 5 16

Fig. 2.5.1 LaNet-5 architecture

Layer C1 is a convolutional layer with six feature maps where the size of the feaure
maps is 28 x 28:
Layer S2 is a sub-sampling layer with six feature maps where the size of the feature
maps is 14x 14:
. Layer C3 is a convolutional layer with sixteen feature maps where the size of the feature
maps is 10× 10;
Layer S4 is s sub-sampling layer with sixteen feature maps where the size of feanure
maps is 5 x5:
. Laver Cs is a convolutional layer with 120 feature maps where the size of the feature
maps is 1×1;
Layer F6 contains 84 units and is fully connected to the CSconvolutional layer.
2.6 CNN Learning: Nonlinearity Functions
The weight layers in a CNN are often followed by a nonlinear activation function. Ih%
activation function takes a real valued input and squashes it within a small range suen
[0; 1]and [1; I), The application of anonlinear function after the weight layers
highly important, since it allows a neural network to learn nonlinear mappings.
" In the absence of nonlinearities, a stacked network of weight layers is equivalent
linear mapping from the input domain to the output domain.

TECHNICAL PUBLICATIONS an up-thrust for knowledge


DeepLeaming 2-17 Convolutional Neural Networks

Anonlinear function can also be understood as a switching or a selection mechanism,


which decides whether a neuron will fire or not given all of its inputs. The activation
functions that are commonly used in deep networks are differentiable to enable error
back propagation.
Common activation functions that are used in deep neural networks are Sigmoid, Tanh,
Algebraic Sigmoid, ReLU, Leaky ReLU/PReLU and Exponential Linear Unit. These
activation function are shown below:
1
1

Tanh Algebraic sigmoid


Sigmoid

Leaky ReLUIPReLU Exponentia! linear unit


ReLU
takes in a real number as its input
and
Sigmoid:The sigmoid activation function
of [0, 1].
outputs a number in the range
tanh activation function implements the hyperbolic tangent function to
Tanh : The
of[- 1; 1].
squash the input values within the range function also maps the input
The algebraic sigmoid
Algebraic sigmoid function :
within the range (- 1; 1]. activation function which is of a special
unit : The ReLU is a simple
" Rectifier linear computation. A ReLUfunction maps the input
quick
practical importance because of its unchanged if it is positive.
keeps its value
to a 0 if it is negative and completely switches off the output if the input is
function
Leaky ReLU : The rectifier reduce the output to a zero value, rather it
function does not
negative. A leaky ReLU
down-scaled version ofthe negative input. negative
Qutputs a
exponential linear units have both positive and
Exponential linear units : The helns in
the mean activations toward zero. It
therefore try to push
values and they
while achieving a better performance.
process
Speeding upthe training
knowledge
PUBLICATIONS- an up-thrust for
TECHNICAL
Deep Learning 2- 18 Convolutional Neural Networks
2.7 Loss Function
" A loss function computes the difference between the estimated output of the model
(prediction) and the correct output (the groundtruth).
" All the algorithms in machine learning rely on minimizing or maximizing a function
which we call "objective function". The group of functions that are minimized are caled
"loss functions". A loss function is a measure of how good a prediction model does in
terms of being able to predict the expected outcome.
" Loss functions are used to calculate the difference between the predicted output and the
actual output.
The loss function is the function that computes the distance between the current output
of the algorithm and the expected output. It's a method to evaluate how our algorithm
models the data. It can be categorized into two groups. One for classification (discrete
values, 0, 1, 2, ...)and the other for regression (continuous values).
" The type of loss function used in CNN model depends on end problem. The generic set
problems for which neural networks are usually used can be categorized into the
following categories.
1. Binary Classification (SVM hinge loss, Squared hinge loss).
2. Identity Verification (Contrastive loss).
3.Multi-class Classification (Softmax loss, Expectation loss).
4. Regression (SSIM, 1error, Euclidean loss).
Loss Function Notation
" Loss function notation are as follows :
a) N =The number of samples collected.
b) P =The number of input features gathered
c) M=The number of output features that have been observed.
d) (X, Y) to denote the input and output data collected. there will be Nsuch pairs where
the input is a collection of P values and the output Y is acollection ofM values. We
th
will denote the i pair in the dataset as X; and Y;
e) Y=Output of the neural net.

f) h(X-) =Y; - Neural network transforming the input Xto give the output.
g) Thus y,, refers to the j" feature observed in the i" sample collected.
h) Loss function = L(W, b)

TECHNICAL PUBLICATIONS® an up-thrust for knowledge


DeepLearming 2- 19 Convolutional Neural Networks

2.7.1 Loss Functions for Regression


Loss functions for regression : Regression involves predicting a specific value that is
continuous in nature. Estimating the price of a house or predicting stock prices are
examples of regression because one works towards building a model that would predict
a real-valued quantity.
Mean Square Error
Mean Sqaure Error (MSE) is the most commonly used regression loss function. MSE is
the sum of squared distances between our target variable and predicted values.
" Mean Squared Error is the average of the squared differences between the actual and the
predicted values. For a data point Y, and its predicted value Y, where n is the total
number of data points in the dataset, the mean squared error is defined as :
n
1
MSE =
i=1

Advantages : For small errors, MSE helps converge to the minima efficiently, as the
gradient reduces gradually.
" Drawback :

a) Squaring the values does increases the rate of training, but at the same time, an
extremely large loss may lead to a drastic jump during backpropagation, which is not
desirable.
b) MSE is also sensitive to outliers.

2.7.2 Loss Functions for Classification


Loss functions for classification :Classification problems involve predicting a discrete
class output. It involves dividing the dataset into different and unique classes based on
different parameters so that a new and unseen recordcan be put into one of the classes.
1. Hinge loss
Hinge loss is a specific loss function used by Support Vector Machines (SVM), This
loss function will help SVM to make a decision boundary with a certain margin
distance.
The equation for hinge loss when data points must be categorized as - 1l or 1 is as
follows:
1 N
L(W, b) = max(0, I -y; XYi)
Hinge loss is mostly used of binary classification.
TECHNICAL PUBLICATIONS- an up-thrust for knowledge
Deep Leaming 2-20 Convolutional Neural Networks

Square hinge loss :


There are many extensions of hinge loss are present to use with SVM models. One of
the popular extensions is called Squared Hinge Loss. It simply caleulates the square of
the hinge loss value.
Squared hinge loss has the eftect of the smoothing the surface of the error function and
making it numerically easier to work with.
When the hinge loss requires better performance on a given binary classification
problem it is mostly observed that a squared hinge loss may be appropriate to use. As
using the hinge loss function, the target variable must be modified to have values in the
set (- 1, 1l}.
It is simple to implement using python only we have to change the loss function name to
"squared_hinge" in compile () function when building the model.
" A typical application can be classifying email into 'spam' and 'not spam' and we are
only interested in the classification accuracy. Let us see how squared Hinge can be used
with Keras. It just involves specifying it as the used loss function during the model
compilation step:
#Compile the model
model.compile(loss=squared hinge,optimizer-tensortlow.keras.optimizers.Adam{lr-0.
03), metrics=['accuracy)

2.8 Gradient Computation


Much of machine learning can be written as an optimization problem.
Example loss functions : Logistie regression, linear regression, principle component
analysis, neural network loss.
" A very efticient way to train logistic models is with Stochastic Gradient Descent
(SGD).
" One challenge with training on power law data (i.e. most data) is that the ternms in the
gradient can have very different strengths.
The idea behind stochastic gradient descent is iterating a weight update based on
gradient of loss function :
W(k+ 1) =w (k)-yV L(w)
" Logistic regression is designed as a binary classifer (output say $0, 1)) but actuay
outputs the probability that the input instance is in the"1" class.

TECHNICAL PUBLICATIONS- an up-thrust for knowBedge


peepLearning 2- 21 Convolutional Neural Networks

. logistic classifier has the form:


A
1
P(X)
1+ exp (- XB)
where X =(X1, ..., X,)is a vector of features.
Stochastic gradient has some serious limitations however, especially if the gradients
vary widely in magnitude. Some coefficients change very fast, others very slowly.
This happens for text, user activity and social media data (and other power-law data),
because gradient magnitudes scale with feature frequency, i.e. over several orders of
magnitude.
. It is not possible to set asingle learning rate that trains the frequent and infrequent
features at the same time.
An example of stochasticgradient descent with perceptron loss is shown as follows:
from sklearn.linear model import SGDCIassifier.
2.8.1 Finding the Optimal Hyper-parameters through Grid Search
" In statistics, hyperparameter is aparameter from aprior distribution; it captures the prior
belief before data is observed.
" In any machine learning algorithm, these parameters need to be initialized before
training a model.
Model hyperparameters are the properties that govern the entire training process.
Hyperparameters are important because they directly control the behaviour of the
training algorithm and have a significant impact on the performance of the model is
being trained.
Choosing appropriate hyperparameters plays a crucial role in the success of our neural
network architecture. Since it makes a huge impact on the learned model.
For example, if the learning rate is too low, the model will miss the important patterns in
the data. If it is high, it may have collisions.
Choosing good hyperparameters gives two benefits :
1. Efficiently search the space of possible hyperparameters.
2. Easy to manage a large set of experiments for hyperparameter tuning.
" The process of finding most optimal hyperparameters in machine learning is called
hyperparameter optimisation.

TECHNICAL PUBLICATIONS an up-thrust for knowledge


Deep Learning 2-22 Convolutional Neural Networks
Grid search is a very traditional technique for implementing hyperparameters. It brute
force all combinations.Grid search requires to create two set of hyperparameters.
1. Learning rate 2. Number of layers.
Grid search trains the algorithm for all combinations by using the two set of
hyperparameters and measures the performance using'"Cross Validation" technique.
This validation technique gives assurance that our trained model got most of the patterns
from the dataset.

One of the best methods to do validation by using "K-Fold Cross Validation" which
helps to provide ample data for training the model and ample data for validations.
Wih this technique, we simply build a model for each possible combination of all of the
hyperparameter values provided, evaluating each model and selecting the architecture
which produces the best results.
For example, say you have two continuous parameters andB, where manually selected
values for the parameters are the following :
a E {0, 1, 2}
Be {.25, .50, .75}
" Then the pairing of the selected hyperparametric values, H, can take on any of the
following:
He (0, .25), (0,.50), (0, .75), (1, .25), (1, .50), (1, .75), (2, .25), (2, .50), (2, .75)}
Grid search will examine each pairing of and to determine the best performing
combination. The resulting pairs, H, are simply each output that results from taking the
Cartesian product of a and B.
" While straightforward, this "brute force" approach for
hyperparameter optimization has
some drawbacks. Higher-dimensional hyperparametric spaces are far more
time
consuming to test than the simple two-dimensional problem presented here.
" Also, because there will always be a fixed number of
training samples for any given
model, the model's predictive power will decrease as the number of dimensions
increases. This is known as Hughes phenomenon.
2.8.2 Vanishing Gradient Problem
When back-propagation is used, the earlier layers will receive very small
updates
compared to the later layers. This problem is referred to as the vanishing gradien
problem.

TECHNICAL PUBLICATIONS- an up-thrust for knowledge


peepLearming
2-23 Convolutiona fleurat tktworks
. The vanishing gradient problem is essentially a situation in which a deep
feed-forward network or a Recurent Neural Network (RNN)does not have themultilayer
propagate usetul gradient information from the output end of the model ability to
layers near the input end back to the
of the model.
Weight initialization is one technique that can be used to
solve the
problem. lt involves artificially creating an initial value for weights invanishing gradient
a neural network
to prevent the backpropagation algorithm from
small.
assigning weights that are unrealistically
. The most important solution to the vanishing gradient problem is a specific
neural network called Long Short-Term Memory type of
Networks (LSTMs).
Indication of vanishing gradient problem:
a) The parameters of the higher layers change to a
great extent, while the parameterS of
lower layers barely change.
b) The model weights could become 0
during training.
c) The model learns at a particularly slow pace and
the training could stagnate at a very
early phase after only a few iterations.
" Some methods that are proposed to overcome the
vanishing gradient problem :
a) Residual neural networks (ResNets)
b) Multi-level hierarchy
c) Long short term memory (LSTM)
d) Faster hardware
e) ReLU
) Batch normalization

2.9 Two Marks Questions with Answers


Q1 Define convolutional networks.
Ans. : Convolutional networks are simply neural networks that use
convolution in place of general
Matrix muliplication in at least one of their layers.
How sparse interactions used in convolutional networks ? What are benefits of it ?
AnS. : Sparse interaction is implemented by using kernels or feature detector smaller than the
input
Image, i.e. Making the kernel smaller than the input.
03 Why sparse interactions is beneficial ?
s.:" Fewer parameters: reduces the memory requirements and improves its statistical
efficiency.
" Computing the output requires fewer operations.

TECHNICAL PUBLICATIONS- an up-thrust for knowledge


Deep Learning 2-24 Convolutional Neural Networke
convolutional networks ?
Q.4 Would sparse Interactions cause reductlon on performance in
convolutional net
Ans. : Not really, since we have deep layers. Even though direct connections in a
of the input
are very sparse, units in the deeper layers can be indirectly connected to all or most
image.
Q.5 What is equlvariance representation ?
Ans. : In case of convolution, the particular form of parameter sharing causes the layer to have a
property called equivariance to translation.
Q.6 List the types of pooling.
Ans. :Types of pooling are max pooling, average pooling, L2 norm and weighted average.
Q.7 Explain Pro of Tiled convolution.
Ans.: "It offers acompromise betvween aconvolutional layer and a locally connected layer.
Memory requirements for storing the parameters will increase only by a factor of the size
of this set of kernels.
Q.8 What ls a convolution ?
Ans. : Convolution isan orderly procedure where two sources of information are intertwined; it's an
operation that changes a function into something else.
Q.9 Which are four main operations in a CNN ?
Ans. : Four main operations in a CNN are Convolution, Non-Linearity (ReLU), Pooling or Sub
Sampling and Classification (Fully Connected Layer).
Q.10 Define full convolution.
Ans. : Full convolution applies the maximum posible padding to the input feature maps before
convolution. The maximum possible padding is the one where at least one valid input value is
involved in all convolution cases.
Q,11 What is gradient descent ?
Ans. : Gradient descent is a first-order optimization algorithm. To find a local minimum of a
function using gradient descent, one takes steps proportional to the negative of the gradient of the
function at the current point.
Q.12 What is difference between linear unit and rectified linear unit ?
Ans. : The only difference between a linear unit and a rectified linear unit is that a rectified linear
unit outputs zero across half its domain. This makes the derivatives through a rectified linear unit
remain large whenever the units is active. The gradients are not only large but also consistent.
ro
is
ent
70

TECHNICAL PUBLICATIONS. an up-thrust for knowledge


Convolutional Neural Networks
DeepLeaning 2- 25

interactions
a13 Define sparse
Ans. :
Sparse interactions are also referred to as sparse connectivity or sparse weights. Sparse
interaction is implenmented by using kernels or feature detector smaller than the input image, i.e.
input.
Making the kernelsmaller than the
functions?
0.14 What is loss
loss function is a measure of how good a prediction model does in terms of being able to
ict the expected outcome. Loss functions are used to calculate the difference between the
Dredicted output and the actual output.
convolutional layers.
0.15 List the components of
sharing and layer
Ans. :Components of convolutional layers are filters, activation maps, parameter
specific hyper-parameters.
0.16 What is use of parameter sharing in CNN ?
Convolutional layers
Ans. : Parameter sharing is used in CNN to control the total parameter count.
reduce the parameter count further by using a technique called parameter sharing.
Q.17 Explain padding in CNN.
boundaries of an
Ans. : Padding is the process of adding one or more pixels of zeros all around the
image, in order to increase its effective size. Zero padding helps to make output dimensions and
kernel size independent.
Q.18 HoW many filters must a CNN have?

Ans. :CNN does not learn with asingle filter; they know through multiple features in parallel for a
given input. For example, it is usual for a convolution layer to learn from 32 to S12 filters in parallel
for a piece of shared information.
Q.19 What is tiled convolution ?
Ans. : " Tiled convolution learn a set of kernels that is rotated through as we move through space,
Tather than learning a separate set of weights at every spatial location as in locally connected layer.
It offers a compromise between a convolutional layer and a locally connected layer.
Memory requirements for storing the parameters will increase only by a factor of the size
of this set of kernels.

TECHNICAL PUBLICATIONS- an up-thrust for knowledge

You might also like