DL Mod4
DL Mod4
if we use a two-dimensional image I as our input, we probably also want to use a two-
dimensional kernel K:
The commutative property of convolution arises because we have flipped the kernel relative
to the input, in the sense that as m increases, the index into the input increases, but the index
into the kernel decreases. The only reason to flip the kernel is to obtain the commutative
property.
Convolution is equivariant operation about translate because it exploits shared kernel and
stride from the scratch to the end.
Q)Explain Convolutional Network Components.
Ans) A typical layer of a convolutional network consists of three stages:
In the first stage, the layer performs several convolutions in parallel to produce a set of
linear activations.
In the second stage, each linear activation is run through a nonlinear activation function,
such as the rectified linear activation function. This stage activation
function. This stage
is sometimes called the detector stage.
In the third stage, we use a pooling function to modify the output of the layer
further.
Q)What is Pooling? How Max Pooling and Invariance to Translation. V imp
Ans)The pooling function replaces the output of the net at a certain location with a summary
statistic of the nearby outputs.
• Pooling helps to make the representation become approximately invariant to small
translations of the input.
• Invariance to translation means that if we translate the input by a small amount, the
values of most of the pooled outputs do not change.
• Invariance to local translation can be a very useful property if we care more about
whether some feature is present than exactly where it is.
• For example, when determining whether an image contains a face, we need not know
the location of the eyes with pixel-perfect accuracy, we just need to know that there is
an eye on the left side of the face and an eye on the right side of the face.
• In other contexts, it is more important to preserve the location of a feature. For example,
if we want to find a corner defined by two edges meeting at a specific orientation, we
need to preserve the location of the edges well enough to test whether they meet.
• Max pooling reports the maximum output within a rectangular neighborhood.
□ One is the extreme case in which no zero-padding is used whatsoever, and the
convolution kernel is only allowed to visit positions where the entire kernel is
contained entirely within the image. This is called valid convolution. In this
case, all pixels in the output are a function of the same number of pixels in the
input. However, the size of the output shrinks at each layer. If the input image
has width m and the kernel has width k, the output will be of width m − k + 1.
□ Another special case of the zero-padding setting is when just enough zero-
padding is added to keep the size of the output equal to the size of the input.
This is called same convolution.
□ Another is to design them by hand, for example by setting each kernel to detect
edges at a certain orientation or scale.
□ Finally, one can learn the kernels with an unsupervised criterion. For example,
apply k-means clustering to small image patches, then use each learned centroid
as a convolution kernel.
• Learning the features with an unsupervised criterion allows them to be determined
separately from the classifier layer at the top of the architecture. One can then extract
the features for the entire training set just once, essentially constructing a new training
set for the last layer.
• Random filters often work surprisingly well in convolutional networks showed that
layers consisting of convolution following by pooling naturally become frequency
selective and translation invariant when assigned random weights.
• This provides an inexpensive way to choose the architecture of a convolutional
network: first evaluate the performance of several convolutional network architectures
by training only the last layer, then take the best of these architectures and train the
entire architecture using a more expensive approach.
• An intermediate approach is to learn the features - With multilayer perceptrons, we use
greedy layer-wise pretraining, to train the first layer in isolation, then extract all
features from the first layer only once, then train the second layer in isolation given
those features, and so on.
• Instead of training an entire convolutional layer at a time, we can train a model of a
small patch, do with k-means. We can then use the parameters from this patch-based
model to define the kernels of a convolutional layer.
• This means that it is possible to use unsupervised learning to train a convolutional
network without ever using convolution during the training process.
• Using this approach, we can train very large models and incur a high computational
cost only at inference time.
• Architecture
• n Lenet-1,
• With convolutional and subsampling/pooling layers introduced, LeNet-1 got the error
• It is noted that, at the moment authors invented the LeNet, they used average pooling
layer, output the average values of 2×2 feature maps. Right now, many LeNet
implementation use max pooling that only the maximum value from 2×2 feature maps
is output, and it turns out that it can help for speeding up the training. As the strongest
• LeNet-4 With more feature maps, and one more fully connected layers, error
• LeNet-5, the most popular LeNet people talked about, only has slight differences
• With more feature maps, and one more fully connected layers, error rate is 0.95% on
test data.
, AlexNet.
The convolutional neural network (CNN) architecture known as AlexNet was created by Alex
Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, who served as Krizhevsky’s PhD advisor.
This was the first architecture that used GPU to boost the training performance. AlexNet
consists of 5 convolution layers, 3 max-pooling layers, 2 Normalized layers, 2 fully connected
layers and 1 SoftMax layer. Each convolution layer consists of a convolution filter and a non-
linear activation function called “ReLU”. The pooling layers are used to perform the max-
pooling function and the input size is fixed due to the presence of fully connected layers. The
input size is mentioned at most of the places as 224x224x3 but due to some padding which
happens it works out to be 227x227x3. Above all this AlexNet has over 60 million parameters.
Key Features:
• ‘ReLU’ is used as an activation function rather than ‘tanh’
• Batch size of 128
• SGD Momentum is used as a learning algorithm
• Data Augmentation is been carried out like flipping, jittering, cropping, colour
normalization, etc.
AlexNet was trained on a GTX 580 GPU with only 3 GB of memory which couldn’t fit the entire
network. So the network was split across 2 GPUs, with half of the neurons(feature maps) on
each GPU.
.