0% found this document useful (0 votes)
18 views18 pages

DL Mod4

Convolutional Neural Networks (CNNs) are specialized neural networks designed for processing grid-like data, such as images or time series. The convolution operation involves applying a kernel to an input to produce a feature map, which helps in reducing the dimensionality and extracting important features. CNNs improve machine learning systems through sparse interactions, parameter sharing, and equivariant representations, making them efficient for tasks like image recognition.

Uploaded by

Tharun Kshatriya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views18 pages

DL Mod4

Convolutional Neural Networks (CNNs) are specialized neural networks designed for processing grid-like data, such as images or time series. The convolution operation involves applying a kernel to an input to produce a feature map, which helps in reducing the dimensionality and extracting important features. CNNs improve machine learning systems through sparse interactions, parameter sharing, and equivariant representations, making them efficient for tasks like image recognition.

Uploaded by

Tharun Kshatriya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Module-4(CNN)

Q)what do you mean by convolutional neural networks? Explain the Convolution


operation with an example. V imp
Ans)CNNs are a specialized kind of neural network that has a known,
grid-like topology for processing data.
Example
• 1D grid – time series data (1D grid taking samples at regular time intervals)
• 2D grid – image data (2D grid of pixels)
Convolutional networks are simply neural networks that use
convolution in place of general matrix multiplication in at least one of their
layers.
Neural network convolution does not correspond to convolution used in engineering and
mathematics.
convolution operation
Convolution is an operation on two functions of a real-valued argument. The convolution
operation is typically denoted with an asterisk:
s(t) = (x ∗ w)(t)
• In convolutional network terminology, the first argument (in this example, the function
x) to the convolution is often referred to as the input and the second argument (in this
example, the function w) as the kernel.
• The output is sometimes referred to as the feature map.
• In our example, the idea of a laser sensor that can provide measurements at every instant
in time is not realistic.
• Usually, when we work with data on a computer, time will be discretized, and our sensor
will provide data at regular intervals.
• In our example, it might be more realistic to assume that our laser provides a
measurement once per second.
• The time index t can then take on only integer values.
• If we now assume that x and w are defined only on integer t, we can define the discrete
convolution:

if we use a two-dimensional image I as our input, we probably also want to use a two-
dimensional kernel K:
The commutative property of convolution arises because we have flipped the kernel relative
to the input, in the sense that as m increases, the index into the input increases, but the index
into the kernel decreases. The only reason to flip the kernel is to obtain the commutative
property.

The output image is reduced after convolution.


Let input is n x n size and Kernal is f x f size. Output is size
Q)Explain how convolution improves a machine learning system. V.v imp
Ans)Convolution leverages three important ideas that can help improve a machine-learning
system:
1. Sparse interactions
2. Parameter sharing
3. Equivariant representations
Sparse Interactions
• Traditional neural network layers use matrix multiplication by a matrix of parameters
with a separate parameter describing the interaction between each input unit and each
output unit.
• This means every output unit interacts with every input unit.
• Convolutional networks, however, typically have sparse interactions since they use
convolution operation instead of matrix multiplication.
• Sparse interactions
• Also called sparse connectivity or sparse weights
• Sparse Connectivity is accomplished by making the kernel smaller than the
input- If we limit the number of connections each output may have to k, then
the sparsely connected approach requires only k × n parameters
• k << m requires O(k x n) runtime (per example)
• k is typically several orders of magnitude smaller than m
Parameter Sharing
• The parameter sharing used by the convolution operation means that rather than
learning a separate set of parameters for every location, we learn only one set. In
traditional neural networks
• Each element of the weight matrix is used exactly once when computing the
output of a layer. It is multiplied by one element of the input and then never
revisited.
• Parameter sharing refers to using the same parameter for more than one function in a
model.
• The network has tied weights - because the value of the weight applied to one input is
tied to the value of a weight applied elsewhere.
• Each member of the kernel is used at every position of the input.
• Parameter sharing has some effects :
- Extracts same features for every location
- Reduce memory space
Equivariant Representations
• In the case of convolution, the particular form of parameter sharing causes the layer to
have a property called equivariance to translation.
• To say a function is equivariant means that if the input changes, the output changes in
the same way.
• Specifically, a function f (x) is equivariant to a function g if f(g(x)) = g(f (x)).
In the case of convolution, if we let g be any function that translates the input, i.e., shifts it,
then the convolution function is equivariant to g. For example, let I be a function giving image
brightness at integer coordinates. Let g be a function mapping one image function to another
image function, such that. If.
we apply this transformation to I, then apply convolution, the result will be the same as if we
applied convolution tothen applied the transformation g to the output.

Convolution is equivariant operation about translate because it exploits shared kernel and
stride from the scratch to the end.
Q)Explain Convolutional Network Components.
Ans) A typical layer of a convolutional network consists of three stages:
In the first stage, the layer performs several convolutions in parallel to produce a set of
linear activations.
In the second stage, each linear activation is run through a nonlinear activation function,
such as the rectified linear activation function. This stage activation
function. This stage
is sometimes called the detector stage.
In the third stage, we use a pooling function to modify the output of the layer

further.
Q)What is Pooling? How Max Pooling and Invariance to Translation. V imp
Ans)The pooling function replaces the output of the net at a certain location with a summary
statistic of the nearby outputs.
• Pooling helps to make the representation become approximately invariant to small
translations of the input.
• Invariance to translation means that if we translate the input by a small amount, the
values of most of the pooled outputs do not change.
• Invariance to local translation can be a very useful property if we care more about
whether some feature is present than exactly where it is.
• For example, when determining whether an image contains a face, we need not know
the location of the eyes with pixel-perfect accuracy, we just need to know that there is
an eye on the left side of the face and an eye on the right side of the face.
• In other contexts, it is more important to preserve the location of a feature. For example,
if we want to find a corner defined by two edges meeting at a specific orientation, we
need to preserve the location of the edges well enough to test whether they meet.
• Max pooling reports the maximum output within a rectangular neighborhood.

• Average pooling reports the average output of a rectangular neighborhood.


Q)Explain convolution and Pooling as an Infinitely Strong Prior.
Ans)
• prior probability distribution -This is a probability distribution over the parameters of a
model that encodes our beliefs about what models are reasonable, before we have seen
any data.
• Prior probabilities (beliefs before we see actual data) can be strong or weak
• A weak prior is a prior distribution with high entropy (e.g., Gaussian distribution
with high variance) allows the data to move the parameters.
• A strong prior is a prior distribution with low entropy (e.g., Gaussian
distribution with low variance) strongly determines the parameters.
• An infinitely strong prior controls the parameters.
• A convolutional net is an infinitely strong prior
• Weights are zero except in small receptive fields - An infinitely strong prior
places zero probability on some parameters (weights are zero) and says that
these parameter values are completely forbidden, regardless of how much
support the data gives to those values
• Weights identical for neighboring hidden units.
• Thinking of a convolutional net as a fully connected net with an infinitely strong
prior can give us some insights into how convolutional nets work.
• One key insight is that - Convolution and pooling can cause underfitting -
• The prior is useful only when the assumptions made by
the prior are reasonably accurate.
• If a task relies on preserving precise spatial information, then pooling on all
features can increase training error.
• When a task involves incorporating information from very distant locations in
the input, then the prior imposed by convolution may be inappropriate.
• Another key insight from this view is that we should only compare
convolutional models to other convolutional models in benchmarks of statistical
learning performance.
Q)Explain Variants of the Basic Convolution Function with an example. V imp
Ans)
• Stride is the amount of downsampling
• Zero padding avoids layer-to-layer shrinking
• Unshared convolution
• Like convolution but without sharing
• Partial connectivity between channels
• Tiled convolution
• Cycle between shared parameter groups
1.Convolution with Stride
• We may want to skip over some positions of the
kernel in order to reduce the computational cost (at the expense of
not extracting our features as finely).
• We can think of this as downsampling the output of the full convolution function.
• If we want to sample only every s pixels in each direction in the output, We refer to s
as the stride of this downsampled convolution.

2.Zero Padding Controls Size


• One essential feature of any convolutional network implementation is the ability to
implicitly zero-pad the input V in order to make it wider.
• Without this feature, the width of the representation shrinks by one pixel less than the
kernel width at each layer.
• Zero padding the input allows us to control the kernel width and the size of the output
independently.
• Three special cases of the zero-padding setting -

□ One is the extreme case in which no zero-padding is used whatsoever, and the
convolution kernel is only allowed to visit positions where the entire kernel is
contained entirely within the image. This is called valid convolution. In this
case, all pixels in the output are a function of the same number of pixels in the
input. However, the size of the output shrinks at each layer. If the input image
has width m and the kernel has width k, the output will be of width m − k + 1.

□ Another special case of the zero-padding setting is when just enough zero-
padding is added to keep the size of the output equal to the size of the input.
This is called same convolution.

□ Other extreme case, which is referred to as full convolution, in which enough


zeroes are added for every pixel to be visited k times in each direction, resulting
in an output image of width m + k − 1.
3)Unshared Convolution
• In some cases, we do not actually want to use convolution, but rather locally connected
layers.
• This is sometimes also called unshared convolution, because it is a similar operation
to convolution with a small kernel, but without sharing parameters across locations.
• Locally connected layers are useful - when we know that each feature should be a
function of a small part of space, but there is no reason to think that the same feature
should occur across all of space.
For example, if we want to tell if an image is a picture of a face, we only need to look for the
mouth in the bottom half of the image.
4)Partial Connectivity Between Channels
• It can also be useful to make versions of convolution or locally connected layers in
which the connectivity is further restricted, for example to constrain each output
channel i to be a function of only a subset of the input channels l.
• A common way to do this is to make the first m output channels connect to only the
first n input channels, the second m output channels connect to only the second n input
channels, and so on.
• Modeling interactions between few channels allows the network to have fewer
parameters in order to reduce memory consumption and increase statistical efficiency,
and also reduces the amount of computation.
• It accomplishes these goals without reducing the number of hidden units.
5)Tiled Convolution
• Tiled convolution offers a compromise between a convolutional layer and a locally
connected layer.
• Rather than learning a separate set of weights at every spatial location, we learn a set of
kernels that we rotate through as we move through space.
• This means that immediately neighboring locations will have different filters, like in a
locally connected layer, but the memory requirements for storing the parameters will
increase only by a factor of the size of this set of kernels, rather than the size of the
entire output feature map.
Structured Outputs
• Convolutional networks can be used to output a high-dimensional, structured
object, rather than just predicting a class label for a classification task or a real value
for a regression task.
• Typically this object is just a tensor, emitted by a standard convolutional layer.
• For example, the model might emit a tensor S, where S is the probability that pixel
(j, k ) of the input to the i,j,k network belongs to class i.
• This allows the model to label every pixel in an image and draw precise masks that
follow the outlines of individual objects.

Structured Outputs - Strategy for pixel-wise labeling of images


X : Input image tensor
Y : Probability distribution over labels for each pixel
H : Hidden representation
U : Tensor of convolution kernels
V : Tensor of kernels to produce an estimate of the labels
W : Kernel tensor to convolve over Y to provide input to H

Q)Explain different Data types that can be process by CNN. V imp


• The data used with a convolutional network usually consists of several channels, each
channel being the observation of a different quantity at some point in space or time.
• CNN can also process inputs with varying spatial extents - These kinds of input simply
cannot be represented by traditional, matrix multiplication-based neural networks. This
provides a compelling reason to use convolutional networks.
• For example, consider a collection of images, where each image has a different width
and height.
• It is unclear how to model such inputs with a weight matrix of fixed size.
• Convolution is straightforward to apply; the kernel is simply applied a different number
of times depending on the size of the input, and the output of the convolution operation
scales accordingly.
• Sometimes the output of the network is allowed to have variable size as well as the
input, for example if we want to assign a class label to each pixel of the input.
• In other cases, the network must produce some fixed-size output, for example if we
want to assign a single class label to the entire image.
Examples of different formats of data that can be used with convolutional networks.
Q)Discuss about Efficient Convolution Algorithms.
• Modern CNN compute more than one million units.
• Exploiting parallel computation resources are essential.
• Selecting an appropriate convolution algorithm is important to speed up the
convolution.
• When a d-dimensional kernel can be expressed as the outer product of d vectors, one
vector per dimension, the kernel is called separable.
• If the kernel is separable, naive convolution is inefficient.
• Naive 𝑑-dimensional convolution requires 𝑂(𝑾𝒅) runtime and parameter storage space.
• - 𝑊 is wide element’s number in each dimension
• Separable convolution requires 𝑂(𝑾 × 𝒅) runtime and parameter storage space.
Devising faster ways of performing convolution or approximate convolution without harming
the accuracy of the model is an active area of research
Q)Explain strategies for obtaining convolution kernels without supervised training.
Ans)Random or Unsupervised Features learning
• This approach was popular from roughly 2007–2013,
when labelled datasets were small and computational
power was more limited. Today, CNN is trained in a purely supervised fashion.
• Typically, the most expensive part of convolutional network training is learning the
features.
• The output layer is relatively inexpensive due to small number of features provided as
input to this layer after passing through several layers of pooling.
• One way to reduce thecost of convolutional network training is to use the features that
are not trained in a supervised fashion.
• There are three basic strategies for obtaining convolution kernels without supervised
training.

□ One is to simply initialize them randomly.

□ Another is to design them by hand, for example by setting each kernel to detect
edges at a certain orientation or scale.

□ Finally, one can learn the kernels with an unsupervised criterion. For example,
apply k-means clustering to small image patches, then use each learned centroid
as a convolution kernel.
• Learning the features with an unsupervised criterion allows them to be determined
separately from the classifier layer at the top of the architecture. One can then extract
the features for the entire training set just once, essentially constructing a new training
set for the last layer.
• Random filters often work surprisingly well in convolutional networks showed that
layers consisting of convolution following by pooling naturally become frequency
selective and translation invariant when assigned random weights.
• This provides an inexpensive way to choose the architecture of a convolutional
network: first evaluate the performance of several convolutional network architectures
by training only the last layer, then take the best of these architectures and train the
entire architecture using a more expensive approach.
• An intermediate approach is to learn the features - With multilayer perceptrons, we use
greedy layer-wise pretraining, to train the first layer in isolation, then extract all
features from the first layer only once, then train the second layer in isolation given
those features, and so on.
• Instead of training an entire convolutional layer at a time, we can train a model of a
small patch, do with k-means. We can then use the parameters from this patch-based
model to define the kernels of a convolutional layer.
• This means that it is possible to use unsupervised learning to train a convolutional
network without ever using convolution during the training process.
• Using this approach, we can train very large models and incur a high computational
cost only at inference time.

Q)Writer short notes on following.


i)LeNet
ii)AlexNet
i)LeNet
LeNet is a convolutional neural network (CNN) architecture that was proposed in 1989 by
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. It was originally designed to
recognize handwritten digits in images, and was later used to identify zip codes provided by
the US Postal Service.
characteristics of LeNet:

• Architecture

• n Lenet-1,

28×28 input image >


Four 24×24 feature maps convolutional layer (5×5 size) >

Average Pooling layers (2×2 size) >

Eight 12×12 feature maps convolutional layer (5×5 size) >

Average Pooling layers (2×2 size) >

Directly fully connected to the output

• With convolutional and subsampling/pooling layers introduced, LeNet-1 got the error

rate of 1.7% on test data

• It is noted that, at the moment authors invented the LeNet, they used average pooling
layer, output the average values of 2×2 feature maps. Right now, many LeNet

implementation use max pooling that only the maximum value from 2×2 feature maps

is output, and it turns out that it can help for speeding up the training. As the strongest

feature is chosen, larger gradient can be obtained during back-propagation.


• Further LeNet-4, LeNet-5 , Boosted LeNet-4 havee come.

• LeNet-4 With more feature maps, and one more fully connected layers, error

rate is 1.1% on test data.

• LeNet-5, the most popular LeNet people talked about, only has slight differences

compared with LeNet-4.

• With more feature maps, and one more fully connected layers, error rate is 0.95% on

test data.

, AlexNet.
The convolutional neural network (CNN) architecture known as AlexNet was created by Alex
Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, who served as Krizhevsky’s PhD advisor.
This was the first architecture that used GPU to boost the training performance. AlexNet
consists of 5 convolution layers, 3 max-pooling layers, 2 Normalized layers, 2 fully connected
layers and 1 SoftMax layer. Each convolution layer consists of a convolution filter and a non-
linear activation function called “ReLU”. The pooling layers are used to perform the max-
pooling function and the input size is fixed due to the presence of fully connected layers. The
input size is mentioned at most of the places as 224x224x3 but due to some padding which
happens it works out to be 227x227x3. Above all this AlexNet has over 60 million parameters.
Key Features:
• ‘ReLU’ is used as an activation function rather than ‘tanh’
• Batch size of 128
• SGD Momentum is used as a learning algorithm
• Data Augmentation is been carried out like flipping, jittering, cropping, colour
normalization, etc.
AlexNet was trained on a GTX 580 GPU with only 3 GB of memory which couldn’t fit the entire
network. So the network was split across 2 GPUs, with half of the neurons(feature maps) on
each GPU.
.

You might also like