21CS743 DL Module4 Notes
21CS743 DL Module4 Notes
Module 4: The Convolution Operation, Pooling, Convolution and Pooling as an Infinitely Strong Prior,
Variants of the Basic Convolution Function, Structured Outputs, Data Types, Efficient Convolution
Algorithms, Random or Unsupervised Features- LeNet, AlexNet.
Text Book: Ian Goodfellow, Yoshua Bengio, Aaron Courville, “Deep Learning”, MIT Press, 2016.
Convolutional networks (LeCun, 1989), also known as convolutional neural networks or CNNs, are a
specialized kind of neural network for processing data that has a known, grid-like topology. Examples include
time-series data, which can be thought of as a 1D grid taking samples at regular time intervals, and image
data, which can be thought of as a 2D grid of pixels. Convolutional networks have been tremendously
successful in practical applications. The name “convolutional neural network” indicates that the network
employs a mathematical operation called convolution. Convolution is a specialized kind of linear operation.
Convolutional networks are simply neural networks that use convolution in place of general matrix
multiplication in at least one of their layers.
• We can do this with a weighting function w(a), where a is the age of a measurement. If we apply
such a weighted average operation at every moment, we obtain a new function s providing a
smoothed estimate of the position of the spaceship:
• This operation is called convolution. The convolution operation is typically denoted with an
asterisk:
• In the above equation the x represents the input, * represents the convolution operation and w
denotes the filter that is applied.
• In the above example, w needs to be a valid probability density function, or the output is not a
weighted average. Also, w needs to be 0 for all negative arguments, or it will look into the future,
which is presumably beyond our capabilities. These limitations are particular to the example
discussed above though.
• In general, convolution is defined for any functions for which the above integral is defined, and
may be used for other purposes besides taking weighted averages.
• In convolutional network terminology, the first argument (x) to the convolution is often referred to
as the input and the second argument (the function w) as the kernel. The output is sometimes
referred to as the feature map.
• Similarly, for 2D input, re-estimation of each pixel is done by taking a weighted average of all its
neighbours.
➢ Convolutional Neural Network Architecture:
• A CNN typically has three layers: a convolutional layer, a pooling layer, and a fully connected
layer.
• Convolution Layer:
• The convolution layer is the core building block of the CNN. It carries the main portion of the
network’s computational load. The CNN architecture is as shown in Figure 1.
• This layer performs a dot product between two matrices, where one matrix is the set of learnable
parameters otherwise known as a kernel, and the other matrix is the restricted portion of the
receptive field.
• The kernel is spatially smaller than an image but is more in-depth.
• During the forward pass, the kernel slides across the height and width of the image-producing the
image representation of that receptive region.
• This produces a two-dimensional representation of the image known as an activation map that
gives the response of the kernel at each spatial position of the image. The sliding size of the kernel
is called a stride.
• If we have an input of size W x W x D and Dout number of kernels with a spatial size of F with
stride S and amount of padding P, then the size of output volume can be determined by the
following formula:
• Number of Filters: Dout This represents the number of filters (kernels) used in the convolution
layer, determining the depth of the output volume.
• Each filter produces one output channel, so the output will have Dout channels. The figure 2 depicts
the process of how the activation map is obtained.
• Before we go on to the next layer let us try and understand Cross-Correlation and its Role in
CNNs.
• In convolutional networks, the operation commonly referred to as "convolution" is, in fact, cross-
correlation.
• Cross-correlation computes the similarity between the input signal and the kernel as the kernel
slides over the input. Mathematically, this can be written as:
(f⋆g)[n]=m∑f[m]g[n+m]
• Here, f represents the input, g the kernel, and n the spatial or temporal shift. Cross-correlation
captures localized patterns in the data, which is essential for feature extraction in CNNs.
• Toeplitz Matrix in Convolution: The convolution operation can be expressed in matrix form
using a Toeplitz matrix, where each diagonal contains the same elements.
• For a 1D convolution, the input vector can be expanded into a Toeplitz matrix, enabling the
convolution to be represented as:
Output=Toeplitz (Input)×Kernel
➢ Pooling Layer:
• A typical layer of a convolutional network consists of three stages. In the first stage, the layer
performs several convolutions in parallel to produce a set of linear activations. In the second stage,
each linear activation is run through a nonlinear activation function, such as the rectified linear
activation function. This stage is sometimes called the detector stage.
• In the third stage, we use a pooling function to modify the output of the layer further. A pooling
function replaces the output of the net at a certain location with a summary statistic of the nearby
outputs. For example, the max pooling operation reports the maximum output within a rectangular
neighbourhood.
• Other popular pooling functions include the average of a rectangular neighbourhood, the L2 norm
of a rectangular neighbourhood, or a weighted average based on the distance from the central pixel.
• In all cases, pooling helps to make the representation become approximately invariant to small
translations of the input. Invariance to translation means that if we translate the input by a small
amount, the values of most of the pooled outputs do not change.
• Translation Invariance: Invariance to local translation can be a very useful property if we care more
about whether some feature is present than exactly where it is.
• For example, when determining whether an image contains a face, we need not know the location
of the eyes with pixel-perfect accuracy, we just need to know that there is an eye on the left side of
the face and an eye on the right side of the face.
• In other contexts, it is more important to preserve the location of a feature. For example, if we want
to find a corner defined by two edges meeting at a specific orientation, we need to preserve the
location of the edges well enough to test whether they meet.