0% found this document useful (0 votes)
142 views7 pages

21CS743 DL Module4 Notes

hygyutyt

Uploaded by

NANDAN M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
142 views7 pages

21CS743 DL Module4 Notes

hygyutyt

Uploaded by

NANDAN M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Deep Learning Module 4: Convolutional Networks

Module 4: Convolutional Networks

Module 4: The Convolution Operation, Pooling, Convolution and Pooling as an Infinitely Strong Prior,
Variants of the Basic Convolution Function, Structured Outputs, Data Types, Efficient Convolution
Algorithms, Random or Unsupervised Features- LeNet, AlexNet.

Text Book: Ian Goodfellow, Yoshua Bengio, Aaron Courville, “Deep Learning”, MIT Press, 2016.

(Chapters 9.1, 9.9)

➢ Convolutional Neural Networks:

Convolutional networks (LeCun, 1989), also known as convolutional neural networks or CNNs, are a
specialized kind of neural network for processing data that has a known, grid-like topology. Examples include
time-series data, which can be thought of as a 1D grid taking samples at regular time intervals, and image
data, which can be thought of as a 2D grid of pixels. Convolutional networks have been tremendously
successful in practical applications. The name “convolutional neural network” indicates that the network
employs a mathematical operation called convolution. Convolution is a specialized kind of linear operation.
Convolutional networks are simply neural networks that use convolution in place of general matrix
multiplication in at least one of their layers.

➢ The Convolutional Operation:


• Convolution Operation: As convolution is a mathematical operation on two functions that produces
a third function that expresses how the shape of one function is modified by another.
• Similarly, CNN is more like a convolution operation where we take different measurements and
rely more on the near ones for better results.
• So here we take revised measurements which is a weighted average of the measurements taken
such that the near ones are assigned more weight than the measurements taken earlier.
• Let us assume we are tracking the location of a spaceship with a laser sensor. Our laser sensor
provides a single output x(t), the position of the spaceship at time t. Both x and t are real-valued,
i.e., we can get a different reading from the laser sensor at any instant in time.
• Now suppose that our laser sensor is somewhat noisy. To obtain a less noisy estimate of the
spaceship’s position, we would like to average together several measurements.
• Of course, more recent measurements are more relevant, so we will want this to be a weighted
average that gives more weight to recent measurements.

Department of CSE, Vemana I.T Page 1 of 7


Deep Learning Module 4: Convolutional Networks

• We can do this with a weighting function w(a), where a is the age of a measurement. If we apply
such a weighted average operation at every moment, we obtain a new function s providing a
smoothed estimate of the position of the spaceship:

• This operation is called convolution. The convolution operation is typically denoted with an
asterisk:

• In the above equation the x represents the input, * represents the convolution operation and w
denotes the filter that is applied.
• In the above example, w needs to be a valid probability density function, or the output is not a
weighted average. Also, w needs to be 0 for all negative arguments, or it will look into the future,
which is presumably beyond our capabilities. These limitations are particular to the example
discussed above though.
• In general, convolution is defined for any functions for which the above integral is defined, and
may be used for other purposes besides taking weighted averages.
• In convolutional network terminology, the first argument (x) to the convolution is often referred to
as the input and the second argument (the function w) as the kernel. The output is sometimes
referred to as the feature map.
• Similarly, for 2D input, re-estimation of each pixel is done by taking a weighted average of all its
neighbours.
➢ Convolutional Neural Network Architecture:
• A CNN typically has three layers: a convolutional layer, a pooling layer, and a fully connected
layer.
• Convolution Layer:
• The convolution layer is the core building block of the CNN. It carries the main portion of the
network’s computational load. The CNN architecture is as shown in Figure 1.
• This layer performs a dot product between two matrices, where one matrix is the set of learnable
parameters otherwise known as a kernel, and the other matrix is the restricted portion of the
receptive field.
• The kernel is spatially smaller than an image but is more in-depth.

Department of CSE, Vemana I.T Page 2 of 7


Deep Learning Module 4: Convolutional Networks

Fig 1. The CNN Architecture

• During the forward pass, the kernel slides across the height and width of the image-producing the
image representation of that receptive region.
• This produces a two-dimensional representation of the image known as an activation map that
gives the response of the kernel at each spatial position of the image. The sliding size of the kernel
is called a stride.
• If we have an input of size W x W x D and Dout number of kernels with a spatial size of F with
stride S and amount of padding P, then the size of output volume can be determined by the
following formula:

• This will yield an output volume of size Wout x Wout x Dout.


• Input Volume Dimensions: W x W x D: The input volume has a width and height of 𝑊, and a
depth of 𝐷.
• In CNNs, depth 𝐷 typically represents the number of channels (for example, 𝐷=3 for RGB images
with three color channels).
• Kernel Parameters F: This is the spatial size of the filter (kernel). If F=3, it means the kernel is
3×3.
• S: The stride s the number of pixels by which the filter slides over the input each time. If S=1, the
filter moves one pixel at a time. If S=2, it moves two pixels, and so on.
• P: Padding refers to the number of pixels added around the input's border. Padding can help
preserve spatial dimensions after convolution. For example, if padding P=1, one pixel layer is
added to all four sides of the input.

Department of CSE, Vemana I.T Page 3 of 7


Deep Learning Module 4: Convolutional Networks

• Number of Filters: Dout This represents the number of filters (kernels) used in the convolution
layer, determining the depth of the output volume.
• Each filter produces one output channel, so the output will have Dout channels. The figure 2 depicts
the process of how the activation map is obtained.

Fig 2. Representation of how the activation map/ feature map is obtained.

• Before we go on to the next layer let us try and understand Cross-Correlation and its Role in
CNNs.
• In convolutional networks, the operation commonly referred to as "convolution" is, in fact, cross-
correlation.
• Cross-correlation computes the similarity between the input signal and the kernel as the kernel
slides over the input. Mathematically, this can be written as:

(f⋆g)[n]=m∑f[m]g[n+m]

• Here, f represents the input, g the kernel, and n the spatial or temporal shift. Cross-correlation
captures localized patterns in the data, which is essential for feature extraction in CNNs.
• Toeplitz Matrix in Convolution: The convolution operation can be expressed in matrix form
using a Toeplitz matrix, where each diagonal contains the same elements.
• For a 1D convolution, the input vector can be expanded into a Toeplitz matrix, enabling the
convolution to be represented as:

Output=Toeplitz (Input)×Kernel

• This matrix-based representation helps in understanding the linear transformations performed by


convolutional layers.

Department of CSE, Vemana I.T Page 4 of 7


Deep Learning Module 4: Convolutional Networks

• Block-Circulant Matrices for Efficiency: In advanced implementations, the Toeplitz matrix is


often extended into a block-circulant matrix to optimize memory usage and computation. Block-
circulant structures allow for the decomposition of the convolution operation into efficient
computations using FFT, reducing the computational cost significantly.
• Circulant Matrices and FFT:
• A circulant matrix is a specific type of Toeplitz matrix where each row is a circular shift of the
previous row. Circulant matrices are fundamental to the FFT-based acceleration of convolutions,
making them integral to efficient deep learning frameworks.
• By leveraging these mathematical structures, modern CNNs achieve high efficiency in both
training and inference.
➢ Motivation behind Convolution:
• Convolution leverages three important ideas that motivated computer vision researchers: sparse
interaction, parameter sharing, and equivariant representation. Let’s describe each one of them in
detail.
• Trivial neural network layers use matrix multiplication by a matrix of parameters describing the
interaction between the input and output unit. This means that every output unit interacts with every
input unit. However, convolution neural networks have sparse interaction.
• This is achieved by making kernel smaller than the input e.g., an image can have millions or
thousands of pixels, but while processing it using kernel, we can detect meaningful information
that is of tens or hundreds of pixels.
• This means that we need to store fewer parameters that not only reduces the memory requirement
of the model but also improves the statistical efficiency of the model.
• If computing one feature at a spatial point (x1, y1) is useful then it should also be useful at some
other spatial point say (x2, y2). It means that for a single two-dimensional slice i.e., for creating
one activation map, neurons are constrained to use the same set of weights.
• In a traditional neural network, each element of the weight matrix is used once and then never
revisited, while convolution network has shared parameters i.e., for getting output, weights
applied to one input are the same as the weight applied elsewhere.
• Due to parameter sharing, the layers of convolution neural network will have a property of
equivariance to translation. It says that if we changed the input in a way, the output will also get
changed in the same way.

Department of CSE, Vemana I.T Page 5 of 7


Deep Learning Module 4: Convolutional Networks

➢ Pooling Layer:
• A typical layer of a convolutional network consists of three stages. In the first stage, the layer
performs several convolutions in parallel to produce a set of linear activations. In the second stage,
each linear activation is run through a nonlinear activation function, such as the rectified linear
activation function. This stage is sometimes called the detector stage.
• In the third stage, we use a pooling function to modify the output of the layer further. A pooling
function replaces the output of the net at a certain location with a summary statistic of the nearby
outputs. For example, the max pooling operation reports the maximum output within a rectangular
neighbourhood.
• Other popular pooling functions include the average of a rectangular neighbourhood, the L2 norm
of a rectangular neighbourhood, or a weighted average based on the distance from the central pixel.
• In all cases, pooling helps to make the representation become approximately invariant to small
translations of the input. Invariance to translation means that if we translate the input by a small
amount, the values of most of the pooled outputs do not change.
• Translation Invariance: Invariance to local translation can be a very useful property if we care more
about whether some feature is present than exactly where it is.
• For example, when determining whether an image contains a face, we need not know the location
of the eyes with pixel-perfect accuracy, we just need to know that there is an eye on the left side of
the face and an eye on the right side of the face.
• In other contexts, it is more important to preserve the location of a feature. For example, if we want
to find a corner defined by two edges meeting at a specific orientation, we need to preserve the
location of the edges well enough to test whether they meet.

➢ Fully Connected Layer:


• Neurons in this layer have full connectivity with all neurons in the preceding and succeeding layer
as seen in regular FCNN.
• This is why it can be computed as usual by a matrix multiplication followed by a bias effect.
• The FC layer helps to map the representation between the input and the output.
• The non linearity is introduced in this fully connected layer through the activation functions.

Department of CSE, Vemana I.T Page 6 of 7


Deep Learning Module 4: Convolutional Networks

➢ Convolution and Pooling as an Infinitely Strong Prior:


• A prior is something we already assume to be true about a problem.
• Convolution and pooling assume that:
• Patterns matter more than exact positions (e.g., a "tail" is a tail no matter where it is in the photo).
• Small details add up to the big picture (e.g., spotting "fur" and "paws" helps identify a tiger).
• These assumptions (or priors) are so strong that they work incredibly well for many tasks, especially
image recognition.
• They save time, computation, and generalize well to real-world images without needing extra data.
• An infinitely strong prior means having an assumption about how something works that is so powerful
and rigid that it dominates how we process or interpret data no matter what the actual data says.
• IN CNN the assumptions made by the convolution and pooling are that patterns matter more than their
exact location.
• A cat’s ears will look the same whether they are in the top-left corner or the bottom-right corner of
the image. This is called translation invariance.
• Local relationships are key: Convolution assumes the meaningful information in an image (e.g., an
eye, a whisker) can be found by looking at small patches of the image at a time.
• These assumptions are so strong that the model focuses entirely on patterns and ignores other
possibilities.
• For example: If an image’s pattern looks slightly like an ear, the convolutional model might still say,
“This must be part of a cat!” even if it’s just a random shape.
• In other words, convolution and pooling force the model to think in terms of patterns and local
information no matter what the data might suggest otherwise.

Department of CSE, Vemana I.T Page 7 of 7

You might also like