0% found this document useful (0 votes)
2 views

Module 3

The document provides an overview of Convolutional Neural Networks (CNNs), detailing their structure, including input, convolutional, pooling, fully connected, and output layers, and their roles in image processing. It explains key concepts such as convolution operations, parameter sharing, and pooling as a means of reducing data size while preserving important information. Additionally, it discusses various CNN architectures, applications, and efficient convolution algorithms, highlighting the significance of CNNs in tasks like image classification and object detection.

Uploaded by

akshaylalsp6
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Module 3

The document provides an overview of Convolutional Neural Networks (CNNs), detailing their structure, including input, convolutional, pooling, fully connected, and output layers, and their roles in image processing. It explains key concepts such as convolution operations, parameter sharing, and pooling as a means of reducing data size while preserving important information. Additionally, it discusses various CNN architectures, applications, and efficient convolution algorithms, highlighting the significance of CNNs in tasks like image classification and object detection.

Uploaded by

akshaylalsp6
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 67

Module – 3

Convolutional Neural Network (CNN)

Convolutional Neural Networks – convolution operation, motivation, pooling,


Convolution and Pooling as an infinitely strong prior, variants of convolution
functions, structured outputs, data types, efficient convolution algorithms.

Reena Thomas, Asst. Prof., CSE dept., CEMP


1
Convolutional Neural Network (CNN)

• It is a special type of deep learning model used to recognize patterns in images.


• It works by extracting important features like edges, shapes, and textures and
then making predictions.
• A CNN has different layers, each with a specific role in processing an image.

2
1. Input Layer

• The input layer is where the CNN receives the image for processing.
• What it Takes: An image in the form of a numerical matrix (grid of pixel values).
• Structure:
• A grayscale image has 1 channel (e.g., 28×28×1 for a black-and-white image).
• A color image has 3 channels (RGB: Red, Green, Blue), e.g., 32×32×3.
• Preprocessing:
• Rescaling: Pixel values are often normalized (e.g., between 0 and 1) for better
training.
• Reshaping: Images may be resized to a fixed shape (e.g., 224×224×3 for deep
CNNs).

3
2. Convolutional Layer (Feature Extraction)

• The most important layer in a CNN.


• It detects patterns in the image by applying small filters (kernels).
• A filter slides over the image and performs a mathematical operation (dot
product) to create a feature map (new representation of the image).
Feature Map = Input ∗ Kernel + Bias
• Example: A 3×3 filter detects small edges
• Followed by a ReLU activation function (which removes negative values) to
keep only useful features.
• Think of this like looking at an image through different lenses to find
important details.

4
3. Pooling Layer (Downsampling)

• Reduces the size of the feature maps, making the model faster and more
efficient.
• Downsampling is the process of reducing the size of data while preserving
important information.
• Prevents overfitting by keeping only important information.
• Types of pooling:
• Max Pooling
• Average Pooling

5
4. Fully Connected (Dense) Layer

• Converts the extracted features into a single long vector (Flattening).


• Connects every neuron to all previous neurons (like a regular neural network).
• Helps the model understand high-level patterns.
• Uses activation functions (like ReLU) to introduce non-linearity.
• Think of this like a decision-making stage where all detected features combine
to classify an image.

6
5. Output Layer (Prediction)

• Produces the final result based on the processed information.


• Uses an activation function based on the type of task:
• Softmax for multi-class classification (e.g., dog, cat, bird).
• Sigmoid for binary classification (e.g., cancerous vs. non-cancerous).
• Linear for continuous value predictions.
• Think of this like making the final decision based on all observations.

7
Convolution Operation
The convolution operation in CNN is a fundamental process that extracts
spatial features from input data (such as images)

Step 1: Define the Input Image


Step 2: Define the Filter (Kernel)
Step 3: Perform Element-wise Multiplication
i. The filter moves over the input image, and at each step:
ii. It takes a region (of the same size as the filter).
iii. Multiplies corresponding elements.
iv. Sums up the results.
v. Places the summed value in the output matrix.
Step 4: Slide the Filter Over the Image
8
9
10
11
12
13
14
PROBLEM 3 : Let's take a smaller 4×4 input matrix and a 3×3 filter to perform
convolution.

15
16
17
18
19
20
Motivation
• Convolution leverages three key ideas: sparse interactions, parameter sharing, and
equivariant representations.
• Sparse interactions occur because each output unit interacts with only a small
subset of input units, unlike traditional neural networks where all input units
influence every output unit.
• Parameter sharing allows the same kernel to be applied across different parts of the
input, reducing the number of parameters and improving efficiency.
• Equivariant representations ensure that a pattern detected in one region of the
input will also be recognized in another, making convolution useful for tasks like
image processing.
• Convolutional networks can handle variable-sized inputs, unlike traditional networks
that require fixed input dimensions.

21
• Traditional neural network layers rely on matrix multiplication, where each
output unit is connected to all input units, making connectivity dense.
• Convolutional layers achieve sparse connectivity (Sparse interactions or sparse
weights) by using small kernels, ensuring only a limited number of input units
affect each output unit.

When S is formed by convolution with a


kernel of width 3, only three output
units are affected by each input unit.

In contrast, when S is formed by matrix


multiplication, the connectivity is no
longer sparse, meaning every output
unit is influenced by every input unit,
including x3.
22
• We highlight one output unit, s3, and
also highlight the input units in x that
affect this unit.
• These units are known as the
receptive field of s3.
• When s is formed by convolution
with a kernel of width 3, only three
inputs affect s3.

23
• The receptive field of the units in the deeper layers of a convolutional network is
larger than the receptive field of the units in the shallow layers.
• This effect increases if the network includes architectural features like strided
convolution or pooling.
• This means that even though direct connections in a convolutional net are very
sparse, units in the deeper layers can be indirectly connected to all or most of the
input image.
24
Parameter Sharing
• Parameter sharing means using the same parameter for multiple functions in a
model, also known as "tied weights" because a weight applied at one input is tied
to its value elsewhere.
• Traditional Neural Networks: Each weight in the matrix is used exactly once per
computation—multiplied by a single input element and never reused.
• Convolutional Neural Networks (CNNs): Each kernel parameter is applied across all
positions of the input.
• Advantage: Instead of learning separate parameters for every location, CNNs learn
a single set of parameters, reducing the model’s complexity while improving
efficiency.

25
• The black arrows indicate
uses of the central element
of a 3 element kernel in a
convolutional model. Due to
parameter sharing, this
single parameter is used at
all input locations.

• The single black arrow indicates the use of the central element of the weight
matrix in a fully connected model. This model has no parameter sharing so the
parameter is used only once.

26
27
CNN consists of a
A convolutional
few complex layers,
network consists of
each made up of
many simple
multiple stages.
layers, where each
processing step is
There is a direct
considered a
one-to-one
separate layer.
mapping between
kernel (filters) and
Some layers
network layers,
perform operations
meaning each layer
like activation or
applies specific
pooling without
filters to detect
having their own
patterns in the
parameters.
input.
28
Pooling

29
A view of the middle of the output of a
convolutional layer. The bottom row
shows outputs of the nonlinearity. The
top row shows the outputs of max
pooling, with a stride of one pixel
between pooling regions and a pooling
region width of three pixels.

A view of the same network, after the


input has been shifted to the right by
one pixel. Every value in the bottom
row has changed, but only half of the
values in the top row have changed,
because the max pooling units are
sensitive only to the maximum value
in the neighborhood, not its exact
location.
Figure: Max pooling introduces invariance.
30
31
• The use of pooling can be viewed as adding an infinitely strong prior that the
function the layer learns must be invariant to small translations.
• It may greatly improve the statistical efficiency of the network.
• Pooling over spatial regions produces invariance to translation
• But if we pool over the outputs of separately parametrized convolutions, the
features can learn which transformations to become invariant

Figure : Example of learned invariances. 32


• A pooling unit that pools over multiple features that are learned with separate
parameters can learn to be invariant to transformations of the input.
• Here we show how a set of three learned filters and a max pooling unit can learn to
become invariant to rotation. All three filters are intended to detect a hand written 5.
• Each filter attempts to match a slightly different orientation of the 5. When a 5
appears in the input, the corresponding filter will match it and cause a large activation
in a detector unit.
• The max pooling unit then has a large activation regardless of which detector unit was
activated.
• We show here how the network processes two different inputs, resulting in two
different detector units being activated.
• Max pooling over spatial positions is naturally invariant to translation; this
multichannel approach is only necessary for learning other transformations.

33
• pooling summarizes the responses over a whole neighborhood
• It is possible to use fewer pooling units than detector units, by reporting
summary statistics for pooling regions spaced k pixels apart rather than 1 pixel
apart.

Figure : Pooling with downsampling.

• Here we use max pooling with a pool width of three and a stride between
pools of two.
• This reduces the representation size by a factor of two, which reduces the
computational and statistical burden on the next layer.
• Note that the rightmost pooling region has a smaller size but must be
included if we do not want to ignore some of the detector units.
34
• When the number of parameters in the next layer is a function of its
input size.
• This reduction in the input size can also result in improved statistical
efficiency and reduced memory requirements for storing the
parameters.
• For many tasks, pooling is essential for handling inputs of varying size.

35
Convolution and Pooling as an
infinitely strong prior

36
Prior Probability Distribution

• Imagine you're guessing a fruit's weight, if you've seen similar fruits before, you can
estimate their weight based on past experience before measuring.
• This initial belief before seeing any data is called a prior probability distribution.
• In machine learning and statistics, we use a prior distribution to express what we
believe about the parameters of a model before looking at any data.
• Weak Prior (High Uncertainty)
• Wide and spread-out belief - We let the data decide most of the outcome.
• A weak prior allows the data to shape the model more freely.
• Eg. : A Gaussian (Normal) distribution with high variance
• Strong Prior (Low Uncertainty)
• Narrow and confident belief – It has a big influence on the final outcome.
• A strong prior influences the model more, making it less sensitive to new data.
• Eg. : A Gaussian distribution with low variance.
37
An infinitely strong prior means certain parameter values are completely forbidden
(Zero Probability), no matter how much data supports them.

• In a neural network, this can mean forcing the weights of one hidden unit to be
identical to its neighbor.
• It can also mean setting most weights to zero, except in a small region assigned to
each unit.
• Convolutional layers apply this idea by restricting how weights are shared and used,
effectively enforcing a strong prior on the model's parameters.
• The layer's function should capture only local interactions and be translation
equivariant.
• Pooling acts as an infinitely strong prior, enforcing invariance to small translations.

38
• A convolutional net can be seen as a fully connected net with an infinitely strong
prior, helping us understand its behavior.
• Convolution and pooling may cause underfitting if the prior assumptions don’t
match the data.
• They are useful only when their assumptions hold true.

• Some channels use pooling for invariance, while others skip it to avoid underfitting
when translation invariance isn't accurate.

• Convolutional models should only be compared to other convolutional models in


statistical benchmarks.

39
40
CNN based architectures

• LeNet – LeNet-5 (named after Yann LeCun)


• AlexNet – Named after Alex Krizhevsky
• ZFNet – Zeiler and Fergus Network
• VGG – Visual Geometry Group (developed by Oxford's VGG team)
• GoogleNet – Later renamed as Inception (developed by Google)
• ResNet – Residual Network

41
Applications of CNN

• Image Classification – Identifying objects in images (e.g., ImageNet).


• Object Detection – Locating objects within images (e.g., YOLO).
• Face Recognition – Used in security systems and social media tagging.
• Medical Image Analysis – Detecting diseases in X-rays, MRIs, CT scans.
• Self-Driving Cars – Lane detection, pedestrian recognition.
• Satellite Image Processing – Land-use classification, weather prediction.
• Optical Character Recognition (OCR) – Reading handwritten and printed text.
• Gesture Recognition – Used in human-computer interaction and gaming.
• Deepfake Generation – Face swapping and synthetic media creation.
• Defect Detection in Manufacturing – Identifying flaws in products.
• Video Analysis – Action recognition, surveillance systems.
• Speech to Image Conversion – Generating images from spoken descriptions.

42
Variants of convolution functions
( https://youtu.be/CChuD_wD2UI?si=PzfL95olvlpZqyIb )

43
44
45
46
47
48
49
50
51
52
53
Structured outputs

• Convolutional networks can generate high-dimensional structured outputs,


typically represented as a tensor.
• The output tensor S is emitted by a convolutional layer, where Si,j,k
represents the probability that pixel (j, k) belongs to class i.
• This enables pixel-wise labeling, allowing the model to create detailed
object masks that follow the exact outlines of objects.
• When used for single-object classification, the largest spatial dimension
reduction comes from pooling layers with large strides.

54
A Recurrent Convolutional Network (RCN)
iteratively refines pixel labels. U extracts
features, V predicts labels, and W updates
predictions in later steps. The same
parameters are reused, making it recurrent.

55
Data types

• The data used with a convolutional network usually consists of several channels,
each channel being the observation of a different quantity at some point in
space or time.
• One advantage to convolutional networks is that they can also process inputs
with varying spatial extents.
• These kinds of input simply cannot be represented by traditional, matrix
multiplication-based neural networks

56
57
Efficient Convolution Algorithms
• Modern convolutional networks often contain millions of units, requiring powerful
implementations that leverage parallel computing for efficiency.
• Frequency Domain Convolution: Convolution can be done by converting both the input
and kernel to the Fourier domain, performing point-wise multiplication, and then
converting back using the inverse Fourier transform.
• This is often faster than direct convolution for larger kernels.
• Separable Kernels: A d-dimensional kernel is called separable if it can be expressed as
the outer product of d vectors.
• Separable Convolution Efficiency: Instead of using a naïve approach, separable kernels
allow convolution to be broken into multiple 1D convolutions, reducing computational
cost.
• It also reduces the number of parameters needed to store the kernel.
58
• If the kernel is w elements wide in each dimension, Naïve multidimensional
convolution requires O(w^d) runtime and storage, while separable convolution
reduces this to O(w × d).
• Three Major Approaches for Efficient Convolution:
• Naïve approach
• Convolution with separable kernel
• Recursive filtering

59
1. Naïve Convolution Approach
Convolution computes the weighted sum of an input signal using a kernel.

• For each input position n, multiply the flipped kernel with the corresponding input
values and sum the results.
• This method is slow and memory-intensive, making it inefficient for large data.
• More efficient methods like separable convolution and Fourier transforms are
used to speed up computation.

60
2. Separable Convolution
• A convolution with a separable kernel can be decomposed into multiple lower-
dimensional convolutions, reducing computational cost.
• A kernel is separable if its matrix has rank 1. To construct such a kernel, consider:
u = (u1, u2, u3, . . . , um) and one column vector v^T = (v1, v2, v3, . . . , vn). let us
convolve them together:

• A represents the convolution kernel, while u and v are lower-dimensional


convolution filters, making computation more efficient.

61
62
Challenges in Recursive Filtering
• Replication – Given a slow but accurate non-recursive filter, finding an equivalent
recursive version can be complex.
• Stability – The recursive formula may cause numerical instability, leading to
incorrect results or divergence.
• Accuracy – Small computational errors can accumulate over time, reducing the
precision of the filtering process.

Architectures for Recursive Filters


• Streaming architectures – Process data in a stream with minimal memory.
• Parallel architectures – Speed up computation using mathematical optimizations.

63
Previous year questions

64
• What is the size of the feature map if the input image is 64x64, the convolutional
kernel size is 8x8, the stride is 3, and there is symmetric padding of 2 pixels on
each side?

• What are structured outputs in the context of CNNs?


• Illustrate the role of convolutional layers in CNNs with an example.
• Explain the significance of efficient convolution algorithms in CNNs.
• List and explain the common data types used in deep learning.
• Suggest a method to make convolution algorithm more efficient. Justify Your
answer.
65
• What is CNN, and how is it different from a fully connected neural network?

• Give two benefits of using convolutional layers instead of fully connected ones for
visual tasks.
• Describe the motivation behind convolution neural networks.
• Sketch the diagram of Convolutional Neural Network architecture and explain
different stages in detail.
• Explain in detail the variants of convolution functions.

66
• Assume an input volume of dimension 64 x 64 x 3. What are the dimensions of the
resulting volume after convolving a 5 x 5 kernel with zero padding, stride of 1 and 2
filters?

67

You might also like