0% found this document useful (0 votes)
11 views93 pages

Intro To CNN

The document discusses Convolutional Neural Networks (CNNs) and their advantages over traditional feature selection methods in image classification tasks. It highlights the shortcomings of traditional approaches, such as loss of spatial information and dependency on expert knowledge, and explains how CNNs automatically learn relevant features from images. The document also covers the architecture of CNNs, including convolutional layers, pooling layers, and fully connected layers, as well as concepts like feature extraction, receptive fields, and regularization techniques.

Uploaded by

melvin.2022
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views93 pages

Intro To CNN

The document discusses Convolutional Neural Networks (CNNs) and their advantages over traditional feature selection methods in image classification tasks. It highlights the shortcomings of traditional approaches, such as loss of spatial information and dependency on expert knowledge, and explains how CNNs automatically learn relevant features from images. The document also covers the architecture of CNNs, including convolutional layers, pooling layers, and fully connected layers, as well as concepts like feature extraction, receptive fields, and regularization techniques.

Uploaded by

melvin.2022
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 93

Convolutional Neural Network

Dr. Thomas Abraham J V


SCOPE, VIT Chennai
1

Feature Selection in Image
• Suppose you have a dataset of images, and you want to classify whether the images contain
cats or dogs.

• In a traditional machine learning approach, you might use handcrafted features such as color
histograms, texture descriptors, and edge information as input to a classifier (e.g., Support
Vector Machine or Random Forest).

2
Shortcomings Of Traditional Feature Selection
• Loss of Spatial Information

• Issue: Traditional feature selection methods might discard important spatial information in
images. Features like color histograms or texture descriptors don't capture the spatial
relationships between pixels effectively.

• Example: The arrangement of pixels in a specific pattern that represents a cat's ear or a dog's
tail might not be adequately captured by traditional features.

• Limited Robustness to Variations

• Issue: Handcrafted features are often designed based on assumptions about the data
distribution. They may not be robust to variations in scale, rotation, or lighting conditions.

• Example: If the cat or dog images vary significantly in pose or lighting, manually selected
features may struggle to generalize across different scenarios.
3
Shortcomings Of Traditional Feature Selection
• •Dependency on Expert Knowledge

• Issue: The selection of handcrafted features relies on domain expertise and may not adapt well to
diverse datasets.

• Example: Features designed for one dataset may not be as effective when applied to a different dataset
with distinct characteristics.

• Solution

• Transition to CNN ( Convolutional Neural Network)

• CNNs automatically learn relevant features, leveraging the spatial relationships within images and
adapting to the complexities of the data.

• This approach is especially powerful in tasks where the structure and arrangement of features are
crucial, such as in computer vision applications.
4
Introduction to CNN

• Convolutional neural network is a class of deep, feed-forward artificial neural networks.

• CNNs, like neural networks, are made up of neurons with learnable weights and biases.

• CNN has an input layer, number of hidden layers, and an output layer.

• Computer vision through CNNs has several applications such as self-driving cars and
robotics.

5
Inspiration Behind CNN
• Hierarchical architecture

• Local connectivity

• Translation invariance

• Multiple feature maps

• Non-linearity

Source
Composing CNNs for complex tasks

7
Source
CNN vs RNN

8
Basic CNN Architecture

9
What are Features

10
What are Features

11
Feature Extraction
• Feature extraction refers to the process of transforming raw input data into a set of
meaningful features that are more representative and informative for solving a particular task
or problem. In the context of machine learning, including Convolutional Neural Networks
(CNNs), feature extraction involves identifying and selecting relevant features from the input
data that can be used for analysis, classification, or other purposes.

• Feature extraction is essential because raw input data, such as images, can contain vast
amounts of information, much of which may be redundant or irrelevant for a specific task.
By extracting relevant features, the model focuses on the most informative aspects of the
data, making learning more efficient and effective. The extracted features serve as inputs to
subsequent layers or classifiers for tasks like image classification, object detection,
segmentation, etc.

12
Receptive Field
There are two types of receptive fields:

• Local Receptive Field: This refers to the spatial area in the


input data that a single neuron in a specific layer is sensitive to.

In a convolutional layer, for instance, each neuron is associated


with a small local receptive field defined by the size of the
filter/kernel applied to the input data.

This local receptive field represents the region of the input


image that the neuron "sees" or is influenced by.

• Global Receptive Field: This refers to the entire spatial area of


the input data that influences the output of a particular neuron
in the network.

It's the combined effect of all the local receptive fields from
preceding layers that contribute to the activation of a specific
neuron in deeper layers.
13
Components of CNN
• Convolutional layer

• Pooling or downsampling layer

• Flattening layer

• Fully connected layer

14
Architecture of the CNNs applied to digit recognition

15

Source
Components of CNN (contd)
• Convolutional Layers: These layers perform feature extraction by applying convolution
operations using learnable filters (also known as kernels) to the input data. Filters slide across
the input image, extracting features such as edges, textures, or shapes, preserving spatial
relationships.

• Pooling Layers: Pooling layers reduce the spatial dimensions of the feature maps generated
by convolutional layers. Common pooling operations include max pooling and average
pooling, which downsample the data, extracting the most relevant information while
reducing computational complexity.

• Fully Connected Layers: These layers integrate the features learned by the previous layers
and perform classification or regression tasks based on the extracted features. The output of
these layers represents the final prediction or decision made by the network.

16
Filters
• Weight matrix applied to extract local region
features from image
• Many filters could used to extract more
features
• Typical image filter

17
Image checking

Given Image of X Check it is X?


How?
By extracting Features

18
Feature of X in image

19
Using Filters
filter1

20
filter2

21
filter3

22
Convolution operation

23
Edge Detection Algorithm

24
Image filtering

25
Filters in CNN
• In Convolutional neural networks we don't decide the filters but rather just provide the
number of kernel filters in each Convolutional layers

• The values of the kernel filters are learned automatically by the neural network through the
training process, and the filters kernels which results in the features that are most efficient for
the particular classification or the detection are automatically learned.

• The values of the kernel filters are the weights in the particular CNN and those values are
learned rather than decided.

• For classification model CNN doesn’t need to look at every pixel of the image, but abstractly
looks on different parts of the object in the image, but in segmentation problem the model
needs to look at each and every pixel.

26
How it Convolutes

• the convolution of a 5x5 image and a 3x3 filter

• slide the 3x3 filter over the input image, element-


wise multiply, and add the outputs
27
Convolution Operation

28
29
30
31
32
33
34
35
36
Sliding filter to extract local feature

Image Courtesy: https://towardsdatascience.com/convolutional-neural-network-in-natural-37


language-processing-96d67f91275c
Stride
• During convolution, the filter slides
from left to right and from top to
bottom until it passes through the entire
input image.

• Stride is considered as the step of the


filter. So, when we want to downsample
the input image and end up with a
smaller output, we set .

38
Padding

39
40
41
42
43
44
45
46
47
48
49
Padding (contd)
• We have seen that convolving an input of 6 X 6 dimension with a 3 X 3 filter results in 4 X 4
output. We can generalize it and say that if the input is n X n and the filter size is f X f, then
the output size will be (n-f+1) X (n-f+1):

• There are primarily two disadvantages here:

• Every time we apply a convolutional operation, the size of the image shrinks

• Pixels present in the corner of the image are used only a few number of times during
convolution as compared to the central pixels. Hence, we do not focus too much on the
corners since that can lead to information loss.

• To overcome these issues, we can pad the image with an additional border.

50
Count of each pixel usage

51
Zero Padding

52
53
54
55
• valid padding-padding is not used, convolution
normally reduces the spatial output
• full-padding-full padding increases the spatial
ouput

56
Channels

57
Convolution for Multiple Channels

58
Convolution for Multiple Channels

59
60
Multiple Filter Edges
• Generalized dimensions can be given as:

Input: n X n X nc

Filter: f X f X nc

Padding: p

Stride: s

Output:

[(n+2p-f)/s+1] X [(n+2p-f)/s+1] X nf

• Here, nc is the number of channels in


the input and filter, while nf is the
number of filters. 61
How to Choose Kernel Size in CNN?
• Understand the Task and Data

• Consider Input Size and Complexity

• Balance Between Local and Global Information

• Avoid Information Loss

• Experiment and Validate

• Consider Computational Resources

62
Pooling Layer
• The Pooling layer is responsible for reducing the spatial
size of the Convolved Feature. This is to decrease the
computational power required to process the
data through dimensionality reduction.

• It is useful for extracting dominant features which are


rotational and positional invariant, thus maintaining the
process of effectively training the model.

• There are two types of Pooling: Max Pooling and


Average Pooling.

• Max Pooling also performs as a Noise Suppressant


w h e r e a s Av e r a g e P o o l i n g s i m p l y p e r f o r m s
dimensionality reduction as a noise-suppressing
mechanism.
63
Activation / Feature Map Dimension
• Input image dimension is W x H, Filter dimension is K x K, Stride S and Padding P, the output activation
map will have the following dimensions:

W − K + 2P
• Wout = S
+ 1

H − K + 2P
• Hout = S
+ 1

• If the output dimensions are not integers, it means that we haven’t set the stride correctly.

• We have two exceptional cases:

• When there is no padding at all, the output dimensions are

W−K H−K
• ( S + 1,
S
+ 1)
64
Example

• Let’s suppose that we have an input image of size 128 x 128, a filter of size 5 x 5, padding
P=2 and stride S=2. Then the output dimensions are the following:

• So,the output activation map will have dimensions, 62 x 24 x nf

65
Fully Connected Layer

66
Fully Connected Layer
• These layers are in the last layer of the convolutional neural network, and their inputs
correspond to the flattened one-dimensional matrix generated by the last pooling layer. ReLU
activations functions are applied to them for non-linearity.

• Finally, a softmax prediction layer is used to generate probability values for each of the
possible output labels, and the final label predicted is the one with the highest probability
score.

67
Overfitting and Regularization in CNNs
• Deep learning models, especially Convolutional Neural Networks (CNNs), are particularly susceptible to
overfitting due to their capacity for high complexity and their ability to learn detailed patterns in large-scale data.

• Several regularization techniques can be applied to mitigate overfitting in CNNs, and some are illustrated below:

68

Source
# Build the CNN model
model = models.Sequential([
layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)),
layers.MaxPooling2D((2, 2)),
layers.Conv2D(64, (3, 3), activation='relu'),
layers.MaxPooling2D((2, 2)),
layers.Conv2D(64, (3, 3), activation='relu'),
layers.Flatten(),
layers.Dense(64, activation='relu'),
layers.Dense(10)
])
69
• Layer (type) ┃ Output Shape ┃ Param # ┃

• ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩

• │ conv2d (Conv2D) │ (None, 30, 30, 32) │ 896 │

• ├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

• │ max_pooling2d (MaxPooling2D) │ (None, 15, 15, 32) │ 0│

• ├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

• │ conv2d_1 (Conv2D) │ (None, 13, 13, 64) │ 18,496 │

• ├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

• │ max_pooling2d_1 (MaxPooling2D) │ (None, 6, 6, 64) │ 0│

• ├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

• │ conv2d_2 (Conv2D) │ (None, 4, 4, 64) │ 36,928 │

• ├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

• │ flatten (Flatten) │ (None, 1024) │ 0│

• ├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

• │ dense (Dense) │ (None, 64) │ 65,600 │

• ├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

70
Pre-trained Models
A pre-trained model is a saved network that was previously trained on a large dataset, typically
on a large-scale image classification task. You either use the pretrained model as is or use
transfer learning to customize this model to a given task.

• LeNet

• AlexNet

• ZF-Net

• GoogLeNet

• VGG16/VGG19

• ResNet
71
LeNet – First CNN Architecture
• LeNet was developed in 1998 by Yann LeCun, Corinna Cortes, and Christopher Burges for
handwritten digit recognition problems. The LeNet architecture consists of five convolution layers
followed by two fully connected layers.

Layer Structure:

• Input Layer: Takes in 32x32 grayscale images (MNIST digits were originally 28x28, but they were
padded to 32x32).

• Convolutional Layers: Two convolutional layers with average pooling (then called subsampling)
following each convolution.

• First Convolutional Layer (C1): 6 filters of size 5x5, followed by subsampling.

• Second Convolutional Layer (C3): 16 filters of size 5x5, followed by subsampling.

• Third Convolutional Layer (C5): 120 filters of size 5x5. 72


LeNet - Key Features
• Fully Connected Layers: After the convolutional layers, the output is fed into two fully connected
layers.

• The first fully connected layer (F6) has 84 neurons.

• The output layer has 10 neurons (one for each digit class 0-9).

• The average pooling layer is used for subsampling., ’tanh’ is used as the activation function

73
AlexNet – DL Architecture that popularized CNN
• AlexNet was developed by Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton. AlexNet
network had a very similar architecture to LeNet, but was deeper, bigger, and featured
Convolutional Layers stacked on top of each other.

• The AlexNet architecture was designed to be used with large-scale image datasets and it
achieved state-of-the-art results at the time of its publication.

• AlexNet is composed of 5 convolutional layers with a combination of max-pooling layers, 3


fully connected layers, and 2 dropout (or normalized) layers.

• The activation function used in all layers is ReLU. The activation function used in the output
layer is Softmax.

• The input size is mentioned at most of the places as 224x224x3 but due to some padding
which happens it works out to be 227x227x3. The total number of parameters in this
architecture is around 60 million.
74
AlexNet - Key Features
Key Features:

• First to use ReLU (Rectified Linear Unit) activation function, which helps in faster training.

• Introduced dropout layers to reduce overfitting.

• Utilized overlapping max-pooling layers to downsample feature maps.

• Batch size of 128

• SGD Momentum is used as a learning algorithm

• Data Augmentation is been carried out like flipping, jittering, cropping, colour normalization,
etc.

• Made use of GPU computation for training, enabling deeper networks.


75
AlexNet Architecture

76
AlexNet Architecture

77
ZF Net
• ZFnet is the CNN architecture that uses a combination of fully-connected layers and CNNs.
ZF Net was developed by Matthew Zeiler and Rob Fergus.

• It was an improvement on AlexNet by tweaking the architecture hyperparameters, in


particular by expanding the size of the middle convolutional layers and making the stride and
filter size on the first layer smaller.

• ZF Net CNN architecture consists of a total of seven layers: Convolutional layer, max-
pooling layer (downscaling), concatenation layer, convolutional layer with linear activation
function, and stride one, dropout for regularization purposes applied before the fully
connected output.

• The network has relatively fewer parameters than AlexNet and is computationally more
efficient than AlexNet by introducing an approximate inference stage through
deconvolutional layers in the middle of CNNs.
78
Difference between AlexNet and ZFNet
• Architecture

• AlexNet consists of eight layers, five convolutional layers followed by three fully connected
layers. ZFNet retained basic architecture of AlexNet but made some architectural
adjustments,particularly in the first few layers.

• Filters

• AlexNet used 11x11,5x5 and 3x3 filter sizes while ZFNet used 7x7 filter size in the first layer
only and 3x3 in the latter layers only.

• Strides

• There is stride of 4 in the first layer of AlexNet while in ZFNet there is stride of 2 used.

• Normalization

• AlexNet used Local Response Normalization while ZFNet used Local Contrast Normalization 79
GoogLeNet – CNN Architecture used by Google
• GoogLeNet is the CNN architecture used by Google to win ILSVRC 2014 classification task.
It was developed by Jeff Dean, Christian Szegedy, Alexandro Szegedy et al..

• It achieves deeper architecture by employing a number of distinct techniques, including 1×1


convolution and global average pooling. GoogleNet CNN architecture is computationally
expensive.

• Their architecture consisted of a 22 layer deep CNN but reduced the number of parameters
from 60 million (AlexNet) to 4 million. The key features of GoogLeNet: Inception Module,
the 1×1 Convolution, Global Average Pooling, Auxiliary Classifiers for Training.

• To reduce the parameters that must be learned, it uses heavy unpooling layers on top of CNNs
to remove spatial redundancy during training and also features shortcut connections between
the first two convolutional layers before adding new filters in later CNN layers.

• Real-world applications/examples of GoogLeNet CNN architecture include Street View


House Number (SVHN) digit recognition task. 80
GoogLeNet - Key Features
• Introduced the Inception module, which allows the network to extract features at multiple
scales by using convolutional filters of different sizes (1x1, 3x3, 5x5) in parallel.

• Reduced the number of parameters significantly by using 1x1 convolutions, leading to a


deeper network without a massive increase in computational cost.

• Employed global average pooling instead of fully connected layers, reducing the number of
parameters further.

81
GoogLeNet Architecture

82

Source
VGG Net
• The convolutional neural network model called the VGG model, or VGGNet, that supports 16
layers is also known as VGG16. It was developed by A. Zisserman and K. Simonyan from the
University of Oxford.

• VGGNet accepts 224x224 pixel images as input.

• VGG’s convolutional layers use the smallest feasible receptive field, or 3x3, to record left-to-
right and up-to-down movement. Additionally, 11 convolution filters are used to transform the
input linearly followed by ReLU activation layer.

• VGGNet contains three layers with full connectivity. The first two levels each have 4096
channels, while the third layer has 1000 channels with one channel for each class.

• It is very slow to train (the original VGG model was trained on Nvidia Titan GPU for 2–3
weeks) and it takes quite a lot of disk space and bandwidth which makes it inefficient.

• 138 million parameters lead to exploding gradients problem. 83


VGG - Key Features
• Simplified architecture using only 3x3 convolutional layers stacked on top of each other,
with depth increasing progressively.

• Utilized a consistent design pattern (same filter size, consistent max-pooling) making it
easier to understand and implement.

• The architecture was deeper than previous models (VGG16 with 16 layers and VGG19 with
19 layers), which improved performance.

84
VGG Architecture

85

Source
ResNet
• ResNet is the CNN architecture that was developed by
Kaiming He et al. to win the ILSVRC 2015 classification task
with a top-five error of only 15.43%. The network has 152
layers and over one million parameters,

86
ResNet
• The skip connection bypasses some levels in between to link-layer activations to subsequent
layers. This creates a leftover block. These leftover blocks are stacked to create resnets.

• The following ResNet implementations are part of Keras Applications and offer ResNet V1
and ResNet V2 with 50, 101, or 152 layers,

• ResNet50, ResNet101, ResNet152

• ResNet50V2, ResNet101V2, ResNet152V2

• ResNetV2 and the original ResNet (V1) vary primarily in that V2 applies batch
normalisation before each weight layer.

87
ResNet-34 Architecture

88
Transfer Learning
• Transfer learning is an approach to machine learning where a model trained on one task is
used as the starting point for a model on a new task. This is done by transferring the
knowledge that the first model has learned about the features of the data to the second model.

• In deep learning, transfer learning is often used to solve problems with limited data. This is
because deep learning models typically require a large amount of data to train, which can be
difficult or expensive to obtain.

89
Transfer Learning

90
Need of Transfer Learning in Deep Learning
• Limited Data: When the dataset is not large enough to train a deep neural network from
scratch.

• Time and Resource Efficiency: Training deep networks from scratch is computationally
expensive and time-consuming.

• Improved Performance: Pre-trained models often lead to better performance as they start
with learned features rather than random initialization.

91
TL Types
• Feature Extraction: Use the representations learned by a previous network to extract
meaningful features from new samples. You simply add a new classifier, which will be
trained from scratch, on top of the pretrained model so that you can repurpose the feature
maps learned previously for the dataset.
.

• Fine-Tuning: Unfreeze a few of the top layers of a frozen model base and jointly train both
the newly-added classifier layers and the last layers of the base model. This allows us to
"fine-tune" the higher-order feature representations in the base model in order to make them
more relevant for the specific task.

92
Transfer learning scenarios
• The target dataset is small and similar to the base training dataset.

• Freeze all the layers except the last, remove the last layer, add the new FC layer with
randomized weights

• The target dataset is large and similar to the base training dataset.

• Unfreeze all the layers of the model and continue the training process with the target dataset or
Initialize the model with pre-trained weights from the base model, then train on the target dataset
as if it were a new task.

• The target dataset is small and different from the base training dataset.

• Freeze most of the layers in the pre-trained model and only train the final few layers on the target
dataset.

• The target dataset is large and different from the base training dataset.
93
• Unfreeze all layers of the pre-trained model and fine-tune the entire model on the target dataset.

You might also like