CPCS432 Lecture 5 Deep Learning and Artificial Neural Networks Techniques in Computer Vision
CPCS432 Lecture 5 Deep Learning and Artificial Neural Networks Techniques in Computer Vision
Lecture 5
Deep Learning and Artificial
Neural Networks
Techniques in
Computer Vision
Applying Deep Learning Algorithms for
Computer Vision Tasks
tryoreer
③Introduction to convolutional neural network (CNN)
In the previous part, we discussed the feature extraction and feature selection .
We now understand that the better the features are, the more accurate the results are going to be.
In recent periods, the features have become more precise and as such better accuracy has been achieved. This is
due to a new kind of feature extractor called Convolutional Neural Networks (CNNs) and they have shown
remarkable accuracy in complex tasks, such as object detection in challenging domains, and classifying images
with high accuracy, and are now quite ubiquitous in applications ranging from smartphone photo enhancements to
satellite image analysis.
Lakshmanan, V., Görner, M., & Gillard, R. (2021). Practical machine learning for computer vision. " O'Reilly Media, Inc.".
CPCS432 Lecture 5 28/09/2023 64
Convolutional Neural Network
A convolutional neural network (CNN) is a special artificial neural network that automatically
extracts and selects relevant features from input data. This characteristic makes CNNs well-suited
for handling grid-like structured data, such as images.
· •
Feature extraction & selection Classification
ANN Architecture
A CNN model can be thought as a combination of two components: feature extraction and the
classification. ame
• The convolution and pooling layers perform feature extraction and selection.
~
• The ANN layers act as a classifier on top of these features. They learn how to use the features map to
correctly classify the images by assigning a probability for the input image class.
⑳
Key Components: Filtering fe a
ature
Convolutional Layers
Pooling Layers
Fully Connected Layers
Additional Components:
Batch Normalization
Dropout
*
ExampleKernel (n-mtl)
(33-3xD
CNN Components Convolutional Layers +
1) /
-
k
What Is Convolution? (n
output-
-
3
=
2x2
(k,k)
3x3
4x4 ((n-k+1), (n-k+1))
(n,n)
imaging
i
ginning
CNN Components Convolutional Layers incl begin
to Orig
What Is padding?
added Rossemms
a Padding (“same”
* voparagus
2
3X3
5x5
output(15-3+ 1) , (5- +) -image After
F
image
smaller Padding: valid
-
(3 3)
=
,
Padding: same
>
-
filtering stays Padding: full -
>
-
>
image bigger
Input length = N Input length = N the same Input length = N After
After
filtering
Kernel length = K
Output length = N - K + 1
Kernel length = K
Output length = N
Kernel length = K filtering
Output length = N + K - 1
Gro
fire
or
Found
Filter
↓
ruffuhi
r Y
Il
&
>
- 3x3x3
Joseph Nelson. (Feb 5, 2020). When to Use Grayscale as a Preprocessing Step. Roboflow Blog: https://blog.roboflow.com/when-to-use-grayscale-as-a-preprocessing-step/
73
CPCS432 Lecture 5 28/09/2023 73
CNN Components Convolutional Layers
2-D
Filter
Image
3-D
&
image everne Filter
Image
2 indices 3 indices
↓ 2
-
E st
CPCS432 Lecture 5 28/09/2023 74
CNN Components Convolutional Layers
Convolution on Colour Images
bott e
Do
·
Multiple features
• A.shape = H x W x 3 example
-
3
• If we use "same mode", then B₁.shape = H x W, B₂.shape = H x W
• If we stack B₁, and B₂, we get B.shape = H x W x 2
• We can add any number of features!
• Consider that we should have more than one filter per image, because each filter is looking for something different
F₁ = K x K x 3 B₁ = H x W
F₂ = K x K x 3
B₂ = H x W
A= H x W x 3
·
• We call these "feature maps" (e.g. each 2-D image is a Output Image
map that tells us where the feature is found)
Feature Maps
Filter
3-D 1
H R+1
-
Input Image
HXWX3 Filter
2
↓ h
un Li
CPCS432 Lecture 5 28/09/2023 77
CNN Components Convolutional Layers
·
map
78
CPCS432 Lecture 5 28/09/2023 78
od
met h
sed
n-ba
c tio
dete
Ed ge
CNN Components Convolutional Layers
used
e
mb
nu
ce
du
s
Re
-
How much do we save?
• Input image: 32 x 32 x 3= 3072
• Filter: 3 x 5 x 5 x 64= 4800 # of parameters (ignoring bias term)
• Output image: 28 x 28 x 64 (32-5+1=28) = 50176
• Weight matrix: 3072 x 50176 =154,140,672 ~154 MILLION
• Compared to convolution, 154,140,672 / 4800 we have ~32,000 times more parameters
• It would also perform sub optimally, because we want to use the same pattern finder in
multiple places
• Without shared weights, we need to learn to find the pattern in every possible location it
might appear, separately
Activation Function
• Activation function serves as a decision function and helps learn complex patterns.
• Activation functions are necessary to prevent linearity. Without them, the data would pass
through the nodes and layers of the network only going through linear functions
• The selection of an appropriate activation function can accelerate the learning process
zen
nu
ce
Re du
Pooling/Subsampling/Downsampling
&
Pooling layer performs “downsampling,” much like feature selection.
Pooling Layer
• Pooling or down-sampling is an interesting local operation.
• It sums up similar information in the neighborhood of the receptive
field and outputs the dominant response within this local region.
• The use of pooling operation helps to extract a combination of
features, which are invariant to translational shifts and small distortions.
Re du
ce
nu
mb
feature
minimizes
CNN Components Pooling Layer
Pooling
At a high level, pooling is downsampling
E.g output a smaller image from a bigger image
If input is 100x100, a pool size of 2 would yield 50x50A.k.a. "Downsample by 2"
100
100
+
2xin
- 50x50
smaller
.
Corey de
age
CPCS432 Lecture 5
Stride =2 -
> age ,
28/09/2023 87
CNN Components Pooling Layer Stride
Stride /Padding Equation
Elters
of
3708 No
.
ne eith
=( + 1, + 1, 𝑁𝑓) Filters
87
37+ 2(0)
3
-
+
37+
20
3,0
-
To .
-
1 - = 35x39410
CPCS432 Lecture 5 28/09/2023 88
CNN Components Pooling Layer Stride
Stride /Padding
91
CPCS432 Lecture 5 28/09/2023 91
J
CNN Components Fully Connected Layer
• Fully Connected Layers form the last few layers in the network.
• The input to the fully connected layer is the output of the preceding layer (activation maps of high-
level features) and outputs an n-dimensional vector
• The flattened output is fed to a feed-forward neural network, and backpropagation is applied to
every iteration of training.
• Over a series of epochs, the model can distinguish between dominating and certain low-level
features in images and classify them using the softmax classification technique.
• Pooling Layers: Additionally, CNNs use pooling operations (e.g., max pooling)
to progressively reduce the spatial dimensions of the input, further reducing the
number of parameters and making the model computationally more efficient
without losing important features.
In traditional machine learning models, when an object shifts position in an image, the
model might not be able to recognize it because the exact pixel locations have changed.
This requires extensive data augmentation (training on multiple versions of the same image
in different locations) for good performance.
CNNs, on the other hand, inherently provide translational invariance due to two key aspects:
• Convolutional Layers: CNNs apply filters (kernels) across the entire input image using
convolution operations. These filters are small matrices that slide over the image and are
shared across the whole input space. The same filter is applied to different parts of the
image, meaning that it can recognize specific patterns, like edges or textures, regardless of
where they appear. This allows the network to detect the same feature (like an eye, wheel, or
object) even if it moves to a different part of the image.
• Pooling Layers: Pooling operations, like max pooling, further contribute to translational
invariance by reducing the spatial dimensions of the input. Max pooling takes the maximum
value from a specific region, making the exact position of a feature less important while still
retaining its presence. This down-sampling ensures that minor translations (small
movements) in the input do not change the output too much, hence making the model more
robust to shifts in objects' positions.
CPCS432 Lecture 5 28/09/2023 96
Convolutional Neural Network
Why is convolutional neural network (CNN) efficient ?
Capture complex dependencies
• Learning Hierarchical Relationships: CNNs excel at capturing complex
dependencies by using multiple layers of convolution, pooling, and non-linearity
(e.g., ReLU). The successive layers of a CNN are able to combine lower-level
features (like edges or textures) to recognize higher-level patterns (like objects or
faces). These complex dependencies between features are automatically learned
during training without the need for manual feature engineering.
-Ann
Feature extraction & selection
feature e
Classification
Feature extraction & selection ANN Architecture
hermentou
are
CPCS432 Lecture 5 28/09/2023 101
Typical CNN Architecture
Why convolution followed by pooling
eatur
a
or
Losing Information
We lose spatial information: we don't care where the feature was found
We haven't yet considered the # of feature maps
Generally, these increase at each layer
So, we gain information in terms of what features were found
Two
type
ides
Es
The LeNet-5 CNN architecture, introduced in 1998 by LeCun et al. in their paper “Gradient-Based Learning Applied
to Document Recognition,” was mainly used for recognizing handwritten and machine-generated characters (optical
character recognition [OCR]) from documents.
It is a CNN consisting of seven layers. • There are two subsampling layers (S2 and S4).
• There is one fully connected layer (F6) and one output layer.
• The convolutional layers use 5×5 convolution kernels with stride 1.
• The subsampling layers are 2×2 average pooling layers.
• The entire network uses the TanH activation function except for the
output layer, which uses softmax.
Toclassification
110
CPCS432 Lecture 5 Dr. Arwa Basbrain
Examples of Popular CNNs
The input size is 224×224×3 colour images. AlexNet
Convolution layer 3: Kernel 3×3, filters 384, strides 1×1, activation ReLU
Convolution layer 4: Kernel 3×3, filters 384, strides 1×1, activation ReLU
Convolution layer 5: Kernel 3×3, filters 384, strides 1×1, activation ReLU
Pooling layer 5: MaxPooling with kernel size 3×3, strides 2×2
The last three layers are a fully connected MLP.
All convolution layers use ReLU activation functions.
The output layer uses softmax activation.
There are 1,000 classes in the output layer. 111
CPCS432 Lecture 5 Dr. Arwa Basbrain
Examples of Popular CNNs
Input size 224×224x3
GoogLeNet
The Inception architecture introduces several key ideas that allow it to achieve Inception
high performance, both in terms of accuracy and computational cost. Multiple convolutions in parallel branches
Instead of trying to choose different filter
sizes (1x1, 3x3, 5x5, etc.) just try them all!
Al
& Add
113
CPCS432 Lecture 5 Dr. Arwa Basbrain
Examples of Popular CNNs
ResNet
ResNet
The key innovation of ResNet is the introduction of "residual blocks" that allow A CNN with branches (one branch is the
for training substantially deeper networks than what was previously possible. identity function, so the other learns the
residual)Variations: ResNet50, ResNet101,
Residual Block ResNet152, ResNet_v2, ResNeXt.
114
CPCS432 Lecture 5 Dr. Arwa Basbrain
wengt see
e
of
• The features are found from one task may be useful for another task
• Transfer Learning took off in the field of computer vision
• ImageNet - Large-scale image dataset (millions of images, 1k categories)
CPCS432 Lecture
Lecture5 5 Dr. Arwa Basbrain 28/09/2023 116
Transfer Learning Intuition
hange
L
Dont
117
CPCS432 Lecture
Lecture5 5 Dr. Arwa Basbrain 28/09/2023 117
Doesa
e
118
CPCS432 Lecture
Lecture5 5 Dr. Arwa Basbrain 28/09/2023 118
Transfer Learning Intuition
• Main idea: The features are found from one task may be useful for another task
• Transfer Learning took off in the field of computer vision
• ImageNet - Large-scale image dataset (millions of images, 1k categories)
• Because the dataset is so diverse, weights trained on this dataset can be applied
to a large number of vision tasks
• Cats vs Dogs
• Cars vs Trucks
• Even microscope images/images never seen before.