3 - DeepLearning - and - CNN v3
3 - DeepLearning - and - CNN v3
Intelligent Systems
ELE 4643
What is Deep Learning?
Artificial Intelligence:
• A broad concept where machines think and
act more like humans Artificial Intelligence
Machine Learning:
Machine Learning
• An application of AI where machines use date
to automatically improve at performing tasks
Deep Learning:
Deep Learning
• A machine learning technique that processes
data through a multi-layered neural network
much like the human brain
Deep Learning Animation
How neural networks learn | Deep learning
https://youtu.be/IHZwWFHWa-w
Backpropagatin
Backpropagation
• How do you train a Multi-Layer Neural Networks
weights? How does it
learn?
• Backpropagation algorithm uses Gradient Descent
• For each training step:
• Compute the output error
• Compute how much each neuron in the previous
hidden layer contributed
• Back-propagate that error in a reverse pass
• Tweak weights to reduce the error using gradient
descent
Backpropagation
Information flow
𝐲 = 𝑓NOO 𝐱 = 𝑜 𝐖 `𝜎 𝐖 `a"𝜎 …
𝐖 "𝐱
Output layer
Hidden layer
Input layer
training data desired output
Error calculation
c)
b)
d)
a)
Global minimum
Potential problem during gradient descent: a) Finding local minima, b) Near halting due to small
gradients, c) Oscillation in valleys, d) Leaving good minima
Learning Flow
Step 1
Random Initialization Loss
Desired
Inputs Function
Step 7 Output
Optimizer Function
Learning
On training set
How is this achieved?
ReLU function
Loss Functions
playground.tensorflow.org
Convolution Neural Network CNN
• What is a Convulotional Neural Network?
– A CNN is a neural network with convolution operations
instead of matrix multiplications in at least one of the layers
. . . bird
. . . . car
. . . .
. . . cat
.
An RGB image can be . . . . dog
represented as . horse
pixels (32*32*3) ship
X N
deer
Input vector dimension
N= 32⨉32⨉3= 3072
CNN – Motivation
Property 1: Some patterns are much smaller than the whole image
Can Convolution
repeat Property 2 : The same patterns appear in different regions
many
times Max Pooling Property 3 : Downsampling the pixels does not change the object
Convolution
Fully Connected
Max Pooling Flatten Feedforward Prediction
Network
Convolutional Layers
0 1 0 0 1 0
Apply small filers to detect small patterns
0 1 0 0 1 0
Each filter has a size of 3 x 3
0 1 0 0 1 0
1 0 0 0 0 1 -1 1 -1 1 -1 -1
0 1 0 0 1 0 -1 1 -1 -1 1 -1
.
.
.
.
.
0 0 1 1 0 0 -1 1 -1 -1 -1 1
6 x 6 image Filter 1 Filter 2
• Note: Only the size of the filters is specified; the weights are initialised
to arbitrary values before the start of training.
• The weights of the filters are learnt through the CNN training process
Convolutional Layers
• Key Parameters
– Filter size – defines the height and width of the filter kernel
• E.g., a filter kernel of size would have nine weights
– Stride – determines the number of steps to move in each
spatial direction while performing convolution.
– Padding –appends zeroes to the boundary of an image to
control the size of the output of convolution
• When we convolve an image of a specific size by a filter, the
resulting image is generally smaller than the original image
Convolutional Layers
stride = 1
0 1 0 0 1 0
0 1 0 0 1 0 -1 1 -1
3 -3
0 1 0 0 1 0 -1 1 -1
1 0 0 0 0 1 -1 1 -1
0 1 0 0 1 0
Filter 1
0 0 1 1 0 0
6 x 6 image
Compute the dot product between the filter and a small 3 x 3 chunk of the image
Convolutional Layers
stride = 2
0 1 0 0 1 0
0 1 0 0 1 0 -1 1 -1
3 -3
0 1 0 0 1 0 -1 1 -1
1 0 0 0 0 1 -1 1 -1
0 1 0 0 1 0
Filter 1
0 0 1 1 0 0
6 x 6 image
We set stride = 1 below
Compute the dot product between the filter and a small 3 x 3 chunk of the image
Convolutional Layers
stride = 1
0 1 0 0 1 0
0 1 0 0 1 0 -1 1 -1
3 -3 -3 3
0 1 0 0 1 0 -1 1 -1
1 -2 -2 1
1 0 0 0 0 1 -1 1 -1
0 1 0 0 1 0 1 -2 -2 1
Filter 1
0 0 1 1 0 0 -1 -1 -1 -1
6 x 6 image
4 x 4 image
Convolutional Layers
0 1 0 0 1 0
1 -2 -2 1
0 1 0 0 1 0
1 -2 -2 1
1 0 0 0 0 1
0 1 0 0 1 0 -1 -1 -1 -1
0 0 1 1 0 0
6 x 6 image 4 x 4 image
output size: (6 - 3) / 1 + 1 = 4
Convolutional Layers
filter size = F
output size: (N - F) / stride + 1
for example: N = 6, F = 3
stride = 1 -> (6-3)/1 + 1 = 4
stride = 2 -> (6-3)/2 + 1 = 2.5 :\
stride = 3 -> (6-3)/3 + 1 = 2
N x N image
Convolutional Layers
• N x N image
Convolutional Layers
N x N image
Convolutional Layers
stride = 1
0 1 0 0 1 0
0 1 0 0 1 0 -1
-1 11 -1
-1 3 -3 -3 3
0 1 0 0 1 0 -1
-1 11 -1
-1
1 -2 -2 1
1 0 0 0 0 1 -1 11 -1
-1 -1
0 1 0 0 1 0 1 -2 -2 1
Filter 1
0 0 1 1 0 0 -1 -1 -1 -1
detect a vertical line
6 x 6 image
The same pattern in different locations are detected with the same filter
Convolutional Layers
stride = 1
0 1 0 0 1 0 Feature Maps
0 1 0 0 1 0 -1 1 -1
3 -3 -3 3
-1 -1 -1 -1 -1
0 1 0 0 1 0 1-1
1 -1 -1
1 0 0 0 0 1 1 -2 -2 1
-1 -11 1-1 -1 -1 0 -2 -1
0 1 0 0 1 0
Filt
-1 r -1
1 1 1 -2 -2 1
0 0 1 1 0 0 -3 0 0 -3
e -1 -1
3 -1 -1
6 x 6 image Filter 2 -1 -3 -1
x (#filters)
4 x 4 image
Do the same process for every filter
Convolutional Layers
-1 1 -1
RGB images -1-1 1 1 -1-1
-1 1 -1
-1-1 1 1 -1-1
-1 1 -1
0 1 0 0 1 0 -1-1 1 1 -1-1
0 1 0 0 1 0
0 10 01 00 10 01 0
Filter 1
0 1 0 0 1 0
0 10 01 00 10 01 0
0 1 0 0 1 0 1 -1 -1
1 1 -1-1 -1-1
1 00 01 00 00 11 0
1 0 0 0 0 1 -1 1 -1
-1-1 1 1 -1-1
0 11 00 00 10 00 1
0 1 0 0 1 0 -1 -1 1
-1-1 -1-1 1 1
0 00 11 10 00 01 0
0 0 1 1 0 0 Filter 2
0 0 1 1 0 0
3 channels 6x6x3 Filters
depth ofalways extend
the input the full 3
volume x3x3
Convolutional Layers – Parameters
• Key Parameters:
● Accepts an input of size W1 x H1 X D1
● Requires 4 hyperparameters:
○ Number of filters K Common settings:
○ Size of the filters F K: powers of 2, such as 32, 64, 128, 512
○ The stride S F = 3, S=1, P=2
○ The amount of zero padding P F = 5, S=1, P=2
● Produce an output of size W2 x H2 x D2, where ……
○ W2 = (W1 - F + 2P)/S + 1
○ H2 = (H1 - F + 2P)/S + 1
○ D2 = K
● With parameter sharing, it introduces F x F x D1 weights per filter, for a
total of (F x F x D1)x K weights and K biases.
Convolution vs Fully Connected
0 1 0 0 1 0 X 1
-1 1 -1 3 -3 -3 3
0 1 0 0 1 0 1 -1 -1 -1 -1 -1 -1 X 2
-1 1 -1 .
0 1 0 0 1 0 -1 1 -1 .
1
-1
-2
0
-2
-2
1
-1 .
-1 .
1 0 0 0 0 1 -1 1 -1 -1 1 .
. .
0 1 0 0 1 0 1 -2 -2 1 . .
-3 0 0 -3
convolution .
0 0 1 1 0 0
X N
-1 -1 -1 -1
3 -1 -3 -1
fully connected
Convolution vs Fully Connected
-1 1 -1 1 0
Filter 1 -1 1 -1 2 1 3
3 -3 -3 3 3 0
-1 1 -1
4 0
1 -2 -2 1 .
0 1 0 0 1 0 .
.
0 1 0 0 1 0 1 -2 -2 1 7 0
8 1
0 1 0 0 1 0 -1 -1 -1 -1 9 0
1 0 0 0 0 1
Instead of 36,
10. only 9 inputs
0 1 0 0 1 0
0 are connected
0 0 1 1 0 0 Less parameters to learn! .
.
13
6 x 6 image 0 0
14 0
1
Convolution vs Fully Connected
-1 1 -1 1 0
Filter 1 -1 1 -1 2 1 3
3 -3 -3 3 3 0
-1 1 -1
4 0
1 -2 -2 1 .
0 1 0 0 1 0 .
. -3
0 1 0 0 1 0 1 -2 -2 1 7 0
8 1
0 1 0 0 1 0 -1 -1 -1 -1 9 0
1 0 0 0 0 1
10.
0 1 0 0 1 0
0 weights are shared
0 0 1 1 0 0 Less parameters to learn! . between cells
.
6 x 6 image Even less parameters to learn! 13
0
14 0
1
Convolution vs Fully Connected
down-sampling
Filter 1 Filter 2
-1 1 -1 1 -1 -1
-1 1 -1 -1 1 -1
Pooling size = 2 x 2
-1 -1 1
-1 1 -1 Stride = 2
Operates over each
3 -3 -3 3 -1 -1 -1 -1 feature map
independently
1 -2 -2 1 -1 0 -2 -1
Invariant to small
1 -2 -2 1 -3 0 0 -3
differences in the input
-1 -1 -1 -1 3 -1 -3 -1
0 1 0 0 1 0
0 1 0 0 1 0
0 1 0 0 1 0 Convolution
1 0 0 0 0 1 3 3
Max Pooling 0 -1
0 1 0 0 1 0
1 1
0 0 1 1 0 0 3 0
Key Parameters:
● Accepts an input of size W1 x H1 X D1
● Requires 2 hyperparameters:
Common settings:
○ Size of the filters F
F = 2, S=2
○ The stride S F = 3,
● Produce an output of size W2 x H2 x D2, where S=2
○ W2 = (W1 - F)/S + 1 ……
○ H2 = (H1 - F)/S + 1
○ D2 = D1
● It introduces zero learnable parameters since
it computes a fixed function of the input.
Convolve, Pool, Repeat
Convolution
Max Pooling
.
.
.
Output can be regarded as new images:
• Smaller than the original images
• The depth of new images is the number of filters
Transfer learning
Transfer Learning
• Features learned by CNNs on large dataset problem, can be helpful for
other tasks. It is very common to pre-train a CNN on Imagenet and then
use it as a fixed feature extractor or as network initialisation.
Feature extractor: remove the last layer and then use the remaining
network to extract representations from hidden layers directly, which can
then be utilised as features for other applications.