0% found this document useful (0 votes)
53 views50 pages

3 - DeepLearning - and - CNN v3

Deep learning uses neural networks to process data in a way that is similar to the human brain. Convolutional neural networks (CNNs) are a type of deep learning that uses convolution operations instead of matrix multiplications. CNNs are useful for tasks involving visual or spatial data because they can recognize patterns in images or text even if the patterns appear in different locations. CNNs work by applying convolutional filters to extract features from the input data, then using techniques like backpropagation and gradient descent to train the network to accurately classify examples. This helps CNNs generalize well while avoiding overfitting issues.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views50 pages

3 - DeepLearning - and - CNN v3

Deep learning uses neural networks to process data in a way that is similar to the human brain. Convolutional neural networks (CNNs) are a type of deep learning that uses convolution operations instead of matrix multiplications. CNNs are useful for tasks involving visual or spatial data because they can recognize patterns in images or text even if the patterns appear in different locations. CNNs work by applying convolutional filters to extract features from the input data, then using techniques like backpropagation and gradient descent to train the network to accurately classify examples. This helps CNNs generalize well while avoiding overfitting issues.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 50

Deep Learning

& Convolutional Neural Nets

Intelligent Systems
ELE 4643
What is Deep Learning?

Artificial Intelligence:
• A broad concept where machines think and
act more like humans Artificial Intelligence

Machine Learning:
Machine Learning
• An application of AI where machines use date
to automatically improve at performing tasks
Deep Learning:
Deep Learning
• A machine learning technique that processes
data through a multi-layered neural network
much like the human brain
Deep Learning Animation
How neural networks learn | Deep learning
https://youtu.be/IHZwWFHWa-w
Backpropagatin
Backpropagation
• How do you train a Multi-Layer Neural Networks
weights? How does it
learn?
• Backpropagation algorithm uses Gradient Descent
• For each training step:
• Compute the output error
• Compute how much each neuron in the previous
hidden layer contributed
• Back-propagate that error in a reverse pass
• Tweak weights to reduce the error using gradient
descent
Backpropagation

Information flow
𝐲 = 𝑓NOO 𝐱 = 𝑜 𝐖 `𝜎 𝐖 `a"𝜎 …
𝐖 "𝐱

Propagation (data stream)


Input of the Comparison with

Output layer
Hidden layer
Input layer
training data desired output

Error calculation

Backpropagation (error stream)


Gradient Descent
Gradient descent measures the gradient/slope (the change in
error caused by a change in weight during Neural Network
training).
Based on the gradient, the algorithm can tell if the weight
should be increased or decreased in order to push the
gradient towards the direction where the slope flattens out
and the error is minimized.
The gradient is obtained by calculating the partial derivative
of the error function with respect to weights and biases.
Gradient Descent Animation
Gradient Descent Potential Problems

Potential problems during gradient descent

c)
b)
d)
a)

Global minimum

Potential problem during gradient descent: a) Finding local minima, b) Near halting due to small
gradients, c) Oscillation in valleys, d) Leaving good minima
Learning Flow
Step 1
Random Initialization Loss
Desired
Inputs Function
Step 7 Output

Iterat Weights Step 2 Step 3 Loss (Error)


e /Model Feed Forward Calculate loss function Metric
until
converge
nce Gradien Gradient
t for for the last
Step 6 all Step 5 layer Step 4
Update Weights layers Backpropagate Calculate error derivative

Optimizer Function
Learning

Overfitting refers to a model that models the training data too


well (memorizes the training data). 
Overfitting happens when a model learns the detail and noise
in the training data to the extent that it negatively impacts the
performance of the model on new data
Regularization is a set of techniques that can prevent
overfitting in neural networks and thus improve the accuracy
of a Deep Learning model when facing completely new data
from the problem domain
Learning
Regularization Learning To Avoiding
Overfitting
With thousands of weights to tune, overfitting is a
problem. These are some techniques to avoid that.
• Early stopping (when performance starts dropping)
• Dropout – ignore say 50% of all neurons randomly at
each training step
• Works surprisingly well!
• Forces your model to spread out its learning

There are techniques too…


Learning and Overfitting

Overfitting: fitting the model too closely to the training


data and departing with validation data.
Error is starting to increase
with new testing data
Overfitting is starting to
happen Error is still
decreasing with
Validation or testing familiar training
data

On training set
How is this achieved?

• Role of Activation Functions


– Good activation functions are nonlinear
• Allow for selective correlation: increase or decrease how
correlated the neuron is to all the other incoming signals
– Other core properties
• The function must be continuous and infinite
• The function should be monotonic
– I.e., no two input values of have the same output value
• The function and its derivative should be easily
computable
– Enable efficiency when training and deploying the network
Activation Functions

• Linear • Rectified Linear Unit


– 𝑅𝑒𝐿𝑈 𝑥 = max(0, 𝑥)
– 𝑓(𝑥) = 𝑎𝑥
– Range 0 to ∞
– Range −∞ to ∞
• Sigmoid • Leaky Rectified Linear Unit
𝑥 𝑓𝑜𝑟 𝑥 ≥ 0
– 𝜎(𝑥) – 𝐿𝑅𝑒𝐿𝑈 𝑥 = v
0.01𝑥 𝑓𝑜𝑟 𝑥 < 0
– Range: 0 to 1 – Range −∞ to ∞
• Hyperbolic Tangent • Softmax
– 𝑡𝑎𝑛ℎ(𝑥)
– Range: −1 to 1 – 𝑠𝑜𝑓𝑡𝑚𝑎𝑥
–𝑥Range:
R 0 to 1 used in the
output classification layer
Activation functions
Activation Functions ReLu
(aka
rectifier)
• Step functions don’t work with gradient descent –
there is no gradient!
• Mathematically, they have no useful derivative.
• ReLU is common. Fast to compute and works well.
• Also: “Leaky ReLU”, “Noisy ReLU”
• ELU can sometimes lead to faster learning
though.

ReLU function
Loss Functions

• Regression • Single label from multiple classes


– Predicting a single numerical value – Multiple classes which are exclusive
– Final activation – Linear – Final Activation function – Softmax
– Loss function – Mean Squared Error – Loss function – Cross Entropy

• Binary outcome • Multiple labels from multiple classes


– If there are multiple labels in your data
– Data is or isn’t a class
– Final Activation function – Sigmoid
– Final Activation function – Sigmoid
– Loss function – Binary Cross Entropy
– Loss function – Binary Cross Entropy
Tuning your
topology
• Trial & error is one way
• Evaluate a smaller network with less neurons in
the hidden layers
• Evaluate a larger network with more layers
• Try reducing the size of each layer as you progress –
form a funnel
• More layers can yield faster learning
• Or just use more layers and neurons than you
need, and don’t care because you use early
stopping.
Students may experiment with Artificial Neural Net using Googles playground on the link below.

playground.tensorflow.org
Convolution Neural Network CNN
• What is a Convulotional Neural Network?
– A CNN is a neural network with convolution operations
instead of matrix multiplications in at least one of the layers

– Special form of feed-forward network


• Reuse the same neurons for repetitive convolution tasks
– Convolution ≈ Filtering
• Convolutional kernels recognise patterns in a signal
• Reuses weights to detect the same patterns in multiple places
• Reduces overfitting and leads to much more accurate models
How Does CNN Work
CNN – Motivation

• Convolutional Neural Network (CNN)


– Convolutional kernels perform feature extraction
CNN – Motivation

• When applying a fully connected feedforward neural network


– Many thousands of weights and connections
– However, network can be simplified by considering the properties of input signal
Output
X 1
example:
flower
X 2

. . . bird
. . . . car
. . . .
. . . cat
.
An RGB image can be . . . . dog
represented as . horse
pixels (32*32*3) ship
X N

deer
Input vector dimension
N= 32⨉32⨉3= 3072
CNN – Motivation

Property 1: Some patterns are much smaller than the whole image
Can Convolution
repeat Property 2 : The same patterns appear in different regions
many
times Max Pooling Property 3 : Downsampling the pixels does not change the object

Convolution
Fully Connected
Max Pooling Flatten Feedforward Prediction
Network
Convolutional Layers

0 1 0 0 1 0
Apply small filers to detect small patterns
0 1 0 0 1 0
Each filter has a size of 3 x 3
0 1 0 0 1 0

1 0 0 0 0 1 -1 1 -1 1 -1 -1
0 1 0 0 1 0 -1 1 -1 -1 1 -1

.
.
.
.
.
0 0 1 1 0 0 -1 1 -1 -1 -1 1
6 x 6 image Filter 1 Filter 2

• Note: Only the size of the filters is specified; the weights are initialised
to arbitrary values before the start of training.
• The weights of the filters are learnt through the CNN training process
Convolutional Layers

• Key Parameters
– Filter size – defines the height and width of the filter kernel
• E.g., a filter kernel of size would have nine weights
– Stride – determines the number of steps to move in each
spatial direction while performing convolution.
– Padding –appends zeroes to the boundary of an image to
control the size of the output of convolution
• When we convolve an image of a specific size by a filter, the
resulting image is generally smaller than the original image
Convolutional Layers

stride = 1
0 1 0 0 1 0

0 1 0 0 1 0 -1 1 -1
3 -3
0 1 0 0 1 0 -1 1 -1
1 0 0 0 0 1 -1 1 -1
0 1 0 0 1 0
Filter 1
0 0 1 1 0 0

6 x 6 image

Compute the dot product between the filter and a small 3 x 3 chunk of the image
Convolutional Layers

stride = 2
0 1 0 0 1 0
0 1 0 0 1 0 -1 1 -1
3 -3
0 1 0 0 1 0 -1 1 -1
1 0 0 0 0 1 -1 1 -1
0 1 0 0 1 0
Filter 1
0 0 1 1 0 0

6 x 6 image
We set stride = 1 below
Compute the dot product between the filter and a small 3 x 3 chunk of the image
Convolutional Layers

stride = 1
0 1 0 0 1 0

0 1 0 0 1 0 -1 1 -1
3 -3 -3 3
0 1 0 0 1 0 -1 1 -1
1 -2 -2 1
1 0 0 0 0 1 -1 1 -1
0 1 0 0 1 0 1 -2 -2 1
Filter 1
0 0 1 1 0 0 -1 -1 -1 -1
6 x 6 image
4 x 4 image
Convolutional Layers

stride = 1, filter size = 3


0 1 0 0 1 0 3 -3 -3 3

0 1 0 0 1 0
1 -2 -2 1
0 1 0 0 1 0
1 -2 -2 1
1 0 0 0 0 1
0 1 0 0 1 0 -1 -1 -1 -1
0 0 1 1 0 0

6 x 6 image 4 x 4 image

output size: (6 - 3) / 1 + 1 = 4
Convolutional Layers

filter size = F
output size: (N - F) / stride + 1

for example: N = 6, F = 3
stride = 1 -> (6-3)/1 + 1 = 4
stride = 2 -> (6-3)/2 + 1 = 2.5 :\
stride = 3 -> (6-3)/3 + 1 = 2

N x N image
Convolutional Layers

zero-padding to the border


0 0 0 0 0 0 0 0
For example: N = 6, F = 3, stride = 1
0 0
Without 0-padding:
0 0
output size is (6-3)/1 + 1 = 4
0 0
With 0-padding with 1 pixel border:
0 0
output size is (6-3+2x1)/1 + 1 = 6
0 0
0 0
The output size is then the same as the input!
0 0 0 0 0 0 0 0

• N x N image
Convolutional Layers

zero-padding to the border


0 0 0 0 0 0 0 0
In general, stride=1, filters of size FxF,
0 0
then zero-padding with (F-1)/2,
0 0
to preserve size spatially
0 0
e.g.
0 0

0 0 F = 3 -> zero pad with 1 pixel to the border


F = 5 -> zero pad with 2 pixels to the border
0 0
F = 7 -> zero pad with 3 pixels to the
0 0 0 0 0 0 0 0 border

N x N image
Convolutional Layers

stride = 1
0 1 0 0 1 0

0 1 0 0 1 0 -1
-1 11 -1
-1 3 -3 -3 3
0 1 0 0 1 0 -1
-1 11 -1
-1
1 -2 -2 1
1 0 0 0 0 1 -1 11 -1
-1 -1
0 1 0 0 1 0 1 -2 -2 1
Filter 1
0 0 1 1 0 0 -1 -1 -1 -1
detect a vertical line
6 x 6 image

The same pattern in different locations are detected with the same filter
Convolutional Layers

stride = 1
0 1 0 0 1 0 Feature Maps
0 1 0 0 1 0 -1 1 -1
3 -3 -3 3
-1 -1 -1 -1 -1
0 1 0 0 1 0 1-1
1 -1 -1
1 0 0 0 0 1 1 -2 -2 1
-1 -11 1-1 -1 -1 0 -2 -1
0 1 0 0 1 0
Filt
-1 r -1
1 1 1 -2 -2 1
0 0 1 1 0 0 -3 0 0 -3
e -1 -1
3 -1 -1
6 x 6 image Filter 2 -1 -3 -1

x (#filters)
4 x 4 image
Do the same process for every filter
Convolutional Layers

-1 1 -1
RGB images -1-1 1 1 -1-1
-1 1 -1
-1-1 1 1 -1-1
-1 1 -1
0 1 0 0 1 0 -1-1 1 1 -1-1
0 1 0 0 1 0
0 10 01 00 10 01 0
Filter 1
0 1 0 0 1 0
0 10 01 00 10 01 0
0 1 0 0 1 0 1 -1 -1
1 1 -1-1 -1-1
1 00 01 00 00 11 0
1 0 0 0 0 1 -1 1 -1
-1-1 1 1 -1-1
0 11 00 00 10 00 1
0 1 0 0 1 0 -1 -1 1
-1-1 -1-1 1 1
0 00 11 10 00 01 0
0 0 1 1 0 0 Filter 2
0 0 1 1 0 0
3 channels 6x6x3 Filters
depth ofalways extend
the input the full 3
volume x3x3
Convolutional Layers – Parameters

• Key Parameters:
● Accepts an input of size W1 x H1 X D1
● Requires 4 hyperparameters:
○ Number of filters K Common settings:
○ Size of the filters F K: powers of 2, such as 32, 64, 128, 512
○ The stride S F = 3, S=1, P=2
○ The amount of zero padding P F = 5, S=1, P=2
● Produce an output of size W2 x H2 x D2, where ……
○ W2 = (W1 - F + 2P)/S + 1
○ H2 = (H1 - F + 2P)/S + 1
○ D2 = K
● With parameter sharing, it introduces F x F x D1 weights per filter, for a
total of (F x F x D1)x K weights and K biases.
Convolution vs Fully Connected

0 1 0 0 1 0 X 1

-1 1 -1 3 -3 -3 3
0 1 0 0 1 0 1 -1 -1 -1 -1 -1 -1 X 2

-1 1 -1 .
0 1 0 0 1 0 -1 1 -1 .
1
-1
-2
0
-2
-2
1
-1 .
-1 .
1 0 0 0 0 1 -1 1 -1 -1 1 .
. .
0 1 0 0 1 0 1 -2 -2 1 . .
-3 0 0 -3
convolution .
0 0 1 1 0 0
X N
-1 -1 -1 -1
3 -1 -3 -1

fully connected
Convolution vs Fully Connected

-1 1 -1 1 0

Filter 1 -1 1 -1 2 1 3
3 -3 -3 3 3 0
-1 1 -1
4 0
1 -2 -2 1 .
0 1 0 0 1 0 .
.
0 1 0 0 1 0 1 -2 -2 1 7 0
8 1
0 1 0 0 1 0 -1 -1 -1 -1 9 0
1 0 0 0 0 1
Instead of 36,
10. only 9 inputs
0 1 0 0 1 0
0 are connected
0 0 1 1 0 0 Less parameters to learn! .
.
13
6 x 6 image 0 0
14 0
1
Convolution vs Fully Connected

-1 1 -1 1 0

Filter 1 -1 1 -1 2 1 3
3 -3 -3 3 3 0
-1 1 -1
4 0
1 -2 -2 1 .
0 1 0 0 1 0 .
. -3
0 1 0 0 1 0 1 -2 -2 1 7 0
8 1
0 1 0 0 1 0 -1 -1 -1 -1 9 0
1 0 0 0 0 1
10.
0 1 0 0 1 0
0 weights are shared
0 0 1 1 0 0 Less parameters to learn! . between cells
.
6 x 6 image Even less parameters to learn! 13
0
14 0
1
Convolution vs Fully Connected

• The core idea


– Instead of having a large, dense linear layer with a
connection from every input to every output, we have lots
of small convolutional layers
• Convolutional layers usually with fewer inputs and a single
output
– The result is a smaller subset of kernel predictions, which
are used as input to the next layer.
– Convolutional layers usually have many kernels
Convolution vs Fully Connected

• The core idea


– Each kernel to learn a particular pattern and then search for
the existence of that pattern somewhere in the image
– A single, small set of weights can train over a much larger
set of training examples
– This changes the ratio of weights to datapoints on which
those weights are being trained
– This has a powerful impact on the network, drastically
reducing its ability to overfit to training data and increasing
its ability to generalise
Max Pooling

Pooling layers are usually present after a convolutional layer.


They provide a down-sampled version of the convolution output.

down-sampling

In this example, a 2x2 region is used as input of the pooling.


There are different types of pooling, the most used is max pooling.
Max Pooling

Filter 1 Filter 2
-1 1 -1 1 -1 -1

-1 1 -1 -1 1 -1
Pooling size = 2 x 2
-1 -1 1
-1 1 -1 Stride = 2
Operates over each
3 -3 -3 3 -1 -1 -1 -1 feature map
independently
1 -2 -2 1 -1 0 -2 -1

Invariant to small
1 -2 -2 1 -3 0 0 -3
differences in the input
-1 -1 -1 -1 3 -1 -3 -1

Feature map 1 Feature map 2


Max Pooling

0 1 0 0 1 0

0 1 0 0 1 0
0 1 0 0 1 0 Convolution

1 0 0 0 0 1 3 3
Max Pooling 0 -1
0 1 0 0 1 0
1 1
0 0 1 1 0 0 3 0

6 x 6 image each filter is a channel 2 feature maps


each of size 2 x 2
Smaller and more manageable
Max Pooling – Parameters

Key Parameters:
● Accepts an input of size W1 x H1 X D1
● Requires 2 hyperparameters:
Common settings:
○ Size of the filters F
F = 2, S=2
○ The stride S F = 3,
● Produce an output of size W2 x H2 x D2, where S=2
○ W2 = (W1 - F)/S + 1 ……
○ H2 = (H1 - F)/S + 1
○ D2 = D1
● It introduces zero learnable parameters since
it computes a fixed function of the input.
Convolve, Pool, Repeat

Can be repeat many times

Convolution

Max Pooling Convolution

Max Pooling

.
.
.
Output can be regarded as new images:
• Smaller than the original images
• The depth of new images is the number of filters
Transfer learning

What is Transfer Learning?


Transfer learning is a machine learning technique where a
model trained on one task is re-purposed on a second related
task.
Transfer learning with CNN

Transfer Learning
• Features learned by CNNs on large dataset problem, can be helpful for
other tasks. It is very common to pre-train a CNN on Imagenet and then
use it as a fixed feature extractor or as network initialisation.
Feature extractor: remove the last layer and then use the remaining
network to extract representations from hidden layers directly, which can
then be utilised as features for other applications.

Network initialisation: use pre-trained network and continue its training on


your own data and thus fine-tune the weights for your own specific
problem, as the training becomes progressively specific to the details of the
problem.
References:
Deep learning Conference adaptation-Schuller

You might also like