7 Deep Learning
7 Deep Learning
Deep Learning
Neural Network (Deep Learning)
x1
x2
Output
x3
Layer L4
x4
Layer L3
Layer L1 Layer L2
What is Deep Learning (DL)
A machine learning subfield of learning representations of data.
Exceptional effective at learning patterns.
Deep learning algorithms attempt to learn (multiple levels of)
representation by using a hierarchy of multiple layers
If you provide the system tons of information, it begins to
understand it and respond in useful ways.
https://www.xenonstack.com/blog/static/public/uploads/media/machine-learning-vs-deep-learning.png
Why is DL useful?
o Manually designed features are often over-
specified, incomplete and take a long time to
design and validate
o Learned Features are easy to adapt, fast to learn
o Deep learning provides a very flexible, (almost)
universal, learnable framework for representing
world, visual and linguistic information.
o Effective end-to-end joint system learning
o Utilize large amounts of training data
what exactly is deep learning ?
New AI methods
(deep learning)
Performance
Most learning
algorithms
Amount of data
Things we want to do with data
Learning
algorithm
Computer vision
Feature Learning
representation
algorithm
Features for vision
Images
Audio
some units are randomly omitted from the input/hidden layers during training.
Deep Learning: Vanishing
Gradient
If a logistic activation function is used (shown on left), the weight changes are proportional to f(x)-
(1-f(x))
This factor is also propagated back, but cannot be larger than ¼ (see right).
⇒ The gradient tends to vanish if many layers are backpropagated through.
Learning in the early hidden layers can become very slow [Hochreiter 1991].
Deep Learning: Vanishing
Gradient
The logistic activation function is a contracting function:
(Obvious from the fact that its derivative is always < 1; actually ≤ 1 / 4.)
• If several logistic functions are chained, these contractions combine and
yield an even stronger contraction of the input range.
• As a consequence, a rather large change of the input values will produce
only a rather small change in the output values, and the more so, the more
logistic functions are chained together.
• Therefore the function that maps the inputs of a multilayer perceptron to
its outputs usually becomes the flatter the more layers the multilayer
perceptron has.
• Consequently the gradient in the first hidden layer (were the inputs are
processed) becomes the smaller.
Convolutional Neural Networks
(CNN)
We know it is good to learn a small model.
From this fully connected model, do we really need all the
connections?
Can some of these be shared?
Deep Learning: Convolutional
Neural Networks (CNNs)
It is advantageous that the features constructed in hidden layers are
not localized to a specific part of the image.
• Special form of deep learning multi-layer perceptron: convolutional
neural network.
• Inspired by the human retina, where sensory neurons have a
receptive field, that is, a limited region in which they respond to a
(visual) stimulus.
• Each neuron of the (first) hidden layer is connected to a small number of input
neurons that refer to a contiguous region of the input image (left).
• Connection weights are shared, same network is evaluated at different locations.
The input field is “moved” step by step over the whole image (right).
• Equivalent to a convolution with a small size kernel.
Convolutional Neural Networks
(CNNs)
Convolutional
Input matrix 3x3 filter
Consider learning an image:
Some patterns are much smaller than the whole image
“beak” detector
Same pattern appears in different places:
They can be compressed!
What about training a lot of such “small” detectors
and each detector must “move around”.
“upper-left
beak” detector
“middle beak”
detector
A convolutional layer
Beak detector
A filter
Convolution
These are the network
parameters to be learned.
1 -1 -1
1 0 0 0 0 1 -1 1 -1 Filter 1
0 1 0 0 1 0 -1 -1 1
0 0 1 1 0 0
1 0 0 0 1 0 -1 1 -1
-1 1 -1 Filter 2
0 1 0 0 1 0
0 0 1 0 1 0 -1 1 -1
…
…
6 x 6 image
Each filter detects a
small pattern (3 x 3).
Convolution 1 -1 -1
Filter 1
-1 1 -1
-1 -1 1
stride=1
1 0 0 0 0 1 Dot
product
0 1 0 0 1 0 3 -1
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
6 x 6 image
Convolution 1 -1 -1
-1 1 -1 Filter 1
-1 -1 1
If stride=2
1 0 0 0 0 1
0 1 0 0 1 0 3 -3
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
6 x 6 image
Convolution 1 -1 -1
-1 1 -1 Filter 1
-1 -1 1
stride=1
1 0 0 0 0 1
0 1 0 0 1 0 3 -1 -3 -1
0 0 1 1 0 0
1 0 0 0 1 0 -3 1 0 -3
0 1 0 0 1 0
0 0 1 0 1 0 -3 -3 0 1
6 x 6 image 3 -2 -2 -1
-1 1 -1
Convolution -1 1 -1 Filter 2
-1 1 -1
stride=1
Repeat this for each filter
1 0 0 0 0 1
0 1 0 0 1 0 3 -1 -3 -1
-1 -1 -1 -1
0 0 1 1 0 0
1 0 0 0 1 0 -3 1 0 -3
-1 -1 -2 1
0 1 0 0 1 0 Feature
0 0 1 0 1 0 -3 -3 Map0 1
-1 -1 -2 1
6 x 6 image 3 -2 -2 -1
-1 0 -4 3
Two 4 x 4 images
Forming 2 x 4 x 4 matrix
Color image: RGB 3 channels
11 -1-1 -1-1 -1-1 11 -1-1
1 -1 -1 -1 1 -1
-1 1 -1 -1-1 11 -1-1
-1-1 11 -1-1 Filter 1 -1 1 -1 Filter 2
-1 -1 1 -1-1 11 -1-1
-1-1 -1-1 11 -1 1 -1
Color image
1 0 0 0 0 1
1 0 0 0 0 1
0 11 00 00 01 00 1
0 1 0 0 1 0
0 00 11 01 00 10 0
0 0 1 1 0 0
1 00 00 10 11 00 0
1 0 0 0 1 0
0 11 00 00 01 10 0
0 1 0 0 1 0
0 00 11 00 01 10 0
0 0 1 0 1 0
0 0 1 0 1 0
Color image RGB 3 channels
https://www.researchgate.net/post/How_will_channels_RGB_effect_convolutional_neural_network
Convolution v.s. Fully Connected
1 0 0 0 0 1 1 -1 -1 -1 1 -1
0 1 0 0 1 0 -1 1 -1 -1 1 -1
0 0 1 1 0 0 -1 -1 1 -1 1 -1
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
convolution
image
x1
1 0 0 0 0 1
0 1 0 0 1 0 x2
Fully-connected 0 0 1 1 0 0
1 0 0 0 1 0
…
…
…
…
0 1 0 0 1 0
0 0 1 0 1 0
x36
1 -1 -1 Filter 1 1 1
-1 1 -1 2 0
-1 -1 1 3 0
4 0 3
1 0 0 0 0 1 :
…
0 1 0 0 1 0 0
0 0 1 1 0 0 8 1
1 0 0 0 1 0 9 0
0 1 0 0 1 0 10: 0
…
0 0 1 0 1 0
13 0
6 x 6 image
14 0
fewer parameters! 15 1 Only connect to 9
16 1 inputs, not fully
connected
…
1 -1 -1 1 1
:2 0
-1 1 -1 Filter 1
:3 0
-1 -1 1
:4 0 3
1 0 0 0 0 1 :
…
0 1 0 0 1 0 7 0
0 0 1 1 0 0 :8 1
1 0 0 0 1 0 :9 0 -1
0 1 0 0 1 0 10:: 0
…
0 0 1 0 1 0
13 0
6 x 6 image
:
14 0
Fewer parameters :15: 1
16: 1 Shared weights
Even fewer parameters
…
The whole CNN
cat dog ……
Convolution
Max Pooling
Can repeat
many
Convolution times
Max Pooling
Flattened
Fully Connected
Feedforward network
Max Pooling
1 -1 -1 -1 1 -1
-1 1 -1 Filter 1 -1 1 -1 Filter 2
-1 -1 1 -1 1 -1
3 -1 -3 -1 -1 -1 -1 -1
-3 1 0 -3 -1 -1 -2 1
-3 -3 0 1 -1 -1 -2 1
3 -2 -2 -1 -1 0 -4 3
Why Pooling
Subsampling pixels will not change the object
bird
bird
Subsampling
New image
1 0 0 0 0 1 but smaller
0 1 0 0 1 0 Conv
3 0
0 0 1 1 0 0 -1 1
1 0 0 0 1 0
0 1 0 0 1 0 Max 3 1
0 3
0 0 1 0 1 0 Pooling
2 x 2 image
6 x 6 image
Each filter
is a channel
The whole CNN
3 0
-1 1 Convolution
3 1
0 3
Max Pooling
Can repeat
A new image
many
Convolution times
Smaller than the original
image Max Pooling
Max Pooling
Max Pooling
1
3 0
-1 1 3
3 1 -1
0 3 Flattened
1 Fully Connected
Feedforward network
3
CNN in speech recognition
Image Time
Spectrogram
Presentation
Notice: the same function and the same set of parameters are used at every
time step
One to many
Decoder
Encoder
e.g. Video
classification on
frame level
RNN forward pass
Basic RNN backward pass
Bidirectional RNN
In the discussed RNN architecture, the current output ŷ <t> depends
on the previous inputs and activations.
Let's have this example 'He Said, "Teddy Roosevelt was a great
president"'. In this example Teddy is a person name but we know
that from the word president that came after Teddy not from He
and said that were before it.
So limitation of the discussed architecture is that it can not learn
from elements later in the sequence. To address this problem
Bidirectional RNN (BRNN) are introduced.