Deep Learning Tutorial
Deep Learning Tutorial
李宏毅
Hung-yi Lee
Framework f( )= “cat”
A set of Model
function f1 , f 2 !
f1 ( )= “cat” f2 ( )= “money”
f1 ( )= “dog f2 ( )= “snake”
”
Image Recognition:
Framework f( )= “cat”
A set of Model
function f1 , f 2 ! Better!
Goodness of
function f
Supervised Learning
Framework f( )= “cat”
Training Testing
A set of Model
function f1 , f 2 ! “cat”
Step 1
Training
Data
“monkey” “cat” “dog
Three Steps for Deep Learning
a1 w1 A simple function
…
ak wk z σ (z )
+ a
…
Activation
…
wK function
aK weights b bias
Neural Network
Sigmoid Function σ (z )
Neuron
1
σ (z ) = −z
1+ e z
2
1
4
-1 -2 + σ (z ) 0.98
Activation
-1
function
1 weights 1 bias
Neural Network
Different connections lead to
different network structures
+ σ (z )
+ σ (z ) + σ (z )
+ σ (z )
……
……
……
……
……
xN …… yM
Input Output
Layer Hidden Layers Layer
http://
cs231n.stanford.edu/ 19 layers
slides/
winter1516_lecture8.pdf
8 layers
6.7%
7.3%
16.4%
Special
structure
3.57%
7.3% 6.7%
16.4%
AlexNet VGG GoogleNet Residual Net Taipei
(2012) (2014) (2014) (2015) 101
Example Application
Input Output
x1
y0.1
1 is 1
x2 y0.7
2 is 2
The image is
“2”
……
……
……
x256 y0.2
10 is 0
16 x 16 = 256
Ink → 1 Each dimension represents
No ink → 0 the confidence of a digit.
Example Application
• Handwriting Digit Recognition
x1 y1 is 1
x2
y2 is 2
Neural
Machine “2”
……
……
Network
……
x256 y10 is 0
What is needed is a
function ……
Input: output:
256-dim vector 10-dim vector
Example Application
Input Layer 1 Layer 2 Layer L Output
x1 …… y1 is 1
x2 ……
A function set containing the y2 is 2
candidates for “2”
……
……
……
……
……
……
Handwriting Digit Recognition
x256 …… y10 is 0
Input Output
Layer Hidden Layers Layer
Softmax
…… …… y2 is 2
……
……
x256 …… y10 is 0
16 x 16 = 256
Ink → 1 The learning target is ……
No ink → 0
Input: y1 has the maximum value
“1”
x1 …… y1 1
x2
Given a set ……
Softmax
of y2 0
parameters
……
……
……
……
……
x256 …… y10 0
……
……
xR NN yR
Three Steps for Deep Learning
libdnn
(by NTU student,
Po-wei Chou)
Ref: https://www.youtube.com/watch?v=ibJpTrp5mcE
Three Steps for Deep Learning
• Step 1: • Step 2: • Step 3:
define a goodness pick the
set of of best
function function function
Deep Learning is so simple ……
(A)
random
Machine “1”
28 x 28
Very flexible
or
Need some
effort to learn
……
500
……
500
Softmax
y1 y2 …… y10
Keras
Keras
0.1
Step 3.2: Find the optimal network parameters
28 x 28 …… 10 ……
=784
case 1:
case 2:
Using GPU to speed training
• THEANO_FLAGS=device=gpu0 python YourCode.py
Live Demo
• You can find the code for demo today at the
following link:
• http://speech.ee.ntu.edu.tw/~tlkagk/DL_tutorial/
DeepLecture_HelloWorld.py
Lesson we learned ...
http://ent.ltn.com.tw/news/
breakingnews/1144545
• Step 1: define a NO
Good Results on
set of function Testing Data?
Overfitting!
• Step 2: goodness
of function YES
NO
• Step 3: pick the Good Results on
Training Data?
best function
Neural
Network
Do not always blame Overfitting
Overfitting?
Good Results on
Different approaches for Testing Data?
different problems.
YES
e.g. “dropout” for good
results on testing data
Good Results on
Training Data?
Neural
Network
Recipe of Deep Learning
YES
• Momentum
Choosing Proper Loss
“1”
x1 …… y1 1 1
x2 ……
Softmax
y2 0 0
……
……
……
……
……
……
loss
x256 …… y10 0 0
Which one is better?
target
Square Cross
Error Entropy
=0 =0
Let’s try it
Square Error
Cross Entropy
Testing: Accuracy
Let’s try it
Square Error 0.11
Cross Entropy 0.84
Training
Cross
Entropy
Square
Error
Choosing Proper Loss
When using softmax output layer,
choose cross entropy
Cross
Entropy
Total
Loss
Square
Error
http://jmlr.org/
proceedings/papers/
w1 w2
v9/glorot10a/
Recipe of Deep Learning
YES
• Momentum
We do not really minimize total loss!
Mini-batch ➢ Randomly initialize
network parameters
…
Mini-batch
…
➢ Until all mini-batches
Repeat 20 times have been picked
one epoch
We do not really minimize total loss!
Mini-batch ➢ Randomly initialize
network parameters
…
Mini-batch
Unstable!!!
1 epoch
Mini-batch 0.84
No batch 0.12
Training
Mini-batch
Accuracy
No batch
Epoch
Shuffle the training examples for each epoch
Epoch 1 Epoch 2
x1 NN y1 x1 NN y1
Mini-batch
Mini-batch
x31 NN y31 x31 NN y31
……
……
Don’t worry. This is the default of Keras.
x2 NN y2 x2 NN y2
Mini-batch
Mini-batch
……
……
Recipe of Deep Learning
YES
• Momentum
Hard to get the power of Deep …
……
……
……
……
xN …… yM
x1 ……
x2 Small
…… output
……
……
……
……
……
……
xN ……
Large
input
Intuitive way to compute the derivatives …
Hard to get the power of Deep …
x1 y1
0 y2
x2
0
0
ReLU
A Thinner linear network
x1 y1
y2
x2
Do not have
smaller gradients
Let’s try it
Testing: 9 layers Accuracy
Let’s try it
Sigmoid 0.11
ReLU 0.96
• 9 layers
Training
ReLU
Sigmoid
Recipe of Deep Learning
YES
• Momentum
In physical world ……
• Momentum
Momentum
Real Movement
𝜕𝐿∕𝜕𝑤 = 0
Adam Adaptive learning rate + Momentum
Let’s try it Testing: Accuracy
Original 0.96
Adam 0.97
• ReLU, 3 layer
Training
Original
Adam
Recipe of Deep Learning
YES
• Regularization
YES
• Network Structure
Why Overfitting?
• Training data and testing data can be different.
Handwriting recognition:
Original Created
Training Data: Training Data:
Shift 15。
Recipe of Deep Learning
YES
• Regularization
YES
• Network Structure
Dropout
Training:
Thinner!
http://big5.xinhuanet.com/gate/big5/
news.xinhuanet.com/sports/2012-07/
03/c_123363695.htm
https://www.youtube.com/watch?v=pn5dP9s9yiM
(idea from Prof. Min Sun)
Recipe of Deep Learning
YES
• Regularization
YES
• Network Structure
CNN is a very good example!
(next lecture)
Lecture III:
Convolutional Neural
Network (CNN)
Why CNN for Image?
• When processing image, the first layer of fully
connected network would be very large
……
Softmax
……
100
……3 x 107
……
……
100 100 x 100 x 3 1000
Can the fully connected network be simplified by
considering the properties of image recognition?
Why CNN for Image
• Some patterns are much smaller than the whole
image
A neuron does not have to see the whole image to
discover the pattern.
Connecting to small region with less parameters
“beak” detector
Why CNN for Image
• The same patterns appear in different regions.
“upper-left
beak” detector
“middle beak”
detector
Why CNN for Image
• Subsampling the pixels will not change the object
bird
bird
subsampling
Max Pooling
Can repeat
Fully Connected many times
Feedforward network Convolution
Max Pooling
Flatten
The whole CNN
Property 1
➢ Some patterns are much Convolution
smaller than the whole image
Property 2
➢ The same patterns appear in Max Pooling
Can repeat
different regions.
many times
Property 3 Convolution
➢Subsampling the pixels will
not change the object
Max Pooling
Flatten
The whole CNN
cat dog ……
Convolution
Max Pooling
Can repeat
Fully Connected many times
Feedforward network Convolution
Max Pooling
Flatten
CNN – Convolution
The values in the matrices are learned from training data.
Filter 1
Matrix
Convolution Filter 2
Matrix
……
Max Pooling
……
……
6 x 6 image
Each filter detects a small
Property 1
pattern (3 x 3).
1 -1 -1
CNN – Convolution -1 1 -1 Filter 1
-1 -1 1
stride=1
1 0 0 0 0 1
0 1 0 0 1 0 3 -1 -3 -1
0 0 1 1 0 0
1 0 0 0 1 0 -3 1 0 -3
0 1 0 0 1 0
-3 -3 0 1
0 0 1 0 1 0
6 x 6 image 3 -2 -2 -1
Property 2
-1 1 -1
CNN – Convolution -1 1 -1 Filter 2
-1 1 -1
stride=1 Do the same process for
1 0 0 0 0 1 every filter
0 1 0 0 1 0 3 -1 -3 -1
-1 -1 -1 -1
0 0 1 1 0 0
1 0 0 0 1 0 -3 1 0 -3
-1 -1 -2 1
Feature
0 1 0 0 1 0
-3 -3 Map0 1
0 0 1 0 1 0 -1 -1 -2 1
6 x 6 image 3 -2 -2 -1
-1 0 -4 3
4 x 4 image
CNN – Colorful image
11 -1-1 -1-1 -1-1 11 -1-1
1 -1 -1 -1 1 -1
-1-1 11 -1-1 -1-1 11 -1-1
-1 1 -1 Filter 1 -1 1 -1 Filter 2
-1-1 -1-1 11 -1-1 11 -1-1
-1 -1 1 -1 1 -1
Colorful image
1 0 0 0 0 1
1 0 0 0 0 1
0 11 00 00 01 00 1
0 1 0 0 1 0
0 00 11 01 00 10 0
0 0 1 1 0 0
1 00 00 10 11 00 0
1 0 0 0 1 0
0 11 00 00 01 10 0
0 1 0 0 1 0
0 00 11 00 01 10 0
0 0 1 0 1 0
0 0 1 0 1 0
Convolution v.s. Fully Connected
1 0 0 0 0 1 1 -1 -1 -1 1 -1
0 1 0 0 1 0 -1 1 -1 -1 1 -1
0 0 1 1 0 0 -1 -1 1 -1 1 -1
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0 convolution
image
x1
1 0 0 0 0 1
0 1 0 0 1 0 x2
Fully- 0 0 1 1 0 0
connected 1 0 0 0 1 0
……
……
0 1 0 0 1 0
0 0 1 0 1 0
x36
1 -1 -1 Filter 1 1: 1
-1 1 -1 2: 0
-1 -1 1 3: 0
4: 0 3
1 0 0 0 0 1
…
0 1 0 0 1 0 7: 0
0 0 1 1 0 0 8: 1
1 0 0 0 1 0 9: 0
10: 0
0 1 0 0 1 0
…
0 0 1 0 1 0
13: 0
6 x 6 image
14: 0
Less parameters! 15: 1 Only connect to 9
16: 1 input, not fully
connected
…
1 -1 -1 1: 1
-1 1 -1 Filter 1 2: 0
-1 -1 1 3: 0
4: 0 3
1 0 0 0 0 1
…
0 1 0 0 1 0 7: 0
0 0 1 1 0 0 8: 1
9: 0 -1
1 0 0 0 1 0
10: 0
0 1 0 0 1 0
…
0 0 1 0 1 0
13: 0
6 x 6 image
14: 0
Less parameters! 15: 1
16: 1 Shared weights
Even less parameters!
…
The whole CNN
cat dog ……
Convolution
Max Pooling
Can repeat
Fully Connected many times
Feedforward network Convolution
Max Pooling
Flatten
CNN – Max Pooling
1 -1 -1 -1 1 -1
-1 1 -1 Filter 1 -1 1 -1 Filter 2
-1 -1 1 -1 1 -1
3 -1 -3 -1 -1 -1 -1 -1
-3 1 0 -3 -1 -1 -2 1
-3 -3 0 1 -1 -1 -2 1
3 -2 -2 -1 -1 0 -4 3
CNN – Max Pooling
New image
1 0 0 0 0 1 but smaller
0 1 0 0 1 0 Conv
3 0
0 0 1 1 0 0 -1 1
1 0 0 0 1 0
0 1 0 0 1 0 Max 30 13
0 0 1 0 1 0 Pooling
2 x 2 image
6 x 6 image
Each filter
is a channel
The whole CNN
3 0
-1 1 Convolution
30 13
Max Pooling
Can repeat
A new image many times
Convolution
Smaller than the original
image
The number of the channel Max Pooling
is the number of filters
The whole CNN
cat dog ……
Convolution
Max Pooling
A new image
Fully Connected
Feedforward network Convolution
Max Pooling
A new image
Flatten
3
Flatten
0
1
3 0
-1 1 3
3 1 -1
0 3 Flatten
1 Fully Connected
Feedforward network
0
3
Only modified the network structure and
CNN in Keras input format (vector -> 3-D tensor)
input
Convolution
1 -1 -1
-1 1 -1
-1 1 -1
-1 1 -1 …… There are 25
-1 -1 1 3x3 filters.
-1 1 -1 Max Pooling
Input_shape = ( 1 , 28 , 28 )
1: black/weight, 3: RGB 28 x 28 pixels Convolution
3 -1 3 Max Pooling
-3 1
Only modified the network structure and
CNN in Keras input format (vector -> 3-D tensor)
input
1 x 28 x 28
Convolution
How many parameters
9 25 x 26 x 26
for each filter?
Max Pooling
25 x 13 x 13
Convolution
How many parameters
225 50 x 11 x 11
for each filter?
Max Pooling
50 x 5 x 5
Only modified the network structure and
CNN in Keras input format (vector -> 3-D tensor)
input
1 x 28 x 28
output Convolution
25 x 26 x 26
Fully Connected Max Pooling
Feedforward network
25 x 13 x 13
Convolution
50 x 11 x 11
Max Pooling
1250 50 x 5 x 5
Flatten
Live Demo
What does CNN learn?
The output of the k-th filter is a x
11 x 11 matrix. input
Degree of the activation
of the k-th filter: 25 3x3
Convolution
filters
(gradient ascent)
11 Max Pooling
3 -1 …… -1
50 3x3
Convolution
filters
-3 1 …… -3
11 50 x 11 x 11
Max Pooling
……
……
……
3 -2 …… -1
What does CNN learn?
The output of the k-th filter is a
11 x 11 matrix. input
Degree of the activation
of the k-th filter: 25 3x3
Convolution
filters
(gradient ascent)
Max Pooling
50 3x3
Convolution
filters
50 x 11 x 11
Max Pooling
Convolution
Max Pooling
flatten
Max Pooling
3 4 5
flatten
6 7 8
0 1 2 0 1 2
3 4 5 3 4 5
6 7 8 6 7 8
CNN
http://deepdreamgenerator.com/
Deep Dream
• Given a photo, machine adds what it sees ……
http://deepdreamgenerator.com/
Deep Style
• Given a photo, make its style like famous paintings
https://dreamscopeapp.com/
Deep Style
• Given a photo, make its style like famous paintings
https://dreamscopeapp.com/
Deep Style
CNN CNN
?
Application: Playing Go
Next move
Network (19 x 19
positions)
19 x 19 matrix 19 x 19 vector
19(image)
x 19 vector
Black: 1 Fully-connected feedforward
white: -1 network can be used
none: 0 But CNN performs much better.
http://lgs.tw/qwwheue
Training
Collecting records of many previous plays
……
CNN
Why CNN for Go playing?
• Some patterns are much smaller than the whole
image
Alpha Go uses 5 x 5 for first layer