Deep Learning: Hung-yi Lee 李宏毅
Deep Learning: Hung-yi Lee 李宏毅
Hung-yi Lee
李宏毅
Deep learning
attracts lots of attention.
• I believe you have seen lots of exciting results
before.
z
z z
z
“Neuron”
Neural Network
Different connection leads to different network
structures
Network parameter : all the weights and biases in the “neurons”
Fully Connect Feedforward
Network
1 4 0.98
1
-2
1
-1 -2 0.12
-1
1
0
Sigmoid Function z
1
z z
1 e z
Fully Connect Feedforward
Network
1 4 0.98 2 0.86 3 0.62
1
-2 -1 -1
1 0 -2
-1 -2 0.12 -2 0.11 -1 0.83
-1
1 -1 4
0 0 2
Fully Connect Feedforward
Network
1 0.73 2 0.72 3 0.51
0
-2 -1 -1
1 0 -2
-1 0.5 -2 0.12 -1 0.85
0
1 -1 4
0 0 2
This is a function.
Input vector, output vector
𝑓
([ ]) [
1
−1
=
0 .62
0.83 ] ([ ]) [
𝑓
0
0
=
0 .51
0.85 ]
Given network structure, define a function set
Fully Connect Feedforward
Network
neuron
Input Layer 1 Layer 2 Layer L Output
x1 …… y1
x2 …… y2
……
……
……
……
……
xN …… yM
Input Output
Layer Hidden Layers Layer
Deep = Many hidden layers
22 layers
http://
cs231n.stanford.edu/ 19 layers
slides/
winter1516_lecture8.pdf
8 layers
6.7%
7.3%
16.4%
Special
structure
Ref:
https://www.youtube.com/watch?
3.57%
v=dxB6299gpvI
7.3% 6.7%
16.4%
AlexNet VGG GoogleNet Residual Net Taipei
(2012) (2014) (2014) (2015) 101
Matrix Operation
1 4 0.98
y1
1
-2
1
-1 -2 0.12
-1 y2
1
0
𝜎[
1
−1
−2
1 ] [ ] +¿ [ ] ¿ [
(
1
−1
1
)
0
0 .98
0.12 ]
[ ]
4
−2
Neural Network
x1 …… y1
x 2 W1 W2 ……
WL y2
b1 b2 bL
……
……
……
……
……
xN x a1 ……
a2 y yM
𝜎W1 x(+ b)
1
𝜎W2 a1(+ b)
2
𝜎 L-1 + )
WL a( bL
Neural Network
x1 …… y1
x 2 W1 W2 ……
WL y2
b1 b2 bL
……
……
……
……
……
xN x a1 ……
a2 y yM
¿WL … 𝜎
W2 𝜎
𝜎 W1 (
x(+ (
b)
1 … + bL
+ b2 )
)
Output Layer
as Multi-Class Classifier
Feature extractor replacing
feature engineering
x1
…… y1
x2
…… y2
Softmax
x
……
……
……
……
……
xK
…… yM
Input Output = Multi-class
Layer Hidden Layers Layer Classifier
Example Application
Input Output
x1 y1
0.1 is 1
x2 y2
0.7 is 2
The image
is “2”
……
……
……
x256 y10
0.2 is 0
16 x 16 = 256
Ink → 1 Each dimension represents
No ink → 0 the confidence of a digit.
Example Application
• Handwriting Digit Recognition
x1 y1 is 1
x2
y2 is 2
Neural
Machine “2
……
Network
……
……
”
x256 y10 is 0
What is needed is a
function ……
Input: output:
256-dim vector 10-dim vector
Example Application
Input Layer 1 Layer 2 Layer L Output
x1 …… y1 is 1
x2 ……
A function set containing the y2 is 2
candidates for “2
……
……
……
……
……
……
Handwriting Digit Recognition ”
xN …… y10 is 0
Input Output
Layer Hidden Layers Layer
Softmax
Given a set of y2 0
parameters
……
……
……
……
……
Cross
……
Entropy
x256 …… y10 ^
𝑦 10 0
10 𝑦 ^
𝑦
𝑙 ( 𝑦 , ^𝑦 )=− ∑ ^
𝑦 𝑖 𝑙𝑛 𝑦 𝑖
𝑖=1
Total Loss:
Total Loss 𝑁
𝐿= ∑ 𝑙 𝑛
For all training data … 𝑛=1
x1 NN y1 ^
𝑦
1
1
𝑙
Find a function in
x2 NN y2 ^
𝑦
2
𝑙
2
function set that
minimizes total loss L
x3 NN y3 3
^
𝑦
3
𝑙
……
……
……
……
[]
Compute
𝑤1 0.2 0.15 𝜕𝐿
−𝜇 𝜕 𝐿/ 𝜕 𝑤1 𝜕 𝑤1
Compute 𝜕𝐿
𝑤2 -0.1 0.05 𝜕 𝑤2
−𝜇 𝜕 𝐿/ 𝜕 𝑤2 𝛻 𝐿=¿
⋮
……
𝜕𝐿
Compute 𝜕 𝑏1
𝑏1 0.3 0.2 ⋮
− 𝜇 𝜕 𝐿/ 𝜕 𝑏1
gradient
……
Gradient Descent
𝜃 Compute Compute
𝑤1 0.2 0.15 0.09
−𝜇 𝜕 𝐿/ 𝜕 𝑤1 −𝜇 𝜕 𝐿/ 𝜕 𝑤1
……
Compute Compute
𝑤2 -0.1 0.05 0.15
−𝜇 𝜕 𝐿/ 𝜕 𝑤2 −𝜇 𝜕 𝐿/ 𝜕 𝑤2
……
……
Compute Compute
𝑏1 0.3 0.2 0.10
− 𝜇 𝜕 𝐿/ 𝜕 𝑏1 − 𝜇 𝜕 𝐿/ 𝜕 𝑏1
……
……
Gradient Descent
This is the “learning” of machines in deep
learning ……
Even alpha go using this approach.
People image …… Actually …..
libdnn
台大周伯威
同學開發
Ref: http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture/DNN
%20backprop.ecm.mp4/index.html
Three Steps for Deep Learning
Step 1:
Step 2: Step 3: pick
define a set
goodness of the best
of Neural
function
Network function function