Fundamentals of Deep Learning: Part 2: How A Neural Network Trains
Fundamentals of Deep Learning: Part 2: How A Neural Network Trains
DEEP LEARNING
Part 2: How a Neural Network Trains
1
Part 1: An Introduction to Deep Learning
28 [0,0,0,24,75,184,185,78,32,55,0,0,0…]
28
5
DATA PREPARATION
Targets as categories
0 [1,0,0,0,0,0,0,0,0,0]
1 [0,1,0,0,0,0,0,0,0,0]
2 [0,0,1,0,0,0,0,0,0,0]
3 [0,0,0,1,0,0,0,0,0,0]
.
.
. 6
AN UNTRAINED MODEL
[ 0, 0, …, 0] (784,)
… … … (512,)
Layer
Size
… … … (512,)
(10,)
7
A SIMPLER MODEL
8
A SIMPLER MODEL
𝑦 =𝑚𝑥 +𝑏
6
m•x 𝑚=?
x y 4
y
1 3 2
2 5 0 ^
𝑦 b=?
0.5 1 1.5 2 2.5 3
x
9
A SIMPLER MODEL
𝑦 =𝑚𝑥 +𝑏
6
m•x 𝑚=?
x y 4
y
1 3 2
2 5 0 ^
𝑦 b=?
0.5 1 1.5 2 2.5 3
x
10
A SIMPLER MODEL
𝑦 =𝑚𝑥 +𝑏
Start
6 Random
m•x
4
x y
𝑚=−1
y
1 3 4 2
2 5 3 0 ^
𝑦 b=5
0.5 1 1.5 2 2.5 3
x
11
A SIMPLER MODEL
𝑦 =𝑚𝑥 +𝑏
x y 6
5
1 3 4 1 4
3
√
2 5 3 4 2
𝑛
1
1
𝑅𝑀𝑆𝐸= ∑ (𝑦 𝑖 − 𝑦 𝑖 )
^
MSE = 2.5 0 2
0.5 1 1.5 2 2.5 3
RMSE = 1.6 x
𝑛 𝑖=1
12
A SIMPLER MODEL
𝑦 =𝑚𝑥 +𝑏
x y 6
5
1 3 4 1 4
3
y
2 5 3 4 2
1
MSE = 2.5 0
0.5 1 1.5 2 2.5 3
RMSE = 1.6 x
13
THE LOSS CURVE
Loss Surface 16
MSE
MSE
14
THE LOSS CURVE
6 16
4
y
Current
2
MSE
0
0.5 1 1.5 2 2.5 3
Target
x
𝑚=−1
b=5 0
15
THE LOSS CURVE
6 16
4
y
Old
2
Current
MSE
0
0.5 1 1.5 2 2.5 3
Target
x
𝑚=−1
b=4 0
16
THE LOSS CURVE
6 16
4
y
2
Current
MSE
0
0.5 1 1.5 2 2.5 3
Target
x
𝑚=0
b=4 0
17
THE LOSS CURVE
The 16
Which direction loss decreases
Gradient
the most
: The
learning
rate How far to travel
Target
Batch
A sample of the full dataset
18
THE LOSS CURVE
The 16
Which direction loss decreases
Gradient
the most
: The
learning
rate How far to travel
Target
Batch
A sample of the full dataset
19
OPTIMIZERS
20
FROM NEURON TO
NETWORK
21
BUILDING A NETWORK
w1 w2
^
𝑦
22
BUILDING A NETWORK
x1 x2
w2 w3
w1 w4
• Scales to more inputs
• Can chain neurons
w5 w6
^
𝑦
23
BUILDING A NETWORK
x1 x2
w2 w3
w1 w4
• Scales to more inputs
• Can chain neurons
w5 w6
• If all regressions are
linear, then output will
^
𝑦 also be a linear
regression
24
ACTIVATION FUNCTIONS
25
ACTIVATION FUNCTIONS
{
1
^
𝑦 =𝑤𝑥+𝑏 𝑦 = 𝑤𝑥 +𝑏 𝑖𝑓 𝑤𝑥 +𝑏> 0
^ ^
𝑦= −( 𝑤𝑥+𝑏)
0 𝑜𝑡h𝑒𝑟𝑤𝑖𝑠𝑒 1+𝑒
10 10 1
0.8
5 5 0.6
0.4
0 0
0.2
-5 -5 0
- - - - - - - - 012356781
-10 -10 18765321 . . . . . . 0
-10 -5 0 5 10 -10 -5 0 5 10 0 . . . . . . 257 257
26
ACTIVATION FUNCTIONS
27
ACTIVATION FUNCTIONS
x1 2 2
w1 w2 1
0
1
0
-1 -1
-2 -2
--------------------001122334455667788991 --------------------001122334455667788991
19988776655443322110. . . . . . . . . .0 19988776655443322110. . . . . . . . . .0
0. . . . . . . . . . 5555555555 0. . . . . . . . . . 5555555555
w3 w4
2
1
0
-1
^
𝑦 -2
--------------------001122334455667788991
19988776655443322110. . . . . . . . . .0
0. . . . . . . . . . 5555555555
28
OVERFITTING
29
OVERFITTING
Why not have a super large neural network?
30
OVERFITTING
Which Trendline is Better?
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
MSE = .0000 MSE = .0113
0.1 0.1
0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
31
OVERFITTING
Which Trendline is Better?
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
MSE = .0308 MSE = .0062
0.1 0.1
0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
32
TRAINING VS VALIDATION DATA
Avoid memorization
Validation data 40
30
MSE
• New data for model to see if it truly 20
understands (can generalize)
10
Overfitting 0
1 2 3 4 5 6 7 8 9 10
• When model performs well on the training Epoch
data, but not the validation data (evidence of
memorization) Training MSE
• Ideally the accuracy and loss should be Validation MSE - Expected
similar between both datasets Validation MSE - Overfitting
33
FROM REGRESSION TO
CLASSIFICATION
34
AN MNIST MODEL
[ 0, 0, …, 0] (784,)
… … … (512,)
Layer
Size
… … … (512,)
(10,)
35
AN MNIST MODEL
[ 0, 0, …, 0] (784,)
ReLU … … … (512,)
Layer
Size
ReLU … … … (512,)
Sigmoid
(10,)
36
AN MNIST MODEL
[ 0, 0, …, 0] (784,)
ReLU … … … (512,)
Layer
Size
ReLU … … … (512,)
Softmax
(10,)
37
RMSE FOR PROBABILITIES?
4
0
0 0.5 1 1.5 2 2.5 3 3.5
RMSE FOR PROBABILITIES?
4
0
0 0.5 1 1.5 2 2.5 3 3.5
CROSS ENTROPY
4
Cross Entropy
3 Blue Point Prediction
6
2
4
1 2
Loss
0 0
0 0.5 1 1.5 2 2.5 3 3.5
25
65
9
05
45
85
00
99
0.
0.
0.
0.
0.
0.
00
0.
Assigned Probability
1 2
Loss
0 0
0 0.5 1 1.5 2 2.5 3 3.5
25
65
9
05
45
85
00
99
0.
0.
0.
0.
0.
0.
00
0.
𝐿𝑜𝑠𝑠 =− ¿
Assigned Probability
𝑡 ( 𝑥 ) =𝑡𝑎𝑟𝑔𝑒𝑡 (0 𝑖𝑓 𝐹𝑎𝑙𝑠𝑒 , 1𝑖𝑓 𝑇𝑟𝑢𝑒)
Loss if True Loss if False
𝑝 ( 𝑥 )=𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑜𝑓 𝑝𝑜𝑖𝑛𝑡 𝑥
CROSS ENTROPY
4
Cross Entropy
3 Blue Point Prediction
6
2
4
1 2
Loss
0 0
0 0.5 1 1.5 2 2.5 3 3.5
25
65
9
05
45
85
00
99
0.
0.
0.
0.
0.
0.
00
0.
Assigned Probability
44
LET’S GO!
45
APPENDIX: GRADIENT
DESCENT
HELPING THE COMPUTER CHEAT CALCULUS
46
Learning From Error
1 2
𝑀𝑆𝐸= ((3 − (𝑚 ( 1 )+ 𝑏)) ¿ ¿ 2+(5 − (𝑚 ( 2 ) +𝑏)) )¿
2
𝜕 𝑀𝑆𝐸 𝜕 𝑀𝑆𝐸
=9𝑚+5 𝑏 − 23 =5 𝑚+3 𝑏 − 13
𝜕𝑚 𝜕𝑏
𝑚=−1
𝜕 𝑀𝑆𝐸 𝜕 𝑀𝑆𝐸 b=5
=− 7 =− 3
𝜕𝑚 𝜕𝑏
THE LOSS CURVE
Loss Surface 16
Current
Target
48
THE LOSS CURVE
16
𝜕 𝑀𝑆𝐸 𝜕 𝑀𝑆𝐸
=− 7 =− 3
𝜕𝑚 𝜕𝑏
Target
49
THE LOSS CURVE
16
𝜕 𝑀𝑆𝐸 𝜕 𝑀𝑆𝐸
=− 7 =− 3
𝜕𝑚 𝜕𝑏
𝜕 𝑀𝑆𝐸
m := m − λ
𝜕𝑚
Target
𝜕 𝑀𝑆𝐸
b ≔𝑏− λ 0
𝜕𝑏
50
THE LOSS CURVE
16
𝜕 𝑀𝑆𝐸 𝜕 𝑀𝑆𝐸
=− 7 =− 3
𝜕𝑚 𝜕𝑏
𝜕 𝑀𝑆𝐸
m := m − λ
𝜕𝑚
Target
λ=.6
𝜕 𝑀𝑆𝐸
b ≔𝑏− λ 0
𝜕𝑏
51
THE LOSS CURVE
16
𝜕 𝑀𝑆𝐸 𝜕 𝑀𝑆𝐸
=− 7 =− 3
𝜕𝑚 𝜕𝑏
𝜕 𝑀𝑆𝐸
m := m − λ
𝜕𝑚
Target
λ=.005
𝜕 𝑀𝑆𝐸
b ≔𝑏− λ 0
𝜕𝑏
52
THE LOSS CURVE
16
λ=.1
Target
b ≔ 5+3 λ= 4.7
53
54