Week2 DL
Week2 DL
1. Function
with Unknown Parameters
1 2
4.8k 4.9k 7.5k 3.4k 9.8k 4.8k 4.9k 7.5k 3.4k 9.8k
3 4
1
9/12/23
𝑏 + 𝑤𝑥! = 𝑦 1
Loss: 𝐿= - 𝑒"
𝑒! 𝑁
"
𝑦& 𝑏 Error Surface
4.9k
5 6
𝑤( 𝑤 𝑤( 𝑤! 𝑤
7 8
2
9/12/23
Gradient Descent
Ø (Randomly) Pick initial values 𝑤 (, 𝑏 (
Ø (Randomly) Pick an initial value 𝑤 (
𝜕𝐿 Ø Compute
Ø Compute | !
𝜕𝑤 %)% 𝜕𝐿 𝜕𝐿
Loss 𝜕𝐿 | ! !
𝑤! ← 𝑤 ( − 𝜂 |
𝑤! ← 𝑤 ( − 𝜂 | 𝜕𝑤 %)% ,')' !
𝜕𝑤 %)% ,')'
!
𝐿 𝜕𝑤 %)%
!
Ø Update 𝑤 iteratively 𝜕𝐿 𝜕𝐿
| ! !
𝑏! ← 𝑏 ( − 𝜂 |
Does local minima truly cause the problem? 𝜕𝑏 %)% ,')' !
𝜕𝑏 %)% ,')'
!
Local global
minima Can be done in one line in most deep learning frameworks
minima
Ø Update 𝑤 and 𝑏 interatively
𝑤( 𝑤! 𝑤" 𝑤* 𝑤
9 10
Model 𝑦 = 𝑏 + 𝑤𝑥!
3. Optimization 𝑤 ∗ , 𝑏 ∗ = 𝑎𝑟𝑔 min 𝐿
%,' Machine Learning is so simple ……
𝑤 ∗ = 0.97, 𝑏 ∗ = 0.1𝑘
𝑦 = 𝑏 + 𝑤𝑥! 𝐿 𝑤 ∗ , 𝑏 ∗ = 0.48𝑘
Compute 𝜕𝐿 ⁄𝜕𝑤, 𝜕𝐿 ⁄𝜕𝑏
Step 1: Step 2: define
Step 3:
function with loss from
optimization
unknown training data
𝑤 ∗ = 0.97, 𝑏 ∗ = 0.1𝑘
𝑏 𝐿 𝑤 ∗ , 𝑏 ∗ = 0.48𝑘
(−𝜂 𝜕𝐿 ⁄𝜕𝑤, −𝜂 𝜕𝐿 ⁄𝜕𝑏)
11 12
3
9/12/23
Training
13 14
15 16
4
9/12/23
𝑦
All Piecewise Linear Curves
= constant + sum of a set of
1
0
𝑥!
17 18
𝑥!
Sigmoid Function
1
𝑦=𝑐
1 + 𝑒 '3%4"
2
= 𝑐 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏 + 𝑤𝑥!
𝑥!
To have good approximation, we need sufficient pieces. 𝑥!
19 20
5
9/12/23
𝑦
Change slopes 𝑐! 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏! + 𝑤!𝑥!
1
Different b
𝑐5 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏5 + 𝑤5𝑥! 3
Shift
0
𝑥!
Different 𝑐
𝑦 = 𝑏 + - 𝑐* 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏* + 𝑤* 𝑥!
0 * 1 + 2 + 3
𝑐" 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏" + 𝑤"𝑥! 2
Change height
21 22
𝑗: 1,2,3
𝑦 = 𝑏 + - 𝑐* 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏* + - 𝑤*# 𝑥#
New Model: More Features * #
no. of features
𝑖: 1,2,3
𝑦 = 𝑏 + 𝑤𝑥! no. of sigmoid
1
𝑟! = 𝑏! + 𝑤!!𝑥! + 𝑤!"𝑥" + 𝑤!5𝑥5 + 𝑤!!
𝑏! 𝑤!" 𝑥!
𝑦 = 𝑏 + - 𝑐* 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏* + 𝑤* 𝑥! 𝑤67 : weight for 𝑥7 for i-th sigmoid 1
𝑤!5
*
2 𝑥"
𝑟" = 𝑏" + 𝑤"!𝑥! + 𝑤""𝑥" + 𝑤"5𝑥5 +
𝑦 = 𝑏 + - 𝑤# 𝑥#
# 1 𝑥5
3
𝑟5 = 𝑏5 + 𝑤5!𝑥! + 𝑤5"𝑥" + 𝑤55𝑥5 +
𝑦 = 𝑏 + - 𝑐* 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏* + - 𝑤*# 𝑥#
* # 1
23 24
6
9/12/23
1
𝑟! = 𝑏! + 𝑤!!𝑥! + 𝑤!"𝑥" + 𝑤!5𝑥5 𝑟! + 𝑤!!
𝑟" = 𝑏" + 𝑤"!𝑥! + 𝑤""𝑥" + 𝑤"5𝑥5 𝑏! 𝑤!" 𝑥!
1
𝑟5 = 𝑏5 + 𝑤5!𝑥! + 𝑤5"𝑥" + 𝑤55𝑥5 𝑤!5
2 𝑥"
𝒓 = 𝒃 + 𝑊 𝒙 𝑟" +
𝑟! 𝑏! 𝑤!! 𝑤!( 𝑤!+ 𝑥!
𝑟( = 𝑏( + 𝑤(! 𝑤(( 𝑤(+ 𝑥( 1 𝑥5
𝑟+ 𝑏+ 𝑤+! 𝑤+( 𝑤++ 𝑥+
3
𝑟5 +
𝒓 = 𝒃 + 𝑊 𝒙 1
25 26
1 1
𝑎! 𝑟! + 𝑤!! 𝑎! 𝑟! + 𝑤!!
1 𝑏! 𝑤!" 𝑥! 𝑐! 𝑏! 𝑤!" 𝑥!
𝑎! = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑟! = 1
𝑤!5 1
1 + 𝑒 28" 𝑤!5
2 𝑥" 𝑐" 2 𝑥"
𝑎" 𝑟" + 𝑦 + 𝑎" 𝑟" +
𝑏
1 𝑥5 1 1 𝑥5
𝑐5
3 3
𝒂 =𝜎 𝒓 𝑎5 𝑟5 + 𝑎5 𝑟5 +
1
𝑦 = 𝑏 + 𝒄, 𝒂 1
27 28
7
9/12/23
1 1
𝑎! 𝑟! + 𝑤!! 𝑎! 𝑟! + 𝑤!!
𝑐! 𝑏! 𝑤!" 𝑥! 𝑐! 𝑏! 𝑤!" 𝑥!
1 1
𝑤!5 𝑤!5
𝑐" 2 𝑥" 𝑐" 2 𝑥"
𝑦 + 𝑎" 𝑟" + 𝑦 + 𝑎" 𝑟" +
𝑏 𝑏
1 1 𝑥5 1 1 𝑥5
𝑐5 𝑐5
3 3
𝑎5 𝑟5 + 𝑎5 𝑟5 +
1 1
𝑦 = 𝑏 + 𝒄, 𝒂
𝑦 = 𝑏 + 𝒄, 𝜎 𝒃 + 𝑊 𝒙
𝒂 =𝜎 𝒓 𝒓 = 𝒃 + 𝑊 𝒙
29 30
𝜃!
Unknown parameters
𝜃
𝜽 = (
𝜃+ 𝑦 = 𝑏 + 𝒄, 𝜎 𝒃 + 𝑊 𝒙
𝑊 𝒃 ⋮
𝒄, 𝑏
31 32
8
9/12/23
1
Loss: 𝐿= - 𝑒"
𝑁
"
33 34
(
⋮
Ø (Randomly) Pick initial values 𝜽 Ø (Randomly) Pick initial values 𝜽(
𝜕𝐿 𝜕𝐿 Ø Compute gradient 𝒈 = ∇𝐿 𝜽(
| : 𝜂| :
𝜕𝜃! 𝜽$𝜽 𝜃!! 𝜃!-
𝜕𝜃! 𝜽$𝜽 𝜽! ← 𝜽( − 𝜂𝒈
𝒈 = 𝜕𝐿 𝜃(! ← 𝜃(- − 𝜕𝐿 Ø Compute gradient 𝒈 = ∇𝐿 𝜽!
| :
gradient 𝜕𝜃 𝜽$𝜽 𝜂 | :
( ⋮ ⋮ 𝜕𝜃( 𝜽$𝜽 𝜽" ← 𝜽! − 𝜂𝒈
⋮ ⋮ Ø Compute gradient 𝒈 = ∇𝐿 𝜽"
𝒈 = ∇𝐿 𝜽- 𝜽! ← 𝜽- − 𝜂𝒈 𝜽5 ← 𝜽" − 𝜂𝒈
35 36
9
9/12/23
37 38
Rectified Linear
𝑦 = , 𝑊 𝒙 Unit (ReLU) 𝑐 𝑚𝑎𝑥 0, 𝑏 + 𝑤𝑥!
𝑏 + 𝒄 𝜎 𝒃 +
𝑥!
More variety of models … 𝑐′ 𝑚𝑎𝑥 0, 𝑏′ + 𝑤′𝑥!
39 40
10
9/12/23
Activation function
linear 10 ReLU 100 ReLU 1000 ReLU
𝑦 = 𝑏 + - 𝑐* 𝑚𝑎𝑥 0, 𝑏* + - 𝑤*# 𝑥# 2017 – 2020 0.32k 0.32k 0.28k 0.27k
(* #
2021 0.46k 0.45k 0.43k 0.43k
41 42
+ 𝑎! +
Back to ML Framework 𝑥!
1 1
𝑥"
Step 1: Step 2: define + 𝑎" +
Step 3:
function with loss from
optimization 𝑥5
unknown training data 1 or 1
……
+ 𝑎5 +
𝑦 = 𝑏 + 𝒄, 𝜎 𝒃 + 𝑊 𝒙
1 1
43 44
11
9/12/23
2021/01/01 2021/02/14
45 46
𝑥"
Step 1: Step 2: define + 𝑎" +
Step 3:
function with loss from
optimization 𝑥5
unknown training data 1 1
……
+ 𝑎5 +
𝑦 = 𝑏 + 𝒄 , 𝜎 𝒃 + 𝑊 𝒙 Neuron
1 1
It is not fancy enough. Neural Network This mimics human brains … (???)
Let’s give it a fancy name! Many layers means Deep Deep Learning
47 48
12
9/12/23
http://cs231n.stanford.e
du/slides/winter1516_le 19 layers Special
cture8.pdf
structure
8 layers
6.7% Why we want “Deep” network,
7.3%
not “Fat” network? 3.57%
16.4%
7.3% 6.7%
16.4%
AlexNet VGG GoogleNet Residual Net Taipei
AlexNet (2012) VGG (2014) GoogleNet (2014) (2012) (2014) (2014) (2015) 101
49 50
51 52
13
9/12/23
53 56
57 58
14
9/12/23
𝑓𝜽𝟏 𝒙
Optimization Issue Model Bias 𝑓𝜽𝟐 𝒙
59 60
Ref: Ref:
http://arxiv.org/abs/1512.0338 http://arxiv.org/abs/1512.03385
Testing Data
• Solution: More powerful optimization technology
Training Data
(next lecture)
61 62
15
9/12/23
𝑦 Flexible
• Small loss on training data, large loss on testing model
data. Why?
An extreme example
𝑥
Training data: 𝒙𝟏, 𝑦& ! , 𝒙𝟐, 𝑦& " ,…, 𝒙𝑵, 𝑦& #
𝑥 𝑦 Large loss
𝑦& 6 ∃𝒙𝒊 =𝒙
𝑓 𝒙 =Z Less than useless … Real data distribution
𝑟𝑎𝑛𝑑𝑜𝑚 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 (not observable)
Training data
This function obtains zero training loss, but large testing loss.
Testing data
𝑥
65 66
𝑥 𝑥
More training data
𝑥 𝑥
67 68
16
9/12/23
Overfitting 𝑦 𝑦 = 𝑎 + 𝑏𝑥 + 𝑐𝑥 "
Overfitting 𝑦 𝑦 = 𝑎 + 𝑏𝑥 + 𝑐𝑥 "
𝑦 constrained 𝑦 constrained
model model
𝑥 𝑥
𝑥 𝑦 𝑥
• Less parameters, sharing parameters
Real data distribution
• Less features Fully-connected
(not observable)
• Early stopping
Training data • Regularization CNN
Testing data • Dropout
𝑥
69 70
Overfitting 𝑦
𝑦 = 𝑎 + 𝑏𝑥
Bias-Complexity Trade-off
𝑦 constrain
too much loss Testing loss
𝑥
select this one
𝑥 𝑦 Back to model bias …
71 72
17
9/12/23
80 82
83 84
18
9/12/23
85 86
Small Batch v.s. Large Batch Small Batch v.s. Large Batch
• Larger batch size does not require longer time to • Smaller batch requires longer time for one epoch
compute gradient (unless batch size is too large) (longer time for seeing all data once)
Time for one update Time for one epoch
Time for
each update slower
MNIST: digit Having limitation
classification
60 updates
faster
Parallel computing
87 88
19
9/12/23
Small Batch v.s. Large Batch Small Batch v.s. Large Batch
Consider 20 examples (N=20)
Batch size = N (Full Batch) Batch size = 1 MNIST CIFAR-10
Update after seeing all Update for each example
the 20 examples Update 20 times in an epoch
Long time for cooldown, Short time for cooldown, Ø Smaller batch size has better performance
but powerful but noisy Ø What’s wrong with large batch size? Optimization Fails
89 90
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
https://arxiv.org/abs/1609.04836
Small Batch v.s. Large Batch Small Batch v.s. Large Batch
• Smaller batch size has better performance • Small batch is better on testing data?
• “Noisy” update is better for training
SB = 256
Full Batch Small Batch LB =
0.1 x data set
𝐿
𝐿(
stuck
stuck 𝐿!
trainable
91 92
20
9/12/23
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
https://arxiv.org/abs/1609.04836
Small Batch v.s. Large Batch Small Batch v.s. Large Batch
• Small batch is better on testing data? Small Large
large batch
Speed for one update
Faster Slower
bad for testing (no parallel)
Speed for one update
Same Same (not too large)
Testing Loss (with parallel)
Time for one epoch Slower Faster
Gradient Noisy Stable
small batch Optimization Better Worse
good for testing Generalization Better Worse
Training Loss
Flat Minima Sharp Minima Batch size is a hyperparameter you have to decide.
93 94
95 96
21
9/12/23
……
The value of a network parameter w
97 98
99 100
22
9/12/23
101 102
https://docs.google.com/presentation/d/1siUFXARYRpNiMeSRwgFbt7mZVjkMPhR5od09w0Z8xa
U/edit#slide=id.g3532c09be1_0_382
norm of
gradient
103 104
103 104
23
9/12/23
100,000
updates
105 106
2
Parameter 𝜂 1 (
𝑤! dependent 𝜽𝒕0𝟏 ← 𝜽𝒕* − 2 𝒈𝒕* 𝜎*2 = / 𝒈𝒕*
*
𝜎* 𝑡+1
107 *$- 108
107 108
24
9/12/23
109 110
𝜂 𝒕
RMSProp 𝜽𝒕0𝟏
* ← 𝜽𝒕* − 𝒈 RMSProp 𝒈𝟏* 𝒈𝟐* …… 𝒈𝒕6𝟏
*
𝜎*2 * 0<𝛼<1
𝜂 ( 𝜂
𝜽𝟏* ← 𝜽-* − - 𝒈𝟎* 𝜎*- = 𝒈𝟎* 𝜽𝒕0𝟏
* ← 𝜽𝒕* − 2 𝒈𝒕* 𝜎*2 = 𝛼 𝜎*26!
(
+ 1 − 𝛼 𝒈𝒕*
(
𝜎* 0<𝛼<1 𝜎*
𝜂 𝟏 The recent gradient has larger influence,
( (
𝜽𝟐* ← 𝜽!* − 𝒈 𝜎*! = 𝛼 𝜎*- + 1−𝛼 𝒈𝟏* and the past gradients have less influence.
𝜎*! *
𝜂 𝟐
𝜽+* ← 𝜽(* − 𝒈 ( ( small 𝜎6A increase 𝜎6A
𝜎*( * 𝜎*( = 𝛼 𝜎*! + 1 − 𝛼 𝒈𝟐*
larger step smaller step
……
decrease 𝜎6A
𝜂 larger step
𝜽𝒕0𝟏
* ← 𝜽𝒕* − 2 𝒈𝒕* 𝜎*2 = 𝛼 𝜎*26!
(
+ 1 − 𝛼 𝒈𝒕*
(
𝜎*
111 112
111 112
25
9/12/23
for momentum 𝜂 𝒕
for RMSprop 𝜽𝒕0𝟏
* ← 𝜽𝒕* − 𝒈
𝜎*2 *
2
1 (
𝜎*2 = / 𝒈𝒕*
𝑡+1
*$-
113 114
113 114
Warm Up
𝜂A
Increase and then decrease?
115 116
115 116
26
9/12/23
Warm Up
𝜂A
Increase and then decrease?
At the beginning, the estimate
𝑡 of 𝜎6A has large variance.
Transformer https://arxiv.org/abs/1706.03762
Please refer to RAdam https://arxiv.org/abs/1908.03265
117 118
117 118
Various Improvements
• Classification as regression?
Learning rate scheduling
𝜂2 𝒕 𝒙 Model 𝑦 𝑦& class
𝜽𝒕0𝟏
* ← 𝜽𝒕* − 𝒎 Momentum: weighted sum of the
𝜎*2 * previous gradients Consider 1 = class 1
direction similar?
different? 2 = class 2
root mean square of the gradients 3 = class 3
only magnitude 119 120
119 120
27
9/12/23
Class as one-hot vector Class 1 Class 2 Class 3 Class as one-hot vector Class 1 Class 2 Class 3
1 0 0 1 0 0
e=
𝒚 0 or 1 or 0 e=
𝒚 0 or 1 or 0
0 0 1 0 0 1
𝑎! 𝑟! + 𝑦! + 𝑎! 𝑟! +
𝑥! 𝑥!
only output 1
1 1
one value
𝑥" 𝑥"
𝑦 + 𝑎" 𝑟" + 𝑦" + 𝑎" 𝑟" +
𝑏
1 1 𝑥5 1 1 𝑥5
How to output
multiple values? 𝑎5 𝑟5 + 𝑦5 + 𝑎5 𝑟5 +
1 1 1
121 122
121 122
≈0 -3
0.05
𝑦5B ÷ e 𝑦5
=
label 𝒚 𝒚′ = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝒚
logit
0 or 1 Make all values Can have +
between 0 and 1 any value 123 124
123 124
28
9/12/23
http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture/Deep%20More%20(v
2).ecm.mp4/index.html
-10 ~ 10
Loss of Classification 1 𝑦!B 𝑦!
softmax
1 -10 ~ 10
𝐿= / 𝑒" 0 𝑦"B 𝑦" Network 𝒙
𝑁 0 𝑒 𝑦5B
𝑦5 -1000
label "
softmax
=
𝒚 𝒚′ 𝒚 Network 𝒙
𝑒 largeMean Square Error (MSE) large Cross-entropy
loss loss
stuck!
Mean Square Error (MSE) =* − 𝒚7*
𝑒=/ 𝒚 (
𝑦" 𝑦"
*
125 126
127
29