0% found this document useful (0 votes)
10 views

Week2 DL

Uploaded by

luomichael23
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Week2 DL

Uploaded by

luomichael23
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

9/12/23

1. Function
with Unknown Parameters

Deep Learning 𝑦=𝑓

Model 𝑦 = 𝑏 + 𝑤𝑥! based on domain knowledge


feature
𝑦: no. of views on 2/26, 𝑥!: no. of views on 2/25
𝑤 and 𝑏 are unknown parameters (learned from data)
weight bias

1 2

Ø Loss is a function of Ø Loss is a function of


2. Define Loss parameters 𝐿 𝑏, 𝑤 2. Define Loss parameters 𝐿 𝑏, 𝑤
from Training Data Ø Loss: how good a set of from Training Data Ø Loss: how good a set of
values is. values is.
𝐿 0.5𝑘, 1 𝑦 = 𝑏 + 𝑤𝑥! 𝑦 = 0.5𝑘 + 1𝑥! How good it is? 𝐿 0.5𝑘, 1 𝑦 = 𝑏 + 𝑤𝑥! 𝑦 = 0.5𝑘 + 1𝑥! How good it is?
Data from 2017/01/01 – 2020/12/31 Data from 2017/01/01 – 2020/12/31
2017/01/01 01/02 01/03 …… 2020/12/30 12/31 2017/01/01 01/02 01/03 …… 2020/12/30 12/31

4.8k 4.9k 7.5k 3.4k 9.8k 4.8k 4.9k 7.5k 3.4k 9.8k

0.5𝑘+1𝑥! = 𝑦 5.3k 0.5𝑘+1𝑥! = 𝑦 5.4k 0.5𝑘+1𝑥! = 𝑦


𝑒!= 𝑦 − 𝑦& = 0.4𝑘 𝑒"= 𝑦 − 𝑦& = 2.1𝑘 𝑒#
label 𝑦& 𝑦& 𝑦&

4.9k 4.9k 7.5k 9.8k

3 4

1
9/12/23

Ø Loss is a function of Ø Loss is a function of


2. Define Loss parameters 𝐿 𝑏, 𝑤 2. Define Loss parameters 𝐿 𝑏, 𝑤
from Training Data Ø Loss: how good a set of from Training Data Ø Loss: how good a set of
values is. Model 𝑦 = 𝑏 + 𝑤𝑥! values is.
Small 𝐿
4.8k 4.9k

𝑏 + 𝑤𝑥! = 𝑦 1
Loss: 𝐿= - 𝑒"
𝑒! 𝑁
"
𝑦& 𝑏 Error Surface
4.9k

𝑒 = 𝑦 − 𝑦& 𝐿 is mean absolute error (MAE)


𝑒 = 𝑦 − 𝑦& "
𝐿 is mean square error (MSE)
If 𝑦 and 𝑦& are both probability distributions Cross-entropy Large 𝐿 𝑤

5 6

Source of image: http://chico386.pixnet.net/album/photo/171572850 Source of image: http://chico386.pixnet.net/album/photo/171572850

3. Optimization 𝑤 ∗ , 𝑏 ∗ = 𝑎𝑟𝑔 min 𝐿


%,' 3. Optimization 𝑤 ∗ , 𝑏 ∗ = 𝑎𝑟𝑔 min 𝐿
%,'

Gradient Descent Gradient Descent


Ø (Randomly) Pick an initial value 𝑤 ( Ø (Randomly) Pick an initial value 𝑤 (
𝜕𝐿 𝜕𝐿
Ø Compute | ! Ø Compute | !
𝜕𝑤 %)% 𝜕𝑤 %)%
Loss Loss 𝜕𝐿
𝐿 Negative Increase w 𝐿 𝑤! ← 𝑤 ( − 𝜂 | !
𝜕𝑤 %)%
Decrease w 𝜕𝐿
Positive 𝜂 | ! 𝜂: learning rate
𝜕𝑤 %)%
hyperparameters

𝑤( 𝑤 𝑤( 𝑤! 𝑤

7 8

2
9/12/23

Source of image: http://chico386.pixnet.net/album/photo/171572850

3. Optimization 𝑤 ∗ , 𝑏 ∗ = 𝑎𝑟𝑔 min 𝐿


%,' 3. Optimization 𝑤 ∗ , 𝑏 ∗ = 𝑎𝑟𝑔 min 𝐿
%,'

Gradient Descent
Ø (Randomly) Pick initial values 𝑤 (, 𝑏 (
Ø (Randomly) Pick an initial value 𝑤 (
𝜕𝐿 Ø Compute
Ø Compute | !
𝜕𝑤 %)% 𝜕𝐿 𝜕𝐿
Loss 𝜕𝐿 | ! !
𝑤! ← 𝑤 ( − 𝜂 |
𝑤! ← 𝑤 ( − 𝜂 | 𝜕𝑤 %)% ,')' !
𝜕𝑤 %)% ,')'
!
𝐿 𝜕𝑤 %)%
!

Ø Update 𝑤 iteratively 𝜕𝐿 𝜕𝐿
| ! !
𝑏! ← 𝑏 ( − 𝜂 |
Does local minima truly cause the problem? 𝜕𝑏 %)% ,')' !
𝜕𝑏 %)% ,')'
!

Local global
minima Can be done in one line in most deep learning frameworks
minima
Ø Update 𝑤 and 𝑏 interatively
𝑤( 𝑤! 𝑤" 𝑤* 𝑤

9 10

Model 𝑦 = 𝑏 + 𝑤𝑥!
3. Optimization 𝑤 ∗ , 𝑏 ∗ = 𝑎𝑟𝑔 min 𝐿
%,' Machine Learning is so simple ……
𝑤 ∗ = 0.97, 𝑏 ∗ = 0.1𝑘
𝑦 = 𝑏 + 𝑤𝑥! 𝐿 𝑤 ∗ , 𝑏 ∗ = 0.48𝑘
Compute 𝜕𝐿 ⁄𝜕𝑤, 𝜕𝐿 ⁄𝜕𝑏
Step 1: Step 2: define
Step 3:
function with loss from
optimization
unknown training data
𝑤 ∗ = 0.97, 𝑏 ∗ = 0.1𝑘
𝑏 𝐿 𝑤 ∗ , 𝑏 ∗ = 0.48𝑘
(−𝜂 𝜕𝐿 ⁄𝜕𝑤, −𝜂 𝜕𝐿 ⁄𝜕𝑏)

Compute 𝜕𝐿 ⁄𝜕𝑤, 𝜕𝐿 ⁄𝜕𝑏

11 12

3
9/12/23

Red: real no. of views


Machine Learning is so simple …… 𝑦 = 0.1𝑘 + 0.97𝑥! blue: estimated no. of views
𝑤 ∗ = 0.97, 𝑏 ∗ = 0.1𝑘
𝑦 = 𝑏 + 𝑤𝑥! 𝐿 𝑤 ∗ , 𝑏 ∗ = 0.48𝑘
Step 1: Step 2: define
Step 3:
function with loss from Views
optimization
unknown training data (k)

Training

𝑦 = 0.1𝑘 + 0.97𝑥! achieves the smallest loss 𝐿 = 0.48𝑘


on data of 2017 – 2020 (training data)
How about data of 2021 (unseen during training)?
2021/01/01 2021/02/14
𝐿′ = 0.58𝑘

13 14

2017 - 2020 2021


𝑦 = 𝑏 + 𝑤𝑥! Linear models are too simple … we need more sophisticated modes.
𝐿 = 0.48𝑘 𝐿′ = 0.58𝑘
' 𝑦
2017 - 2020 2021
𝑦 = 𝑏 + - 𝑤# 𝑥#
𝐿 = 0.38𝑘 𝐿′ = 0.49𝑘
#$!
𝒃 𝒘∗𝟏 𝒘∗𝟐 𝒘∗𝟑 𝒘∗𝟒 𝒘∗𝟓 𝒘∗𝟔 𝒘∗𝟕
0.05k 0.79 -0.31 0.12 -0.01 -0.10 0.30 0.18
Different w
()
2017 - 2020 2021 Different 𝑏
𝑦 = 𝑏 + - 𝑤# 𝑥# 𝐿′ = 0.46𝑘
𝐿 = 0.33𝑘
#$!
%& 𝑥!
2017 - 2020 2021
𝑦 = 𝑏 + - 𝑤# 𝑥# Linear models have severe limitation. Model Bias
𝐿 = 0.32𝑘 𝐿′ = 0.46𝑘
#$!
Linear models We need a more flexible model!

15 16

4
9/12/23

red curve = constant + sum of a set of

𝑦
All Piecewise Linear Curves
= constant + sum of a set of
1

0
𝑥!

More pieces require more


2

17 18

red curve = constant + sum of a set of


Beyond Piecewise Linear?
Approximate continuous curve How to represent
𝑦 by a piecewise linear curve. this function? Hard Sigmoid

𝑥!

Sigmoid Function
1
𝑦=𝑐
1 + 𝑒 '3%4"
2

= 𝑐 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏 + 𝑤𝑥!
𝑥!
To have good approximation, we need sufficient pieces. 𝑥!

19 20

5
9/12/23

Different 𝑤 red curve = sum of a set of + constant

𝑦
Change slopes 𝑐! 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏! + 𝑤!𝑥!
1

Different b
𝑐5 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏5 + 𝑤5𝑥! 3
Shift

0
𝑥!
Different 𝑐
𝑦 = 𝑏 + - 𝑐* 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏* + 𝑤* 𝑥!
0 * 1 + 2 + 3
𝑐" 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏" + 𝑤"𝑥! 2
Change height

21 22

𝑗: 1,2,3
𝑦 = 𝑏 + - 𝑐* 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏* + - 𝑤*# 𝑥#
New Model: More Features * #
no. of features
𝑖: 1,2,3
𝑦 = 𝑏 + 𝑤𝑥! no. of sigmoid
1
𝑟! = 𝑏! + 𝑤!!𝑥! + 𝑤!"𝑥" + 𝑤!5𝑥5 + 𝑤!!
𝑏! 𝑤!" 𝑥!
𝑦 = 𝑏 + - 𝑐* 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏* + 𝑤* 𝑥! 𝑤67 : weight for 𝑥7 for i-th sigmoid 1
𝑤!5
*
2 𝑥"
𝑟" = 𝑏" + 𝑤"!𝑥! + 𝑤""𝑥" + 𝑤"5𝑥5 +
𝑦 = 𝑏 + - 𝑤# 𝑥#
# 1 𝑥5

3
𝑟5 = 𝑏5 + 𝑤5!𝑥! + 𝑤5"𝑥" + 𝑤55𝑥5 +
𝑦 = 𝑏 + - 𝑐* 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏* + - 𝑤*# 𝑥#
* # 1

23 24

6
9/12/23

𝑦 = 𝑏 + - 𝑐* 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏* + - 𝑤*# 𝑥# 𝑖: 1,2,3 𝑦 = 𝑏 + - 𝑐* 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏* + - 𝑤*# 𝑥# 𝑖: 1,2,3


* # 𝑗: 1,2,3 * # 𝑗: 1,2,3

1
𝑟! = 𝑏! + 𝑤!!𝑥! + 𝑤!"𝑥" + 𝑤!5𝑥5 𝑟! + 𝑤!!
𝑟" = 𝑏" + 𝑤"!𝑥! + 𝑤""𝑥" + 𝑤"5𝑥5 𝑏! 𝑤!" 𝑥!
1
𝑟5 = 𝑏5 + 𝑤5!𝑥! + 𝑤5"𝑥" + 𝑤55𝑥5 𝑤!5
2 𝑥"
𝒓 = 𝒃 + 𝑊 𝒙 𝑟" +
𝑟! 𝑏! 𝑤!! 𝑤!( 𝑤!+ 𝑥!
𝑟( = 𝑏( + 𝑤(! 𝑤(( 𝑤(+ 𝑥( 1 𝑥5
𝑟+ 𝑏+ 𝑤+! 𝑤+( 𝑤++ 𝑥+
3
𝑟5 +
𝒓 = 𝒃 + 𝑊 𝒙 1

25 26

𝑦 = 𝑏 + - 𝑐* 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏* + - 𝑤*# 𝑥# 𝑖: 1,2,3 𝑦 = 𝑏 + - 𝑐* 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏* + - 𝑤*# 𝑥# 𝑖: 1,2,3


* # 𝑗: 1,2,3 * # 𝑗: 1,2,3

1 1
𝑎! 𝑟! + 𝑤!! 𝑎! 𝑟! + 𝑤!!
1 𝑏! 𝑤!" 𝑥! 𝑐! 𝑏! 𝑤!" 𝑥!
𝑎! = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑟! = 1
𝑤!5 1
1 + 𝑒 28" 𝑤!5
2 𝑥" 𝑐" 2 𝑥"
𝑎" 𝑟" + 𝑦 + 𝑎" 𝑟" +
𝑏
1 𝑥5 1 1 𝑥5
𝑐5
3 3
𝒂 =𝜎 𝒓 𝑎5 𝑟5 + 𝑎5 𝑟5 +

1
𝑦 = 𝑏 + 𝒄, 𝒂 1

27 28

7
9/12/23

1 1
𝑎! 𝑟! + 𝑤!! 𝑎! 𝑟! + 𝑤!!
𝑐! 𝑏! 𝑤!" 𝑥! 𝑐! 𝑏! 𝑤!" 𝑥!
1 1
𝑤!5 𝑤!5
𝑐" 2 𝑥" 𝑐" 2 𝑥"
𝑦 + 𝑎" 𝑟" + 𝑦 + 𝑎" 𝑟" +
𝑏 𝑏
1 1 𝑥5 1 1 𝑥5
𝑐5 𝑐5
3 3
𝑎5 𝑟5 + 𝑎5 𝑟5 +

1 1
𝑦 = 𝑏 + 𝒄, 𝒂

𝑦 = 𝑏 + 𝒄, 𝜎 𝒃 + 𝑊 𝒙
𝒂 =𝜎 𝒓 𝒓 = 𝒃 + 𝑊 𝒙

29 30

Function with unknown parameters


𝑦 = 𝑏 + 𝒄, 𝜎 𝒃 + 𝑊 𝒙 Back to ML Framework

𝒙 feature Step 1: Step 2: define


Rows Step 3:
function with loss from
of 𝑊 optimization
unknown training data
……

𝜃!
Unknown parameters
𝜃
𝜽 = (
𝜃+ 𝑦 = 𝑏 + 𝒄, 𝜎 𝒃 + 𝑊 𝒙
𝑊 𝒃 ⋮

𝒄, 𝑏

31 32

8
9/12/23

Loss Ø Loss is a function of parameters 𝐿 𝜃 Back to ML Framework


Ø Loss means how good a set of values is.

feature Step 1: Step 2: define


Step 3:
function with loss from
optimization
unknown training data
𝑦 = 𝑏 + 𝒄, 𝜎 𝒃 + 𝑊 𝒙
𝑒
label 𝑦E 𝑦 = 𝑏 + 𝒄, 𝜎 𝒃 + 𝑊 𝒙
Given a set of values

1
Loss: 𝐿= - 𝑒"
𝑁
"

33 34

Optimization of New Model 𝜃! Optimization of New Model


𝜽∗ = 𝑎𝑟𝑔 min 𝐿 𝜽 = 𝜃( 𝜽∗ = 𝑎𝑟𝑔 min 𝐿
𝜽 𝜃+ 𝜽

(

Ø (Randomly) Pick initial values 𝜽 Ø (Randomly) Pick initial values 𝜽(

𝜕𝐿 𝜕𝐿 Ø Compute gradient 𝒈 = ∇𝐿 𝜽(
| : 𝜂| :
𝜕𝜃! 𝜽$𝜽 𝜃!! 𝜃!-
𝜕𝜃! 𝜽$𝜽 𝜽! ← 𝜽( − 𝜂𝒈
𝒈 = 𝜕𝐿 𝜃(! ← 𝜃(- − 𝜕𝐿 Ø Compute gradient 𝒈 = ∇𝐿 𝜽!
| :
gradient 𝜕𝜃 𝜽$𝜽 𝜂 | :
( ⋮ ⋮ 𝜕𝜃( 𝜽$𝜽 𝜽" ← 𝜽! − 𝜂𝒈
⋮ ⋮ Ø Compute gradient 𝒈 = ∇𝐿 𝜽"
𝒈 = ∇𝐿 𝜽- 𝜽! ← 𝜽- − 𝜂𝒈 𝜽5 ← 𝜽" − 𝜂𝒈

35 36

9
9/12/23

Optimization of New Model Optimization of New Model


𝜽∗ = 𝑎𝑟𝑔 min 𝐿 Example 1
𝜽
B batch Ø 10,000 examples (N = 10,000) B batch
Ø (Randomly) Pick initial values 𝜽( 𝐿
Ø Batch size is 10 (B = 10)
Ø Compute gradient 𝒈 = ∇𝐿! 𝜽( 𝐿! How many update in 1 epoch?
batch batch
update 𝜽! ← 𝜽( − 𝜂𝒈 1,000 updates
N N
Ø Compute gradient 𝒈 = ∇𝐿" 𝜽! 𝐿" Example 2
update 𝜽" ← 𝜽! − 𝜂𝒈 batch Ø 1,000 examples (N = 1,000) batch
Ø Batch size is 100 (B = 100)
Ø Compute gradient 𝒈 = ∇𝐿5 𝜽" 𝐿5
How many update in 1 epoch?
update 𝜽5 ← 𝜽" − 𝜂𝒈 batch batch
10 updates
1 epoch = see all the batches once

37 38

Back to ML Framework Sigmoid → ReLU


How to represent
Step 1: Step 2: define this function?
Step 3:
function with loss from
optimization
unknown training data
𝑥!

Rectified Linear
𝑦 = , 𝑊 𝒙 Unit (ReLU) 𝑐 𝑚𝑎𝑥 0, 𝑏 + 𝑤𝑥!
𝑏 + 𝒄 𝜎 𝒃 +

𝑥!
More variety of models … 𝑐′ 𝑚𝑎𝑥 0, 𝑏′ + 𝑤′𝑥!

39 40

10
9/12/23

Sigmoid → ReLU Experimental Results

𝑦 = 𝑏 + - 𝑐* 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏* + - 𝑤*# 𝑥# 𝑦 = 𝑏 + - 𝑐* 𝑚𝑎𝑥 0, 𝑏* + - 𝑤*# 𝑥#


* # (* #

Activation function
linear 10 ReLU 100 ReLU 1000 ReLU
𝑦 = 𝑏 + - 𝑐* 𝑚𝑎𝑥 0, 𝑏* + - 𝑤*# 𝑥# 2017 – 2020 0.32k 0.32k 0.28k 0.27k
(* #
2021 0.46k 0.45k 0.43k 0.43k

Which one is better?

41 42

+ 𝑎! +
Back to ML Framework 𝑥!
1 1

𝑥"
Step 1: Step 2: define + 𝑎" +
Step 3:
function with loss from
optimization 𝑥5
unknown training data 1 or 1

……
+ 𝑎5 +
𝑦 = 𝑏 + 𝒄, 𝜎 𝒃 + 𝑊 𝒙
1 1

Even more variety of models … 𝒂′ = 𝜎 𝒃′ + 𝒂 𝒂 =𝜎 𝑊 𝒙


𝑊′ 𝒃 +

43 44

11
9/12/23

Red: real no. of views


3 layers
Experimental Results blue: estimated no. of views

• Loss for multiple hidden layers


• 100 ReLU for each layer
• input features are the no. of views in the past 56 Views
days (k)

1 layer 2 layer 3 layer 4 layer


2017 – 2020 0.28k 0.18k 0.14k 0.10k
?
2021 0.43k 0.39k 0.38k 0.44k

2021/01/01 2021/02/14

45 46

hidden layer hidden layer


+ 𝑎! +
Back to ML Framework 𝑥!
1 1

𝑥"
Step 1: Step 2: define + 𝑎" +
Step 3:
function with loss from
optimization 𝑥5
unknown training data 1 1

……
+ 𝑎5 +
𝑦 = 𝑏 + 𝒄 , 𝜎 𝒃 + 𝑊 𝒙 Neuron
1 1

It is not fancy enough. Neural Network This mimics human brains … (???)
Let’s give it a fancy name! Many layers means Deep Deep Learning

47 48

12
9/12/23

Deep = Many hidden layers Deep = Many hidden layers


22 layers 152 layers 101 layers

http://cs231n.stanford.e
du/slides/winter1516_le 19 layers Special
cture8.pdf
structure

8 layers
6.7% Why we want “Deep” network,
7.3%
not “Fat” network? 3.57%
16.4%

7.3% 6.7%
16.4%
AlexNet VGG GoogleNet Residual Net Taipei
AlexNet (2012) VGG (2014) GoogleNet (2014) (2012) (2014) (2014) (2015) 101

49 50

Why don’t we go deeper? Why don’t we go deeper?


• Loss for multiple hidden layers • Loss for multiple hidden layers
• 100 ReLU for each layer • 100 ReLU for each layer
• input features are the no. of views in the past 56 • input features are the no. of views in the past 56
days days
1 layer 2 layer 3 layer 4 layer 1 layer 2 layer 3 layer 4 layer
2017 – 2020 0.28k 0.18k 0.14k 0.10k 2017 – 2020 0.28k 0.18k 0.14k 0.10k
2021 0.43k 0.39k 0.38k 0.44k 2021 0.43k 0.39k 0.38k 0.44k

Better on training data, worse on unseen data


Overfitting

51 52

13
9/12/23

loss on training data


General
Let’s predict no. of views today! Guide large small

model loss on testing data


• If we want to select a model for predicting no. of bias optimization
views today, which one will you use? large small
make your Next Lecture
1 layer 2 layer 3 layer 4 layer model complex
2017 – 2020 0.28k 0.18k 0.14k 0.10k
overfitting mismatch
2021 0.43k 0.39k 0.38k 0.44k

more training data Not in HWs,


data augmentation except HW 11
make your model simpler
trade-off
Split your training data into training set and
validation set for model selection

53 56

Model Bias loss on training data


General
• The model is too simple. 𝑓𝜽𝟏 𝒙 𝑦 = 𝑓𝜽 𝒙 Guide large small
𝑓𝜽𝟐 𝒙
model loss on testing data
find a needle in a haystack … bias optimization
𝑓𝜽 𝒙

… but there is no needle large small
too small … 𝑓 ∗ 𝒙 small loss make your Next Lecture
model complex
• Solution: redesign your model to make it more
flexible ;<
overfitting mismatch
More features
𝑦 = 𝑏 + 𝑤𝑥! 𝑦 = 𝑏 + V 𝑤7 𝑥7
7)! more training data
Deep Learning
data augmentation
(more neurons, layers)
make your model simpler
trade-off
𝑦 = 𝑏 + V 𝑐6 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏6 + V 𝑤67 𝑥7 Split your training data into training set and
6 7 validation set for model selection

57 58

14
9/12/23

𝑓𝜽𝟏 𝒙
Optimization Issue Model Bias 𝑓𝜽𝟐 𝒙

find a needle in a haystack … 𝑓𝜽∗ 𝒙


• Large loss not always imply model bias. There is
another possibility … … but there is no needle 𝑓∗ 𝒙
𝑓𝜽𝟐 𝒙 too small …
𝑓𝜽∗ 𝒙 small loss
𝑓𝜽𝟏 𝒙 Which one???
𝐿
𝑓𝜽𝟐 𝒙 𝑓𝜽∗ 𝒙
𝑓𝜽𝟏 𝒙
𝑦 = 𝑓𝜽 𝒙 Optimization Issue
𝑓∗ 𝒙
𝐿 𝜽∗ large A needle is in a haystack … 𝑦 = 𝑓𝜽 𝒙
𝜽 A needle is in a haystack … … Just cannot find it. 𝑓∗ 𝒙
𝜽∗ … Just cannot find it.

59 60

Ref: Ref:
http://arxiv.org/abs/1512.0338 http://arxiv.org/abs/1512.03385

Model Bias v.s. Optimization Issue5


Optimization Issue
• Gaining the insights from comparison • Gaining the insights from comparison
Optimization issue
• Start from shallower networks (or other models),
which are easier to optimize.
• If deeper networks do not obtain smaller loss on
training data, then there is optimization issue.
1 layer 2 layer 3 layer 4 layer 5 layer
Overfitting? 2017 – 2020 0.28k 0.18k 0.14k 0.10k 0.34k

Testing Data
• Solution: More powerful optimization technology
Training Data
(next lecture)

61 62

15
9/12/23

Overfitting Overfitting 𝑦 “freestyle”

𝑦 Flexible
• Small loss on training data, large loss on testing model
data. Why?
An extreme example
𝑥
Training data: 𝒙𝟏, 𝑦& ! , 𝒙𝟐, 𝑦& " ,…, 𝒙𝑵, 𝑦& #
𝑥 𝑦 Large loss
𝑦& 6 ∃𝒙𝒊 =𝒙
𝑓 𝒙 =Z Less than useless … Real data distribution
𝑟𝑎𝑛𝑑𝑜𝑚 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 (not observable)
Training data
This function obtains zero training loss, but large testing loss.
Testing data
𝑥

65 66

Overfitting 𝑦 Overfitting 𝑦 𝑦 = 𝑎 + 𝑏𝑥 + 𝑐𝑥 "


𝑦 Flexible 𝑦 constrained
model model

𝑥 𝑥
More training data
𝑥 𝑥

Data augmentation Real data distribution


(not observable)
Training data
Testing data

67 68

16
9/12/23

Overfitting 𝑦 𝑦 = 𝑎 + 𝑏𝑥 + 𝑐𝑥 "
Overfitting 𝑦 𝑦 = 𝑎 + 𝑏𝑥 + 𝑐𝑥 "
𝑦 constrained 𝑦 constrained
model model

𝑥 𝑥

𝑥 𝑦 𝑥
• Less parameters, sharing parameters
Real data distribution
• Less features Fully-connected
(not observable)
• Early stopping
Training data • Regularization CNN
Testing data • Dropout
𝑥

69 70

Overfitting 𝑦
𝑦 = 𝑎 + 𝑏𝑥
Bias-Complexity Trade-off
𝑦 constrain
too much loss Testing loss

𝑥
select this one
𝑥 𝑦 Back to model bias …

Real data distribution Training loss


(not observable)
Training data Model becomes complex
Testing data (e.g. more features, more parameters)
𝑥

71 72

17
9/12/23

Mismatch Small Gradient …


• Your training and testing data have different Loss Gradient Descent
Very slow at the
distributions. Be aware of how data is generated.
plateau
Most HWs do not have this problem, except HW11
Stuck at saddle point
Training Data
Stuck at local minima

Simply increasing the training data will not help.


Testing Data 𝜕𝐿 ∕ 𝜕𝑤 𝜕𝐿 ∕ 𝜕𝑤 𝜕𝐿 ∕ 𝜕𝑤
≈0 =0 =0

The value of a network parameter w

80 82

Tips for training:


Batch and Momentum Batch

83 84

18
9/12/23

Review: Optimization with Batch Small Batch v.s. Large Batch


Consider 20 examples (N=20)
𝜽∗ = 𝑎𝑟𝑔 min 𝐿 Batch size = N (Full batch) Batch size = 1
𝜽
B batch
𝐿 Update after seeing all Update for each example
Ø (Randomly) Pick initial values 𝜽( the 20 examples Update 20 times in an epoch
Ø Compute gradient 𝒈𝟎 = ∇𝐿! 𝜽( 𝐿! batch
update 𝜽! ← 𝜽( − 𝜂𝒈𝟎 N See all See only one
Ø Compute gradient 𝒈𝟏 = ∇𝐿" 𝜽! 𝐿" examples example
batch
update 𝜽" ← 𝜽! − 𝜂𝒈𝟏 See all
𝟑 5 " 5 examples
Ø Compute gradient 𝒈 = ∇𝐿 𝜽 𝐿
batch
update 𝜽5 ← 𝜽" − 𝜂𝒈𝟑 Long time for cooldown, Short time for cooldown,
1 epoch = see all the batches once Shuffle after each epoch but powerful but noisy

85 86

oldest slides: http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture/DNN%20(v4).pdf


old slides: http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017/Lecture/Keras.pdf

Small Batch v.s. Large Batch Small Batch v.s. Large Batch
• Larger batch size does not require longer time to • Smaller batch requires longer time for one epoch
compute gradient (unless batch size is too large) (longer time for seeing all data once)
Time for one update Time for one epoch
Time for
each update slower
MNIST: digit Having limitation
classification
60 updates
faster
Parallel computing

Tesla V100 GPU


60000 updates in one epoch
full

87 88

19
9/12/23

Small Batch v.s. Large Batch Small Batch v.s. Large Batch
Consider 20 examples (N=20)
Batch size = N (Full Batch) Batch size = 1 MNIST CIFAR-10
Update after seeing all Update for each example
the 20 examples Update 20 times in an epoch

See all See only one


examples example
See all
examples

Long time for cooldown, Short time for cooldown, Ø Smaller batch size has better performance
but powerful but noisy Ø What’s wrong with large batch size? Optimization Fails

89 90

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
https://arxiv.org/abs/1609.04836

Small Batch v.s. Large Batch Small Batch v.s. Large Batch
• Smaller batch size has better performance • Small batch is better on testing data?
• “Noisy” update is better for training
SB = 256
Full Batch Small Batch LB =
0.1 x data set
𝐿
𝐿(
stuck

stuck 𝐿!
trainable

91 92

20
9/12/23

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
https://arxiv.org/abs/1609.04836

Small Batch v.s. Large Batch Small Batch v.s. Large Batch
• Small batch is better on testing data? Small Large
large batch
Speed for one update
Faster Slower
bad for testing (no parallel)
Speed for one update
Same Same (not too large)
Testing Loss (with parallel)
Time for one epoch Slower Faster
Gradient Noisy Stable
small batch Optimization Better Worse
good for testing Generalization Better Worse
Training Loss

Flat Minima Sharp Minima Batch size is a hyperparameter you have to decide.

93 94

Have both fish and bear's paws?


• Large Batch Optimization for Deep Learning: Training BERT
in 76 minutes (https://arxiv.org/abs/1904.00962)
• Extremely Large Minibatch SGD: Training ResNet-50 on
ImageNet in 15 Minutes (https://arxiv.org/abs/1711.04325) Momentum
• Stochastic Weight Averaging in Parallel: Large-Batch Training
That Generalizes Well (https://arxiv.org/abs/2001.02312)
• Large Batch Training of Convolutional Networks
(https://arxiv.org/abs/1708.03888)
• Accurate, large minibatch sgd: Training imagenet in 1 hour
(https://arxiv.org/abs/1706.02677)

95 96

21
9/12/23

Small Gradient … (Vanilla) Gradient Descent


Loss Consider the physical world …
𝒈𝟎 Starting at 𝜽𝟎
How about put this phenomenon 𝒈𝟏 Compute gradient 𝒈𝟎
in gradient descent? 𝜽𝟎
𝜽𝟏 Move to 𝜽𝟏 = 𝜽𝟎 − 𝜂𝒈𝟎
𝒈𝟐
Compute gradient 𝒈𝟏
Gradient 𝜽𝟐
Move to 𝜽𝟐 = 𝜽𝟏 − 𝜂𝒈𝟏
Movement 𝒈𝟑
𝜽𝟑

……
The value of a network parameter w

97 98

Gradient Descent + Momentum Gradient Descent + Momentum


Movement: movement of last Starting at 𝜽𝟎 Movement: movement of last Starting at 𝜽𝟎
step minus gradient at present Movement 𝒎𝟎 =𝟎 step minus gradient at present Movement 𝒎𝟎 = 𝟎
𝟎
𝒈 Compute gradient 𝒈𝟎 Compute gradient 𝒈𝟎
𝟏
𝒈𝟏
𝒎 Movement 𝒎 𝟏 = λ𝒎 𝟎 − 𝜂𝒈𝟎 𝒎𝒊 is the weighted sum of all the Movement 𝒎𝟏 = λ𝒎𝟎 − 𝜂𝒈𝟎
𝜽𝟎
Move to 𝜽𝟏 = 𝜽𝟎 + 𝒎 𝟏 previous gradient: 𝒈𝟎, 𝒈𝟏, …, 𝒈𝒊2𝟏 Move to 𝜽𝟏 = 𝜽𝟎 + 𝒎𝟏
𝜽𝟏 𝒈𝟐
𝒎𝟐
Compute gradient 𝒈𝟏 𝒎𝟎 = 𝟎 Compute gradient 𝒈𝟏
𝜽𝟐
Gradient Movement 𝒎 𝟐 = λ𝒎 𝟏 − 𝜂𝒈𝟏 𝒎𝟏 = −𝜂𝒈𝟎 Movement 𝒎𝟐 = λ𝒎𝟏 − 𝜂𝒈𝟏
𝒎𝟑 𝟐
Move to 𝜽 = 𝜽 + 𝒎𝟏 𝟐
Move to 𝜽𝟐 = 𝜽𝟏 + 𝒎𝟐
Movement
𝜽𝟑 𝒎𝟐 =−λ𝜂𝒈𝟎 − 𝜂𝒈𝟏
Movement 𝒈𝟑 Movement not just based Movement not just based
on gradient, but previous on gradient, but previous
……

of the last step


movement. movement.

99 100

22
9/12/23

Gradient Descent + Momentum Concluding Remarks


loss Movement =
Negative of 𝜕𝐿∕𝜕𝑤 + Last Movement • Critical points have zero gradients.
Negative of 𝜕𝐿 ∕ 𝜕𝑤 • Critical points can be either saddle points or local
minima.
Last Movement
Real Movement
• Can be determined by the Hessian matrix.
• It is possible to escape saddle points along the
direction of eigenvectors of the Hessian matrix.
• Local minima may be rare.
• Smaller batch size and momentum help escape
critical points.
𝜕𝐿∕𝜕𝑤 = 0

101 102

https://docs.google.com/presentation/d/1siUFXARYRpNiMeSRwgFbt7mZVjkMPhR5od09w0Z8xa
U/edit#slide=id.g3532c09be1_0_382

Training stuck ≠ Small Gradient


• People believe training stuck because the
parameters are around a critical point …
Error surface is rugged …
loss
Tips for training: Adaptive Learning Rate

norm of
gradient

103 104

103 104

23
9/12/23

Training can be difficult


Wait a minute … even without critical points.

This error surface is convex.

Learning rate cannot be


one-size-fits-all

100,000
updates

105 𝜂 = 10-2 𝜂 = 10-7 106

105 106

Different parameters needs 𝜂 𝒕


Root Mean Square 𝜽𝒕0𝟏 ← 𝜽𝒕* − 𝒈
different learning rate *
𝜎*2 *
𝜂 𝟎 (
Formulation for one parameter: 𝜽𝟏* ← 𝜽𝟎* − 𝒈 𝜎*- = 𝒈𝟎* = 𝒈𝟎*
𝜎*- *
Smaller 𝜽𝒕0𝟏
* ← 𝜽𝒕* − 𝜂𝒈𝒕*
Learning Rate 𝜂 𝟏 1 ( (
𝜕𝐿 𝜽𝟐* ← 𝜽𝟏* − 𝒈 𝜎*! = 𝒈𝟎* + 𝒈𝟏*
𝒈𝒕* = | 𝒕 𝜎*! * 2
𝜕𝜽* 𝜽$𝜽
𝑤" 𝜂 𝟐
𝜂 𝜽𝟑* ← 𝜽𝟐* − 𝒈 1 ( ( (
𝜽𝒕0𝟏 ← 𝜽𝒕* − 2 𝒈𝒕* 𝜎*( * 𝜎*( = 𝒈𝟎* + 𝒈𝟏* + 𝒈𝟐*
Larger *
𝜎* 3
Learning Rate
……

2
Parameter 𝜂 1 (
𝑤! dependent 𝜽𝒕0𝟏 ← 𝜽𝒕* − 2 𝒈𝒕* 𝜎*2 = / 𝒈𝒕*
*
𝜎* 𝑡+1
107 *$- 108

107 108

24
9/12/23

Root Mean Square Learning rate adapts dynamically


smaller 𝜎!A
𝒈!𝒕2𝟏 larger step Error Surface can be very complex.
𝜂 𝒕 𝒈!𝒕
𝜽𝒕0𝟏
* ← 𝜽𝒕* − 𝒈
𝜎*2 * 𝜽! Smaller
Learning Rate
𝒈𝒕2𝟏
"
2 larger 𝜎"A 𝑤"
1 (
𝜎*2 = / 𝒈𝒕* 𝒈𝒕"
smaller step
𝑡+1 Larger
*$-
Learning Rate
Used in Adagrad 𝜽"
𝑤!
109 110

109 110

𝜂 𝒕
RMSProp 𝜽𝒕0𝟏
* ← 𝜽𝒕* − 𝒈 RMSProp 𝒈𝟏* 𝒈𝟐* …… 𝒈𝒕6𝟏
*
𝜎*2 * 0<𝛼<1
𝜂 ( 𝜂
𝜽𝟏* ← 𝜽-* − - 𝒈𝟎* 𝜎*- = 𝒈𝟎* 𝜽𝒕0𝟏
* ← 𝜽𝒕* − 2 𝒈𝒕* 𝜎*2 = 𝛼 𝜎*26!
(
+ 1 − 𝛼 𝒈𝒕*
(
𝜎* 0<𝛼<1 𝜎*
𝜂 𝟏 The recent gradient has larger influence,
( (
𝜽𝟐* ← 𝜽!* − 𝒈 𝜎*! = 𝛼 𝜎*- + 1−𝛼 𝒈𝟏* and the past gradients have less influence.
𝜎*! *
𝜂 𝟐
𝜽+* ← 𝜽(* − 𝒈 ( ( small 𝜎6A increase 𝜎6A
𝜎*( * 𝜎*( = 𝛼 𝜎*! + 1 − 𝛼 𝒈𝟐*
larger step smaller step
……

decrease 𝜎6A
𝜂 larger step
𝜽𝒕0𝟏
* ← 𝜽𝒕* − 2 𝒈𝒕* 𝜎*2 = 𝛼 𝜎*26!
(
+ 1 − 𝛼 𝒈𝒕*
(
𝜎*
111 112

111 112

25
9/12/23

Original paper: https://arxiv.org/pdf/1412.6980.pdf Without Adaptive Learning Rate


Adam: RMSProp + Momentum

for momentum 𝜂 𝒕
for RMSprop 𝜽𝒕0𝟏
* ← 𝜽𝒕* − 𝒈
𝜎*2 *

2
1 (
𝜎*2 = / 𝒈𝒕*
𝑡+1
*$-
113 114

113 114

Learning Rate Scheduling 𝜂𝜂2 Learning Rate Scheduling 𝜂𝜂2


𝜽𝒕0𝟏
* ← 𝜽𝒕* − 𝒈𝒕 𝜽𝒕0𝟏 ← 𝜽𝒕* − 𝒈𝒕
𝜎*2 * *
𝜎*2 *
𝜂A Learning Rate Decay 𝜂A Learning Rate Decay
As the training goes, we are As the training goes, we are
closer to the destination, so we closer to the destination, so we
𝑡 reduce the learning rate. 𝑡 reduce the learning rate.

Warm Up
𝜂A
Increase and then decrease?

115 116

115 116

26
9/12/23

Learning Rate Scheduling 𝜂𝜂2


𝜽𝒕0𝟏
* ← 𝜽𝒕* − 2 𝒈𝒕*
𝜎*
𝜂A Learning Rate Decay

After the training goes, we are


Residual Network
close to the destination, so we
https://arxiv.org/abs/1512.03385 𝑡 reduce the learning rate.

Warm Up
𝜂A
Increase and then decrease?
At the beginning, the estimate
𝑡 of 𝜎6A has large variance.
Transformer https://arxiv.org/abs/1706.03762
Please refer to RAdam https://arxiv.org/abs/1908.03265
117 118

117 118

Summary of Optimization Classification as Regression?


(Vanilla) Gradient Descent
• Regression
𝜽𝒕0𝟏
* ← 𝜽𝒕* − 𝜂𝒈𝒕*
𝒙 Model 𝑦 𝑦&

Various Improvements
• Classification as regression?
Learning rate scheduling
𝜂2 𝒕 𝒙 Model 𝑦 𝑦& class
𝜽𝒕0𝟏
* ← 𝜽𝒕* − 𝒎 Momentum: weighted sum of the
𝜎*2 * previous gradients Consider 1 = class 1
direction similar?
different? 2 = class 2
root mean square of the gradients 3 = class 3
only magnitude 119 120

119 120

27
9/12/23

Class as one-hot vector Class 1 Class 2 Class 3 Class as one-hot vector Class 1 Class 2 Class 3
1 0 0 1 0 0
e=
𝒚 0 or 1 or 0 e=
𝒚 0 or 1 or 0
0 0 1 0 0 1
𝑎! 𝑟! + 𝑦! + 𝑎! 𝑟! +
𝑥! 𝑥!
only output 1
1 1
one value
𝑥" 𝑥"
𝑦 + 𝑎" 𝑟" + 𝑦" + 𝑎" 𝑟" +
𝑏
1 1 𝑥5 1 1 𝑥5
How to output
multiple values? 𝑎5 𝑟5 + 𝑦5 + 𝑎5 𝑟5 +

1 1 1
121 122

121 122

Regression 𝑒𝑥𝑝 𝑦* n 1 > 𝑦6B > 0


feature Soft-max 𝑦*7 =
∑# 𝑒𝑥𝑝 𝑦* n ∑6 𝑦6B = 1
label
𝑦5 𝑦 = 𝑏 + 𝒄, 𝜎 𝒃 + 𝑊 𝒙 Softmax How about binary classification? J
0.88 3
20
Classification 𝑦!B ÷ e 𝑦!
feature
0.12 2.7 1
𝒚 = 𝒃′ + 𝑊′ 𝜎 𝒃 + 𝑊 𝒙 𝑦"B ÷ e 𝑦"

≈0 -3
0.05
𝑦5B ÷ e 𝑦5
=
label 𝒚 𝒚′ = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝒚
logit
0 or 1 Make all values Can have +
between 0 and 1 any value 123 124

123 124

28
9/12/23

http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture/Deep%20More%20(v
2).ecm.mp4/index.html
-10 ~ 10
Loss of Classification 1 𝑦!B 𝑦!

softmax
1 -10 ~ 10
𝐿= / 𝑒" 0 𝑦"B 𝑦" Network 𝒙
𝑁 0 𝑒 𝑦5B
𝑦5 -1000
label "

softmax
=
𝒚 𝒚′ 𝒚 Network 𝒙
𝑒 largeMean Square Error (MSE) large Cross-entropy
loss loss
stuck!
Mean Square Error (MSE) =* − 𝒚7*
𝑒=/ 𝒚 (
𝑦" 𝑦"
*

Cross-entropy small small


=* 𝑙𝑛𝒚7*
𝑒 = −/𝒚 loss loss
* 𝑦! 𝑦!
Minimizing cross-entropy is equivalent to maximizing likelihood. Changing the loss function can change the difficulty of optimization.
125 126

125 126

• Thank you for Prof. Hung-yi Lee for providing the


slides

127

29

You might also like