0% found this document useful (0 votes)

15 views29 pages

Week2 DL

Uploaded by

luomichael23

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views29 pages

Week2 DL

Uploaded by

luomichael23

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

9/12/23

1. Function
with Unknown Parameters

Deep Learning 𝑦=𝑓

Model 𝑦 = 𝑏 + 𝑤𝑥! based on domain knowledge

feature
𝑦: no. of views on 2/26, 𝑥!: no. of views on 2/25
𝑤 and 𝑏 are unknown parameters (learned from data)
weight bias

1 2

Ø Loss is a function of Ø Loss is a function of

2. Define Loss parameters 𝐿 𝑏, 𝑤 2. Define Loss parameters 𝐿 𝑏, 𝑤
from Training Data Ø Loss: how good a set of from Training Data Ø Loss: how good a set of
values is. values is.
𝐿 0.5𝑘, 1 𝑦 = 𝑏 + 𝑤𝑥! 𝑦 = 0.5𝑘 + 1𝑥! How good it is? 𝐿 0.5𝑘, 1 𝑦 = 𝑏 + 𝑤𝑥! 𝑦 = 0.5𝑘 + 1𝑥! How good it is?
Data from 2017/01/01 – 2020/12/31 Data from 2017/01/01 – 2020/12/31
2017/01/01 01/02 01/03 …… 2020/12/30 12/31 2017/01/01 01/02 01/03 …… 2020/12/30 12/31

4.8k 4.9k 7.5k 3.4k 9.8k 4.8k 4.9k 7.5k 3.4k 9.8k

0.5𝑘+1𝑥! = 𝑦 5.3k 0.5𝑘+1𝑥! = 𝑦 5.4k 0.5𝑘+1𝑥! = 𝑦

𝑒!= 𝑦 − 𝑦& = 0.4𝑘 𝑒"= 𝑦 − 𝑦& = 2.1𝑘 𝑒#
label 𝑦& 𝑦& 𝑦&

4.9k 4.9k 7.5k 9.8k

3 4

1
9/12/23

Ø Loss is a function of Ø Loss is a function of

2. Define Loss parameters 𝐿 𝑏, 𝑤 2. Define Loss parameters 𝐿 𝑏, 𝑤
from Training Data Ø Loss: how good a set of from Training Data Ø Loss: how good a set of
values is. Model 𝑦 = 𝑏 + 𝑤𝑥! values is.
Small 𝐿
4.8k 4.9k

𝑏 + 𝑤𝑥! = 𝑦 1
Loss: 𝐿= - 𝑒"
𝑒! 𝑁
"
𝑦& 𝑏 Error Surface
4.9k

𝑒 = 𝑦 − 𝑦& 𝐿 is mean absolute error (MAE)

𝑒 = 𝑦 − 𝑦& "
𝐿 is mean square error (MSE)
If 𝑦 and 𝑦& are both probability distributions Cross-entropy Large 𝐿 𝑤

5 6

Source of image: http://chico386.pixnet.net/album/photo/171572850 Source of image: http://chico386.pixnet.net/album/photo/171572850

3. Optimization 𝑤 ∗ , 𝑏 ∗ = 𝑎𝑟𝑔 min 𝐿

%,' 3. Optimization 𝑤 ∗ , 𝑏 ∗ = 𝑎𝑟𝑔 min 𝐿
%,'

Gradient Descent Gradient Descent

Ø (Randomly) Pick an initial value 𝑤 ( Ø (Randomly) Pick an initial value 𝑤 (
𝜕𝐿 𝜕𝐿
Ø Compute | ! Ø Compute | !
𝜕𝑤 %)% 𝜕𝑤 %)%
Loss Loss 𝜕𝐿
𝐿 Negative Increase w 𝐿 𝑤! ← 𝑤 ( − 𝜂 | !
𝜕𝑤 %)%
Decrease w 𝜕𝐿
Positive 𝜂 | ! 𝜂: learning rate
𝜕𝑤 %)%
hyperparameters

𝑤( 𝑤 𝑤( 𝑤! 𝑤

7 8

2
9/12/23

Source of image: http://chico386.pixnet.net/album/photo/171572850

3. Optimization 𝑤 ∗ , 𝑏 ∗ = 𝑎𝑟𝑔 min 𝐿

%,' 3. Optimization 𝑤 ∗ , 𝑏 ∗ = 𝑎𝑟𝑔 min 𝐿
%,'

Gradient Descent
Ø (Randomly) Pick initial values 𝑤 (, 𝑏 (
Ø (Randomly) Pick an initial value 𝑤 (
𝜕𝐿 Ø Compute
Ø Compute | !
𝜕𝑤 %)% 𝜕𝐿 𝜕𝐿
Loss 𝜕𝐿 | ! !
𝑤! ← 𝑤 ( − 𝜂 |
𝑤! ← 𝑤 ( − 𝜂 | 𝜕𝑤 %)% ,')' !
𝜕𝑤 %)% ,')'
!
𝐿 𝜕𝑤 %)%
!

Ø Update 𝑤 iteratively 𝜕𝐿 𝜕𝐿
| ! !
𝑏! ← 𝑏 ( − 𝜂 |
Does local minima truly cause the problem? 𝜕𝑏 %)% ,')' !
𝜕𝑏 %)% ,')'
!

Local global
minima Can be done in one line in most deep learning frameworks
minima
Ø Update 𝑤 and 𝑏 interatively
𝑤( 𝑤! 𝑤" 𝑤* 𝑤

9 10

Model 𝑦 = 𝑏 + 𝑤𝑥!
3. Optimization 𝑤 ∗ , 𝑏 ∗ = 𝑎𝑟𝑔 min 𝐿
%,' Machine Learning is so simple ……
𝑤 ∗ = 0.97, 𝑏 ∗ = 0.1𝑘
𝑦 = 𝑏 + 𝑤𝑥! 𝐿 𝑤 ∗ , 𝑏 ∗ = 0.48𝑘
Compute 𝜕𝐿 ⁄𝜕𝑤, 𝜕𝐿 ⁄𝜕𝑏
Step 1: Step 2: define
Step 3:
function with loss from
optimization
unknown training data
𝑤 ∗ = 0.97, 𝑏 ∗ = 0.1𝑘
𝑏 𝐿 𝑤 ∗ , 𝑏 ∗ = 0.48𝑘
(−𝜂 𝜕𝐿 ⁄𝜕𝑤, −𝜂 𝜕𝐿 ⁄𝜕𝑏)

Compute 𝜕𝐿 ⁄𝜕𝑤, 𝜕𝐿 ⁄𝜕𝑏

11 12

3
9/12/23

Red: real no. of views

Machine Learning is so simple …… 𝑦 = 0.1𝑘 + 0.97𝑥! blue: estimated no. of views
𝑤 ∗ = 0.97, 𝑏 ∗ = 0.1𝑘
𝑦 = 𝑏 + 𝑤𝑥! 𝐿 𝑤 ∗ , 𝑏 ∗ = 0.48𝑘
Step 1: Step 2: define
Step 3:
function with loss from Views
optimization
unknown training data (k)

Training

𝑦 = 0.1𝑘 + 0.97𝑥! achieves the smallest loss 𝐿 = 0.48𝑘

on data of 2017 – 2020 (training data)
How about data of 2021 (unseen during training)?
2021/01/01 2021/02/14
𝐿′ = 0.58𝑘

13 14

2017 - 2020 2021

𝑦 = 𝑏 + 𝑤𝑥! Linear models are too simple … we need more sophisticated modes.
𝐿 = 0.48𝑘 𝐿′ = 0.58𝑘
' 𝑦
2017 - 2020 2021
𝑦 = 𝑏 + - 𝑤# 𝑥#
𝐿 = 0.38𝑘 𝐿′ = 0.49𝑘
#$!
𝒃 𝒘∗𝟏 𝒘∗𝟐 𝒘∗𝟑 𝒘∗𝟒 𝒘∗𝟓 𝒘∗𝟔 𝒘∗𝟕
0.05k 0.79 -0.31 0.12 -0.01 -0.10 0.30 0.18
Different w
()
2017 - 2020 2021 Different 𝑏
𝑦 = 𝑏 + - 𝑤# 𝑥# 𝐿′ = 0.46𝑘
𝐿 = 0.33𝑘
#$!
%& 𝑥!
2017 - 2020 2021
𝑦 = 𝑏 + - 𝑤# 𝑥# Linear models have severe limitation. Model Bias
𝐿 = 0.32𝑘 𝐿′ = 0.46𝑘
#$!
Linear models We need a more flexible model!

15 16

4
9/12/23

red curve = constant + sum of a set of

𝑦
All Piecewise Linear Curves
= constant + sum of a set of
1

0
𝑥!

More pieces require more

17 18

red curve = constant + sum of a set of

Beyond Piecewise Linear?
Approximate continuous curve How to represent
𝑦 by a piecewise linear curve. this function? Hard Sigmoid

𝑥!

Sigmoid Function
1
𝑦=𝑐
1 + 𝑒 '3%4"
2

= 𝑐 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏 + 𝑤𝑥!
𝑥!
To have good approximation, we need sufficient pieces. 𝑥!

19 20

5
9/12/23

Different 𝑤 red curve = sum of a set of + constant

𝑦
Change slopes 𝑐! 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏! + 𝑤!𝑥!
1

Different b
𝑐5 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏5 + 𝑤5𝑥! 3
Shift

0
𝑥!
Different 𝑐
𝑦 = 𝑏 + - 𝑐* 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏* + 𝑤* 𝑥!
0 * 1 + 2 + 3
𝑐" 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏" + 𝑤"𝑥! 2
Change height

21 22

𝑗: 1,2,3
𝑦 = 𝑏 + - 𝑐* 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏* + - 𝑤*# 𝑥#
New Model: More Features * #
no. of features
𝑖: 1,2,3
𝑦 = 𝑏 + 𝑤𝑥! no. of sigmoid
1
𝑟! = 𝑏! + 𝑤!!𝑥! + 𝑤!"𝑥" + 𝑤!5𝑥5 + 𝑤!!
𝑏! 𝑤!" 𝑥!
𝑦 = 𝑏 + - 𝑐* 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏* + 𝑤* 𝑥! 𝑤67 : weight for 𝑥7 for i-th sigmoid 1
𝑤!5
*
2 𝑥"
𝑟" = 𝑏" + 𝑤"!𝑥! + 𝑤""𝑥" + 𝑤"5𝑥5 +
𝑦 = 𝑏 + - 𝑤# 𝑥#
# 1 𝑥5

3
𝑟5 = 𝑏5 + 𝑤5!𝑥! + 𝑤5"𝑥" + 𝑤55𝑥5 +
𝑦 = 𝑏 + - 𝑐* 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏* + - 𝑤*# 𝑥#
* # 1

23 24

6
9/12/23

𝑦 = 𝑏 + - 𝑐* 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏* + - 𝑤# 𝑥# 𝑖: 1,2,3 𝑦 = 𝑏 + - 𝑐 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏* + - 𝑤*# 𝑥# 𝑖: 1,2,3

* # 𝑗: 1,2,3 * # 𝑗: 1,2,3

1
𝑟! = 𝑏! + 𝑤!!𝑥! + 𝑤!"𝑥" + 𝑤!5𝑥5 𝑟! + 𝑤!!
𝑟" = 𝑏" + 𝑤"!𝑥! + 𝑤""𝑥" + 𝑤"5𝑥5 𝑏! 𝑤!" 𝑥!
1
𝑟5 = 𝑏5 + 𝑤5!𝑥! + 𝑤5"𝑥" + 𝑤55𝑥5 𝑤!5
2 𝑥"
𝒓 = 𝒃 + 𝑊 𝒙 𝑟" +
𝑟! 𝑏! 𝑤!! 𝑤!( 𝑤!+ 𝑥!
𝑟( = 𝑏( + 𝑤(! 𝑤(( 𝑤(+ 𝑥( 1 𝑥5
𝑟+ 𝑏+ 𝑤+! 𝑤+( 𝑤++ 𝑥+
3
𝑟5 +
𝒓 = 𝒃 + 𝑊 𝒙 1

25 26

𝑦 = 𝑏 + - 𝑐* 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏* + - 𝑤# 𝑥# 𝑖: 1,2,3 𝑦 = 𝑏 + - 𝑐 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏* + - 𝑤*# 𝑥# 𝑖: 1,2,3

* # 𝑗: 1,2,3 * # 𝑗: 1,2,3

1 1
𝑎! 𝑟! + 𝑤!! 𝑎! 𝑟! + 𝑤!!
1 𝑏! 𝑤!" 𝑥! 𝑐! 𝑏! 𝑤!" 𝑥!
𝑎! = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑟! = 1
𝑤!5 1
1 + 𝑒 28" 𝑤!5
2 𝑥" 𝑐" 2 𝑥"
𝑎" 𝑟" + 𝑦 + 𝑎" 𝑟" +
𝑏
1 𝑥5 1 1 𝑥5
𝑐5
3 3
𝒂 =𝜎 𝒓 𝑎5 𝑟5 + 𝑎5 𝑟5 +

1
𝑦 = 𝑏 + 𝒄, 𝒂 1

27 28

7
9/12/23

1 1
𝑎! 𝑟! + 𝑤!! 𝑎! 𝑟! + 𝑤!!
𝑐! 𝑏! 𝑤!" 𝑥! 𝑐! 𝑏! 𝑤!" 𝑥!
1 1
𝑤!5 𝑤!5
𝑐" 2 𝑥" 𝑐" 2 𝑥"
𝑦 + 𝑎" 𝑟" + 𝑦 + 𝑎" 𝑟" +
𝑏 𝑏
1 1 𝑥5 1 1 𝑥5
𝑐5 𝑐5
3 3
𝑎5 𝑟5 + 𝑎5 𝑟5 +

1 1
𝑦 = 𝑏 + 𝒄, 𝒂

𝑦 = 𝑏 + 𝒄, 𝜎 𝒃 + 𝑊 𝒙
𝒂 =𝜎 𝒓 𝒓 = 𝒃 + 𝑊 𝒙

29 30

Function with unknown parameters

𝑦 = 𝑏 + 𝒄, 𝜎 𝒃 + 𝑊 𝒙 Back to ML Framework

𝒙 feature Step 1: Step 2: define

Rows Step 3:
function with loss from
of 𝑊 optimization
unknown training data
……

𝜃!
Unknown parameters
𝜃
𝜽 = (
𝜃+ 𝑦 = 𝑏 + 𝒄, 𝜎 𝒃 + 𝑊 𝒙
𝑊 𝒃 ⋮

𝒄, 𝑏

31 32

8
9/12/23

Loss Ø Loss is a function of parameters 𝐿 𝜃 Back to ML Framework

Ø Loss means how good a set of values is.

feature Step 1: Step 2: define

Step 3:
function with loss from
optimization
unknown training data
𝑦 = 𝑏 + 𝒄, 𝜎 𝒃 + 𝑊 𝒙
𝑒
label 𝑦E 𝑦 = 𝑏 + 𝒄, 𝜎 𝒃 + 𝑊 𝒙
Given a set of values

1
Loss: 𝐿= - 𝑒"
𝑁
"

33 34

Optimization of New Model 𝜃! Optimization of New Model

𝜽∗ = 𝑎𝑟𝑔 min 𝐿 𝜽 = 𝜃( 𝜽∗ = 𝑎𝑟𝑔 min 𝐿
𝜽 𝜃+ 𝜽

(
⋮
Ø (Randomly) Pick initial values 𝜽 Ø (Randomly) Pick initial values 𝜽(

𝜕𝐿 𝜕𝐿 Ø Compute gradient 𝒈 = ∇𝐿 𝜽(
| : 𝜂| :
𝜕𝜃! 𝜽$𝜽 𝜃!! 𝜃!-
𝜕𝜃! 𝜽$𝜽 𝜽! ← 𝜽( − 𝜂𝒈
𝒈 = 𝜕𝐿 𝜃(! ← 𝜃(- − 𝜕𝐿 Ø Compute gradient 𝒈 = ∇𝐿 𝜽!
| :
gradient 𝜕𝜃 𝜽$𝜽 𝜂 | :
( ⋮ ⋮ 𝜕𝜃( 𝜽$𝜽 𝜽" ← 𝜽! − 𝜂𝒈
⋮ ⋮ Ø Compute gradient 𝒈 = ∇𝐿 𝜽"
𝒈 = ∇𝐿 𝜽- 𝜽! ← 𝜽- − 𝜂𝒈 𝜽5 ← 𝜽" − 𝜂𝒈

35 36

9
9/12/23

Optimization of New Model Optimization of New Model

𝜽∗ = 𝑎𝑟𝑔 min 𝐿 Example 1
𝜽
B batch Ø 10,000 examples (N = 10,000) B batch
Ø (Randomly) Pick initial values 𝜽( 𝐿
Ø Batch size is 10 (B = 10)
Ø Compute gradient 𝒈 = ∇𝐿! 𝜽( 𝐿! How many update in 1 epoch?
batch batch
update 𝜽! ← 𝜽( − 𝜂𝒈 1,000 updates
N N
Ø Compute gradient 𝒈 = ∇𝐿" 𝜽! 𝐿" Example 2
update 𝜽" ← 𝜽! − 𝜂𝒈 batch Ø 1,000 examples (N = 1,000) batch
Ø Batch size is 100 (B = 100)
Ø Compute gradient 𝒈 = ∇𝐿5 𝜽" 𝐿5
How many update in 1 epoch?
update 𝜽5 ← 𝜽" − 𝜂𝒈 batch batch
10 updates
1 epoch = see all the batches once

37 38

Back to ML Framework Sigmoid → ReLU

How to represent
Step 1: Step 2: define this function?
Step 3:
function with loss from
optimization
unknown training data
𝑥!

Rectified Linear
𝑦 = , 𝑊 𝒙 Unit (ReLU) 𝑐 𝑚𝑎𝑥 0, 𝑏 + 𝑤𝑥!
𝑏 + 𝒄 𝜎 𝒃 +

𝑥!
More variety of models … 𝑐′ 𝑚𝑎𝑥 0, 𝑏′ + 𝑤′𝑥!

39 40

10
9/12/23

Sigmoid → ReLU Experimental Results

𝑦 = 𝑏 + - 𝑐* 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏* + - 𝑤# 𝑥# 𝑦 = 𝑏 + - 𝑐 𝑚𝑎𝑥 0, 𝑏* + - 𝑤*# 𝑥#

* # (* #

Activation function
linear 10 ReLU 100 ReLU 1000 ReLU
𝑦 = 𝑏 + - 𝑐* 𝑚𝑎𝑥 0, 𝑏* + - 𝑤*# 𝑥# 2017 – 2020 0.32k 0.32k 0.28k 0.27k
(* #
2021 0.46k 0.45k 0.43k 0.43k

Which one is better?

41 42

+ 𝑎! +
Back to ML Framework 𝑥!
1 1

𝑥"
Step 1: Step 2: define + 𝑎" +
Step 3:
function with loss from
optimization 𝑥5
unknown training data 1 or 1

……
+ 𝑎5 +
𝑦 = 𝑏 + 𝒄, 𝜎 𝒃 + 𝑊 𝒙
1 1

Even more variety of models … 𝒂′ = 𝜎 𝒃′ + 𝒂 𝒂 =𝜎 𝑊 𝒙

𝑊′ 𝒃 +

43 44

11
9/12/23

Red: real no. of views

3 layers
Experimental Results blue: estimated no. of views

• Loss for multiple hidden layers

• 100 ReLU for each layer
• input features are the no. of views in the past 56 Views
days (k)

1 layer 2 layer 3 layer 4 layer

2017 – 2020 0.28k 0.18k 0.14k 0.10k
?
2021 0.43k 0.39k 0.38k 0.44k

2021/01/01 2021/02/14

45 46

hidden layer hidden layer

+ 𝑎! +
Back to ML Framework 𝑥!
1 1

𝑥"
Step 1: Step 2: define + 𝑎" +
Step 3:
function with loss from
optimization 𝑥5
unknown training data 1 1

……
+ 𝑎5 +
𝑦 = 𝑏 + 𝒄 , 𝜎 𝒃 + 𝑊 𝒙 Neuron
1 1

It is not fancy enough. Neural Network This mimics human brains … (???)
Let’s give it a fancy name! Many layers means Deep Deep Learning

47 48

12
9/12/23

Deep = Many hidden layers Deep = Many hidden layers

22 layers 152 layers 101 layers

http://cs231n.stanford.e
du/slides/winter1516_le 19 layers Special
cture8.pdf
structure

8 layers
6.7% Why we want “Deep” network,
7.3%
not “Fat” network? 3.57%
16.4%

7.3% 6.7%
16.4%
AlexNet VGG GoogleNet Residual Net Taipei
AlexNet (2012) VGG (2014) GoogleNet (2014) (2012) (2014) (2014) (2015) 101

49 50

Why don’t we go deeper? Why don’t we go deeper?

• Loss for multiple hidden layers • Loss for multiple hidden layers
• 100 ReLU for each layer • 100 ReLU for each layer
• input features are the no. of views in the past 56 • input features are the no. of views in the past 56
days days
1 layer 2 layer 3 layer 4 layer 1 layer 2 layer 3 layer 4 layer
2017 – 2020 0.28k 0.18k 0.14k 0.10k 2017 – 2020 0.28k 0.18k 0.14k 0.10k
2021 0.43k 0.39k 0.38k 0.44k 2021 0.43k 0.39k 0.38k 0.44k

Better on training data, worse on unseen data

Overfitting

51 52

13
9/12/23

loss on training data

General
Let’s predict no. of views today! Guide large small

model loss on testing data

• If we want to select a model for predicting no. of bias optimization
views today, which one will you use? large small
make your Next Lecture
1 layer 2 layer 3 layer 4 layer model complex
2017 – 2020 0.28k 0.18k 0.14k 0.10k
overfitting mismatch
2021 0.43k 0.39k 0.38k 0.44k

more training data Not in HWs,

data augmentation except HW 11
make your model simpler
trade-off
Split your training data into training set and
validation set for model selection

53 56

Model Bias loss on training data

General
• The model is too simple. 𝑓𝜽𝟏 𝒙 𝑦 = 𝑓𝜽 𝒙 Guide large small
𝑓𝜽𝟐 𝒙
model loss on testing data
find a needle in a haystack … bias optimization
𝑓𝜽 𝒙
∗
… but there is no needle large small
too small … 𝑓 ∗ 𝒙 small loss make your Next Lecture
model complex
• Solution: redesign your model to make it more
flexible ;<
overfitting mismatch
More features
𝑦 = 𝑏 + 𝑤𝑥! 𝑦 = 𝑏 + V 𝑤7 𝑥7
7)! more training data
Deep Learning
data augmentation
(more neurons, layers)
make your model simpler
trade-off
𝑦 = 𝑏 + V 𝑐6 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏6 + V 𝑤67 𝑥7 Split your training data into training set and
6 7 validation set for model selection

57 58

14
9/12/23

𝑓𝜽𝟏 𝒙
Optimization Issue Model Bias 𝑓𝜽𝟐 𝒙

find a needle in a haystack … 𝑓𝜽∗ 𝒙

• Large loss not always imply model bias. There is
another possibility … … but there is no needle 𝑓∗ 𝒙
𝑓𝜽𝟐 𝒙 too small …
𝑓𝜽∗ 𝒙 small loss
𝑓𝜽𝟏 𝒙 Which one???
𝐿
𝑓𝜽𝟐 𝒙 𝑓𝜽∗ 𝒙
𝑓𝜽𝟏 𝒙
𝑦 = 𝑓𝜽 𝒙 Optimization Issue
𝑓∗ 𝒙
𝐿 𝜽∗ large A needle is in a haystack … 𝑦 = 𝑓𝜽 𝒙
𝜽 A needle is in a haystack … … Just cannot find it. 𝑓∗ 𝒙
𝜽∗ … Just cannot find it.

59 60

Ref: Ref:
http://arxiv.org/abs/1512.0338 http://arxiv.org/abs/1512.03385

Model Bias v.s. Optimization Issue5

Optimization Issue
• Gaining the insights from comparison • Gaining the insights from comparison
Optimization issue
• Start from shallower networks (or other models),
which are easier to optimize.
• If deeper networks do not obtain smaller loss on
training data, then there is optimization issue.
1 layer 2 layer 3 layer 4 layer 5 layer
Overfitting? 2017 – 2020 0.28k 0.18k 0.14k 0.10k 0.34k

Testing Data
• Solution: More powerful optimization technology
Training Data
(next lecture)

61 62

15
9/12/23

Overfitting Overfitting 𝑦 “freestyle”

𝑦 Flexible
• Small loss on training data, large loss on testing model
data. Why?
An extreme example
𝑥
Training data: 𝒙𝟏, 𝑦& ! , 𝒙𝟐, 𝑦& " ,…, 𝒙𝑵, 𝑦& #
𝑥 𝑦 Large loss
𝑦& 6 ∃𝒙𝒊 =𝒙
𝑓 𝒙 =Z Less than useless … Real data distribution
𝑟𝑎𝑛𝑑𝑜𝑚 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 (not observable)
Training data
This function obtains zero training loss, but large testing loss.
Testing data
𝑥

65 66

Overfitting 𝑦 Overfitting 𝑦 𝑦 = 𝑎 + 𝑏𝑥 + 𝑐𝑥 "

𝑦 Flexible 𝑦 constrained
model model

𝑥 𝑥
More training data
𝑥 𝑥

Data augmentation Real data distribution

(not observable)
Training data
Testing data

67 68

16
9/12/23

Overfitting 𝑦 𝑦 = 𝑎 + 𝑏𝑥 + 𝑐𝑥 "
Overfitting 𝑦 𝑦 = 𝑎 + 𝑏𝑥 + 𝑐𝑥 "
𝑦 constrained 𝑦 constrained
model model

𝑥 𝑥

𝑥 𝑦 𝑥
• Less parameters, sharing parameters
Real data distribution
• Less features Fully-connected
(not observable)
• Early stopping
Training data • Regularization CNN
Testing data • Dropout
𝑥

69 70

Overfitting 𝑦
𝑦 = 𝑎 + 𝑏𝑥
Bias-Complexity Trade-off
𝑦 constrain
too much loss Testing loss

𝑥
select this one
𝑥 𝑦 Back to model bias …

Real data distribution Training loss

(not observable)
Training data Model becomes complex
Testing data (e.g. more features, more parameters)
𝑥

71 72

17
9/12/23

Mismatch Small Gradient …

• Your training and testing data have different Loss Gradient Descent
Very slow at the
distributions. Be aware of how data is generated.
plateau
Most HWs do not have this problem, except HW11
Stuck at saddle point
Training Data
Stuck at local minima

Simply increasing the training data will not help.

Testing Data 𝜕𝐿 ∕ 𝜕𝑤 𝜕𝐿 ∕ 𝜕𝑤 𝜕𝐿 ∕ 𝜕𝑤
≈0 =0 =0

The value of a network parameter w

80 82

Tips for training:

Batch and Momentum Batch

83 84

18
9/12/23

Review: Optimization with Batch Small Batch v.s. Large Batch

Consider 20 examples (N=20)
𝜽∗ = 𝑎𝑟𝑔 min 𝐿 Batch size = N (Full batch) Batch size = 1
𝜽
B batch
𝐿 Update after seeing all Update for each example
Ø (Randomly) Pick initial values 𝜽( the 20 examples Update 20 times in an epoch
Ø Compute gradient 𝒈𝟎 = ∇𝐿! 𝜽( 𝐿! batch
update 𝜽! ← 𝜽( − 𝜂𝒈𝟎 N See all See only one
Ø Compute gradient 𝒈𝟏 = ∇𝐿" 𝜽! 𝐿" examples example
batch
update 𝜽" ← 𝜽! − 𝜂𝒈𝟏 See all
𝟑 5 " 5 examples
Ø Compute gradient 𝒈 = ∇𝐿 𝜽 𝐿
batch
update 𝜽5 ← 𝜽" − 𝜂𝒈𝟑 Long time for cooldown, Short time for cooldown,
1 epoch = see all the batches once Shuffle after each epoch but powerful but noisy

85 86

oldest slides: http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture/DNN%20(v4).pdf

old slides: http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017/Lecture/Keras.pdf

Small Batch v.s. Large Batch Small Batch v.s. Large Batch
• Larger batch size does not require longer time to • Smaller batch requires longer time for one epoch
compute gradient (unless batch size is too large) (longer time for seeing all data once)
Time for one update Time for one epoch
Time for
each update slower
MNIST: digit Having limitation
classification
60 updates
faster
Parallel computing

Tesla V100 GPU

60000 updates in one epoch
full

87 88

19
9/12/23

Small Batch v.s. Large Batch Small Batch v.s. Large Batch
Consider 20 examples (N=20)
Batch size = N (Full Batch) Batch size = 1 MNIST CIFAR-10
Update after seeing all Update for each example
the 20 examples Update 20 times in an epoch

See all See only one

examples example
See all
examples

Long time for cooldown, Short time for cooldown, Ø Smaller batch size has better performance
but powerful but noisy Ø What’s wrong with large batch size? Optimization Fails

89 90

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
https://arxiv.org/abs/1609.04836

Small Batch v.s. Large Batch Small Batch v.s. Large Batch
• Smaller batch size has better performance • Small batch is better on testing data?
• “Noisy” update is better for training
SB = 256
Full Batch Small Batch LB =
0.1 x data set
𝐿
𝐿(
stuck

stuck 𝐿!
trainable

91 92

20
9/12/23

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
https://arxiv.org/abs/1609.04836

Small Batch v.s. Large Batch Small Batch v.s. Large Batch
• Small batch is better on testing data? Small Large
large batch
Speed for one update
Faster Slower
bad for testing (no parallel)
Speed for one update
Same Same (not too large)
Testing Loss (with parallel)
Time for one epoch Slower Faster
Gradient Noisy Stable
small batch Optimization Better Worse
good for testing Generalization Better Worse
Training Loss

Flat Minima Sharp Minima Batch size is a hyperparameter you have to decide.

93 94

Have both fish and bear's paws?

• Large Batch Optimization for Deep Learning: Training BERT
in 76 minutes (https://arxiv.org/abs/1904.00962)
• Extremely Large Minibatch SGD: Training ResNet-50 on
ImageNet in 15 Minutes (https://arxiv.org/abs/1711.04325) Momentum
• Stochastic Weight Averaging in Parallel: Large-Batch Training
That Generalizes Well (https://arxiv.org/abs/2001.02312)
• Large Batch Training of Convolutional Networks
(https://arxiv.org/abs/1708.03888)
• Accurate, large minibatch sgd: Training imagenet in 1 hour
(https://arxiv.org/abs/1706.02677)

95 96

21
9/12/23

Small Gradient … (Vanilla) Gradient Descent

Loss Consider the physical world …
𝒈𝟎 Starting at 𝜽𝟎
How about put this phenomenon 𝒈𝟏 Compute gradient 𝒈𝟎
in gradient descent? 𝜽𝟎
𝜽𝟏 Move to 𝜽𝟏 = 𝜽𝟎 − 𝜂𝒈𝟎
𝒈𝟐
Compute gradient 𝒈𝟏
Gradient 𝜽𝟐
Move to 𝜽𝟐 = 𝜽𝟏 − 𝜂𝒈𝟏
Movement 𝒈𝟑
𝜽𝟑

……
The value of a network parameter w

97 98

Gradient Descent + Momentum Gradient Descent + Momentum

Movement: movement of last Starting at 𝜽𝟎 Movement: movement of last Starting at 𝜽𝟎
step minus gradient at present Movement 𝒎𝟎 =𝟎 step minus gradient at present Movement 𝒎𝟎 = 𝟎
𝟎
𝒈 Compute gradient 𝒈𝟎 Compute gradient 𝒈𝟎
𝟏
𝒈𝟏
𝒎 Movement 𝒎 𝟏 = λ𝒎 𝟎 − 𝜂𝒈𝟎 𝒎𝒊 is the weighted sum of all the Movement 𝒎𝟏 = λ𝒎𝟎 − 𝜂𝒈𝟎
𝜽𝟎
Move to 𝜽𝟏 = 𝜽𝟎 + 𝒎 𝟏 previous gradient: 𝒈𝟎, 𝒈𝟏, …, 𝒈𝒊2𝟏 Move to 𝜽𝟏 = 𝜽𝟎 + 𝒎𝟏
𝜽𝟏 𝒈𝟐
𝒎𝟐
Compute gradient 𝒈𝟏 𝒎𝟎 = 𝟎 Compute gradient 𝒈𝟏
𝜽𝟐
Gradient Movement 𝒎 𝟐 = λ𝒎 𝟏 − 𝜂𝒈𝟏 𝒎𝟏 = −𝜂𝒈𝟎 Movement 𝒎𝟐 = λ𝒎𝟏 − 𝜂𝒈𝟏
𝒎𝟑 𝟐
Move to 𝜽 = 𝜽 + 𝒎𝟏 𝟐
Move to 𝜽𝟐 = 𝜽𝟏 + 𝒎𝟐
Movement
𝜽𝟑 𝒎𝟐 =−λ𝜂𝒈𝟎 − 𝜂𝒈𝟏
Movement 𝒈𝟑 Movement not just based Movement not just based
on gradient, but previous on gradient, but previous
……

of the last step

movement. movement.

99 100

22
9/12/23

Gradient Descent + Momentum Concluding Remarks

loss Movement =
Negative of 𝜕𝐿∕𝜕𝑤 + Last Movement • Critical points have zero gradients.
Negative of 𝜕𝐿 ∕ 𝜕𝑤 • Critical points can be either saddle points or local
minima.
Last Movement
Real Movement
• Can be determined by the Hessian matrix.
• It is possible to escape saddle points along the
direction of eigenvectors of the Hessian matrix.
• Local minima may be rare.
• Smaller batch size and momentum help escape
critical points.
𝜕𝐿∕𝜕𝑤 = 0

101 102

https://docs.google.com/presentation/d/1siUFXARYRpNiMeSRwgFbt7mZVjkMPhR5od09w0Z8xa
U/edit#slide=id.g3532c09be1_0_382

Training stuck ≠ Small Gradient

• People believe training stuck because the
parameters are around a critical point …
Error surface is rugged …
loss
Tips for training: Adaptive Learning Rate

norm of
gradient

103 104

23
9/12/23

Training can be difficult

Wait a minute … even without critical points.

This error surface is convex.

Learning rate cannot be

one-size-fits-all

100,000
updates

105 𝜂 = 10-2 𝜂 = 10-7 106

105 106

Different parameters needs 𝜂 𝒕

Root Mean Square 𝜽𝒕0𝟏 ← 𝜽𝒕* − 𝒈
different learning rate *
𝜎*2 *
𝜂 𝟎 (
Formulation for one parameter: 𝜽𝟏* ← 𝜽𝟎* − 𝒈 𝜎*- = 𝒈𝟎* = 𝒈𝟎*
𝜎*- *
Smaller 𝜽𝒕0𝟏
* ← 𝜽𝒕* − 𝜂𝒈𝒕*
Learning Rate 𝜂 𝟏 1 ( (
𝜕𝐿 𝜽𝟐* ← 𝜽𝟏* − 𝒈 𝜎*! = 𝒈𝟎* + 𝒈𝟏*
𝒈𝒕* = | 𝒕 𝜎*! * 2
𝜕𝜽* 𝜽$𝜽
𝑤" 𝜂 𝟐
𝜂 𝜽𝟑* ← 𝜽𝟐* − 𝒈 1 ( ( (
𝜽𝒕0𝟏 ← 𝜽𝒕* − 2 𝒈𝒕* 𝜎*( * 𝜎*( = 𝒈𝟎* + 𝒈𝟏* + 𝒈𝟐*
Larger *
𝜎* 3
Learning Rate
……

2
Parameter 𝜂 1 (
𝑤! dependent 𝜽𝒕0𝟏 ← 𝜽𝒕* − 2 𝒈𝒕* 𝜎*2 = / 𝒈𝒕*
*
𝜎* 𝑡+1
107 *$- 108

107 108

24
9/12/23

Root Mean Square Learning rate adapts dynamically

smaller 𝜎!A
𝒈!𝒕2𝟏 larger step Error Surface can be very complex.
𝜂 𝒕 𝒈!𝒕
𝜽𝒕0𝟏
* ← 𝜽𝒕* − 𝒈
𝜎*2 * 𝜽! Smaller
Learning Rate
𝒈𝒕2𝟏
"
2 larger 𝜎"A 𝑤"
1 (
𝜎*2 = / 𝒈𝒕* 𝒈𝒕"
smaller step
𝑡+1 Larger
*$-
Learning Rate
Used in Adagrad 𝜽"
𝑤!
109 110

109 110

𝜂 𝒕
RMSProp 𝜽𝒕0𝟏
* ← 𝜽𝒕* − 𝒈 RMSProp 𝒈𝟏* 𝒈𝟐* …… 𝒈𝒕6𝟏
*
𝜎*2 * 0<𝛼<1
𝜂 ( 𝜂
𝜽𝟏* ← 𝜽-* − - 𝒈𝟎* 𝜎*- = 𝒈𝟎* 𝜽𝒕0𝟏
* ← 𝜽𝒕* − 2 𝒈𝒕* 𝜎*2 = 𝛼 𝜎*26!
(
+ 1 − 𝛼 𝒈𝒕*
(
𝜎* 0<𝛼<1 𝜎*
𝜂 𝟏 The recent gradient has larger influence,
( (
𝜽𝟐* ← 𝜽!* − 𝒈 𝜎*! = 𝛼 𝜎*- + 1−𝛼 𝒈𝟏* and the past gradients have less influence.
𝜎*! *
𝜂 𝟐
𝜽+* ← 𝜽(* − 𝒈 ( ( small 𝜎6A increase 𝜎6A
𝜎*( * 𝜎*( = 𝛼 𝜎*! + 1 − 𝛼 𝒈𝟐*
larger step smaller step
……

decrease 𝜎6A
𝜂 larger step
𝜽𝒕0𝟏
* ← 𝜽𝒕* − 2 𝒈𝒕* 𝜎*2 = 𝛼 𝜎*26!
(
+ 1 − 𝛼 𝒈𝒕*
(
𝜎*
111 112

111 112

25
9/12/23

Original paper: https://arxiv.org/pdf/1412.6980.pdf Without Adaptive Learning Rate

Adam: RMSProp + Momentum

for momentum 𝜂 𝒕
for RMSprop 𝜽𝒕0𝟏
* ← 𝜽𝒕* − 𝒈
𝜎*2 *

2
1 (
𝜎*2 = / 𝒈𝒕*
𝑡+1
*$-
113 114

113 114

Learning Rate Scheduling 𝜂𝜂2 Learning Rate Scheduling 𝜂𝜂2

𝜽𝒕0𝟏
* ← 𝜽𝒕* − 𝒈𝒕 𝜽𝒕0𝟏 ← 𝜽𝒕* − 𝒈𝒕
𝜎*2 * *
𝜎*2 *
𝜂A Learning Rate Decay 𝜂A Learning Rate Decay
As the training goes, we are As the training goes, we are
closer to the destination, so we closer to the destination, so we
𝑡 reduce the learning rate. 𝑡 reduce the learning rate.

Warm Up
𝜂A
Increase and then decrease?

115 116

26
9/12/23

Learning Rate Scheduling 𝜂𝜂2

𝜽𝒕0𝟏
* ← 𝜽𝒕* − 2 𝒈𝒕*
𝜎*
𝜂A Learning Rate Decay

After the training goes, we are

Residual Network
close to the destination, so we
https://arxiv.org/abs/1512.03385 𝑡 reduce the learning rate.

Warm Up
𝜂A
Increase and then decrease?
At the beginning, the estimate
𝑡 of 𝜎6A has large variance.
Transformer https://arxiv.org/abs/1706.03762
Please refer to RAdam https://arxiv.org/abs/1908.03265
117 118

117 118

Summary of Optimization Classification as Regression?

(Vanilla) Gradient Descent
• Regression
𝜽𝒕0𝟏
* ← 𝜽𝒕* − 𝜂𝒈𝒕*
𝒙 Model 𝑦 𝑦&

Various Improvements
• Classification as regression?
Learning rate scheduling
𝜂2 𝒕 𝒙 Model 𝑦 𝑦& class
𝜽𝒕0𝟏
* ← 𝜽𝒕* − 𝒎 Momentum: weighted sum of the
𝜎*2 * previous gradients Consider 1 = class 1
direction similar?
different? 2 = class 2
root mean square of the gradients 3 = class 3
only magnitude 119 120

119 120

27
9/12/23

Class as one-hot vector Class 1 Class 2 Class 3 Class as one-hot vector Class 1 Class 2 Class 3
1 0 0 1 0 0
e=
𝒚 0 or 1 or 0 e=
𝒚 0 or 1 or 0
0 0 1 0 0 1
𝑎! 𝑟! + 𝑦! + 𝑎! 𝑟! +
𝑥! 𝑥!
only output 1
1 1
one value
𝑥" 𝑥"
𝑦 + 𝑎" 𝑟" + 𝑦" + 𝑎" 𝑟" +
𝑏
1 1 𝑥5 1 1 𝑥5
How to output
multiple values? 𝑎5 𝑟5 + 𝑦5 + 𝑎5 𝑟5 +

1 1 1
121 122

121 122

Regression 𝑒𝑥𝑝 𝑦* n 1 > 𝑦6B > 0

feature Soft-max 𝑦*7 =
∑# 𝑒𝑥𝑝 𝑦* n ∑6 𝑦6B = 1
label
𝑦5 𝑦 = 𝑏 + 𝒄, 𝜎 𝒃 + 𝑊 𝒙 Softmax How about binary classification? J
0.88 3
20
Classification 𝑦!B ÷ e 𝑦!
feature
0.12 2.7 1
𝒚 = 𝒃′ + 𝑊′ 𝜎 𝒃 + 𝑊 𝒙 𝑦"B ÷ e 𝑦"

≈0 -3
0.05
𝑦5B ÷ e 𝑦5
=
label 𝒚 𝒚′ = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝒚
logit
0 or 1 Make all values Can have +
between 0 and 1 any value 123 124

123 124

28
9/12/23

http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture/Deep%20More%20(v
2).ecm.mp4/index.html
-10 ~ 10
Loss of Classification 1 𝑦!B 𝑦!

softmax
1 -10 ~ 10
𝐿= / 𝑒" 0 𝑦"B 𝑦" Network 𝒙
𝑁 0 𝑒 𝑦5B
𝑦5 -1000
label "

softmax
=
𝒚 𝒚′ 𝒚 Network 𝒙
𝑒 largeMean Square Error (MSE) large Cross-entropy
loss loss
stuck!
Mean Square Error (MSE) =* − 𝒚7*
𝑒=/ 𝒚 (
𝑦" 𝑦"
*

Cross-entropy small small

=* 𝑙𝑛𝒚7*
𝑒 = −/𝒚 loss loss
* 𝑦! 𝑦!
Minimizing cross-entropy is equivalent to maximizing likelihood. Changing the loss function can change the difficulty of optimization.
125 126

125 126

• Thank you for Prof. Hung-yi Lee for providing the

slides

127

ScadaBR-Developers - CERTI - ScadaBR2
100% (1)
ScadaBR-Developers - CERTI - ScadaBR2
20 pages
Manual of Sensorless Brushless Motor Speed Controller: Ver HW-01-081027.1
No ratings yet
Manual of Sensorless Brushless Motor Speed Controller: Ver HW-01-081027.1
4 pages
Deep Learning Computer Vision
No ratings yet
Deep Learning Computer Vision
302 pages
UCC28950 Green Phase-Shifted Full-Bridge Controller With Synchronous Rectification
No ratings yet
UCC28950 Green Phase-Shifted Full-Bridge Controller With Synchronous Rectification
76 pages
Lecture 02-2
No ratings yet
Lecture 02-2
37 pages
DL 02 Basics
No ratings yet
DL 02 Basics
95 pages
Inputs Outputs: Navigation Menu
No ratings yet
Inputs Outputs: Navigation Menu
10 pages
2 DL Training
No ratings yet
2 DL Training
60 pages
SEMIKRON Product-Catalogue EN PDF
100% (1)
SEMIKRON Product-Catalogue EN PDF
126 pages
Model Predictive Control of Power Electronics Converter: Jiaying Wang
No ratings yet
Model Predictive Control of Power Electronics Converter: Jiaying Wang
81 pages
Slide 2-f2
No ratings yet
Slide 2-f2
52 pages
L3 Cse256 Fa24 FFN
No ratings yet
L3 Cse256 Fa24 FFN
64 pages
A Model of Musical Motifs
100% (1)
A Model of Musical Motifs
8 pages
Lecture 14 Introduction To Pytorch
No ratings yet
Lecture 14 Introduction To Pytorch
45 pages
Neural Network Intro Lecture 4
No ratings yet
Neural Network Intro Lecture 4
46 pages
Lec 10 Oct 24
No ratings yet
Lec 10 Oct 24
21 pages
Neural Network (Perceptrons)
No ratings yet
Neural Network (Perceptrons)
31 pages
1 s2.0 S174680942300722X Main
No ratings yet
1 s2.0 S174680942300722X Main
5 pages
Rama 88203 06011181621075
No ratings yet
Rama 88203 06011181621075
108 pages
DD / MM / Yyyy: ISQ Registration Form For
No ratings yet
DD / MM / Yyyy: ISQ Registration Form For
1 page
Towards A Mathematical Understanding of Neural Network-Based Machine Learning: What We Know and What We Don't
No ratings yet
Towards A Mathematical Understanding of Neural Network-Based Machine Learning: What We Know and What We Don't
56 pages
Deep Learning Turorial PDF
No ratings yet
Deep Learning Turorial PDF
301 pages
Mini Project
No ratings yet
Mini Project
25 pages
cs188 Fa24 Lec24
No ratings yet
cs188 Fa24 Lec24
46 pages
Audio Compression
No ratings yet
Audio Compression
53 pages
Geek Pride Day by Slidesgo
No ratings yet
Geek Pride Day by Slidesgo
47 pages
Emails (SK, China, Italy) (March 2023)
No ratings yet
Emails (SK, China, Italy) (March 2023)
226 pages
IBest DeepLearning
No ratings yet
IBest DeepLearning
123 pages
Introduction of Machine Learning
No ratings yet
Introduction of Machine Learning
61 pages
Picozero Readthedocs Io en Latest
No ratings yet
Picozero Readthedocs Io en Latest
69 pages
DL 02 Basics
No ratings yet
DL 02 Basics
94 pages
So You Thought You Were Safe Using Angularjs. - . - Think Again!
No ratings yet
So You Thought You Were Safe Using Angularjs. - . - Think Again!
46 pages
Component e 45589
No ratings yet
Component e 45589
32 pages
Image Compression
No ratings yet
Image Compression
15 pages
ISDT
No ratings yet
ISDT
16 pages
Lecture - 14 - FFNN
No ratings yet
Lecture - 14 - FFNN
59 pages
Chapter 8 Revision
No ratings yet
Chapter 8 Revision
15 pages
DL - M2 - Deep Feedforward NN
No ratings yet
DL - M2 - Deep Feedforward NN
97 pages
CSE 440 AI Volume1 (p1)
No ratings yet
CSE 440 AI Volume1 (p1)
4 pages
Probability Neuron Network
No ratings yet
Probability Neuron Network
84 pages
PROJECT PROPOSAL FOR jetPOINT
No ratings yet
PROJECT PROPOSAL FOR jetPOINT
17 pages
DeepLearning Recap
No ratings yet
DeepLearning Recap
104 pages
DNN - M2 - Deep Feedforward NN 23dec
No ratings yet
DNN - M2 - Deep Feedforward NN 23dec
97 pages
Module 3 - Modified
No ratings yet
Module 3 - Modified
106 pages
Dalamberian
No ratings yet
Dalamberian
3 pages
2.game AI 1
No ratings yet
2.game AI 1
268 pages
Week 2 Artificial Neural Networks
No ratings yet
Week 2 Artificial Neural Networks
62 pages
Firming and Planning Time Fence
No ratings yet
Firming and Planning Time Fence
11 pages
Lecture 18. Backpropagation
No ratings yet
Lecture 18. Backpropagation
55 pages
Debugging 9
No ratings yet
Debugging 9
16 pages
01 - Introduction To Deep Learning
No ratings yet
01 - Introduction To Deep Learning
56 pages
Deep Learning
No ratings yet
Deep Learning
299 pages
HODL Lec 2 Training NNs Intro TF
No ratings yet
HODL Lec 2 Training NNs Intro TF
83 pages
A Imprimer 4
No ratings yet
A Imprimer 4
4 pages
Computer Project XI
No ratings yet
Computer Project XI
10 pages
BITS F464 ML Lecture Notes
No ratings yet
BITS F464 ML Lecture Notes
86 pages
Gefran 3400-4400 0409 ENG
No ratings yet
Gefran 3400-4400 0409 ENG
4 pages
SWOT Analysis Yahoo
No ratings yet
SWOT Analysis Yahoo
3 pages
Neural Networks - 2
No ratings yet
Neural Networks - 2
79 pages
Flying Capacitor DCDC Converter
No ratings yet
Flying Capacitor DCDC Converter
5 pages
16 DL 1
No ratings yet
16 DL 1
9 pages
Deep Learning Unit 2
No ratings yet
Deep Learning Unit 2
25 pages
Deep Learning Lectures - 2
No ratings yet
Deep Learning Lectures - 2
73 pages
L10 Learning II Gradient Based Learning
No ratings yet
L10 Learning II Gradient Based Learning
72 pages
Chapter 9
No ratings yet
Chapter 9
73 pages
Context Switching'
No ratings yet
Context Switching'
1 page
Seat-Belt Engine Cut-Off System
No ratings yet
Seat-Belt Engine Cut-Off System
6 pages
Lecture20 Backprop
No ratings yet
Lecture20 Backprop
77 pages
Deep Learning 1
No ratings yet
Deep Learning 1
48 pages
Ticketdirect 1214508382
No ratings yet
Ticketdirect 1214508382
2 pages
Linear Classifier: by Dr. Sanjeev Kumar Associate Professor Department of Mathematics IIT Roorkee, Roorkee-247 667, India
No ratings yet
Linear Classifier: by Dr. Sanjeev Kumar Associate Professor Department of Mathematics IIT Roorkee, Roorkee-247 667, India
86 pages
Neural Networks Essay Feranmi Dere
No ratings yet
Neural Networks Essay Feranmi Dere
7 pages
Lecture 1
No ratings yet
Lecture 1
6 pages
L4 Training Neural Networks en
No ratings yet
L4 Training Neural Networks en
48 pages
DL Unit-2
No ratings yet
DL Unit-2
24 pages
Unit 2.1
No ratings yet
Unit 2.1
37 pages
Kagan Lecture2
No ratings yet
Kagan Lecture2
118 pages
Lecture 2
No ratings yet
Lecture 2
6 pages
Lec 2
No ratings yet
Lec 2
5 pages
ML 01
No ratings yet
ML 01
24 pages
Deep Learning Summer School 2015: Introduction To Machine Learning
No ratings yet
Deep Learning Summer School 2015: Introduction To Machine Learning
46 pages
Ece18898g Neural Networks
No ratings yet
Ece18898g Neural Networks
47 pages
Aula 4 (L) - Oggi La Tua Lezione È in Presenza
No ratings yet
Aula 4 (L) - Oggi La Tua Lezione È in Presenza
11 pages
Notes Chapter8
No ratings yet
Notes Chapter8
4 pages
Introduction To Neural Network
No ratings yet
Introduction To Neural Network
20 pages
NLP-NeuralNetworks Reading Notes
No ratings yet
NLP-NeuralNetworks Reading Notes
13 pages
Master Procreate: Top Tips, Tricks & Hacks to Create Stunning Digital Art Faster
From Everand
Master Procreate: Top Tips, Tricks & Hacks to Create Stunning Digital Art Faster
Monica Jepchumba
No ratings yet
Learn to Program with Small Basic: An Introduction to Programming with Games, Art, Science, and Math
From Everand
Learn to Program with Small Basic: An Introduction to Programming with Games, Art, Science, and Math
Majed Marji
No ratings yet
Tableau Hacks - Tips and Tricks to Build Dashboards Like a Pro
From Everand
Tableau Hacks - Tips and Tricks to Build Dashboards Like a Pro
Hema
No ratings yet
Hidden Surface Determination: Unveiling the Secrets of Computer Vision
From Everand
Hidden Surface Determination: Unveiling the Secrets of Computer Vision
Fouad Sabry
No ratings yet

Week2 DL

Uploaded by

Week2 DL

Uploaded by

9/12/23

Deep Learning 𝑦=𝑓

Model 𝑦 = 𝑏 + 𝑤𝑥! based on domain knowledge

Ø Loss is a function of Ø Loss is a function of

0.5𝑘+1𝑥! = 𝑦 5.3k 0.5𝑘+1𝑥! = 𝑦 5.4k 0.5𝑘+1𝑥! = 𝑦

4.9k 4.9k 7.5k 9.8k

Ø Loss is a function of Ø Loss is a function of

𝑒 = 𝑦 − 𝑦& 𝐿 is mean absolute error (MAE)

Source of image: http://chico386.pixnet.net/album/photo/171572850 Source of image: http://chico386.pixnet.net/album/photo/171572850

3. Optimization 𝑤 ∗ , 𝑏 ∗ = 𝑎𝑟𝑔 min 𝐿

Gradient Descent Gradient Descent

Source of image: http://chico386.pixnet.net/album/photo/171572850

3. Optimization 𝑤 ∗ , 𝑏 ∗ = 𝑎𝑟𝑔 min 𝐿

Compute 𝜕𝐿 ⁄𝜕𝑤, 𝜕𝐿 ⁄𝜕𝑏

Red: real no. of views

𝑦 = 0.1𝑘 + 0.97𝑥! achieves the smallest loss 𝐿 = 0.48𝑘

2017 - 2020 2021

red curve = constant + sum of a set of

More pieces require more

red curve = constant + sum of a set of

Different 𝑤 red curve = sum of a set of + constant

𝑦 = 𝑏 + - 𝑐* 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏* + - 𝑤*# 𝑥# 𝑖: 1,2,3 𝑦 = 𝑏 + - 𝑐* 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏* + - 𝑤*# 𝑥# 𝑖: 1,2,3

𝑦 = 𝑏 + - 𝑐* 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏* + - 𝑤*# 𝑥# 𝑖: 1,2,3 𝑦 = 𝑏 + - 𝑐* 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏* + - 𝑤*# 𝑥# 𝑖: 1,2,3

Function with unknown parameters

𝒙 feature Step 1: Step 2: define

Loss Ø Loss is a function of parameters 𝐿 𝜃 Back to ML Framework

feature Step 1: Step 2: define

Optimization of New Model 𝜃! Optimization of New Model

Optimization of New Model Optimization of New Model

Back to ML Framework Sigmoid → ReLU

Sigmoid → ReLU Experimental Results

𝑦 = 𝑏 + - 𝑐* 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏* + - 𝑤*# 𝑥# 𝑦 = 𝑏 + - 𝑐* 𝑚𝑎𝑥 0, 𝑏* + - 𝑤*# 𝑥#

Which one is better?

Even more variety of models … 𝒂′ = 𝜎 𝒃′ + 𝒂 𝒂 =𝜎 𝑊 𝒙

Red: real no. of views

• Loss for multiple hidden layers

1 layer 2 layer 3 layer 4 layer

hidden layer hidden layer

Deep = Many hidden layers Deep = Many hidden layers

Why don’t we go deeper? Why don’t we go deeper?

Better on training data, worse on unseen data

loss on training data

model loss on testing data

more training data Not in HWs,

Model Bias loss on training data

find a needle in a haystack … 𝑓𝜽∗ 𝒙

Model Bias v.s. Optimization Issue5

Overfitting Overfitting 𝑦 “freestyle”

Overfitting 𝑦 Overfitting 𝑦 𝑦 = 𝑎 + 𝑏𝑥 + 𝑐𝑥 "

Data augmentation Real data distribution

Real data distribution Training loss

Mismatch Small Gradient …

Simply increasing the training data will not help.

The value of a network parameter w

Tips for training:

Review: Optimization with Batch Small Batch v.s. Large Batch

oldest slides: http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture/DNN%20(v4).pdf

Tesla V100 GPU

See all See only one

Have both fish and bear's paws?

Small Gradient … (Vanilla) Gradient Descent

Gradient Descent + Momentum Gradient Descent + Momentum

of the last step

Gradient Descent + Momentum Concluding Remarks

Training stuck ≠ Small Gradient

Training can be difficult

This error surface is convex.

Learning rate cannot be

105 𝜂 = 10-2 𝜂 = 10-7 106

Different parameters needs 𝜂 𝒕

Root Mean Square Learning rate adapts dynamically

Original paper: https://arxiv.org/pdf/1412.6980.pdf Without Adaptive Learning Rate

Learning Rate Scheduling 𝜂𝜂2 Learning Rate Scheduling 𝜂𝜂2

Learning Rate Scheduling 𝜂𝜂2

After the training goes, we are

Summary of Optimization Classification as Regression?

𝑦 = 𝑏 + - 𝑐* 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏* + - 𝑤# 𝑥# 𝑖: 1,2,3 𝑦 = 𝑏 + - 𝑐 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏* + - 𝑤*# 𝑥# 𝑖: 1,2,3

𝑦 = 𝑏 + - 𝑐* 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏* + - 𝑤# 𝑥# 𝑖: 1,2,3 𝑦 = 𝑏 + - 𝑐 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏* + - 𝑤*# 𝑥# 𝑖: 1,2,3

𝑦 = 𝑏 + - 𝑐* 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏* + - 𝑤# 𝑥# 𝑦 = 𝑏 + - 𝑐 𝑚𝑎𝑥 0, 𝑏* + - 𝑤*# 𝑥#