0% found this document useful (0 votes)
7 views65 pages

Week9 CIV2020 Lecture Note Rev

Uploaded by

shrudgus3439
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views65 pages

Week9 CIV2020 Lecture Note Rev

Uploaded by

shrudgus3439
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

CIV2020 – Week#9

[CIV7084] 21-2 건설딥러닝특론 강의자료

01. Parameter Updates


Gradient Descent Method를 활용한 네트워크 역전파 계산

e.g. x = -2, y = 5, z = -4

Want:

Slide from CS231n


2
[CIV7084] 21-2 건설딥러닝특론 강의자료

01. Parameter Updates


Gradient Descent Method를 활용한 네트워크 역전파 계산

activations

“local gradient”

Slide from CS231n


3
[CIV7084] 21-2 건설딥러닝특론 강의자료

01. Parameter Updates


1) Stochastic Gradient Descent, SGD

Training a neural network, main loop:

Slide from CS231n


4
[CIV7084] 21-2 건설딥러닝특론 강의자료

01. Parameter Updates


1) Stochastic Gradient Descent, SGD

Training a neural network, main loop:

simple gradient descent update


now: complicate.

Slide from CS231n


5
[CIV7084] 21-2 건설딥러닝특론 강의자료

01. Parameter Updates


1) Stochastic Gradient Descent, SGD

Image credits: Alec Radford

Slide from CS231n


6
[CIV7084] 21-2 건설딥러닝특론 강의자료

01. Parameter Updates


1) Stochastic Gradient Descent, SGD

Suppose loss function is steep vertically but shallow horizontally:

Q: What is the trajectory along which we converge towards


the minimum with SGD?

Slide from CS231n


7
[CIV7084] 21-2 건설딥러닝특론 강의자료

01. Parameter Updates


1) Stochastic Gradient Descent, SGD

Suppose loss function is steep vertically but shallow horizontally:

Q: What is the trajectory along which we converge towards


the minimum with SGD?

Slide from CS231n


8
[CIV7084] 21-2 건설딥러닝특론 강의자료

01. Parameter Updates


1) Stochastic Gradient Descent, SGD

Suppose loss function is steep vertically but shallow horizontally:

Q: What is the trajectory along which we converge towards


the minimum with SGD?

Slide from CS231n


9
[CIV7084] 21-2 건설딥러닝특론 강의자료

01. Parameter Updates


2) Momentum Optimizer

Momentum update

- Physical interpretation as ball rolling down the loss function + friction (mu coefficient).
- mu = usually ~0.5, 0.9, or 0.99 (Sometimes annealed over time, e.g. from 0.5 -> 0.99)

Slide from CS231n


10
[CIV7084] 21-2 건설딥러닝특론 강의자료

01. Parameter Updates


2) Momentum Optimizer

Momentum update

- Allows a velocity to “build up” along shallow directions


- Velocity becomes damped in steep direction due to quickly changing sign

Slide from CS231n


11
[CIV7084] 21-2 건설딥러닝특론 강의자료

01. Parameter Updates


2) Momentum Optimizer

SGD
vs
Momentum notice momentum
overshooting the target, b
ut overall getting to the mi
nimum much faster.

Slide from CS231n


12
[CIV7084] 21-2 건설딥러닝특론 강의자료

01. Parameter Updates


3) Nesterov Momentum

Nesterov Momentum update

Momentum update Nesterov momentum update


“lookahead” gradient
step (bit different than
momentum momentum
original)
step step
actual step
actual step

gradient
step

Slide from CS231n


13
[CIV7084] 21-2 건설딥러닝특론 강의자료

01. Parameter Updates


3) Nesterov Momentum

Nesterov Momentum update

Momentum update Nesterov momentum update


“lookahead” gradient
step (bit different than
momentum momentum
original)
step step
actual step
actual step

gradient Nesterov: the only difference...


step

Slide from CS231n


14
[CIV7084] 21-2 건설딥러닝특론 강의자료

01. Parameter Updates


3) Nesterov Momentum

Nesterov Momentum update


Slightly inconvenient…
usually we have :

Slide from CS231n


15
[CIV7084] 21-2 건설딥러닝특론 강의자료

01. Parameter Updates


3) Nesterov Momentum

Nesterov Momentum update


Slightly inconvenient…
usually we have :

Variable transform and rearranging saves the day:

Slide from CS231n


16
[CIV7084] 21-2 건설딥러닝특론 강의자료

01. Parameter Updates


3) Nesterov Momentum

Nesterov Momentum update


Slightly inconvenient…
usually we have :

Variable transform and rearranging saves the day:

Replace all thetas with phis, rearrange and obtain:

Slide from CS231n


17
[CIV7084] 21-2 건설딥러닝특론 강의자료

01. Parameter Updates


3) Nesterov Momentum

nag =
Nesterov Accelerated
Gradient

Slide from CS231n


18
[CIV7084] 21-2 건설딥러닝특론 강의자료

01. Parameter Updates


4) AdaGrad

[Duchi et al., 2011]


AdaGrad update

Added element-wise scaling of the gradient based on the


historical sum of squares in each dimension

Slide from CS231n


19
[CIV7084] 21-2 건설딥러닝특론 강의자료

01. Parameter Updates


4) AdaGrad

[Duchi et al., 2011]


AdaGrad update

Q: What happens with AdaGrad?

Slide from CS231n


20
[CIV7084] 21-2 건설딥러닝특론 강의자료

01. Parameter Updates


4) AdaGrad

[Duchi et al., 2011]


AdaGrad update

Q2: What happens to the step size over long time?


Slide from CS231n
21
[CIV7084] 21-2 건설딥러닝특론 강의자료

01. Parameter Updates


5) RMSProp

[Tieleman and Hinton, 2012]


RMSProp update

Slide from CS231n


22
[CIV7084] 21-2 건설딥러닝특론 강의자료

01. Parameter Updates


5) RMSProp

Introduced in a slide in
Geoff Hinton’s Coursera
class, lecture 6

Slide from CS231n


23
[CIV7084] 21-2 건설딥러닝특론 강의자료

01. Parameter Updates


5) RMSProp

Introduced in a slide in
Geoff Hinton’s Coursera
class, lecture 6

Cited by several
papers as:

Slide from CS231n


24
[CIV7084] 21-2 건설딥러닝특론 강의자료

01. Parameter Updates


5) RMSProp

adagrad
rmsprop

Slide from CS231n


25
[CIV7084] 21-2 건설딥러닝특론 강의자료

01. Parameter Updates


5) Adam

[Kingma and Ba, 2014]


Adam update

Slide from CS231n


26
[CIV7084] 21-2 건설딥러닝특론 강의자료

01. Parameter Updates


5) Adam

[Kingma and Ba, 2014]


Adam update

momentum

RMSProp-like
Looks a bit like RMSProp with momentum

Slide from CS231n


27
[CIV7084] 21-2 건설딥러닝특론 강의자료

01. Parameter Updates


5) Adam

[Kingma and Ba, 2014]


Adam update

momentum

bias correction
(only relevant in first few itera
tions when t is small)

RMSProp-like

The bias correction compensates for the fact that m,v are initialized
at zero and need some time to “warm up”.

Slide from CS231n


28
[CIV7084] 21-2 건설딥러닝특론 강의자료

01. Parameter Updates


6) 그 외 방법들

Second order optimization methods


second-order Taylor expansion:

Solving for the critical point we obtain the Newton parameter update:

Slide from CS231n


29
[CIV7084] 21-2 건설딥러닝특론 강의자료

01. Parameter Updates


6) 그 외 방법들

Second order optimization methods

- Quasi-Newton methods (BGFS most popular):


instead of inverting the Hessian (O(n^3)), approximate inverse
Hessian with rank 1 updates over time (O(n^2) each).

- L-BFGS (Limited memory BFGS):


Does not form/store the full inverse Hessian.

Slide from CS231n


30
[CIV7084] 21-2 건설딥러닝특론 강의자료

01. Parameter Updates


6) 그 외 방법들

L-BFGS
- Usually works very well in full batch, deterministic mode
i.e. if you have a single, deterministic f(x) then L-BFGS will probably
work very nicely

- Does not transfer very well to mini-batch setting. Gives bad results
. Adapting L-BFGS to large-scale, stochastic setting is an active area
of research.

Slide from CS231n


31
[CIV7084] 21-2 건설딥러닝특론 강의자료

01. Parameter Updates


가장 좋은 Parameter Update 방법은?

SGD, SGD+Momentum, Adagrad, RMSProp, Adam

Slide from CS231n


32
[CIV7084] 21-2 건설딥러닝특론 강의자료

01. Parameter Updates


다양한 Parameter Update 방법들의 효율을 높이기 위한 가장 좋은 Learning Rate 는?

SGD, SGD+Momentum, Adagrad, RMSProp, Adam all have


learning rate as a hyperparameter.

Q: Which one of these learning


rates is best to use?

Slide from CS231n


33
[CIV7084] 21-2 건설딥러닝특론 강의자료

01. Parameter Updates


다양한 Parameter Update 방법들의 효율을 높이기 위한 가장 좋은 Learning Rate 는?

SGD, SGD+Momentum, Adagrad, RMSProp, Adam all have


learning rate as a hyperparameter.
=> Learning rate decay over time!

step decay:
e.g. decay learning rate by half every few
epochs.

exponential decay:

1/t decay:

Slide from CS231n


34
[CIV7084] 21-2 건설딥러닝특론 강의자료

01. Parameter Updates


Summary

In practice:

- Adam is a good default choice in most cases

- If you can afford to do full batch updates then try out L-


BFGS (and don’t forget to disable all sources of noise)

Slide from CS231n


35
[CIV7084] 21-2 건설딥러닝특론 강의자료

Slide from CS231n


00. Review
Regularization 사용에 따른 효과

Regularization 사용 결과?
→ 모델의 일반화가 가능해짐

36
[CIV7084] 21-2 건설딥러닝특론 강의자료

00. Review
과적합(Overfitting)을 방지하기 위한 방안

Weight Regularization

L2 regularization
L1 regularization
Elastic net (L1 + L2)
Dropout

Slide from CS231n


37
[CIV7084] 21-2 건설딥러닝특론 강의자료

02. Dropout

Regularization: Dropout
“randomly set some neurons to zero in the forward pass”

[Srivastava et al., 2014]

Slide from CS231n


38
[CIV7084] 21-2 건설딥러닝특론 강의자료

02. Dropout

Example forward
pass with a 3-layer n
etwork using
dropout

Slide from CS231n


39
[CIV7084] 21-2 건설딥러닝특론 강의자료

02. Dropout

Waaaait a second…
How could this possibly be a good idea?

Slide from CS231n


40
[CIV7084] 21-2 건설딥러닝특론 강의자료

02. Dropout

Waaaait a second…
How could this possibly be a good idea?
Forces the network to have a redundant representation.

has an ear X
has a tail

is furry X cat
score
has claws
mischievous X
look

Slide from CS231n


41
[CIV7084] 21-2 건설딥러닝특론 강의자료

02. Dropout

Waaaait a second…
How could this possibly be a good idea?

Another interpretation:

Dropout is training a large ensemble of


models (that share parameters).

Each binary mask is one model, gets


trained on only ~one datapoint.

Slide from CS231n


42
[CIV7084] 21-2 건설딥러닝특론 강의자료

02. Dropout

At test time….

Ideally:
want to integrate out all the noise

Monte Carlo approximation:


do many forward passes with different
dropout masks, average all predictions

Slide from CS231n


43
[CIV7084] 21-2 건설딥러닝특론 강의자료

02. Dropout

At test time….
Can in fact do this with a single forward pass! (approximately)
Leave all input neurons turned on (no dropout).
during test: a = w0*x + w1*y
a
during train:
E[a] = ¼ * (w0*0 + w1*0
+ w0*0 + w1*y
w0 w1
+ w0*x + w1*0
x y
+ w0*x + w1*y)
= ¼ * (2 w0*x + 2 w1*y)
= ½ * (w0*x + w1*y)
Slide from CS231n
44
[CIV7084] 21-2 건설딥러닝특론 강의자료

02. Dropout

At test time….
Can in fact do this with a single forward pass! (approximately)
Leave all input neurons turned on (no dropout).
With p=0.5, using all
during test: a = w0*x + w1*y inputs in the forward
a
during train: pass would inflate the
E[a] = ¼ * (w0*0 + w1*0 activations by 2x from
what the network was
+ w0*0 + w1*y “used to” during
w0 w1
+ w0*x + w1*0 training!
+ w0*x + w1*y)  Have to
x y compensate by
= ¼ * (2 w0*x + 2 w1*y) scaling the activations
= ½ * (w0*x + w1*y) back down by ½

Slide from CS231n


45
[CIV7084] 21-2 건설딥러닝특론 강의자료

02. Dropout

We can do something approximate analytically

At test time all neurons are active always


=> We must scale the activations so that for each neuron:
output at test time = expected output at training time

Slide from CS231n


46
[CIV7084] 21-2 건설딥러닝특론 강의자료

02. Dropout

Dropout Summary

drop in forward pass

scale at test time

Slide from CS231n


47
[CIV7084] 21-2 건설딥러닝특론 강의자료

02. Dropout

More common: “Inverted dropout”

test time is unchanged!

Slide from CS231n


48
[CIV7084] 21-2 건설딥러닝특론 강의자료

03. Cross Validation

Try out what hyperparameters work best on test set.

Slide from CS231n


49
[CIV7084] 21-2 건설딥러닝특론 강의자료

03. Cross Validation

Trying out what hyperparameters work best on test set:


Very bad idea. The test set is a proxy for the generalization performance!
Use only VERY SPARINGLY, at the end.

Slide from CS231n


50
[CIV7084] 21-2 건설딥러닝특론 강의자료

03. Cross Validation

Cross-validation
cycle through the choice of which fold is
the validation fold, average results.

Slide from CS231n


51
[CIV7084] 21-2 건설딥러닝특론 강의자료

03. 딥러닝 모델 학습 과정
다양한 Hyperparameter를 적용한 모델 개발 수행

52
[CIV7084] 21-2 건설딥러닝특론 강의자료

03. 딥러닝 모델 학습 과정
Step 1: 딥러닝 모델 학습 목표 설정

• 목표 : MNIST 데이터 분류
• 개발을 통해 얻고자 하는 결과
-한 글자 짜리 손글씨 사진이 입력되었을 때, 해당 손글씨 사진이
어떤 숫자인지 자동으로 분류하는 딥러닝 모델
• 개발을 위해 적절한 학습 방법론
-지도학습
• 인공지능학습을 위한 데이터의 종류, 형태 등
-‘0’ (Class 1)
-‘1’ (Class 2)
-‘2’ (Class 3)
-‘…’ (Class m)
-‘9’ (Class 10)

53
[CIV7084] 21-2 건설딥러닝특론 강의자료

03. 딥러닝 모델 학습 과정
Step 2: 입력 데이터 전처리 수행

• 라벨링(Labeling)
* 강아지 (Class 1)
* 고양이 (Class 2)
* 그 외 사진들 (Class 3)

1 ⋯ 17
1
⋮ ⋱ ⋮
Class 1 0
M

35 ⋯ 1
0
X data Y data
(MxN matrix) (3x1 matrix)
N

54
[CIV7084] 21-2 건설딥러닝특론 강의자료

03. 딥러닝 모델 학습 과정
Step 2: 입력 데이터 전처리 수행

• 데이터 정규화(Normalization)
* Gaussian Normalization
* Min-max Normalization
* Centering 등등

55
[CIV7084] 21-2 건설딥러닝특론 강의자료

03. 딥러닝 모델 학습 과정
Step 3: 딥러닝 모델 구성

50 hidden
neurons

10 output neurons,
output layer
input layer one per class
CIFAR-10 images,
3072 numbers hidden layer

56
[CIV7084] 21-2 건설딥러닝특론 강의자료

03. 딥러닝 모델 학습 과정
Step 3: 딥러닝 모델 구성

• Network Architecture

Recurrent Neural Network

Deep Neural Network

Convolutional Neural Network

57
[CIV7084] 21-2 건설딥러닝특론 강의자료

03. 딥러닝 모델 학습 과정
Step 4: Hyperparameter Fine-tuning

• Activation Function (Sigmoid, tanh, ReLU, …)


• Gradient Method (Adam, RmsProp, …)
• Learning Rate (0.01, 0.001, 0.0001, …)
• Regulrarization (L1, L2, Dropout, …)

58
[CIV7084] 21-2 건설딥러닝특론 강의자료

03. 딥러닝 모델 학습 과정
Step 4: Hyperparameter Fine-tuning

모델 학습이
수행되는 과정에서,
Loss가 작아지는 것이
일반적인 현상..

만약, Epoch가 커져도


Loss가 작아지지
(혹은Accuracy가 커지지)
않는다면?
learning rate too low

59
[CIV7084] 21-2 건설딥러닝특론 강의자료

03. 딥러닝 모델 학습 과정
Step 4: Hyperparameter Fine-tuning

모델 학습이
수행되는 과정에서,
Loss가 작아지는 것이
일반적인 현상..

만약, Loss가 Inf 값을 보인다면?


Learning rate too large

60
[CIV7084] 21-2 건설딥러닝특론 강의자료

03. 딥러닝 모델 학습 과정
Step 5: 학습 결과 분석

• 손실값 및 정확도 그래프 확인


-Training Loss and Accuracy
-Validation Loss and Accuaracy

61
[CIV7084] 21-2 건설딥러닝특론 강의자료

03. 딥러닝 모델 학습 과정
Step 5: 학습 결과 분석

• 손실값 및 정확도 그래프 확인


-Training Loss and Accuracy
-Validation Loss and Accuaracy

Loss
Bad initialization
a prime suspect

time Slide from CS231n

62
[CIV7084] 21-2 건설딥러닝특론 강의자료

05. 딥러닝 모델 학습 과정
Step 5: 학습 수행정도 평가

• Training 결과 및 Validation 결과의 비교를 통한 과적합(Overfitting) 유추 가능

big gap = overfitting


=> increase regularization strength?

no gap
=> increase model capacity?

63
[CIV7084] 21-2 건설딥러닝특론 강의자료

03. 딥러닝 모델 학습 과정
Step 6: 모델 정확도 평가

• Confusion Matrix

Class 1 Class 2
Predicted Predicted
Class 1
TP FN
Actual
Class 2
FP TN
Actual

𝑇𝑁 + 𝑇𝑃 𝑃𝑟𝑒𝑐.× 𝑅𝑒𝑐.
Accuracy = 𝐹1 score = 2 ×
𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁 + 𝑇𝑃 𝑃𝑟𝑒𝑐. +𝑅𝑒𝑐.
𝑇𝑃 𝑇𝑃
Precision = Recall =
𝐹𝑃 + 𝑇𝑃 𝑇𝑃 + 𝐹𝑁

64
감사합니다.

You might also like