0% found this document useful (0 votes)
38 views18 pages

Deep Learning: Computer Science and Engineering

This document discusses regularization techniques in deep learning. It introduces regularization, which modifies learning algorithms to reduce generalization error without affecting training error. Common regularization strategies include parameter norm penalties like L2 regularization (weight decay), which adds a penalty term for high parameter norms to the objective function. This limits model capacity and prevents unbounded parameter growth. The document analyzes the effect of L2 regularization on the learning updates and shows that it shrinks parameters along eigenvectors of the Hessian matrix. In deep learning, regularization is widely used to reduce overfitting and obtain a well-generalizing model.

Uploaded by

alok
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views18 pages

Deep Learning: Computer Science and Engineering

This document discusses regularization techniques in deep learning. It introduces regularization, which modifies learning algorithms to reduce generalization error without affecting training error. Common regularization strategies include parameter norm penalties like L2 regularization (weight decay), which adds a penalty term for high parameter norms to the objective function. This limits model capacity and prevents unbounded parameter growth. The document analyzes the effect of L2 regularization on the learning updates and shows that it shrinks parameters along eigenvectors of the Hessian matrix. In deep learning, regularization is widely used to reduce overfitting and obtain a well-generalizing model.

Uploaded by

alok
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 18

Computer Science and Engineering| Indian Institute of Technology Kharagpu

cse.iitkgp.ac.i

Deep Learning

Abir Das
Assistant Professor
Computer Science and Engineering Department
Indian Institute of Technology Kharagpur

http://cse.iitkgp.ac.in/~adas/
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Agenda
• Introduce the concepts of
• Regularization
• Dropout
• Batch normalization

• Resource: Goodfellow Book (Chapter 7)

27 Feb 2020
CS60010 / Deep Learning | Regularization and Batchnorm (c) Abir Das 2
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Recap: The Bias-Variance Decomposition


2
 𝐸
𝑜𝑢𝑡 ( 𝒙 )=𝔼 𝒟 [( 𝑔 ( 𝒙 ) − 𝑓 ( 𝒙 ) ) ]
𝒟
𝑛
2 𝒟 𝒟 2
  𝔼 𝒟 [ 𝑓 ( 𝒙 ) − 2 𝑓 ( 𝒙 ) 𝑔 𝑛 ( 𝒙 ) + 𝑔𝑛 ( 𝒙 )
¿ ]
2

2
𝒟

𝒟 2
𝒟 2
  𝑓 ( 𝒙 ) − 2 𝑓 ( 𝒙 ) 𝔼 𝒟 [ 𝑔𝑛 ( 𝒙 ) ] + 𝔼 𝒟 [ 𝑔 𝑛 ( 𝒙 ) ]
¿ 𝐸 +¿
  𝑜𝑢𝑡 ( 𝒙 )=¿ Bias  Variance
  𝑓 ( 𝒙 ) − 2 𝑓 ( 𝒙 ) 𝑔 ( 𝒙 )+ 𝔼 𝒟 [ 𝑔 𝑛 ( 𝒙 ) ]
¿
2 2 2 𝒟 2
  𝑓 ( 𝒙 ) − 𝑔 ( 𝒙 ) +𝑔 ( 𝒙 ) − 2 𝑓 ( 𝒙 ) 𝑔 ( 𝒙 )+ 𝔼𝒟 [ 𝑔 𝑛 ( 𝒙 ) ]
¿
2 𝒟 2 2
¿  ( 𝑓 ( 𝒙 ) − 𝑔 ( 𝒙 ) ) + 𝔼 𝒟 [ 𝑔𝑛 ( 𝒙 ) ] − 𝑔 ( 𝒙 )

Bias Variance

Slide motivation: Malik Magdon-Ismail


27 Feb 2020 CS60010 / Deep Learning | Regularization and Batchnorm (c) Abir Das 3
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Recap: Bias-Variance Trade-off

Test Error
variance

Error
Training
Error
bias

Number of Data Points, N

27 Feb 2020 CS60010 / Deep Learning | Regularization and Batchnorm (c) Abir Das 4
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Regularization
• Machine learning is concerned more about the performance on the test data than on
the training data

• According to the Goodfellow book, chapter 7 – “Many strategies used in Machine


Learning are explicitly designed to reduce the test error, possibly at the expense of
increased training error. These strategies are collectively known as Regularization”.

• Also – in the book, regularization is defined as – “Any modification we make to a


learning algorithm that is intended to reduce its generalization error but not its
training error”.

27 Feb 2020 CS60010 / Deep Learning | Regularization and Batchnorm (c) Abir Das 5
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Regularization Strategies
• Adding restrictions on parameter values

• Adding constraints that are designed to encode specific kinds of prior knowledge

• Use of ensemble methods/dropout

• Dataset augmentation

• In practical Deep Learning scenarios, we almost do find – the best fitting model (in
the sense of minimizing generalization error) is a large model that has been
regularized appropriately

27 Feb 2020 CS60010 / Deep Learning | Regularization and Batchnorm (c) Abir Das 6
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Parameter Norm Penalties


 • The most traditional form of regularization to deep learning is adding penalties for high norm of
parameters
• This approach limits the capacity of the model by adding penalty to the objective function resulting in

• When the optimization procedure tries to minimize the objective function, it will also limit the
parameters to grow in an unbounded manner, thus restricting the complexity

 𝛼
27 Feb 2020 CS60010 / Deep Learning | Regularization and Batchnorm (c) Abir Das 7
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Parameter Norm Penalties


 • Two most common choices are L2 (also known as weight decay in deep learning
community) and L1 norms as penalties

• In neural networks, we typically, choose the ’s as the ’s to regularize – not the biases

• Regularizing bias parameters can introduce significant amount of underfitting

• Thus for neural networks,

27 Feb 2020 CS60010 / Deep Learning | Regularization and Batchnorm (c) Abir Das 8
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

L-2 Parameter Norm Regularization


 • L-2 parameter norm penalty is commonly known as Weight Decay

• We can gain some insight into the behavior of weight decay regularization by studying
the gradient of the regularized objective function.

• The gradient is

• So, the update step is

• The addition of weight decay term modifies the learning rule to shrink the weight
vector further before performing the usual gradient update
27 Feb 2020 CS60010 / Deep Learning | Regularization and Batchnorm (c) Abir Das 9
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

L-2 Parameter Norm Regularization


 • Further simplification of the analysis will be made by making a quadratic
approximation to the unregularized objective function in the neighborhood of the
optimum weights , to the unregularized objective function.

• H is the Hessian Matrix of J w.r.t. w evaluated at w*.

• What rule/formula is used to get this approximation?


• Taylor series expansion

• Where is the first order term?


• being the minimizing value, is 0

27 Feb 2020 CS60010 / Deep Learning | Regularization and Batchnorm (c) Abir Das 10
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

L-2 Parameter Norm Regularization


 • With this approximation, the regularized objective is given by

• Computing the gradient of the above and equating it to 0, we get the minimizing w of
the regularized and approximated objective as,

• As

• As grows, we can see the effect by using eigendecomposition of H

27 Feb 2020 CS60010 / Deep Learning | Regularization and Batchnorm (c) Abir Das 11
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

L-2 Parameter Norm Regularization


 
• Then
• The effect of weight decay is to rescale along the axes defined by the eigenvectors
of . Specifically, the component of that is aligned with the eigenvector of is rescaled
by a factor

27 Feb 2020 CS60010 / Deep Learning | Regularization and Batchnorm (c) Abir Das 12
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Regularization Strategies: Dataset Augmentation


• One way to get better generalization is to train on more data.
• But under most circumstances, data is limited. Furthermore, labelling is an extremely tedious task.
• Dataset Augmentation provides a cheap and easy way to increase the amount of training data.

Color Jitter

And many many more Horizontal Flip


27 Feb 2020 CS60010 / Deep Learning | Regularization and Batchnorm (c) Abir Das 13
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Regularization Strategies: Dropout


• Bagging is a technique for reducing generalization error through combining several models (Breiman, 1994)
• Bagging: (1) Train k different models on k different subsets of training data, constructed to have the same number
of examples as the original dataset through random sampling from that dataset with replacement
• Bagging: (2) Have all of the models vote on the output for test examples
• Dropout is a computationally inexpensive but powerful extension of Bagging
• Training with dropout consists of training sub-networks that can be formed by removing non-output units from an
underlying base network

Images courtesy: Goodfellow et. al., Karpathy et. al.


27 Feb 2020 CS60010 / Deep Learning | Regularization and Batchnorm (c) Abir Das 14
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Dropout (Fun Intuition)

27 Feb 2020 CS60010 / Deep Learning | Regularization and Batchnorm (c) Abir Das 15
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Batch Normalization
𝑥1
   𝑤1  
1
𝑚
(𝑖)
 𝑤
𝜇= ∑ 𝑥
𝑎 (𝒙 )  𝑔(𝑎) 𝑚 𝑖=1

 
𝑥2
  2
𝑦
 
X  -
 𝑊 0 =𝑏 (elementwise)
 

𝑥𝑑
  𝑤𝑑
 


 𝑥1 ( 2)
 𝑎
( 1)
h
(1)  𝑎1  h(2)
1   1 1  Can we normalize so as to train , faster ?
 𝑥2 ^ 𝑦
 Normalize
 𝑎(21) h2
(1)
 𝑎(22) h(2)
 𝑥3 2
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Implementing BatchNorm
 
Given some intermediate values in NN, :
𝑚
 
1 (𝑖) If,
𝜇= ∑ 𝑎
𝑚 𝑖=1 2
𝑚
𝛾 = √ 𝜎 + 𝜖
 

1
𝜎 = ∑ ( 𝑎 − 𝜇) 𝛽=𝜇
2 (𝑖 ) 2 
𝑚 𝑖=1
then,
( 𝑖)
 
(𝑖) 𝑎 −𝜇
𝑎 =
𝑛𝑜𝑟𝑚 2
 
=
√𝜎 +𝜖
 
, are the learnable parameters of the model
 
Use
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Effect of Batch Normalization on Biases


  (𝑙) ( 𝑙 ) ( 𝑙 −1 ) (𝑙)
𝑎 =𝑤 h +𝑏
 
We know,
 
+
 
+
 
So,

You might also like