Deep Learning: Computer Science and Engineering
Deep Learning: Computer Science and Engineering
cse.iitkgp.ac.i
Deep Learning
Abir Das
Assistant Professor
Computer Science and Engineering Department
Indian Institute of Technology Kharagpur
http://cse.iitkgp.ac.in/~adas/
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
Agenda
• Introduce the concepts of
• Regularization
• Dropout
• Batch normalization
27 Feb 2020
CS60010 / Deep Learning | Regularization and Batchnorm (c) Abir Das 2
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
2
𝒟
𝒟 2
𝒟 2
𝑓 ( 𝒙 ) − 2 𝑓 ( 𝒙 ) 𝔼 𝒟 [ 𝑔𝑛 ( 𝒙 ) ] + 𝔼 𝒟 [ 𝑔 𝑛 ( 𝒙 ) ]
¿ 𝐸 +¿
𝑜𝑢𝑡 ( 𝒙 )=¿ Bias Variance
𝑓 ( 𝒙 ) − 2 𝑓 ( 𝒙 ) 𝑔 ( 𝒙 )+ 𝔼 𝒟 [ 𝑔 𝑛 ( 𝒙 ) ]
¿
2 2 2 𝒟 2
𝑓 ( 𝒙 ) − 𝑔 ( 𝒙 ) +𝑔 ( 𝒙 ) − 2 𝑓 ( 𝒙 ) 𝑔 ( 𝒙 )+ 𝔼𝒟 [ 𝑔 𝑛 ( 𝒙 ) ]
¿
2 𝒟 2 2
¿ ( 𝑓 ( 𝒙 ) − 𝑔 ( 𝒙 ) ) + 𝔼 𝒟 [ 𝑔𝑛 ( 𝒙 ) ] − 𝑔 ( 𝒙 )
Bias Variance
Test Error
variance
Error
Training
Error
bias
27 Feb 2020 CS60010 / Deep Learning | Regularization and Batchnorm (c) Abir Das 4
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
Regularization
• Machine learning is concerned more about the performance on the test data than on
the training data
27 Feb 2020 CS60010 / Deep Learning | Regularization and Batchnorm (c) Abir Das 5
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
Regularization Strategies
• Adding restrictions on parameter values
• Adding constraints that are designed to encode specific kinds of prior knowledge
• Dataset augmentation
• In practical Deep Learning scenarios, we almost do find – the best fitting model (in
the sense of minimizing generalization error) is a large model that has been
regularized appropriately
27 Feb 2020 CS60010 / Deep Learning | Regularization and Batchnorm (c) Abir Das 6
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
• When the optimization procedure tries to minimize the objective function, it will also limit the
parameters to grow in an unbounded manner, thus restricting the complexity
𝛼
27 Feb 2020 CS60010 / Deep Learning | Regularization and Batchnorm (c) Abir Das 7
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
• In neural networks, we typically, choose the ’s as the ’s to regularize – not the biases
27 Feb 2020 CS60010 / Deep Learning | Regularization and Batchnorm (c) Abir Das 8
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
• We can gain some insight into the behavior of weight decay regularization by studying
the gradient of the regularized objective function.
• The gradient is
• The addition of weight decay term modifies the learning rule to shrink the weight
vector further before performing the usual gradient update
27 Feb 2020 CS60010 / Deep Learning | Regularization and Batchnorm (c) Abir Das 9
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
27 Feb 2020 CS60010 / Deep Learning | Regularization and Batchnorm (c) Abir Das 10
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
• Computing the gradient of the above and equating it to 0, we get the minimizing w of
the regularized and approximated objective as,
• As
27 Feb 2020 CS60010 / Deep Learning | Regularization and Batchnorm (c) Abir Das 11
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
27 Feb 2020 CS60010 / Deep Learning | Regularization and Batchnorm (c) Abir Das 12
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
Color Jitter
27 Feb 2020 CS60010 / Deep Learning | Regularization and Batchnorm (c) Abir Das 15
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
Batch Normalization
𝑥1
𝑤1
1
𝑚
(𝑖)
𝑤
𝜇= ∑ 𝑥
𝑎 (𝒙 ) 𝑔(𝑎) 𝑚 𝑖=1
𝑥2
2
𝑦
X -
𝑊 0 =𝑏 (elementwise)
𝑥𝑑
𝑤𝑑
1
𝑥1 ( 2)
𝑎
( 1)
h
(1) 𝑎1 h(2)
1 1 1 Can we normalize so as to train , faster ?
𝑥2 ^ 𝑦
Normalize
𝑎(21) h2
(1)
𝑎(22) h(2)
𝑥3 2
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
Implementing BatchNorm
Given some intermediate values in NN, :
𝑚
1 (𝑖) If,
𝜇= ∑ 𝑎
𝑚 𝑖=1 2
𝑚
𝛾 = √ 𝜎 + 𝜖
1
𝜎 = ∑ ( 𝑎 − 𝜇) 𝛽=𝜇
2 (𝑖 ) 2
𝑚 𝑖=1
then,
( 𝑖)
(𝑖) 𝑎 −𝜇
𝑎 =
𝑛𝑜𝑟𝑚 2
=
√𝜎 +𝜖
, are the learnable parameters of the model
Use
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i