0% found this document useful (0 votes)

46 views32 pages

Deep Learning Basics Lecture 3 Regularization I

Regularization helps prevent overfitting by adding additional terms or constraints to the training objective. It can be viewed as imposing hard constraints during optimization, adding regularization terms like l2 or l1 norms as soft constraints, or incorporating priors over the parameters in a Bayesian view. L2 regularization scales parameter values proportionally along eigenvectors of the Hessian, while l1 regularization induces sparsity by driving small parameter values to exactly zero.

Uploaded by

baris

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views32 pages

Deep Learning Basics Lecture 3 Regularization I

Uploaded by

baris

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

Deep Learning Basics

Lecture 3: Regularization I
Princeton University COS 495
Instructor: Yingyu Liang
What is regularization?
• In general: any method to prevent overfitting or help the optimization

• Specifically: additional terms in the training optimization objective to

prevent overfitting or help the optimization
Review: overfitting
Overfitting example: regression using polynomials
𝑡 = sin 2𝜋𝑥 + 𝜖

Figure from Machine Learning

and Pattern Recognition, Bishop
Overfitting example: regression using polynomials

Figure from Machine Learning

and Pattern Recognition, Bishop
Overfitting
• Empirical loss and expected loss are different

• Smaller the data set, larger the difference between the two
• Larger the hypothesis class, easier to find a hypothesis that fits the
difference between the two
• Thus has small training error but large test error (overfitting)
Prevent overfitting
• Larger data set helps
• Throwing away useless hypotheses also helps

• Classical regularization: some principal ways to constrain hypotheses

• Other types of regularization: data augmentation, early stopping, etc.
Different views of regularization
Regularization as hard constraint
• Training objective 𝑛
1
min 𝐿෠ 𝑓 = ෍ 𝑙(𝑓, 𝑥𝑖 , 𝑦𝑖 )
𝑓 𝑛
𝑖=1
subject to: 𝑓 ∈ 𝓗

• When parametrized 𝑛
1
min 𝐿෠ 𝜃 = ෍ 𝑙(𝜃, 𝑥𝑖 , 𝑦𝑖 )
𝜃 𝑛
𝑖=1
subject to: 𝜃 ∈ 𝛺
Regularization as hard constraint
• When 𝛺 measured by some quantity 𝑅
𝑛
1
min 𝐿෠ 𝜃 = ෍ 𝑙(𝜃, 𝑥𝑖 , 𝑦𝑖 )
𝜃 𝑛
𝑖=1

subject to: 𝑅 𝜃 ≤ 𝑟
• Example: 𝑙2 regularization 𝑛
1
෠
min 𝐿 𝜃 = ෍ 𝑙(𝜃, 𝑥𝑖 , 𝑦𝑖 )
𝜃 𝑛
𝑖=1
2
subject to: | 𝜃| 2 ≤ 𝑟2
Regularization as soft constraint
• The hard-constraint optimization is equivalent to soft-constraint
𝑛
1
min 𝐿෠ 𝑅 𝜃 = ෍ 𝑙(𝜃, 𝑥𝑖 , 𝑦𝑖 ) + 𝜆∗ 𝑅(𝜃)
𝜃 𝑛
𝑖=1
for some regularization parameter 𝜆∗ > 0
• Example: 𝑙2 regularization
𝑛
1
min 𝐿෠ 𝑅 𝜃 = ෍ 𝑙(𝜃, 𝑥𝑖 , 𝑦𝑖 ) + 𝜆∗ | 𝜃| 2
2
𝜃 𝑛
𝑖=1
Regularization as soft constraint
• Showed by Lagrangian multiplier method
ℒ 𝜃, 𝜆 ≔ 𝐿෠ 𝜃 + 𝜆[𝑅 𝜃 − 𝑟]
• Suppose 𝜃 ∗ is the optimal for hard-constraint optimization
𝜃 ∗ = argmin max ℒ 𝜃, 𝜆 ≔ 𝐿෠ 𝜃 + 𝜆[𝑅 𝜃 − 𝑟]
𝜃 𝜆≥0
• Suppose 𝜆∗ is the corresponding optimal for max
𝜃 ∗ = argmin ℒ 𝜃, 𝜆∗ ≔ 𝐿෠ 𝜃 + 𝜆∗ [𝑅 𝜃 − 𝑟]
𝜃
Regularization as Bayesian prior
• Bayesian view: everything is a distribution
• Prior over the hypotheses: 𝑝 𝜃
• Posterior over the hypotheses: 𝑝 𝜃 | {𝑥𝑖 , 𝑦𝑖 }
• Likelihood: 𝑝 𝑥𝑖 , 𝑦𝑖 𝜃)

• Bayesian rule:
𝑝 𝜃 𝑝 𝑥𝑖 , 𝑦𝑖 𝜃)
𝑝 𝜃 | {𝑥𝑖 , 𝑦𝑖 } =
𝑝({𝑥𝑖 , 𝑦𝑖 })
Regularization as Bayesian prior
• Bayesian rule:
𝑝 𝜃 𝑝 𝑥𝑖 , 𝑦𝑖 𝜃)
𝑝 𝜃 | {𝑥𝑖 , 𝑦𝑖 } =
𝑝({𝑥𝑖 , 𝑦𝑖 })

• Maximum A Posteriori (MAP):

max log 𝑝 𝜃 | {𝑥𝑖 , 𝑦𝑖 } = max log 𝑝 𝜃 + log 𝑝 𝑥𝑖 , 𝑦𝑖 | 𝜃
𝜃 𝜃

Regularization MLE loss

Regularization as Bayesian prior
• Example: 𝑙2 loss with 𝑙2 regularization
𝑛
1
min 𝐿෠ 𝑅 𝜃 = ෍ 𝑓𝜃 𝑥𝑖 − 𝑦𝑖 2 + 𝜆∗ | 𝜃| 2
2
𝜃 𝑛
𝑖=1

• Correspond to a normal likelihood 𝑝 𝑥, 𝑦 | 𝜃 and a normal prior 𝑝(𝜃)

Three views
• Typical choice for optimization: soft-constraint
min 𝐿෠ 𝑅 𝜃 = 𝐿෠ 𝜃 + 𝜆𝑅(𝜃)
𝜃

• Hard constraint and Bayesian view: conceptual; or used for derivation

Three views
• Hard-constraint preferred if
• Know the explicit bound 𝑅 𝜃 ≤ 𝑟
• Soft-constraint causes trapped in a local minima with small 𝜃
• Projection back to feasible set leads to stability

• Bayesian view preferred if

• Know the prior distribution
Some examples
Classical regularization
• Norm penalty
• 𝑙2 regularization
• 𝑙1 regularization

• Robustness to noise
𝑙2 regularization
𝛼
min 𝐿෠ 𝑅 𝜃 = 𝐿෠ (𝜃) + | 𝜃| 2
2
𝜃 2

• Effect on (stochastic) gradient descent

• Effect on the optimal solution
Effect on gradient descent
• Gradient of regularized objective
𝛻 𝐿෠ 𝑅 𝜃 = 𝛻 𝐿෠ (𝜃) + 𝛼𝜃
• Gradient descent update
𝜃 ← 𝜃 − 𝜂𝛻 𝐿෠ 𝑅 𝜃 = 𝜃 − 𝜂 𝛻𝐿෠ 𝜃 − 𝜂𝛼𝜃 = 1 − 𝜂𝛼 𝜃 − 𝜂 𝛻 𝐿෠ 𝜃
• Terminology: weight decay
Effect on the optimal solution
• Consider a quadratic approximation around 𝜃 ∗
1
𝐿෠ 𝜃 ≈ 𝐿෠ 𝜃∗ + 𝜃− 𝜃 ∗ 𝑇 𝛻𝐿෠ 𝜃∗ + 𝜃 − 𝜃∗ 𝑇𝐻 𝜃 − 𝜃∗
2

• Since 𝜃 ∗ is optimal, 𝛻 𝐿෠ 𝜃 ∗ = 0
1
𝐿෠ 𝜃 ≈ 𝐿෠ 𝜃∗ + 𝜃 − 𝜃∗ 𝑇𝐻 𝜃 − 𝜃∗
2
𝛻 𝐿෠ 𝜃 ≈ 𝐻 𝜃 − 𝜃 ∗
Effect on the optimal solution
• Gradient of regularized objective
𝛻 𝐿෠ 𝑅 𝜃 ≈ 𝐻 𝜃 − 𝜃 ∗ + 𝛼𝜃
• On the optimal 𝜃𝑅∗
0 = 𝛻 𝐿෠ 𝑅 𝜃𝑅∗ ≈ 𝐻 𝜃𝑅∗ − 𝜃 ∗ + 𝛼𝜃𝑅∗
𝜃𝑅∗ ≈ 𝐻 + 𝛼𝐼 −1 𝐻𝜃 ∗
Effect on the optimal solution
• The optimal
𝜃𝑅∗ ≈ 𝐻 + 𝛼𝐼 −1 𝐻𝜃 ∗

• Suppose 𝐻 has eigen-decomposition 𝐻 = 𝑄Λ𝑄 𝑇

𝜃𝑅∗ ≈ 𝐻 + 𝛼𝐼 −1
𝐻𝜃 ∗ = 𝑄 Λ + 𝛼𝐼 −1
Λ𝑄 𝑇 𝜃 ∗

• Effect: rescale along eigenvectors of 𝐻

Effect on the optimal solution

Notations:
𝜃 ∗ = 𝑤 ∗ , 𝜃𝑅∗ = 𝑤
෥

Figure from Deep Learning,

Goodfellow, Bengio and Courville
𝑙1 regularization
min 𝐿෠ 𝑅 𝜃 = 𝐿෠ (𝜃) + 𝛼| 𝜃 |1
𝜃

• Effect on (stochastic) gradient descent

• Effect on the optimal solution
Effect on gradient descent
• Gradient of regularized objective
𝛻 𝐿෠ 𝑅 𝜃 = 𝛻 𝐿෠ 𝜃 + 𝛼 sign(𝜃)
where sign applies to each element in 𝜃
• Gradient descent update
𝜃 ← 𝜃 − 𝜂𝛻 𝐿෠ 𝑅 𝜃 = 𝜃 − 𝜂 𝛻 𝐿෠ 𝜃 − 𝜂𝛼 sign(𝜃)
Effect on the optimal solution
• Consider a quadratic approximation around 𝜃 ∗
1
𝐿෠ 𝜃 ≈ 𝐿෠ 𝜃∗ + 𝜃− 𝜃 ∗ 𝑇 𝛻𝐿෠ 𝜃∗ + 𝜃 − 𝜃∗ 𝑇𝐻 𝜃 − 𝜃∗
2

• Since 𝜃 ∗ is optimal, 𝛻 𝐿෠ 𝜃 ∗ = 0
1
𝐿෠ 𝜃 ≈ 𝐿෠ 𝜃∗ + 𝜃 − 𝜃∗ 𝑇𝐻 𝜃 − 𝜃∗
2
Effect on the optimal solution
• Further assume that 𝐻 is diagonal and positive (𝐻𝑖𝑖 > 0, ∀𝑖)
• not true in general but assume for getting some intuition
• The regularized objective is (ignoring constants)
1
෠𝐿𝑅 𝜃 ≈ ෍ 𝐻𝑖𝑖 𝜃𝑖 − 𝜃𝑖∗ 2
+ 𝛼 |𝜃𝑖 |
2
𝑖
• The optimal 𝜃𝑅∗
𝛼
max − 𝜃𝑖∗ ,0 if 𝜃𝑖∗ ≥ 0
𝐻𝑖𝑖
(𝜃𝑅∗ )𝑖 ≈
∗ 𝛼
min 𝜃𝑖 + ,0 if 𝜃𝑖∗ < 0
𝐻𝑖𝑖
Effect on the optimal solution
• Effect: induce sparsity
(𝜃𝑅∗ )𝑖

𝛼 𝛼 (𝜃 ∗ )𝑖
−
𝐻𝑖𝑖 𝐻𝑖𝑖
Effect on the optimal solution
• Further assume that 𝐻 is diagonal
• Compact expression for the optimal 𝜃𝑅∗
𝛼
(𝜃𝑅∗ )𝑖 ≈ sign 𝜃𝑖∗ max{ 𝜃𝑖∗ − , 0}
𝐻𝑖𝑖
Bayesian view
• 𝑙1 regularization corresponds to Laplacian prior

𝑝 𝜃 ∝ exp(𝛼 ෍ |𝜃𝑖 |)
𝑖
log 𝑝 𝜃 = 𝛼 ෍ |𝜃𝑖 | + constant = 𝛼| 𝜃 |1 + constant
𝑖

ECE604 f20 hw3
0% (1)
ECE604 f20 hw3
3 pages
Karsten Harris Between Nihilism and Faith (A Commentary On or (De Gruyter 2010)
100% (4)
Karsten Harris Between Nihilism and Faith (A Commentary On or (De Gruyter 2010)
202 pages
Concept
100% (1)
Concept
162 pages
Editing Techniques
No ratings yet
Editing Techniques
2 pages
Dmshareguy Complete Links
No ratings yet
Dmshareguy Complete Links
44 pages
c16000 Service Manual
100% (1)
c16000 Service Manual
2,491 pages
Unit4 DL Final
No ratings yet
Unit4 DL Final
30 pages
Regularization (Mathematics) - Wikipedia
No ratings yet
Regularization (Mathematics) - Wikipedia
13 pages
07 Regularization
No ratings yet
07 Regularization
51 pages
NN&DL Unit-IV Regularization For Deep Learning
No ratings yet
NN&DL Unit-IV Regularization For Deep Learning
16 pages
Regularization (Mathematics)
No ratings yet
Regularization (Mathematics)
11 pages
Unit Iv NNHDL
No ratings yet
Unit Iv NNHDL
15 pages
Deep Learning Basics Lecture 4 Regularization II
No ratings yet
Deep Learning Basics Lecture 4 Regularization II
27 pages
Unit - 4-NNDL - Notes
No ratings yet
Unit - 4-NNDL - Notes
14 pages
DL Chpter 3
No ratings yet
DL Chpter 3
8 pages
Module - 2 Ver 1.4
No ratings yet
Module - 2 Ver 1.4
35 pages
DL Unit 4
No ratings yet
DL Unit 4
15 pages
Regularization For Deep Learning: Tsz-Chiu Au Chiu@unist - Ac.kr
No ratings yet
Regularization For Deep Learning: Tsz-Chiu Au Chiu@unist - Ac.kr
100 pages
Regularization
No ratings yet
Regularization
46 pages
4th Unit DL Final Class Notes
No ratings yet
4th Unit DL Final Class Notes
68 pages
Unit-2 L1
No ratings yet
Unit-2 L1
23 pages
G R I A D M: Radient Egularization Mproves Ccuracy OF Isciminative Odels
No ratings yet
G R I A D M: Radient Egularization Mproves Ccuracy OF Isciminative Odels
14 pages
CM20315 09 Regularization
No ratings yet
CM20315 09 Regularization
44 pages
L09 - Regularisation
No ratings yet
L09 - Regularisation
79 pages
07 Regularization
No ratings yet
07 Regularization
7 pages
5-Introduction To regularization-03-Aug-2020Material - I - 03-Aug-2020 - Module3 - Regularization
No ratings yet
5-Introduction To regularization-03-Aug-2020Material - I - 03-Aug-2020 - Module3 - Regularization
10 pages
07: Regularization: The Problem of Overfitting
No ratings yet
07: Regularization: The Problem of Overfitting
5 pages
L11+ Regularization
No ratings yet
L11+ Regularization
24 pages
Lecture6 Regularization
No ratings yet
Lecture6 Regularization
56 pages
Unit 2.3
No ratings yet
Unit 2.3
43 pages
Lec8 Regularization
No ratings yet
Lec8 Regularization
41 pages
Lecture 05 - Regularization - 4p
No ratings yet
Lecture 05 - Regularization - 4p
21 pages
Unit - 4 REGULARIZATION FOR DEEP LEARNING
No ratings yet
Unit - 4 REGULARIZATION FOR DEEP LEARNING
56 pages
RTV 4 Manual
No ratings yet
RTV 4 Manual
128 pages
RTV 4 Manual - Regu Tools
No ratings yet
RTV 4 Manual - Regu Tools
128 pages
Chap 7-1 Regularization For Deep Learning-Keonwoo Noh
No ratings yet
Chap 7-1 Regularization For Deep Learning-Keonwoo Noh
41 pages
Skript Opt Mach
No ratings yet
Skript Opt Mach
49 pages
Regularization PDF
No ratings yet
Regularization PDF
32 pages
The Problem of Overfitting: Overfitting With Linear Regression
No ratings yet
The Problem of Overfitting: Overfitting With Linear Regression
32 pages
Lec 05 Regularization
No ratings yet
Lec 05 Regularization
77 pages
Lecture 4.2. Generalization and Regularization
No ratings yet
Lecture 4.2. Generalization and Regularization
23 pages
Introml sp24 Lec2
No ratings yet
Introml sp24 Lec2
48 pages
Penalizing Gradient Norm For Efficiently Improving Generalization in Deep Learning
No ratings yet
Penalizing Gradient Norm For Efficiently Improving Generalization in Deep Learning
11 pages
Mod 4
No ratings yet
Mod 4
65 pages
Convex Problems
No ratings yet
Convex Problems
48 pages
Deep Learning: Computer Science and Engineering
No ratings yet
Deep Learning: Computer Science and Engineering
18 pages
Lecture15 Regularization
No ratings yet
Lecture15 Regularization
47 pages
Regularization: Swetha V, Research Scholar
No ratings yet
Regularization: Swetha V, Research Scholar
32 pages
L11+ Regularization
No ratings yet
L11+ Regularization
25 pages
Regularization in Machine Learning
No ratings yet
Regularization in Machine Learning
17 pages
315 F19 15 SVM 2
No ratings yet
315 F19 15 SVM 2
35 pages
Lecture 3
No ratings yet
Lecture 3
105 pages
Unit Ii
No ratings yet
Unit Ii
8 pages
Types of Regularization in Machine Learning - by Aqeel Anwar - Towards Data Science
No ratings yet
Types of Regularization in Machine Learning - by Aqeel Anwar - Towards Data Science
11 pages
Les Hoches 2022 Convex Optimization
No ratings yet
Les Hoches 2022 Convex Optimization
34 pages
Deep Learning 02
No ratings yet
Deep Learning 02
28 pages
DL Unit-3
No ratings yet
DL Unit-3
56 pages
ML Paper
No ratings yet
ML Paper
21 pages
Introduction To Machine Learning: The Problem of Overfitting
No ratings yet
Introduction To Machine Learning: The Problem of Overfitting
8 pages
W02 MLOptDL
No ratings yet
W02 MLOptDL
23 pages
Lecture 2
No ratings yet
Lecture 2
6 pages
Learning From Data: 9: Regularization
No ratings yet
Learning From Data: 9: Regularization
37 pages
L S N N R: Earning Parse Eural Etworks Through Egularization
No ratings yet
L S N N R: Earning Parse Eural Etworks Through Egularization
13 pages
Exercises of Logarithms and Exponentials
From Everand
Exercises of Logarithms and Exponentials
Simone Malacrida
No ratings yet
Calculus: Maths of the Gods
From Everand
Calculus: Maths of the Gods
Bill Todorovich
No ratings yet
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Deep Learning Basics Lecture 6 Convolutional NN
No ratings yet
Deep Learning Basics Lecture 6 Convolutional NN
36 pages
Deep Learning Basics Lecture 2 Backpropagation
No ratings yet
Deep Learning Basics Lecture 2 Backpropagation
31 pages
OSRAM SFH 309 Datasheet
No ratings yet
OSRAM SFH 309 Datasheet
16 pages
Deep Learning Basics Lecture 1 Feedforward
No ratings yet
Deep Learning Basics Lecture 1 Feedforward
31 pages
Deep Learning Basics Lecture 8 Autoencoder & DBM
No ratings yet
Deep Learning Basics Lecture 8 Autoencoder & DBM
28 pages
Lectures On Electromagnetic Theory - Weng Cho Chew
No ratings yet
Lectures On Electromagnetic Theory - Weng Cho Chew
591 pages
ECE604 f20 hw1
No ratings yet
ECE604 f20 hw1
1 page
Deep Learning Basics Lecture 11 Practical Methodology
No ratings yet
Deep Learning Basics Lecture 11 Practical Methodology
25 pages
PYu-RC Group 51 RoHS L 12
No ratings yet
PYu-RC Group 51 RoHS L 12
10 pages
SFH 203 - en
No ratings yet
SFH 203 - en
15 pages
SFH 235 Fa - en
No ratings yet
SFH 235 Fa - en
15 pages
Anthropology of Food in Nagaland
No ratings yet
Anthropology of Food in Nagaland
9 pages
Engineering Economy Probset
No ratings yet
Engineering Economy Probset
5 pages
GaU 5th Graduation List
No ratings yet
GaU 5th Graduation List
28 pages
Method Adoption Workshop
No ratings yet
Method Adoption Workshop
21 pages
Psychosocial Development Theory
100% (1)
Psychosocial Development Theory
2 pages
Thyroid Tumors
No ratings yet
Thyroid Tumors
43 pages
5630-1 Final
No ratings yet
5630-1 Final
15 pages
GEA 1000 Tutorial 1 Solution
No ratings yet
GEA 1000 Tutorial 1 Solution
12 pages
ST Learning Task 7
No ratings yet
ST Learning Task 7
6 pages
Large Retailer 2020
No ratings yet
Large Retailer 2020
54 pages
The Role, Intention, Character Traits, and Thematic Significance of The Mechanicals in A Midsummer Night's Dream
No ratings yet
The Role, Intention, Character Traits, and Thematic Significance of The Mechanicals in A Midsummer Night's Dream
9 pages
A.A Fondjo. Dynamic Loading On Structures Project
No ratings yet
A.A Fondjo. Dynamic Loading On Structures Project
45 pages
APPLICATION SECURITY ASSESSMENT - Draft - Ver01
No ratings yet
APPLICATION SECURITY ASSESSMENT - Draft - Ver01
42 pages
Jail Break - Yogi Mahajan
No ratings yet
Jail Break - Yogi Mahajan
89 pages
Teradata SQL Performance Tuning Case Study Part I
No ratings yet
Teradata SQL Performance Tuning Case Study Part I
16 pages
Joint Attention and Early Languange - Tomasello - Tugas Jurnal
No ratings yet
Joint Attention and Early Languange - Tomasello - Tugas Jurnal
11 pages
HRPTA
No ratings yet
HRPTA
2 pages
XML in A Nutshell 2nd Edition Elliotte Rusty Harold W Scott Means Download
No ratings yet
XML in A Nutshell 2nd Edition Elliotte Rusty Harold W Scott Means Download
35 pages
Ex-2 4
No ratings yet
Ex-2 4
3 pages
Netters Musculoskeletal Flash Cards
No ratings yet
Netters Musculoskeletal Flash Cards
292 pages
Crude Drug Belonging To Family Solanaceae
No ratings yet
Crude Drug Belonging To Family Solanaceae
36 pages
1-2 Lewis and Clark and Me
No ratings yet
1-2 Lewis and Clark and Me
89 pages
Lemon Tree Song - Present Continuous Worksheet
No ratings yet
Lemon Tree Song - Present Continuous Worksheet
1 page
Beam
No ratings yet
Beam
8 pages
The Collected Poems Sylvia Plath PDF Download
No ratings yet
The Collected Poems Sylvia Plath PDF Download
78 pages

Deep Learning Basics Lecture 3 Regularization I

Uploaded by

Deep Learning Basics Lecture 3 Regularization I

Uploaded by

Deep Learning Basics

• Specifically: additional terms in the training optimization objective to

Figure from Machine Learning

Figure from Machine Learning

• Classical regularization: some principal ways to constrain hypotheses

• Maximum A Posteriori (MAP):

Regularization MLE loss

• Correspond to a normal likelihood 𝑝 𝑥, 𝑦 | 𝜃 and a normal prior 𝑝(𝜃)

• Hard constraint and Bayesian view: conceptual; or used for derivation

• Bayesian view preferred if

• Effect on (stochastic) gradient descent

• Suppose 𝐻 has eigen-decomposition 𝐻 = 𝑄Λ𝑄 𝑇

• Effect: rescale along eigenvectors of 𝐻

Figure from Deep Learning,

• Effect on (stochastic) gradient descent

You might also like