0% found this document useful (0 votes)
46 views32 pages

Deep Learning Basics Lecture 3 Regularization I

Regularization helps prevent overfitting by adding additional terms or constraints to the training objective. It can be viewed as imposing hard constraints during optimization, adding regularization terms like l2 or l1 norms as soft constraints, or incorporating priors over the parameters in a Bayesian view. L2 regularization scales parameter values proportionally along eigenvectors of the Hessian, while l1 regularization induces sparsity by driving small parameter values to exactly zero.

Uploaded by

baris
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views32 pages

Deep Learning Basics Lecture 3 Regularization I

Regularization helps prevent overfitting by adding additional terms or constraints to the training objective. It can be viewed as imposing hard constraints during optimization, adding regularization terms like l2 or l1 norms as soft constraints, or incorporating priors over the parameters in a Bayesian view. L2 regularization scales parameter values proportionally along eigenvectors of the Hessian, while l1 regularization induces sparsity by driving small parameter values to exactly zero.

Uploaded by

baris
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Deep Learning Basics

Lecture 3: Regularization I
Princeton University COS 495
Instructor: Yingyu Liang
What is regularization?
• In general: any method to prevent overfitting or help the optimization

• Specifically: additional terms in the training optimization objective to


prevent overfitting or help the optimization
Review: overfitting
Overfitting example: regression using polynomials
𝑡 = sin 2𝜋𝑥 + 𝜖

Figure from Machine Learning


and Pattern Recognition, Bishop
Overfitting example: regression using polynomials

Figure from Machine Learning


and Pattern Recognition, Bishop
Overfitting
• Empirical loss and expected loss are different

• Smaller the data set, larger the difference between the two
• Larger the hypothesis class, easier to find a hypothesis that fits the
difference between the two
• Thus has small training error but large test error (overfitting)
Prevent overfitting
• Larger data set helps
• Throwing away useless hypotheses also helps

• Classical regularization: some principal ways to constrain hypotheses


• Other types of regularization: data augmentation, early stopping, etc.
Different views of regularization
Regularization as hard constraint
• Training objective 𝑛
1
min 𝐿෠ 𝑓 = ෍ 𝑙(𝑓, 𝑥𝑖 , 𝑦𝑖 )
𝑓 𝑛
𝑖=1
subject to: 𝑓 ∈ 𝓗

• When parametrized 𝑛
1
min 𝐿෠ 𝜃 = ෍ 𝑙(𝜃, 𝑥𝑖 , 𝑦𝑖 )
𝜃 𝑛
𝑖=1
subject to: 𝜃 ∈ 𝛺
Regularization as hard constraint
• When 𝛺 measured by some quantity 𝑅
𝑛
1
min 𝐿෠ 𝜃 = ෍ 𝑙(𝜃, 𝑥𝑖 , 𝑦𝑖 )
𝜃 𝑛
𝑖=1

subject to: 𝑅 𝜃 ≤ 𝑟
• Example: 𝑙2 regularization 𝑛
1

min 𝐿 𝜃 = ෍ 𝑙(𝜃, 𝑥𝑖 , 𝑦𝑖 )
𝜃 𝑛
𝑖=1
2
subject to: | 𝜃| 2 ≤ 𝑟2
Regularization as soft constraint
• The hard-constraint optimization is equivalent to soft-constraint
𝑛
1
min 𝐿෠ 𝑅 𝜃 = ෍ 𝑙(𝜃, 𝑥𝑖 , 𝑦𝑖 ) + 𝜆∗ 𝑅(𝜃)
𝜃 𝑛
𝑖=1
for some regularization parameter 𝜆∗ > 0
• Example: 𝑙2 regularization
𝑛
1
min 𝐿෠ 𝑅 𝜃 = ෍ 𝑙(𝜃, 𝑥𝑖 , 𝑦𝑖 ) + 𝜆∗ | 𝜃| 2
2
𝜃 𝑛
𝑖=1
Regularization as soft constraint
• Showed by Lagrangian multiplier method
ℒ 𝜃, 𝜆 ≔ 𝐿෠ 𝜃 + 𝜆[𝑅 𝜃 − 𝑟]
• Suppose 𝜃 ∗ is the optimal for hard-constraint optimization
𝜃 ∗ = argmin max ℒ 𝜃, 𝜆 ≔ 𝐿෠ 𝜃 + 𝜆[𝑅 𝜃 − 𝑟]
𝜃 𝜆≥0
• Suppose 𝜆∗ is the corresponding optimal for max
𝜃 ∗ = argmin ℒ 𝜃, 𝜆∗ ≔ 𝐿෠ 𝜃 + 𝜆∗ [𝑅 𝜃 − 𝑟]
𝜃
Regularization as Bayesian prior
• Bayesian view: everything is a distribution
• Prior over the hypotheses: 𝑝 𝜃
• Posterior over the hypotheses: 𝑝 𝜃 | {𝑥𝑖 , 𝑦𝑖 }
• Likelihood: 𝑝 𝑥𝑖 , 𝑦𝑖 𝜃)

• Bayesian rule:
𝑝 𝜃 𝑝 𝑥𝑖 , 𝑦𝑖 𝜃)
𝑝 𝜃 | {𝑥𝑖 , 𝑦𝑖 } =
𝑝({𝑥𝑖 , 𝑦𝑖 })
Regularization as Bayesian prior
• Bayesian rule:
𝑝 𝜃 𝑝 𝑥𝑖 , 𝑦𝑖 𝜃)
𝑝 𝜃 | {𝑥𝑖 , 𝑦𝑖 } =
𝑝({𝑥𝑖 , 𝑦𝑖 })

• Maximum A Posteriori (MAP):


max log 𝑝 𝜃 | {𝑥𝑖 , 𝑦𝑖 } = max log 𝑝 𝜃 + log 𝑝 𝑥𝑖 , 𝑦𝑖 | 𝜃
𝜃 𝜃

Regularization MLE loss


Regularization as Bayesian prior
• Example: 𝑙2 loss with 𝑙2 regularization
𝑛
1
min 𝐿෠ 𝑅 𝜃 = ෍ 𝑓𝜃 𝑥𝑖 − 𝑦𝑖 2 + 𝜆∗ | 𝜃| 2
2
𝜃 𝑛
𝑖=1

• Correspond to a normal likelihood 𝑝 𝑥, 𝑦 | 𝜃 and a normal prior 𝑝(𝜃)


Three views
• Typical choice for optimization: soft-constraint
min 𝐿෠ 𝑅 𝜃 = 𝐿෠ 𝜃 + 𝜆𝑅(𝜃)
𝜃

• Hard constraint and Bayesian view: conceptual; or used for derivation


Three views
• Hard-constraint preferred if
• Know the explicit bound 𝑅 𝜃 ≤ 𝑟
• Soft-constraint causes trapped in a local minima with small 𝜃
• Projection back to feasible set leads to stability

• Bayesian view preferred if


• Know the prior distribution
Some examples
Classical regularization
• Norm penalty
• 𝑙2 regularization
• 𝑙1 regularization

• Robustness to noise
𝑙2 regularization
𝛼
min 𝐿෠ 𝑅 𝜃 = 𝐿෠ (𝜃) + | 𝜃| 2
2
𝜃 2

• Effect on (stochastic) gradient descent


• Effect on the optimal solution
Effect on gradient descent
• Gradient of regularized objective
𝛻 𝐿෠ 𝑅 𝜃 = 𝛻 𝐿෠ (𝜃) + 𝛼𝜃
• Gradient descent update
𝜃 ← 𝜃 − 𝜂𝛻 𝐿෠ 𝑅 𝜃 = 𝜃 − 𝜂 𝛻𝐿෠ 𝜃 − 𝜂𝛼𝜃 = 1 − 𝜂𝛼 𝜃 − 𝜂 𝛻 𝐿෠ 𝜃
• Terminology: weight decay
Effect on the optimal solution
• Consider a quadratic approximation around 𝜃 ∗
1
𝐿෠ 𝜃 ≈ 𝐿෠ 𝜃∗ + 𝜃− 𝜃 ∗ 𝑇 𝛻𝐿෠ 𝜃∗ + 𝜃 − 𝜃∗ 𝑇𝐻 𝜃 − 𝜃∗
2

• Since 𝜃 ∗ is optimal, 𝛻 𝐿෠ 𝜃 ∗ = 0
1
𝐿෠ 𝜃 ≈ 𝐿෠ 𝜃∗ + 𝜃 − 𝜃∗ 𝑇𝐻 𝜃 − 𝜃∗
2
𝛻 𝐿෠ 𝜃 ≈ 𝐻 𝜃 − 𝜃 ∗
Effect on the optimal solution
• Gradient of regularized objective
𝛻 𝐿෠ 𝑅 𝜃 ≈ 𝐻 𝜃 − 𝜃 ∗ + 𝛼𝜃
• On the optimal 𝜃𝑅∗
0 = 𝛻 𝐿෠ 𝑅 𝜃𝑅∗ ≈ 𝐻 𝜃𝑅∗ − 𝜃 ∗ + 𝛼𝜃𝑅∗
𝜃𝑅∗ ≈ 𝐻 + 𝛼𝐼 −1 𝐻𝜃 ∗
Effect on the optimal solution
• The optimal
𝜃𝑅∗ ≈ 𝐻 + 𝛼𝐼 −1 𝐻𝜃 ∗

• Suppose 𝐻 has eigen-decomposition 𝐻 = 𝑄Λ𝑄 𝑇

𝜃𝑅∗ ≈ 𝐻 + 𝛼𝐼 −1
𝐻𝜃 ∗ = 𝑄 Λ + 𝛼𝐼 −1
Λ𝑄 𝑇 𝜃 ∗

• Effect: rescale along eigenvectors of 𝐻


Effect on the optimal solution

Notations:
𝜃 ∗ = 𝑤 ∗ , 𝜃𝑅∗ = 𝑤

Figure from Deep Learning,


Goodfellow, Bengio and Courville
𝑙1 regularization
min 𝐿෠ 𝑅 𝜃 = 𝐿෠ (𝜃) + 𝛼| 𝜃 |1
𝜃

• Effect on (stochastic) gradient descent


• Effect on the optimal solution
Effect on gradient descent
• Gradient of regularized objective
𝛻 𝐿෠ 𝑅 𝜃 = 𝛻 𝐿෠ 𝜃 + 𝛼 sign(𝜃)
where sign applies to each element in 𝜃
• Gradient descent update
𝜃 ← 𝜃 − 𝜂𝛻 𝐿෠ 𝑅 𝜃 = 𝜃 − 𝜂 𝛻 𝐿෠ 𝜃 − 𝜂𝛼 sign(𝜃)
Effect on the optimal solution
• Consider a quadratic approximation around 𝜃 ∗
1
𝐿෠ 𝜃 ≈ 𝐿෠ 𝜃∗ + 𝜃− 𝜃 ∗ 𝑇 𝛻𝐿෠ 𝜃∗ + 𝜃 − 𝜃∗ 𝑇𝐻 𝜃 − 𝜃∗
2

• Since 𝜃 ∗ is optimal, 𝛻 𝐿෠ 𝜃 ∗ = 0
1
𝐿෠ 𝜃 ≈ 𝐿෠ 𝜃∗ + 𝜃 − 𝜃∗ 𝑇𝐻 𝜃 − 𝜃∗
2
Effect on the optimal solution
• Further assume that 𝐻 is diagonal and positive (𝐻𝑖𝑖 > 0, ∀𝑖)
• not true in general but assume for getting some intuition
• The regularized objective is (ignoring constants)
1
෠𝐿𝑅 𝜃 ≈ ෍ 𝐻𝑖𝑖 𝜃𝑖 − 𝜃𝑖∗ 2
+ 𝛼 |𝜃𝑖 |
2
𝑖
• The optimal 𝜃𝑅∗
𝛼
max − 𝜃𝑖∗ ,0 if 𝜃𝑖∗ ≥ 0
𝐻𝑖𝑖
(𝜃𝑅∗ )𝑖 ≈
∗ 𝛼
min 𝜃𝑖 + ,0 if 𝜃𝑖∗ < 0
𝐻𝑖𝑖
Effect on the optimal solution
• Effect: induce sparsity
(𝜃𝑅∗ )𝑖

𝛼 𝛼 (𝜃 ∗ )𝑖

𝐻𝑖𝑖 𝐻𝑖𝑖
Effect on the optimal solution
• Further assume that 𝐻 is diagonal
• Compact expression for the optimal 𝜃𝑅∗
𝛼
(𝜃𝑅∗ )𝑖 ≈ sign 𝜃𝑖∗ max{ 𝜃𝑖∗ − , 0}
𝐻𝑖𝑖
Bayesian view
• 𝑙1 regularization corresponds to Laplacian prior

𝑝 𝜃 ∝ exp(𝛼 ෍ |𝜃𝑖 |)
𝑖
log 𝑝 𝜃 = 𝛼 ෍ |𝜃𝑖 | + constant = 𝛼| 𝜃 |1 + constant
𝑖

You might also like