Deep Learning Basics Lecture 3 Regularization I
Deep Learning Basics Lecture 3 Regularization I
Lecture 3: Regularization I
Princeton University COS 495
Instructor: Yingyu Liang
What is regularization?
• In general: any method to prevent overfitting or help the optimization
• Smaller the data set, larger the difference between the two
• Larger the hypothesis class, easier to find a hypothesis that fits the
difference between the two
• Thus has small training error but large test error (overfitting)
Prevent overfitting
• Larger data set helps
• Throwing away useless hypotheses also helps
• When parametrized 𝑛
1
min 𝐿 𝜃 = 𝑙(𝜃, 𝑥𝑖 , 𝑦𝑖 )
𝜃 𝑛
𝑖=1
subject to: 𝜃 ∈ 𝛺
Regularization as hard constraint
• When 𝛺 measured by some quantity 𝑅
𝑛
1
min 𝐿 𝜃 = 𝑙(𝜃, 𝑥𝑖 , 𝑦𝑖 )
𝜃 𝑛
𝑖=1
subject to: 𝑅 𝜃 ≤ 𝑟
• Example: 𝑙2 regularization 𝑛
1
min 𝐿 𝜃 = 𝑙(𝜃, 𝑥𝑖 , 𝑦𝑖 )
𝜃 𝑛
𝑖=1
2
subject to: | 𝜃| 2 ≤ 𝑟2
Regularization as soft constraint
• The hard-constraint optimization is equivalent to soft-constraint
𝑛
1
min 𝐿 𝑅 𝜃 = 𝑙(𝜃, 𝑥𝑖 , 𝑦𝑖 ) + 𝜆∗ 𝑅(𝜃)
𝜃 𝑛
𝑖=1
for some regularization parameter 𝜆∗ > 0
• Example: 𝑙2 regularization
𝑛
1
min 𝐿 𝑅 𝜃 = 𝑙(𝜃, 𝑥𝑖 , 𝑦𝑖 ) + 𝜆∗ | 𝜃| 2
2
𝜃 𝑛
𝑖=1
Regularization as soft constraint
• Showed by Lagrangian multiplier method
ℒ 𝜃, 𝜆 ≔ 𝐿 𝜃 + 𝜆[𝑅 𝜃 − 𝑟]
• Suppose 𝜃 ∗ is the optimal for hard-constraint optimization
𝜃 ∗ = argmin max ℒ 𝜃, 𝜆 ≔ 𝐿 𝜃 + 𝜆[𝑅 𝜃 − 𝑟]
𝜃 𝜆≥0
• Suppose 𝜆∗ is the corresponding optimal for max
𝜃 ∗ = argmin ℒ 𝜃, 𝜆∗ ≔ 𝐿 𝜃 + 𝜆∗ [𝑅 𝜃 − 𝑟]
𝜃
Regularization as Bayesian prior
• Bayesian view: everything is a distribution
• Prior over the hypotheses: 𝑝 𝜃
• Posterior over the hypotheses: 𝑝 𝜃 | {𝑥𝑖 , 𝑦𝑖 }
• Likelihood: 𝑝 𝑥𝑖 , 𝑦𝑖 𝜃)
• Bayesian rule:
𝑝 𝜃 𝑝 𝑥𝑖 , 𝑦𝑖 𝜃)
𝑝 𝜃 | {𝑥𝑖 , 𝑦𝑖 } =
𝑝({𝑥𝑖 , 𝑦𝑖 })
Regularization as Bayesian prior
• Bayesian rule:
𝑝 𝜃 𝑝 𝑥𝑖 , 𝑦𝑖 𝜃)
𝑝 𝜃 | {𝑥𝑖 , 𝑦𝑖 } =
𝑝({𝑥𝑖 , 𝑦𝑖 })
• Robustness to noise
𝑙2 regularization
𝛼
min 𝐿 𝑅 𝜃 = 𝐿 (𝜃) + | 𝜃| 2
2
𝜃 2
• Since 𝜃 ∗ is optimal, 𝛻 𝐿 𝜃 ∗ = 0
1
𝐿 𝜃 ≈ 𝐿 𝜃∗ + 𝜃 − 𝜃∗ 𝑇𝐻 𝜃 − 𝜃∗
2
𝛻 𝐿 𝜃 ≈ 𝐻 𝜃 − 𝜃 ∗
Effect on the optimal solution
• Gradient of regularized objective
𝛻 𝐿 𝑅 𝜃 ≈ 𝐻 𝜃 − 𝜃 ∗ + 𝛼𝜃
• On the optimal 𝜃𝑅∗
0 = 𝛻 𝐿 𝑅 𝜃𝑅∗ ≈ 𝐻 𝜃𝑅∗ − 𝜃 ∗ + 𝛼𝜃𝑅∗
𝜃𝑅∗ ≈ 𝐻 + 𝛼𝐼 −1 𝐻𝜃 ∗
Effect on the optimal solution
• The optimal
𝜃𝑅∗ ≈ 𝐻 + 𝛼𝐼 −1 𝐻𝜃 ∗
𝜃𝑅∗ ≈ 𝐻 + 𝛼𝐼 −1
𝐻𝜃 ∗ = 𝑄 Λ + 𝛼𝐼 −1
Λ𝑄 𝑇 𝜃 ∗
Notations:
𝜃 ∗ = 𝑤 ∗ , 𝜃𝑅∗ = 𝑤
• Since 𝜃 ∗ is optimal, 𝛻 𝐿 𝜃 ∗ = 0
1
𝐿 𝜃 ≈ 𝐿 𝜃∗ + 𝜃 − 𝜃∗ 𝑇𝐻 𝜃 − 𝜃∗
2
Effect on the optimal solution
• Further assume that 𝐻 is diagonal and positive (𝐻𝑖𝑖 > 0, ∀𝑖)
• not true in general but assume for getting some intuition
• The regularized objective is (ignoring constants)
1
𝐿𝑅 𝜃 ≈ 𝐻𝑖𝑖 𝜃𝑖 − 𝜃𝑖∗ 2
+ 𝛼 |𝜃𝑖 |
2
𝑖
• The optimal 𝜃𝑅∗
𝛼
max − 𝜃𝑖∗ ,0 if 𝜃𝑖∗ ≥ 0
𝐻𝑖𝑖
(𝜃𝑅∗ )𝑖 ≈
∗ 𝛼
min 𝜃𝑖 + ,0 if 𝜃𝑖∗ < 0
𝐻𝑖𝑖
Effect on the optimal solution
• Effect: induce sparsity
(𝜃𝑅∗ )𝑖
𝛼 𝛼 (𝜃 ∗ )𝑖
−
𝐻𝑖𝑖 𝐻𝑖𝑖
Effect on the optimal solution
• Further assume that 𝐻 is diagonal
• Compact expression for the optimal 𝜃𝑅∗
𝛼
(𝜃𝑅∗ )𝑖 ≈ sign 𝜃𝑖∗ max{ 𝜃𝑖∗ − , 0}
𝐻𝑖𝑖
Bayesian view
• 𝑙1 regularization corresponds to Laplacian prior
𝑝 𝜃 ∝ exp(𝛼 |𝜃𝑖 |)
𝑖
log 𝑝 𝜃 = 𝛼 |𝜃𝑖 | + constant = 𝛼| 𝜃 |1 + constant
𝑖