Deep Learning Basics Lecture 4 Regularization II
Deep Learning Basics Lecture 4 Regularization II
Lecture 4: Regularization II
Princeton University COS 495
Instructor: Yingyu Liang
Review
Regularization as hard constraint
• Constrained optimization
𝑛
1
min 𝐿 𝜃 = 𝑙(𝜃, 𝑥𝑖 , 𝑦𝑖 )
𝜃 𝑛
𝑖=1
subject to: 𝑅 𝜃 ≤ 𝑟
Regularization as soft constraint
• Unconstrained optimization
𝑛
1
min 𝐿 𝑅 𝜃 = 𝑙(𝜃, 𝑥𝑖 , 𝑦𝑖 ) + 𝜆𝑅(𝜃)
𝜃 𝑛
𝑖=1
for some regularization parameter 𝜆 > 0
Regularization as Bayesian prior
• Bayesian rule:
𝑝 𝜃 𝑝 𝑥𝑖 , 𝑦𝑖 𝜃)
𝑝 𝜃 | {𝑥𝑖 , 𝑦𝑖 } =
𝑝({𝑥𝑖 , 𝑦𝑖 })
Class +1
𝑤1 𝑤2 𝑤3
Class -1
Class +1
𝑤2
Class -1
Class +1
𝑤2
Class -1
2
𝐿(𝑓) = 𝔼𝑥,𝑦,𝜖 𝑓 𝑥 + 𝜖 − 𝑦 = 𝔼𝑥,𝑦,𝜖 𝑓 𝑥 + 𝑤 𝑇 𝜖 − 𝑦 2
2
𝐿(𝑓) =𝔼𝑥,𝑦,𝜖 𝑓 𝑥 − 𝑦 + 2𝔼𝑥,𝑦,𝜖 𝑤 𝑇 𝜖 𝑓 𝑥 − 𝑦 + 𝔼𝑥,𝑦,𝜖 𝑤 𝑇 𝜖 2
2 2
𝐿(𝑓) =𝔼𝑥,𝑦,𝜖 𝑓 𝑥 − 𝑦 +𝜆 𝑤
Add noise to the weights
• For the loss on each data point, add a noise term to the weights
before computing the prediction
𝜖~𝑁(0, 𝜂𝐼), 𝑤′ = 𝑤 + 𝜖
• Prediction: 𝑓𝑤 ′ 𝑥 instead of 𝑓𝑤 𝑥
• Loss becomes
𝐿(𝑓) = 𝔼𝑥,𝑦,𝜖 𝑓𝑤+𝜖 𝑥 − 𝑦 2
Add noise to the weights
• Loss becomes
𝐿(𝑓) = 𝔼𝑥,𝑦,𝜖 𝑓𝑤+𝜖 𝑥 − 𝑦 2
• Advantage
• Efficient: along with training; only store an extra copy of weights
• Simple: no change to the model/algo
• Typical dropout probability: 0.2 for input and 0.5 for hidden units
Dropout