0% found this document useful (0 votes)
56 views

Deep Learning Basics Lecture 4 Regularization II

Regularization techniques discussed include: 1. Adding constraints such as l1 and l2 norms to prevent overfitting. 2. Adding noise to inputs, weights, or outputs to improve robustness and act as regularization. This can be equivalent to weight decay. 3. Data augmentation such as rotations to artificially increase data size. 4. Early stopping training when validation error stops improving to prevent overfitting. 5. Dropout which randomly drops units during training, forcing the model to learn robust representations.

Uploaded by

baris
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views

Deep Learning Basics Lecture 4 Regularization II

Regularization techniques discussed include: 1. Adding constraints such as l1 and l2 norms to prevent overfitting. 2. Adding noise to inputs, weights, or outputs to improve robustness and act as regularization. This can be equivalent to weight decay. 3. Data augmentation such as rotations to artificially increase data size. 4. Early stopping training when validation error stops improving to prevent overfitting. 5. Dropout which randomly drops units during training, forcing the model to learn robust representations.

Uploaded by

baris
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Deep Learning Basics

Lecture 4: Regularization II
Princeton University COS 495
Instructor: Yingyu Liang
Review
Regularization as hard constraint
• Constrained optimization
𝑛
1
min 𝐿෠ 𝜃 = ෍ 𝑙(𝜃, 𝑥𝑖 , 𝑦𝑖 )
𝜃 𝑛
𝑖=1

subject to: 𝑅 𝜃 ≤ 𝑟
Regularization as soft constraint
• Unconstrained optimization
𝑛
1
min 𝐿෠ 𝑅 𝜃 = ෍ 𝑙(𝜃, 𝑥𝑖 , 𝑦𝑖 ) + 𝜆𝑅(𝜃)
𝜃 𝑛
𝑖=1
for some regularization parameter 𝜆 > 0
Regularization as Bayesian prior
• Bayesian rule:
𝑝 𝜃 𝑝 𝑥𝑖 , 𝑦𝑖 𝜃)
𝑝 𝜃 | {𝑥𝑖 , 𝑦𝑖 } =
𝑝({𝑥𝑖 , 𝑦𝑖 })

• Maximum A Posteriori (MAP):


max log 𝑝 𝜃 | {𝑥𝑖 , 𝑦𝑖 } = max log 𝑝 𝜃 + log 𝑝 𝑥𝑖 , 𝑦𝑖 | 𝜃
𝜃 𝜃

Regularization MLE loss


Classical regularizations
• Norm penalty
• 𝑙2 regularization
• 𝑙1 regularization
More examples
Other types of regularizations
• Robustness to noise
• Noise to the input
• Noise to the weights
• Noise to the output
• Data augmentation
• Early stopping
• Dropout
Multiple optimal solutions?

Class +1
𝑤1 𝑤2 𝑤3

Class -1

Prefer 𝑤2 (higher confidence)


Add noise to the input

Class +1
𝑤2

Class -1

Prefer 𝑤2 (higher confidence)


Caution: not too much noise
Too much noise leads
to data points cross
the boundary

Class +1
𝑤2

Class -1

Prefer 𝑤2 (higher confidence)


Equivalence to weight decay
• Suppose the hypothesis is 𝑓 𝑥 = 𝑤 𝑇 𝑥, noise is 𝜖~𝑁(0, 𝜆𝐼)
• After adding noise, the loss is

2
𝐿(𝑓) = 𝔼𝑥,𝑦,𝜖 𝑓 𝑥 + 𝜖 − 𝑦 = 𝔼𝑥,𝑦,𝜖 𝑓 𝑥 + 𝑤 𝑇 𝜖 − 𝑦 2

2
𝐿(𝑓) =𝔼𝑥,𝑦,𝜖 𝑓 𝑥 − 𝑦 + 2𝔼𝑥,𝑦,𝜖 𝑤 𝑇 𝜖 𝑓 𝑥 − 𝑦 + 𝔼𝑥,𝑦,𝜖 𝑤 𝑇 𝜖 2

2 2
𝐿(𝑓) =𝔼𝑥,𝑦,𝜖 𝑓 𝑥 − 𝑦 +𝜆 𝑤
Add noise to the weights
• For the loss on each data point, add a noise term to the weights
before computing the prediction

𝜖~𝑁(0, 𝜂𝐼), 𝑤′ = 𝑤 + 𝜖

• Prediction: 𝑓𝑤 ′ 𝑥 instead of 𝑓𝑤 𝑥
• Loss becomes
𝐿(𝑓) = 𝔼𝑥,𝑦,𝜖 𝑓𝑤+𝜖 𝑥 − 𝑦 2
Add noise to the weights
• Loss becomes
𝐿(𝑓) = 𝔼𝑥,𝑦,𝜖 𝑓𝑤+𝜖 𝑥 − 𝑦 2

• To simplify, use Taylor expansion


𝑇 𝜖𝑇 𝛻2 𝑓 𝑥 𝜖
• 𝑓𝑤+𝜖 𝑥 ≈ 𝑓𝑤 𝑥 + 𝜖 𝛻𝑓 𝑥 +
2
• Plug in
2
• 𝐿 𝑓 ≈ 𝔼 𝑓𝑤 𝑥 − 𝑦 + 𝜂𝔼[ 𝑓𝑤 𝑥 − 𝑦 𝛻 2 𝑓𝑤 𝑥 ] + 𝜂𝔼||𝛻𝑓𝑤 (𝑥)||2

Small so can be ignored Regularization term


Data augmentation

Figure from Image Classification with Pyramid Representation


and Rotated Data Augmentation on Torch 7, by Keven Wang
Data augmentation
• Adding noise to the input: a special kind of augmentation

• Be careful about the transformation applied:


• Example: classifying ‘b’ and ‘d’
• Example: classifying ‘6’ and ‘9’
Early stopping
• Idea: don’t train the network to too small training error

• Recall overfitting: Larger the hypothesis class, easier to find a


hypothesis that fits the difference between the two

• Prevent overfitting: do not push the hypothesis too much; use


validation error to decide when to stop
Early stopping

Figure from Deep Learning,


Goodfellow, Bengio and Courville
Early stopping
• When training, also output validation error
• Every time validation error improved, store a copy of the weights
• When validation error not improved for some time, stop
• Return the copy of the weights stored
Early stopping
• hyperparameter selection: training step is the hyperparameter

• Advantage
• Efficient: along with training; only store an extra copy of weights
• Simple: no change to the model/algo

• Disadvantage: need validation data


Early stopping
• Strategy to get rid of the disadvantage
• After early stopping of the first run, train a second run and reuse validation
data

• How to reuse validation data


1. Start fresh, train with both training data and validation data up to the
previous number of epochs
2. Start from the weights in the first run, train with both training data and
validation data util the validation loss < the training loss at the early
stopping point
Early stopping as a regularizer

Figure from Deep Learning,


Goodfellow, Bengio and Courville
Dropout
• Randomly select weights to update

• More precisely, in each update step


• Randomly sample a different binary mask to all the input and hidden units
• Multiple the mask bits with the units and do the update as usual

• Typical dropout probability: 0.2 for input and 0.5 for hidden units
Dropout

Figure from Deep Learning,


Goodfellow, Bengio and Courville
Dropout

Figure from Deep Learning,


Goodfellow, Bengio and Courville
Dropout

Figure from Deep Learning,


Goodfellow, Bengio and Courville
What regularizations are frequently used?
• 𝑙2 regularization
• Early stopping
• Dropout

• Data augmentation if the transformations known/easy to implement

You might also like