0% found this document useful (0 votes)
113 views44 pages

CM20315 09 Regularization

Uploaded by

gdcs.iug
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
113 views44 pages

CM20315 09 Regularization

Uploaded by

gdcs.iug
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 44

CM20315 - Machine Learning

Prof. Simon Prince


9. Regularization
Regularization
• Why is there a generalization gap between training and test data?
• Overfitting (model describes statistical peculiarities)
• Model unconstrained in areas where there are no training examples
• Regularization = methods to reduce the generalization gap
• Technically means adding terms to loss function
• But colloquially means any method (hack) to reduce gap
Regularization
• Explicit regularization
• Implicit regularization
• Early stopping
• Ensembling
• Dropout
• Adding noise
• Bayesian approaches
• Transfer learning, multi-task learning, self-supervised learning
• Data augmentation
Explicit regularization
• Standard loss function:

• Regularization adds and extra term

• Favors some parameters, disfavors others.


• >0 controls the strength
Explicit regularization
• Standard loss function:

• Regularization adds an extra term

• Favors some parameters, disfavors others.


• >0 controls the strength
Explicit regularization
• Standard loss function:

• Regularization adds an extra term

• Favors some parameters, disfavors others.


• >0 controls the strength
Explicit regularization
Explicit regularization
Explicit regularization
Probabilistic interpretation
• Maximum likelihood:

• Regularization is equivalent to a adding a prior over parameters

… what you know about parameters before seeing the data


Equivalence
• Explicit regularization:

• Probabilistic interpretation:

• Mapping:
Equivalence
• Explicit regularization:

• Probabilistic interpretation:

• Mapping:
L2 Regularization
• Can only use very general terms
• Most common is L2 regularization
• Favors smaller parameters

• Also called Tichonov regularization, ridge regression


• In neural networks, usually just for weights and called weight decay
Why does L2 regularization help?
• Discourages slavish adherence to the data (overfitting)
• Encourages smoothness between datapoints
L2 regularization
Regularization
• Explicit regularization
• Implicit regularization
• Early stopping
• Ensembling
• Dropout
• Adding noise
• Bayesian approaches
• Transfer learning, multi-task learning, self-supervised learning
• Data augmentation
Implicit regularization

Gradient descent approximates a Finite step size equivalent to Add in that regularization and
differential equation regularization differential equation converges to
(infinitesimal step size) same place
Implicit regularization
• Gradient descent disfavors areas where gradients are steep

• SGD likes all batches to have similar gradients

• Depends on learning rate – perhaps why larger learning rates generalize better.
Implicit regularization
• Gradient descent disfavors areas where gradients are steep

• SGD likes all batches to have similar gradients

• Depends on learning rate – perhaps why larger learning rates generalize better.
Implicit regularization
• Gradient descent disfavors areas where gradients are steep

• SGD likes all batches to have similar gradients

• Depends on learning rate – perhaps why larger learning rates generalize better.
Generally performance
• best for larger learning rates
• best with smaller learning rates
Regularization
• Explicit regularization
• Implicit regularization
• Early stopping
• Ensembling
• Dropout
• Adding noise
• Bayesian approaches
• Transfer learning, multi-task learning, self-supervised learning
• Data augmentation
Early stopping
• If we stop training early, weights don’t have time to overfit to noise
• Weights start small, don’t have time to get large
• Reduces effective model complexity
• Known as early stopping
• Don’t have to re-train
Regularization
• Explicit regularization
• Implicit regularization
• Early stopping
• Ensembling
• Dropout
• Adding noise
• Bayesian approaches
• Transfer learning, multi-task learning, self-supervised learning
• Data augmentation
Ensembling
• Average together several models – an ensemble
• Can take mean or median
• Different initializations / different models
• Different subsets of the data resampled with replacements -- bagging
Regularization
• Explicit regularization
• Implicit regularization
• Early stopping
• Ensembling
• Dropout
• Adding noise
• Bayesian approaches
• Transfer learning, multi-task learning, self-supervised learning
• Data augmentation
Dropout
Dropout

Can eliminate kinks in function that are far from data and don’t contribute to
training loss
Regularization
• Explicit regularization
• Implicit regularization
• Early stopping
• Ensembling
• Dropout
• Adding noise
• Bayesian approaches
• Transfer learning, multi-task learning, self-supervised learning
• Data augmentation
Adding noise

• to inputs
• to weights
• to outputs (labels)
Regularization
• Explicit regularization
• Implicit regularization
• Early stopping
• Ensembling
• Dropout
• Adding noise
• Bayesian approaches
• Transfer learning, multi-task learning, self-supervised learning
• Data augmentation
Bayesian approaches
• There are many parameters compatible with the data
• Can find a probability distribution over them Prior info about
parameters

• Take all possible parameters into account when make prediction


Bayesian approaches
• There are many parameters compatible with the data
• Can find a probability distribution over them Prior info about
parameters

• Take all possible parameters into account when make prediction


Bayesian approaches
Regularization
• Explicit regularization
• Implicit regularization
• Early stopping
• Ensembling
• Dropout
• Adding noise
• Bayesian approaches
• Transfer learning, multi-task learning, self-supervised learning
• Data augmentation
• Transfer learning

• Multi-task learning

• Self-supervised
learning
• Transfer learning

• Multi-task learning

• Self-supervised
learning
• Transfer learning

• Multi-task learning

• Self-supervised
learning
Regularization
• Explicit regularization
• Implicit regularization
• Early stopping
• Ensembling
• Dropout
• Adding noise
• Bayesian approaches
• Transfer learning, multi-task learning, self-supervised learning
• Data augmentation
Data augmentation
Regularization overview

You might also like