Lecture 7 Loss Function and Regularization
Lecture 7 Loss Function and Regularization
• How to regularize?
– Shrink coefficients
– Reduce features
Regularization is constraining a model
• How to regularize?
– Reduce the number of parameters
• Share weights in structure
– Constrain parameters to be small
– Encourage sparsity of output in loss
• Most commonly Tikhonov (or L2, or ridge)
regularization (a.k.a. weight decay)
– Penalty on sums of squares of individual weights
𝑁 𝑛 𝑛
1 2
𝜆
𝐽= 𝑦𝑖 − 𝑓 𝑥𝑖 + 𝑤𝑗 2 ; 𝑓 𝑥𝑖 = 𝑤𝑗 𝑥𝑖 𝑗 ;
𝑁 2
𝑖=1 𝑗=1 𝑗=0
Coefficient shrinkage using ridge
Source: Regression Shrinkage and Selection via the Lasso, by Robert Tibshirani, Journal of Royal Stat. Soc., 1996
L2-regularization visualized
Contents
• Revisiting MSE and L2 regularization
Source: Regression Shrinkage and Selection via the Lasso, by Robert Tibshirani, Journal of Royal Stat. Soc., 1996
Other forms of regularization
• L1-regularization
(sparsity
inducing norm)
– Penalty on sums
of absolute
values of
weights
Lasso coeff paths with decreasing λ
Source: Regression Shrinkage and Selection via the Lasso, by Robert Tibshirani, Journal of Royal Stat. Soc., 1996
Compare to coeff shrinkage path of
ridge
Source: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties, by Fan and Li, Journal of Am. Stat. Assoc., 2001
Thresholding in three cases: No
alteration of large coefficients by SCAD
and Hard
Source: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties, by Fan and Li, Journal of Am. Stat. Assoc., 2001
Motivation for elastic net
• The p >> n problem and grouped selection
– Microarrays: p > 10,000 and n < 100.
– For those genes sharing the same biological “pathway”, the
correlations among them can be high.
• LASSO limitations
– If p > n, the lasso selects at most n variables. The number
of
– Grouped variables: the lasso fails to do grouped selection.
It tends to select one variable from a group and ignore the
others.
Source: Elastic SCAD SVM, by Becker, Toedt, Lichter and Benner, in BMC Bioinformatics2011
A family of loss functions
Source: “A General and Adaptive Robust Loss Function” Jonathan T. Barron, ArXiv 2017
A family of loss functions
Source: “A General and Adaptive Robust Loss Function” Jonathan T. Barron, ArXiv 2017
Contents
• Revisiting MSE and L2 regularization
Source: Why the logistic function? A tutorial discussion on probabilities and neural networks, by Michael I. Jordan ftp://psyche.mit.edu/pub/jordan/uai.ps
Losses for ranking and metric learning
• Margin loss
• Cosine similarity
• Ranking
– Point-wise
– Pair-wise
• φ(z) = (1-z)+, e-z, log(1-e-z)
– List-wise
Source: “Ranking Measures and Loss Functions in Learning to Rank” Chen et al, NIPS 2009
Dropout: Drop a unit out to prevent
co-adaptation
Source: “Dropout: A Simple Way to Prevent Neural Networks from Overfitting”, by Srivastava, Hinton, et al. in JMLR 2014.
Why dropout?
• Make other features unreliable to break co-
adaptation
• Equivalent to adding noise
• Train several (dropped out) architectures in one
architecture (O(2n))
• Average architectures at run time
– Is this a good method for averaging?
– How about Bayesian averaging?
– Practically, this work well too
Source: “Dropout: A Simple Way to Prevent Neural Networks from Overfitting”, by Srivastava, Hinton, et al. in JMLR 2014.
Model averaging
Source: “Dropout: A Simple Way to Prevent Neural Networks from Overfitting”, by Srivastava, Hinton, et al. in JMLR 2014.
Indeed, DO leads to sparse activation
Source: “Dropout: A Simple Way to Prevent Neural Networks from Overfitting”, by Srivastava, Hinton, et al. in JMLR 2014.
There is a sweet spot with DO, even if
you increase the number of neurons
Source: “Dropout: A Simple Way to Prevent Neural Networks from Overfitting”, by Srivastava, Hinton, et al. in JMLR 2014.