AML L2 Logistic Regression
AML L2 Logistic Regression
LEARNING
Module 1: Generalized Regression and HDLSS
problems
Instructor: Amit Sethi
Co-developer: Neeraj Kumar
TAs: Gaurav Yadav, Niladri Bhattacharya
Page: AdvancedMachineLearning.weebly.com
IITG Course No: EE 622
Module objectives
• Understand linear and generalized linear
regression
Source: Wikipedia
Solutions to linear regression
Source: Wikipedia
Minimizing the Lp norm of error
Source: Wikipedia
Logistic regression as a GLM
• GLM
• Exponential family
Sources: Wikipedia and Why the logistic function? A tutorial discussion on probabilities and neural networks, by Michael I. Jordan ftp://psyche.mit.edu/pub/jordan/uai.ps
Bernoulli distribution logistic function
Source: Why the logistic function? A tutorial discussion on probabilities and neural networks, by Michael I. Jordan ftp://psyche.mit.edu/pub/jordan/uai.ps
Two sides of the same coin
Generative vs. Discriminative (“diagnostic”)
Source: Why the logistic function? A tutorial discussion on probabilities and neural networks, by Michael I. Jordan ftp://psyche.mit.edu/pub/jordan/uai.ps
Inference in the generative model
Source: Why the logistic function? A tutorial discussion on probabilities and neural networks, by Michael I. Jordan ftp://psyche.mit.edu/pub/jordan/uai.ps
Logistic function is a natural choice for
Gaussian class conditional densities
Source: Why the logistic function? A tutorial discussion on probabilities and neural networks, by Michael I. Jordan ftp://psyche.mit.edu/pub/jordan/uai.ps
Looks familiar?
Source: Why the logistic function? A tutorial discussion on probabilities and neural networks, by Michael I. Jordan ftp://psyche.mit.edu/pub/jordan/uai.ps
Generative vs. discriminative
Generative Discriminative
Source: Why the logistic function? A tutorial discussion on probabilities and neural networks, by Michael I. Jordan ftp://psyche.mit.edu/pub/jordan/uai.ps
Cross entropy loss function for ML ϴ
Source: Why the logistic function? A tutorial discussion on probabilities and neural networks, by Michael I. Jordan ftp://psyche.mit.edu/pub/jordan/uai.ps
Optimize for the parameters
Source: Why the logistic function? A tutorial discussion on probabilities and neural networks, by Michael I. Jordan ftp://psyche.mit.edu/pub/jordan/uai.ps
Other issues
• What about nonlinear discriminant functions?
• How to regularize?
• Shrink coefficients
• Reduce features
Coefficient shrinkage using ridge
Source: Regression Shrinkage and Selection via the Lasso, by Robert Tibshirani, Journal of Royal Stat. Soc., 1996
Subset selection
• Set the coefficients with lowest absolute value to zero
LASSO both selects and shrinks
Source: Regression Shrinkage and Selection via the Lasso, by Robert Tibshirani, Journal of Royal Stat. Soc., 1996
Level sets of Lq norm of coefficients
Source: Regression Shrinkage and Selection via the Lasso, by Robert Tibshirani, Journal of Royal Stat. Soc., 1996
Geometry of Lasso and Ridge
Source: Regression Shrinkage and Selection via the Lasso, by Robert Tibshirani, Journal of Royal Stat. Soc., 1996
Coefficient shrinkage in orthonormal case
Lasso Garotte
Source: Regression Shrinkage and Selection via the Lasso, by Robert Tibshirani, Journal of Royal Stat. Soc., 1996
Can you argue the case for LASSO’s
coefficient shrinkage pattern in orthonormal
case?
Source: Regression Shrinkage and Selection via the Lasso, by Robert Tibshirani, Journal of Royal Stat. Soc., 1996
Lasso can flip signs of LS coeffs for d>2
Source: Regression Shrinkage and Selection via the Lasso, by Robert Tibshirani, Journal of Royal Stat. Soc., 1996
What about non-orthonormal case?
Source: Regression Shrinkage and Selection via the Lasso, by Robert Tibshirani, Journal of Royal Stat. Soc., 1996
Lasso coeff paths with decreasing λ
Source: Regression Shrinkage and Selection via the Lasso, by Robert Tibshirani, Journal of Royal Stat. Soc., 1996
Compare to coeff shrinkage path of ridge
Source: Regression Shrinkage and Selection via the Lasso, by Robert Tibshirani, Journal of Royal Stat. Soc., 1996
Smoothly Clipped Absolute Deviation
(SCAD) Penalty
Source: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties, by Fan and Li, Journal of Am. Stat. Assoc., 2001
Thresholding in three cases: No alteration
of large coefficients by SCAD and Hard
Source: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties, by Fan and Li, Journal of Am. Stat. Assoc., 2001
Motivation for elastic net
• The p >> n problem and grouped selection
• Microarrays: p > 10,000 and n < 100.
• For those genes sharing the same biological “pathway”,
the correlations among them can be high.
• LASSO limitations
• If p > n, the lasso selects at most n variables. The
number of
• Grouped variables: the lasso fails to do grouped
selection. It tends to select one variable from a group
and ignore the others.