0% found this document useful (0 votes)
7 views37 pages

AML L2 Logistic Regression

Uploaded by

gogimurali546
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views37 pages

AML L2 Logistic Regression

Uploaded by

gogimurali546
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

ADVANCED MACHINE

LEARNING
Module 1: Generalized Regression and HDLSS
problems
Instructor: Amit Sethi
Co-developer: Neeraj Kumar
TAs: Gaurav Yadav, Niladri Bhattacharya
Page: AdvancedMachineLearning.weebly.com
IITG Course No: EE 622
Module objectives
• Understand linear and generalized linear
regression

• Understand logistic regression as a special GLM

• Appreciate the link between log reg and linear


discriminants

• Understand the role of various penalties in


HDLSS case
Linear regression

Source: Wikipedia
Solutions to linear regression

Generalized least squares

Source: Wikipedia
Minimizing the Lp norm of error

Source: Wikipedia
Logistic regression as a GLM
• GLM

• Exponential family

Sources: Wikipedia and Why the logistic function? A tutorial discussion on probabilities and neural networks, by Michael I. Jordan ftp://psyche.mit.edu/pub/jordan/uai.ps
Bernoulli distribution  logistic function

Source: Why the logistic function? A tutorial discussion on probabilities and neural networks, by Michael I. Jordan ftp://psyche.mit.edu/pub/jordan/uai.ps
Two sides of the same coin
Generative vs. Discriminative (“diagnostic”)

Source: Why the logistic function? A tutorial discussion on probabilities and neural networks, by Michael I. Jordan ftp://psyche.mit.edu/pub/jordan/uai.ps
Inference in the generative model

Source: Why the logistic function? A tutorial discussion on probabilities and neural networks, by Michael I. Jordan ftp://psyche.mit.edu/pub/jordan/uai.ps
Logistic function is a natural choice for
Gaussian class conditional densities

Source: Why the logistic function? A tutorial discussion on probabilities and neural networks, by Michael I. Jordan ftp://psyche.mit.edu/pub/jordan/uai.ps
Looks familiar?

Source: Why the logistic function? A tutorial discussion on probabilities and neural networks, by Michael I. Jordan ftp://psyche.mit.edu/pub/jordan/uai.ps
Generative vs. discriminative
Generative Discriminative

• Belief network A is more • More robust


modular
• Class-conditional densities are • Don’t need precise model
likely to be local, characteristic specification, so long as it is
functions of the ob jects being from exponential family
classiffied, invariant to the
nature and number of the • Requires fewer
other classes
parameters
• More “natural”
• Deciding what kind of object to • O(n) as opposed to O(n2)
generate and then generating
it from a recipe
• More efficient to estimate
mode, if correct

Source: Why the logistic function? A tutorial discussion on probabilities and neural networks, by Michael I. Jordan ftp://psyche.mit.edu/pub/jordan/uai.ps
Cross entropy loss function for ML ϴ

Source: Why the logistic function? A tutorial discussion on probabilities and neural networks, by Michael I. Jordan ftp://psyche.mit.edu/pub/jordan/uai.ps
Optimize for the parameters

Source: Why the logistic function? A tutorial discussion on probabilities and neural networks, by Michael I. Jordan ftp://psyche.mit.edu/pub/jordan/uai.ps
Other issues
• What about nonlinear discriminant functions?

• Is logistic nonlinearity required in hidden layers of


neural networks?
Regularization in regression
• Why regularize?

• Reduce variance, at the cost of bias

• Increase test (validation) accuracy

• Get interpretable models

• How to regularize?

• Shrink coefficients

• Reduce features
Coefficient shrinkage using ridge

Source: Regression Shrinkage and Selection via the Lasso, by Robert Tibshirani, Journal of Royal Stat. Soc., 1996
Subset selection
• Set the coefficients with lowest absolute value to zero
LASSO both selects and shrinks

Source: Regression Shrinkage and Selection via the Lasso, by Robert Tibshirani, Journal of Royal Stat. Soc., 1996
Level sets of Lq norm of coefficients

Which one is ridge? Subset selection? Lasso?

Source: Regression Shrinkage and Selection via the Lasso, by Robert Tibshirani, Journal of Royal Stat. Soc., 1996
Geometry of Lasso and Ridge

Source: Regression Shrinkage and Selection via the Lasso, by Robert Tibshirani, Journal of Royal Stat. Soc., 1996
Coefficient shrinkage in orthonormal case

Subset selection Ridge

Lasso Garotte
Source: Regression Shrinkage and Selection via the Lasso, by Robert Tibshirani, Journal of Royal Stat. Soc., 1996
Can you argue the case for LASSO’s
coefficient shrinkage pattern in orthonormal
case?

Source: Regression Shrinkage and Selection via the Lasso, by Robert Tibshirani, Journal of Royal Stat. Soc., 1996
Lasso can flip signs of LS coeffs for d>2

Source: Regression Shrinkage and Selection via the Lasso, by Robert Tibshirani, Journal of Royal Stat. Soc., 1996
What about non-orthonormal case?

Source: Regression Shrinkage and Selection via the Lasso, by Robert Tibshirani, Journal of Royal Stat. Soc., 1996
Lasso coeff paths with decreasing λ

Source: Regression Shrinkage and Selection via the Lasso, by Robert Tibshirani, Journal of Royal Stat. Soc., 1996
Compare to coeff shrinkage path of ridge

Source: Sci-kit learn tutorial


Lasso as Bayes estimate

Source: Regression Shrinkage and Selection via the Lasso, by Robert Tibshirani, Journal of Royal Stat. Soc., 1996
Smoothly Clipped Absolute Deviation
(SCAD) Penalty

Source: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties, by Fan and Li, Journal of Am. Stat. Assoc., 2001
Thresholding in three cases: No alteration
of large coefficients by SCAD and Hard

Source: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties, by Fan and Li, Journal of Am. Stat. Assoc., 2001
Motivation for elastic net
• The p >> n problem and grouped selection
• Microarrays: p > 10,000 and n < 100.
• For those genes sharing the same biological “pathway”,
the correlations among them can be high.
• LASSO limitations
• If p > n, the lasso selects at most n variables. The
number of
• Grouped variables: the lasso fails to do grouped
selection. It tends to select one variable from a group
and ignore the others.

Source: Elastic net, by Zou and Hastie


Elastic net: Use both L2 and L2 penalties

Source: Elastic net, by Zou and Hastie


Geometry of elastic net

Source: Elastic net, by Zou and Hastie


Elastic net selects correlated variables as
“group”

Source: Elastic net, by Zou and Hastie


Elastic net selects correlated variables as
“group” and stabilizes the coefficient paths

Source: Elastic net, by Zou and Hastie


Why L2 penalty keeps coefficients of
groups together?
• Try to think of an example with correlated variables
Summary
• General
• Linear regression is a model with good mathematical properties
• Using a link function and iterative optimization, it can be used for
models from the exponential family
• Logistic regression is a natural choice Bayesian optimal models for
class conditional densities from exponential families
• The dispersion parameters of the two classes have to be the same
• Regularization and variable elimination in HDLSS
problems
• GLMs can be penalized for regularization
• Ridge penalty only shrinks the coefficients
• LASSO penalty selects a subset and produces constant shrinkage
• SCAD penalty only selects a subset
• Elastic net keep correlated variables together, while behaving like
LASSO

You might also like