Notes 3
Notes 3
1 Introduction
3 Logistic regression
Learning algorithm
x Model y
(h)
1 Introduction
3 Logistic regression
Model: hθ (x ) + ϵ = θ0 + θ1 x1 + · · · + θn xn + ϵ = x T θ + ϵ,
θ-parameter, x -inputs
How to compute θ? One option is to minimize the difference
between hθ (x ) and y
Minimize least squares:
m
1X
J(θ) = (hθ (x (i) ) − y (i) )2 (1)
2 i=1
Data
i y x1 x2 ... xn
(1) (11) (12)
1 y x x ... x (1n)
.. .. .. .. .. ..
. . . . . .
m y (m) x (m1) x (m2) ... x (mn)
Let
(x (1) )T (x (11) ) (y (1) )
(x (2) )T (x (12) ) (y (2) )
(i)
X = ,x = ,Y = (2)
... ... ...
Let us assume that the outputs and the inputs are related via
MLE
m
Y1 1 (y (i) − θT x (i) )2
l(θ) =ln √exp(− ) (16)
i=1 2πσ 2 σ2
m
1 1 1X
= m ln √ − 2 (y (i) − θT x (i) )2 (17)
2πσ σ
|{z} 2 i=1
| {z }
const scaling
2 Prediction: Output θT x
LWR
1 An online algorithm
Training: Fit θ to minimize i w (i) (y (i) − θT x (i) )2
P
2
3 Prediction: Output θT x
w (i) - non-negative valued weights
If w (i) is large, penalization of (y (i) − θT x (i) )2 is high, w (i) is
small penalization fo (y (i) − θT x (i) )2 is small
(i) )2
A standard choice: w (i) = exp(− (x 2τ−x 2 ), at a particular query
point x
|(x (i) − x )| is small w (i) ≊ 1, |(x (i) − x )| is large w (i) ≊ 0
τ - bandwidth parameter which decides how the weight of a
training example falls off with distance of its x (i) from the query
point x
HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 28 / 59
Parametric and non-parametric models
1 Introduction
3 Logistic regression
Combining these:
Combining these:
1 Introduction
3 Logistic regression
GLM
y |x ; θ ∼ exponential family
Given x , our goal is to predict the expected value of T (y ) given
x , that is, hθ = E (y |x ). Eg. Logistic regression:
p(y |x ; θ) = (hθ (x ))y (1 − hθ (x ))(1−y )
The natural parameter η and the inputs x are related linearly:
η = θT x
Linear regressiony = θT x + ϵ
For y |x ; θ ∼ N (µ = θT x , σ 2 ), belong to exponential family
hθ (x ) = E (y |x ) = θT x
In θT x , inputs are varying linearly with respect to parameter
1
Logistic regression:hθ (x ) = g(θT x ) = 1+exp(−θT x )
1
hθ (x ) = g(θT x ) = 1+exp(−θ T x ) -Success or failure-Bernaulli
1 Introduction
3 Logistic regression
There are two reasons why we are often not satisfied with the
least squares estimates
1 prediction accuracy: the least squares estimates often have low
bias but large variance. Prediction accuracy can sometimes be
improved by shrinking or setting some coefficients to zero. By
doing so we sacrifice a little bit of bias to reduce the variance of
the predicted values, and hence may improve the overall
prediction accuracy.
2 interpretation: with a large number of predictors, we often
would like to determine a smaller subset that exhibit the
strongest effects. In order to get the ”big picture” we are willing
to sacrifice some of the small details.
L penalty
2
HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 50 / 59
Ridge regression
When there are many correlated variables in a linear regression
model, their coefficients can become poorly determined and
exhibit high variance.
A wildly large positive coefficient on one variable can be
canceled by a similarly large negative coefficient on its correlated
variable. By imposing a size constraint on the coefficients, this
problem is alleviated.
Notice that the intercept β0 has been left out of the penalty
term to avoid the procedure depend on the origin chosen for y
β̂ ridge = (X T X + λI)−1 X T y , I-identity matrix p × p
The solution adds a positive constant to the diagonal of X T X
before inversion. This makes the problem nonsingular, even if
X T X is not of full rank, and was the main motivation for ridge
regression
λI projects up from singularity, hence the name
HK (ChE,Yardi ScAI IITD ) CLL788-PDA 10-JAN-2023 51 / 59
Least Absolute Shrinkage and Selection
Operator (LASSO)
The lasso regression is a shrinkage method like ridge, with subtle
but important differences.
Lasso regression:
n p
β̂ Lasso =arg min βj xj )2 }
X X
{(yi − β0 −
β i=1 j=1
p
X
|βj | ≤ t
j=1
Lasso Ridge