M02Logistic Regression Logistic RegressioLogistic Regressionn
M02Logistic Regression Logistic RegressioLogistic Regressionn
𝜕𝐽 𝜃 𝜂
◼ So Δ𝜃 = −𝜂
𝜕𝜃
= 𝑚 σ𝑚
𝑖=1 𝑦 𝑖
− ℎ 𝜃 𝑥 𝑖
∗ 𝑥𝑖
◼ Note the update formula above looks the SAME as the formula
for gradient descent for linear regression – but actually the
ℎ𝜃 𝑥𝑖 =sigmoid(𝜃 T ∗ x i) here is different from the
ℎ𝜃 𝑥 𝑖 = 𝜃 T ∗ x i in linear regression!
Regularization
◼ When we have too many input variables, the model
may be too complex, risk of overfitting
◼ To handle it: add a regularization term to loss:
−1 𝜆
J(𝜃) = σ𝑚 𝑖
𝑖=1[𝑦 log ℎ𝜃 𝑥
𝑖 + 1 − 𝑦𝑖 log 1 − ℎ𝜃 𝑥 𝑖 ] + σ𝑛𝑗=1 𝜃𝑗2
𝑚 2𝑚
(𝜆 ≥ 0 )
Note that the regularization term starts from j=1.
The gradient would also be changed:
= +
Regularization
𝜆
◼ In the regularization term σ𝑛𝑗=1 𝜃𝑗2 ,
2𝑚
we do NOT regularize 𝜃0
◼ When 𝜆 = 0 𝑜𝑟 𝑣𝑒𝑟𝑦 𝑠𝑚𝑎𝑙𝑙, regularization
takes NO effect → may overfit
𝜆
◼ When 𝜆 is very big, to minimize σ𝑛𝑗=1 𝜃𝑗2 ,
2𝑚
the 𝜃𝑗 woud be forced to take very small
values, so the decision boundary would not
be good → may underfit
Generalization to more than two
class classification
Assume we have 3 classes 𝐶1 , 𝐶2 , and 𝐶3.
We build 3 binary classifiers 𝑀1 , 𝑀2 , 𝑀3 with logistic
regression – using the one-vs-all approach:
1. Generate training data 𝐷1 from the original data D: label
all examples of 𝐶1 as positive and all other examples as
negative.
2. Apply logistic regression to 𝐷1 and build 𝑀1 .
Repeat the above steps (1) and (2) to build 𝑀2 , 𝑀3
For a new data point x, we get 3 probabilities 𝑃1 , 𝑃2 , 𝑃3 by applying
𝑀1 , 𝑀2 , 𝑀3 to the data x. Predict class 𝐶𝑗 , if 𝑃𝑗 is the maximum
among {𝑃1 , 𝑃2 , 𝑃3 }.
Clearly the above method can be generalized to handle any k (>2)
class problem.
Practical considerations
◼ Feature scaling should be applied when the
input variables have rather different value
scales, just like in multivariate linear
regression
◼ Learning rate 𝜂 selection should also be done
carefully
◼ Selection of the regularization parameter 𝜆