DDA3020 Lecture 06 Logistic Regression
DDA3020 Lecture 06 Logistic Regression
Jicong Fan
School of Data Science, CUHK-SZ
3 Logistic regression
3 Logistic regression
Two solutions:
−1 >
Closed-form solution: w∗ = X> X X y
Gradient descent: w ← w − αX> (Xw − y), for multiple iterations until
convergence
p(y|x, w) = N (w> x, σ 2 )
= φ(x)> w, (8)
>
φ(x) = [1, x1 , . . . , xd , . . . , xi xj , . . . , xi xj xk , . . .] ,
w = [b, w1 , . . . , wd , . . . , wij , . . . , wijk , . . .]> .
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 06 LogisticOctober
Regression
10/12, 2022 6 / 47
Variants of linear regression
u2
u1
ML Estimate
MAP Estimate
prior mean
3 Logistic regression
It seems that the simple threshold classifier with linear regression works
well on this classification task
However, if there is a positive sample with very large tumor size (plot
above), what will happen?
The hypothesis function will be significantly changed, causing that some
positive samples are mis-classified as negative (not malignant). How to han-
dle it? Adjusting the threshold value, or adopting robust linear regression.
A desired hypothesis function for this task should be fw,b (x) ∈ [0, 1]
To this end, we introduce a novel function, as follows:
1
fw,b (x) = g(w> x) ∈ [0, 1], g(z) = ,
1 + exp(−z)
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
−10 −5 0 5 10
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
−10 −5 0 5 10
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
−10 −5 0 5 10
3 Logistic regression
Cross-entropy loss:
(
− log(fw,b (x)), if y(x) = 1
cost(y(x), fw,b (x)) =
− log(1 − fw,b (x)), if y(x) = 0
w ← w − α∇w J(w)
m
1 X
∇w J(w) = [fw,b (xi ) − yi ]xi
m i=1
Exercise: Suppose you are running a logistic regression model, and you should
observe the learning procedure to find a suitable learning rate α. Which of the
following is reasonable to make sure α is set properly and that the gradient
descent is running correctly?
1
Pm 2
Plot J(w) = − m i (yi − fw,b (xi )) as a function of the number of itera-
tions (i.e., the horizontal axis is the iteration number) and make sure J(w)
is decreasing on every iteration.
1
Pm
Plot J(w) = − m i yi log(fw,b (xi )) + (1 − yi ) log(1 − fw,b (xi )) as a
function of the number of iterations (i.e., the horizontal axis is the iteration
number) and make sure J(w) is decreasing on every iteration.
Plot J(w) as a function of w and make sure it is decreasing on every
iteration.
Plot J(w) as a function of w and make sure it is convex.
Softmax function:
(j) exp(wj> x + bj )
fW,b (x) = PC = P (y = j|x; W, b), (12)
>
c=1 exp(wc x + bc )
∂J(W)
wj ← wj − α ,
∂wj
m
∂J(W) 1 X I(yi = j) ∇fwj ,bj (xi )
=− ·
∂wj m i fwj ,bj (xi )) ∇wj
C
X I(yi 6= j) ∇fwc ,bc (xi )
+ ·
f
c=1 wc ,bc
(xi )) ∇wj
∇fwj ,bj (xi )
= fwj ,bj (xi ) · (1 − fwj ,bj (xi )) · xi .
∇wj
∇fwc ,bc (xi )
= −fwj ,bj (xi ) · fwc ,bc (xi ) · xi
∇wj
m
∂J(W) 1 X
=⇒ = fwj ,bj (xi ) − I(yi = j) xi (14)
∂wj m i
Note: {wc }C
c=1 should be updated in parallel, rather than sequentially.
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 35 / 47
1 Review of last week
3 Logistic regression
Overfitting: If we have too many features, the learned hypothesis may fit the
training data very well (low bias), but fail to generalize to new examples.
Generally, there are two approaches to address the overfitting problem, includ-
ing:
Reducing the number of features:
Feature selection
Dimensionality reduction (introduced in later lectures)
Regularization:
Keep all features, but reduce magnitude/value of each parameter, such that
each feature contributes a bit to predict y
In the following, we will focus on the regularization-based approach.
3 Logistic regression
Then, we have
(
µ if y = 1,
P (y|x; w) =
1 − µ if y = 0.
Thus, we obtain
3 Logistic regression
Note that: For each variant of linear/logistic regression, you can derive it from both
the deterministic and the probabilistic perspectives.
Own reading: Both linear regression and logistic regression are special cases of gen-
eralized linear models. If interested, you can find more details from Section 4 of th
book “Pattern Recognition and Machine Learning”, Bishop, 2006.