0% found this document useful (0 votes)
32 views

DDA3020 Lecture 06 Logistic Regression

This document outlines a lecture on logistic regression. It begins with a review of linear regression, discussing both the deterministic and probabilistic perspectives. It then introduces classification problems and representations before covering the topics of logistic regression, regularized logistic regression, and the probabilistic perspective of logistic regression. Variants of linear regression like ridge regression, lasso regression, and robust regression are also briefly discussed.

Uploaded by

J Deng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

DDA3020 Lecture 06 Logistic Regression

This document outlines a lecture on logistic regression. It begins with a review of linear regression, discussing both the deterministic and probabilistic perspectives. It then introduces classification problems and representations before covering the topics of logistic regression, regularized logistic regression, and the probabilistic perspective of logistic regression. Variants of linear regression like ridge regression, lasso regression, and robust regression are also briefly discussed.

Uploaded by

J Deng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

DDA3020 Machine Learning

Lecture 06 Logistic Regression

Jicong Fan
School of Data Science, CUHK-SZ

October 10/12, 2022

Jicong Fan School of Data Science, CUHK-SZ


DDA3020 Machine Learning Lecture 06 LogisticOctober
Regression
10/12, 2022 1 / 47
Outline

1 Review of last week

2 Classification and representation

3 Logistic regression

4 Regularized logistic regression

5 Probabilistic perspective of logistic regression

6 Summary: linear regression vs. logistic regression

Jicong Fan School of Data Science, CUHK-SZ


DDA3020 Machine Learning Lecture 06 LogisticOctober
Regression
10/12, 2022 2 / 47
1 Review of last week

2 Classification and representation

3 Logistic regression

4 Regularized logistic regression

5 Probabilistic perspective of logistic regression

6 Summary: linear regression vs. logistic regression

Jicong Fan School of Data Science, CUHK-SZ


DDA3020 Machine Learning Lecture 06 LogisticOctober
Regression
10/12, 2022 3 / 47
Linear regression: deterministic perspective

Linear hypothesis function: fw,b (x) = x> w + b, or, simply fw (x) =


x> w by concatenating b and w together and augmenting x to [1; x]
Linear regression by minimizing residual sum of squares (RSS):
m
1X > 1
w∗ = arg min J(w), where J(w) = (xi w − yi )2 = kXw − yk2
w 2 i=1 2

Two solutions:
−1 >
Closed-form solution: w∗ = X> X X y
Gradient descent: w ← w − αX> (Xw − y), for multiple iterations until
convergence

Jicong Fan School of Data Science, CUHK-SZ


DDA3020 Machine Learning Lecture 06 LogisticOctober
Regression
10/12, 2022 4 / 47
Linear regression: probabilistic perspective
We assume that: y = w> x + e, where e ∼ N (0, σ 2 ) is called observation
noise or residual error
y is also a random variable, and its conditional probability is

p(y|x, w) = N (w> x, σ 2 )

Maximum log-likelihood estimation:


m
Y 
wM LE = arg max log L(w|D) = arg max log p(yi |xi , w) (1)
w w
i
m
X m
X
= arg max log p(yi |xi , w) = arg max log N (w> xi , σ 2 ) (2)
w w
i i
m
1 X m
= arg max − log(σ m (2π) ) − 2 (yi − w> xi )2
2 (3)
w 2σ i
m
1X
= arg min (yi − w> xi )2 , (4)
w 2 i

Jicong Fan School of Data Science, CUHK-SZ


DDA3020 Machine Learning Lecture 06 LogisticOctober
Regression
10/12, 2022 5 / 47
Variants of linear regression
Ridge regression to avoid over-fitting, through MAP estimation:
m
X
wM AP = arg max log p(yi |xi , w) + log p(w) (5)
w
i=1
Xm
= arg max log N (w> xi , σ 2 ) + N (w|0, τ 2 I) (6)
w
i=1
m
X
≡ arg min (w> xi − yi )2 + λkwk22 . (7)
w
i=1

Polynomial regression: linear model with basis expansion φ(x)


d
X d X
X d d X
X d X
d
fw,b (x) = b + wi xi + wij xi xj + wijk xi xj xk + . . .
i=1 i=1 j=1 i=1 j=1 k=1

= φ(x)> w, (8)
>
φ(x) = [1, x1 , . . . , xd , . . . , xi xj , . . . , xi xj xk , . . .] ,
w = [b, w1 , . . . , wd , . . . , wij , . . . , wijk , . . .]> .
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 06 LogisticOctober
Regression
10/12, 2022 6 / 47
Variants of linear regression

Lasso regression to obtain sparse model,


m
X
wM AP = arg max log N (w> xi , σ 2 ) + Lap(w|0, b) (9)
w
i
m
X
= arg min (w> xi − yi )2 + λkwk1 . (10)
w
i=1

Robust regression for data with outliers:


m
X
wM LE = arg min |w> xi − yi | (11)
w
i=1

Jicong Fan School of Data Science, CUHK-SZ


DDA3020 Machine Learning Lecture 06 LogisticOctober
Regression
10/12, 2022 7 / 47
Summary of different linear regressions
Note that the uniform distribution will not change the mode of the likelihood.
Thus, MAP estimation with a uniform prior corresponds to MLE.
p(y|x, w) p(w) regression method
Gaussian Uniform Least squares
Gaussian Gaussian Ridge regression
Gaussian Laplace Lasso regression
Laplace Uniform Robust regression
Student Uniform Robust regression

u2

u1
ML Estimate

MAP Estimate

prior mean

Jicong Fan School of Data Science, CUHK-SZ


DDA3020 Machine Learning Lecture 06 LogisticOctober
Regression
10/12, 2022 8 / 47
1 Review of last week

2 Classification and representation

3 Logistic regression

4 Regularized logistic regression

5 Probabilistic perspective of logistic regression

6 Summary: linear regression vs. logistic regression

Jicong Fan School of Data Science, CUHK-SZ


DDA3020 Machine Learning Lecture 06 LogisticOctober
Regression
10/12, 2022 9 / 47
Classification

Classification: classifying input data into discrete states


Email filtering: spam / not spam?
Weather forecast: sunny / not sunny?
Tumor: malignant / benign?
The label y ∈ {0, 1}:
y = 0: negative class, e.g., not spam, not sunny, benign
y = 1: positive class, e.g., spam, sunny, malignant

Jicong Fan School of Data Science, CUHK-SZ


DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 10 / 47
Threshold classifier with linear regression

We assume a linear hypothesis function fw,b (x) = x> w + b


A simple threshold classifier with this hypothesis function is
If fw,b (x) > 0.5, then y = 1, i.e., malignant tumor
If fw,b (x) < 0.5, then y = 0, i.e., benign tumor

Jicong Fan School of Data Science, CUHK-SZ


DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 11 / 47
Threshold classifier with linear regression

It seems that the simple threshold classifier with linear regression works
well on this classification task
However, if there is a positive sample with very large tumor size (plot
above), what will happen?
The hypothesis function will be significantly changed, causing that some
positive samples are mis-classified as negative (not malignant). How to han-
dle it? Adjusting the threshold value, or adopting robust linear regression.

Jicong Fan School of Data Science, CUHK-SZ


DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 12 / 47
Threshold classifier with linear regression

But there is still something wired.


Our goal is to predict y ∈ {0, 1}, but the prediction could be fw,b (x) > 1
or fw,b (x) < 0, which does not serve our purpose.
A desired hypothesis function for this task should be fw,b (x) ∈ [0, 1].

Jicong Fan School of Data Science, CUHK-SZ


DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 13 / 47
Threshold classifier with linear regression

Exercise: Which statements are true?


If linear regression doesn’t work well like the above example, feature scaling
may help
If the training set satisfies that all yi ∈ [0, 1] for all points (xi , yi ), then the
linear hypothesis function fw,b (x) ∈ [0, 1] for all values of xi
None of the above is correct

Jicong Fan School of Data Science, CUHK-SZ


DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 14 / 47
Hypothesis representation

A desired hypothesis function for this task should be fw,b (x) ∈ [0, 1]
To this end, we introduce a novel function, as follows:
1
fw,b (x) = g(w> x) ∈ [0, 1], g(z) = ,
1 + exp(−z)

where g(·) is called sigmoid function or logistic function (shown below)


1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
−10 −5 0 5 10

Jicong Fan School of Data Science, CUHK-SZ


DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 15 / 47
Hypothesis representation
Interpretation of sigmoid/logistic function
fw,b (x) = estimated probability that y = 1 of input x.
For example (plot below), if fw,b (x) = 0.8, then it means that a patient
with tumor size x has 80% chance of tumor being malignant. In this task,
larger tumor size has larger chance/probability of being malignant tumor.
Thus, we can say that

fw,b (x) = P (y = 1|x; w).

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
−10 −5 0 5 10

Jicong Fan School of Data Science, CUHK-SZ


DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 16 / 47
Decision boundary
1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
−10 −5 0 5 10

In logistic regression, we have


1
fw,b (x) = g(w> x + b) = P (y = 1|x; w) ∈ [0, 1], g(z) = .
1 + exp(−z)
Suppose that if fw,b (x) ≥ 0.5, then we predict y = 1; if fw,b (x) < 0.5, then
we predict y = 0
Correspondingly, if w> x + b ≥ 0, we predict y = 1; if w> x + b < 0, then
we predict y = 0.
It determines the decision boundary, which is the curve/hyper-plane cor-
responding to fw,b (x) = 0.5, or w> x + b = 0
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 17 / 47
Decision boundary

fw,b (x) = g(b + w1 x1 + w2 x2 ) = g(−3 + x1 + x2 )


Predict y = 1 if −3 + x1 + x2 ≥ 0 (plot above)

Jicong Fan School of Data Science, CUHK-SZ


DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 18 / 47
Decision boundary

Figure: Non-linear decision boundary

fw,b (x) = g(b + w1 x1 + w2 x2 + w3 x21 + w4 x22 ) = g(−1 + x21 + x22 )


Predict y = 1 if −1 + x21 + x22 ≥ 0 (plot above)

Jicong Fan School of Data Science, CUHK-SZ


DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 19 / 47
1 Review of last week

2 Classification and representation

3 Logistic regression

4 Regularized logistic regression

5 Probabilistic perspective of logistic regression

6 Summary: linear regression vs. logistic regression

Jicong Fan School of Data Science, CUHK-SZ


DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 20 / 47
Cost function

Training set: m training examples {(xi , yi )}m


i=1
Hypothesis function: fw,b (x) = g(w> x + b) = 1
1+exp(−w> x−b)
Cost function:
1
Pm 2 1 2
Linear regression: J(w) = 2m i=1 (fw,b (xi ) − yi ) = 2m kXw − yk ,
which is called `2 loss or residual sum of squares
It is convex w.r.t. w for linear regression
Logistic regression: If we adopt the same cost function for logistic regres-
sion, we have
m
1 X
J(w) = (g(w> xi ) − yi )2 .
2m i
However, it is non-convex w.r.t. w.

Exercise 1: Prove the `2 loss is convex w.r.t. w for linear regression.


Exercise 2: Prove the `2 loss is non-convex w.r.t. w for logistic regression.

Jicong Fan School of Data Science, CUHK-SZ


DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 21 / 47
Cost function

Exercise 1: Prove the `2 loss is convex w.r.t. w for linear regression.


Exercise 2: Prove the `2 loss is non-convex w.r.t. w for logistic regression.

Jicong Fan School of Data Science, CUHK-SZ


DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 22 / 47
Cost function
Cross-entropy:
Z X
H(p, q) = − p(x) log(q(x))dx or − p(x) log(q(x)),
x x

where p(x), q(x) are probability density functions (PDF) of x if x is


a continuous random variable, or, probability mass functions if x is a
discrete random variable
We set
ground-truth posterior probability : y(x) = P (y = 1|x),
predicted posterior probability : fw,b (x) = P (y = 1|x; w).
Cross-entropy loss:
 
cost y(x), fw,b (x) = H y(x), fw,b (x)
= − P (y = 1|x) · log P (y = 1|x; w) − P (y = 0|x) · log P (y = 0|x; w)
(
− log(fw,b (x)), if y(x) = 1
=
− log(1 − fw,b (x)), if y(x) = 0

Jicong Fan School of Data Science, CUHK-SZ


DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 23 / 47
Cost function for logistic regression
Cross-entropy loss:
(
 − log(fw,b (x)), if y(x) = 1
cost y(x), fw,b (x) =
− log(1 − fw,b (x)), if y(x) = 0

For y = 1, if fw,b (x) = 1, i.e., P (y = 1|x; w) = 1, then the prediction


equals to the ground-truth label, the cost is 0.
For y = 1, if fw,b (x) → 0, i.e., P (y = 1|x; w) → 0, then it should be
penalized with a very large cost. Here we have cost(y(x), fw,b (x)) → ∞.

Jicong Fan School of Data Science, CUHK-SZ


DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 24 / 47
Cost function for logistic regression
Cross-entropy loss:
(
− log(fw,b (x)), if y(x) = 1
cost(y(x), fw,b (x)) =
− log(1 − fw,b (x)), if y(x) = 0
For y = 0, if fw,b (x) = 0, i.e., P (y = 1|x; w) = 0, then the prediction
equals to the ground-truth label, the cost is 0
For y = 0, if fw,b (x) → 1, i.e., P (y = 1|x; w) → 0, then it should be
penalized with a very large cost. Here we have cost(y(x), fw,b (x)) → ∞

Jicong Fan School of Data Science, CUHK-SZ


DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 25 / 47
Cost function for logistic regression

Cross-entropy loss:
(
− log(fw,b (x)), if y(x) = 1
cost(y(x), fw,b (x)) =
− log(1 − fw,b (x)), if y(x) = 0

Exercise: Which states are true?


If fw,b (x) = y, then cost(y(x), fw,b (x)) = 0 for both y = 0 and y = 1
If y = 0, then cost(y(x), fw,b (x)) → ∞ as fw,b (x) → 1
If y = 0, then cost(y(x), fw,b (x)) → ∞ as fw,b (x) → 0
Regardless whether y = 0 or y = 1, if fw,b (x) = 0.5, then
cost(y(x), fw,b (x)) > 0

Jicong Fan School of Data Science, CUHK-SZ


DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 26 / 47
Cost function of logistic regression

Cost function of logistic regression


m
1 X
J(w) = cost(yi , fw,b (xi )),
m i=1
(
− log(fw,b (x)), if y(x) = 1
cost(y(x), fw,b (x)) =
− log(1 − fw,b (x)), if y(x) = 0

The above cost function can be simplified as follows


m
1 X 
J(w) = − yi log(fw,b (xi )) + (1 − yi ) log(1 − fw,b (xi )) .
m i=1

Exercise: Please prove that J(w) is convex w.r.t. w.

Jicong Fan School of Data Science, CUHK-SZ


DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 27 / 47
Gradient descent for logistic regression

Learning w by minimize J(w), i.e.,


m
1 X
w∗ = arg min J(w) = −

yi log(fw,b (xi )) + (1 − yi ) log(1 − fw,b (xi )) .
w m i=1

Gradient descent: repeat the following update until convergence

w ← w − α∇w J(w)
m
1 X
∇w J(w) = [fw,b (xi ) − yi ]xi
m i=1

How to define convergence? Calculating the changes of J(w) or w in the


last K steps, if the change is lower than a threshold, than it can be seen as
convergence. Remember that choosing suitable learning rate α is important
to achieve a good converged solution.

Jicong Fan School of Data Science, CUHK-SZ


DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 28 / 47
Gradient descent for logistic regression

Exercise: Suppose you are running a logistic regression model, and you should
observe the learning procedure to find a suitable learning rate α. Which of the
following is reasonable to make sure α is set properly and that the gradient
descent is running correctly?
1
Pm 2
Plot J(w) = − m i (yi − fw,b (xi )) as a function of the number of itera-
tions (i.e., the horizontal axis is the iteration number) and make sure J(w)
is decreasing on every iteration.
1
Pm  
Plot J(w) = − m i yi log(fw,b (xi )) + (1 − yi ) log(1 − fw,b (xi )) as a
function of the number of iterations (i.e., the horizontal axis is the iteration
number) and make sure J(w) is decreasing on every iteration.
Plot J(w) as a function of w and make sure it is decreasing on every
iteration.
Plot J(w) as a function of w and make sure it is convex.

Jicong Fan School of Data Science, CUHK-SZ


DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 29 / 47
Multi-class classification

Binary classification: in above examples and derivations, we only consider


the binary classification problem, i.e., y ∈ {0, 1}.
Multi-class/multi-category classification: however, many practical prob-
lems involve with multi-category outputs, i.e., y ∈ {1, . . . , C}:
Weather forecast: sunny, cloudy, rain, snow
Email tagging: work, friends, families, hobby

Jicong Fan School of Data Science, CUHK-SZ


DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 30 / 47
Multi-class classification

Jicong Fan School of Data Science, CUHK-SZ


DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 31 / 47
Multi-class classification: one-vs-all

Jicong Fan School of Data Science, CUHK-SZ


DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 32 / 47
Multi-class classification: one-vs-all

One-vs-all logistic regression:


Train a binary logistic regression fwj ,bj (·) for each class j, by setting all
samples of other classes as negative class
For a new testing sample x, predict its class as arg maxj fwj ,bj (x).
Pros: Easy to implement
Cons: The training cost is too high, and is difficult to scale to tasks with large
number of classes.

Jicong Fan School of Data Science, CUHK-SZ


DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 33 / 47
Multi-class classification: Softmax regression

Softmax function:

(j) exp(wj> x + bj )
fW,b (x) = PC = P (y = j|x; W, b), (12)
>
c=1 exp(wc x + bc )

where W = [w1 , . . . , wC ], b = [b1 ; b2 ; . . . ; bC ] with C being the number of


(j)
classes. For simplicity, in the following we write fW,b (·) as fwj ,bj (·)
Cost function:
m C
1 XX 
J(W) = − I(yi = j) log(fwj ,bj (xi )) , (13)
m i j

where I(a) = 1 if a is true, otherwise I(a) = 0.

Jicong Fan School of Data Science, CUHK-SZ


DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 34 / 47
Multi-class classification: Softmax regression
It can also be optimized by gradient descent:

∂J(W)
wj ← wj − α ,
∂wj
m 
∂J(W) 1 X I(yi = j) ∇fwj ,bj (xi )
=− ·
∂wj m i fwj ,bj (xi )) ∇wj
C 
X I(yi 6= j) ∇fwc ,bc (xi )
+ ·
f
c=1 wc ,bc
(xi )) ∇wj
∇fwj ,bj (xi )
= fwj ,bj (xi ) · (1 − fwj ,bj (xi )) · xi .
∇wj
∇fwc ,bc (xi )
= −fwj ,bj (xi ) · fwc ,bc (xi ) · xi
∇wj
m
∂J(W) 1 X 
=⇒ = fwj ,bj (xi ) − I(yi = j) xi (14)
∂wj m i

Note: {wc }C
c=1 should be updated in parallel, rather than sequentially.
Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 35 / 47
1 Review of last week

2 Classification and representation

3 Logistic regression

4 Regularized logistic regression

5 Probabilistic perspective of logistic regression

6 Summary: linear regression vs. logistic regression

Jicong Fan School of Data Science, CUHK-SZ


DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 36 / 47
Overfitting in linear regression

Jicong Fan School of Data Science, CUHK-SZ


DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 37 / 47
Overfitting in linear regression

Overfitting: If we have too many features, the learned hypothesis may fit the
training data very well (low bias), but fail to generalize to new examples.

Jicong Fan School of Data Science, CUHK-SZ


DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 38 / 47
Overfitting in logistic regression

Under-fitting Good-fitting Over-fitting

Jicong Fan School of Data Science, CUHK-SZ


DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 39 / 47
Addressing Overfitting

Generally, there are two approaches to address the overfitting problem, includ-
ing:
Reducing the number of features:
Feature selection
Dimensionality reduction (introduced in later lectures)
Regularization:
Keep all features, but reduce magnitude/value of each parameter, such that
each feature contributes a bit to predict y
In the following, we will focus on the regularization-based approach.

Jicong Fan School of Data Science, CUHK-SZ


DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 40 / 47
Regularized logistic regression
The objective function of the regularized logistic regression is formulated
as follows
d
¯ λ X 2
J(w) = J(w) + w
2m j=1 j
m d
1 X  λ X 2
=− yi log(fw,b (xi )) + (1 − yi ) log(1 − fw,b (xi )) + w .
m i 2m j=1 j

Note: the bias parameter w0 (or b) is not regularized/penalized.


The above objective function can also be solved by gradient descent, as
follows
m
α X
w0 ← w0 − (fw,b (xi ) − yi ) · xi (0), where xi (0) = 1, ∀i
m i=1
m
α X 
wj ← wj − (fw,b (xi ) − yi ) · xi (j) + λ · wj ,
m i=1

where xi (j) denotes the j-th entry of xi , and j = 0, . . . , d.


Jicong Fan School of Data Science, CUHK-SZ
DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 41 / 47
Regularized logistic regression

Exercise: When using regularized logistic regression, which of these is the


best way to monitor whether gradient descent is working correctly?
Plot J(w) as a function of the number of iterations and make sure it’s
decreasing
λ
Pd 2
Plot J(w) − 2m j=1 wj as a function of the number of iterations and
make sure it’s decreasing
λ
Pd 2
Plot J(w) + 2m j=1 wj as a function of the number of iterations and
make sure it’s decreasing
Pd
Plot j=1 wj2 as a function of the number of iterations and make sure it’s
decreasing

Jicong Fan School of Data Science, CUHK-SZ


DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 42 / 47
1 Review of last week

2 Classification and representation

3 Logistic regression

4 Regularized logistic regression

5 Probabilistic perspective of logistic regression

6 Summary: linear regression vs. logistic regression

Jicong Fan School of Data Science, CUHK-SZ


DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 43 / 47
Logistic regression: probabilistic modeling
Behind logistic regression for binary classification, we assume that
both the feature x and and the label y are random variables, as follows

µ(x|w) = Sigmoid(w> x),


y(x|w) ∼ Bernoulli(µ(x|w)).

Then, we have
(
µ if y = 1,
P (y|x; w) =
1 − µ if y = 0.

The log-likelihood function of P (y|x; w) is formulated as

L(w) = y log(µ) + (1 − y) log(1 − µ).

Thus, we obtain

max L(w) ≡ min J(w).


w w

Jicong Fan School of Data Science, CUHK-SZ


DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 44 / 47
Logistic regression: probabilistic modeling

Behind logistic regression, we assume that

µ(x|w) = Sigmoid(w> x),


y(x|w) ∼ Bernoulli(µ(x|w)).

`2 -regularized logistic regression: we further assume w ∼ N (w|0, σ 2 I),


then we have
d
λ X 2
max L(w) + log N (w|0, σ 2 I) ≡ min J(w) + w .
w w 2m j=1 j

`1 -regularized logistic regression: if we assume w ∼ Laplace(w|0, b), then


we have
d
λ X
max L(w) + log Laplace(w|0, b) ≡ min J(w) + |wj |.
w w 2m j=1

Jicong Fan School of Data Science, CUHK-SZ


DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 45 / 47
1 Review of last week

2 Classification and representation

3 Logistic regression

4 Regularized logistic regression

5 Probabilistic perspective of logistic regression

6 Summary: linear regression vs. logistic regression

Jicong Fan School of Data Science, CUHK-SZ


DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 46 / 47
Summary: linear regression vs. logistic regression

Linear regression Logistic regression


Task regression classification
Hypothesis fw,b (x) w> x + b ∈ (−∞, ∞) g(w> x+ b) ∈ [0, 1]
1
Pm > 2 1
Pm
Objective J(w) 2m i (yi − w xi ) −m i=1 yi log(fw,b (xi ))
+(1 − yi ) log(1 − fw,b (xi ))
Solution closed-form or gradient descent gradient descent

Note that: For each variant of linear/logistic regression, you can derive it from both
the deterministic and the probabilistic perspectives.
Own reading: Both linear regression and logistic regression are special cases of gen-
eralized linear models. If interested, you can find more details from Section 4 of th
book “Pattern Recognition and Machine Learning”, Bishop, 2006.

Jicong Fan School of Data Science, CUHK-SZ


DDA3020 Machine Learning Lecture 06 Logistic
October
Regression
10/12, 2022 47 / 47

You might also like