0% found this document useful (0 votes)
16 views19 pages

M02Logistic Regression Logistic RegressioLogistic Regressionn

This document discusses logistic regression, including how it can be used for binary classification by outputting a probability. It covers the hypothesis representation using a sigmoid function, gradient descent training, and how regularization can be added to prevent overfitting. It also briefly discusses extensions to multi-class classification problems.

Uploaded by

anjibalaji52
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views19 pages

M02Logistic Regression Logistic RegressioLogistic Regressionn

This document discusses logistic regression, including how it can be used for binary classification by outputting a probability. It covers the hypothesis representation using a sigmoid function, gradient descent training, and how regularization can be added to prevent overfitting. It also briefly discusses extensions to multi-class classification problems.

Uploaded by

anjibalaji52
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Logistic Regression

With Gradient Descent and Regularization


Logistic Regression
◼ A classification method for binary
classification – returns the probability of
target variable y=1
◼ Can be implemented as an NN with sigmoid
activation function
◼ Can be seen as a Bayesian learning method
(maximum likelihood learner) for learning to
predict probability
◼ Can represent non-linear decision boundaries
Hypothesis Representation
◼ Parameters: vector 𝜃 = (𝜃0 , 𝜃1, … , 𝜃𝑛 )𝑇
◼ Hypothesis is a non-linear function of the input variables
( 𝑥1, … , 𝑥𝑛 )
Input data: D {< x1 , 𝑦 1 >, … , < 𝑥 𝑚 , 𝑦 𝑚 >} ), each yi ∈ 0, 1
𝑥 𝑖 = (𝑥0𝑖, 𝑥1𝑖 , … , 𝑥𝑛𝑖 )𝑇 , here 𝑥0𝑖 = 1 (We added the
0-th feature 𝑥0 =1 to simplify the notation).
Hypothesis ℎ𝜃 is obtained by applying the sigmoid function
to a linear function of the input variables:
𝑇 1
ℎ𝜃 𝑥 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝜃 ∗ 𝑥 ) = 𝑇 (for logistic
1+𝑒 −𝜃 𝑥
regression).
In contrast, ℎ𝜃 𝑥 = 𝜃 𝑇 ∗ 𝑥 for linear regression
Sigmoid/logistic function
1
Sigmoid(x) =
1+𝑒 −𝑥
sigmoid(x) is monotone
and non-linear
when 𝑥 → +∞, Sigmoid(𝑥) → +1
when 𝑥 → −∞, Sigmoid(𝑥) → 0
when 𝑥 = 0, Sigmoid(𝑥) =0.5
Sigmoid(𝑥) > 0.5 ⇔ 𝑥 > 0
0 < Sigmoid(𝑥) < 1 (Sigmoid(x) is a bounded
function), with value in the (0, 1) open interval
Interpretation of the hypothesis –
returning a probability
ℎ𝜃 𝑥 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝜃 𝑇 ∗ 𝑥 )
• Note that given a training data pair (x, y), ℎ𝜃 𝑥 can
be interpreted as returning the probability :
Pr 𝑦 = 1 𝑥; 𝜃), the probability that the target
variable y takes value 1, given observed 𝑥 and
the parameters 𝜃
• So if we use the ℎ𝜃 as a classifier, and predict
y = 1 if ℎ𝜃 𝑥 ≥ 0.5, this would be equivalent
to predicting y = 1 if 𝜃 𝑇 ∗ 𝑥 ≥ 0.
Interpretation of the hypothesis –
defining a decision boundary
ℎ𝜃 𝑥 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝜃 𝑇 ∗ 𝑥 )
Predict y = 1 if ℎ𝜃 𝑥 ≥ 0.5, otherwise predict
y=0.
Equivalent to predicting y = 1 if and only if
𝜃 𝑇 ∗ 𝑥 ≥ 0.
➔ So the decision boundary (separating class 1
and class 0) is defined by the line 𝜃 𝑇 ∗ 𝑥 = 0.
This is a linear (in the input variables) decision
boundary (when we use only the original input
variables).
Illustration of a linear decision
boundary
• Two input variables
𝑥1 𝑎𝑛𝑑 𝑥2 . The 3
triangle points are
positive examples and
𝑥2
circle points are
negative examples.
• The decision
boundary (the red
3 𝑥1
line) is given by the
linear equation Namely, if 𝑥1 + 𝑥2 ≥ 3,
𝑥1 + 𝑥2 − 3 = 0 predict y=1, otherwise
predict y = 0.
Illustration of a non-linear decision
boundary
• Still two input variables 𝑥1 ,
𝑥2 . The decision boundary
(the red line) is NOT linear in .
𝑥1 𝑎𝑛𝑑 𝑥2 . Using quadratic 3
2
features (𝑥1 −3) and
(𝑥2 −3)2 with logistic
regression, we can get the
𝑥2
non-linear decision
boundary.
3 𝑥1
• The red circle is centered at
The decision:
(3, 3), with radius = 2 if (𝑥1 −3)2+ (𝑥2 −3)2 ≥ 4, predict y=1,
The points on the red line otherwise predict y = 0. Note the
satisfies the equation decision boundary is linear in the new
(𝑥1 −3)2 + (𝑥2−3)2 = 4 features (𝑥1 −3)2 and (𝑥2 −3)2
Logistic regression for non-linear
decision boundaries
◼ From the previous slide: use polynomial
features + logistic regression, we can handle
classification problems with non-linear
decision boundary
◼ The decision boundary is non-linear in the
original input variables
◼ But it is linear in the polynomial features
Loss function
◼ Parameters: vector 𝜃 = (𝜃0 , 𝜃1 , … , 𝜃𝑛 )𝑇
◼ Loss/Cost:
−1
J(𝜃) = σ𝑚 𝑖
𝑖=1 [𝑦 log ℎ𝜃 𝑥
𝑖
+ 1 − 𝑦𝑖 log 1 − ℎ𝜃 𝑥 𝑖 ]
𝑚
(mean cross-entropy)
(Remember: each 𝑦𝑖 ∈ 0, 1 ), so only one of the two terms
would be non-zero for each pair (𝑥 𝑖 , 𝑦𝑖 ) )
◼ Why use this loss function instead of the mean squared
error used in linear regression: The reason is that the
above loss function is convex for logistic regression,
whereas the loss formula used in linear regression, when
using ℎ𝜃 𝑥 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝜃 𝑇 ∗ 𝑥 ) , is NO longer
convex ➔ this would make gradient descent difficult.
Convex function and gradient
descent
◼ For a function f(x) with one variable x, f is convex in
interval [a, b] if 𝑓 ′′ 𝑥 ≥ 0 on [a, b]
◼ For a multi-variate function f(𝑥1 , … , 𝑥𝑛 ) is convex if the
Hessian matrix H of f is positive semi-definite (i.e.,
del(H) ≥ 0 ) – intuitively the 2nd derivative of f is non-
negative.
◼ A convex function f has a unique minima (global
optimal point) – This would make gradient descent’s
job easy: As long as the learning rate 𝜂 is not too big,
GD (Gradient Descent) is guaranteed to find the
optimal solution.
Illustrations of convex and non-convex
functions

Example of a non- Example of a convex


convex function, with function, with only
local minimal points ONE global minimal
point
Intuition about the loss function
−1
J(𝜃) = σ𝑚 𝑖
𝑖=1 [𝑦 log ℎ𝜃 𝑥
𝑖
+ 1 − 𝑦𝑖 log 1 − ℎ𝜃 𝑥 𝑖 ]
𝑚

Consider the term −𝑦 𝑖 log ℎ𝜃 𝑥 𝑖


(when 𝑦 𝑖 =1): ℎ𝜃 𝑥 𝑖 takes value in
(0, 1) interval. Intuitively, when 𝑦 𝑖 =1,
if the predicted value ℎ𝜃 𝑥 𝑖 is closer
−log(𝑥)
to 1, then the loss should be small;
and the loss should be bigger if
ℎ𝜃 𝑥𝑖 is approaching 0. The curve
for the function −log(𝑥) in the interval
(0, 1] behaves exactly the desired 0 𝑥 1
way. When ℎ𝜃 𝑥 𝑖 ⟶ 0,
log(ℎ𝜃 𝑥 𝑖 )→ −∞, and thus
−log(ℎ𝜃 𝑥 𝑖 )→ +∞
Intuition about the loss function
−1
J(𝜃) = σ𝑚 𝑖
𝑖=1 [𝑦 log ℎ𝜃 𝑥
𝑖
+ 1 − 𝑦𝑖 log 1 − ℎ𝜃 𝑥 𝑖 ]
𝑚

Consider the term


−(1- 𝑦 𝑖 ) log 1 − ℎ𝜃 𝑥𝑖 (when
𝑦 𝑖 =0): ℎ𝜃 𝑥 𝑖 takes value in (0,
1) interval. Intuitively, when 𝑦 𝑖 =0, −log(1 − 𝑥)
if the predicted value ℎ𝜃 𝑥 𝑖 is
closer to 0, then the loss should
be small. The loss should be
bigger if ℎ𝜃 𝑥 𝑖 is approaching 1.
If we look at the curve for the 0 𝑥 1
function −log(1- x) in interval (0,
1], we see it behaves exactly the
desired way.
Gradient descent for training logistic
regression classifier
◼ Initialize 𝜃, and select learning rate 𝜂 > 0
◼ Then loop until convergence/termination
𝜕𝐽 𝜃
◼ Compute Δ𝜃 = −𝜂 𝜕𝜃
◼ 𝜃 ← 𝜃 + Δ𝜃
𝜕𝐽(𝜃) 1
◼ 𝜕𝜃
= 𝑚 σ𝑚 ℎ
𝑖=1 𝜃 𝑥 𝑖
− 𝑦 𝑖
∗ 𝑥 𝑖

𝜕𝐽 𝜃 𝜂
◼ So Δ𝜃 = −𝜂
𝜕𝜃
= 𝑚 σ𝑚
𝑖=1 𝑦 𝑖
− ℎ 𝜃 𝑥 𝑖
∗ 𝑥𝑖
◼ Note the update formula above looks the SAME as the formula
for gradient descent for linear regression – but actually the
ℎ𝜃 𝑥𝑖 =sigmoid(𝜃 T ∗ x i) here is different from the
ℎ𝜃 𝑥 𝑖 = 𝜃 T ∗ x i in linear regression!
Regularization
◼ When we have too many input variables, the model
may be too complex, risk of overfitting
◼ To handle it: add a regularization term to loss:
−1 𝜆
J(𝜃) = σ𝑚 𝑖
𝑖=1[𝑦 log ℎ𝜃 𝑥
𝑖 + 1 − 𝑦𝑖 log 1 − ℎ𝜃 𝑥 𝑖 ] + σ𝑛𝑗=1 𝜃𝑗2
𝑚 2𝑚
(𝜆 ≥ 0 )
Note that the regularization term starts from j=1.
The gradient would also be changed:

= +
Regularization
𝜆
◼ In the regularization term σ𝑛𝑗=1 𝜃𝑗2 ,
2𝑚
we do NOT regularize 𝜃0
◼ When 𝜆 = 0 𝑜𝑟 𝑣𝑒𝑟𝑦 𝑠𝑚𝑎𝑙𝑙, regularization
takes NO effect → may overfit
𝜆
◼ When 𝜆 is very big, to minimize σ𝑛𝑗=1 𝜃𝑗2 ,
2𝑚
the 𝜃𝑗 woud be forced to take very small
values, so the decision boundary would not
be good → may underfit
Generalization to more than two
class classification
Assume we have 3 classes 𝐶1 , 𝐶2 , and 𝐶3.
We build 3 binary classifiers 𝑀1 , 𝑀2 , 𝑀3 with logistic
regression – using the one-vs-all approach:
1. Generate training data 𝐷1 from the original data D: label
all examples of 𝐶1 as positive and all other examples as
negative.
2. Apply logistic regression to 𝐷1 and build 𝑀1 .
Repeat the above steps (1) and (2) to build 𝑀2 , 𝑀3
For a new data point x, we get 3 probabilities 𝑃1 , 𝑃2 , 𝑃3 by applying
𝑀1 , 𝑀2 , 𝑀3 to the data x. Predict class 𝐶𝑗 , if 𝑃𝑗 is the maximum
among {𝑃1 , 𝑃2 , 𝑃3 }.
Clearly the above method can be generalized to handle any k (>2)
class problem.
Practical considerations
◼ Feature scaling should be applied when the
input variables have rather different value
scales, just like in multivariate linear
regression
◼ Learning rate 𝜂 selection should also be done
carefully
◼ Selection of the regularization parameter 𝜆

You might also like