0% found this document useful (0 votes)
41 views42 pages

3 Logistic Regression and Regularization

This document discusses logistic regression. It covers the hypothesis representation using the sigmoid function, the cost function for logistic regression using maximum likelihood estimation, training logistic regression using gradient descent, regularization to address overfitting, and multi-class classification using a one-vs-all approach with multiple logistic regression classifiers.

Uploaded by

Smit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views42 pages

3 Logistic Regression and Regularization

This document discusses logistic regression. It covers the hypothesis representation using the sigmoid function, the cost function for logistic regression using maximum likelihood estimation, training logistic regression using gradient descent, regularization to address overfitting, and multi-class classification using a one-vs-all approach with multiple logistic regression classifiers.

Uploaded by

Smit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Logistic Regression

Logistic Regression
• Hypothesis representation

• Cost function

• Logistic regression with gradient descent

• Regularization

• Multi-class classification
Logistic Regression
• Hypothesis representation

• Cost function

• Logistic regression with gradient descent

• Regularization

• Multi-class classification
1 (Yes)
Malignant?

0 (No)
Tumor Size
ℎ𝜃 𝑥 = 𝜃 ⊤ 𝑥

• Threshold classifier output ℎ𝜃 𝑥 at 0.5


– If ℎ𝜃 𝑥 ≥ 0.5, predict “𝑦 = 1”
– If ℎ𝜃 𝑥 < 0.5, predict “𝑦 = 0”
Slide credit: Andrew Ng
Classification: 𝑦 = 1 or 𝑦 = 0

ℎ𝜃 𝑥 = 𝜃 ⊤ 𝑥 (from linear regression)


can be > 1 or < 0

Logistic regression: 0 ≤ ℎ𝜃 𝑥 ≤ 1

Logistic regression is actually for classification


Slide credit: Andrew Ng
Hypothesis representation
• Want 0 ≤ ℎ𝜃 𝑥 ≤ 1 1
ℎ𝜃 𝑥 = −𝜃 ⊤𝑥
1+𝑒
• ℎ𝜃 𝑥 = 𝑔 𝜃 ⊤ 𝑥 ,

1
where 𝑔 𝑧 =
1+𝑒 −𝑧 𝑔(𝑧)

• Sigmoid function
𝑧 Slide credit: Andrew Ng
Interpretation of hypothesis output
• ℎ𝜃 𝑥 = estimated probability that 𝑦 = 1 on input 𝑥

𝑥0 1
• Example: If 𝑥 = x =
1 tumorSize
• ℎ𝜃 𝑥 = 0.7

• Tell patient that 70% chance of tumor being malignant

Slide credit: Andrew Ng


Logistic regression
ℎ𝜃 𝑥 = 𝑔 𝜃 ⊤ 𝑥 𝑔(𝑧)
1
𝑔 𝑧 =
1 + 𝑒 −𝑧
𝑧 = 𝜃⊤𝑥
Suppose predict “y = 1” if ℎ𝜃 𝑥 ≥ 0.5
𝑧 = 𝜃⊤𝑥 ≥ 0
predict “y = 0” if ℎ𝜃 𝑥 < 0.5
𝑧 = 𝜃⊤𝑥 < 0
Slide credit: Andrew Ng
Decision boundary
• ℎ𝜃 𝑥 = 𝑔(𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 )

Age
E.g., 𝜃0 = −3, 𝜃1 = 1, 𝜃2 = 1

Tumor Size

• Predict “𝑦 = 1” if −3 + 𝑥1 + 𝑥2 ≥
0
Slide credit: Andrew Ng
• ℎ𝜃 𝑥 = 𝑔(𝜃0 + 𝜃1 𝑥1 + 𝜃2𝑥2
+ 𝜃3 𝑥12 + 𝜃4 𝑥22 )

E.g., 𝜃0 = −1, 𝜃1 = 0, 𝜃2 = 0, 𝜃3 = 1, 𝜃4 = 1

• Predict “𝑦 = 1” if −1 + 𝑥12 + 𝑥22 ≥ 0

• ℎ𝜃 𝑥 = 𝑔(𝜃0 + 𝜃1 𝑥1 + 𝜃2𝑥2 + 𝜃3 𝑥12 +


𝜃4 𝑥12 𝑥2 + 𝜃5 𝑥12𝑥22 + 𝜃6 𝑥13𝑥2 + ⋯ )
Slide credit: Andrew Ng
Logistic Regression
• Hypothesis representation

• Cost function

• Logistic regression with gradient descent

• Regularization

• Multi-class classification
Training set with 𝑚 examples
{ 𝑥 1 ,𝑦 1 , 𝑥 2 ,𝑦 2 ,⋯, 𝑥 𝑚
,𝑦 𝑚

𝑥0
𝑥1
𝑥∈ ⋮ 𝑥0 = 1, 𝑦 ∈ {0, 1}
𝑥𝑛

1
ℎ𝜃 𝑥 = −𝜃 ⊤𝑥
1+ 𝑒
Slide credit: Andrew Ng
Cost function for Linear Regression
𝑚 𝑚
1 𝑖 𝑖 2 1
𝐽 𝜃 = ෍ ℎ𝜃 𝑥 −𝑦 = ෍ Cost(ℎ𝜃 (𝑥 𝑖 ), 𝑦))
2𝑚 𝑚
𝑖=1 𝑖=1

1 2
Cost(ℎ𝜃 𝑥 , 𝑦) = ℎ𝜃 𝑥 − 𝑦
2

Slide credit: Andrew Ng


Cost function for Logistic Regression
−log ℎ𝜃 𝑥 if 𝑦 = 1
Cost(ℎ𝜃 𝑥 , 𝑦) = ቐ
−log 1 − ℎ𝜃 𝑥 if 𝑦 = 0

if 𝑦 = 1 if 𝑦 = 0

0 ℎ𝜃 𝑥 1 0 ℎ𝜃 𝑥 1
Slide credit: Andrew Ng
Logistic regression cost function
−log ℎ𝜃 𝑥 if 𝑦 = 1
• Cost(ℎ𝜃 𝑥 , 𝑦) = ቐ
−log 1 − ℎ𝜃 𝑥 if 𝑦 = 0

• Cost ℎ𝜃 𝑥 , 𝑦 = −𝑦 log h𝜃 x − (1 − y) log 1 − ℎ𝜃 𝑥

• If 𝑦 = 1: Cost ℎ𝜃 𝑥 , 𝑦 = −log ℎ𝜃 𝑥
• If 𝑦 = 0: Cost ℎ𝜃 𝑥 , 𝑦 = −log 1 − ℎ𝜃 𝑥
Slide credit: Andrew Ng
Logistic regression
𝑚
1
𝐽 𝜃 = ෍ Cost(ℎ𝜃 (𝑥 𝑖 ), 𝑦 (𝑖) ))
𝑚
𝑖=1
1
= − σ𝑚 𝑖=1 𝑦 (𝑖)
log ℎ𝜃 𝑥 (𝑖) + (1 − 𝑦 (𝑖) ) log 1 − ℎ𝜃 𝑥 (𝑖)
𝑚

Learning: fit parameter 𝜃 Prediction: given new 𝑥


1
min 𝐽(𝜃) Output ℎ𝜃 𝑥 = −𝜃⊤ 𝑥
𝜃 1+𝑒

Slide credit: Andrew Ng


Logistic Regression
• Hypothesis representation

• Cost function

• Logistic regression with gradient descent

• Regularization

• Multi-class classification
Gradient descent
𝐽 𝜃
𝑚
1
=− ෍ 𝑦 (𝑖) log ℎ𝜃 𝑥 (𝑖) + (1 − 𝑦 (𝑖) ) log 1 − ℎ𝜃 𝑥 (𝑖)
𝑚
𝑖=1
Goal: min 𝐽(𝜃)
𝜃 Good news: Convex function!
Bad news: No analytical solution
Repeat {
𝜕
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼 𝐽(𝜃) (Simultaneously update all 𝜃𝑗 )
𝜕𝜃𝑗
} 𝑚
𝜕 1 𝑖 (𝑖) (𝑖)
𝐽 𝜃 = ෍(ℎ𝜃 𝑥 −𝑦 ) 𝑥𝑗
𝜕𝜃𝑗 𝑚
𝑖=1
Slide credit: Andrew Ng
Gradient descent
𝐽 𝜃
𝑚
1
=− ෍ 𝑦 (𝑖) log ℎ𝜃 𝑥 (𝑖) + (1 − 𝑦 (𝑖) ) log 1 − ℎ𝜃 𝑥 (𝑖)
𝑚
𝑖=1
Goal: min 𝐽(𝜃)
𝜃

Repeat {
𝑚
(Simultaneously update all 𝜃𝑗 )
1 𝑖 (𝑖)
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼 ෍ ℎ𝜃 𝑥 − 𝑦 (𝑖) 𝑥𝑗
𝑚
𝑖=1
}

Slide credit: Andrew Ng


Gradient descent for Linear Regression
Repeat {
𝑚
1
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼 ෍ ℎ𝜃 𝑥 𝑖
−𝑦 (𝑖) (𝑖)
𝑥𝑗 ℎ𝜃 𝑥 = 𝜃 ⊤ 𝑥
𝑚
𝑖=1
}
Gradient descent for Logistic Regression
Repeat {
1
𝑚
1
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼 ෍ ℎ𝜃 𝑥 𝑖

(𝑖)
𝑦 (𝑖) 𝑥𝑗 ℎ𝜃 𝑥 = −𝜃 ⊤𝑥
𝑚
𝑖=1
1+ 𝑒
} Slide credit: Andrew Ng
Logistic Regression
• Hypothesis representation

• Cost function

• Logistic regression with gradient descent

• Regularization

• Multi-class classification
Logistic Regression
• Hypothesis representation

• Cost function

• Logistic regression with gradient descent

• Regularization

• Multi-class classification
Multi-class classification
• Email foldering/taggning: Work, Friends, Family, Hobby

• Medical diagrams: Not ill, Cold, Flu

• Weather: Sunny, Cloudy, Rain, Snow

Slide credit: Andrew Ng


Binary classification Multiclass classification

𝑥2 𝑥2

𝑥1 𝑥1
One-vs-all (one-vs-rest)
𝑥 2
1
ℎ𝜃 𝑥
𝑥1
𝑥2
2 𝑥2
ℎ𝜃 𝑥

𝑥1 𝑥1
Class 1:
Class 2: 3 𝑥2
Class 3: ℎ𝜃 𝑥

ℎ𝜃𝑖 𝑥 = 𝑃 𝑦 = 𝑖 𝑥; 𝜃 (𝑖 = 1, 2, 3) 𝑥1
Slide credit: Andrew Ng
One-vs-all
𝑖
• Train a logistic regression classifier
ℎ𝜃 𝑥 for
each class 𝑖 to predict the probability that 𝑦 = 𝑖

• Given a new input 𝑥, pick the class 𝑖 that


maximizes
𝑖
max ℎ𝜃 𝑥
i
Slide credit: Andrew Ng
Regularization
The problem of
overfitting
Example: Linear regression (housing prices)
Price

Price

Price
Size Size Size

Overfitting: If we have too many features, the learned hypothesis


may fit the training set very well ( ), but fail
to generalize to new examples (predict prices on new examples).
Andrew Ng
Addressing overfitting:

Options:
1. Reduce number of features.
― Manually select which features to keep.
― Model selection algorithm (later in course).
2. Regularization.
― Keep all the features, but reduce magnitude/values of
parameters .
― Works well when we have a lot of features, each of
which contributes a bit to predicting .

Andrew Ng
Regularization
Cost function
Intuition

Price
Price

Size of house Size of house

Suppose we penalize and make , really small.

Andrew Ng
Regularization.

Small values for parameters


― “Simpler” hypothesis
― Less prone to overfitting
Housing:
― Features:
― Parameters:

Andrew Ng
Regularization.

Price

Size of house

Andrew Ng
In regularized linear regression, we choose to minimize

What if is set to an extremely large value (perhaps for too large


for our problem, say )?
Price

Size of house

Andrew Ng
Regularization
Regularized linear
regression
Regularized linear regression
Gradient descent
Repeat

Andrew Ng
Normal equation

Andrew Ng
Non-invertibility (optional/advanced).
Suppose ,
(#examples) (#features)

If ,

Andrew Ng
References
 Andrew Ng’s slides on Multiple Linear Regression
from his Machine Learning Course on Coursera.

Andrew Ng
Disclaimer
 Content of this presentation is not original and it
has been prepared from various sources for
teaching purpose.

Andrew Ng

You might also like