0% found this document useful (0 votes)
11 views41 pages

Lecture Note #9 - PEC-CS701E

Uploaded by

halderriya56732
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views41 pages

Lecture Note #9 - PEC-CS701E

Uploaded by

halderriya56732
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Logistic Regression

Logistic Regression is a Machine Learning algorithm which is used for the classification
problems, it is a predictive analysis algorithm and based on the concept of probability.
•Logistic regression predicts the output of a categorical dependent variable. Therefore
the outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1,
true or False, etc. but instead of giving the exact value as 0 and 1, it gives the
probabilistic values which lie between 0 and 1.

•In Logistic regression, instead of fitting a regression


line, we fit an "S" shaped logistic function, which
predicts two maximum values (0 or 1).
•The curve from the logistic function indicates the
likelihood of something such as whether the cells are
cancerous or not, a mouse is obese or not based on its
weight, etc.
Logistic Function (Sigmoid Function):
•The sigmoid function is a mathematical function used to
map the predicted values to probabilities.
•It maps any real value into another value within a range of 0
and 1.
•The value of the logistic regression must be between 0 and
1, which cannot go beyond this limit, so it forms a curve like
the "S" form. The S-form curve is called the Sigmoid function
or the logistic function.
•In logistic regression, we use the concept of the threshold
value, which defines the probability of either 0 or 1. Such as
values above the threshold value tends to 1, and a value
below the threshold values tends to 0.
Assumptions for Logistic Regression:

•The dependent variable must be categorical in nature.


•The independent variable should not have multi-collinearity.
Logistic Regression Equation:
The Logistic regression equation can be obtained from the Linear Regression equation. The mathematical steps
to get Logistic Regression equations are given below:
•We know the equation of the straight line can be written as:

•In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above equation by (1-y):

•But we need range between -[infinity] to +[infinity], then take logarithm of the equation it will become:

The above equation is the final equation for Logistic Regression.


Type of Logistic Regression:
• On the basis of the categories, Logistic Regression can be classified
into three types:
• Binomial: In binomial Logistic regression, there can be only two
possible types of the dependent variables, such as 0 or 1, Pass or Fail,
etc.
• Multinomial: In multinomial Logistic regression, there can be 3 or
more possible unordered types of the dependent variable, such as
"cat", "dogs", or "sheep"
• Ordinal: In ordinal Logistic regression, there can be 3 or more
possible ordered types of dependent variables, such as "low",
"Medium", or "High".
Linear Regression vs Logistic Regression
Linear Regression vs Logistic Regression
Linear Regression Logistic Regression
Linear regression is used to predict the continuous Logistic Regression is used to predict the categorical
dependent variable using a given set of independent dependent variable using a given set of independent
variables. variables.
Linear Regression is used for solving Regression problem. Logistic regression is used for solving Classification
problems.
In Linear regression, we predict the value of continuous In logistic Regression, we predict the values of categorical
variables. variables.
In linear regression, we find the best fit line, by which we In Logistic Regression, we find the S-curve by which we can
can easily predict the output. classify the samples.
Least square estimation method is used for estimation of Maximum likelihood estimation method is used for
accuracy. estimation of accuracy.
The output for Linear Regression must be a continuous The output of Logistic Regression must be a Categorical
value, such as price, age, etc. value such as 0 or 1, Yes or No, etc.
In Linear regression, it is required that relationship In Logistic regression, it is not required to have the linear
between dependent variable and independent variable relationship between the dependent and independent
must be linear. variable.
In linear regression, there may be collinearity between the In logistic regression, there should not be collinearity
Logistic Regression

• Hypothesis representation

• Cost function

• Logistic regression with gradient descent

• Regularization

• Multi-class classification
Logistic Regression

• Hypothesis representation

• Cost function

• Logistic regression with gradient descent

• Regularization

• Multi-class classification
1 (Yes)
Malignant?

0 (No)
Tumor Size
ℎ𝜃 𝑥 = 𝜃 ⊤ 𝑥

• Threshold classifier output ℎ𝜃 𝑥 at 0.5


• If ℎ𝜃 𝑥 ≥ 0.5, predict “𝑦 = 1”
• If ℎ𝜃 𝑥 < 0.5, predict “𝑦 = 0”
Classification: 𝑦 = 1 or 𝑦 = 0


ℎ𝜃 𝑥 = 𝜃 𝑥 (from linear regression)
can be > 1 or < 0

Logistic regression: 0 ≤ ℎ𝜃 𝑥 ≤ 1

Logistic regression is actually for classification


Hypothesis representation
• Want 0 ≤ ℎ𝜃 𝑥 ≤ 1 1
ℎ𝜃 𝑥 = −𝜃 ⊤𝑥
• ℎ𝜃 𝑥 = 𝑔 𝜃 ⊤ 𝑥 , 1+ 𝑒
1
where 𝑔 𝑧 =
1+𝑒 −𝑧
𝑔(𝑧)

• Sigmoid function
• Logistic function 𝑧
Interpretation of hypothesis output
• ℎ𝜃 𝑥 = estimated probability that 𝑦 = 1 on input 𝑥

𝑥0 1
• Example: If 𝑥 = x =
1 tumorSize
• ℎ𝜃 𝑥 = 0.7

• Tell patient that 70% chance of tumor being malignant


Logistic regression
⊤ 𝑔(𝑧)
ℎ𝜃 𝑥 = 𝑔 𝜃 𝑥
1
𝑔 𝑧 =
1 + 𝑒 −𝑧
𝑧 = 𝜃⊤𝑥
Suppose predict “y = 1” if ℎ𝜃 𝑥 ≥ 0.5
𝑧 = 𝜃 ⊤𝑥 ≥ 0
predict “y = 0” if ℎ𝜃 𝑥 < 0.5
𝑧 = 𝜃 ⊤𝑥 < 0
Decision boundary
• ℎ𝜃 𝑥 = 𝑔(𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 )

Age
E.g., 𝜃0 = −3, 𝜃1 = 1, 𝜃2 = 1

Tumor Size

• Predict “𝑦 = 1” if −3 + 𝑥1 + 𝑥2 ≥ 0
Hypothesis representation
• Logistic regression hypothesis representation
1 1
ℎ𝜃 𝑥 = ⊤ =
1 + 𝑒 −𝜃 𝑥 1 + 𝑒 −(𝜃0+𝜃1𝑥1+𝜃2𝑥2+⋯+𝜃𝑛𝑥𝑛)
• Consider learning f: 𝑋 → 𝑌, where
• 𝑋 is a vector of real-valued features 𝑋1 , ⋯ , 𝑋𝑛 ⊤
• 𝑌 is Boolean
• Assume all 𝑋𝑖 are conditionally independent given 𝑌
• Model 𝑃 𝑋𝑖 𝑌 = 𝑦𝑘 as Gaussian 𝑁 𝜇𝑖𝑘 , 𝜎𝑖
• Model 𝑃 𝑌 as Bernoulli 𝜋
Logistic Regression

• Hypothesis representation

• Cost function

• Logistic regression with gradient descent

• Regularization

• Multi-class classification
Training set with 𝑚 examples
{ 𝑥 1 ,𝑦 1 , 𝑥 2 ,𝑦 2 ,⋯, 𝑥 𝑚 ,𝑦 𝑚
𝑥0
𝑥1
𝑥∈ ⋮ 𝑥0 = 1, 𝑦 ∈ {0, 1}
𝑥𝑛

1
ℎ𝜃 𝑥 = −𝜃 ⊤𝑥
1+ 𝑒
How to choose parameters 𝜃?
Cost function for Linear Regression
𝑚 𝑚
1 𝑖 𝑖 2 1
𝐽 𝜃 = ෍ ℎ𝜃 𝑥 −𝑦 = ෍ Cost(ℎ𝜃 (𝑥 𝑖 ), 𝑦))
2𝑚 𝑚
𝑖=1 𝑖=1

1 2
Cost(ℎ𝜃 𝑥 , 𝑦) = ℎ𝜃 𝑥 − 𝑦
2
Cost function for Logistic Regression

−log ℎ𝜃 𝑥 if 𝑦 = 1
Cost(ℎ𝜃 𝑥 , 𝑦) = ቐ
−log 1 − ℎ𝜃 𝑥 if 𝑦 = 0

if 𝑦 = 1 if 𝑦 = 0

0 ℎ𝜃 𝑥 1 0 ℎ𝜃 𝑥 1
Logistic regression cost function
−log ℎ𝜃 𝑥 if 𝑦 = 1
• Cost(ℎ𝜃 𝑥 , 𝑦) = ቐ
−log 1 − ℎ𝜃 𝑥 if 𝑦 = 0

• Cost ℎ𝜃 𝑥 , 𝑦 = −𝑦 log h𝜃 x − (1 − y) log 1 − ℎ𝜃 𝑥

• If 𝑦 = 1: Cost ℎ𝜃 𝑥 , 𝑦 = −log ℎ𝜃 𝑥
• If 𝑦 = 0: Cost ℎ𝜃 𝑥 , 𝑦 = −log 1 − ℎ𝜃 𝑥
Logistic regression
𝑚
1
𝐽 𝜃 = ෍ Cost(ℎ𝜃 (𝑥 𝑖 ), 𝑦 (𝑖) ))
𝑚
𝑖=1
1
= − σ𝑚 𝑖=1 𝑦 (𝑖) log ℎ𝜃 𝑥 (𝑖) + (1 − 𝑦 (𝑖) ) log 1 − ℎ𝜃 𝑥 (𝑖)
𝑚

Learning: fit parameter 𝜃 Prediction: given new 𝑥


1
min 𝐽(𝜃) Output ℎ𝜃 𝑥 = −𝜃⊤ 𝑥
𝜃 1+𝑒
Where does the cost come from?
• Training set with 𝑚 examples
𝑥 1 ,𝑦 1 , 𝑥 2 ,𝑦 2
,⋯, 𝑥 𝑚
,𝑦 𝑚

• Maximum likelihood estimate for parameter 𝜃


𝜃MLE = argmax 𝑃𝜃 𝑥 1 , 𝑦 1 , 𝑥 2 , 𝑦 2 , ⋯ , 𝑥 𝑚
,𝑦 𝑚
𝜃 𝑚

= argmax ෑ 𝑃𝜃 𝑥 𝑖 ,𝑦 𝑖
𝜃
𝑖=1
• Maximum conditional likelihood estimate for parameter 𝜃
• Goal: choose 𝜃 to maximize conditional likelihood of training data
1
• 𝑃𝜃 𝑌 = 1 𝑋 = 𝑥 = ℎ𝜃 𝑥 = ⊤
1+𝑒 −𝜃 𝑥
−𝜃⊤𝑥
𝑒
• 𝑃𝜃 𝑌 = 0 𝑋 = 𝑥 = 1 − ℎ𝜃 𝑥 = ⊤
1+𝑒 −𝜃 𝑥

1 1 2 2 𝑚 𝑚
• Training data D = 𝑥 ,𝑦 , 𝑥 ,𝑦 ,⋯, 𝑥 ,𝑦
• Data likelihood = ς𝑚
𝑖=1 𝑃𝜃 𝑥 𝑖 ,𝑦 𝑖

• Data conditional likelihood = ς𝑚 𝑃


𝑖=1 𝜃 𝑦 (𝑖)
|𝑥 𝑖

𝑚 (𝑖) 𝑖
𝜃MCLE = argmax ς𝑖=1 𝑃𝜃 𝑦 |𝑥
𝜃
Expressing conditional log-likelihood
𝑚 𝑚

𝐿 𝜃 = log ෑ 𝑃𝜃 𝑦 (𝑖) |𝑥 𝑖 = ෍ log 𝑃𝜃 𝑦 (𝑖) |𝑥 𝑖

𝑚 𝑖=1 𝑖=1

= ෍ 𝑦 (𝑖) log 𝑃𝜃 𝑦 (𝑖) = 1|𝑥 𝑖 + 1−𝑦 𝑖 log 𝑃𝜃 𝑦 (𝑖) = 0|𝑥 𝑖

𝑖=1
= σ𝑚𝑖=1 𝑦 (𝑖) log (ℎ𝜃 (𝑥 (𝑖) )) + 1 − 𝑦 𝑖 log(1 − ℎ𝜃 (𝑥 (𝑖) ))

−log ℎ𝜃 𝑥 if 𝑦 = 1
Cost(ℎ𝜃 𝑥 , 𝑦) = ቐ
−log 1 − ℎ𝜃 𝑥 if 𝑦 = 0
Logistic Regression

• Hypothesis representation

• Cost function

• Logistic regression with gradient descent

• Regularization

• Multi-class classification
Gradient descent
𝑚
1
𝐽 𝜃 =− ෍ 𝑦 (𝑖) log ℎ𝜃 𝑥 (𝑖) + (1 − 𝑦 (𝑖) ) log 1 − ℎ𝜃 𝑥 (𝑖)
𝑚
𝑖=1
Goal: min 𝐽(𝜃) Good news: Convex function!
𝜃 Bad news: No analytical solution

Repeat { (Simultaneously update all 𝜃𝑗 )


𝜕 𝑚
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼 𝐽(𝜃) 𝜕 1 𝑖 (𝑖) (𝑖)
𝜕𝜃𝑗 𝐽 𝜃 = ෍(ℎ𝜃 𝑥 −𝑦 ) 𝑥𝑗
𝜕𝜃𝑗 𝑚
} 𝑖=1
Gradient descent
𝑚
1
𝐽 𝜃 =− ෍ 𝑦 (𝑖) log ℎ𝜃 𝑥 (𝑖) + (1 − 𝑦 (𝑖) ) log 1 − ℎ𝜃 𝑥 (𝑖)
𝑚
𝑖=1
Goal: min 𝐽(𝜃)
𝜃

Repeat { (Simultaneously update all 𝜃𝑗 )


𝑚
1 𝑖 (𝑖) (𝑖)
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼 ෍ ℎ𝜃 𝑥 −𝑦 𝑥𝑗
𝑚
𝑖=1
}
Gradient descent for Linear Regression
Repeat {
𝑚
1 𝑖 (𝑖) (𝑖) ⊤
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼 ෍ ℎ𝜃 𝑥
𝑚
−𝑦 𝑥𝑗 ℎ𝜃 𝑥 = 𝜃 𝑥
𝑖=1
}

Gradient descent for Logistic Regression


Repeat {
𝑚
1 (𝑖)
1
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼 ෍ ℎ𝜃 𝑥 𝑖
−𝑦 (𝑖)
𝑥𝑗 ℎ𝜃 𝑥 = −𝜃 ⊤𝑥
𝑚
𝑖=1
1+ 𝑒
}
Logistic Regression

• Hypothesis representation

• Cost function

• Logistic regression with gradient descent

• Regularization

• Multi-class classification
How about MAP?
• Maximum conditional likelihood estimate (MCLE)

𝑚 (𝑖) 𝑖
𝜃MCLE = argmax ς𝑖=1 𝑃𝜃 𝑦 |𝑥
𝜃

• Maximum conditional a posterior estimate (MCAP)

𝑚 (𝑖) 𝑖
𝜃MCAP = argmax ς𝑖=1 𝑃𝜃 𝑦 |𝑥 𝑃(𝜃)
𝜃
Prior 𝑃(𝜃)
• Common choice of 𝑃(𝜃):
• Normal distribution, zero mean, identity covariance
• “Pushes” parameters towards zeros
• Corresponds to Regularization
• Helps avoid very large weights and overfitting
MLE vs. MAP
• Maximum conditional likelihood estimate (MCLE)
𝑚
1 𝑖 (𝑖) (𝑖)
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼 ෍ ℎ𝜃 𝑥 −𝑦 𝑥𝑗
𝑚
𝑖=1

• Maximum conditional a posterior estimate (MCAP)


𝑚
1 𝑖 (𝑖) (𝑖)
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼𝜆𝜃𝑗 − 𝛼 ෍ ℎ𝜃 𝑥 −𝑦 𝑥𝑗
𝑚
𝑖=1
Logistic Regression

• Hypothesis representation

• Cost function

• Logistic regression with gradient descent

• Regularization

• Multi-class classification
Multi-class classification
• Email foldering/taggning: Work, Friends, Family, Hobby

• Medical diagrams: Not ill, Cold, Flu

• Weather: Sunny, Cloudy, Rain, Snow


Binary classification Multiclass classification

𝑥2 𝑥2

𝑥1 𝑥1
One-vs-all (one-vs-rest) 𝑥2
1
ℎ𝜃 𝑥
𝑥1
𝑥2
2 𝑥2
ℎ𝜃 𝑥

𝑥1 𝑥1
Class 1:
Class 2: 3
ℎ𝜃 𝑥 𝑥2
Class 3:
𝑖
ℎ𝜃 𝑥 = 𝑃 𝑦 = 𝑖 𝑥; 𝜃 (𝑖 = 1, 2, 3) 𝑥1
One-vs-all
𝑖
• Train a logistic regression classifier
ℎ𝜃 𝑥 for
each class 𝑖 to predict the probability that 𝑦 = 𝑖

• Given a new input 𝑥, pick the class 𝑖 that


maximizes
𝑖
max ℎ𝜃 𝑥
i
Generative Approach Discriminative Approach
Ex: Naïve Bayes Ex: Logistic regression

Estimate 𝑃(𝑌) and 𝑃(𝑋|𝑌) Estimate 𝑃(𝑌|𝑋) directly


(Or a discriminant function: e.g., SVM)

Prediction Prediction
𝑦ො = argmax𝑦 𝑃 𝑌 = 𝑦 𝑃(𝑋 = 𝑥|𝑌 = 𝑦) 𝑦ො = 𝑃(𝑌 = 𝑦|𝑋 = 𝑥)
Things to remember
1
• Hypothesis representation ℎ𝜃 𝑥 =
1 + 𝑒 −𝜃
⊤𝑥

−log ℎ𝜃 𝑥 if 𝑦 = 1
• Cost function Cost(ℎ𝜃 𝑥 , 𝑦) = ቐ
−log 1 − ℎ𝜃 𝑥 if 𝑦 = 0

• Logistic regression with gradient descent


𝑚
1 (𝑖)
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼 ෍ ℎ𝜃 𝑥 𝑖 − 𝑦 (𝑖) 𝑥𝑗
• Regularization 𝑚
𝑖=1 𝑚
1 (𝑖)
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼𝜆𝜃𝑗 − 𝛼 ෍ ℎ𝜃 𝑥 𝑖 − 𝑦 (𝑖) 𝑥𝑗
𝑚
• Multi-class classification 𝑖
𝑖=1
max ℎ𝜃 𝑥
i
Logistic regression
Advantages:
– Makes no assumptions about distributions of classes in feature
space
– Easily extended to multiple classes (multinomial regression)
– Natural probabilistic view of class predictions
– Quick to train
– Very fast at classifying unknown records
– Good accuracy for many simple data sets
– Resistant to overfitting
– Can interpret model coefficients as indicators of feature
importance

Disadvantages:
– Linear decision boundary

You might also like