Lecture Note #9 - PEC-CS701E
Lecture Note #9 - PEC-CS701E
Logistic Regression is a Machine Learning algorithm which is used for the classification
problems, it is a predictive analysis algorithm and based on the concept of probability.
•Logistic regression predicts the output of a categorical dependent variable. Therefore
the outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1,
true or False, etc. but instead of giving the exact value as 0 and 1, it gives the
probabilistic values which lie between 0 and 1.
•In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above equation by (1-y):
•But we need range between -[infinity] to +[infinity], then take logarithm of the equation it will become:
• Hypothesis representation
• Cost function
• Regularization
• Multi-class classification
Logistic Regression
• Hypothesis representation
• Cost function
• Regularization
• Multi-class classification
1 (Yes)
Malignant?
0 (No)
Tumor Size
ℎ𝜃 𝑥 = 𝜃 ⊤ 𝑥
⊤
ℎ𝜃 𝑥 = 𝜃 𝑥 (from linear regression)
can be > 1 or < 0
Logistic regression: 0 ≤ ℎ𝜃 𝑥 ≤ 1
• Sigmoid function
• Logistic function 𝑧
Interpretation of hypothesis output
• ℎ𝜃 𝑥 = estimated probability that 𝑦 = 1 on input 𝑥
𝑥0 1
• Example: If 𝑥 = x =
1 tumorSize
• ℎ𝜃 𝑥 = 0.7
Age
E.g., 𝜃0 = −3, 𝜃1 = 1, 𝜃2 = 1
Tumor Size
• Predict “𝑦 = 1” if −3 + 𝑥1 + 𝑥2 ≥ 0
Hypothesis representation
• Logistic regression hypothesis representation
1 1
ℎ𝜃 𝑥 = ⊤ =
1 + 𝑒 −𝜃 𝑥 1 + 𝑒 −(𝜃0+𝜃1𝑥1+𝜃2𝑥2+⋯+𝜃𝑛𝑥𝑛)
• Consider learning f: 𝑋 → 𝑌, where
• 𝑋 is a vector of real-valued features 𝑋1 , ⋯ , 𝑋𝑛 ⊤
• 𝑌 is Boolean
• Assume all 𝑋𝑖 are conditionally independent given 𝑌
• Model 𝑃 𝑋𝑖 𝑌 = 𝑦𝑘 as Gaussian 𝑁 𝜇𝑖𝑘 , 𝜎𝑖
• Model 𝑃 𝑌 as Bernoulli 𝜋
Logistic Regression
• Hypothesis representation
• Cost function
• Regularization
• Multi-class classification
Training set with 𝑚 examples
{ 𝑥 1 ,𝑦 1 , 𝑥 2 ,𝑦 2 ,⋯, 𝑥 𝑚 ,𝑦 𝑚
𝑥0
𝑥1
𝑥∈ ⋮ 𝑥0 = 1, 𝑦 ∈ {0, 1}
𝑥𝑛
1
ℎ𝜃 𝑥 = −𝜃 ⊤𝑥
1+ 𝑒
How to choose parameters 𝜃?
Cost function for Linear Regression
𝑚 𝑚
1 𝑖 𝑖 2 1
𝐽 𝜃 = ℎ𝜃 𝑥 −𝑦 = Cost(ℎ𝜃 (𝑥 𝑖 ), 𝑦))
2𝑚 𝑚
𝑖=1 𝑖=1
1 2
Cost(ℎ𝜃 𝑥 , 𝑦) = ℎ𝜃 𝑥 − 𝑦
2
Cost function for Logistic Regression
−log ℎ𝜃 𝑥 if 𝑦 = 1
Cost(ℎ𝜃 𝑥 , 𝑦) = ቐ
−log 1 − ℎ𝜃 𝑥 if 𝑦 = 0
if 𝑦 = 1 if 𝑦 = 0
0 ℎ𝜃 𝑥 1 0 ℎ𝜃 𝑥 1
Logistic regression cost function
−log ℎ𝜃 𝑥 if 𝑦 = 1
• Cost(ℎ𝜃 𝑥 , 𝑦) = ቐ
−log 1 − ℎ𝜃 𝑥 if 𝑦 = 0
• If 𝑦 = 1: Cost ℎ𝜃 𝑥 , 𝑦 = −log ℎ𝜃 𝑥
• If 𝑦 = 0: Cost ℎ𝜃 𝑥 , 𝑦 = −log 1 − ℎ𝜃 𝑥
Logistic regression
𝑚
1
𝐽 𝜃 = Cost(ℎ𝜃 (𝑥 𝑖 ), 𝑦 (𝑖) ))
𝑚
𝑖=1
1
= − σ𝑚 𝑖=1 𝑦 (𝑖) log ℎ𝜃 𝑥 (𝑖) + (1 − 𝑦 (𝑖) ) log 1 − ℎ𝜃 𝑥 (𝑖)
𝑚
= argmax ෑ 𝑃𝜃 𝑥 𝑖 ,𝑦 𝑖
𝜃
𝑖=1
• Maximum conditional likelihood estimate for parameter 𝜃
• Goal: choose 𝜃 to maximize conditional likelihood of training data
1
• 𝑃𝜃 𝑌 = 1 𝑋 = 𝑥 = ℎ𝜃 𝑥 = ⊤
1+𝑒 −𝜃 𝑥
−𝜃⊤𝑥
𝑒
• 𝑃𝜃 𝑌 = 0 𝑋 = 𝑥 = 1 − ℎ𝜃 𝑥 = ⊤
1+𝑒 −𝜃 𝑥
1 1 2 2 𝑚 𝑚
• Training data D = 𝑥 ,𝑦 , 𝑥 ,𝑦 ,⋯, 𝑥 ,𝑦
• Data likelihood = ς𝑚
𝑖=1 𝑃𝜃 𝑥 𝑖 ,𝑦 𝑖
𝑚 (𝑖) 𝑖
𝜃MCLE = argmax ς𝑖=1 𝑃𝜃 𝑦 |𝑥
𝜃
Expressing conditional log-likelihood
𝑚 𝑚
𝑚 𝑖=1 𝑖=1
𝑖=1
= σ𝑚𝑖=1 𝑦 (𝑖) log (ℎ𝜃 (𝑥 (𝑖) )) + 1 − 𝑦 𝑖 log(1 − ℎ𝜃 (𝑥 (𝑖) ))
−log ℎ𝜃 𝑥 if 𝑦 = 1
Cost(ℎ𝜃 𝑥 , 𝑦) = ቐ
−log 1 − ℎ𝜃 𝑥 if 𝑦 = 0
Logistic Regression
• Hypothesis representation
• Cost function
• Regularization
• Multi-class classification
Gradient descent
𝑚
1
𝐽 𝜃 =− 𝑦 (𝑖) log ℎ𝜃 𝑥 (𝑖) + (1 − 𝑦 (𝑖) ) log 1 − ℎ𝜃 𝑥 (𝑖)
𝑚
𝑖=1
Goal: min 𝐽(𝜃) Good news: Convex function!
𝜃 Bad news: No analytical solution
• Hypothesis representation
• Cost function
• Regularization
• Multi-class classification
How about MAP?
• Maximum conditional likelihood estimate (MCLE)
𝑚 (𝑖) 𝑖
𝜃MCLE = argmax ς𝑖=1 𝑃𝜃 𝑦 |𝑥
𝜃
𝑚 (𝑖) 𝑖
𝜃MCAP = argmax ς𝑖=1 𝑃𝜃 𝑦 |𝑥 𝑃(𝜃)
𝜃
Prior 𝑃(𝜃)
• Common choice of 𝑃(𝜃):
• Normal distribution, zero mean, identity covariance
• “Pushes” parameters towards zeros
• Corresponds to Regularization
• Helps avoid very large weights and overfitting
MLE vs. MAP
• Maximum conditional likelihood estimate (MCLE)
𝑚
1 𝑖 (𝑖) (𝑖)
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼 ℎ𝜃 𝑥 −𝑦 𝑥𝑗
𝑚
𝑖=1
• Hypothesis representation
• Cost function
• Regularization
• Multi-class classification
Multi-class classification
• Email foldering/taggning: Work, Friends, Family, Hobby
𝑥2 𝑥2
𝑥1 𝑥1
One-vs-all (one-vs-rest) 𝑥2
1
ℎ𝜃 𝑥
𝑥1
𝑥2
2 𝑥2
ℎ𝜃 𝑥
𝑥1 𝑥1
Class 1:
Class 2: 3
ℎ𝜃 𝑥 𝑥2
Class 3:
𝑖
ℎ𝜃 𝑥 = 𝑃 𝑦 = 𝑖 𝑥; 𝜃 (𝑖 = 1, 2, 3) 𝑥1
One-vs-all
𝑖
• Train a logistic regression classifier
ℎ𝜃 𝑥 for
each class 𝑖 to predict the probability that 𝑦 = 𝑖
Prediction Prediction
𝑦ො = argmax𝑦 𝑃 𝑌 = 𝑦 𝑃(𝑋 = 𝑥|𝑌 = 𝑦) 𝑦ො = 𝑃(𝑌 = 𝑦|𝑋 = 𝑥)
Things to remember
1
• Hypothesis representation ℎ𝜃 𝑥 =
1 + 𝑒 −𝜃
⊤𝑥
−log ℎ𝜃 𝑥 if 𝑦 = 1
• Cost function Cost(ℎ𝜃 𝑥 , 𝑦) = ቐ
−log 1 − ℎ𝜃 𝑥 if 𝑦 = 0
Disadvantages:
– Linear decision boundary