3 LogisticRegression
3 LogisticRegression
Machine Learning
Logistic Regression
Some content in the slides is based on Dr. Razvan’s and Dr. Andrew’s lectures
Logistic Regression
Linear
Regression
Regression
Supervised
Types of Output
Learning
Logistic
Classification
Regression
Classification
• binary classification:
• “y” can only be one of two values:
- false: 0: "negative class" = “absence”
- true: 1: "positive class” = “presence”
Linear Regression Approach
𝑓!,# 𝑥 = 𝑤𝑥 + 𝑏
(Yes) 1
𝑏 = 𝑤$
Threshold 0.5 𝑦! = 1
malignant? 𝑦! = 0
(No) 0
tumor size 𝑥
(diameter in cm)
𝑓!,# 𝑥 = 𝑤𝑥 + 𝑏
(Yes) 1
Threshold 0.5 𝑦! = 1
malignant? 𝑦! = 0
(No) 0
tumor size 𝑥
(diameter in cm)
(Yes) 1
Threshold 0.7
malignant?
(No) 0
tumor size 𝑥
(diameter in cm)
𝑥-
3
𝑓&,( 𝑥 = 𝑔 𝑧 = 𝑔(𝑤!𝑥! + 𝑤-𝑥- + 𝑏)
2
Decision Boundary: 𝑧 = 𝑤 + 𝑥 + 𝑏 = 0
1 (set 𝑤! = 1, 𝑤- = 1 )
𝑧 = 𝑥! + 𝑥- − 3 = 0
0 𝑥! + 𝑥- = 3
1 2 3 𝑥!
𝑥-
1 𝑧 = 𝑤$ 𝑥$% + 𝑤% 𝑥%% + 𝑏
-1 1 𝑥!
-1 Decision Boundary:
(set 𝑤! = 1, 𝑤- = 1 )
𝑧 = 𝑥!- + 𝑥-- − 1 = 0
𝑥!- + 𝑥-- = 1
Loss Function
Training Set
tumor size(cm) … patient’s age malignant?
𝑥! 𝑥" 𝑦 𝑖 = 1,2, ⋯ 𝑚: number of training
samples
10 52 1
𝑗 = 1,2, ⋯ 𝑛: number of features
2 73 0 target 𝑦 is 0 or 1
5 55 0
12 49 1
… … …
1
𝑓!,# 𝑥 =
1 + 𝑒 $(!&'(# )
How to choose 𝑤 = [𝑤* , 𝑤+ , 𝑤, , ⋯ 𝑤- ] and 𝑏?
Loss Function
−log(𝑓!,# 𝑥⃗ . ), 𝑖𝑓 𝑦 . =1
𝐿 𝑓!,# 𝑥⃗ . ,𝑦 . =6
− log 1 − 𝑓!,# 𝑥⃗ . 𝑖𝑓 𝑦 . =0
#
𝑖𝑓 𝑦 = 0, As 𝑓$,& 𝑥⃗ # → 1, then loss → ∞
As 𝑓$,& 𝑥⃗ # → 0, then loss → 0
Simplified Loss Function
• Overall:
* /
𝐽 𝑤, 𝑏 = ∑.0*[𝐿 𝑓!,# 𝑥⃗ . , 𝑦 . ]
/
*
= − ∑/ .0*[𝑦 . log 𝑓!,# 𝑥⃗ . + (1 − 𝑦 . ) log 1 − 𝑓!,# 𝑥⃗ . ]
/
2
𝑏 =𝑏−𝛼 𝐽 𝑤, 𝑏 ,
2#
2 * /
where 𝐽 𝑤, 𝑏 = ∑ (𝑓 𝑥⃗ (.) − 𝑦 . )
2# / .0* !,#
} simultaneous updates
Bias & Variance
• Bias and Variance are two fundamental concepts in machine learning that
pertain to the errors associated with predictive models.
• Bias: The differences between actual or expected values and the predicted
values are known as bias error or error due to bias. Bias is a systematic error
that occurs due to wrong assumptions in the machine learning process.
• Low Bias: In this case, the model will closely match the training dataset.
• High Bias: If a model has high bias, this means it can't capture the
patterns in the data, no matter how much you train it. The model is too
simplistic. This scenario is often referred to as underfitting.
• Variance: Variance is the amount by which the performance of a predictive
model changes when it is trained on different subsets of the training data.
More specifically, variance is the variability of the model that how much it is
sensitive to another subset of the training dataset (i.e. how much it can adjust
on the new subset of the training dataset).
• Low Variance: Low variance means that the model is less sensitive to
changes in the training data and can produce consistent estimates of the
target function with different subsets of data from the same distribution.
• High Variance: High variance means that the model is very sensitive to
changes in the training data and can result in significant changes in the
estimate of the target function when trained on different subsets of data 15
from the same distribution.
Polynomial Regression Examples
• Disadvantage:
• Useful features could be lost
Regularization
𝑓 𝑥 𝑓 𝑥
= 28𝑥 − 385𝑥 # + 39𝑥 $ = 13𝑥 − 0.23𝑥 # + 0.000014𝑥 $
− 174𝑥 % + 100 − 0.0001𝑥 % + 10
Regularized Linear Regression
• Overall Loss with Regularizer:
!
𝐽 𝑤, 𝑏 = − 9 ∑9
:;! = 𝑦 : log 𝑓
&,( 𝑥⃗
: + 1−𝑦 :
D
log 1 − 𝑓&,( 𝑥⃗ : >+ ∑FE;! 𝑤E -
-9
• Gradient Decent:
Repeat {
(
𝑤' = 𝑤' − 𝛼 ($ 𝐽 𝑤, 𝑏 ,
!
( ! -
where ($ 𝐽 𝑤, 𝑏 = ) ∑) ⃗ (#) − 𝑦 # )𝑥'
#*!( 𝑓$,& 𝑥
#
+ ) 𝑤'
!
(
𝑏 = 𝑏 − 𝛼 (& 𝐽 𝑤, 𝑏 ,
( !
where (& 𝐽 𝑤, 𝑏 = ) ∑) ⃗ (#) − 𝑦 # )
#*!( 𝑓$,& 𝑥
} simultaneous updates
Machine Learning Objective
• Find a model M:
• that fits the training data + that is simple
• A real-world classifier
• Strategies:
• Multi-class classification:
• One-vs-All (One-vs-Rest)
• One-vs-One
• Softmax Regression (Later)
• Decision Trees (Later)