03-Logistic Regression
03-Logistic Regression
Ali Sharifi-Zarchi
CE Department
Sharif University of Technology
October 5, 2024
CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 5, 2024 1 / 59
Introduction Logistic Regression Summary Extra reading References
1 Introduction
2 Logistic Regression
3 Summary
4 Extra reading
5 References
CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 5, 2024 2 / 59
Introduction Logistic Regression Summary Extra reading References
1 Introduction
2 Logistic Regression
3 Summary
4 Extra reading
5 References
CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 5, 2024 3 / 59
Introduction Logistic Regression Summary Extra reading References
Classification problem
• Classification (binary)
• Email: Spam / Not Spam?
• Online Transactions: Fraudulent / Genuine?
• Tumor: Malignant / Benign?
(
0: “Negative Class” (e.g., benign tumor)
y ∈ {0, 1}
1: “Positive Class” (e.g., malignant tumor)
CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 5, 2024 4 / 59
Introduction Logistic Regression Summary Extra reading References
CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 5, 2024 5 / 59
Introduction Logistic Regression Summary Extra reading References
• Classification: y = 0 or y = 1
• hθ (x) can be > 1 or < 0
• Another drawback of using linear regression for this problem
• What we need:
1 Introduction
2 Logistic Regression
Fundamentals
Decision surface
ML estimation
Cost function
Gradient descent
Multi-class logistic regression
3 Summary
4 Extra reading
5 References
CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 5, 2024 7 / 59
Introduction Logistic Regression Summary Extra reading References
1 Introduction
2 Logistic Regression
Fundamentals
Decision surface
ML estimation
Cost function
Gradient descent
Multi-class logistic regression
3 Summary
4 Extra reading
5 References
CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 5, 2024 8 / 59
Introduction Logistic Regression Summary Extra reading References
Introduction
CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 5, 2024 9 / 59
Introduction Logistic Regression Summary Extra reading References
Introduction (cont.)
• We need to look for a function which gives us an output in the range [0, 1]. (like a
probability).
• Let’s denote this function with σ(.) and call it the activation function.
CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 5, 2024 10 / 59
Introduction Logistic Regression Summary Extra reading References
Introduction (cont.)
1
σ(z) =
1 + e−z
• A good candidate for activation function.
• It gives us a number between 0 and 1
smoothly.
• It is also differentiable
CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 5, 2024 11 / 59
Introduction Logistic Regression Summary Extra reading References
CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 5, 2024 12 / 59
Introduction Logistic Regression Summary Extra reading References
Introduction (cont.)
x = [x0 = 1, x1 , . . . , xd ]
w = [w0 , w1 , . . . , wd ]
CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 5, 2024 13 / 59
Introduction Logistic Regression Summary Extra reading References
1 Introduction
2 Logistic Regression
Fundamentals
Decision surface
ML estimation
Cost function
Gradient descent
Multi-class logistic regression
3 Summary
4 Extra reading
5 References
CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 5, 2024 14 / 59
Introduction Logistic Regression Summary Extra reading References
Decision surface
• Decision boundary hyperplane always has one less dimension than the feature
space.
CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 5, 2024 15 / 59
Introduction Logistic Regression Summary Extra reading References
1
σ(wT x) = T
= 0.5
1 + e−(w x)
• Decision surfaces are linear functions of x
• if σ(wT x) ≥ 0.5 then ŷ = 1, else ŷ = 0
• Equivalently, if wT x + w0 ≥ 0.5 then decide ŷ = 1, else ŷ = 0
CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 5, 2024 17 / 59
Introduction Logistic Regression Summary Extra reading References
σ(wT x) = σ(w0 + w1 x1 + w2 x2 )
Predict y = 1 if − 3 + x1 + x2 ≥ 0
CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 5, 2024 18 / 59
Introduction Logistic Regression Summary Extra reading References
CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 5, 2024 19 / 59
Introduction Logistic Regression Summary Extra reading References
1 Introduction
2 Logistic Regression
Fundamentals
Decision surface
ML estimation
Cost function
Gradient descent
Multi-class logistic regression
3 Summary
4 Extra reading
5 References
CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 5, 2024 20 / 59
Introduction Logistic Regression Summary Extra reading References
ML estimation
(i) (i)
P(y (i) |x(i) , w) = σ(wT x(i) )y (1 − σ(wT x(i) ))(1−y )
CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 5, 2024 21 / 59
Introduction Logistic Regression Summary Extra reading References
ML estimation
log P(y (i) |x(i) , w) = y (i) log(σ(wT x(i) )) + (1 − y (i) ) log(1 − σ(wT x(i) ))
CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 5, 2024 22 / 59
Introduction Logistic Regression Summary Extra reading References
1 Introduction
2 Logistic Regression
Fundamentals
Decision surface
ML estimation
Cost function
Gradient descent
Multi-class logistic regression
3 Summary
4 Extra reading
5 References
CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 5, 2024 23 / 59
Introduction Logistic Regression Summary Extra reading References
Cost function
• We should find
• MLE finds parameters that best describe a classification problem so cost function
should be negative of log likelihood term:
n
log P(y (i) |x(i) , w)
X
J(w) = −
i=1
n
−y (i) log(σ(wT x(i) )) − (1 − y (i) ) log(1 − σ(wT x(i) ))
X
=
i=1
• You get:
y 1−y
+
σ2 (1 − σ)2
• Which for both y = 0 and y = 1 is positive.
• Each log P(y (i) |x(i) , w) is convex, hence the summation is convex as well.
CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 5, 2024 25 / 59
Introduction Logistic Regression Summary Extra reading References
• As you can see if the model predicted value is ŷ = 0.16 and true label is y = 1 then
the error is high but if the true label is y = 0 the error would be low.
1 Introduction
2 Logistic Regression
Fundamentals
Decision surface
ML estimation
Cost function
Gradient descent
Multi-class logistic regression
3 Summary
4 Extra reading
5 References
CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 5, 2024 27 / 59
Introduction Logistic Regression Summary Extra reading References
Gradient descent
n
−y (i) log(σ(wT x(i) )) − (1 − y (i) ) log(1 − σ(wT x(i) ))
X
J(w) =
i=1
n
(σ(wT x(i) ) − y (i) )x(i)
X
∇w J(w) =
i=1
CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 5, 2024 28 / 59
Introduction Logistic Regression Summary Extra reading References
Gradient descent
• Compare the gradient of logistic regression with the gradient of SSE in linear
regression :
n
(σ(wT x(i) ) − y (i) )x(i)
X
∇w J(w) =
i=1
n
(wT x(i) − y (i) )x(i)
X
∇w J(w) =
i=1
CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 5, 2024 29 / 59
Introduction Logistic Regression Summary Extra reading References
Loss function
• Loss function is a single overall measure of loss incurred for taking our decisions
(over entire dataset).
• We have:
• How is it related to zero-one loss? (ŷ is the predicted label and y is the ture label)
(
1 if y ̸= ŷ
Loss(y, ŷ) =
0 if y = ŷ
CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 5, 2024 30 / 59
Introduction Logistic Regression Summary Extra reading References
1 Introduction
2 Logistic Regression
Fundamentals
Decision surface
ML estimation
Cost function
Gradient descent
Multi-class logistic regression
3 Summary
4 Extra reading
5 References
CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 5, 2024 31 / 59
Introduction Logistic Regression Summary Extra reading References
• Now consider a problem where we have K classes and every sample only belongs
to one class (for simplicity).
CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 5, 2024 32 / 59
Introduction Logistic Regression Summary Extra reading References
CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 5, 2024 33 / 59
Introduction Logistic Regression Summary Extra reading References
exp (wkT x)
σk (x, W) = P(y = k|x) = PK T
j=1 exp (wj x)
P(x|Ck )P(Ck )
P(Ck |x) = PK
j=1 P(x|Cj )P(Cj )
CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 5, 2024 34 / 59
Introduction Logistic Regression Summary Extra reading References
K
X exp(wkT x)
PK T
=1
k=1 j=1 exp(wj x)
CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 5, 2024 35 / 59
Introduction Logistic Regression Summary Extra reading References
CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 5, 2024 36 / 59
Introduction Logistic Regression Summary Extra reading References
n
P(y (i) |x(i) , W)
Y
J(W ) = − log
i=1
n YK (i)
σk (x(i) ; W)yk
Y
= − log
i=1 k=1
n X
K
yk(i) log(σk (x(i) ; W))
X
=−
i=1 k=1
CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 5, 2024 37 / 59
Introduction Logistic Regression Summary Extra reading References
n X
K
yk(i) log(σk (x(i) ; W))
X
J(W ) = −
i=1 k=1
• In which:
(1)
yK(1)
y (1) y1 ...
y (2) y (2) ... yK(2)
1
W = [w1 , w2 , . . . , wK ], Y =
.. = ..
.. ..
. . . .
y (n) y (n) ... yK(n)
1
• y is a vector of length K (1-of-K encoding)
• For example y = [0, 0, 1, 0]T when the target class is C3 .
CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 5, 2024 38 / 59
Introduction Logistic Regression Summary Extra reading References
• wt denotes the weight vector for class j (since in multi-class LR, each class has its
j
own weight vector) in the t-th iteration
CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 5, 2024 39 / 59
Introduction Logistic Regression Summary Extra reading References
1 Introduction
2 Logistic Regression
3 Summary
4 Extra reading
5 References
CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 5, 2024 40 / 59
Introduction Logistic Regression Summary Extra reading References
• LR is a linear classifier
• LR optimization problem is obtained by maximum likelihood
• No closed-form solution for its optimization problem
• But convex cost function and global optimum can be found by gradient ascent
CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 5, 2024 41 / 59
Introduction Logistic Regression Summary Extra reading References
1 Introduction
2 Logistic Regression
3 Summary
4 Extra reading
Probabilistic view in classification
Probabilistic classifiers
5 References
CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 5, 2024 42 / 59
Introduction Logistic Regression Summary Extra reading References
1 Introduction
2 Logistic Regression
3 Summary
4 Extra reading
Probabilistic view in classification
Probabilistic classifiers
5 References
CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 5, 2024 43 / 59
Introduction Logistic Regression Summary Extra reading References
• In a classification problem:
• Each feature is a random variable (e.g. a person’s height)
• The class label is also considered a random variable (e.g. a person could be
overweight or not)
• We observe the feature values for a random sample and intend to find its class label
• Evidence: Feature vector x
• Objective: Class label
CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 5, 2024 44 / 59
Introduction Logistic Regression Summary Extra reading References
Definitions
P(Ck |x)
P(Ck )
• P(x): PDF of feature vector x
• From total probability theorem:
K
X
P(x) = P(x|Ck )P(Ck )
k=1
CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 5, 2024 45 / 59
Introduction Logistic Regression Summary Extra reading References
1 Introduction
2 Logistic Regression
3 Summary
4 Extra reading
Probabilistic view in classification
Probabilistic classifiers
5 References
CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 5, 2024 46 / 59
Introduction Logistic Regression Summary Extra reading References
Probabilistic classifiers
CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 5, 2024 47 / 59
Introduction Logistic Regression Summary Extra reading References
• Let’s assume we have input data x and want to classify the data into labels y.
• A generative model learns the joint probability distribution P(x, y).
• A discriminative model learns the conditional probability distribution P(y|x)
CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 5, 2024 48 / 59
Introduction Logistic Regression Summary Extra reading References
• P(x, y) is :
y=0 y=1
1
x=1 2 0
1 1
x=2 4 4
• P(y|x) is :
y=0 y=1
x=1 1 0
1 1
x=2 2 2
CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 5, 2024 49 / 59
Introduction Logistic Regression Summary Extra reading References
• The distribution P(y|x) is the natural distribution for classifying a given sample x
into class y.
• This is why that algorithms which model this directly are called discriminative
algorithms.
• Generative algorithms model P(x, y), which can be transformed into P(y|x) by
Bayes rule and then used for classification.
• However, the distribution P(x, y) can also be used for other purposes.
• For example we can use P(x, y) to generate likely (x, y) pairs
CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 5, 2024 50 / 59
Introduction Logistic Regression Summary Extra reading References
Generative approach
1 Inference
• Determine class conditional densities P(x|Ck ) and priors P(Ck )
• Use Bayes theorem to find P(Ck |x)
2 Decision
• Make optimal assignment for new input (after learning the model in the inference
stage)
• if P(Ci |x) > P(Cj |x)∀j ̸= i, then decide Ci .
CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 5, 2024 51 / 59
Introduction Logistic Regression Summary Extra reading References
Discriminative approach
1 Inference
• Determine the posterior class probabilities P(Ck |x) directly.
2 Decision
• Make optimal assignment for new input (after learning the model in the inference
stage)
• if P(Ci |x) > P(Cj |x)∀j ̸= i, then decide Ci .
CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 5, 2024 53 / 59
Introduction Logistic Regression Summary Extra reading References
CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 5, 2024 55 / 59
Introduction Logistic Regression Summary Extra reading References
1 Introduction
2 Logistic Regression
3 Summary
4 Extra reading
5 References
CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 5, 2024 56 / 59
Introduction Logistic Regression Summary Extra reading References
Contributions
CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 5, 2024 57 / 59
Introduction Logistic Regression Summary Extra reading References
CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 5, 2024 58 / 59
Introduction Logistic Regression Summary Extra reading References
Any Questions?
CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 5, 2024 59 / 59