Logistic Regression
Logistic Regression
Regression
Introduction 01 02 Hypothesis
TOC
Pros & Cons 07
1. Introduction
Classification is a problem of supervised learning, which tries to learn a mapping function
from any input variables to output variable, which is a discrete or categorical value.
x ℎ(x) 𝑦 ∈ ℕ
1. Introduction
Examples:
Digits Classification
1. Logistic Regression
● Logistic Regression is a popular statistical model which is used in classification
problem to identify the class of a given observation by predicting the output value of
categorical variable.
● It mostly deals with binary/binomial classification whose output value is binary value,
which can be either 0 or 1 (true or false). However, it can be used for multiclass/multi-
binomial classification.
● In this lecture, we will focus mainly on binary classification to simplify the problem.
1. How does it work?
● Logistic Regression measures the relationship between the dependent variable (the
label, what we want to predict) and one or more independent variables (the input
features) by estimating the probability using the underlying logistic function.
● These probability is then transformed into binary value in order to actually make a
prediction by using a threshold.
𝑥1 1
𝑥2
. Logistic
𝑦 = Probability
. Regression
𝑥𝑛 0
≥ Threshold=0.5?
1. Sigmoid Function
Sigmoid Function (or Logistic Function) is an S-sharped curve function that can takes any
real-valued number and maps it into a value between range of 0 and 1, but never exactly at
those limit.
1
𝑔 𝑥 =
1 + 𝑒 −𝑥
Sigmoid Function
1. Why not Linear Regression?
Example: Human Obesity Prediction
Not Obese/0 × × × × × ×
Weight
1. Why not Linear Regression?
Example: Human Obesity Prediction
ℎw (x)
● To deal with classification, we can set
Obese/1
a threshold of output value e.g. 0.5.
×××× ××
if ℎw x ≥ 0.5 ⟹ Obese/1
0.5
if ℎw x < 0.5 ⟹ Not Obese/0
Obese
Not Obese/0 ××××××
𝑥1 Weight
1. Why not Linear Regression?
Example: Human Obesity Prediction
Input variables have any measurement level Input variables have any measurement level
Require linear relationship among dependent Not necessary for linear relationship between
and independent variable dependent and independent variable
Independent variables can be correlated with Independent variables must not be correlated
each other with each other
Predicted value are the mean of the target Predicted values are the probability of a
variable at the given values of the input particular levels of the target variable at the
variables given values of input variables
1. Linear Regression vs. Logistic Regression
Example: Human Obesity Prediction - Logistic regression is more robust than linear
regression.
1
ℎw x = 𝑇x
1 + 𝑒 −w
ℎw (x)
𝑇
≥ 0.5, 𝑖𝑓 w x ≥ 0
ℎw x = ൝
< 0.5, 𝑖𝑓 w 𝑇 x < 0
w𝑇 x
3. Decision Boundary
● In order to map ℎw (x) to discrete value of either
0 or 1, a threshold value of 0.5 is defined as a
𝑐𝑙𝑎𝑠𝑠 1 ℎw (x) tipping point above which observations will be
classified into class 0 – negative class or class
1 – positive class.
𝑐𝑙𝑎𝑠𝑠 0
≥ 0.5 → 𝑐𝑙𝑎𝑠𝑠 1
ℎw x = ቊ
< 0.5 → 𝑐𝑙𝑎𝑠𝑠 0
w𝑇 x
3. Decision Boundary
𝑥2
● This figure shows the decision boundary of
Logistic Regression with two input variables x =
Class 1 (𝑥1 , 𝑥2 ).
● All the observations which are classified into
class 1, locates above the decision line, while the
ℎw (x) ≥ 0.5 observations of class 0 are below the decision
line.
Class 0 ≥ 0.5 → 𝑐𝑙𝑎𝑠𝑠 1
ℎw x < 0.5 ℎw x = ቊ
< 0.5 → 𝑐𝑙𝑎𝑠𝑠 0
𝑥1
4. Cost Function
In logistic regression, we cannot use Mean Squared Error (MSE) as the non-linearity of the
cost function 𝐽(𝑤), due to the sigmoid function. It results MSE a non-convex function with
many local minimums. In such case, gradient descent cannot be used to learn the
parameters.
1
𝐽 w = σ𝑛𝑖=1 (ℎw (x𝑖 ) − 𝑦𝑖 )2
2𝑛
1
ℎw x𝑖 = 𝑇
𝐽(w)
1+ 𝑒 −w x𝑖
w
4. Cost Function
● Instead of MSE, we use a cost function, called Cross-Entropy Loss, also known as
Negative Log-likelihood, which is a convex function and can be minimized by gradient
descent.
● For the whole training set of 𝑛 instances, it is the average of overall costs.
𝑛
1
𝐽 w = − 𝑦𝑖 log ℎw (x𝑖 ) + 1 − 𝑦𝑖 log(1 − ℎw x𝑖 )
𝑛
𝑖=1
4. Cost Function – Intuition
● Cross-Entropy Loss aims to find parameter w
ෝ so that the model estimates high
probabilities for the positive instances (𝑦 = 1) and low probabilities for the negative
instances (𝑦 = 0).
● Cross-Entropy Loss is composed of two convex functions with respect to each class.
− 𝑙𝑜𝑔 ℎw (x) 𝑖𝑓 𝑦 = 1
𝐽(w) = ൝
− 𝑙𝑜𝑔 1 − ℎw x 𝑖𝑓 𝑦 = 0
Loss
ℎw (x)
5. Gradient Descent
● Given n examples { x𝑖 , 𝑦1 , x2 , 𝑦2 , . . , (x𝑛 , 𝑦𝑛 )} such that xi ∈ ℝ𝑑 , the cost function:
𝑛
1
𝐽 w = − [𝑦𝑖 log ℎw xi + 1 − 𝑦𝑖 log(1 − ℎw (xi )]
𝑛
𝑖=1
● The gradient or partial derivative of cost function with respect to each parameter w𝑖 :
𝑛
𝜕 1
𝐽 w = ℎw xi − 𝑦𝑖 𝑥i
𝜕𝑤𝑖 𝑛
𝑖=1
5. Gradient Descent
● The batch gradient descent algorithm of logistic regression is outlined as follows:
𝑝Ƹ = ℎw (x)
6. Multi-Class Classification
● What is the intuition behind the multi-class classification in Logistic Regression?
𝑥2
× × ×
+
+
+ × × × Class 1: +
+ + ×
×
+ + Class 2: ×
∆ ∆ Class 3: ∆
∆
∆ ∆
∆ ∆
∆
𝑥1
6. One vs. Rest Classification
● Logistic Regression is mainly focus on binary classification. However, it can used to
solve with multi-class classification by using a technique, called ” One vs. Rest
Classification”.
● ”One vs. Rest Classification” strategy involves training a single classifier per class, with
the samples of that class as positive samples and all the other samples as negatives.
The predictions are made using the model that is the most confident.
6. One vs. Rest Classification
𝑥2 ×× ×
Class 1: + +
+ + × ×× ×
++ 𝑘
Class 2: × ++ × ℎw x = 𝑃 𝑦 = 𝑘 x, w
Class 3: ∆ ∆∆ ∆
∆ ∆∆ ∆
∆
𝑥1
6. One vs. Rest Classification
𝑥2 ×× ×
Class 1: + +
+ + × ×× ×
++ 𝑘
Class 2: × ++ × ℎw x = 𝑃 𝑦 = 𝑘 x, w
Class 3: ∆ ∆∆ ∆
∆ ∆∆ ∆
∆
𝑥1
1
ℎw (x)
×× ×
++
+++ ××× ×
++ ×
∆∆ ∆
∆ ∆∆ ∆
∆
6. One vs. Rest Classification
𝑥2 ×× ×
Class 1: + +
+ + × ×× ×
++ 𝑘
Class 2: × ++ × ℎw x = 𝑃 𝑦 = 𝑘 x, w
Class 3: ∆ ∆∆ ∆
∆ ∆∆ ∆
∆
𝑥1
1
ℎw (x)
×× × ×× ×
++ ++
+++ ××× × +++ ××× ×
++ × ++ ×
∆∆ ∆ ∆∆ ∆
∆ ∆∆ ∆ ∆ ∆∆ ∆
∆ ∆
6. One vs. Rest Classification
𝑥2 ×× ×
Class 1: + +
+ + × ×× ×
++ 𝑘
Class 2: × ++ × ℎw x = 𝑃 𝑦 = 𝑘 x, w
Class 3: ∆ ∆∆ ∆
∆ ∆∆ ∆
∆
𝑥1
1
ℎw (x)
×× × ×× × ×× ×
++ ++ ++
+++ ××× × +++ ××× × +++ ××× ×
++ × ++ × ++ ×
3
∆∆ ∆ ∆∆ ∆ ∆∆ ∆ ℎw (x)
∆ ∆∆ ∆ ∆ ∆∆ ∆ ∆ ∆∆ ∆
∆ ∆ ∆
6. One vs. Rest Classification
1
ℎw x = 𝑃 𝑦 = 1 x =0.89 𝑦=1
2
+ ℎw x = 𝑃 𝑦 = 2 x =0.10
x 3
ℎw x = 𝑃 𝑦 = 3 x =0.01
6. Softmax Regression
● Logistic Regression can be generalized to support multiple classes directly, without
having to train and combine multiple binary classifiers. This is called Softmax
Regression or Multinomial Logistic Regression.
● The idea is quite simple: when given an instance x, Softmax Regression first computes
a score 𝑠𝑘 (x) for each class k, then estimates the probability of each class by applying
the softmax function (also called the normalized exponential) to the scores.
1.3 0.02
5.1 0.90
𝑒 𝑧𝑖
𝜎 z 𝑖 = 𝑛 2.2 Softmax-𝜎 0.05
σ𝑗=1 𝑒 𝑧𝑗
0.7 0.01
1.1 0.02
z 𝜎(z)
6. Softmax Regression
𝑤10
𝑠1 = w1𝑇 x
𝑤11
Σ 𝑦1 𝑃 𝑦=1x
𝑥1 𝑤21
𝑤31 𝑤20
𝑠2 = w2𝑇 x
softmax
𝑥2 Σ 𝑦2 𝑃(𝑦 = 2|x)
𝑤30
𝑥𝑑 𝑠3 = w3𝑇 x
Σ 𝑦3 𝑃(𝑦 = 3|x)
𝑛 𝐾
𝑒 𝑠𝑘(x) 1 (𝑖)
𝑃 𝑦=𝑘 =𝜎 𝑠 x = , 𝐽 𝑤 = − 𝑦𝑘 log(𝑃(𝑖) 𝑦 = 𝑘 )
𝑘 σ𝐾 𝑠𝑗 (x) 𝑛
𝑗=1 𝑒 𝑖=1 𝑘=1
7. Pros & Cons
● Pros:
○ Easy to implement and interpret results
○ Inexpensive computation
○ Not require feature scaling
● Cons:
○ Not able to handle a large number of categorical output variable
○ Vulnerable to overfitting
○ Cannot solve non-linear data
Q&A