Logistic Regression
Logistic Regression
Machine Learning
• Polynomial Regression
• Locally Weighted Linear Regression
• Logistic Regression/ Softmax Regression
Polynomial Regression
What if your data is actually more complex than a simple straight line?
We can use a linear model to fit nonlinear data
A model is said to be linear if it is linear in parameters
y 0 1 x1 2 x2
y 0 1 x 2 x 2
Add powers of each feature as new features, then train a linear model on
this extended set of features
Polynomial Regression
y 3.52 0.98 x
Polynomial Regression: Example
x (i ) [ x (i ) , ( x (i ) ) 2 ]
y 2 x 0.5 x 2
1 m
RMSE ( X, h )
m i 1
(h ( x (i ) ) y (i ) ) 2
R2- score or the coefficient of determination explains how much the total variance of
the dependent variable can be reduced by using the least square regression.
y 2 x 0.5 x 2
Linear Regression
y 2 x 0.5 x 2
Locally Weighted Linear Regression
y 2 x 0.5 x 2
𝑥 = 2.5
Locally Weighted Linear Regression
y 2 x 0.5 x 2
𝑥 = 2.5
Locally Weighted Linear Regression
𝑥 = 2.5
Locally Weighted Linear Regression
y 2 x 0.5 x 2
𝑥 = 2.5
Gradient Descent
Update Rule:
𝜃 =𝜃 −η ∑ 𝑤 ( ) (ℎ 𝑥 ( ) − 𝑦 ( ) ) 𝑥 ()
The magnitude of the update is proportional to the error and the weight of
a sample
Locally Weighted Linear Regression
w exp(
(i ) ( x ( i ) x) 2
) bandwidth parameter
2 2
θ is chosen giving a much higher weight to the (errors on) training examples
close to the query point x.
Locally Weighted Linear Regression
2) Mean normalization: 𝑥 = 𝑜𝑟 𝑥 =
( )
Classification problem
Similar to regression problem
Assume a binary classifier: y can take on only two values 0 and 1.
Ex.: spam email (1, positive class), ham email (0 negative class)
Linear Regression
ℎ 𝑥 = 𝜃 + 𝜃 𝑥 + 𝜃 𝑥 + ⋯+ 𝜃 𝑥
Classification Problem
𝑦 ∈ 0,1
So, we want 0≤ℎ 𝑥 ≤1
Logistic Regression
1
h ( x ) g ( T x ) T
1 e x
1
g (z) Logistic function or sigmoid function
1 ez
Logistic Regression
1
g (z)
1 ez
g ( z ) tends towards 1 as z
g ( z ) tends towards 0 as z
1
h ( x ) g ( T x ) T
1 e x
n
x 0 j x j
T
x0 1
j 1
Logistic Regression
Predicts something that is
True or False, instead of
something continuous like
size
Logistic Regression
Logistic Regression
Logistic Regression
Logistic Regression
Logistic Regression
Interpretation of Logistic Function
Assume: 𝑦 = 1 𝑖𝑓 ℎ 𝑥 ≥ 0.5
𝑦 = 0 𝑖𝑓 ℎ < 0.5
ℎ 𝑥 = 𝑔(𝜃 + 𝜃 𝑥 + 𝜃 𝑥 )
−3
𝜃= 1
1
Predict 𝑦 = 1 if −3 + 𝑥 + 𝑥 ≥ 0
𝑥 +𝑥 =3 Decision Boundary
Logistic Regression
Training set: { 𝑥 ,𝑦 , 𝑥 ,𝑦 , … , (𝑥 , 𝑦( )
)}
𝑚 examples
𝑥
𝑥= ⋮ 𝑥 =1
𝑥
𝑦 ∈ {0,1}
ℎ 𝑥 =
Cost Function
Linear Regression: 𝐽 𝜃 = ∑ (ℎ 𝑥 − 𝑦( ))
𝐶𝑜𝑠𝑡 ℎ 𝑥 , 𝑦 = (ℎ 𝑥 − 𝑦( ))
Using this cost function for logistic regression gives non-convex function
Not guaranteed to converge to global minima
Cost Function
− log ℎ 𝑥 𝑖𝑓 𝑦 = 1
𝐶𝑜𝑠𝑡 ℎ 𝑥 , 𝑦 =
− log 1 − ℎ 𝑥 𝑖𝑓 𝑦 = 0
𝐶𝑜𝑠𝑡 = 0 𝑖𝑓 𝑦 = 1 𝑎𝑛𝑑 ℎ 𝑥 = 1
But as ℎ 𝑥 → 0 𝐶𝑜𝑠𝑡 → ∞
If ℎ 𝑥 = 0 => predict that 𝑝 𝑦 = 1 𝑥; 𝜃 = 0 if actually 𝑦 = 1 then
penalizes learning algorithm by a very large cost
Cost Function
− log ℎ 𝑥 𝑖𝑓 𝑦 = 1 1
𝐶𝑜𝑠𝑡 ℎ 𝑥 , 𝑦 = 𝐽 𝜃 = 𝐶𝑜𝑠𝑡(ℎ 𝑥 , 𝑦( ))
− log 1 − ℎ 𝑥 𝑖𝑓 𝑦 = 0 𝑚
Cost Cost
ℎ 𝑥 ℎ 𝑥
Cost Function
− log ℎ 𝑥 𝑖𝑓 𝑦 = 1
𝐶𝑜𝑠𝑡 ℎ 𝑥 , 𝑦 =
− log 1 − ℎ 𝑥 𝑖𝑓 𝑦 = 0
𝑦 = 1 𝑜𝑟 𝑦 = 0 only two values are possible
𝐶𝑜𝑠𝑡 ℎ 𝑥 , 𝑦 = −𝑦𝑙𝑜𝑔 ℎ 𝑥 − 1 − 𝑦 log 1 − ℎ 𝑥
𝐽 𝜃 = ∑ 𝐶𝑜𝑠𝑡(ℎ 𝑥 , 𝑦( )) = − ∑ 𝑦 ( ) log(ℎ 𝑥 )+ 1−
𝑦 log(1 − ℎ 𝑥 )
This is a convex function
Logistic Regression
To fit parameters 𝜃:
min 𝐽 𝜃 to obtain 𝜃
Want min 𝐽 𝜃 :
Repeat{
𝜃 ≔𝜃 −𝞰 𝐽(𝜃)
1
𝐽 𝜃 = 𝐶𝑜𝑠𝑡(ℎ 𝑥 , 𝑦( ))
𝑚
1
=− 𝑦 ( ) log(ℎ 𝑥 )+ 1−𝑦 log(1 − ℎ 𝑥 )
𝑚
𝜕 1
𝐽 𝜃 = (ℎ 𝑥 − 𝑦 ( ) )𝑥 ()
𝜕𝜃 𝑚
Gradient Descent
Want min 𝐽 𝜃 : 1
ℎ 𝑥 =
Repeat{ 1+𝑒
𝜃 ≔𝜃 −𝞰 ∑ (ℎ 𝑥 − 𝑦 ( ) )𝑥 ()
Gives probabilities.
To get the classes just use log_reg.predict()
Example
Let’s apply Logistic Regression on Iris dataset.
Irisfamous dataset, contains the sepal and petal length and width of 150 iris flowers
Three different species: Iris-Setosa, Iris-Versicolor and Iris-Virginica
Iris Dataset
Example
Example
Nonlinear Decision Boundary
Nonlinear Decision Boundary
𝑥 =𝑥
Nonlinear Decision Boundary
ℎ 𝑥 = 𝑔(𝜃 + 𝜃 𝑥 + 𝜃 𝑥 )
𝟏
ℎ 𝑥 = 𝑔(𝜃 + 𝜃 𝑥 + 𝜃 𝑥 +𝜃 𝑥 +𝜃 𝑥 )
−1
0
𝟏 𝑋 𝜃= 0
−𝟏
1
−𝟏 1
Predict 𝑦 = 1 if −1 + 𝑥 +𝑥 ≥0
𝑥 +𝑥 ≥1 𝑥 +𝑥 =1
Multiclass Classification
𝑋
One-vs-all Classifier (one-vs-rest)
Class 1:
Class 2:
Class 3:
𝑋 𝑋 𝑋
𝑋 𝑋 𝑋
ℎ (𝑥) ℎ (𝑥) ℎ (𝑥)
One-vs-all Classifier (one-vs-rest)
ℎ 𝑥 = 𝑝 𝑦 = 𝑖 𝑥; 𝜃 𝑓𝑜𝑟 𝑖 = 1,2,3
One-vs-all:
Train a logistic regression classifier ℎ (𝑥) for each class 𝑖 to predict the probability
𝑦=1
On a new input 𝑥, to make a prediction pick the class 𝑖 that maximizes max ℎ (𝑥)
Softmax Regression
Multi-class classification
Multinomial Logistic Regression
Generalization of logistic regression
𝑦 ∈ {1,2, … , 𝐾}
Softmax Regression
Given a test input 𝑥, estimate the probability that 𝑃(𝑦 = 𝑘|𝑥) for each value
of 𝑘 = 1, … , 𝐾
Estimate the probability of the class label taking on each of the 𝐾 different
possible values.
Our hypothesis will output a 𝐾-dimensional vector (whose elements sum to
1) giving 𝐾 estimated probabilities.
Softmax Regression: Hypothesis
| | |
𝜽 = 𝜽( ) … 𝜽( ) n-by-K matrix
| | |
Softmax Regression: Cost Function
1 - indicator function
1 𝑇𝑟𝑢𝑒 𝑠𝑡𝑎𝑡𝑒𝑚𝑒𝑛𝑡 = 1 and 1 𝐹𝑎𝑙𝑠𝑒 𝑠𝑡𝑎𝑡𝑒𝑚𝑒𝑛𝑡 = 0
Minimize the cost function:
()
exp(𝜽 𝒙( ) )
𝐽 𝜃 =− 1𝑦 = 𝑘 𝑙𝑜𝑔
∑ exp(𝜽 𝒙( ) )
(𝒊)
exp(𝜽 𝒙( ) )
𝑃 𝑦 = 𝑘 𝒙 ;𝜽 =
∑ exp(𝜽 𝒙( ) )
Softmax Regression
exp(𝜽 𝒙( ) )
𝑃 𝑦 =𝑘 𝒙(𝒊) ; 𝜽 =
∑ exp(𝜽 𝒙( ) )
Example: Iris dataset
Python Implementation
Suppose you train a logistic regression classifier and your hypothesis function H is
Which of the following figure will represent the decision boundary as given by above classifier?
Question
Imagine, you have given the below graph of logistic regression which is shows the relationships
between cost function and number of iteration for 3 different learning rate values (different colors
are showing different curves at different learning rates ).
Suppose, you save the graph for future reference but you forgot to save the value of
different learning rates for this graph. Now, you want to find out the relation between
the leaning rate values of these curve. Which of the following will be the true relation?
Note:
1. The learning rate for blue is l1
2. The learning rate for red is l2
3. The learning rate for green is l3
A) l1>l2>l3
B) l1 = l2 = l3
C) l1 < l2 < l3
D) None of these
Question
Suppose you are dealing with 4 class classification problem and you want to train a logistic
regression model on the data for that you are using One-vs-all method.
How many logistic regression models do we need to fit in 4-class classification problem?