0% found this document useful (0 votes)
5 views

Logistic Regression

Uploaded by

mertdene10
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Logistic Regression

Uploaded by

mertdene10
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 74

CMPE 442 Introduction to

Machine Learning
• Polynomial Regression
• Locally Weighted Linear Regression
• Logistic Regression/ Softmax Regression
Polynomial Regression

 What if your data is actually more complex than a simple straight line?
 We can use a linear model to fit nonlinear data
 A model is said to be linear if it is linear in parameters
y   0  1 x1   2 x2  

y   0  1 x   2 x 2  

y   0  1 x1   2 x2  11 x12   22 x22  12 x1 x2  

 Add powers of each feature as new features, then train a linear model on
this extended set of features
Polynomial Regression

 A kth order polynomial model in one variable is given by


y   0  1 x   2 x 2  ...   k x k  

 A second order polynomial model in two variable is given by

y   0  1 x1   2 x2  11 x12   22 x22  12 x1 x2  

 An array containing n features can be transformed into an array containing


( n  d )!
features, where d is polynomial order
d !n!
Polynomial Regression: Example

 Let’s generate some nonlinear data based on a simple quadratic equation


 x’s are assigned randomly
y  2  x  0.5 x 2  
Polynomial Regression: Example

A straight line poorly fits nonlinear data


Case for underfitting

y  3.52  0.98 x
Polynomial Regression: Example

Add powers of each feature as new features

x (i )  [ x (i ) , ( x (i ) ) 2 ]

y  2  x  0.5 x 2  

h(x)  2.04  0.96 x  0.51x 2


Polynomial Regression: Example
Polynomial Regression: Example
Evaluating the Performance of the
Model
 RMSE - is the square root of the average of the sum of the squares of residuals

1 m
RMSE ( X, h )  
m i 1
(h ( x (i ) )  y (i ) ) 2

 R2- score or the coefficient of determination explains how much the total variance of
the dependent variable can be reduced by using the least square regression.

𝑅 =1− where, 𝑆𝑆 = ∑ (𝑦 − 𝑦) and 𝑆𝑆 = ∑ (ℎ 𝑥 − 𝑦( ))


Locally Weighted Linear Regression

 In the original linear regression algorithm, to make a prediction at a query


point x, we would:
1. Fit θ to minimize (y i
(i )
 θT x ( i ) ) 2
T (i )
2. Output θ x

 In locally weighted linear regression algorithm


1. Fit θ to minimize w
i
(i )
( y ( i )  θT x ( i ) ) 2
2. Output θ x T (i )
Linear Regression

y  2  x  0.5 x 2  
Linear Regression

y  2  x  0.5 x 2   h(x)  2.04  0.96 x  0.51x 2


Linear Regression

y  2  x  0.5 x 2   h(x)  2.04  0.96 x  0.51x 2

𝑥 = 2.5 ℎ 2.5 = 2.04 + 0.96 ∗ 2.5 + 0.51 ∗ 2.5 = 7.63


Locally Weighted Linear Regression

y  2  x  0.5 x 2  
Locally Weighted Linear Regression

y  2  x  0.5 x 2  

𝑥 = 2.5
Locally Weighted Linear Regression

y  2  x  0.5 x 2  

𝑥 = 2.5
Locally Weighted Linear Regression

This line is learnt specifically


y  2  x  0.5 x 2   for x_new

𝑥 = 2.5
Locally Weighted Linear Regression

y  2  x  0.5 x 2  

𝑥 = 2.5
Gradient Descent

 Update Rule:
 𝜃 =𝜃 −η ∑ 𝑤 ( ) (ℎ 𝑥 ( ) − 𝑦 ( ) ) 𝑥 ()

 The magnitude of the update is proportional to the error and the weight of
a sample
Locally Weighted Linear Regression

 w(i)’s are non-negative valued weights.


 Weights depend on the particular point x at which we are trying to
evaluate x.

w  exp(
(i ) ( x ( i )  x) 2
)  bandwidth parameter
2 2

 If x  x is small, then w(i ) is close to 1


(i )

 If x  x is large, then w(i ) is small


(i )

 θ is chosen giving a much higher weight to the (errors on) training examples
close to the query point x.
Locally Weighted Linear Regression

 Linear regression algorithm is a parametric learning algorithm


 It has fixed, finite number of parameters which are fit to the data
 Once we fit θ’s and store them away, we don’t need to keep training data for
future predictions
 Locally weighted linear regression is a non-parametric algorithm
 To make predictions we need to keep entire training set around
 Amount of stuff we need to keep in order to represent the hypothesis h grows
linearly with the size of the training set
Question

 Is the gradient descent algorithm affected by the scales of features?


Question

 Is the gradient descent algorithm affected by the scales of features?


Feature Normalization

 We can speed up the gradient descent by normalizing features (having


them in approximately equal ranges),
 𝜃 will descent quickly on small ranges
1) Feature scaling: 𝑥 =
( )

2) Mean normalization: 𝑥 = 𝑜𝑟 𝑥 =
( )

 For polynomial regression feature normalization is even more important.


 Not needed for Normal Equation.
Summary on Linear Regression

 Considered the case when the hypothesis function is linear.


 We can make our hypothesis nonlinear by combining existing features and
obtaining more features and make the function quadratic, cubic.
 Closed-form solution: Normal Equations
 Scaling the data for gradient descent algorithm is important.
Logistic Regression

 Classification problem
 Similar to regression problem
 Assume a binary classifier: y can take on only two values 0 and 1.
 Ex.: spam email (1, positive class), ham email (0 negative class)
Linear Regression

ℎ 𝑥 = 𝜃 + 𝜃 𝑥 + 𝜃 𝑥 + ⋯+ 𝜃 𝑥
Classification Problem

Threshold classifier output at 0.5


If ℎ 𝑥 ≥ 0.5 predict 𝑦 = 1
ℎ 𝑥 = −0.34 + 1.58𝑥 If ℎ (𝑥) < 0.5 predict 𝑦 = 0
Classification Problem

 𝑦 ∈ 0,1
 So, we want 0≤ℎ 𝑥 ≤1
Logistic Regression

 Let’s change the form of hypotheses h (x)

1
h ( x )  g ( T x )  T
1  e  x

1
g (z)  Logistic function or sigmoid function
1  ez
Logistic Regression

1
g (z) 
1  ez
g ( z ) tends towards 1 as z  

g ( z ) tends towards 0 as z  

g(z) is always bounded between 0 and 1

1
h ( x )  g ( T x )  T
1  e  x

n
 x  0   j x j
T
x0  1
j 1
Logistic Regression
Predicts something that is
True or False, instead of
something continuous like
size
Logistic Regression
Logistic Regression
Logistic Regression
Logistic Regression
Logistic Regression
Interpretation of Logistic Function

 ℎ 𝑥 - can be estimated as a probability that 𝑦 = 1 on input 𝑥


 ℎ 𝑥 = 𝑃(𝑦 = 1|𝑥; 𝜃)
 𝑃 𝑦 = 0 𝑥; 𝜃 = 1 − 𝑃(𝑦 = 1|𝑥; 𝜃)
Decision Boundary

 Assume: 𝑦 = 1 𝑖𝑓 ℎ 𝑥 ≥ 0.5
𝑦 = 0 𝑖𝑓 ℎ < 0.5

ℎ 𝑥 = 𝑔(𝜃 + 𝜃 𝑥 + 𝜃 𝑥 )

−3
𝜃= 1
1

Predict 𝑦 = 1 if −3 + 𝑥 + 𝑥 ≥ 0

𝑥 +𝑥 =3 Decision Boundary
Logistic Regression

 Given the logistic regression model, how do we fit θ for it?


Logistic Regression

 Training set: { 𝑥 ,𝑦 , 𝑥 ,𝑦 , … , (𝑥 , 𝑦( )
)}
 𝑚 examples
𝑥
 𝑥= ⋮ 𝑥 =1
𝑥
 𝑦 ∈ {0,1}
 ℎ 𝑥 =
Cost Function

 Linear Regression: 𝐽 𝜃 = ∑ (ℎ 𝑥 − 𝑦( ))

 𝐶𝑜𝑠𝑡 ℎ 𝑥 , 𝑦 = (ℎ 𝑥 − 𝑦( ))
 Using this cost function for logistic regression gives non-convex function
 Not guaranteed to converge to global minima
Cost Function

− log ℎ 𝑥 𝑖𝑓 𝑦 = 1
 𝐶𝑜𝑠𝑡 ℎ 𝑥 , 𝑦 =
− log 1 − ℎ 𝑥 𝑖𝑓 𝑦 = 0

 𝐶𝑜𝑠𝑡 = 0 𝑖𝑓 𝑦 = 1 𝑎𝑛𝑑 ℎ 𝑥 = 1
 But as ℎ 𝑥 → 0 𝐶𝑜𝑠𝑡 → ∞
 If ℎ 𝑥 = 0 => predict that 𝑝 𝑦 = 1 𝑥; 𝜃 = 0  if actually 𝑦 = 1 then
penalizes learning algorithm by a very large cost
Cost Function

− log ℎ 𝑥 𝑖𝑓 𝑦 = 1 1
 𝐶𝑜𝑠𝑡 ℎ 𝑥 , 𝑦 = 𝐽 𝜃 = 𝐶𝑜𝑠𝑡(ℎ 𝑥 , 𝑦( ))
− log 1 − ℎ 𝑥 𝑖𝑓 𝑦 = 0 𝑚

Cost Cost

ℎ 𝑥 ℎ 𝑥
Cost Function

− log ℎ 𝑥 𝑖𝑓 𝑦 = 1
 𝐶𝑜𝑠𝑡 ℎ 𝑥 , 𝑦 =
− log 1 − ℎ 𝑥 𝑖𝑓 𝑦 = 0
 𝑦 = 1 𝑜𝑟 𝑦 = 0 only two values are possible
 𝐶𝑜𝑠𝑡 ℎ 𝑥 , 𝑦 = −𝑦𝑙𝑜𝑔 ℎ 𝑥 − 1 − 𝑦 log 1 − ℎ 𝑥

 𝐽 𝜃 = ∑ 𝐶𝑜𝑠𝑡(ℎ 𝑥 , 𝑦( )) = − ∑ 𝑦 ( ) log(ℎ 𝑥 )+ 1−
𝑦 log(1 − ℎ 𝑥 )
 This is a convex function
Logistic Regression

 To fit parameters 𝜃:
 min 𝐽 𝜃 to obtain 𝜃

 To make a prediction on given new 𝑥:


 Output ℎ 𝑥 =  𝑝(𝑦 = 1|𝑥; 𝜃)
Gradient Descent

 Want min 𝐽 𝜃 :

Repeat{

𝜃 ≔𝜃 −𝞰 𝐽(𝜃)

} (Simultaneously update for all 𝜃 s)


Gradient Descent

1
𝐽 𝜃 = 𝐶𝑜𝑠𝑡(ℎ 𝑥 , 𝑦( ))
𝑚
1
=− 𝑦 ( ) log(ℎ 𝑥 )+ 1−𝑦 log(1 − ℎ 𝑥 )
𝑚

𝜕 1
𝐽 𝜃 = (ℎ 𝑥 − 𝑦 ( ) )𝑥 ()
𝜕𝜃 𝑚
Gradient Descent

 Want min 𝐽 𝜃 : 1
ℎ 𝑥 =
Repeat{ 1+𝑒
𝜃 ≔𝜃 −𝞰 ∑ (ℎ 𝑥 − 𝑦 ( ) )𝑥 ()

} (Simultaneously update for all 𝜃 s)


Python Implementation

Gives probabilities.
To get the classes just use log_reg.predict()
Example
Let’s apply Logistic Regression on Iris dataset.
Irisfamous dataset, contains the sepal and petal length and width of 150 iris flowers
Three different species: Iris-Setosa, Iris-Versicolor and Iris-Virginica
Iris Dataset
Example
Example
Nonlinear Decision Boundary
Nonlinear Decision Boundary

𝑥 =𝑥
Nonlinear Decision Boundary

ℎ 𝑥 = 𝑔(𝜃 + 𝜃 𝑥 + 𝜃 𝑥 )
𝟏
ℎ 𝑥 = 𝑔(𝜃 + 𝜃 𝑥 + 𝜃 𝑥 +𝜃 𝑥 +𝜃 𝑥 )

−1
0
𝟏 𝑋 𝜃= 0
−𝟏
1
−𝟏 1

Predict 𝑦 = 1 if −1 + 𝑥 +𝑥 ≥0

𝑥 +𝑥 ≥1 𝑥 +𝑥 =1
Multiclass Classification

 What if 𝑦 ∈ {1,2, … , 𝑘} classes


 Example: weather prediction {sunny, cloudy, snowy, …}

𝑋
One-vs-all Classifier (one-vs-rest)

Class 1:
Class 2:
Class 3:

𝑋 𝑋 𝑋

𝑋 𝑋 𝑋
ℎ (𝑥) ℎ (𝑥) ℎ (𝑥)
One-vs-all Classifier (one-vs-rest)

 ℎ 𝑥 = 𝑝 𝑦 = 𝑖 𝑥; 𝜃 𝑓𝑜𝑟 𝑖 = 1,2,3
 One-vs-all:
 Train a logistic regression classifier ℎ (𝑥) for each class 𝑖 to predict the probability
𝑦=1
 On a new input 𝑥, to make a prediction pick the class 𝑖 that maximizes max ℎ (𝑥)
Softmax Regression

 Multi-class classification
 Multinomial Logistic Regression
 Generalization of logistic regression
 𝑦 ∈ {1,2, … , 𝐾}
Softmax Regression

 Given a test input 𝑥, estimate the probability that 𝑃(𝑦 = 𝑘|𝑥) for each value
of 𝑘 = 1, … , 𝐾
 Estimate the probability of the class label taking on each of the 𝐾 different
possible values.
 Our hypothesis will output a 𝐾-dimensional vector (whose elements sum to
1) giving 𝐾 estimated probabilities.
Softmax Regression: Hypothesis

 Hypothesis ℎ (𝑥) takes the form:


𝑃(𝑦 = 1|𝒙; 𝜃) exp(𝜽 𝒙)
𝑃(𝑦 = 2|𝒙; 𝜃) exp(𝜽 𝒙)
 ℎ 𝒙 = =∑
⋮ (𝜽 𝒙) ⋮
𝑃(𝑦 = 𝐾|𝒙; 𝜃) exp(𝜽 𝒙)
 𝜽( ) , 𝜽( ) , … , 𝜽( )
∈ℜ parameters of the model
 ∑
normalizes distribution so it sums to one
(𝜽 𝒙)

| | |
 𝜽 = 𝜽( ) … 𝜽( ) n-by-K matrix
| | |
Softmax Regression: Cost Function

 1 - indicator function
 1 𝑇𝑟𝑢𝑒 𝑠𝑡𝑎𝑡𝑒𝑚𝑒𝑛𝑡 = 1 and 1 𝐹𝑎𝑙𝑠𝑒 𝑠𝑡𝑎𝑡𝑒𝑚𝑒𝑛𝑡 = 0
 Minimize the cost function:

()
exp(𝜽 𝒙( ) )
𝐽 𝜃 =− 1𝑦 = 𝑘 𝑙𝑜𝑔
∑ exp(𝜽 𝒙( ) )

(𝒊)
exp(𝜽 𝒙( ) )
𝑃 𝑦 = 𝑘 𝒙 ;𝜽 =
∑ exp(𝜽 𝒙( ) )
Softmax Regression

 We cannot solve for the minimum of 𝐽(𝜃) analytically


 Use iterative optimization algorithm.
 Taking derivatives, one can show that the gradient is:

𝛻 𝐽 𝜽 =− 𝒙 (1 𝑦 = 𝑘 − 𝑃(𝑦 = 𝑘|𝒙; 𝜽))

exp(𝜽 𝒙( ) )
𝑃 𝑦 =𝑘 𝒙(𝒊) ; 𝜽 =
∑ exp(𝜽 𝒙( ) )
Example: Iris dataset
Python Implementation

 ScikitLearn’s Logistic Regression uses one-versus-all bu default


 Switch to Softmax Regression by setting multi-class=“multinomial”
Example: MNIST dataset

 70000 images of digits handwritten by high school students and employees


of the US Census Bureau
 Each image is 28x28 pixels (hence has 784 features)
 Each feature simply represents one pixel’s intensity, from 0 (white) to 255
(black)
 Is already split into a training set ( the first 60000 images) and a test set (last
10000 images)
Example: MNIST dataset
Example: MNIST dataset
Learned Weights
Example: MNIST dataset

Illustration of the decrease of the cost


function over time

 Training Accuracy: 0.9209666666666667


 Test Accuracy: 0.9206
Question

 Suppose you train a logistic regression classifier and your hypothesis function H is

 Which of the following figure will represent the decision boundary as given by above classifier?
Question

 Imagine, you have given the below graph of logistic regression which is shows the relationships
between cost function and number of iteration for 3 different learning rate values (different colors
are showing different curves at different learning rates ).
Suppose, you save the graph for future reference but you forgot to save the value of
different learning rates for this graph. Now, you want to find out the relation between
the leaning rate values of these curve. Which of the following will be the true relation?
Note:
1. The learning rate for blue is l1
2. The learning rate for red is l2
3. The learning rate for green is l3

A) l1>l2>l3
B) l1 = l2 = l3
C) l1 < l2 < l3
D) None of these
Question

 Suppose you are dealing with 4 class classification problem and you want to train a logistic
regression model on the data for that you are using One-vs-all method.
 How many logistic regression models do we need to fit in 4-class classification problem?

You might also like