0% found this document useful (0 votes)
89 views19 pages

Lecture 6

1) The lecture discusses linear classification using logistic regression. Logistic regression models the probabilities of different classes using a logistic function of the linear combination of inputs. 2) Like linear regression, logistic regression learns the model parameters through maximum likelihood. However, a closed-form solution is not possible for logistic regression due to the nonlinear logistic function. 3) The maximum likelihood estimate finds the parameters that maximize the probability of the training data. This involves minimizing the negative log-likelihood of the training data given the model.

Uploaded by

Mohit Garg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
89 views19 pages

Lecture 6

1) The lecture discusses linear classification using logistic regression. Logistic regression models the probabilities of different classes using a logistic function of the linear combination of inputs. 2) Like linear regression, logistic regression learns the model parameters through maximum likelihood. However, a closed-form solution is not possible for logistic regression due to the nonlinear logistic function. 3) The maximum likelihood estimate finds the parameters that maximize the probability of the training data. This involves minimizing the negative log-likelihood of the training data given the model.

Uploaded by

Mohit Garg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

APL 405: Machine Learning for Mechanics

Lecture 6: Linear classification (logistic regression)

by

Rajdip Nayek
Assistant Professor,
Applied Mechanics Department,
IIT Delhi

Instructor email: [email protected]


Recap of last lecture
• We introduced the linear regression model, which is a parametric model, for solving the regression problem

• Now we will look at basic parametric modelling techniques, particularly


• Linear regression (covered in last lecture)
• Logistic regression

• Linear regression
• A loss-based perspective, using least squares error

• A statistical perspective based on maximum likelihood, where the log-likelihood function was used

• A closed form solution was derived

• One-hot encoding to handle categorical inputs

• We will see that in logistic regression, we will not obtain a closed form solution
2
How to handle categorical input variables?
▪ We had mentioned earlier that input variables 𝐱 can be numerical, catergorical, or mixed

▪ Assume that an input variable is categorical and takes only two classes, say A and B

0, if 𝐀
▪ We can represent such an input variable 𝑥 using 1 and 0 𝑥= ቊ
1, if 𝐁

▪ For linear regression, the model effectively looks like


𝜃0 + 𝜖, if 𝐀
𝑦 = 𝜃0 + 𝜃1 𝑥 + 𝜖 = ቊ
𝜃0 + 𝜃1 + 𝜖, if 𝐁

▪ If the input is a categorical variable with more than two classes, let’s say A, B, C, and D, use one-hot encoding

1 0 0 0
0 1 0 0
𝐱= if A, 𝐱= if B, 𝐱 = if C, 𝐱 = if D
0 0 1 0
0 0 0 1
3
A statistical view of the Classification problem
▪ Classification → learn relationships between some input variables 𝐱 = 𝑥1 𝑥2 … 𝑥𝑝 𝑇 and a categorical output 𝑦
▪ The goal in classification is to take an input vector 𝐱 and to assign it to one of 𝑀 discrete classes 1,2 … , 𝑀

▪ From a statistical perspective, classification amounts to predicting the conditional class probabilities
𝑝 𝑦=𝑚𝐱 𝑦 → 1, 2, … , 𝑀

▪ 𝑝 𝑦 = 𝑚 𝐱 describes the probability for class 𝑚 given that we know the input 𝐱

▪ A probability over output 𝑦 implies the output label 𝑦 is a random variable (r.v.)

▪ We consider 𝑦 as a r.v. because the data (from real world) will always involve a certain amount of randomness (much like
the output from linear regression that was probabilistic due to random error 𝜖)

4
A statistical view of the Classification problem
▪ How to construct a classifier which can not only predict classes but also learn the class probabilities 𝑝 𝑦 | 𝐱 ?

▪ Consider the simplest case of binary classification 𝑀 = 2 and 𝑦 = −1 or 1


▪ In this binary classification case
𝑝 𝑦 = 1|𝐱 will be modelled by 𝑔(𝐱)

▪ By the laws of probability,


𝑝 𝑦 = 1|𝐱 + 𝑝 𝑦 = −1|𝐱 = 1

𝑝 𝑦 = −1|𝐱 will be modelled by 1 − 𝑔(𝐱)

▪ Since 𝑔 𝐱 is a model for a probability, it is natural to require that 0 ≤ 𝑔 𝐱 ≤ 1 for any 𝐱


▪ For a multi-class problem, the classifier should return a vector-valued function 𝒈 𝐱 , where
𝑝 𝑦 = 1|𝐱 𝑔1 𝐱
𝑝 𝑦 = 2|𝐱 𝑔 𝐱 Since 𝒈 𝐱 models a probability vector, each
is modelled by 2
⋮ ⋮ element 𝑔𝑚 𝐱 ≥ 0 and σ𝑀 𝑚=1 𝑔𝑚 𝐱 = 1
𝑝 𝑦 = 𝑀|𝐱 𝑔𝑀 𝐱 6
Logistic Regression model for binary classification
▪ Logistic regression can be viewed as an extension of linear regression that does (binary) classification (instead of
regression)

▪ We wish to learn a function 𝑔(𝐱) that approximates the conditional probability of the positive class, 𝑝 𝑦 = 1|𝐱

▪ Idea of Logisitic Regression: we start with the linear regression model which, without the noise term 𝜖
▪ Define logit, 𝑧 = 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 + ⋯ + 𝜃𝑝 𝑥𝑝 = 𝐱 𝑇 𝜽
▪ Logit takes values on the entire real line, but we need a function that returns a value in the interval 0, 1
𝑒𝑧
▪ Squash the logit 𝑧 = 𝐱𝑇 𝜽 into the interval 0, 1 by using the logistic function, ℎ 𝑧 = 1+𝑒 𝑧

7
Logistic Regression
▪ Idea of Logisitic Regression: we start with the linear regression model which, without the noise term
▪ Define logit, 𝑧 = 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 + ⋯ + 𝜃𝑝 𝑥𝑝 = 𝐱 𝑇 𝜽
▪ Logit takes values on the entire real line, but we need a function that returns a value in the interval 0, 1
𝑒𝑧
▪ Squash the logit 𝑧 = 𝐱 𝜽 into the interval 0, 1 by using the logistic function ℎ 𝑧 =
𝑇
1+𝑒 𝑧

▪ Recall that 𝑔 𝐱 was used to model for 𝑝 𝑦 = 1|𝐱


▪ Using the logistic function for 𝑔 𝐱 restricts the values between 0 and 1 and can be interpreted as a probability
𝑇
𝑒𝐱 𝜽
𝑔 𝐱; 𝜽 = 𝑇
1+𝑒 𝐱 𝜽

▪ It implicitly means that a model for 𝑝 𝑦 = −1|𝐱 is


𝑇𝜽 𝑇𝜽
𝑒𝐱 1 𝑒 −𝐱
1 − 𝑔 𝐱; 𝜽 = 1 − 𝑇𝜽 = 𝑇𝜽 = 𝑇𝜽
1 + 𝑒𝐱 1 + 𝑒𝐱 1 + 𝑒 −𝐱
8
Logistic Regression
▪ Logisitic Regression: Essentially linear regression appended with logistic function
▪ Logit, 𝑧 = 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 + ⋯ + 𝜃𝑝 𝑥𝑝 = 𝐱 𝑇 𝜽
𝑇 𝑇
𝑒𝐱 𝜽 𝑒 −𝐱 𝜽
▪ 𝑝 𝑦 = 1|𝐱; 𝜽 = 𝑔 𝐱; 𝜽 = 𝑇 , 𝑝 𝑦 = −1|𝐱; 𝜽 = 1 − 𝑔 𝐱; 𝜽 = 𝑇
1+𝑒 𝐱 𝜽 1+𝑒 −𝐱 𝜽

▪ Logistic regression is a method for classification, not regression!

▪ The randomness in classification is statistically modelled by the class probability 𝑝 𝑦 = 𝑚|𝐱 , instead of additive noise 𝜖

▪ Like linear regression, logistic regression is also a parametric model, and we learn the parameters 𝜽 from training data

9
Training binary classification model with Maximum Likelihood
▪ Logistic function is a nonlinear function
▪ Therefore, a closed-form solution to logistic regression cannot be derived

▪ Maximum likelihood perspective of learning 𝜽 from training data


෡ = argmax 𝑝 𝒚 𝐗; 𝜽
𝜽
𝜽

▪ Similar to linear regression, we assume that the training data points are independent, and we consider the logarithm of
the likelihood function for numerical reasons
𝑁 𝑁

𝜽 = argmax ln 𝑝 𝒚 𝐗; 𝜽 = argmax ෍ ln 𝑝 𝑦 𝑖 𝑖
𝐱 ;𝜽 = argmin ෍ −ln 𝑝 𝑦 𝑖 𝐱 𝑖 ;𝜽
𝜽 𝜽 𝑖=1 𝜽 𝑖=1

▪ Note that 𝑝 𝑦 = 1 𝐱; 𝜽 is modelled using 𝑔 𝐱; 𝜽 which implies

−ln 𝑔 𝐱 𝑖 ; 𝜽 if 𝑦 𝑖 =1
𝑖 𝑖
− ln 𝑝 𝑦 𝐱 ;𝜽 = ቐ
−ln 1 − 𝑔 𝐱 𝑖 ; 𝜽 if 𝑦 𝑖 = −1

10
Training binary classification model with Maximum Likelihood
▪ Assume that the training data points are independent, and we consider the logarithm of the likelihood function for
numerical reasons
𝑁 𝑁

𝜽 = argmax ln 𝑝 𝒚 𝐗; 𝜽 = argmax ෍ ln 𝑝 𝑦 𝑖 𝑖
𝐱 ;𝜽 = argmin ෍ −ln 𝑝 𝑦 𝑖
𝐱 𝑖 ;𝜽
𝜽 𝜽 𝑖=1 𝜽 𝑖=1

▪ 𝑝 𝑦 = 1 𝐱; 𝜽 is modelled using 𝑔 𝐱; 𝜽

−ln 𝑔 𝐱 𝑖 ; 𝜽 if 𝑦 𝑖 =1
− ln 𝑝 𝑦 𝑖 𝑖
𝐱 ;𝜽 = ቐ
−ln 1 − 𝑔 𝐱 𝑖 ; 𝜽 if 𝑦 𝑖
= −1

Cross-entropy loss function, 𝐿 𝑦 𝑖 , 𝑔 𝐱 𝑖 ; 𝜽

▪ Cross entropy loss can be used for any binary classifier, not just logistic regression, that predicts class probabilities 𝑔 𝐱; 𝜽

▪ The corresponding cost function (or average loss function)


𝑁
1 −ln 𝑔 𝐱 𝑖 ; 𝜽 if 𝑦 𝑖
=1
𝐽 𝜽 = ෍ቐ 𝑖 ;𝜽 𝑖
𝑁 −ln 1 − 𝑔 𝐱 if 𝑦 = −1
𝑖=1
11
Training Logistic Regression model with Maximum Likelihood
▪ We can write the cost function in more detail for logistic regression
𝑇 𝑇
𝐱 𝑖 𝜽 𝑦 𝑖 𝐱 𝑖 𝜽
𝑖 𝑖 𝑒 𝑒
For 𝑦 = 1, 𝑔 𝐱 ;𝜽 = 𝑇 = 𝑇
𝐱 𝑖 𝜽 𝑦 𝑖 𝐱 𝑖 𝜽
1+𝑒 1+𝑒

𝑇 𝑇
− 𝐱 𝑖 𝜽 𝑦 𝑖 𝐱 𝑖 𝜽
𝑖 𝑖 1 𝑒 𝑒
For 𝑦 = −1, 1 − 𝑔 𝐱 ;𝜽 = 𝑇 = 𝑇 = 𝑇
𝐱 𝑖 𝜽 − 𝐱 𝑖 𝜽 𝑦 𝑖 𝐱 𝑖 𝜽
1+𝑒 1+𝑒 1+𝑒

▪ Hence, we get the same expression in both cases and can write the cost function compactly as:
𝑁
1 −ln 𝑔 𝐱 𝑖 ; 𝜽 if 𝑦 𝑖 =1
𝐽 𝜽 = ෍ቐ 𝑖 ;𝜽 𝑖
𝑁 −ln 1 − 𝑔 𝐱 if 𝑦 = −1
𝑖=1
𝑁
𝑦 𝑖 𝐱 𝑖 𝑇𝜽 𝑁 𝑁
1 𝑒 1 1 1 𝑖 𝑇
𝐱𝑖 𝜽
= ෍ − ln 𝑇 = ෍ − ln 𝑇 = ෍ ln 1 + 𝑒 −𝑦
𝑁 1 + 𝑒𝑦
𝑖 𝐱𝑖 𝜽 𝑁 1 + 𝑒 −𝑦
𝑖 𝐱𝑖 𝜽 𝑁
𝑖=1 𝑖=1 𝑖=1
12
Training Logistic Regression model with Maximum Likelihood
▪ Cost function in logistic regression is given by:
𝑁
1 𝑖 𝑇
𝐱𝑖 𝜽
𝐽 𝜽 = ෍ ln 1 + 𝑒 −𝑦
𝑁
𝑖=1

Logistic loss function, 𝐿 𝑦 𝑖 , 𝐱 𝑖 ; 𝜽

▪ The logistic loss 𝐿 𝑦 𝑖 , 𝐱 𝑖 ; 𝜽 above is a special case of the cross-entropy loss

▪ Learning a logistic regression model thus amounts to solving the optimization problem:
𝑇
෡ = argmin 𝐽 𝜽 = argmin 1 σ𝑁
𝜽 ln 1 + 𝑒 −𝑦 𝑖 𝐱𝑖 𝜽
𝜽 𝑁 𝑖=1
𝜽

▪ Contrary to linear regression with squared error loss, the above problem has no closed-form solution, so we have to use
numerical optimization instead

13
Predictions using Logistic Regresion
▪ Logistic regression predicts class probabilities for a test input 𝐱 ∗
▪ by first learning 𝜽 from training data, and
▪ then computing 𝑔 𝐱 ∗ , which is the model for 𝑝 𝑦 ∗ = 1 𝐱 ∗

▪ However, sometimes we want to make a “hard” prediction for the test input 𝐱 ∗
▪ E.g., whether is 𝑦ො 𝐱 ∗ = 1 or 𝑦ො 𝐱 ∗ = −1 in binary classification?
▪ Recall, in 𝑘NN and decision trees, we made “hard” predictions

▪ To make hard predictions with logistic regression model, we add a final step, in which the predicted probabilities are
turned into a class prediction

▪ The most common approach is to let 𝑦ො 𝐱 ∗ be the most probable class ← the class having the highest probability

▪ For binary classification, we can express this as: 𝑟 = 0.5 minimises the so-called misclassification rate
1 if 𝑔 𝐱 ∗ > 𝑟
𝑦ො 𝐱∗ =ቊ with decision threshold 𝑟 = 0.5 (why?)
−1 if 𝑔 𝐱 ∗ ≤ 𝑟
14
Decision Boundaries of Logistic Regression
▪ Decision boundary ← The point(s) where the prediction changes from from one class to another

Grey plane is the decision boundary

▪ The decision boundary for binary classification can be computed by solving the equation
𝑔 𝐱 =1−𝑔 𝐱 meaning 𝑝 𝑦 = 1|𝐱; 𝜽 = 𝑝(𝑦 = −1|𝒙; 𝜽)
▪ The solutions to this equation are points in the input space for which the two classes are predicted to be equally probable
15
Decision Boundaries of Logistic Regression
▪ The decision boundary for binary classification can be computed by solving the equation
𝑔 𝐱 =1−𝑔 𝐱 meaning 𝑝 𝑦 = 1|𝐱; 𝜽 = 𝑝(𝑦 = −1|𝐱; 𝜽)
▪ The solutions to this equation are points in the input space for which the two classes are predicted to be equally probable
▪ For binary logistic regression, it means
𝑇𝜽
𝑒𝐱 1 𝑇𝜽
𝐱𝑇𝜽
= 𝐱𝑇𝜽
⟺ 𝑒𝐱 = 1 ⟺ 𝐱𝑇 𝜽 = 0
1+𝑒 1+𝑒

▪ The equation 𝐱 𝑇 𝜽 = 0 parameterises a (linear) hyperplane


▪ Therefore, the decision boundaries in logistic regression always have the shape of a (linear) hyperplane

16
Prediction and Decision Boundaries of Logistic Regression
▪ For binary classification, we can express this as:
∗ 1 if 𝑔 𝐱 ∗ > 𝑟
𝑦ො 𝐱 =ቊ with decision threshold 𝑟 = 0.5
−1 if 𝑔 𝐱 ∗ ≤ 𝑟
▪ Choosing 𝑟 = 0.5 minimises the so-called misclassification rate

▪ The decision boundary for logistic regression lies at 𝐱 𝑇 𝜽 = 0


⟹ The sign of the expression 𝐱 𝑇 𝜽 determines if we are predicting the positive (1) or the negative (-1) class

▪ Compactly, one can write the test output prediction for a test input 𝐱 ∗ from a logistic regression as
𝑦ො 𝐱 ∗ = sign 𝐱 ∗ 𝑇 𝜽

17
Linear vs Non-linear classifiers
▪ A classifier whose decision boundaries are linear hyperplanes is a linear classifier
▪ Logistic regression is a linear classifier
▪ 𝑘NN and Decision Trees are non-linear classifiers

𝑥2 𝑥2

𝑥1 𝑥1
Linear classifier Non-Linear classifier

▪ Note that the term ‘linear’ has a different sense for linear regression and for linear classification
▪ Linear regression is a model which is linear in its parameters,
▪ Linear classifier is a model linear whose decision boundaries are linear
18
Logistic Regression for more than two classes
▪ For the binary problem, we used the logistic function to design a model for 𝑔 𝐱
▪ 𝑔 𝐱 a scalar-valued function representing 𝑝 𝑦 = 1| 𝐱

▪ For a multi-class problem (𝑀 classes), the classifier should return a vector-valued function 𝒈 𝐱 , where
𝑝 𝑦 = 1|𝐱 𝑔1 𝐱
𝑝 𝑦 = 2|𝐱 𝑔 𝐱 Since 𝒈 𝐱 models a probability vector, each
is modelled by 𝒈 𝐱 = 2
⋮ ⋮ element 𝑔𝑚 𝐱 ≥ 0 and σ𝑀 𝑚=1 𝑔𝑚 𝐱 = 1
𝑝 𝑦 = 𝑀|𝐱 𝑔𝑀 𝐱

▪ For this purpose, we define 𝑀 different logits, 𝑧𝑚 = 𝜽𝑚 𝑇 𝐱 , 𝑚 = 1,2, … , 𝑀

▪ The use the softmax function (a vector-valued generalization of logistic function)

𝑒 𝑧1 • 𝒛 is an 𝑀-dimensional vector
1 𝑒 𝑧2 • softmax 𝒛 also returns a vector of the same dimension
softmax 𝒛 ≜
σ𝑀
𝑚=1 𝑒
𝑧𝑚 ⋮ • By construction, the output vector always sums to 1, and each element
𝑧
𝑒 𝑀 is always ≥ 0
19
Multi-class Logistic Regression model
▪ We have now combined linear regression and softmax function to model multi-class probabilities
𝑧1 𝜽1 𝑇 𝐱
𝑧2 𝜽2 𝑇 𝐱
𝒈 𝒛 = softmax 𝒛 , where 𝒛 = ⋮ =

𝑧𝑀 𝜽 𝑇𝐱
𝑀

▪ Equivalently, we can write out the individual class probabilities, that is, the elements of the vector 𝑔𝑚 𝐱
𝜽𝑚 𝑇 𝐱
𝑒
𝑔𝑚 𝐱 = 𝑇 𝑚 = 1,2, … , 𝑀
σ𝑀 𝑒 𝜽𝑗 𝐱
𝑗=1

▪ This is the multiclass logistic regression model

▪ Note that this construction uses 𝑀 parameter vectors 𝜽1 , … , 𝜽𝑀 (one for each class)
▪ Note the number of parameters to learn grows with 𝑀
20

You might also like