Lecture 6
Lecture 6
by
Rajdip Nayek
Assistant Professor,
Applied Mechanics Department,
IIT Delhi
• Linear regression
• A loss-based perspective, using least squares error
• A statistical perspective based on maximum likelihood, where the log-likelihood function was used
• We will see that in logistic regression, we will not obtain a closed form solution
2
How to handle categorical input variables?
▪ We had mentioned earlier that input variables 𝐱 can be numerical, catergorical, or mixed
▪ Assume that an input variable is categorical and takes only two classes, say A and B
0, if 𝐀
▪ We can represent such an input variable 𝑥 using 1 and 0 𝑥= ቊ
1, if 𝐁
▪ If the input is a categorical variable with more than two classes, let’s say A, B, C, and D, use one-hot encoding
1 0 0 0
0 1 0 0
𝐱= if A, 𝐱= if B, 𝐱 = if C, 𝐱 = if D
0 0 1 0
0 0 0 1
3
A statistical view of the Classification problem
▪ Classification → learn relationships between some input variables 𝐱 = 𝑥1 𝑥2 … 𝑥𝑝 𝑇 and a categorical output 𝑦
▪ The goal in classification is to take an input vector 𝐱 and to assign it to one of 𝑀 discrete classes 1,2 … , 𝑀
▪ From a statistical perspective, classification amounts to predicting the conditional class probabilities
𝑝 𝑦=𝑚𝐱 𝑦 → 1, 2, … , 𝑀
▪ 𝑝 𝑦 = 𝑚 𝐱 describes the probability for class 𝑚 given that we know the input 𝐱
▪ A probability over output 𝑦 implies the output label 𝑦 is a random variable (r.v.)
▪ We consider 𝑦 as a r.v. because the data (from real world) will always involve a certain amount of randomness (much like
the output from linear regression that was probabilistic due to random error 𝜖)
4
A statistical view of the Classification problem
▪ How to construct a classifier which can not only predict classes but also learn the class probabilities 𝑝 𝑦 | 𝐱 ?
▪ We wish to learn a function 𝑔(𝐱) that approximates the conditional probability of the positive class, 𝑝 𝑦 = 1|𝐱
▪ Idea of Logisitic Regression: we start with the linear regression model which, without the noise term 𝜖
▪ Define logit, 𝑧 = 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 + ⋯ + 𝜃𝑝 𝑥𝑝 = 𝐱 𝑇 𝜽
▪ Logit takes values on the entire real line, but we need a function that returns a value in the interval 0, 1
𝑒𝑧
▪ Squash the logit 𝑧 = 𝐱𝑇 𝜽 into the interval 0, 1 by using the logistic function, ℎ 𝑧 = 1+𝑒 𝑧
7
Logistic Regression
▪ Idea of Logisitic Regression: we start with the linear regression model which, without the noise term
▪ Define logit, 𝑧 = 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 + ⋯ + 𝜃𝑝 𝑥𝑝 = 𝐱 𝑇 𝜽
▪ Logit takes values on the entire real line, but we need a function that returns a value in the interval 0, 1
𝑒𝑧
▪ Squash the logit 𝑧 = 𝐱 𝜽 into the interval 0, 1 by using the logistic function ℎ 𝑧 =
𝑇
1+𝑒 𝑧
▪ The randomness in classification is statistically modelled by the class probability 𝑝 𝑦 = 𝑚|𝐱 , instead of additive noise 𝜖
▪ Like linear regression, logistic regression is also a parametric model, and we learn the parameters 𝜽 from training data
9
Training binary classification model with Maximum Likelihood
▪ Logistic function is a nonlinear function
▪ Therefore, a closed-form solution to logistic regression cannot be derived
▪ Similar to linear regression, we assume that the training data points are independent, and we consider the logarithm of
the likelihood function for numerical reasons
𝑁 𝑁
𝜽 = argmax ln 𝑝 𝒚 𝐗; 𝜽 = argmax ln 𝑝 𝑦 𝑖 𝑖
𝐱 ;𝜽 = argmin −ln 𝑝 𝑦 𝑖 𝐱 𝑖 ;𝜽
𝜽 𝜽 𝑖=1 𝜽 𝑖=1
−ln 𝑔 𝐱 𝑖 ; 𝜽 if 𝑦 𝑖 =1
𝑖 𝑖
− ln 𝑝 𝑦 𝐱 ;𝜽 = ቐ
−ln 1 − 𝑔 𝐱 𝑖 ; 𝜽 if 𝑦 𝑖 = −1
10
Training binary classification model with Maximum Likelihood
▪ Assume that the training data points are independent, and we consider the logarithm of the likelihood function for
numerical reasons
𝑁 𝑁
𝜽 = argmax ln 𝑝 𝒚 𝐗; 𝜽 = argmax ln 𝑝 𝑦 𝑖 𝑖
𝐱 ;𝜽 = argmin −ln 𝑝 𝑦 𝑖
𝐱 𝑖 ;𝜽
𝜽 𝜽 𝑖=1 𝜽 𝑖=1
▪ 𝑝 𝑦 = 1 𝐱; 𝜽 is modelled using 𝑔 𝐱; 𝜽
−ln 𝑔 𝐱 𝑖 ; 𝜽 if 𝑦 𝑖 =1
− ln 𝑝 𝑦 𝑖 𝑖
𝐱 ;𝜽 = ቐ
−ln 1 − 𝑔 𝐱 𝑖 ; 𝜽 if 𝑦 𝑖
= −1
▪ Cross entropy loss can be used for any binary classifier, not just logistic regression, that predicts class probabilities 𝑔 𝐱; 𝜽
𝑇 𝑇
− 𝐱 𝑖 𝜽 𝑦 𝑖 𝐱 𝑖 𝜽
𝑖 𝑖 1 𝑒 𝑒
For 𝑦 = −1, 1 − 𝑔 𝐱 ;𝜽 = 𝑇 = 𝑇 = 𝑇
𝐱 𝑖 𝜽 − 𝐱 𝑖 𝜽 𝑦 𝑖 𝐱 𝑖 𝜽
1+𝑒 1+𝑒 1+𝑒
▪ Hence, we get the same expression in both cases and can write the cost function compactly as:
𝑁
1 −ln 𝑔 𝐱 𝑖 ; 𝜽 if 𝑦 𝑖 =1
𝐽 𝜽 = ቐ 𝑖 ;𝜽 𝑖
𝑁 −ln 1 − 𝑔 𝐱 if 𝑦 = −1
𝑖=1
𝑁
𝑦 𝑖 𝐱 𝑖 𝑇𝜽 𝑁 𝑁
1 𝑒 1 1 1 𝑖 𝑇
𝐱𝑖 𝜽
= − ln 𝑇 = − ln 𝑇 = ln 1 + 𝑒 −𝑦
𝑁 1 + 𝑒𝑦
𝑖 𝐱𝑖 𝜽 𝑁 1 + 𝑒 −𝑦
𝑖 𝐱𝑖 𝜽 𝑁
𝑖=1 𝑖=1 𝑖=1
12
Training Logistic Regression model with Maximum Likelihood
▪ Cost function in logistic regression is given by:
𝑁
1 𝑖 𝑇
𝐱𝑖 𝜽
𝐽 𝜽 = ln 1 + 𝑒 −𝑦
𝑁
𝑖=1
▪ Learning a logistic regression model thus amounts to solving the optimization problem:
𝑇
= argmin 𝐽 𝜽 = argmin 1 σ𝑁
𝜽 ln 1 + 𝑒 −𝑦 𝑖 𝐱𝑖 𝜽
𝜽 𝑁 𝑖=1
𝜽
▪ Contrary to linear regression with squared error loss, the above problem has no closed-form solution, so we have to use
numerical optimization instead
13
Predictions using Logistic Regresion
▪ Logistic regression predicts class probabilities for a test input 𝐱 ∗
▪ by first learning 𝜽 from training data, and
▪ then computing 𝑔 𝐱 ∗ , which is the model for 𝑝 𝑦 ∗ = 1 𝐱 ∗
▪ However, sometimes we want to make a “hard” prediction for the test input 𝐱 ∗
▪ E.g., whether is 𝑦ො 𝐱 ∗ = 1 or 𝑦ො 𝐱 ∗ = −1 in binary classification?
▪ Recall, in 𝑘NN and decision trees, we made “hard” predictions
▪ To make hard predictions with logistic regression model, we add a final step, in which the predicted probabilities are
turned into a class prediction
▪ The most common approach is to let 𝑦ො 𝐱 ∗ be the most probable class ← the class having the highest probability
▪ For binary classification, we can express this as: 𝑟 = 0.5 minimises the so-called misclassification rate
1 if 𝑔 𝐱 ∗ > 𝑟
𝑦ො 𝐱∗ =ቊ with decision threshold 𝑟 = 0.5 (why?)
−1 if 𝑔 𝐱 ∗ ≤ 𝑟
14
Decision Boundaries of Logistic Regression
▪ Decision boundary ← The point(s) where the prediction changes from from one class to another
▪ The decision boundary for binary classification can be computed by solving the equation
𝑔 𝐱 =1−𝑔 𝐱 meaning 𝑝 𝑦 = 1|𝐱; 𝜽 = 𝑝(𝑦 = −1|𝒙; 𝜽)
▪ The solutions to this equation are points in the input space for which the two classes are predicted to be equally probable
15
Decision Boundaries of Logistic Regression
▪ The decision boundary for binary classification can be computed by solving the equation
𝑔 𝐱 =1−𝑔 𝐱 meaning 𝑝 𝑦 = 1|𝐱; 𝜽 = 𝑝(𝑦 = −1|𝐱; 𝜽)
▪ The solutions to this equation are points in the input space for which the two classes are predicted to be equally probable
▪ For binary logistic regression, it means
𝑇𝜽
𝑒𝐱 1 𝑇𝜽
𝐱𝑇𝜽
= 𝐱𝑇𝜽
⟺ 𝑒𝐱 = 1 ⟺ 𝐱𝑇 𝜽 = 0
1+𝑒 1+𝑒
16
Prediction and Decision Boundaries of Logistic Regression
▪ For binary classification, we can express this as:
∗ 1 if 𝑔 𝐱 ∗ > 𝑟
𝑦ො 𝐱 =ቊ with decision threshold 𝑟 = 0.5
−1 if 𝑔 𝐱 ∗ ≤ 𝑟
▪ Choosing 𝑟 = 0.5 minimises the so-called misclassification rate
▪ Compactly, one can write the test output prediction for a test input 𝐱 ∗ from a logistic regression as
𝑦ො 𝐱 ∗ = sign 𝐱 ∗ 𝑇 𝜽
17
Linear vs Non-linear classifiers
▪ A classifier whose decision boundaries are linear hyperplanes is a linear classifier
▪ Logistic regression is a linear classifier
▪ 𝑘NN and Decision Trees are non-linear classifiers
𝑥2 𝑥2
𝑥1 𝑥1
Linear classifier Non-Linear classifier
▪ Note that the term ‘linear’ has a different sense for linear regression and for linear classification
▪ Linear regression is a model which is linear in its parameters,
▪ Linear classifier is a model linear whose decision boundaries are linear
18
Logistic Regression for more than two classes
▪ For the binary problem, we used the logistic function to design a model for 𝑔 𝐱
▪ 𝑔 𝐱 a scalar-valued function representing 𝑝 𝑦 = 1| 𝐱
▪ For a multi-class problem (𝑀 classes), the classifier should return a vector-valued function 𝒈 𝐱 , where
𝑝 𝑦 = 1|𝐱 𝑔1 𝐱
𝑝 𝑦 = 2|𝐱 𝑔 𝐱 Since 𝒈 𝐱 models a probability vector, each
is modelled by 𝒈 𝐱 = 2
⋮ ⋮ element 𝑔𝑚 𝐱 ≥ 0 and σ𝑀 𝑚=1 𝑔𝑚 𝐱 = 1
𝑝 𝑦 = 𝑀|𝐱 𝑔𝑀 𝐱
𝑒 𝑧1 • 𝒛 is an 𝑀-dimensional vector
1 𝑒 𝑧2 • softmax 𝒛 also returns a vector of the same dimension
softmax 𝒛 ≜
σ𝑀
𝑚=1 𝑒
𝑧𝑚 ⋮ • By construction, the output vector always sums to 1, and each element
𝑧
𝑒 𝑀 is always ≥ 0
19
Multi-class Logistic Regression model
▪ We have now combined linear regression and softmax function to model multi-class probabilities
𝑧1 𝜽1 𝑇 𝐱
𝑧2 𝜽2 𝑇 𝐱
𝒈 𝒛 = softmax 𝒛 , where 𝒛 = ⋮ =
⋮
𝑧𝑀 𝜽 𝑇𝐱
𝑀
▪ Equivalently, we can write out the individual class probabilities, that is, the elements of the vector 𝑔𝑚 𝐱
𝜽𝑚 𝑇 𝐱
𝑒
𝑔𝑚 𝐱 = 𝑇 𝑚 = 1,2, … , 𝑀
σ𝑀 𝑒 𝜽𝑗 𝐱
𝑗=1
▪ Note that this construction uses 𝑀 parameter vectors 𝜽1 , … , 𝜽𝑀 (one for each class)
▪ Note the number of parameters to learn grows with 𝑀
20