0% found this document useful (0 votes)
9 views

Lecture 6

Lecture 6 of machine learning IIT KANPUR

Uploaded by

Ayush Patel
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Lecture 6

Lecture 6 of machine learning IIT KANPUR

Uploaded by

Ayush Patel
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Linear Models for Classification

CS771: Introduction to Machine Learning


2
Plan today
▪ Wrapping up linear models for regession
▪ Linear models for classification
▪ Logistic and softmax classification

CS771: Intro to ML
3
Gradient Descent for Linear/Ridge Regression
▪ Just use the GD algorithm with the gradient expressions we derived
Also, we usually work with
▪ Iterative updates for linear regression will be of the form average gradient so the gradient
term is divided by 𝑁
Note the form of each term in the
(𝑡+1) (𝑡) 𝑡
𝒘 = 𝒘 − 𝜂𝑡 𝒈 gradient expression update: Amount of
current 𝑤’s error on the 𝑛𝑡ℎ training
Unlike the closed form solution
example multiplied by the input 𝑥𝑛
𝑿⊤ 𝑿 −1 𝑿⊤ 𝒚 of least squares 𝑁
regression, here we have iterative
updates but do not require the (𝑡)
2 (𝒕) ⊤
expensive matrix inversion of the = 𝒘 + 𝜂𝑡 ෍ 𝑦𝑛 − 𝒘 𝒙𝑛 𝒙𝑛
𝐷 × 𝐷 matrix 𝑿⊤ 𝑿 (thus faster) 𝑁
𝑛=1
▪ Similar updates for ridge regression as well (with the gradient expression being
slightly different; left as an exercise)

▪ More on iterative optimization methods later


CS771: Intro to ML
4
Evaluation Measures for Regression Models
Prediction
▪ Plotting the prediction 𝑦ො𝑛 vs truth 𝑦𝑛 for the validation/test set
Truth
▪ Mean Squared Error (MSE) and Mean Absolute Error (MAE) on val./test set
1 𝑁 1 𝑁
𝑀𝑆𝐸 = ෍ 𝑦𝑛 − 𝑦ො𝑛 2 𝑀𝐴𝐸 = ෍ |𝑦𝑛 − 𝑦ො𝑛 | Plots of true vs predicted outputs
𝑁 𝑛=1 𝑁 𝑛=1 and 𝑅2 for two regression models

▪ RMSE (Root Mean Squared Error) ≜ 𝑀𝑆𝐸


▪ Coefficient of determination or 𝑅2

σ𝑁 2
2
𝑦
𝑛=1 𝑛 − 𝑦ො𝑛
“relative” error w.r.t. a model
𝑅 =1− 𝑁 that makes a constant
σ𝑛=1 𝑦𝑛 − 𝑦ത 2 prediction 𝑦ത for all inputs

A “base” model that always predicts the


mean 𝑦ത will have 𝑅 2 = 0 and the perfect 𝑦ത is empirical mean of true
1
model will have 𝑅 2 = 1. Worse than base responses, i.e., σ𝑁𝑛=1 𝑦𝑛
𝑁
models can even have negative 𝑅 2
Pic from MLAPP (Murphy) CS771: Intro to ML
5
Linear Regression as Solving System of Linear Eqs
▪ The form of the lin. reg. model 𝒚 ≈ 𝑿𝒘 is akin to a system of linear equation
▪ Assuming 𝑁 training examples with 𝐷 features each, we have
First training example: 𝑦1 = 𝑥11 𝑤1 + 𝑥12 𝑤2 + … + 𝑥1𝐷 𝑤𝐷 Note: Here 𝑥𝑛𝑑 denotes
the 𝑑 𝑡ℎ feature of the 𝑛𝑡ℎ
Second training example: 𝑦2 = 𝑥21 𝑤1 + 𝑥22 𝑤2 + … + 𝑥2𝐷 𝑤𝐷 training example
𝑁 equations and 𝐷 unknowns
here (𝑤1 , 𝑤2 , … , 𝑤𝐷 )
N-th training example: 𝑦𝑁 = 𝑥𝑁1 𝑤1 + 𝑥𝑁2 𝑤2 + … + 𝑥𝑁𝐷 𝑤𝐷

▪ Usually we will either have 𝑁 > 𝐷 or 𝑁 < 𝐷


▪ Thus we have an underdetermined (𝑁 < 𝐷) or overdetermined (𝑁 > 𝐷) system
▪ Methods to solve over/underdetermined systems can be used for lin-reg as well
▪ Many of these methods don’t require expensive matrix inversion Now solve this!
Solving lin-reg 𝒘 = (𝑿⊤ 𝑿)−1 𝑿⊤ 𝒚 𝑨𝒘 = 𝒃 where 𝑨 = 𝑿⊤ 𝑿, and 𝒃 = 𝑿⊤ 𝒚
as system of lin eq.
System of lin. Eqns with 𝐷 equations and 𝐷 unknowns
CS771: Intro to ML
6

Linear Models for Classification

CS771: Intro to ML
7
Linear Models for Classification
▪ A linear model 𝑦 = 𝒘⊤ 𝒙 can also be used in classification
▪ For binary classification, can treat 𝒘⊤ 𝒙𝑛 as the “score” of input 𝒙𝑛 and either
▪ Threshold the score to get a binary label Note that log
𝜇𝑛
1−𝜇𝑛
= 𝒘⊤ 𝒙𝑛 (the
score) is also called the log-odds
𝑦𝑛 = sign(𝒘⊤ 𝒙𝑛 ) ratio, and often also logits
▪ Convert the score into a probability
1
𝜇𝑛 = 𝑝 𝑦𝑛 = 1 𝒙𝑛 , 𝒘 = 𝜎 𝒘⊤ 𝒙𝑛 The “sigmoid” function 𝜎(z)
1 Squashes a real number
Popularly known as “logistic =
regression” (LR) model (misnomer: 1 + exp(−𝒘⊤ 𝒙𝑛 ) to the range 0-1 0.5
it is not a regression model but a
exp(𝒘⊤ 𝒙𝑛 )
classification model), a probabilistic =
model for binary classification 1 + exp(𝒘⊤ 𝒙𝑛 )
0
z
▪ Note: In LR, if we assume the label 𝑦𝑛 as -1/+1 (not 0/1) then we can write
1
𝑝 𝑦𝑛 𝒘, 𝒙𝑛 = ⊤
= 𝜎(𝑦𝑛 𝒘⊤ 𝒙𝑛 )
1 + exp(−𝑦𝑛 𝒘 𝒙𝑛 )
CS771: Intro to ML
Linear Models: The Decision Boundary
▪ Decision boundary is where the score ▪ Decision boundary is where both classes
𝒘⊤ 𝒙𝑛 changes its sign have equal probability for the input 𝒙𝑛
▪ For logistic reg, at decision boundary
𝒘
𝑝 𝑦𝑛 = 1 𝒘, 𝒙𝑛 = 𝑝(𝑦𝑛 = 0|𝒘, 𝒙𝑛 )
𝒘⊤ 𝒙𝑛 > 𝟎 exp(𝒘⊤ 𝒙𝑛 ) 1
𝒘⊤ 𝒙𝑛 < 𝟎 =
1 + exp(𝒘 𝒙𝑛 ) 1 + exp(𝒘⊤ 𝒙𝑛 )

exp 𝒘⊤ 𝒙𝑛 = 1
𝒘⊤ 𝒙𝑛 = 0 for points
at the decision boundary 𝒘⊤ 𝒙𝑛 = 0

▪ Therefore, both views are equivalent


CS771: Intro to ML
Linear Models for (Multi-class) Classification
𝐶
▪ If there are 𝐶 > 2 classes, we use 𝐶 weight vectors 𝒘𝑖 𝑖=1 to define the model
𝐷 × 𝐶 weight
matrix
𝑾 = [𝒘1 , 𝒘2 , … , 𝒘𝐶 ]
▪ The prediction rule is as follows
𝑦𝑛 = argmax𝑖∈{1,2,…,𝐶} 𝒘⊤
𝑖 𝒙𝑛
▪ Can think of 𝒘⊤ 𝒙
𝑖 𝑛 as the score/similarity of the input w.r.t. the 𝑖 𝑡ℎ
class
▪ Can also use these scores to compute probability of belonging to each class
“softmax” classification Note: Just like logistic
exp(𝒘⊤ 𝑖 𝒙𝑛 )
𝜇𝑛,𝑖 regression, the scores
𝜇𝑛,𝑖 = 𝑝 𝑦𝑛 = 𝑖 𝑾, 𝒙𝑛 = Multi-class extension 𝒘⊤
𝐾 ⊤ 𝑖 𝒙𝑛 are called logits
σ𝑗=1 exp(𝒘𝑗 𝒙) of logistic regression (𝐶 logits in this case)
Probability of 𝒙𝑛 𝑖=1 𝑖=2 𝑖=3

belonging to class 𝑖 𝐶
𝝁𝑛 = [𝜇𝑛,1 , 𝜇𝑛,2 , … , 𝜇𝑛,𝐶 ] Note: We actually need only 𝐶 − 1
෍ 𝜇𝑛,𝑖 = 1 weight vectors in softmax
Vector of probabilities of 𝒙𝑛 𝑖=1 classification. Think why?
Class 𝑖 with largest 𝒘⊤
𝑖 𝒙𝑛
belonging to each of the 𝐶 classes has the largest probability Probabilities must sum to 1 CS771: Intro to ML
10
Linear Classification: Interpreting weight vectors
▪ Recall that multi-class classification prediction rule is

𝑦𝑛 = argmax𝑖∈{1,2,…,𝐶} 𝒘⊤
𝑖 𝒙𝑛

▪ Can think of 𝒘⊤
𝑖 𝒙𝑛 as the score of the input for the 𝑖
𝑡ℎ class (or similarity of 𝒙 with 𝒘 )
𝑛 𝑖

▪ Once learned (we will see the methods later), these 𝐶 weight vectors (one for each class) can
sometimes have nice interpretations, especially when the inputs are images
The learned weight These images sort
vectors of each of the 4 of look like class
classes “unflattened” and
prototypes if I
visualized as images –
they kind of look like a
were using LwP ☺
𝒘𝑐𝑎𝑟 𝒘𝑓𝑟𝑜𝑔 𝒘ℎ𝑜𝑟𝑠𝑒 𝒘𝑐𝑎𝑡
“average” of what the Yeah, “sort of”. ☺
That’s why the dot product of each of these weight vectors with No wonder why LwP (with
images from that class Euclidean distances) acts
an image from the correct class will be expected to be the largest
should look like like a linear model. ☺

CS771: Intro to ML
Logistic and Softmax classification: Pictorially
▪ Logistic regression is a linear model with single weight vector with 𝐷 weights
𝑦𝑛
𝑤1
𝑤2 𝑤𝐷−1 𝑤𝐷
𝑥𝑛,1 𝑥𝑛,2 𝑥𝑛,𝐷−1 𝑥𝑛𝐷

▪ Softmax classification is a linear model with 𝐾 weight vectors with 𝐷 × 𝐶 weights

𝑦𝑛,1 𝑦𝑛,2 𝑦𝑛,𝐶

𝑥𝑛,1 𝑥𝑛,2 𝑥𝑛,𝐷−1 𝑥𝑛𝐷

CS771: Intro to ML
12
Loss Functions for Classification
▪ Assume true label to be 𝑦𝑛 ∈ {0,1} and the score of a linear model to be 𝒘⊤ 𝒙𝑛

▪ One possibility is to use squared loss just like we used in regression


𝑙 𝑦𝑛 , 𝒘⊤ 𝒙𝑛 = 𝑦𝑛 − 𝒘⊤ 𝒙𝑛 2

▪ Will be easy to optimize (same solution as the regression case)

▪ Can also consider other loss functions used in regression


▪ Basically, pretend that the binary label is actually a continuous value and treat the problem as regression
where the output can only be one of two possible values

▪ However, regression loss functions aren’t ideal since 𝑦𝑛 is discrete (binary/categorical)

▪ Using the score 𝒘⊤ 𝒙𝑛 or the probability 𝜇𝑛 = 𝜎(𝒘⊤ 𝒙𝑛 ) of belonging to the positive class, we
have specialized loss function for binary classification
CS771: Intro to ML
13
Loss Functions for Classification: Cross-Entropy
▪ Binary cross-entropy (CE) is a popular loss function for binary classifn. Used in logistic reg.

▪ Assuming true 𝑦𝑛 ∈ {0,1} and 𝜇𝑛 = 𝜎 𝒘⊤ 𝒙𝑛 as predicted prob of 𝑦𝑛 = 1, CE loss is


𝑁
𝐿 𝒘 =− ෍ 𝑦𝑛 log 𝜇𝑛 + 1 − 𝑦𝑛 log(1 − 𝜇𝑛 )
𝑛=1
Very large loss if 𝑦𝑛 is 1and 𝜇𝑛 close This is precisely what we want from a
to 0, or 𝑦𝑛 is 0and 𝜇𝑛 close to 1 good loss function for binary classification

▪ For multi-class classification, the multi-class CE loss is defined as


𝜇𝑛,𝑖 is the predicted probability Note: Unlike least squares loss
𝑁 𝐾 of 𝒙𝑛 belonging to class 𝑖 for regression, for the cross-
𝐿 𝑾 = −෍ ෍ 𝑦𝑛,𝑖 log 𝜇𝑛,𝑖 entropy loss, we can’t get a
𝑛=1 𝑖=1 closed form solution for 𝒘 by
CE loss is also convex in 𝒘 applying first order optimality. Try
this as exercise for binary CE loss
(can prove easily using 𝑦𝑛,𝑖 = 1 if true label of 𝒙𝑛
definition of convexity; will see is class 𝑖 and 0 otherwise. We can however optimize the CE
later). Therefore unique solution loss using iterative optimization
is obtained when we minimize it such as gradient descent CS771: Intro to ML
14
Cross-Entropy Loss: The Gradient
▪ The expression for the gradient of binary cross-entropy loss
Note the 𝜇𝑛 is a
𝑁 function of 𝒘

𝒈 = ∇𝒘 𝐿 𝒘 = − ෍ (𝑦𝑛 − 𝜇𝑛 ) 𝒙𝑛
𝑛=1
Using this, we can now do
Note the form of each term in the gradient expression:
gradient descent to learn the Amount of current 𝑤’s error in predicting the label of
optimal 𝒘 for logistic regression: the 𝑛𝑡ℎ training example multiplied by the input 𝑥𝑛
𝒘(𝑡+1) = 𝒘(𝑡) − 𝜂𝑡 𝒈 𝑡

▪ The expression for the gradient of multi-class cross-entropy loss w.r.t. weight vec of 𝑖𝑡ℎ class
Need to calculate the
gradient for each of 𝑁
the 𝐾 weight vectors 𝒈𝑖 = ∇𝒘𝑖 𝐿 𝑾 = − ෍ (𝑦𝑛,𝑖 − 𝜇𝑛,𝑖 ) 𝒙𝑛
𝑛=1
Using these gradients, we can now do gradient descent
to learn the optimal 𝑾 = [𝒘1 , 𝒘2 , … , 𝒘𝐾 ] Note the form of each term in the gradient expression:
Amount of current 𝑊’s error in predicting the label of
For the softmax classification model the 𝑛𝑡ℎ training example multiplied by the input 𝑥𝑛
CS771: Intro to ML
15
Some Other Loss Functions for Binary Classification
▪ Assume true label as 𝑦𝑛 and prediction as 𝑦ො𝑛 = sign[𝒘⊤ 𝒙𝑛 ]

▪ The zero-one loss is the most natural loss function for classification
1 if 𝑦𝑛 ≠ 𝑦ො𝑛 Non-convex, non-differentiable,
ℓ(𝑦𝑛 , 𝑦ො𝑛 ) = ቊ and NP-Hard to optimize (also
0 if 𝑦𝑛 = 𝑦ො𝑛 no useful gradient info for the
most part)


1 if 𝑦 𝑛 𝒘 𝒙𝑛 < 0 (0,1)
ℓ(𝑦𝑛 , 𝑦ො𝑛 ) = ቊ ⊤
0 if 𝑦𝑛 𝒘 𝒙𝑛 ≥ 0
(0,0) 𝑦𝑛 𝒘⊤ 𝒙𝑛

▪ Since zero-one loss is hard to minimize, we use some surrogate loss function
▪ Popular examples: Cross-entropy (also called logistic loss), hinge loss , etc
▪ Note: Ideally, surrogate loss (approximation of zero-one) must be an upper bound (must
be larger than the 0-1 loss for all values of 𝑦𝑛 𝒘⊤ 𝒙𝑛 ) since our goal is minimization
CS771: Intro to ML
16
Some Other Loss Func for Binary Classification
“Perceptron” Loss
▪ For an ideal loss function, assuming 𝑦𝑛 ∈ (−1, +1)
▪ Large positive 𝑦𝑛 𝒘⊤ 𝒙𝑛 ⇒ small/zero loss
Also, not an upper
▪ Large negative 𝑦𝑛 𝒘⊤ 𝒙𝑛 ⇒ large/non-zero loss bound on 0-1 loss
▪ Small (large) loss if predicted probability of the Convex and Non-differentiable
the true label is large (small)
(0,0)
Same as cross-entropy loss
(logistic reg.) if we assume labels Very popular like cross-entropy loss.
Log(istic) Loss Hinge Loss
to be -1/+1 instead of 0/1 Used in SVM (Support Vector Machine)
classification

Also an upper Also an upper


bound on 0-1 loss bound on 0-1 loss
(0,1)
Convex and Differentiable Convex and Non-differentiable

(0,0) (0,0) (1,0)


CS771: Intro to ML
17
Nonlinear Classification using Linear Models?
▪ Yes, transform the original features and apply logistic or softmax classification model on top
▪ Feature transformation can be pre-defined (e.g., using kernels) or learned (using neural nets)

𝒘𝟑
𝒘
𝜙(𝒙) 𝒘𝟐

𝒙 𝒘𝟏

▪ Similar to how we nonlinearlize a linear model for regression


▪ Only the loss function ℓ 𝑦𝑛 , 𝑓 𝑥𝑛 changes
▪ Binary CE loss for if using logistic regression at the top
▪ Multiclass CE if using softmax classification at the top
▪ Or other classification loss functions if using other linear classifiers at the top

CS771: Intro to ML
18
Evaluation Measures for Binary Classification
▪ Average classification error or average accuracy (on val./test data)
1 𝑁 1 𝑁
𝑒𝑟𝑟 𝒘 = ෍ 𝕀[𝑦𝑛 ≠ 𝑦ො𝑛 ] 𝑎𝑐𝑐 𝒘 = ෍ 𝕀[𝑦𝑛 = 𝑦ො𝑛 ]
𝑁 𝑛=1 𝑁 𝑛=1

▪ The cross-entropy loss itself (on val./test data)


▪ Precision, Recall, and F1 score (preferred if labels are imbalanced)
▪ Precision (P): Of positive predictions by the model, what fraction is true positive
▪ Recall (R): Of all true positive examples, what fraction the model predicted as positive
▪ F1 score: Harmonic mean of P and R
▪ Confusion matrix is also a helpful measure
Various other metrics such as
error/accuracy, P, R, F1, etc. can
be readily calculated from the
confusion matrix
CS771: Intro to ML
19
Evaluation Measures for Multi-class Classification
▪ Average classification error or average accuracy (on val./test data)
1 𝑁 𝑁
1
𝑒𝑟𝑟 𝒘 = ෍ 𝕀[𝑦𝑛 ≠ 𝑦ො𝑛 ] 𝑎𝑐𝑐 𝒘 = ෍ 𝕀[𝑦𝑛 = 𝑦ො𝑛 ]
𝑁 𝑛=1 𝑁 𝑛=1
𝑦𝑛 is the true label, 𝑆መ𝑛 is the set of top-k predicted classes for 𝒙𝑛
▪ Top-k accuracy 𝑁
(based on the predicted probabilities/scores of the various classes)
1
Top − k Accuracy = ෍ is_correct_top_k[𝑦𝑛 , 𝑆መ𝑛 ]
𝑁 𝑛=1

▪ The multi-class cross-entropy loss itself (on val./test data)


▪ Class-wise Precision, Recall, and F1 score (preferred if labels are imbalanced)
▪ Confusion matrix
Various other metrics such as
error/accuracy, P, R, F1, etc. can
be readily calculated from the
confusion matrix

CS771: Intro to ML

You might also like