Lecture 6
Lecture 6
CS771: Intro to ML
3
Gradient Descent for Linear/Ridge Regression
▪ Just use the GD algorithm with the gradient expressions we derived
Also, we usually work with
▪ Iterative updates for linear regression will be of the form average gradient so the gradient
term is divided by 𝑁
Note the form of each term in the
(𝑡+1) (𝑡) 𝑡
𝒘 = 𝒘 − 𝜂𝑡 𝒈 gradient expression update: Amount of
current 𝑤’s error on the 𝑛𝑡ℎ training
Unlike the closed form solution
example multiplied by the input 𝑥𝑛
𝑿⊤ 𝑿 −1 𝑿⊤ 𝒚 of least squares 𝑁
regression, here we have iterative
updates but do not require the (𝑡)
2 (𝒕) ⊤
expensive matrix inversion of the = 𝒘 + 𝜂𝑡 𝑦𝑛 − 𝒘 𝒙𝑛 𝒙𝑛
𝐷 × 𝐷 matrix 𝑿⊤ 𝑿 (thus faster) 𝑁
𝑛=1
▪ Similar updates for ridge regression as well (with the gradient expression being
slightly different; left as an exercise)
σ𝑁 2
2
𝑦
𝑛=1 𝑛 − 𝑦ො𝑛
“relative” error w.r.t. a model
𝑅 =1− 𝑁 that makes a constant
σ𝑛=1 𝑦𝑛 − 𝑦ത 2 prediction 𝑦ത for all inputs
CS771: Intro to ML
7
Linear Models for Classification
▪ A linear model 𝑦 = 𝒘⊤ 𝒙 can also be used in classification
▪ For binary classification, can treat 𝒘⊤ 𝒙𝑛 as the “score” of input 𝒙𝑛 and either
▪ Threshold the score to get a binary label Note that log
𝜇𝑛
1−𝜇𝑛
= 𝒘⊤ 𝒙𝑛 (the
score) is also called the log-odds
𝑦𝑛 = sign(𝒘⊤ 𝒙𝑛 ) ratio, and often also logits
▪ Convert the score into a probability
1
𝜇𝑛 = 𝑝 𝑦𝑛 = 1 𝒙𝑛 , 𝒘 = 𝜎 𝒘⊤ 𝒙𝑛 The “sigmoid” function 𝜎(z)
1 Squashes a real number
Popularly known as “logistic =
regression” (LR) model (misnomer: 1 + exp(−𝒘⊤ 𝒙𝑛 ) to the range 0-1 0.5
it is not a regression model but a
exp(𝒘⊤ 𝒙𝑛 )
classification model), a probabilistic =
model for binary classification 1 + exp(𝒘⊤ 𝒙𝑛 )
0
z
▪ Note: In LR, if we assume the label 𝑦𝑛 as -1/+1 (not 0/1) then we can write
1
𝑝 𝑦𝑛 𝒘, 𝒙𝑛 = ⊤
= 𝜎(𝑦𝑛 𝒘⊤ 𝒙𝑛 )
1 + exp(−𝑦𝑛 𝒘 𝒙𝑛 )
CS771: Intro to ML
Linear Models: The Decision Boundary
▪ Decision boundary is where the score ▪ Decision boundary is where both classes
𝒘⊤ 𝒙𝑛 changes its sign have equal probability for the input 𝒙𝑛
▪ For logistic reg, at decision boundary
𝒘
𝑝 𝑦𝑛 = 1 𝒘, 𝒙𝑛 = 𝑝(𝑦𝑛 = 0|𝒘, 𝒙𝑛 )
𝒘⊤ 𝒙𝑛 > 𝟎 exp(𝒘⊤ 𝒙𝑛 ) 1
𝒘⊤ 𝒙𝑛 < 𝟎 =
1 + exp(𝒘 𝒙𝑛 ) 1 + exp(𝒘⊤ 𝒙𝑛 )
⊤
exp 𝒘⊤ 𝒙𝑛 = 1
𝒘⊤ 𝒙𝑛 = 0 for points
at the decision boundary 𝒘⊤ 𝒙𝑛 = 0
belonging to class 𝑖 𝐶
𝝁𝑛 = [𝜇𝑛,1 , 𝜇𝑛,2 , … , 𝜇𝑛,𝐶 ] Note: We actually need only 𝐶 − 1
𝜇𝑛,𝑖 = 1 weight vectors in softmax
Vector of probabilities of 𝒙𝑛 𝑖=1 classification. Think why?
Class 𝑖 with largest 𝒘⊤
𝑖 𝒙𝑛
belonging to each of the 𝐶 classes has the largest probability Probabilities must sum to 1 CS771: Intro to ML
10
Linear Classification: Interpreting weight vectors
▪ Recall that multi-class classification prediction rule is
𝑦𝑛 = argmax𝑖∈{1,2,…,𝐶} 𝒘⊤
𝑖 𝒙𝑛
▪ Can think of 𝒘⊤
𝑖 𝒙𝑛 as the score of the input for the 𝑖
𝑡ℎ class (or similarity of 𝒙 with 𝒘 )
𝑛 𝑖
▪ Once learned (we will see the methods later), these 𝐶 weight vectors (one for each class) can
sometimes have nice interpretations, especially when the inputs are images
The learned weight These images sort
vectors of each of the 4 of look like class
classes “unflattened” and
prototypes if I
visualized as images –
they kind of look like a
were using LwP ☺
𝒘𝑐𝑎𝑟 𝒘𝑓𝑟𝑜𝑔 𝒘ℎ𝑜𝑟𝑠𝑒 𝒘𝑐𝑎𝑡
“average” of what the Yeah, “sort of”. ☺
That’s why the dot product of each of these weight vectors with No wonder why LwP (with
images from that class Euclidean distances) acts
an image from the correct class will be expected to be the largest
should look like like a linear model. ☺
CS771: Intro to ML
Logistic and Softmax classification: Pictorially
▪ Logistic regression is a linear model with single weight vector with 𝐷 weights
𝑦𝑛
𝑤1
𝑤2 𝑤𝐷−1 𝑤𝐷
𝑥𝑛,1 𝑥𝑛,2 𝑥𝑛,𝐷−1 𝑥𝑛𝐷
CS771: Intro to ML
12
Loss Functions for Classification
▪ Assume true label to be 𝑦𝑛 ∈ {0,1} and the score of a linear model to be 𝒘⊤ 𝒙𝑛
▪ Using the score 𝒘⊤ 𝒙𝑛 or the probability 𝜇𝑛 = 𝜎(𝒘⊤ 𝒙𝑛 ) of belonging to the positive class, we
have specialized loss function for binary classification
CS771: Intro to ML
13
Loss Functions for Classification: Cross-Entropy
▪ Binary cross-entropy (CE) is a popular loss function for binary classifn. Used in logistic reg.
𝒈 = ∇𝒘 𝐿 𝒘 = − (𝑦𝑛 − 𝜇𝑛 ) 𝒙𝑛
𝑛=1
Using this, we can now do
Note the form of each term in the gradient expression:
gradient descent to learn the Amount of current 𝑤’s error in predicting the label of
optimal 𝒘 for logistic regression: the 𝑛𝑡ℎ training example multiplied by the input 𝑥𝑛
𝒘(𝑡+1) = 𝒘(𝑡) − 𝜂𝑡 𝒈 𝑡
▪ The expression for the gradient of multi-class cross-entropy loss w.r.t. weight vec of 𝑖𝑡ℎ class
Need to calculate the
gradient for each of 𝑁
the 𝐾 weight vectors 𝒈𝑖 = ∇𝒘𝑖 𝐿 𝑾 = − (𝑦𝑛,𝑖 − 𝜇𝑛,𝑖 ) 𝒙𝑛
𝑛=1
Using these gradients, we can now do gradient descent
to learn the optimal 𝑾 = [𝒘1 , 𝒘2 , … , 𝒘𝐾 ] Note the form of each term in the gradient expression:
Amount of current 𝑊’s error in predicting the label of
For the softmax classification model the 𝑛𝑡ℎ training example multiplied by the input 𝑥𝑛
CS771: Intro to ML
15
Some Other Loss Functions for Binary Classification
▪ Assume true label as 𝑦𝑛 and prediction as 𝑦ො𝑛 = sign[𝒘⊤ 𝒙𝑛 ]
▪ The zero-one loss is the most natural loss function for classification
1 if 𝑦𝑛 ≠ 𝑦ො𝑛 Non-convex, non-differentiable,
ℓ(𝑦𝑛 , 𝑦ො𝑛 ) = ቊ and NP-Hard to optimize (also
0 if 𝑦𝑛 = 𝑦ො𝑛 no useful gradient info for the
most part)
⊤
1 if 𝑦 𝑛 𝒘 𝒙𝑛 < 0 (0,1)
ℓ(𝑦𝑛 , 𝑦ො𝑛 ) = ቊ ⊤
0 if 𝑦𝑛 𝒘 𝒙𝑛 ≥ 0
(0,0) 𝑦𝑛 𝒘⊤ 𝒙𝑛
▪ Since zero-one loss is hard to minimize, we use some surrogate loss function
▪ Popular examples: Cross-entropy (also called logistic loss), hinge loss , etc
▪ Note: Ideally, surrogate loss (approximation of zero-one) must be an upper bound (must
be larger than the 0-1 loss for all values of 𝑦𝑛 𝒘⊤ 𝒙𝑛 ) since our goal is minimization
CS771: Intro to ML
16
Some Other Loss Func for Binary Classification
“Perceptron” Loss
▪ For an ideal loss function, assuming 𝑦𝑛 ∈ (−1, +1)
▪ Large positive 𝑦𝑛 𝒘⊤ 𝒙𝑛 ⇒ small/zero loss
Also, not an upper
▪ Large negative 𝑦𝑛 𝒘⊤ 𝒙𝑛 ⇒ large/non-zero loss bound on 0-1 loss
▪ Small (large) loss if predicted probability of the Convex and Non-differentiable
the true label is large (small)
(0,0)
Same as cross-entropy loss
(logistic reg.) if we assume labels Very popular like cross-entropy loss.
Log(istic) Loss Hinge Loss
to be -1/+1 instead of 0/1 Used in SVM (Support Vector Machine)
classification
𝒘𝟑
𝒘
𝜙(𝒙) 𝒘𝟐
𝒙 𝒘𝟏
CS771: Intro to ML
18
Evaluation Measures for Binary Classification
▪ Average classification error or average accuracy (on val./test data)
1 𝑁 1 𝑁
𝑒𝑟𝑟 𝒘 = 𝕀[𝑦𝑛 ≠ 𝑦ො𝑛 ] 𝑎𝑐𝑐 𝒘 = 𝕀[𝑦𝑛 = 𝑦ො𝑛 ]
𝑁 𝑛=1 𝑁 𝑛=1
CS771: Intro to ML