0% found this document useful (0 votes)

14 views19 pages

Lecture 6

Lecture 6 of machine learning IIT KANPUR

Uploaded by

Ayush Patel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views19 pages

Lecture 6

Lecture 6 of machine learning IIT KANPUR

Uploaded by

Ayush Patel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Linear Models for Classification

CS771: Introduction to Machine Learning

2
Plan today
▪ Wrapping up linear models for regession
▪ Linear models for classification
▪ Logistic and softmax classification

CS771: Intro to ML
3
Gradient Descent for Linear/Ridge Regression
▪ Just use the GD algorithm with the gradient expressions we derived
Also, we usually work with
▪ Iterative updates for linear regression will be of the form average gradient so the gradient
term is divided by 𝑁
Note the form of each term in the
(𝑡+1) (𝑡) 𝑡
𝒘 = 𝒘 − 𝜂𝑡 𝒈 gradient expression update: Amount of
current 𝑤’s error on the 𝑛𝑡ℎ training
Unlike the closed form solution
example multiplied by the input 𝑥𝑛
𝑿⊤ 𝑿 −1 𝑿⊤ 𝒚 of least squares 𝑁
regression, here we have iterative
updates but do not require the (𝑡)
2 (𝒕) ⊤
expensive matrix inversion of the = 𝒘 + 𝜂𝑡 ෍ 𝑦𝑛 − 𝒘 𝒙𝑛 𝒙𝑛
𝐷 × 𝐷 matrix 𝑿⊤ 𝑿 (thus faster) 𝑁
𝑛=1
▪ Similar updates for ridge regression as well (with the gradient expression being
slightly different; left as an exercise)

▪ More on iterative optimization methods later

CS771: Intro to ML
4
Evaluation Measures for Regression Models
Prediction
▪ Plotting the prediction 𝑦ො𝑛 vs truth 𝑦𝑛 for the validation/test set
Truth
▪ Mean Squared Error (MSE) and Mean Absolute Error (MAE) on val./test set
1 𝑁 1 𝑁
𝑀𝑆𝐸 = ෍ 𝑦𝑛 − 𝑦ො𝑛 2 𝑀𝐴𝐸 = ෍ |𝑦𝑛 − 𝑦ො𝑛 | Plots of true vs predicted outputs
𝑁 𝑛=1 𝑁 𝑛=1 and 𝑅2 for two regression models

▪ RMSE (Root Mean Squared Error) ≜ 𝑀𝑆𝐸

▪ Coefficient of determination or 𝑅2

σ𝑁 2
2
𝑦
𝑛=1 𝑛 − 𝑦ො𝑛
“relative” error w.r.t. a model
𝑅 =1− 𝑁 that makes a constant
σ𝑛=1 𝑦𝑛 − 𝑦ത 2 prediction 𝑦ത for all inputs

A “base” model that always predicts the

mean 𝑦ത will have 𝑅 2 = 0 and the perfect 𝑦ത is empirical mean of true
1
model will have 𝑅 2 = 1. Worse than base responses, i.e., σ𝑁𝑛=1 𝑦𝑛
𝑁
models can even have negative 𝑅 2
Pic from MLAPP (Murphy) CS771: Intro to ML
5
Linear Regression as Solving System of Linear Eqs
▪ The form of the lin. reg. model 𝒚 ≈ 𝑿𝒘 is akin to a system of linear equation
▪ Assuming 𝑁 training examples with 𝐷 features each, we have
First training example: 𝑦1 = 𝑥11 𝑤1 + 𝑥12 𝑤2 + … + 𝑥1𝐷 𝑤𝐷 Note: Here 𝑥𝑛𝑑 denotes
the 𝑑 𝑡ℎ feature of the 𝑛𝑡ℎ
Second training example: 𝑦2 = 𝑥21 𝑤1 + 𝑥22 𝑤2 + … + 𝑥2𝐷 𝑤𝐷 training example
𝑁 equations and 𝐷 unknowns
here (𝑤1 , 𝑤2 , … , 𝑤𝐷 )
N-th training example: 𝑦𝑁 = 𝑥𝑁1 𝑤1 + 𝑥𝑁2 𝑤2 + … + 𝑥𝑁𝐷 𝑤𝐷

▪ Usually we will either have 𝑁 > 𝐷 or 𝑁 < 𝐷

▪ Thus we have an underdetermined (𝑁 < 𝐷) or overdetermined (𝑁 > 𝐷) system
▪ Methods to solve over/underdetermined systems can be used for lin-reg as well
▪ Many of these methods don’t require expensive matrix inversion Now solve this!
Solving lin-reg 𝒘 = (𝑿⊤ 𝑿)−1 𝑿⊤ 𝒚 𝑨𝒘 = 𝒃 where 𝑨 = 𝑿⊤ 𝑿, and 𝒃 = 𝑿⊤ 𝒚
as system of lin eq.
System of lin. Eqns with 𝐷 equations and 𝐷 unknowns
CS771: Intro to ML
6

Linear Models for Classification

CS771: Intro to ML
7
Linear Models for Classification
▪ A linear model 𝑦 = 𝒘⊤ 𝒙 can also be used in classification
▪ For binary classification, can treat 𝒘⊤ 𝒙𝑛 as the “score” of input 𝒙𝑛 and either
▪ Threshold the score to get a binary label Note that log
𝜇𝑛
1−𝜇𝑛
= 𝒘⊤ 𝒙𝑛 (the
score) is also called the log-odds
𝑦𝑛 = sign(𝒘⊤ 𝒙𝑛 ) ratio, and often also logits
▪ Convert the score into a probability
1
𝜇𝑛 = 𝑝 𝑦𝑛 = 1 𝒙𝑛 , 𝒘 = 𝜎 𝒘⊤ 𝒙𝑛 The “sigmoid” function 𝜎(z)
1 Squashes a real number
Popularly known as “logistic =
regression” (LR) model (misnomer: 1 + exp(−𝒘⊤ 𝒙𝑛 ) to the range 0-1 0.5
it is not a regression model but a
exp(𝒘⊤ 𝒙𝑛 )
classification model), a probabilistic =
model for binary classification 1 + exp(𝒘⊤ 𝒙𝑛 )
0
z
▪ Note: In LR, if we assume the label 𝑦𝑛 as -1/+1 (not 0/1) then we can write
1
𝑝 𝑦𝑛 𝒘, 𝒙𝑛 = ⊤
= 𝜎(𝑦𝑛 𝒘⊤ 𝒙𝑛 )
1 + exp(−𝑦𝑛 𝒘 𝒙𝑛 )
CS771: Intro to ML
Linear Models: The Decision Boundary
▪ Decision boundary is where the score ▪ Decision boundary is where both classes
𝒘⊤ 𝒙𝑛 changes its sign have equal probability for the input 𝒙𝑛
▪ For logistic reg, at decision boundary
𝒘
𝑝 𝑦𝑛 = 1 𝒘, 𝒙𝑛 = 𝑝(𝑦𝑛 = 0|𝒘, 𝒙𝑛 )
𝒘⊤ 𝒙𝑛 > 𝟎 exp(𝒘⊤ 𝒙𝑛 ) 1
𝒘⊤ 𝒙𝑛 < 𝟎 =
1 + exp(𝒘 𝒙𝑛 ) 1 + exp(𝒘⊤ 𝒙𝑛 )
⊤

exp 𝒘⊤ 𝒙𝑛 = 1
𝒘⊤ 𝒙𝑛 = 0 for points
at the decision boundary 𝒘⊤ 𝒙𝑛 = 0

▪ Therefore, both views are equivalent

CS771: Intro to ML
Linear Models for (Multi-class) Classification
𝐶
▪ If there are 𝐶 > 2 classes, we use 𝐶 weight vectors 𝒘𝑖 𝑖=1 to define the model
𝐷 × 𝐶 weight
matrix
𝑾 = [𝒘1 , 𝒘2 , … , 𝒘𝐶 ]
▪ The prediction rule is as follows
𝑦𝑛 = argmax𝑖∈{1,2,…,𝐶} 𝒘⊤
𝑖 𝒙𝑛
▪ Can think of 𝒘⊤ 𝒙
𝑖 𝑛 as the score/similarity of the input w.r.t. the 𝑖 𝑡ℎ
class
▪ Can also use these scores to compute probability of belonging to each class
“softmax” classification Note: Just like logistic
exp(𝒘⊤ 𝑖 𝒙𝑛 )
𝜇𝑛,𝑖 regression, the scores
𝜇𝑛,𝑖 = 𝑝 𝑦𝑛 = 𝑖 𝑾, 𝒙𝑛 = Multi-class extension 𝒘⊤
𝐾 ⊤ 𝑖 𝒙𝑛 are called logits
σ𝑗=1 exp(𝒘𝑗 𝒙) of logistic regression (𝐶 logits in this case)
Probability of 𝒙𝑛 𝑖=1 𝑖=2 𝑖=3

belonging to class 𝑖 𝐶
𝝁𝑛 = [𝜇𝑛,1 , 𝜇𝑛,2 , … , 𝜇𝑛,𝐶 ] Note: We actually need only 𝐶 − 1
෍ 𝜇𝑛,𝑖 = 1 weight vectors in softmax
Vector of probabilities of 𝒙𝑛 𝑖=1 classification. Think why?
Class 𝑖 with largest 𝒘⊤
𝑖 𝒙𝑛
belonging to each of the 𝐶 classes has the largest probability Probabilities must sum to 1 CS771: Intro to ML
10
Linear Classification: Interpreting weight vectors
▪ Recall that multi-class classification prediction rule is

𝑦𝑛 = argmax𝑖∈{1,2,…,𝐶} 𝒘⊤
𝑖 𝒙𝑛

▪ Can think of 𝒘⊤
𝑖 𝒙𝑛 as the score of the input for the 𝑖
𝑡ℎ class (or similarity of 𝒙 with 𝒘 )
𝑛 𝑖

▪ Once learned (we will see the methods later), these 𝐶 weight vectors (one for each class) can
sometimes have nice interpretations, especially when the inputs are images
The learned weight These images sort
vectors of each of the 4 of look like class
classes “unflattened” and
prototypes if I
visualized as images –
they kind of look like a
were using LwP ☺
𝒘𝑐𝑎𝑟 𝒘𝑓𝑟𝑜𝑔 𝒘ℎ𝑜𝑟𝑠𝑒 𝒘𝑐𝑎𝑡
“average” of what the Yeah, “sort of”. ☺
That’s why the dot product of each of these weight vectors with No wonder why LwP (with
images from that class Euclidean distances) acts
an image from the correct class will be expected to be the largest
should look like like a linear model. ☺

CS771: Intro to ML
Logistic and Softmax classification: Pictorially
▪ Logistic regression is a linear model with single weight vector with 𝐷 weights
𝑦𝑛
𝑤1
𝑤2 𝑤𝐷−1 𝑤𝐷
𝑥𝑛,1 𝑥𝑛,2 𝑥𝑛,𝐷−1 𝑥𝑛𝐷

▪ Softmax classification is a linear model with 𝐾 weight vectors with 𝐷 × 𝐶 weights

𝑦𝑛,1 𝑦𝑛,2 𝑦𝑛,𝐶

𝑥𝑛,1 𝑥𝑛,2 𝑥𝑛,𝐷−1 𝑥𝑛𝐷

CS771: Intro to ML
12
Loss Functions for Classification
▪ Assume true label to be 𝑦𝑛 ∈ {0,1} and the score of a linear model to be 𝒘⊤ 𝒙𝑛

▪ One possibility is to use squared loss just like we used in regression

𝑙 𝑦𝑛 , 𝒘⊤ 𝒙𝑛 = 𝑦𝑛 − 𝒘⊤ 𝒙𝑛 2

▪ Will be easy to optimize (same solution as the regression case)

▪ Can also consider other loss functions used in regression

▪ Basically, pretend that the binary label is actually a continuous value and treat the problem as regression
where the output can only be one of two possible values

▪ However, regression loss functions aren’t ideal since 𝑦𝑛 is discrete (binary/categorical)

▪ Using the score 𝒘⊤ 𝒙𝑛 or the probability 𝜇𝑛 = 𝜎(𝒘⊤ 𝒙𝑛 ) of belonging to the positive class, we
have specialized loss function for binary classification
CS771: Intro to ML
13
Loss Functions for Classification: Cross-Entropy
▪ Binary cross-entropy (CE) is a popular loss function for binary classifn. Used in logistic reg.

▪ Assuming true 𝑦𝑛 ∈ {0,1} and 𝜇𝑛 = 𝜎 𝒘⊤ 𝒙𝑛 as predicted prob of 𝑦𝑛 = 1, CE loss is

𝑁
𝐿 𝒘 =− ෍ 𝑦𝑛 log 𝜇𝑛 + 1 − 𝑦𝑛 log(1 − 𝜇𝑛 )
𝑛=1
Very large loss if 𝑦𝑛 is 1and 𝜇𝑛 close This is precisely what we want from a
to 0, or 𝑦𝑛 is 0and 𝜇𝑛 close to 1 good loss function for binary classification

▪ For multi-class classification, the multi-class CE loss is defined as

𝜇𝑛,𝑖 is the predicted probability Note: Unlike least squares loss
𝑁 𝐾 of 𝒙𝑛 belonging to class 𝑖 for regression, for the cross-
𝐿 𝑾 = −෍ ෍ 𝑦𝑛,𝑖 log 𝜇𝑛,𝑖 entropy loss, we can’t get a
𝑛=1 𝑖=1 closed form solution for 𝒘 by
CE loss is also convex in 𝒘 applying first order optimality. Try
this as exercise for binary CE loss
(can prove easily using 𝑦𝑛,𝑖 = 1 if true label of 𝒙𝑛
definition of convexity; will see is class 𝑖 and 0 otherwise. We can however optimize the CE
later). Therefore unique solution loss using iterative optimization
is obtained when we minimize it such as gradient descent CS771: Intro to ML
14
Cross-Entropy Loss: The Gradient
▪ The expression for the gradient of binary cross-entropy loss
Note the 𝜇𝑛 is a
𝑁 function of 𝒘

𝒈 = ∇𝒘 𝐿 𝒘 = − ෍ (𝑦𝑛 − 𝜇𝑛 ) 𝒙𝑛
𝑛=1
Using this, we can now do
Note the form of each term in the gradient expression:
gradient descent to learn the Amount of current 𝑤’s error in predicting the label of
optimal 𝒘 for logistic regression: the 𝑛𝑡ℎ training example multiplied by the input 𝑥𝑛
𝒘(𝑡+1) = 𝒘(𝑡) − 𝜂𝑡 𝒈 𝑡

▪ The expression for the gradient of multi-class cross-entropy loss w.r.t. weight vec of 𝑖𝑡ℎ class
Need to calculate the
gradient for each of 𝑁
the 𝐾 weight vectors 𝒈𝑖 = ∇𝒘𝑖 𝐿 𝑾 = − ෍ (𝑦𝑛,𝑖 − 𝜇𝑛,𝑖 ) 𝒙𝑛
𝑛=1
Using these gradients, we can now do gradient descent
to learn the optimal 𝑾 = [𝒘1 , 𝒘2 , … , 𝒘𝐾 ] Note the form of each term in the gradient expression:
Amount of current 𝑊’s error in predicting the label of
For the softmax classification model the 𝑛𝑡ℎ training example multiplied by the input 𝑥𝑛
CS771: Intro to ML
15
Some Other Loss Functions for Binary Classification
▪ Assume true label as 𝑦𝑛 and prediction as 𝑦ො𝑛 = sign[𝒘⊤ 𝒙𝑛 ]

▪ The zero-one loss is the most natural loss function for classification
1 if 𝑦𝑛 ≠ 𝑦ො𝑛 Non-convex, non-differentiable,
ℓ(𝑦𝑛 , 𝑦ො𝑛 ) = ቊ and NP-Hard to optimize (also
0 if 𝑦𝑛 = 𝑦ො𝑛 no useful gradient info for the
most part)

⊤
1 if 𝑦 𝑛 𝒘 𝒙𝑛 < 0 (0,1)
ℓ(𝑦𝑛 , 𝑦ො𝑛 ) = ቊ ⊤
0 if 𝑦𝑛 𝒘 𝒙𝑛 ≥ 0
(0,0) 𝑦𝑛 𝒘⊤ 𝒙𝑛

▪ Since zero-one loss is hard to minimize, we use some surrogate loss function
▪ Popular examples: Cross-entropy (also called logistic loss), hinge loss , etc
▪ Note: Ideally, surrogate loss (approximation of zero-one) must be an upper bound (must
be larger than the 0-1 loss for all values of 𝑦𝑛 𝒘⊤ 𝒙𝑛 ) since our goal is minimization
CS771: Intro to ML
16
Some Other Loss Func for Binary Classification
“Perceptron” Loss
▪ For an ideal loss function, assuming 𝑦𝑛 ∈ (−1, +1)
▪ Large positive 𝑦𝑛 𝒘⊤ 𝒙𝑛 ⇒ small/zero loss
Also, not an upper
▪ Large negative 𝑦𝑛 𝒘⊤ 𝒙𝑛 ⇒ large/non-zero loss bound on 0-1 loss
▪ Small (large) loss if predicted probability of the Convex and Non-differentiable
the true label is large (small)
(0,0)
Same as cross-entropy loss
(logistic reg.) if we assume labels Very popular like cross-entropy loss.
Log(istic) Loss Hinge Loss
to be -1/+1 instead of 0/1 Used in SVM (Support Vector Machine)
classification

Also an upper Also an upper

bound on 0-1 loss bound on 0-1 loss
(0,1)
Convex and Differentiable Convex and Non-differentiable

(0,0) (0,0) (1,0)

CS771: Intro to ML
17
Nonlinear Classification using Linear Models?
▪ Yes, transform the original features and apply logistic or softmax classification model on top
▪ Feature transformation can be pre-defined (e.g., using kernels) or learned (using neural nets)

𝒘𝟑
𝒘
𝜙(𝒙) 𝒘𝟐

𝒙 𝒘𝟏

▪ Similar to how we nonlinearlize a linear model for regression

▪ Only the loss function ℓ 𝑦𝑛 , 𝑓 𝑥𝑛 changes
▪ Binary CE loss for if using logistic regression at the top
▪ Multiclass CE if using softmax classification at the top
▪ Or other classification loss functions if using other linear classifiers at the top

CS771: Intro to ML
18
Evaluation Measures for Binary Classification
▪ Average classification error or average accuracy (on val./test data)
1 𝑁 1 𝑁
𝑒𝑟𝑟 𝒘 = ෍ 𝕀[𝑦𝑛 ≠ 𝑦ො𝑛 ] 𝑎𝑐𝑐 𝒘 = ෍ 𝕀[𝑦𝑛 = 𝑦ො𝑛 ]
𝑁 𝑛=1 𝑁 𝑛=1

▪ The cross-entropy loss itself (on val./test data)

▪ Precision, Recall, and F1 score (preferred if labels are imbalanced)
▪ Precision (P): Of positive predictions by the model, what fraction is true positive
▪ Recall (R): Of all true positive examples, what fraction the model predicted as positive
▪ F1 score: Harmonic mean of P and R
▪ Confusion matrix is also a helpful measure
Various other metrics such as
error/accuracy, P, R, F1, etc. can
be readily calculated from the
confusion matrix
CS771: Intro to ML
19
Evaluation Measures for Multi-class Classification
▪ Average classification error or average accuracy (on val./test data)
1 𝑁 𝑁
1
𝑒𝑟𝑟 𝒘 = ෍ 𝕀[𝑦𝑛 ≠ 𝑦ො𝑛 ] 𝑎𝑐𝑐 𝒘 = ෍ 𝕀[𝑦𝑛 = 𝑦ො𝑛 ]
𝑁 𝑛=1 𝑁 𝑛=1
𝑦𝑛 is the true label, 𝑆መ𝑛 is the set of top-k predicted classes for 𝒙𝑛
▪ Top-k accuracy 𝑁
(based on the predicted probabilities/scores of the various classes)
1
Top − k Accuracy = ෍ is_correct_top_k[𝑦𝑛 , 𝑆መ𝑛 ]
𝑁 𝑛=1

▪ The multi-class cross-entropy loss itself (on val./test data)

▪ Class-wise Precision, Recall, and F1 score (preferred if labels are imbalanced)
▪ Confusion matrix
Various other metrics such as
error/accuracy, P, R, F1, etc. can
be readily calculated from the
confusion matrix

CS771: Intro to ML

Linear Classification: Slides Credit: CMU AI, Zico Kolter, Pat Virtue
No ratings yet
Linear Classification: Slides Credit: CMU AI, Zico Kolter, Pat Virtue
527 pages
CM20315 05 Loss
No ratings yet
CM20315 05 Loss
100 pages
DL145611 03 Shallow
No ratings yet
DL145611 03 Shallow
92 pages
Lecture 0.3 - Linear Classifiers, Logistic Regression, Multiclass Classification
No ratings yet
Lecture 0.3 - Linear Classifiers, Logistic Regression, Multiclass Classification
48 pages
04 LogisticRegression
No ratings yet
04 LogisticRegression
29 pages
AML AfterMid Merged
No ratings yet
AML AfterMid Merged
389 pages
Lecture 0.2 - Linear Methods For Regression, Optimization
No ratings yet
Lecture 0.2 - Linear Methods For Regression, Optimization
53 pages
Lecture 4 - Linear Classification
No ratings yet
Lecture 4 - Linear Classification
34 pages
CS115 01
No ratings yet
CS115 01
38 pages
383 Fall11 Lec19
No ratings yet
383 Fall11 Lec19
30 pages
DL 02 Basics
No ratings yet
DL 02 Basics
94 pages
Multimedia Application L9
No ratings yet
Multimedia Application L9
43 pages
Chapter 3
100% (1)
Chapter 3
59 pages
Lecture 5 - Logistic Regression
No ratings yet
Lecture 5 - Logistic Regression
28 pages
Logistic Regression: Some Slides Adapted From Dan Jurfasky and Brendan O'Connor
No ratings yet
Logistic Regression: Some Slides Adapted From Dan Jurfasky and Brendan O'Connor
53 pages
Linear Models and Learning Via Optimization: Piyush Rai Introduction To Machine Learning (CS771A)
No ratings yet
Linear Models and Learning Via Optimization: Piyush Rai Introduction To Machine Learning (CS771A)
26 pages
06-07-08-Supervised Learning by Computing Distances, Multi Class Classification, Decision Boundary
No ratings yet
06-07-08-Supervised Learning by Computing Distances, Multi Class Classification, Decision Boundary
32 pages
AI Lec2.1 MLsupervised
No ratings yet
AI Lec2.1 MLsupervised
21 pages
Lecture 6 - Ridge Regression, Polynomial Regression (DONE!!) PDF
No ratings yet
Lecture 6 - Ridge Regression, Polynomial Regression (DONE!!) PDF
26 pages
CS221 - Artificial Intelligence - Machine Learning - 3 Linear Classification
No ratings yet
CS221 - Artificial Intelligence - Machine Learning - 3 Linear Classification
28 pages
Multi-Class Classification
No ratings yet
Multi-Class Classification
52 pages
Lec 4
No ratings yet
Lec 4
24 pages
Lect 8
No ratings yet
Lect 8
117 pages
03 Classification Handout
No ratings yet
03 Classification Handout
24 pages
Linear Models
No ratings yet
Linear Models
30 pages
3-LG Eval
No ratings yet
3-LG Eval
52 pages
04 - Linear-Classification-2024
No ratings yet
04 - Linear-Classification-2024
65 pages
l05 Machine Learning
No ratings yet
l05 Machine Learning
34 pages
Lecture 05 - Logistic Regression
No ratings yet
Lecture 05 - Logistic Regression
10 pages
Chapter02 Introduction To DeepLearning
No ratings yet
Chapter02 Introduction To DeepLearning
84 pages
הרצאה-Classifiers and Decision Trees
No ratings yet
הרצאה-Classifiers and Decision Trees
119 pages
i2ML Cheatsheets
No ratings yet
i2ML Cheatsheets
7 pages
cs188 sp23 Lec25 - Z
No ratings yet
cs188 sp23 Lec25 - Z
38 pages
Mathematical Foundations of Computational Linguistics: Manfred Klenner and Jannis Vamvas
No ratings yet
Mathematical Foundations of Computational Linguistics: Manfred Klenner and Jannis Vamvas
32 pages
Homework2 v1.0
No ratings yet
Homework2 v1.0
5 pages
03 Linear Models
No ratings yet
03 Linear Models
46 pages
Intro To ML RevisionNotes
No ratings yet
Intro To ML RevisionNotes
24 pages
Lecture 5
No ratings yet
Lecture 5
18 pages
Lec 03
No ratings yet
Lec 03
42 pages
Learning With Prototypes: CS771: Introduction To Machine Learning Nisheeth
No ratings yet
Learning With Prototypes: CS771: Introduction To Machine Learning Nisheeth
22 pages
Lec1 PDF
No ratings yet
Lec1 PDF
56 pages
Cours1 ML
No ratings yet
Cours1 ML
41 pages
6 - Support Vector Machines
No ratings yet
6 - Support Vector Machines
14 pages
Lecture 11
No ratings yet
Lecture 11
26 pages
Lec 04 Deep Networks 2
No ratings yet
Lec 04 Deep Networks 2
78 pages
Lecture 19
No ratings yet
Lecture 19
8 pages
3 LogisticRegression
No ratings yet
3 LogisticRegression
30 pages
NN Theory
No ratings yet
NN Theory
138 pages
03-Linear Classification
No ratings yet
03-Linear Classification
17 pages
Linear Models: CS771: Introduction To Machine Learning Piyush Rai
No ratings yet
Linear Models: CS771: Introduction To Machine Learning Piyush Rai
8 pages
Lecture 3 - Regression
No ratings yet
Lecture 3 - Regression
47 pages
04 LossFunctions
No ratings yet
04 LossFunctions
22 pages
06 Lectureslides LinearClassification Fixed
No ratings yet
06 Lectureslides LinearClassification Fixed
52 pages
1 Intro
No ratings yet
1 Intro
5 pages
Machine Learning and Pattern Recognition Week 3 Intro - Classification
No ratings yet
Machine Learning and Pattern Recognition Week 3 Intro - Classification
5 pages
02 - Linear Models - D (Multiclass Classification)
No ratings yet
02 - Linear Models - D (Multiclass Classification)
9 pages
Literature Review On The Impact of Information Technology in Banking
100% (1)
Literature Review On The Impact of Information Technology in Banking
8 pages
Solution 5
No ratings yet
Solution 5
4 pages
Lecture 3 - Linear Regression
No ratings yet
Lecture 3 - Linear Regression
31 pages
Deep Learning Summer School 2015: Introduction To Machine Learning
No ratings yet
Deep Learning Summer School 2015: Introduction To Machine Learning
46 pages
Lecture Notes 6 Logistic Regression
No ratings yet
Lecture Notes 6 Logistic Regression
8 pages
All Assignments
No ratings yet
All Assignments
33 pages
M.A. Economics
No ratings yet
M.A. Economics
23 pages
BA 503-Financial Analytics
No ratings yet
BA 503-Financial Analytics
8 pages
Extending Linear Model R PDF
0% (4)
Extending Linear Model R PDF
2 pages
Simple Linear Regression Analysis: Mcgraw-Hill/Irwin
No ratings yet
Simple Linear Regression Analysis: Mcgraw-Hill/Irwin
16 pages
Assignment 1 (Part 1) - To Excel in Chemistry (Instructions) - SV
No ratings yet
Assignment 1 (Part 1) - To Excel in Chemistry (Instructions) - SV
16 pages
A Study of The Effect of Operating Variables On The Efficiency of A Vibrating Screen
No ratings yet
A Study of The Effect of Operating Variables On The Efficiency of A Vibrating Screen
12 pages
Statistics For Business and Economics: Bab 14
No ratings yet
Statistics For Business and Economics: Bab 14
31 pages
Stat 4-6 Chapter
No ratings yet
Stat 4-6 Chapter
37 pages
31101
100% (1)
31101
20 pages
Evolution of GDP, Final Consumption and Net Investment in Romania During 1998-2011
No ratings yet
Evolution of GDP, Final Consumption and Net Investment in Romania During 1998-2011
2 pages
Random Walk
No ratings yet
Random Walk
4 pages
Coping With Multicollinearity: An Example On Application of Principal Components Regression in Dendroecology
No ratings yet
Coping With Multicollinearity: An Example On Application of Principal Components Regression in Dendroecology
48 pages
Austin 2004
No ratings yet
Austin 2004
9 pages
Effect of Interest Rates On Performance of Commercial Banks in Uganda. Case Study Centenary Bank
No ratings yet
Effect of Interest Rates On Performance of Commercial Banks in Uganda. Case Study Centenary Bank
19 pages
Applied Multilevel Analysis A Practical Guide For Medical Researchers Practical Guides To Biostatistics and Epidemiology 1st Edition Jos W. R. Twisk
No ratings yet
Applied Multilevel Analysis A Practical Guide For Medical Researchers Practical Guides To Biostatistics and Epidemiology 1st Edition Jos W. R. Twisk
79 pages
(FREE PDF Sample) OpenIntro Statistics 4th Edition David Diez Ebooks
No ratings yet
(FREE PDF Sample) OpenIntro Statistics 4th Edition David Diez Ebooks
72 pages
EXCEL Data Analysis
No ratings yet
EXCEL Data Analysis
5 pages
MPS - Lab Manual 2025
No ratings yet
MPS - Lab Manual 2025
20 pages
DBDA EANDC QB Practical Machine Learning PDF
No ratings yet
DBDA EANDC QB Practical Machine Learning PDF
4 pages
Dougherty C12G02 2016 05 22
No ratings yet
Dougherty C12G02 2016 05 22
18 pages
Econometrics Simple Linear Regression
No ratings yet
Econometrics Simple Linear Regression
22 pages
Behrens Et Al. 2018 - Spatial Modelling With Euclidean Distance Fields and Machine Learning
No ratings yet
Behrens Et Al. 2018 - Spatial Modelling With Euclidean Distance Fields and Machine Learning
14 pages
Document From Uma-1
No ratings yet
Document From Uma-1
19 pages
International Journal of Data and Network Science
No ratings yet
International Journal of Data and Network Science
10 pages
Step 1: Business and Data Understanding: Project 1: Predicting Catalog Demand
No ratings yet
Step 1: Business and Data Understanding: Project 1: Predicting Catalog Demand
2 pages
ETF1100 - Mock Exam
No ratings yet
ETF1100 - Mock Exam
13 pages
Measurement: Rami Ahmad
No ratings yet
Measurement: Rami Ahmad
10 pages
Worked Examples in Advanced Mechanics of Materials using MATLAB
From Everand
Worked Examples in Advanced Mechanics of Materials using MATLAB
Eric Okoth Ogur
No ratings yet

Lecture 6

Uploaded by

Lecture 6

Uploaded by

Linear Models for Classification

CS771: Introduction to Machine Learning

▪ More on iterative optimization methods later

▪ RMSE (Root Mean Squared Error) ≜ 𝑀𝑆𝐸

A “base” model that always predicts the

▪ Usually we will either have 𝑁 > 𝐷 or 𝑁 < 𝐷

Linear Models for Classification

▪ Therefore, both views are equivalent

▪ Softmax classification is a linear model with 𝐾 weight vectors with 𝐷 × 𝐶 weights

𝑦𝑛,1 𝑦𝑛,2 𝑦𝑛,𝐶

𝑥𝑛,1 𝑥𝑛,2 𝑥𝑛,𝐷−1 𝑥𝑛𝐷

▪ One possibility is to use squared loss just like we used in regression

▪ Will be easy to optimize (same solution as the regression case)

▪ Can also consider other loss functions used in regression

▪ However, regression loss functions aren’t ideal since 𝑦𝑛 is discrete (binary/categorical)

▪ Assuming true 𝑦𝑛 ∈ {0,1} and 𝜇𝑛 = 𝜎 𝒘⊤ 𝒙𝑛 as predicted prob of 𝑦𝑛 = 1, CE loss is

▪ For multi-class classification, the multi-class CE loss is defined as

Also an upper Also an upper

(0,0) (0,0) (1,0)

▪ Similar to how we nonlinearlize a linear model for regression

▪ The cross-entropy loss itself (on val./test data)

▪ The multi-class cross-entropy loss itself (on val./test data)

You might also like