0% found this document useful (0 votes)
9 views100 pages

Unit 3

The document discusses the history and principles of Support Vector Machines (SVM) and linear classifiers, highlighting their effectiveness in tasks like handwritten digit recognition. It also covers instance-based learning methods such as k-Nearest Neighbors (k-NN), including their advantages, disadvantages, and considerations for distance metrics and neighbor selection. Additionally, it introduces probabilistic classification methods, including Naïve Bayes, and explains the importance of understanding prior, conditional, and joint probabilities.

Uploaded by

Abhay Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views100 pages

Unit 3

The document discusses the history and principles of Support Vector Machines (SVM) and linear classifiers, highlighting their effectiveness in tasks like handwritten digit recognition. It also covers instance-based learning methods such as k-Nearest Neighbors (k-NN), including their advantages, disadvantages, and considerations for distance metrics and neighbor selection. Additionally, it introduces probabilistic classification methods, including Naïve Bayes, and explains the importance of understanding prior, conditional, and joint probabilities.

Uploaded by

Abhay Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 100

Machine Learning

Classifiers-2

Carla P. Gomes
CS4700
History of SVM
 SVM is related to statistical learning theory [3]
 SVM was first introduced in 1992 [1]

 SVM becomes popular because of its success in handwritten digit

recognition
 1.1% test error rate for SVM. This is the same as the error rates of a
carefully constructed neural network, LeNet 4.

See Section 5.11 in [2] or the discussion in [3] for details
 SVM is now regarded as an important example of “kernel
methods”, one of the key area in machine learning
 Note: the meaning of “kernel” is different from the “kernel” function for
Parzen windows
[1] B.E. Boser et al. A Training Algorithm for Optimal Margin Classifiers. Proceedings of the Fifth Annual
Workshop on Computational Learning Theory 5 144-152, Pittsburgh, 1992.
[2] L. Bottou et al. Comparison of classifier methods: a case study in handwritten digit recognition.
Proceedings of the 12th IAPR International Conference on Pattern Recognition, vol. 2, pp. 77-82.
[3] V. Vapnik. The Nature of Statistical Learning Theory. 2 nd edition, Springer, 1999.

02/18/2025 2
Linear Classifiers Estimation:
x f yest
f(x,w,b) = sign(w. x -
denotes +1 b)
denotes -1 w: weight vector
x: data vector

How would you


classify this
data?

02/18/2025 3
a
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x -
denotes +1 b)
denotes -1

How would you


classify this
data?

02/18/2025 4
a
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x -
denotes +1 b)
denotes -1

How would you


classify this
data?

02/18/2025 5
a
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x -
denotes +1 b)
denotes -1

How would you


classify this
data?

02/18/2025 6
a
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x -
denotes +1 b)
denotes -1

Any of these
would be fine..

..but which is
best?

02/18/2025 7
a
Classifier Margin
x f yest
f(x,w,b) = sign(w. x -
denotes +1 b)
denotes -1 Define the
margin of a
linear classifier
as the width
that the
boundary could
be increased by
before hitting a
datapoint.

02/18/2025 8
a
Maximum Margin
x f yest
f(x,w,b) = sign(w. x -
denotes +1 b)
denotes -1 The maximum
margin linear
classifier is the
linear classifier
with the, um,
maximum
margin.
This is the
simplest kind of
SVM (Called an
Linear SVM LSVM)
02/18/2025 9
a
Maximum Margin
x f yest
f(x,w,b) = sign(w. x +
denotes +1 b)
denotes -1 The maximum
margin linear
classifier is the
linear classifier
Support with the, um,
Vectors are
those maximum
datapoints margin.
that the
margin pushes
This is the
up against simplest kind of
SVM (Called an
Linear SVM LSVM)
02/18/2025 10
Why Maximum Margin?

f(x,w,b) = sign(w. x -
denotes +1 b)
denotes -1 The maximum
margin linear
classifier is the
linear classifier
Support with the, um,
Vectors are
those maximum
datapoints margin.
that the
margin pushes
This is the
up against simplest kind of
SVM (Called an
LSVM)
02/18/2025 11
How to calculate the distance from a point
to a line?
denotes +1
denotes -1
x
wx +b =
0
X – Vector
W
W – Normal
Vector
b – Scale Value

 http://mathworld.wolfram.com/Point-LineDistance
2-Dimensional.html

 In our case, w1*x1+w2*x2+b=0,


 thus, w=(w1,w2), x=(x1,x2)
02/18/2025 12
Estimate the Margin
denotes +1
denotes -1
x
wx +b =
0
X – Vector
W
W – Normal
Vector
b – Scale Value

 What is the distance expression for a point x to a


line wx+b= 0?
x w  b x w  b
d ( x)  
2 d 2
w 2
 w
i 1 i

02/18/2025 13
Large-margin Decision Boundary
 The decision boundary should be as far away from the data of both
classes as possible
 We should maximize the margin, m
 Distance between the origin and the line wtx=-b is b/||w||

Class 2

Class 1
m

02/18/2025 14
Instance-Based Learning

Idea:
– Similar examples have similar label.
– Classify new examples like similar training examples.
Algorithm:
– Given some new example x for which we need to predict its class y
– Find most similar training examples
– Classify x “like” these most similar examples
Questions:
– How to determine similarity?
– How many similar training examples to consider?
– How to resolve inconsistencies among the training examples?

Carla P. Gomes
CS4700
1-Nearest Neighbor

One of the simplest of all machine learning classifiers


Simple idea: label a new point the same as the closest known point

Label it red.

Carla P. Gomes
CS4700
1-Nearest Neighbor

A type of instance-based learning


– Also known as “memory-based” learning
Forms a Voronoi tessellation of the instance space

Carla P. Gomes
CS4700
Distance Metrics
Different metrics can change the decision surface

Dist(a,b) =(a1 – b1)2 + (a2 – b2)Dist(a,b)


2
=(a1 – b1)2 + (3a2 – 3b2)2

Standard Euclidean distance metric:


– Two-dimensional: Dist(a,b) = sqrt((a1 – b1)2 + (a2 – b2)2)
– Multivariate: Dist(a,b) = sqrt(∑ (a i – bi)2) Adapted from “Instance-Based Learning”
lecture slides by Andrew Moore, CMU.

Carla P. Gomes
CS4700
1-NN’s Aspects as an
Instance-Based Learner:

A distance metric
– Euclidean
– When different units are used for each dimension
 normalize each dimension by standard deviation
– For discrete data, can use hamming distance
 D(x1,x2) =number of features on which x1 and x2 differ
– Others (e.g., normal, cosine)

How many nearby neighbors to look at?


– One
How to fit with the local points?
– Just predict the same output as the nearest neighbor.

Adapted from “Instance-Based Learning”


lecture slides by Andrew Moore, CMU.
Carla P. Gomes
CS4700
k – Nearest Neighbor

Generalizes 1-NN to smooth away noise in the labels


A new point is now assigned the most frequent label of its k nearest neighbors

Label it red, when k = 3

Label it blue, when k = 7


Carla P. Gomes
CS4700
KNN Example
Food Chat Fast Price Bar BigTip
(3) (2) (2) (3) (2)
1 great yes yes normal no yes
2 great no yes normal no yes
3 mediocre yes no high no no
4 great yes yes normal yes yes

Similarity metric: Number of matching attributes (k=2)


New examples:
– Example 1 (great, no, no, normal, no) Yes
 most similar: number 2 (1 mismatch, 4 match)  yes

Second most similar example: number 1 (2 mismatch, 3 match)  yes


– Example 2 (mediocre, yes, no, normal, no)
Yes/No
 Most similar: number 3 (1 mismatch, 4 match)  no

Second most similar example: number 1 (2 mismatch, 3 match)  yes


Selecting the Number of Neighbors

Increase k:
– Makes KNN less sensitive to noise

Decrease k:
– Allows capturing finer structure of space

Pick k not too large, but not too small (depends on data)

Carla P. Gomes
CS4700
Curse-of-Dimensionality

Prediction accuracy can quickly degrade when number of attributes grows.


– Irrelevant attributes easily “swamp” information from relevant attributes
– When many irrelevant attributes, similarity/distance measure becomes less reliable

Remedy
– Try to remove irrelevant attributes in pre-processing step
– Weight attributes differently
– Increase k (but not too much)

Carla P. Gomes
CS4700
Advantages and Disadvantages of KNN

Need distance/similarity measure and attributes that “match” target function.

For large training sets,


 Must make a pass through the entire dataset for each classification. This can be prohibitive for
large data sets.

Prediction accuracy can quickly degrade when number of attributes grows.

Simple to implement algorithm;


Requires little tuning;
Often performs quite weel!
(Try it first on a new learning problem). Carla P. Gomes
CS4700
Background
• There are three methods to establish a classifier
a) Model a classification rule directly
Examples: k-NN, decision trees, perceptron, SVM
b) Model the probability of class memberships given input
data
Example: multi-layered perceptron with the cross-entropy cost
c) Make a probabilistic model of data within each class
Examples: naive Bayes, model based classifiers
• a) and b) are examples of discriminative
classification
• c) is an example of generative classification
• b) and c) are both examples
COMP20411 of probabilistic
Machine Learning 26
Probability Basics
• Prior, conditional and joint probability
– Prior probability:
P(X )
– Conditional probability:
P(X1| X 2), P(X2| X1)
– Joint probability:X (X1,X2), P(X) P(X1 ,X2)
– Relationship:
P(X1 ,X2) P(X 2 | X1)P(X1) P(X1| X 2)P(X 2)
– Independence:
P(X 2 | X1) P(X 2), P(X1| X 2) P(X1), P(X1 ,X2) P(X1)P(X 2)
• Bayesian Rule

P(X| C)P(C) Likelihood


Prior
P(C| X)  Posterior

P(X) Evidence

COMP20411 Machine Learning 27


Example by Dieter Fox
Probabilistic Classification
• Establishing a probabilistic model for classification
– Discriminative model
P(C| X) C c1,,cL , X (X1,,Xn )
– Generative model
P(X| C) C c1,,cL , X (X1,,Xn )
• MAP classification rule
– MAP: Maximum A Posterior
– Assign x to c* P
if(C c* | X x)  P(C c| X x) c c* , c c1,,cL
• Generative classification with the MAP rule
– Apply Bayesian rule to convert: P(X| C)P(C)
P(C| X)   P(X| C)P(C)
P(X)
COMP20411 Machine Learning 31
Feature Histograms

P(x)
C1
C2

Slide by Stephen Marsland


x
Posterior Probability
P(C|x)

0
Slide by Stephen Marsland
x
Naïve Bayes
• Bayes classification
P(C| X)  P(X| C)P(C) P(X1,,Xn | C)P(C)

Difficulty: learning the joint probability


P(X1,,Xn | C)
• Naïve Bayes classification
– Making the assumption that all input attributes are
independent
P(X ,X ,,X | C) P(X | X ,,X ;C)P(X ,,X | C)
1 2 n 1 2 n 2 n

P(X1| C)P(X2 ,,Xn | C)


P(X1| C)P(X2 | C) P(Xn | C)

– MAP *classification rule


[P(x1| c ) P(xn | c* )]P(c* )  [P(x1| c) P(xn | c)]P(c), c c* , c c1,,cL

COMP20411 Machine Learning 34


Naïve Bayes
• Naïve Bayes Algorithm (for discrete input attributes)
– Learning Phase: Given a training set S,
Foreachtargetvalueof ci (ci c1,,cL )
Pˆ(C ci )  estimateP(C ci ) withexamplesin S;
Foreveryattributevalueajk of eachattributexj ( j 1,,n; k 1,,Nj )
Pˆ(X j ajk| C ci )  estimateP(X j ajk| C ci ) withexamplesin S;

Output: conditional probability tables;xj ,for


Nj L
elements X (a1 ,,an )
– Test Phase: Given an unknown instance ,
ˆ(a | c*up
Look
[P ) Pˆ(a | c* )]to
tables Pˆ(cassign
*
)  [Pˆ(athe
| c) Pˆ(a | cc*
label )]Pˆto
(c X’
), c if
 c*
, c c1,,cL
1 n 1 n

COMP20411 Machine Learning 35


Example
• Example: Play Tennis

COMP20411 Machine Learning 36


Example
• Learning Phase
Outlook Play=Y Play=N Temperatu Play=Yes Play=No
es o re
Sunny 2/9 3/5 Hot 2/9 2/5
Overcast 4/9 0/5 Mild 4/9 2/5
Rain 3/9 2/5 Cool 3/9 1/5
Humidity Play=Y Play= Wind Play=Ye Play=N
es No s o
High 3/9 4/5 Strong 3/9 3/5
Normal 6/9 1/5 Weak 6/9 2/5

P(Play=Yes) = 9/14P(Play=No) = 5/14

COMP20411 Machine Learning 37


Example
• Test Phase
– Given a new instance,
x’=(Outlook=Sunny, Temperature=Cool, Humidity=High,
Wind=Strong)
– Look up tables
P(Outlook=Sunny|Play=Yes) = 2/9P(Outlook=Sunny|Play=No) = 3/5
P(Temperature=Cool|Play=Yes) =P(Temperature=Cool|Play==No)
3/9 = 1/5
P(Huminity=High|Play=Yes) = 3/9P(Huminity=High|Play=No) = 4/5
P(Wind=Strong|Play=Yes) = 3/9 P(Wind=Strong|Play=No) = 3/5
P(Play=Yes) = 9/14 P(Play=No) = 5/14

– MAP rule
P(Yes|x’): [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|
Yes)]P(Play=Yes) = 0.0053
P(No|x’): [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|
No)]P(Play=No) = 0.0206

P(Yes|xMachine
Given the factCOMP20411 x’), we label x’ to be “No”.
Learning
’) < P(No| 38
Relevant Issues
• Violation of Independence Assumption
– For many real world tasks,
P(X1,,Xn | C)  P(X1| C) P(Xn | C)
– Nevertheless, naïve Bayes works surprisingly well
anyway!
• Zero conditional probability Problem
X j ajk, Pˆ(X j ajk| C ci ) 0
– If no example contains
Pˆ(x1| cthe ˆattribute ˆ value
i ) P(ajk| ci ) P(xn | ci ) 0
– In this circumstance, during test
– For a remedy, conditional
ˆ(X a | C c probabilities
nc  mp estimated with
P j jk i ) 
n m
nc : numberof trainingexamplesfor whichX j ajk andC ci
n : numberof trainingexamplesfor whichC ci
p : priorestimate(usually,p 1/ t fort possiblevaluesof X j )
m: weightto prior(numberof "virtual"examples,m 1)
COMP20411 Machine Learning 39
Relevant Issues
• Continuous-valued Input Attributes
– Numberless values for an attribute
– Conditional probability modeled with the normal
distribution 1  (X j   ji )2 
ˆ(X | C c ) 
P exp  
j i
2  ji  2 ji 
2

 ji : mean(avearage)
of attributevaluesX j of examplesfor whichC ci
 ji : standarddeviationof attributevaluesX j of examplesfor whichC ci

for X (X1,,Xn ), C c1,,cL


– Learningn
Phase:
L P(C ci ) i 1,, L
Output: fornormal
X (X1distributions
,,Xn ) and
– Test Phase:
• Calculate conditional probabilities with all the normal distributions
• Apply the MAP rule to make a decision
COMP20411 Machine Learning 40
Conclusions
• Naïve Bayes based on the independence
assumption
– Training is very easy and fast; just requiring considering
each attribute in each class separately
– Test is straightforward; just looking up tables or
calculating conditional probabilities with normal
distributions
• A popular generative model
– Performance competitive to most of state-of-the-art
classifiers even in presence of violating independence
assumption
– Many successful applications, e.g., spam mail filtering
COMP20411 Machine Learning 41
Goal of Supervised Learning?
• Minimize the probability of model prediction errors on
future data

• Two Competing Methodologies


– Build one really good model
• Traditional approach
– Build many models and average the results
• Ensemble learning (more recent)

Intro AI Ensembles 42
The Single Model Philosophy
• Motivation: Occam’s Razor
– “one should not increase, beyond what is necessary, the number
of entities required to explain anything”
• Infinitely many models can explain any given dataset
– Might as well pick the smallest one…

Intro AI Ensembles 43
Which Model is Smaller?
yˆ = f1 ( x ) = sin( x)
or
x3 x5 x 7
yˆ = f 2 ( x ) = x - + - +...
3! 5! 7!

• In this case

• It’s not always easy to define small!


Intro AI Ensembles 44
Exact Occam’s Razor Models
• Exact approaches find optimal solutions
• Examples:
– Support Vector Machines
• Find a model structure that uses the smallest percentage of training data
(to explain the rest of it).
– Bayesian approaches
• Minimum description length

Intro AI Ensembles 45
How Do Support Vector Machines Define Small?

Minimize the number


of Support Vectors!

Maximized
Margin

Intro AI Ensembles 46
Approximate Occam’s Razor Models
• Approximate solutions use a greedy search approach which is not
optimal
• Examples
– Kernel Projection Pursuit algorithms
• Find a minimal set of kernel projections
– Relevance Vector Machines
• Approximate Bayesian approach
– Sparse Minimax Probability Machine Classification
• Find a minimum set of kernels and features

Intro AI Ensembles 47
Other Single Models: Not
Necessarily Motivated by Occam’s
Razor
• Minimax Probability Machine (MPM)
• Trees
– Greedy approach to sparseness
• Neural Networks
• Nearest Neighbor
• Basis Function Models
– e.g. Kernel Ridge Regression
Intro AI Ensembles 48
Ensemble Philosophy
• Build many models and combine them
• Only through averaging do we get at the truth!
• It’s too hard (impossible?) to build a single model that
works best
• Two types of approaches:
– Models that don’t use randomness
– Models that incorporate randomness

Intro AI Ensembles 49
Ensemble Approaches
• Bagging
– Bootstrap aggregating

• Boosting

• Random Forests
– Bagging reborn

Intro AI Ensembles 50
Bagging
• Main Assumption:
– Combining many unstable predictors to produce a ensemble (stable)
predictor.
– Unstable Predictor: small changes in training data produce large changes in
the model.
• e.g. Neural Nets, trees
• Stable: SVM (sometimes), Nearest Neighbor.
• Hypothesis Space
– Variable size (nonparametric):
• Can model any function if you use an appropriate predictor (e.g. trees)

Intro AI Ensembles 51
The Bagging Algorithm

Given data: D = {(x1 , y1 ),..., (x N , y N )}

For m = 1: M
• Obtain bootstrap sample Dm from the
training data D
• Build a model Gm (x) from bootstrap data Dm

Intro AI Ensembles 52
The Bagging Model
• Regression
M
1
yˆ =
M
åG
m=1
m (x )

• Classification:
– Vote over classifier outputs G1 (x),..., GM (x)

Intro AI Ensembles 53
Bagging Details
• Bootstrap sample of N instances is obtained by drawing N
examples at random, with replacement.
• On average each bootstrap sample has 63% of
instances
– Encourages predictors to have uncorrelated errors
• This is why it works

Intro AI Ensembles 54
Bagging Details 2
• Usually set M =~ 30
– Or use validation data to pick M
• The models Gm (x) need to be unstable
– Usually full length (or slightly pruned) decision trees.

Intro AI Ensembles 55
Boosting
– Main Assumption:
• Combining many weak predictors (e.g. tree stumps or 1-R predictors) to
produce an ensemble predictor
• The weak predictors or classifiers need to be stable
– Hypothesis Space
• Variable size (nonparametric):
– Can model any function if you use an appropriate predictor (e.g. trees)

Intro AI Ensembles 56
Commonly Used Weak Predictor
(or classifier)
A Decision Tree Stump (1-R)

Intro AI Ensembles 57
Boosting

Each classifier Gm (x) is


trained from a weighted
Sample of the training
Data

Intro AI Ensembles 58
Boosting (Continued)
• Each predictor is created by using a biased sample of the
training data
– Instances (training examples) with high error are weighted higher
than those with lower error
• Difficult instances get more attention
– This is the motivation behind boosting

Intro AI Ensembles 59
Background Notation
• The I ( s ) is defined as:
function
ìïï 1 if s is true
I (s) = í
ïïî 0 otherwise
• The function
log ( x ) is the natural logarithm

Intro AI Ensembles 60
The AdaBoost Algorithm
(Freund and Schapire, 1996)
Given data: D = {(x1 , y1 ),..., (x N , y N )}

1. Initialize weights w =1/ N , i =1,..., N


i

2. For m = 1: M
a) Fit classifier Gm (x) Î {- 1,1} to data using weights wi
b) Compute N

å w I ( y ¹ G (x )) i i m i
errm = i =1
N

å
i =1
wi

c) Compute a = log ((1- err ) / err )


m m m

d) Set w ¬ w exp éêa I ( y ¹ G (x ))ùú, i = 1,..., N


i i ëm i m i û
Intro AI Ensembles 61
The AdaBoost Model

éM ù
yˆ = sgn êå amGm (x)ú
ê
ëm=1 ú
û

AdaBoost is NOT used for Regression!

Intro AI Ensembles 62
The Updates in Boosting
Alpha for Boosting Re-weighting Factor for Boosting
5 100

4 90

3 80

2 70

1 60

w * exp( m)
m

0 50

-1 40

-2 30

-3 20

-4 10

-5 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
errm errm

Intro AI Ensembles 63
Boosting Characteristics
Simulated data: test error
rate for boosting with
stumps, as a function of
the number of iterations.
Also shown are the test
error rate for a single
stump, and a 400 node
tree.

Intro AI Ensembles 64
Loss Functions for y Î {- 1, +1}, f Î Â
•Misclassification
I (sgn ( f ) ¹ y )
•Exponential
(Boosting)
exp (- yf )

•Binomial Deviance
(
(Cross log 1 + exp
Entropy) (- 2 yf ))

•Squared
( y Error
- f )
2

Incorrect Classification Correct Classification


•Support
(1- yf Vectors
)gI ( yf >1)
Intro AI Ensembles 65
Other Variations of Boosting
• Gradient Boosting
– Can use any cost function
• Stochastic (Gradient) Boosting
– Bootstrap Sample: Uniform random sampling (with replacement)
– Often outperforms the non-random version

Intro AI Ensembles 66
Gradient Boosting

Intro AI Ensembles 67
Boosting Summary
• Good points
– Fast learning
– Capable of learning any function (given appropriate weak learner)
– Feature weighting
– Very little parameter tuning
• Bad points
– Can overfit data
– Only for binary classification
• Learning parameters (picked via cross validation)
– Size of tree
– When to stop
• Software
– http://www-stat.stanford.edu/~jhf/R-MART.html

Intro AI Ensembles 68
◾ Idea: Optimize an Additive model
 Additive prediction model:

 Here 𝑓𝑡 can be multi-level!


 Objective (cost) function:

 ω(𝑓𝑡 ) is a regularization term that models the complexity of the


tree.

2/22/22 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 6


◾ Use Additive model to train sequentially:

tree 𝒇𝒊 each time:


 Start from constant prediction, add a new decision

Prediction at Keep predictions


New
training from previous
model
2/22/22
round t rounds
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 7
◾ XGBoost: eXtreme Gradient Boosting
 A highly scalable implementation of gradient boosted
decision trees with regularization

Widely used by data scientists and provides state-of-the-


art results on many problems!

◾ System optimizations:
 Parallel tree constructions using column block
structure
 Distributed Computing for training very large models
using a cluster of machines.
 Out-of-Core Computing for very large datasets that
don’t fit into memory.
2/22/22 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 8
Topics

● Why are metrics important?


● Binary classifiers
○ Rank view, Thresholding
● Metrics
○ Confusion Matrix
○ Point metrics: Accuracy, Precision, Recall / Sensitivity, Specificity, F-score
○ Summary metrics: AU-ROC, AU-PRC, Log-loss.
● Choosing Metrics
● Class Imbalance
○ Failure scenarios for each metric
● Multi-class
Why are metrics important?

- Training objective (cost function) is only a proxy for real world objectives.
- Metrics help capture a business goal into a quantitative target (not all errors are equal).
- Helps organize ML team effort towards that target.
- Generally in the form of improving that metric on the dev set.
- Useful to quantify the “gap” between:
- Desired performance and baseline (estimate effort initially).
- Desired performance and current performance.
- Measure progress over time.
- Useful for lower level tasks and debugging (e.g. diagnosing bias vs variance).
- Ideally training objective should be the metric, but not always possible. Still, metrics are useful and
important for evaluation.
Binary Classification

● x is input
● y is binary output (0/1)
● Model is ŷ = h(x)
● Two types of models
○ Models that output a categorical class directly (K-nearest neighbor, Decision tree)
○ Models that output a real valued score (SVM, Logistic Regression)
■ Score could be margin (SVM), probability (LR, NN)
■ Need to pick a threshold
■ We focus on this type (the other type can be interpreted as an instance)
Score based models
Score = 1

Positive example
Negative example

Example of Score: Output of logistic regression.


For most metrics: Only ranking matters.
If too many examples: Plot class-wise histogram.

# positive examples
Prevalence =
# positive examples
+
# negatives
examples

Score = 0
Threshold -> Classifier -> Point Metrics

Label positive Label negative

Th
Predict Positive

0.5
Th=0.5
Predict Negative
Point metrics: Confusion Matrix

Label Positive Label Negative

Th
9 2
Predict Positive

0.5

Th=0.5

Properties:
- Total sum is fixed (population).
Predict Negative

- Column sums are fixed (class-wise population).


- Quality of model & threshold decide how columns
1 8 are split into rows.
- We want diagonals to be “heavy”, off diagonals to
be “light”.
Point metrics: True Positives

Label positive Label negative

Th TP

9 2
Predict Positive

0.5 9

Th=0.5
Predict Negative

1 8
Point metrics: True Negatives

Label positive Label negative

Th TP TN

9 2
Predict Positive

0.5 9 8

Th=0.5
Predict Negative

1 8
Point metrics: False Positives

Label positive Label negative

Th TP TN FP

9 2
Predict Positive

0.5 9 8 2

Th=0.5
Predict Negative

1 8
Point metrics: False Negatives

Label positive Label negative

Th TP TN FP FN

9 2
Predict Positive

0.5 9 8 2 1

Th=0.5
Predict Negative

8
1
FP and FN also called Type-1 and Type-2 errors

Could not find true source of image to cite


Point metrics: Accuracy

Label positive Label negative

Th TP TN FP FN Acc

9 2
Predict Positive

0.5 9 8 2 1 .85

Th=0.5
Predict Negative

Equivalent to 0-1 Loss!


1 8
Point metrics: Precision

Label positive Label negative

Th TP TN FP FN Acc Pr

9 2
Predict Positive

0.5 9 8 2 1 .85 .81

Th=0.5
Predict Negative

1 8
Point metrics: Positive Recall (Sensitivity)

Label positive Label negative

Th TP TN FP FN Acc Pr Recall

9 2
Predict Positive

0.5 9 8 2 1 .85 .81 .9

Th=0.5
Trivial 100% recall = pull everybody above the threshold.
Trivial 100% precision = push everybody below the
threshold except 1 green on top.
Predict Negative

(Hopefully no gray above it!)

8 Striving for good precision with 100% recall =


1 pulling up the lowest green as high as possible in the ranking.
Striving for good recall with 100% precision =
pushing down the top gray as low as possible in the ranking.
Point metrics: Negative Recall (Specificity)

Label positive Label negative

Th TP TN FP FN Acc Pr Recall Spec

9 2
Predict Positive

0.5 9 8 2 1 .85 .81 .9 0.8

Th=0.5
Predict Negative

1 8
Point metrics: F1-score

Label positive Label negative

Th TP TN FP FN Acc Pr Recall Spec F1

9 2
Predict Positive

0.5 9 8 2 1 .85 .81 .9 .8 .857

Th=0.5
Predict Negative

1 8
Point metrics: Changing threshold

Label positive Label negative

Th TP TN FP FN Acc Pr Recall Spec F1

7 2
Predict Positive

0.6 7 8 2 3 .75 .77 .7 .8 .733

Th=0.6

# effective thresholds = # examples + 1


Predict Negative

3 8
Threshold TP TN FP FN Accuracy Precision Recall Specificity F1
Threshold Scanning
Score = 1 1.00 0 10 0 10 0.50 1 0 1 0
Threshold = 1.00 0.95 1 10 0 9 0.55 1 0.1 1 0.182
0.90 2 10 0 8 0.60 1 0.2 1 0.333
0.85 2 9 1 8 0.55 0.667 0.2 0.9 0.308
0.80 3 9 1 7 0.60 0.750 0.3 0.9 0.429
0.75 4 9 1 6 0.65 0.800 0.4 0.9 0.533
0.70 5 9 1 5 0.70 0.833 0.5 0.9 0.625
0.65 5 8 2 5 0.65 0.714 0.5 0.8 0.588
0.60 6 8 2 4 0.70 0.750 0.6 0.8 0.667
0.55 7 8 2 3 0.75 0.778 0.7 0.8 0.737
0.50 8 8 2 2 0.80 0.800 0.8 0.8 0.800
0.45 9 8 2 1 0.85 0.818 0.9 0.8 0.857
0.40 9 7 3 1 0.80 0.750 0.9 0.7 0.818
0.35 9 6 4 1 0.75 0.692 0.9 0.6 0.783
0.30 9 5 5 1 0.70 0.643 0.9 0.5 0.750
0.25 9 4 6 1 0.65 0.600 0.9 0.4 0.720
0.20 9 3 7 1 0.60 0.562 0.9 0.3 0.692
0.15 9 2 8 1 0.55 0.529 0.9 0.2 0.667
0.10 9 1 9 1 0.50 0.500 0.9 0.1 0.643
0.05 10 1 9 0 0.55 0.526 1 0.1 0.690
0.00 10 0 10 0 0.50 0.500 1 0 0.667
Threshold = 0.00
Score = 0
Summary metrics: Rotated ROC (Sen vs. Spec)
Pos examples
Score = 1

Neg examples

Specificity AUROC = Area Under ROC


= True Neg / Neg
= Prob[Random Pos ranked
higher than random Neg]
Random Guessing

Agnostic to prevalence!

Score = 0

Sensitivity = True Pos / Pos


Summary metrics: PRC (Recall vs. Precision)
Pos examples
Score = 1

Neg examples
Precision AUPRC = Area Under PRC
= True Pos /
Predicted Pos
= Expected precision for
Random threshold

Precision >= prevalence

Score = 0

Recall = Sensitivity = True Pos / Pos


Summary metrics:

Score = 1 Score = 1

Model A Model B

Score = 0 Score = 0

Two models scoring the same data set. Is one of them better than the other?
Summary metrics: Log-Loss vs Brier Score

Score = 1 Score = 1
● Same ranking, and therefore the same AUROC, AUPRC, accuracy!

● Rewards confident correct answers, heavily penalizes confident


wrong answers.
● One perfectly confident wrong prediction is fatal.
-> Well-calibrated model
● Proper scoring rule: Minimized at

Score = 0 Score = 0
Calibration vs Discriminative Power

Logistic (th=0.5): Fraction of Positives


Precision: 0.872
Recall: 0.851
F1: 0.862
Brier: 0.099

SVC (th=0.5):
Precision: 0.872
Recall: 0.852
F1: 0.862 Output
Brier: 0.163
Histogram
Unsupervised Learning

● Log P(x) is a measure of fit in Probabilistic models (GMM, Factor Analysis)

○ High log P(x) on training set, but low log P(x) on test set is a measure of overfitting

○ Raw value of log P(x) hard to interpret in isolation

● K-means is trickier (because of fixed covariance assumption)


Class Imbalance

Symptom: Prevalence < 5% (no strict definition)

Metrics: May not be meaningful.

Learning: May not focus on minority class examples at all

(majority class can overwhelm logistic regression, to a lesser extent SVM)


What happen to the metrics under class imbalance?

Accuracy: Blindly predicts majority class -> prevalence is the baseline.

Log-Loss: Majority class can dominate the loss.

AUROC: Easy to keep AUC high by scoring most negatives very low.

AUPRC: Somewhat more robust than AUROC. But other challenges.

In general: Accuracy < AUROC < AUPRC


Rotated ROC
Score = 1

1% “Fraudulent”

1% Specificity
= True Neg / Neg

AUC = 98/99
98%

Score = 0

Sensitivity = True Pos / Pos


Multi-class

● Confusion matrix will be N * N (still want heavy diagonals, light off-diagonals)


● Most metrics (except accuracy) generally analyzed as multiple 1-vs-many
● Multiclass variants of AUROC and AUPRC (micro vs macro averaging)
● Class imbalance is common (both in absolute and relative sense)
● Cost sensitive learning techniques (also helps in binary Imbalance)
○ Assign weights for each block in the confusion matrix.
○ Incorporate weights into the loss function.
Choosing Metrics
Some common patterns:

- High precision is hard constraint, do best recall (search engine results, grammar correction): Intolerant to
FP

- Metric: Recall at Precision = XX %


- High recall is hard constraint, do best precision (medical diagnosis): Intolerant to FN

- Metric: Precision at Recall = 100 %


- Capacity constrained (by K)

- Metric: Precision in top-K.


- ……

You might also like