0% found this document useful (0 votes)

11 views100 pages

Unit 3

The document discusses the history and principles of Support Vector Machines (SVM) and linear classifiers, highlighting their effectiveness in tasks like handwritten digit recognition. It also covers instance-based learning methods such as k-Nearest Neighbors (k-NN), including their advantages, disadvantages, and considerations for distance metrics and neighbor selection. Additionally, it introduces probabilistic classification methods, including Naïve Bayes, and explains the importance of understanding prior, conditional, and joint probabilities.

Uploaded by

Abhay Singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views100 pages

Unit 3

Uploaded by

Abhay Singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 100

Machine Learning

Classifiers-2

Carla P. Gomes
CS4700
History of SVM
 SVM is related to statistical learning theory [3]
 SVM was first introduced in 1992 [1]

 SVM becomes popular because of its success in handwritten digit

recognition
 1.1% test error rate for SVM. This is the same as the error rates of a
carefully constructed neural network, LeNet 4.

See Section 5.11 in [2] or the discussion in [3] for details
 SVM is now regarded as an important example of “kernel
methods”, one of the key area in machine learning
 Note: the meaning of “kernel” is different from the “kernel” function for
Parzen windows
[1] B.E. Boser et al. A Training Algorithm for Optimal Margin Classifiers. Proceedings of the Fifth Annual
Workshop on Computational Learning Theory 5 144-152, Pittsburgh, 1992.
[2] L. Bottou et al. Comparison of classifier methods: a case study in handwritten digit recognition.
Proceedings of the 12th IAPR International Conference on Pattern Recognition, vol. 2, pp. 77-82.
[3] V. Vapnik. The Nature of Statistical Learning Theory. 2 nd edition, Springer, 1999.

02/18/2025 2
Linear Classifiers Estimation:
x f yest
f(x,w,b) = sign(w. x -
denotes +1 b)
denotes -1 w: weight vector
x: data vector

How would you

classify this
data?

02/18/2025 3
a
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x -
denotes +1 b)
denotes -1

How would you

classify this
data?

02/18/2025 4
a
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x -
denotes +1 b)
denotes -1

How would you

classify this
data?

02/18/2025 5
a
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x -
denotes +1 b)
denotes -1

How would you

classify this
data?

02/18/2025 6
a
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x -
denotes +1 b)
denotes -1

Any of these
would be fine..

..but which is
best?

02/18/2025 7
a
Classifier Margin
x f yest
f(x,w,b) = sign(w. x -
denotes +1 b)
denotes -1 Define the
margin of a
linear classifier
as the width
that the
boundary could
be increased by
before hitting a
datapoint.

02/18/2025 8
a
Maximum Margin
x f yest
f(x,w,b) = sign(w. x -
denotes +1 b)
denotes -1 The maximum
margin linear
classifier is the
linear classifier
with the, um,
maximum
margin.
This is the
simplest kind of
SVM (Called an
Linear SVM LSVM)
02/18/2025 9
a
Maximum Margin
x f yest
f(x,w,b) = sign(w. x +
denotes +1 b)
denotes -1 The maximum
margin linear
classifier is the
linear classifier
Support with the, um,
Vectors are
those maximum
datapoints margin.
that the
margin pushes
This is the
up against simplest kind of
SVM (Called an
Linear SVM LSVM)
02/18/2025 10
Why Maximum Margin?

f(x,w,b) = sign(w. x -
denotes +1 b)
denotes -1 The maximum
margin linear
classifier is the
linear classifier
Support with the, um,
Vectors are
those maximum
datapoints margin.
that the
margin pushes
This is the
up against simplest kind of
SVM (Called an
LSVM)
02/18/2025 11
How to calculate the distance from a point
to a line?
denotes +1
denotes -1
x
wx +b =
0
X – Vector
W
W – Normal
Vector
b – Scale Value

 http://mathworld.wolfram.com/Point-LineDistance
2-Dimensional.html

 In our case, w1x1+w2x2+b=0,

 thus, w=(w1,w2), x=(x1,x2)
02/18/2025 12
Estimate the Margin
denotes +1
denotes -1
x
wx +b =
0
X – Vector
W
W – Normal
Vector
b – Scale Value

 What is the distance expression for a point x to a

line wx+b= 0?
x w  b x w  b
d ( x)  
2 d 2
w 2
 w
i 1 i

02/18/2025 13
Large-margin Decision Boundary
 The decision boundary should be as far away from the data of both
classes as possible
 We should maximize the margin, m
 Distance between the origin and the line wtx=-b is b/||w||

Class 2

Class 1
m

02/18/2025 14
Instance-Based Learning

Idea:
– Similar examples have similar label.
– Classify new examples like similar training examples.
Algorithm:
– Given some new example x for which we need to predict its class y
– Find most similar training examples
– Classify x “like” these most similar examples
Questions:
– How to determine similarity?
– How many similar training examples to consider?
– How to resolve inconsistencies among the training examples?

Carla P. Gomes
CS4700
1-Nearest Neighbor

One of the simplest of all machine learning classifiers

Simple idea: label a new point the same as the closest known point

Label it red.

Carla P. Gomes
CS4700
1-Nearest Neighbor

A type of instance-based learning

– Also known as “memory-based” learning
Forms a Voronoi tessellation of the instance space

Carla P. Gomes
CS4700
Distance Metrics
Different metrics can change the decision surface

Dist(a,b) =(a1 – b1)2 + (a2 – b2)Dist(a,b)

2
=(a1 – b1)2 + (3a2 – 3b2)2

Standard Euclidean distance metric:

– Two-dimensional: Dist(a,b) = sqrt((a1 – b1)2 + (a2 – b2)2)
– Multivariate: Dist(a,b) = sqrt(∑ (a i – bi)2) Adapted from “Instance-Based Learning”
lecture slides by Andrew Moore, CMU.

Carla P. Gomes
CS4700
1-NN’s Aspects as an
Instance-Based Learner:

A distance metric
– Euclidean
– When different units are used for each dimension
 normalize each dimension by standard deviation
– For discrete data, can use hamming distance
 D(x1,x2) =number of features on which x1 and x2 differ
– Others (e.g., normal, cosine)

How many nearby neighbors to look at?

– One
How to fit with the local points?
– Just predict the same output as the nearest neighbor.

Adapted from “Instance-Based Learning”

lecture slides by Andrew Moore, CMU.
Carla P. Gomes
CS4700
k – Nearest Neighbor

Generalizes 1-NN to smooth away noise in the labels

A new point is now assigned the most frequent label of its k nearest neighbors

Label it red, when k = 3

Label it blue, when k = 7

Carla P. Gomes
CS4700
KNN Example
Food Chat Fast Price Bar BigTip
(3) (2) (2) (3) (2)
1 great yes yes normal no yes
2 great no yes normal no yes
3 mediocre yes no high no no
4 great yes yes normal yes yes

Similarity metric: Number of matching attributes (k=2)

New examples:
– Example 1 (great, no, no, normal, no) Yes
 most similar: number 2 (1 mismatch, 4 match)  yes

Second most similar example: number 1 (2 mismatch, 3 match)  yes

– Example 2 (mediocre, yes, no, normal, no)
Yes/No
 Most similar: number 3 (1 mismatch, 4 match)  no

Second most similar example: number 1 (2 mismatch, 3 match)  yes

Selecting the Number of Neighbors

Increase k:
– Makes KNN less sensitive to noise

Decrease k:
– Allows capturing finer structure of space

Pick k not too large, but not too small (depends on data)

Carla P. Gomes
CS4700
Curse-of-Dimensionality

Prediction accuracy can quickly degrade when number of attributes grows.

– Irrelevant attributes easily “swamp” information from relevant attributes
– When many irrelevant attributes, similarity/distance measure becomes less reliable

Remedy
– Try to remove irrelevant attributes in pre-processing step
– Weight attributes differently
– Increase k (but not too much)

Carla P. Gomes
CS4700
Advantages and Disadvantages of KNN

Need distance/similarity measure and attributes that “match” target function.

For large training sets,

 Must make a pass through the entire dataset for each classification. This can be prohibitive for
large data sets.

Prediction accuracy can quickly degrade when number of attributes grows.

Simple to implement algorithm;

Requires little tuning;
Often performs quite weel!
(Try it first on a new learning problem). Carla P. Gomes
CS4700
Background
• There are three methods to establish a classifier
a) Model a classification rule directly
Examples: k-NN, decision trees, perceptron, SVM
b) Model the probability of class memberships given input
data
Example: multi-layered perceptron with the cross-entropy cost
c) Make a probabilistic model of data within each class
Examples: naive Bayes, model based classifiers
• a) and b) are examples of discriminative
classification
• c) is an example of generative classification
• b) and c) are both examples
COMP20411 of probabilistic
Machine Learning 26
Probability Basics
• Prior, conditional and joint probability
– Prior probability:
P(X )
– Conditional probability:
P(X1| X 2), P(X2| X1)
– Joint probability:X (X1,X2), P(X) P(X1 ,X2)
– Relationship:
P(X1 ,X2) P(X 2 | X1)P(X1) P(X1| X 2)P(X 2)
– Independence:
P(X 2 | X1) P(X 2), P(X1| X 2) P(X1), P(X1 ,X2) P(X1)P(X 2)
• Bayesian Rule

P(X| C)P(C) Likelihood

Prior
P(C| X)  Posterior

P(X) Evidence

COMP20411 Machine Learning 27

Example by Dieter Fox
Probabilistic Classification
• Establishing a probabilistic model for classification
– Discriminative model
P(C| X) C c1,,cL , X (X1,,Xn )
– Generative model
P(X| C) C c1,,cL , X (X1,,Xn )
• MAP classification rule
– MAP: Maximum A Posterior
– Assign x to c* P
if(C c* | X x)  P(C c| X x) c c* , c c1,,cL
• Generative classification with the MAP rule
– Apply Bayesian rule to convert: P(X| C)P(C)
P(C| X)   P(X| C)P(C)
P(X)
COMP20411 Machine Learning 31
Feature Histograms

P(x)
C1
C2

Slide by Stephen Marsland

x
Posterior Probability
P(C|x)

0
Slide by Stephen Marsland
x
Naïve Bayes
• Bayes classification
P(C| X)  P(X| C)P(C) P(X1,,Xn | C)P(C)

Difficulty: learning the joint probability

P(X1,,Xn | C)
• Naïve Bayes classification
– Making the assumption that all input attributes are
independent
P(X ,X ,,X | C) P(X | X ,,X ;C)P(X ,,X | C)
1 2 n 1 2 n 2 n

P(X1| C)P(X2 ,,Xn | C)

P(X1| C)P(X2 | C) P(Xn | C)

– MAP *classification rule

[P(x1| c ) P(xn | c* )]P(c* )  [P(x1| c) P(xn | c)]P(c), c c* , c c1,,cL

COMP20411 Machine Learning 34

Naïve Bayes
• Naïve Bayes Algorithm (for discrete input attributes)
– Learning Phase: Given a training set S,
Foreachtargetvalueof ci (ci c1,,cL )
Pˆ(C ci )  estimateP(C ci ) withexamplesin S;
Foreveryattributevalueajk of eachattributexj ( j 1,,n; k 1,,Nj )
Pˆ(X j ajk| C ci )  estimateP(X j ajk| C ci ) withexamplesin S;

Output: conditional probability tables;xj ,for

Nj L
elements X (a1 ,,an )
– Test Phase: Given an unknown instance ,
ˆ(a | c*up
Look
[P ) Pˆ(a | c* )]to
tables Pˆ(cassign
*
)  [Pˆ(athe
| c) Pˆ(a | cc*
label )]Pˆto
(c X’
), c if
 c*
, c c1,,cL
1 n 1 n

COMP20411 Machine Learning 35

Example
• Example: Play Tennis

COMP20411 Machine Learning 36

Example
• Learning Phase
Outlook Play=Y Play=N Temperatu Play=Yes Play=No
es o re
Sunny 2/9 3/5 Hot 2/9 2/5
Overcast 4/9 0/5 Mild 4/9 2/5
Rain 3/9 2/5 Cool 3/9 1/5
Humidity Play=Y Play= Wind Play=Ye Play=N
es No s o
High 3/9 4/5 Strong 3/9 3/5
Normal 6/9 1/5 Weak 6/9 2/5

P(Play=Yes) = 9/14P(Play=No) = 5/14

COMP20411 Machine Learning 37

Example
• Test Phase
– Given a new instance,
x’=(Outlook=Sunny, Temperature=Cool, Humidity=High,
Wind=Strong)
– Look up tables
P(Outlook=Sunny|Play=Yes) = 2/9P(Outlook=Sunny|Play=No) = 3/5
P(Temperature=Cool|Play=Yes) =P(Temperature=Cool|Play==No)
3/9 = 1/5
P(Huminity=High|Play=Yes) = 3/9P(Huminity=High|Play=No) = 4/5
P(Wind=Strong|Play=Yes) = 3/9 P(Wind=Strong|Play=No) = 3/5
P(Play=Yes) = 9/14 P(Play=No) = 5/14

P(Yes|xMachine
Given the factCOMP20411 x’), we label x’ to be “No”.
Learning
’) < P(No| 38
Relevant Issues
• Violation of Independence Assumption
– For many real world tasks,
P(X1,,Xn | C)  P(X1| C) P(Xn | C)
– Nevertheless, naïve Bayes works surprisingly well
anyway!
• Zero conditional probability Problem
X j ajk, Pˆ(X j ajk| C ci ) 0
– If no example contains
Pˆ(x1| cthe ˆattribute ˆ value
i ) P(ajk| ci ) P(xn | ci ) 0
– In this circumstance, during test
– For a remedy, conditional
ˆ(X a | C c probabilities
nc  mp estimated with
P j jk i ) 
n m
nc : numberof trainingexamplesfor whichX j ajk andC ci
n : numberof trainingexamplesfor whichC ci
p : priorestimate(usually,p 1/ t fort possiblevaluesof X j )
m: weightto prior(numberof "virtual"examples,m 1)
COMP20411 Machine Learning 39
Relevant Issues
• Continuous-valued Input Attributes
– Numberless values for an attribute
– Conditional probability modeled with the normal
distribution 1  (X j   ji )2 
ˆ(X | C c ) 
P exp  
j i
2  ji  2 ji 
2

 ji : mean(avearage)
of attributevaluesX j of examplesfor whichC ci
 ji : standarddeviationof attributevaluesX j of examplesfor whichC ci

for X (X1,,Xn ), C c1,,cL

– Learningn
Phase:
L P(C ci ) i 1,, L
Output: fornormal
X (X1distributions
,,Xn ) and
– Test Phase:
• Calculate conditional probabilities with all the normal distributions
• Apply the MAP rule to make a decision
COMP20411 Machine Learning 40
Conclusions
• Naïve Bayes based on the independence
assumption
– Training is very easy and fast; just requiring considering
each attribute in each class separately
– Test is straightforward; just looking up tables or
calculating conditional probabilities with normal
distributions
• A popular generative model
– Performance competitive to most of state-of-the-art
classifiers even in presence of violating independence
assumption
– Many successful applications, e.g., spam mail filtering
COMP20411 Machine Learning 41
Goal of Supervised Learning?
• Minimize the probability of model prediction errors on
future data

• Two Competing Methodologies

– Build one really good model
• Traditional approach
– Build many models and average the results
• Ensemble learning (more recent)

Intro AI Ensembles 42
The Single Model Philosophy
• Motivation: Occam’s Razor
– “one should not increase, beyond what is necessary, the number
of entities required to explain anything”
• Infinitely many models can explain any given dataset
– Might as well pick the smallest one…

Intro AI Ensembles 43
Which Model is Smaller?
yˆ = f1 ( x ) = sin( x)
or
x3 x5 x 7
yˆ = f 2 ( x ) = x - + - +...
3! 5! 7!

• In this case

• It’s not always easy to define small!

Intro AI Ensembles 44
Exact Occam’s Razor Models
• Exact approaches find optimal solutions
• Examples:
– Support Vector Machines
• Find a model structure that uses the smallest percentage of training data
(to explain the rest of it).
– Bayesian approaches
• Minimum description length

Intro AI Ensembles 45
How Do Support Vector Machines Define Small?

Minimize the number

of Support Vectors!

Maximized
Margin

Intro AI Ensembles 46
Approximate Occam’s Razor Models
• Approximate solutions use a greedy search approach which is not
optimal
• Examples
– Kernel Projection Pursuit algorithms
• Find a minimal set of kernel projections
– Relevance Vector Machines
• Approximate Bayesian approach
– Sparse Minimax Probability Machine Classification
• Find a minimum set of kernels and features

Intro AI Ensembles 47
Other Single Models: Not
Necessarily Motivated by Occam’s
Razor
• Minimax Probability Machine (MPM)
• Trees
– Greedy approach to sparseness
• Neural Networks
• Nearest Neighbor
• Basis Function Models
– e.g. Kernel Ridge Regression
Intro AI Ensembles 48
Ensemble Philosophy
• Build many models and combine them
• Only through averaging do we get at the truth!
• It’s too hard (impossible?) to build a single model that
works best
• Two types of approaches:
– Models that don’t use randomness
– Models that incorporate randomness

Intro AI Ensembles 49
Ensemble Approaches
• Bagging
– Bootstrap aggregating

• Boosting

• Random Forests
– Bagging reborn

Intro AI Ensembles 50
Bagging
• Main Assumption:
– Combining many unstable predictors to produce a ensemble (stable)
predictor.
– Unstable Predictor: small changes in training data produce large changes in
the model.
• e.g. Neural Nets, trees
• Stable: SVM (sometimes), Nearest Neighbor.
• Hypothesis Space
– Variable size (nonparametric):
• Can model any function if you use an appropriate predictor (e.g. trees)

Intro AI Ensembles 51
The Bagging Algorithm

Given data: D = {(x1 , y1 ),..., (x N , y N )}

For m = 1: M
• Obtain bootstrap sample Dm from the
training data D
• Build a model Gm (x) from bootstrap data Dm

Intro AI Ensembles 52
The Bagging Model
• Regression
M
1
yˆ =
M
åG
m=1
m (x )

• Classification:
– Vote over classifier outputs G1 (x),..., GM (x)

Intro AI Ensembles 53
Bagging Details
• Bootstrap sample of N instances is obtained by drawing N
examples at random, with replacement.
• On average each bootstrap sample has 63% of
instances
– Encourages predictors to have uncorrelated errors
• This is why it works

Intro AI Ensembles 54
Bagging Details 2
• Usually set M =~ 30
– Or use validation data to pick M
• The models Gm (x) need to be unstable
– Usually full length (or slightly pruned) decision trees.

Intro AI Ensembles 55
Boosting
– Main Assumption:
• Combining many weak predictors (e.g. tree stumps or 1-R predictors) to
produce an ensemble predictor
• The weak predictors or classifiers need to be stable
– Hypothesis Space
• Variable size (nonparametric):
– Can model any function if you use an appropriate predictor (e.g. trees)

Intro AI Ensembles 56
Commonly Used Weak Predictor
(or classifier)
A Decision Tree Stump (1-R)

Intro AI Ensembles 57
Boosting

Each classifier Gm (x) is

trained from a weighted
Sample of the training
Data

Intro AI Ensembles 58
Boosting (Continued)
• Each predictor is created by using a biased sample of the
training data
– Instances (training examples) with high error are weighted higher
than those with lower error
• Difficult instances get more attention
– This is the motivation behind boosting

Intro AI Ensembles 59
Background Notation
• The I ( s ) is defined as:
function
ìïï 1 if s is true
I (s) = í
ïïî 0 otherwise
• The function
log ( x ) is the natural logarithm

Intro AI Ensembles 60
The AdaBoost Algorithm
(Freund and Schapire, 1996)
Given data: D = {(x1 , y1 ),..., (x N , y N )}

1. Initialize weights w =1/ N , i =1,..., N

2. For m = 1: M
a) Fit classifier Gm (x) Î {- 1,1} to data using weights wi
b) Compute N

å w I ( y ¹ G (x )) i i m i
errm = i =1
N

å
i =1
wi

c) Compute a = log ((1- err ) / err )

m m m

d) Set w ¬ w exp éêa I ( y ¹ G (x ))ùú, i = 1,..., N

i i ëm i m i û
Intro AI Ensembles 61
The AdaBoost Model

éM ù
yˆ = sgn êå amGm (x)ú
ê
ëm=1 ú
û

AdaBoost is NOT used for Regression!

Intro AI Ensembles 62
The Updates in Boosting
Alpha for Boosting Re-weighting Factor for Boosting
5 100

4 90

3 80

2 70

1 60

w * exp( m)
m

0 50


-1 40

-2 30

-3 20

-4 10

-5 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
errm errm

Intro AI Ensembles 63
Boosting Characteristics
Simulated data: test error
rate for boosting with
stumps, as a function of
the number of iterations.
Also shown are the test
error rate for a single
stump, and a 400 node
tree.

Intro AI Ensembles 64
Loss Functions for y Î {- 1, +1}, f Î Â
•Misclassification
I (sgn ( f ) ¹ y )
•Exponential
(Boosting)
exp (- yf )

•Binomial Deviance
(
(Cross log 1 + exp
Entropy) (- 2 yf ))

•Squared
( y Error
- f )
2

Incorrect Classification Correct Classification

•Support
(1- yf Vectors
)gI ( yf >1)
Intro AI Ensembles 65
Other Variations of Boosting
• Gradient Boosting
– Can use any cost function
• Stochastic (Gradient) Boosting
– Bootstrap Sample: Uniform random sampling (with replacement)
– Often outperforms the non-random version

Intro AI Ensembles 66
Gradient Boosting

Intro AI Ensembles 67
Boosting Summary
• Good points
– Fast learning
– Capable of learning any function (given appropriate weak learner)
– Feature weighting
– Very little parameter tuning
• Bad points
– Can overfit data
– Only for binary classification
• Learning parameters (picked via cross validation)
– Size of tree
– When to stop
• Software
– http://www-stat.stanford.edu/~jhf/R-MART.html

Intro AI Ensembles 68
◾ Idea: Optimize an Additive model
 Additive prediction model:

 Here 𝑓𝑡 can be multi-level!

 Objective (cost) function:

 ω(𝑓𝑡 ) is a regularization term that models the complexity of the

tree.

2/22/22 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 6

◾ Use Additive model to train sequentially:

tree 𝒇𝒊 each time:

 Start from constant prediction, add a new decision

Prediction at Keep predictions

New
training from previous
model
2/22/22
round t rounds
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 7
◾ XGBoost: eXtreme Gradient Boosting
 A highly scalable implementation of gradient boosted
decision trees with regularization

Widely used by data scientists and provides state-of-the-

art results on many problems!

◾ System optimizations:
 Parallel tree constructions using column block
structure
 Distributed Computing for training very large models
using a cluster of machines.
 Out-of-Core Computing for very large datasets that
don’t fit into memory.
2/22/22 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 8
Topics

● Why are metrics important?

● Binary classifiers
○ Rank view, Thresholding
● Metrics
○ Confusion Matrix
○ Point metrics: Accuracy, Precision, Recall / Sensitivity, Specificity, F-score
○ Summary metrics: AU-ROC, AU-PRC, Log-loss.
● Choosing Metrics
● Class Imbalance
○ Failure scenarios for each metric
● Multi-class
Why are metrics important?

- Training objective (cost function) is only a proxy for real world objectives.
- Metrics help capture a business goal into a quantitative target (not all errors are equal).
- Helps organize ML team effort towards that target.
- Generally in the form of improving that metric on the dev set.
- Useful to quantify the “gap” between:
- Desired performance and baseline (estimate effort initially).
- Desired performance and current performance.
- Measure progress over time.
- Useful for lower level tasks and debugging (e.g. diagnosing bias vs variance).
- Ideally training objective should be the metric, but not always possible. Still, metrics are useful and
important for evaluation.
Binary Classification

● x is input
● y is binary output (0/1)
● Model is ŷ = h(x)
● Two types of models
○ Models that output a categorical class directly (K-nearest neighbor, Decision tree)
○ Models that output a real valued score (SVM, Logistic Regression)
■ Score could be margin (SVM), probability (LR, NN)
■ Need to pick a threshold
■ We focus on this type (the other type can be interpreted as an instance)
Score based models
Score = 1

Positive example
Negative example

Example of Score: Output of logistic regression.

For most metrics: Only ranking matters.
If too many examples: Plot class-wise histogram.

# positive examples
Prevalence =
# positive examples
+
# negatives
examples

Score = 0
Threshold -> Classifier -> Point Metrics

Label positive Label negative

Th
Predict Positive

0.5
Th=0.5
Predict Negative
Point metrics: Confusion Matrix

Label Positive Label Negative

Th
9 2
Predict Positive

0.5

Th=0.5

Properties:
- Total sum is fixed (population).
Predict Negative

- Column sums are fixed (class-wise population).

- Quality of model & threshold decide how columns
1 8 are split into rows.
- We want diagonals to be “heavy”, off diagonals to
be “light”.
Point metrics: True Positives

Label positive Label negative

Th TP

9 2
Predict Positive

0.5 9

Th=0.5
Predict Negative

1 8
Point metrics: True Negatives

Label positive Label negative

Th TP TN

9 2
Predict Positive

0.5 9 8

Th=0.5
Predict Negative

1 8
Point metrics: False Positives

Label positive Label negative

Th TP TN FP

9 2
Predict Positive

0.5 9 8 2

Th=0.5
Predict Negative

1 8
Point metrics: False Negatives

Label positive Label negative

Th TP TN FP FN

9 2
Predict Positive

0.5 9 8 2 1

Th=0.5
Predict Negative

8
1
FP and FN also called Type-1 and Type-2 errors

Could not find true source of image to cite

Point metrics: Accuracy

Label positive Label negative

Th TP TN FP FN Acc

9 2
Predict Positive

0.5 9 8 2 1 .85

Th=0.5
Predict Negative

Equivalent to 0-1 Loss!

1 8
Point metrics: Precision

Label positive Label negative

Th TP TN FP FN Acc Pr

9 2
Predict Positive

0.5 9 8 2 1 .85 .81

Th=0.5
Predict Negative

1 8
Point metrics: Positive Recall (Sensitivity)

Label positive Label negative

Th TP TN FP FN Acc Pr Recall

9 2
Predict Positive

0.5 9 8 2 1 .85 .81 .9

Th=0.5
Trivial 100% recall = pull everybody above the threshold.
Trivial 100% precision = push everybody below the
threshold except 1 green on top.
Predict Negative

(Hopefully no gray above it!)

8 Striving for good precision with 100% recall =

1 pulling up the lowest green as high as possible in the ranking.
Striving for good recall with 100% precision =
pushing down the top gray as low as possible in the ranking.
Point metrics: Negative Recall (Specificity)

Label positive Label negative

Th TP TN FP FN Acc Pr Recall Spec

9 2
Predict Positive

0.5 9 8 2 1 .85 .81 .9 0.8

Th=0.5
Predict Negative

1 8
Point metrics: F1-score

Label positive Label negative

Th TP TN FP FN Acc Pr Recall Spec F1

9 2
Predict Positive

0.5 9 8 2 1 .85 .81 .9 .8 .857

Th=0.5
Predict Negative

1 8
Point metrics: Changing threshold

Label positive Label negative

Th TP TN FP FN Acc Pr Recall Spec F1

7 2
Predict Positive

0.6 7 8 2 3 .75 .77 .7 .8 .733

Th=0.6

# effective thresholds = # examples + 1

Predict Negative

3 8
Threshold TP TN FP FN Accuracy Precision Recall Specificity F1
Threshold Scanning
Score = 1 1.00 0 10 0 10 0.50 1 0 1 0
Threshold = 1.00 0.95 1 10 0 9 0.55 1 0.1 1 0.182
0.90 2 10 0 8 0.60 1 0.2 1 0.333
0.85 2 9 1 8 0.55 0.667 0.2 0.9 0.308
0.80 3 9 1 7 0.60 0.750 0.3 0.9 0.429
0.75 4 9 1 6 0.65 0.800 0.4 0.9 0.533
0.70 5 9 1 5 0.70 0.833 0.5 0.9 0.625
0.65 5 8 2 5 0.65 0.714 0.5 0.8 0.588
0.60 6 8 2 4 0.70 0.750 0.6 0.8 0.667
0.55 7 8 2 3 0.75 0.778 0.7 0.8 0.737
0.50 8 8 2 2 0.80 0.800 0.8 0.8 0.800
0.45 9 8 2 1 0.85 0.818 0.9 0.8 0.857
0.40 9 7 3 1 0.80 0.750 0.9 0.7 0.818
0.35 9 6 4 1 0.75 0.692 0.9 0.6 0.783
0.30 9 5 5 1 0.70 0.643 0.9 0.5 0.750
0.25 9 4 6 1 0.65 0.600 0.9 0.4 0.720
0.20 9 3 7 1 0.60 0.562 0.9 0.3 0.692
0.15 9 2 8 1 0.55 0.529 0.9 0.2 0.667
0.10 9 1 9 1 0.50 0.500 0.9 0.1 0.643
0.05 10 1 9 0 0.55 0.526 1 0.1 0.690
0.00 10 0 10 0 0.50 0.500 1 0 0.667
Threshold = 0.00
Score = 0
Summary metrics: Rotated ROC (Sen vs. Spec)
Pos examples
Score = 1

Neg examples

Specificity AUROC = Area Under ROC

= True Neg / Neg
= Prob[Random Pos ranked
higher than random Neg]
Random Guessing

Agnostic to prevalence!

Score = 0

Sensitivity = True Pos / Pos

Summary metrics: PRC (Recall vs. Precision)
Pos examples
Score = 1

Neg examples
Precision AUPRC = Area Under PRC
= True Pos /
Predicted Pos
= Expected precision for
Random threshold

Precision >= prevalence

Score = 0

Recall = Sensitivity = True Pos / Pos

Summary metrics:

Score = 1 Score = 1

Model A Model B

Score = 0 Score = 0

Two models scoring the same data set. Is one of them better than the other?
Summary metrics: Log-Loss vs Brier Score

Score = 1 Score = 1
● Same ranking, and therefore the same AUROC, AUPRC, accuracy!

● Rewards confident correct answers, heavily penalizes confident

wrong answers.
● One perfectly confident wrong prediction is fatal.
-> Well-calibrated model
● Proper scoring rule: Minimized at

Score = 0 Score = 0
Calibration vs Discriminative Power

Logistic (th=0.5): Fraction of Positives

Precision: 0.872
Recall: 0.851
F1: 0.862
Brier: 0.099

SVC (th=0.5):
Precision: 0.872
Recall: 0.852
F1: 0.862 Output
Brier: 0.163
Histogram
Unsupervised Learning

● Log P(x) is a measure of fit in Probabilistic models (GMM, Factor Analysis)

○ High log P(x) on training set, but low log P(x) on test set is a measure of overfitting

○ Raw value of log P(x) hard to interpret in isolation

● K-means is trickier (because of fixed covariance assumption)

Class Imbalance

Symptom: Prevalence < 5% (no strict definition)

Metrics: May not be meaningful.

Learning: May not focus on minority class examples at all

(majority class can overwhelm logistic regression, to a lesser extent SVM)

What happen to the metrics under class imbalance?

Accuracy: Blindly predicts majority class -> prevalence is the baseline.

Log-Loss: Majority class can dominate the loss.

AUROC: Easy to keep AUC high by scoring most negatives very low.

AUPRC: Somewhat more robust than AUROC. But other challenges.

In general: Accuracy < AUROC < AUPRC

Rotated ROC
Score = 1

1% “Fraudulent”

1% Specificity
= True Neg / Neg

AUC = 98/99
98%

Score = 0

Sensitivity = True Pos / Pos

Multi-class

● Confusion matrix will be N * N (still want heavy diagonals, light off-diagonals)

● Most metrics (except accuracy) generally analyzed as multiple 1-vs-many
● Multiclass variants of AUROC and AUPRC (micro vs macro averaging)
● Class imbalance is common (both in absolute and relative sense)
● Cost sensitive learning techniques (also helps in binary Imbalance)
○ Assign weights for each block in the confusion matrix.
○ Incorporate weights into the loss function.
Choosing Metrics
Some common patterns:

- High precision is hard constraint, do best recall (search engine results, grammar correction): Intolerant to
FP

- Metric: Recall at Precision = XX %

- High recall is hard constraint, do best precision (medical diagnosis): Intolerant to FN

- Metric: Precision at Recall = 100 %

- Capacity constrained (by K)

- Metric: Precision in top-K.

- ……

ML Unit 4
No ratings yet
ML Unit 4
76 pages
KNN Evaluation
No ratings yet
KNN Evaluation
51 pages
Physics Unit 3 Assignment
67% (3)
Physics Unit 3 Assignment
19 pages
AAI Lecture 11 SP 25
No ratings yet
AAI Lecture 11 SP 25
77 pages
Classification (NaiveBayes KNN SVM DecisionTrees)
No ratings yet
Classification (NaiveBayes KNN SVM DecisionTrees)
105 pages
BCA SEM 3 Computer Oriented Numerical Methods BC0043
75% (4)
BCA SEM 3 Computer Oriented Numerical Methods BC0043
10 pages
Data Mining: Classification
No ratings yet
Data Mining: Classification
79 pages
ML Unit-2 (CEC)
No ratings yet
ML Unit-2 (CEC)
96 pages
Datamining Lect12
No ratings yet
Datamining Lect12
75 pages
Machine Learning Crash Course: Computer Vision James Hays
No ratings yet
Machine Learning Crash Course: Computer Vision James Hays
38 pages
03 - Classification PDF
No ratings yet
03 - Classification PDF
92 pages
AA1 Tema4
No ratings yet
AA1 Tema4
37 pages
Mlfa Autumn 22 Lec 03
No ratings yet
Mlfa Autumn 22 Lec 03
61 pages
Slide 2 ML Basics
No ratings yet
Slide 2 ML Basics
42 pages
4K-Nearest Neighbor
No ratings yet
4K-Nearest Neighbor
38 pages
Mod09-ppt2-ML in Image Classification
No ratings yet
Mod09-ppt2-ML in Image Classification
30 pages
Datamining Lect7knearst
No ratings yet
Datamining Lect7knearst
62 pages
CH 7
No ratings yet
CH 7
33 pages
cs4302 Lecture2
No ratings yet
cs4302 Lecture2
40 pages
ML.4-Classification Techniques (Week 5,6,7)
No ratings yet
ML.4-Classification Techniques (Week 5,6,7)
56 pages
08classification I
No ratings yet
08classification I
52 pages
ML04 KNN-SVM 2024-2025
No ratings yet
ML04 KNN-SVM 2024-2025
57 pages
K-Nearest Neighbors
No ratings yet
K-Nearest Neighbors
35 pages
Fractional Brownian Motion: Approximations and Projections
From Everand
Fractional Brownian Motion: Approximations and Projections
Oksana Banna
No ratings yet
L6 Lecture Image - Classification.fundemental v4
No ratings yet
L6 Lecture Image - Classification.fundemental v4
66 pages
3 2KNN
No ratings yet
3 2KNN
27 pages
T6 - KNN - Features, Distances &amp Amp Non-Parametric Models
No ratings yet
T6 - KNN - Features, Distances &amp Amp Non-Parametric Models
23 pages
SWE622 Lecture 3 Classification
No ratings yet
SWE622 Lecture 3 Classification
57 pages
JNTUK R20 B.tech CSE 3-2 Machine Learning Unit 2 Notes
No ratings yet
JNTUK R20 B.tech CSE 3-2 Machine Learning Unit 2 Notes
33 pages
Chapter 4. Classification Algorithms-Stud
No ratings yet
Chapter 4. Classification Algorithms-Stud
43 pages
Data Science Unit 3
No ratings yet
Data Science Unit 3
33 pages
ML Unit2
No ratings yet
ML Unit2
38 pages
ML Unit-2
No ratings yet
ML Unit-2
33 pages
Quiz 1 On Wednesday
No ratings yet
Quiz 1 On Wednesday
46 pages
Session 5
No ratings yet
Session 5
36 pages
2EL1730-ML-Lecture04-Non Parametric Learning and Nearest Neighbor
No ratings yet
2EL1730-ML-Lecture04-Non Parametric Learning and Nearest Neighbor
47 pages
Chapter 6 ML Classifications
100% (1)
Chapter 6 ML Classifications
51 pages
Lec 04
No ratings yet
Lec 04
70 pages
Pattern Recognition 14
No ratings yet
Pattern Recognition 14
46 pages
Pattern Recognition
No ratings yet
Pattern Recognition
33 pages
Unit 5 Learning With Algorithm
No ratings yet
Unit 5 Learning With Algorithm
7 pages
Chapter 7
No ratings yet
Chapter 7
64 pages
Lesson Plan-Angle of Elevation
100% (2)
Lesson Plan-Angle of Elevation
11 pages
Introduction To Support Vector Machines: Andrew Moore CMU
No ratings yet
Introduction To Support Vector Machines: Andrew Moore CMU
40 pages
AIML-Unit 4 Notes-Assignment 4
No ratings yet
AIML-Unit 4 Notes-Assignment 4
21 pages
KNN & Support Vector Machines: Dr.S.Vasantharathna
No ratings yet
KNN & Support Vector Machines: Dr.S.Vasantharathna
22 pages
Machine Learning
No ratings yet
Machine Learning
45 pages
Jntuk r20 ML Unit-II
No ratings yet
Jntuk r20 ML Unit-II
33 pages
Investor: Awareness Guide
100% (1)
Investor: Awareness Guide
24 pages
Instance Based Learning: 09s1: COMP9417 Machine Learning and Data Mining
No ratings yet
Instance Based Learning: 09s1: COMP9417 Machine Learning and Data Mining
9 pages
ASET Abstract Reasoning Sample Test2
100% (1)
ASET Abstract Reasoning Sample Test2
15 pages
Data Mining Lecture 10B: Classification
No ratings yet
Data Mining Lecture 10B: Classification
62 pages
Introduction To: Support Vector Machines
No ratings yet
Introduction To: Support Vector Machines
53 pages
Classification FoundationalMathofAI S24
No ratings yet
Classification FoundationalMathofAI S24
6 pages
SVM Class
No ratings yet
SVM Class
33 pages
"Classifiers": R & D Project by Under The Guidance of
No ratings yet
"Classifiers": R & D Project by Under The Guidance of
59 pages
Distance Metric Learning For Large Margin Nearest Neighbor Classification
No ratings yet
Distance Metric Learning For Large Margin Nearest Neighbor Classification
8 pages
Pattern Recognition & Learning II: © UW CSE Vision Faculty
No ratings yet
Pattern Recognition & Learning II: © UW CSE Vision Faculty
47 pages
Introduction To Support Vector Machines
No ratings yet
Introduction To Support Vector Machines
23 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Mark Scheme (Results) January 2008: O Level Mathematics B (7361 - 01)
No ratings yet
Mark Scheme (Results) January 2008: O Level Mathematics B (7361 - 01)
6 pages
Lecture 3
No ratings yet
Lecture 3
17 pages
2G Kpi
No ratings yet
2G Kpi
61 pages
CH 08 - Deflection in Statically Indeterminate Structures - 2
No ratings yet
CH 08 - Deflection in Statically Indeterminate Structures - 2
9 pages
General Physics Lesson 3
No ratings yet
General Physics Lesson 3
10 pages
5c. Nearest Neighbour Classifier
No ratings yet
5c. Nearest Neighbour Classifier
2 pages
Instance Based Learning
No ratings yet
Instance Based Learning
20 pages
Message
No ratings yet
Message
10 pages
LFD 2005 Nearest Neighbour
No ratings yet
LFD 2005 Nearest Neighbour
6 pages
105 Machine Learning Paper
No ratings yet
105 Machine Learning Paper
6 pages
Pushdown Automata: COMP2600 - Formal Methods For Software Engineering
No ratings yet
Pushdown Automata: COMP2600 - Formal Methods For Software Engineering
27 pages
Concurrent Computing: Programming Paradigms
100% (1)
Concurrent Computing: Programming Paradigms
6 pages
Class Vsyllabus V
No ratings yet
Class Vsyllabus V
23 pages
Average Final
No ratings yet
Average Final
44 pages
AC Performance Steady Flight (Part 2)
No ratings yet
AC Performance Steady Flight (Part 2)
99 pages
Class Xii Maths Formula List (Dr. Amit Bajaj)
No ratings yet
Class Xii Maths Formula List (Dr. Amit Bajaj)
25 pages
CH 4 LP QM
No ratings yet
CH 4 LP QM
40 pages
1 Introduction To Rings
No ratings yet
1 Introduction To Rings
23 pages
Unit-4 Part 2 Modelling and Evaluation
No ratings yet
Unit-4 Part 2 Modelling and Evaluation
35 pages
All Android XML Tags Explained
No ratings yet
All Android XML Tags Explained
8 pages
Dna Computing: Using Dna To Solve Computational Problems
No ratings yet
Dna Computing: Using Dna To Solve Computational Problems
12 pages
Time and Space Complexity in Algorithms
No ratings yet
Time and Space Complexity in Algorithms
8 pages
Government Intervention Chapter - 9: Exercise Practice Set: S D S D S D
No ratings yet
Government Intervention Chapter - 9: Exercise Practice Set: S D S D S D
7 pages
Kotlin Sets Full Guide
No ratings yet
Kotlin Sets Full Guide
2 pages
Java GUI Learning Basic To Advanced
No ratings yet
Java GUI Learning Basic To Advanced
3 pages
Simple Horserace Game Worksheet For Visual Basic 2008
No ratings yet
Simple Horserace Game Worksheet For Visual Basic 2008
4 pages
Decisions Under Uncertainty-L3
No ratings yet
Decisions Under Uncertainty-L3
16 pages
PID, Fuzzy and LQR Controllers For Magnetic Levitation System
No ratings yet
PID, Fuzzy and LQR Controllers For Magnetic Levitation System
5 pages
210 Determining End Ring Resistance and Inductance of Squirrel Cage For Induction Motor With 2D and 3D Computations
No ratings yet
210 Determining End Ring Resistance and Inductance of Squirrel Cage For Induction Motor With 2D and 3D Computations
6 pages
Stanford E14 PSET 1 Solutions
No ratings yet
Stanford E14 PSET 1 Solutions
18 pages
Cubic Graphs
No ratings yet
Cubic Graphs
9 pages
CSE330 Quiz Solutions
No ratings yet
CSE330 Quiz Solutions
5 pages
Digital Communications Over Fading Channels M.K. Simon and M.S. Alouini 2005 Book Review
No ratings yet
Digital Communications Over Fading Channels M.K. Simon and M.S. Alouini 2005 Book Review
2 pages
Problem Set 3a
No ratings yet
Problem Set 3a
2 pages
IEOR 6711: Stochastic Models I Fall 2003, Professor Whitt Class Lecture Notes: Tuesday, November 18. Solutions To Problems For Discussion
No ratings yet
IEOR 6711: Stochastic Models I Fall 2003, Professor Whitt Class Lecture Notes: Tuesday, November 18. Solutions To Problems For Discussion
2 pages

Unit 3

Uploaded by

Unit 3

Uploaded by

Machine Learning

 SVM becomes popular because of its success in handwritten digit

How would you

How would you

How would you

How would you

 In our case, w1*x1+w2*x2+b=0,

 What is the distance expression for a point x to a

One of the simplest of all machine learning classifiers

A type of instance-based learning

Dist(a,b) =(a1 – b1)2 + (a2 – b2)Dist(a,b)

Standard Euclidean distance metric:

How many nearby neighbors to look at?

Adapted from “Instance-Based Learning”

Generalizes 1-NN to smooth away noise in the labels

Label it red, when k = 3

Label it blue, when k = 7

Similarity metric: Number of matching attributes (k=2)

Second most similar example: number 1 (2 mismatch, 3 match)  yes

Second most similar example: number 1 (2 mismatch, 3 match)  yes

Prediction accuracy can quickly degrade when number of attributes grows.

Need distance/similarity measure and attributes that “match” target function.

For large training sets,

Prediction accuracy can quickly degrade when number of attributes grows.

Simple to implement algorithm;

P(X| C)P(C) Likelihood

COMP20411 Machine Learning 27

Slide by Stephen Marsland

Difficulty: learning the joint probability

P(X1| C)P(X2 ,,Xn | C)

– MAP *classification rule

COMP20411 Machine Learning 34

Output: conditional probability tables;xj ,for

COMP20411 Machine Learning 35

COMP20411 Machine Learning 36

P(Play=Yes) = 9/14P(Play=No) = 5/14

COMP20411 Machine Learning 37

for X (X1,,Xn ), C c1,,cL

• Two Competing Methodologies

• It’s not always easy to define small!

Minimize the number

Given data: D = {(x1 , y1 ),..., (x N , y N )}

Each classifier Gm (x) is

1. Initialize weights w =1/ N , i =1,..., N

c) Compute a = log ((1- err ) / err )

d) Set w ¬ w exp éêa I ( y ¹ G (x ))ùú, i = 1,..., N

AdaBoost is NOT used for Regression!

Incorrect Classification Correct Classification

 Here 𝑓𝑡 can be multi-level!

 ω(𝑓𝑡 ) is a regularization term that models the complexity of the

2/22/22 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 6

tree 𝒇𝒊 each time:

Prediction at Keep predictions

Widely used by data scientists and provides state-of-the-

● Why are metrics important?

Example of Score: Output of logistic regression.

Label positive Label negative

Label Positive Label Negative

- Column sums are fixed (class-wise population).

Label positive Label negative

Label positive Label negative

Label positive Label negative

Label positive Label negative

Could not find true source of image to cite

Label positive Label negative

Equivalent to 0-1 Loss!

Label positive Label negative

0.5 9 8 2 1 .85 .81

Label positive Label negative

0.5 9 8 2 1 .85 .81 .9

(Hopefully no gray above it!)

8 Striving for good precision with 100% recall =

Label positive Label negative

Th TP TN FP FN Acc Pr Recall Spec

0.5 9 8 2 1 .85 .81 .9 0.8

Label positive Label negative

 In our case, w1x1+w2x2+b=0,