Unit 3
Unit 3
Classifiers-2
Carla P. Gomes
CS4700
History of SVM
SVM is related to statistical learning theory [3]
SVM was first introduced in 1992 [1]
recognition
1.1% test error rate for SVM. This is the same as the error rates of a
carefully constructed neural network, LeNet 4.
See Section 5.11 in [2] or the discussion in [3] for details
SVM is now regarded as an important example of “kernel
methods”, one of the key area in machine learning
Note: the meaning of “kernel” is different from the “kernel” function for
Parzen windows
[1] B.E. Boser et al. A Training Algorithm for Optimal Margin Classifiers. Proceedings of the Fifth Annual
Workshop on Computational Learning Theory 5 144-152, Pittsburgh, 1992.
[2] L. Bottou et al. Comparison of classifier methods: a case study in handwritten digit recognition.
Proceedings of the 12th IAPR International Conference on Pattern Recognition, vol. 2, pp. 77-82.
[3] V. Vapnik. The Nature of Statistical Learning Theory. 2 nd edition, Springer, 1999.
02/18/2025 2
Linear Classifiers Estimation:
x f yest
f(x,w,b) = sign(w. x -
denotes +1 b)
denotes -1 w: weight vector
x: data vector
02/18/2025 3
a
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x -
denotes +1 b)
denotes -1
02/18/2025 4
a
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x -
denotes +1 b)
denotes -1
02/18/2025 5
a
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x -
denotes +1 b)
denotes -1
02/18/2025 6
a
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x -
denotes +1 b)
denotes -1
Any of these
would be fine..
..but which is
best?
02/18/2025 7
a
Classifier Margin
x f yest
f(x,w,b) = sign(w. x -
denotes +1 b)
denotes -1 Define the
margin of a
linear classifier
as the width
that the
boundary could
be increased by
before hitting a
datapoint.
02/18/2025 8
a
Maximum Margin
x f yest
f(x,w,b) = sign(w. x -
denotes +1 b)
denotes -1 The maximum
margin linear
classifier is the
linear classifier
with the, um,
maximum
margin.
This is the
simplest kind of
SVM (Called an
Linear SVM LSVM)
02/18/2025 9
a
Maximum Margin
x f yest
f(x,w,b) = sign(w. x +
denotes +1 b)
denotes -1 The maximum
margin linear
classifier is the
linear classifier
Support with the, um,
Vectors are
those maximum
datapoints margin.
that the
margin pushes
This is the
up against simplest kind of
SVM (Called an
Linear SVM LSVM)
02/18/2025 10
Why Maximum Margin?
f(x,w,b) = sign(w. x -
denotes +1 b)
denotes -1 The maximum
margin linear
classifier is the
linear classifier
Support with the, um,
Vectors are
those maximum
datapoints margin.
that the
margin pushes
This is the
up against simplest kind of
SVM (Called an
LSVM)
02/18/2025 11
How to calculate the distance from a point
to a line?
denotes +1
denotes -1
x
wx +b =
0
X – Vector
W
W – Normal
Vector
b – Scale Value
http://mathworld.wolfram.com/Point-LineDistance
2-Dimensional.html
02/18/2025 13
Large-margin Decision Boundary
The decision boundary should be as far away from the data of both
classes as possible
We should maximize the margin, m
Distance between the origin and the line wtx=-b is b/||w||
Class 2
Class 1
m
02/18/2025 14
Instance-Based Learning
Idea:
– Similar examples have similar label.
– Classify new examples like similar training examples.
Algorithm:
– Given some new example x for which we need to predict its class y
– Find most similar training examples
– Classify x “like” these most similar examples
Questions:
– How to determine similarity?
– How many similar training examples to consider?
– How to resolve inconsistencies among the training examples?
Carla P. Gomes
CS4700
1-Nearest Neighbor
Label it red.
Carla P. Gomes
CS4700
1-Nearest Neighbor
Carla P. Gomes
CS4700
Distance Metrics
Different metrics can change the decision surface
Carla P. Gomes
CS4700
1-NN’s Aspects as an
Instance-Based Learner:
A distance metric
– Euclidean
– When different units are used for each dimension
normalize each dimension by standard deviation
– For discrete data, can use hamming distance
D(x1,x2) =number of features on which x1 and x2 differ
– Others (e.g., normal, cosine)
Increase k:
– Makes KNN less sensitive to noise
Decrease k:
– Allows capturing finer structure of space
Pick k not too large, but not too small (depends on data)
Carla P. Gomes
CS4700
Curse-of-Dimensionality
Remedy
– Try to remove irrelevant attributes in pre-processing step
– Weight attributes differently
– Increase k (but not too much)
Carla P. Gomes
CS4700
Advantages and Disadvantages of KNN
P(x)
C1
C2
0
Slide by Stephen Marsland
x
Naïve Bayes
• Bayes classification
P(C| X) P(X| C)P(C) P(X1,,Xn | C)P(C)
– MAP rule
P(Yes|x’): [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|
Yes)]P(Play=Yes) = 0.0053
P(No|x’): [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|
No)]P(Play=No) = 0.0206
P(Yes|xMachine
Given the factCOMP20411 x’), we label x’ to be “No”.
Learning
’) < P(No| 38
Relevant Issues
• Violation of Independence Assumption
– For many real world tasks,
P(X1,,Xn | C) P(X1| C) P(Xn | C)
– Nevertheless, naïve Bayes works surprisingly well
anyway!
• Zero conditional probability Problem
X j ajk, Pˆ(X j ajk| C ci ) 0
– If no example contains
Pˆ(x1| cthe ˆattribute ˆ value
i ) P(ajk| ci ) P(xn | ci ) 0
– In this circumstance, during test
– For a remedy, conditional
ˆ(X a | C c probabilities
nc mp estimated with
P j jk i )
n m
nc : numberof trainingexamplesfor whichX j ajk andC ci
n : numberof trainingexamplesfor whichC ci
p : priorestimate(usually,p 1/ t fort possiblevaluesof X j )
m: weightto prior(numberof "virtual"examples,m 1)
COMP20411 Machine Learning 39
Relevant Issues
• Continuous-valued Input Attributes
– Numberless values for an attribute
– Conditional probability modeled with the normal
distribution 1 (X j ji )2
ˆ(X | C c )
P exp
j i
2 ji 2 ji
2
ji : mean(avearage)
of attributevaluesX j of examplesfor whichC ci
ji : standarddeviationof attributevaluesX j of examplesfor whichC ci
Intro AI Ensembles 42
The Single Model Philosophy
• Motivation: Occam’s Razor
– “one should not increase, beyond what is necessary, the number
of entities required to explain anything”
• Infinitely many models can explain any given dataset
– Might as well pick the smallest one…
Intro AI Ensembles 43
Which Model is Smaller?
yˆ = f1 ( x ) = sin( x)
or
x3 x5 x 7
yˆ = f 2 ( x ) = x - + - +...
3! 5! 7!
• In this case
Intro AI Ensembles 45
How Do Support Vector Machines Define Small?
Maximized
Margin
Intro AI Ensembles 46
Approximate Occam’s Razor Models
• Approximate solutions use a greedy search approach which is not
optimal
• Examples
– Kernel Projection Pursuit algorithms
• Find a minimal set of kernel projections
– Relevance Vector Machines
• Approximate Bayesian approach
– Sparse Minimax Probability Machine Classification
• Find a minimum set of kernels and features
Intro AI Ensembles 47
Other Single Models: Not
Necessarily Motivated by Occam’s
Razor
• Minimax Probability Machine (MPM)
• Trees
– Greedy approach to sparseness
• Neural Networks
• Nearest Neighbor
• Basis Function Models
– e.g. Kernel Ridge Regression
Intro AI Ensembles 48
Ensemble Philosophy
• Build many models and combine them
• Only through averaging do we get at the truth!
• It’s too hard (impossible?) to build a single model that
works best
• Two types of approaches:
– Models that don’t use randomness
– Models that incorporate randomness
Intro AI Ensembles 49
Ensemble Approaches
• Bagging
– Bootstrap aggregating
• Boosting
• Random Forests
– Bagging reborn
Intro AI Ensembles 50
Bagging
• Main Assumption:
– Combining many unstable predictors to produce a ensemble (stable)
predictor.
– Unstable Predictor: small changes in training data produce large changes in
the model.
• e.g. Neural Nets, trees
• Stable: SVM (sometimes), Nearest Neighbor.
• Hypothesis Space
– Variable size (nonparametric):
• Can model any function if you use an appropriate predictor (e.g. trees)
Intro AI Ensembles 51
The Bagging Algorithm
For m = 1: M
• Obtain bootstrap sample Dm from the
training data D
• Build a model Gm (x) from bootstrap data Dm
Intro AI Ensembles 52
The Bagging Model
• Regression
M
1
yˆ =
M
åG
m=1
m (x )
• Classification:
– Vote over classifier outputs G1 (x),..., GM (x)
Intro AI Ensembles 53
Bagging Details
• Bootstrap sample of N instances is obtained by drawing N
examples at random, with replacement.
• On average each bootstrap sample has 63% of
instances
– Encourages predictors to have uncorrelated errors
• This is why it works
Intro AI Ensembles 54
Bagging Details 2
• Usually set M =~ 30
– Or use validation data to pick M
• The models Gm (x) need to be unstable
– Usually full length (or slightly pruned) decision trees.
Intro AI Ensembles 55
Boosting
– Main Assumption:
• Combining many weak predictors (e.g. tree stumps or 1-R predictors) to
produce an ensemble predictor
• The weak predictors or classifiers need to be stable
– Hypothesis Space
• Variable size (nonparametric):
– Can model any function if you use an appropriate predictor (e.g. trees)
Intro AI Ensembles 56
Commonly Used Weak Predictor
(or classifier)
A Decision Tree Stump (1-R)
Intro AI Ensembles 57
Boosting
Intro AI Ensembles 58
Boosting (Continued)
• Each predictor is created by using a biased sample of the
training data
– Instances (training examples) with high error are weighted higher
than those with lower error
• Difficult instances get more attention
– This is the motivation behind boosting
Intro AI Ensembles 59
Background Notation
• The I ( s ) is defined as:
function
ìïï 1 if s is true
I (s) = í
ïïî 0 otherwise
• The function
log ( x ) is the natural logarithm
Intro AI Ensembles 60
The AdaBoost Algorithm
(Freund and Schapire, 1996)
Given data: D = {(x1 , y1 ),..., (x N , y N )}
2. For m = 1: M
a) Fit classifier Gm (x) Î {- 1,1} to data using weights wi
b) Compute N
å w I ( y ¹ G (x )) i i m i
errm = i =1
N
å
i =1
wi
éM ù
yˆ = sgn êå amGm (x)ú
ê
ëm=1 ú
û
Intro AI Ensembles 62
The Updates in Boosting
Alpha for Boosting Re-weighting Factor for Boosting
5 100
4 90
3 80
2 70
1 60
w * exp( m)
m
0 50
-1 40
-2 30
-3 20
-4 10
-5 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
errm errm
Intro AI Ensembles 63
Boosting Characteristics
Simulated data: test error
rate for boosting with
stumps, as a function of
the number of iterations.
Also shown are the test
error rate for a single
stump, and a 400 node
tree.
Intro AI Ensembles 64
Loss Functions for y Î {- 1, +1}, f Î Â
•Misclassification
I (sgn ( f ) ¹ y )
•Exponential
(Boosting)
exp (- yf )
•Binomial Deviance
(
(Cross log 1 + exp
Entropy) (- 2 yf ))
•Squared
( y Error
- f )
2
Intro AI Ensembles 66
Gradient Boosting
Intro AI Ensembles 67
Boosting Summary
• Good points
– Fast learning
– Capable of learning any function (given appropriate weak learner)
– Feature weighting
– Very little parameter tuning
• Bad points
– Can overfit data
– Only for binary classification
• Learning parameters (picked via cross validation)
– Size of tree
– When to stop
• Software
– http://www-stat.stanford.edu/~jhf/R-MART.html
Intro AI Ensembles 68
◾ Idea: Optimize an Additive model
Additive prediction model:
◾ System optimizations:
Parallel tree constructions using column block
structure
Distributed Computing for training very large models
using a cluster of machines.
Out-of-Core Computing for very large datasets that
don’t fit into memory.
2/22/22 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 8
Topics
- Training objective (cost function) is only a proxy for real world objectives.
- Metrics help capture a business goal into a quantitative target (not all errors are equal).
- Helps organize ML team effort towards that target.
- Generally in the form of improving that metric on the dev set.
- Useful to quantify the “gap” between:
- Desired performance and baseline (estimate effort initially).
- Desired performance and current performance.
- Measure progress over time.
- Useful for lower level tasks and debugging (e.g. diagnosing bias vs variance).
- Ideally training objective should be the metric, but not always possible. Still, metrics are useful and
important for evaluation.
Binary Classification
● x is input
● y is binary output (0/1)
● Model is ŷ = h(x)
● Two types of models
○ Models that output a categorical class directly (K-nearest neighbor, Decision tree)
○ Models that output a real valued score (SVM, Logistic Regression)
■ Score could be margin (SVM), probability (LR, NN)
■ Need to pick a threshold
■ We focus on this type (the other type can be interpreted as an instance)
Score based models
Score = 1
Positive example
Negative example
# positive examples
Prevalence =
# positive examples
+
# negatives
examples
Score = 0
Threshold -> Classifier -> Point Metrics
Th
Predict Positive
0.5
Th=0.5
Predict Negative
Point metrics: Confusion Matrix
Th
9 2
Predict Positive
0.5
Th=0.5
Properties:
- Total sum is fixed (population).
Predict Negative
Th TP
9 2
Predict Positive
0.5 9
Th=0.5
Predict Negative
1 8
Point metrics: True Negatives
Th TP TN
9 2
Predict Positive
0.5 9 8
Th=0.5
Predict Negative
1 8
Point metrics: False Positives
Th TP TN FP
9 2
Predict Positive
0.5 9 8 2
Th=0.5
Predict Negative
1 8
Point metrics: False Negatives
Th TP TN FP FN
9 2
Predict Positive
0.5 9 8 2 1
Th=0.5
Predict Negative
8
1
FP and FN also called Type-1 and Type-2 errors
Th TP TN FP FN Acc
9 2
Predict Positive
0.5 9 8 2 1 .85
Th=0.5
Predict Negative
Th TP TN FP FN Acc Pr
9 2
Predict Positive
Th=0.5
Predict Negative
1 8
Point metrics: Positive Recall (Sensitivity)
Th TP TN FP FN Acc Pr Recall
9 2
Predict Positive
Th=0.5
Trivial 100% recall = pull everybody above the threshold.
Trivial 100% precision = push everybody below the
threshold except 1 green on top.
Predict Negative
9 2
Predict Positive
Th=0.5
Predict Negative
1 8
Point metrics: F1-score
9 2
Predict Positive
Th=0.5
Predict Negative
1 8
Point metrics: Changing threshold
7 2
Predict Positive
Th=0.6
3 8
Threshold TP TN FP FN Accuracy Precision Recall Specificity F1
Threshold Scanning
Score = 1 1.00 0 10 0 10 0.50 1 0 1 0
Threshold = 1.00 0.95 1 10 0 9 0.55 1 0.1 1 0.182
0.90 2 10 0 8 0.60 1 0.2 1 0.333
0.85 2 9 1 8 0.55 0.667 0.2 0.9 0.308
0.80 3 9 1 7 0.60 0.750 0.3 0.9 0.429
0.75 4 9 1 6 0.65 0.800 0.4 0.9 0.533
0.70 5 9 1 5 0.70 0.833 0.5 0.9 0.625
0.65 5 8 2 5 0.65 0.714 0.5 0.8 0.588
0.60 6 8 2 4 0.70 0.750 0.6 0.8 0.667
0.55 7 8 2 3 0.75 0.778 0.7 0.8 0.737
0.50 8 8 2 2 0.80 0.800 0.8 0.8 0.800
0.45 9 8 2 1 0.85 0.818 0.9 0.8 0.857
0.40 9 7 3 1 0.80 0.750 0.9 0.7 0.818
0.35 9 6 4 1 0.75 0.692 0.9 0.6 0.783
0.30 9 5 5 1 0.70 0.643 0.9 0.5 0.750
0.25 9 4 6 1 0.65 0.600 0.9 0.4 0.720
0.20 9 3 7 1 0.60 0.562 0.9 0.3 0.692
0.15 9 2 8 1 0.55 0.529 0.9 0.2 0.667
0.10 9 1 9 1 0.50 0.500 0.9 0.1 0.643
0.05 10 1 9 0 0.55 0.526 1 0.1 0.690
0.00 10 0 10 0 0.50 0.500 1 0 0.667
Threshold = 0.00
Score = 0
Summary metrics: Rotated ROC (Sen vs. Spec)
Pos examples
Score = 1
Neg examples
Agnostic to prevalence!
Score = 0
Neg examples
Precision AUPRC = Area Under PRC
= True Pos /
Predicted Pos
= Expected precision for
Random threshold
Score = 0
Score = 1 Score = 1
Model A Model B
Score = 0 Score = 0
Two models scoring the same data set. Is one of them better than the other?
Summary metrics: Log-Loss vs Brier Score
Score = 1 Score = 1
● Same ranking, and therefore the same AUROC, AUPRC, accuracy!
Score = 0 Score = 0
Calibration vs Discriminative Power
SVC (th=0.5):
Precision: 0.872
Recall: 0.852
F1: 0.862 Output
Brier: 0.163
Histogram
Unsupervised Learning
○ High log P(x) on training set, but low log P(x) on test set is a measure of overfitting
AUROC: Easy to keep AUC high by scoring most negatives very low.
1% “Fraudulent”
1% Specificity
= True Neg / Neg
AUC = 98/99
98%
Score = 0
- High precision is hard constraint, do best recall (search engine results, grammar correction): Intolerant to
FP