Performance
Performance
Generalization
Classifier Performance
Intuitively, performance of classifiers
(learning algorithms) depends on
Complexity of the classifiers (e.g., how many
layers and how many neurons per layers)
Training samples (generally more is better)
PR , ANN, & ML 2
Generalization Performance
You can make a classifier performs very
well on any training data set
Given enough structure complexity
Given enough training cycles
PR , ANN, & ML 3
Generalization Performance (cont.)
First, try to do better on unseen data by doing
better on training data might not work
Because overfitting can be a problem
You can fit the training data arbitrarily well, but there is
no prediction of what it will do on data not seen
Example: curve fitting
PR , ANN, & ML 5
Intuitively
Meaningful associations usually imply
Simplicity (capacity)
The association function should be simple
More generally to determine how much capability a
classifier possesses
Repeatability (stability)
The association function should not change drastically
when different training data sets are used to derive the
function, or E(f)=0 (over different data set)
Average salary of Ph.D. is higher than that of high-
school dropout – simple and repeatable relation (not
sensitive to the particular training data set)
PR , ANN, & ML 6
Generalization Performance (cont.)
So does that mean we should always prefer
simplicity?
Occam’s Razor: nature prefers simplicity
Explanations should not be multiplied beyond
necessity
Sometimes, it is a bias or preference over
the forms and parameters of a classifier
PR , ANN, & ML 7
No free lunch theorem
Under very general assumption, one should
not prefer one classifier (or learning
algorithm) over another for the
generalization performance
Why?
Because given certain training data, there is no
telling (in general) what unseen data will
behave
PR , ANN, & ML 8
Example
target
PR , ANN, & ML 9
Example (cont.)
However, in reality, we also expect that learning
(training) is effective
given a large number of representative samples and
a target function that is smooth
PR , ANN, & ML 10
How do we reconcile?
So the choice of a classifier or a training
algorithm depends on our preconceived
notion of the relevant target functions
In that sense, both Occam’s Razor and no
free lunch theorem can co-exist
PR , ANN, & ML 11
Make it More Concrete
Let’s assume that there are two classes {+1,
-1} and n samples (x1, y1) … (xn, yn)
The classifier is f(x,) -> y (: tunable
parameters of f, e.g., hyperplane)
1
Loss (error) is 2 | y f ( x , ) |
i i
Expected error 1
2
e ( ) | y f ( x, ) | dP( x, y )
PR , ANN, & ML 12
Issues with the Two Risks
How do we estimate how good is a
classifier based on a particular run of the
experiment?
How best to compute eemp from samples
How do we estimate e from eemp?
Practically(with some particular data sets), and
Theoretically (with an upper bound)
PR , ANN, & ML 13
Answers
Computing eemp
There are statistical resampling techniques to
give better estimate of a classifier’s performance
Estimating e from eemp
Theoretically
There is an upper bound on e given eemp
Practically
Given a particular set(s) of training data, a procedure
exists to estimate e from eemp
PR , ANN, & ML 14
Practical Issues
OK, so philosophically nothing is better
than anything else
In reality, we have to choose classifiers and
learning algorithms
Run some classification experiments in
supervised mode (with labeled data)
What should we look for in a “good”
classifier or learning algorithm?
PR , ANN, & ML 15
Assumption
Use RBF interpretation
Interpolation a function with given data points
True function F(x)
Interpolation function g(x; D)
PR , ANN, & ML 16
Bias and Variance
E D ( g (x; D) F (x)) 2 ( E D [ g (x; D) F (x)]) 2 E D ( g (x; D) ED [ g (x; D)]) 2
bias2 variance
E D ( g (x; D) F (x)) 2 =0
E D [ g ] 2 E D [ g ]F F
2 2
ED [ g 2 ] 2 ED [ g ]F F 2 ( ED [ g ]) 2 2( ED [ g ]) 2 ( E D [ g ]) 2
( E D [ g ]) 2 2 ED [ g ]F F 2 ED [ g 2 ] 2 E D [ g ]ED [ g ] ( ED [ g ]) 2
( E D [ g F ]) 2 ED ( g ED [ g ]) 2
PR , ANN, & ML 17
Bias and Variance
E D ( g (x; D) F (x)) 2 ( E D [ g (x; D) F (x)]) 2 E D ( g (x; D) E D [ g (x; D )]) 2
bias2 variance
Bias – measure the accuracy
How good are the classifiers confirm to reality
High bias implies a poor match
Variance – measure the precision or specificity of a
match
How good is the classifiers confirm to one another
High variance implies a weak match
PR , ANN, & ML 18
Graphic Interpretation
PR , ANN, & ML 19
Tradeoff
Intuitively, increasing the flexibility (more
parameters)
Givesbetter fit (low bias)
Produces harder to control fit (high variance)
PR , ANN, & ML 20
Bias and Variance in Fitting
With more parameters, a
better fit (low bias) is
possible, at the expense
that parameter values may
vary widely given
different sampled data
(high variance)
PR , ANN, & ML 21
Bias and Variance in Classifiers
Simple models
capture aggregate class
properties which are usually
more stable (hence low
variance)
However, they miss fine details
and give poor fit (hence high
bias)
Complicated models
capture aggregate class
properties and fine variance
(hence low bias)
However, fine details depend
on samples used (hence high
variance)
PR , ANN, & ML 22
Curse of Dimensionality
What happens to bias and variance when the
dimensionality increase?
It depends
F(X) depends on all dimensions of X
Bias is likely to go up for nn classifier because
neighbors will be further away from a data point for
faithful interpolation
F(X) depends on some dimensions of X
Variance is likely to go up for nn classifier because
spread along the used dimensions might go up
PR , ANN, & ML 23
Practical Issues (cont.)
So you choose a labeled data set and test your
classifier
But that is just on one particular data set
How do you know how it will do for other labeled
data set? or
How do you estimate bias and variance? (or how
do you know the particular relation is stable and
repeatable?)
We do not really know F for other data sets
We have only one data set, not an ensemble
How do you extrapolate (from eemp to e)?
How do you improve bias and variance?
PR , ANN, & ML 24
Example: Estimation - Jackknife
Perform many “leave-one-out” estimations
E.g., to estimate mean and variance
Traditional Leave-one-out
1 n 1 n
u(.) u(i )
n
1 where u(i ) xj
uˆ
n i 1
xi n i1 n 1 j 1, j i
n 1 n n 1 n
2
i
n i 1
( x uˆ ) 2
Var[u] (i) (.)
n i1
(u u ) 2
n i n i
n 1
nu xi
1
n
i (i )
(u uˆ ) 2
n i n 1
2 n 1 nuˆ xi
n u xi u 2
( ˆ )
1 n i n 1
i
n 1 uˆ xi 2 2
n n 1 (
n i n 1
)
1 n u nu
2
n n 1
u PR , ANN, & ML 26
General Jackknife Estimation
Similar to mean and variance estimation
Perform many leave-one-out estimations
of the parameter
(i ) ( x1 , x2 , , xi 1 , xi 1 , , xn )
1 n
(.) (i ) e.g., can be the hyperplane equation
n i 1
n 1 n 2
var[ ]
n i 1
( (i ) (.) )
PR , ANN, & ML 27
Bias and Variance of
General Jackknife Estimation
Bias E ( )
bias jackknife (n 1)( (.) )
2
Var[ ] E ( E ( ))
n 1 n 2
Varjackknife [ ]
n i 1
( ( i ) (.) )
PR , ANN, & ML 28
Example: Mode Estimate
n=6
D={0,10,10,10,20,20}
1 n 1
(.) (i ) {10 15 15 15 10 10] 12.5
n i 1 6
bias jackknife (n 1)( (.) ) 5(12.5 10) 12.5
n 1 n 2 5
Varjackknife [ ]
n i 1
( (i ) (.) )
6
{3(10 12.5) 2
3(15 12. 5) 2
} 31.25
Mode: (1)most common elements, (2) two equal peaks, midpoint btw
If two elements are equally likely, the mode is the midway point
PR , ANN, & ML 29
Estimation - Bootstrap
Perform many subset estimations (n out of n) with
replacement
There are nn possible samples
E.g., two samples (x1,x2) generate 22 subsets (x1,x1), (x1,x2),
(x2,x1), (x2,x2)
*(.) 1 B
B
b 1
*( b )
1 B *(.)
bias boot
B
b 1
*( b )
1 B *(.)
var boot [ ]
B
[
b 1
*( b )
]2
PR , ANN, & ML 30
The Question Remained
From data sets we can
estimate eemp
eemp (1 eemp )
We desire e e eemp c
n
They are related by
n : sample size
n=40 c : confidence interval
28 correct and 12 incorrect
eemp =12/40=.3
With a confidence interval of 95%, c=1.96
.3 .7
e .3 1.96 .3 .14
40
PR , ANN, & ML 31
What?
The formula is valid if
Hypothesis is discrete-valued
Samples are independently drawn
PR , ANN, & ML 32
Comparison
Repeat an experiment Use a classifier for
many times many sets of data
Each time, toss a coin Each data set, the
n times to see if it classifier gets h wrong
lands on head (h) or and t correct out of n
tail (t) h+t=n samples, h+t=n
A coin of an The classifier has an
(unknown) probability (unknown, but fixed)
p to land on head probability p to
classify data
incorrectly
PR , ANN, & ML 33
Comparison (cont.)
Results of each coin Results of each sample
toss is a random classification is a random
variable variable
Results of how many Results of how many
heads in n toss is also incorrect labels in n
a random variable samples is also a random
variable
Repetitive experiments Repetitive classifications
give outcomes in give outcomes in
binomial distribution binomial distribution
PR , ANN, & ML 34
Binomial Distributions
n!
P (h) p h (1 p ) n h
h ! ( n h )!
n : sample size
h : number of heads
P(h/n)
p : head probabilty
h/n
PR , ANN, & ML 35
Binomial Distribution
Then it can be shown that
n! nh
P ( h) p (1 p )
h
h!(n h)!
E (h) np
Var (h) np (1 p )
h np (1 p)
PR , ANN, & ML 36
Estimators
An estimator for p is An estimator for e (p)
(number of heads)/n is eemp
This estimator is an This estimator is an
unbiased estimator unbiased estimator
because because
# head np # error np
E( ) p E( )
n
p
n n n
PR , ANN, & ML 37
Confidence Interval
An N% confidence interval for some
parameter p is an interval that is expected
with probability N% to contain p
For binomial distribution, this can be
approximated by normal
eemp (1 eemp )
e eemp c
n
n : sample size
c : confidence interval
PR , ANN, & ML p 38
Capacity (Simplicity)
We have just discussed the repeatability
issue
The assumption is that the classification error is
the same for training and new data
The misclassification rate is drawn from the
same population
True for simple classifiers, but not for
complicated classifiers
The other issue is simplicity (or more
generally the capacity) of the classifier
PR , ANN, & ML 39
General Theoretical Bound
The sample size
The larger the sample size, the more confident
we should be about “we have seen enough”
The complexity of the classifier
The simpler the classifier, the more confident
we should be about “we have observed enough”
Or complex classifier can do wired things when
you are not looking
PR , ANN, & ML 40
VC Dimension
Vapnik-Chervonenkis dimension
Defined for a class of functions f(alpha)
The maximum number of points that can be
shattered by the function
Shatter means that given, say, n points, there are 2n
ways to label them {+1, -1}. These points are shattered
if an f(alpha) can be found to correctly assign those
labels
E.g., three points can be shattered by 1 line, but not four
points
Linear function in n space is of VC dimension n+1
A measure of “capacity” of a classifier
PR , ANN, & ML 41
Generalization
h(log(2n / h) 1) log( / 4)
It can be shown that e( ) eemp ( )
n
PR , ANN, & ML 42
Capacity Interpretation
h(log(2n / h) 1) log( / 4)
e( ) eemp ( )
n
A simple classifier
Has low VC dimension and small second term
Has high empirical error and large first term
A complicated classifier
Has high VC dimension and large second term
Has low empirical error and small first term
PR , ANN, & ML 43
Generalization Performance
Hence, a classifier should be chosen to give
the lowest bound
However, many times the bound is not tight,
easily the bound can reach 1 and make it
useless
Only useful for small VC dimension
PR , ANN, & ML 44
Upper Bound
95% confidence level
10,000 samples
h/n >0.37
PR , ANN, & ML 45
Ensemble Classifiers
Combining simple classifiers by majority
votes
Famous ones: bagging and boosting
Why it works:
Mightreduce error (bias)
Reduce variance
PR , ANN, & ML 46
Reduce Bias
If each classifier makes error, say 30%
How likely it is for a committee of n classifiers to
make mistake by majority rule
Answer: Bonelli distribution
Big IF: they must perform independently
PR , ANN, & ML 47
Improvement - Averaging
Each machine reach a local minimum
Majority vote
Majority vote
PR , ANN, & ML 48
Improvement - Bagging
Bootstrap AGGregation
A simple “parallel processing” model using multiple
component classifiers
Help stability problem
Different (splitting) training data
Different starting point
Majority vote
PR , ANN, & ML 49
Why Bagging Works?
It can reduce both bias and variance
Bias: not necessarily improve
E D ( g (x; D) F (x)) 2 ( E D [ g (x; D) F (x)]) 2 E D ( g (x; D) ED [ g (x; D)]) 2
1
ED [ g (x; D) F (x)] ED [ g i (x; D) F (x)]
n
1 1
[ ED g i (x; D) F (x)]
n n
1 1
[ ED g i (x; D) F (x)] {bias of a constituent}
n n
average bias of a constituent
PR , ANN, & ML 50
Why Bagging Works?
It can reduce both bias and variance
Variance: reduce to 1/n – IF all constituents
are independent
E ( g (x; D) F (x)) ( E [ g (x; D) F (x)]) E ( g (x; D) E [ g (x; D)])
D
2
D
2
D D
2
E D ( g (x; D) E D [ g (x; D)]) 2
1 1
n
E D [( g i ( x; D ) E D
n
g i (x; D)) 2 ]
1 1
E D [ ( ( g i (x; D ) E D g i (x; D)) ]
2
n n
1
{average variance of all constituents}
n
PR , ANN, & ML 51
Boosting by Filtering
Bagging is competition model while
boosting is a collaborative model
Component Classifiers
should be introduced when needed
should then be trained on ambiguous samples
PR , ANN, & ML 52
Boosting by Filtering (cont)
c1 c2 c3
PR , ANN, & ML 53
Boosting by Filtering (cont.)
D1:
subset from D (without replacement) to train c1
D2:
Head: samples from D-D1, where c1 is wrong
Tail: samples from D-D1 where c1 is correct
Half correct/half wrong for D1, D2 is learning what
D1 has difficulty with
D3:
D-(D1+D2) where c1 and c2 disagree
D3 is learning what the previous two cannot agree
PR , ANN, & ML 54
Boosting by Filtering (cont.)
PR , ANN, & ML 55
Boosting by Filtering (cont.)
If each committee machine has an error rate
of , then the combined machine has an error
rate of 32 -23
32 -23
PR , ANN, & ML 56
AdaBoost –Adaptive Boosting
Basic idea is very simple
Add more component classifiers if error is
higher than preset threshold
Samples are weighted: if samples are accurately
classified by combined component classifiers,
the chance of being picked for the new
classifier is reduced
Adaboost focuses on difficult patterns
PR , ANN, & ML 57
Adaboost Algorithm
Initialization
1
D {( x , y ), , ( x , y )}, k max , W0 (i ) , i 1, , n
1 1 n n
n
Procedure
for (k 1, k k max , k )
train weak clearner Ck using D sampled according to Wk (i)
Ek prob[hk ( xi ) yi ] Dt (i ) Error rate of Ck
hk ( xi ) yi
1 1 Ek
k ln[ ] Weighting of Ck
2 Ek
e k hk ( xi ) yi
W (i )
Wk 1 (i ) k
Zk e k
hk ( xi ) yi
return
Ck and αk , k 1, k max
Final hypothesis
k max
H ( x) sign( k hk ( x))
k 1
PR , ANN, & ML 58
PR , ANN, & ML 59
PR , ANN, & ML 60
Comparison
Multilayer Perceptron Boosting
Many neurons Many weak classifiers
Train together Trained individually
Hard to train Easy to train
Same data set Fine-tuned data set
Require nonlinearity Require different data
in thresholding sets
Complicated decision Simple decision
boundary boundary
Overfitting likely Less susceptible to
overfitting
PR , ANN, & ML 61
Adaboost Algorithm (cont.)
As long as each individual weak learner has
a better than chance performance, Adaboost
can boost the performance arbitrarily well
(Freund and Schapire, JCSS 1997)
PR , ANN, & ML 62
Applications
Adaboost has been used successfully in
many applications
One famous one is to the use in face
detection from images
PR , ANN, & ML 63
Viola and Jones
PR , ANN, & ML 64
Fast Feature Computation
PR , ANN, & ML 65
Classifier
Adaboost idea – greedy feature (classifier)
selection
Weak leaner (h) – select a single rectangular
feature
f:feature
: threshold
p: polarity
PR , ANN, & ML 66
Algorithm
PR , ANN, & ML 67
Efficiency
Use attention cascade
PR , ANN, & ML 68
Results
PR , ANN, & ML 69
Results
PR , ANN, & ML 70
Learning with Queries
Given a weak classifier or several weak
component classifiers
Find out where ambiguity is
Where weak classifier gives high reading for
the top two discrimination functions (e.g., in a
linear machine)
Where component classifiers yield the greatest
disagreement
Train the classifiers with those ambiguous
samples
PR , ANN, & ML 71
Learning with Queries (cont.)
PR , ANN, & ML 72