0% found this document useful (0 votes)
6 views

Performance

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Performance

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 72

Performance and

Generalization
Classifier Performance
 Intuitively, performance of classifiers
(learning algorithms) depends on
 Complexity of the classifiers (e.g., how many
layers and how many neurons per layers)
 Training samples (generally more is better)

 Training procedures (e.g., how many


searches/epochs are allowed)
 Etc.

PR , ANN, & ML 2
Generalization Performance
 You can make a classifier performs very
well on any training data set
 Given enough structure complexity
 Given enough training cycles

 But how does it do on a validation (unseen)


data set?
 Or how is the generalization performance?

PR , ANN, & ML 3
Generalization Performance (cont.)
 First, try to do better on unseen data by doing
better on training data might not work
 Because overfitting can be a problem
 You can fit the training data arbitrarily well, but there is
no prediction of what it will do on data not seen
 Example: curve fitting

 Using a large network or complicated classifier does


not necessarily lead to good generalization (they
almost always lead to good training results)
PR , ANN, & ML 4
Generalization Performance (cont.)
 In fact, some relations must exist in the data set even
when the data set is made of random numbers
 Example: given n people, each
 Has a credit card
 Has a phone
 The credit card and phone number association is captured
by an n-1-degree polynomial
 But can you extrapolate (predict other credit card, phone
number association)?
 A problem of overfitting

PR , ANN, & ML 5
Intuitively
 Meaningful associations usually imply
 Simplicity (capacity)
 The association function should be simple
 More generally to determine how much capability a
classifier possesses
 Repeatability (stability)
 The association function should not change drastically
when different training data sets are used to derive the
function, or E(f)=0 (over different data set)
 Average salary of Ph.D. is higher than that of high-
school dropout – simple and repeatable relation (not
sensitive to the particular training data set)
PR , ANN, & ML 6
Generalization Performance (cont.)
 So does that mean we should always prefer
simplicity?
 Occam’s Razor: nature prefers simplicity
 Explanations should not be multiplied beyond
necessity
 Sometimes, it is a bias or preference over
the forms and parameters of a classifier

PR , ANN, & ML 7
No free lunch theorem
 Under very general assumption, one should
not prefer one classifier (or learning
algorithm) over another for the
generalization performance
 Why?
 Because given certain training data, there is no
telling (in general) what unseen data will
behave

PR , ANN, & ML 8
Example
target

 Training data might not x F h1 h2


provide any information 000 1 1 1
about F(x)
training 001 -1 -1 -1
 There are multiple (25)
target functions that are 010 1 1 1
consistent with the n=3 011 -1 1 -1
patterns in training set 100 1 1 -1
 Each inversion of F (-F)Unseen 101 -1 1 -1
will make one good and2 5

the other bad combinations


110 1 1 -1
111 1 1 -1

PR , ANN, & ML 9
Example (cont.)
 However, in reality, we also expect that learning
(training) is effective
 given a large number of representative samples and
 a target function that is smooth

 Or the sampling theorem says that extrapolation


and reconstruction is possible under certain
conditions

PR , ANN, & ML 10
How do we reconcile?
 So the choice of a classifier or a training
algorithm depends on our preconceived
notion of the relevant target functions
 In that sense, both Occam’s Razor and no
free lunch theorem can co-exist

PR , ANN, & ML 11
Make it More Concrete
 Let’s assume that there are two classes {+1,
-1} and n samples (x1, y1) … (xn, yn)
 The classifier is f(x,) -> y (: tunable
parameters of f, e.g., hyperplane)
1
 Loss (error) is 2 | y  f ( x , ) |
i i

 Then there are two types of errors (risks)


1 n
 Empiricalerror eemp ( )  
2n i 1
| yi  f ( xi ,  ) |

 Expected error 1
2
e ( )  | y  f ( x, ) | dP( x, y )

PR , ANN, & ML 12
Issues with the Two Risks
 How do we estimate how good is a
classifier based on a particular run of the
experiment?
 How best to compute eemp from samples
 How do we estimate e from eemp?
 Practically(with some particular data sets), and
 Theoretically (with an upper bound)

PR , ANN, & ML 13
Answers
 Computing eemp
 There are statistical resampling techniques to
give better estimate of a classifier’s performance
 Estimating e from eemp
 Theoretically
 There is an upper bound on e given eemp
 Practically
 Given a particular set(s) of training data, a procedure
exists to estimate e from eemp

PR , ANN, & ML 14
Practical Issues
 OK, so philosophically nothing is better
than anything else
 In reality, we have to choose classifiers and
learning algorithms
 Run some classification experiments in
supervised mode (with labeled data)
 What should we look for in a “good”
classifier or learning algorithm?

PR , ANN, & ML 15
Assumption
 Use RBF interpretation
 Interpolation a function with given data points
 True function F(x)
 Interpolation function g(x; D)

 D is the given data set

PR , ANN, & ML 16
Bias and Variance
  
E D ( g (x; D)  F (x)) 2  ( E D [ g (x; D)  F (x)]) 2  E D ( g (x; D)  ED [ g (x; D)]) 2 
bias2 variance


E D ( g (x; D)  F (x)) 2  =0
 E D [ g ]  2 E D [ g ]F  F
2 2

 ED [ g 2 ]  2 ED [ g ]F  F 2  ( ED [ g ]) 2  2( ED [ g ]) 2  ( E D [ g ]) 2

 ( E D [ g ]) 2  2 ED [ g ]F  F 2  ED [ g 2 ]  2 E D [ g ]ED [ g ]  ( ED [ g ]) 2
 ( E D [ g  F ]) 2  ED ( g  ED [ g ]) 2

PR , ANN, & ML 17
Bias and Variance
  
E D ( g (x; D)  F (x)) 2  ( E D [ g (x; D)  F (x)]) 2  E D ( g (x; D)  E D [ g (x; D )]) 2 
bias2 variance
 Bias – measure the accuracy
 How good are the classifiers confirm to reality
 High bias implies a poor match
 Variance – measure the precision or specificity of a
match
 How good is the classifiers confirm to one another
 High variance implies a weak match

PR , ANN, & ML 18
Graphic Interpretation

PR , ANN, & ML 19
Tradeoff
 Intuitively, increasing the flexibility (more
parameters)
 Givesbetter fit (low bias)
 Produces harder to control fit (high variance)

 Bias-variance tradeoff is a lot like precision-


recall, you cannot have both

PR , ANN, & ML 20
Bias and Variance in Fitting
 With more parameters, a
better fit (low bias) is
possible, at the expense
that parameter values may
vary widely given
different sampled data
(high variance)

a, b: fixed linear model


c: learned cubic model
d: learned linear model

PR , ANN, & ML 21
Bias and Variance in Classifiers
 Simple models
 capture aggregate class
properties which are usually
more stable (hence low
variance)
 However, they miss fine details
and give poor fit (hence high
bias)
 Complicated models
 capture aggregate class
properties and fine variance
(hence low bias)
 However, fine details depend
on samples used (hence high
variance)

PR , ANN, & ML 22
Curse of Dimensionality
 What happens to bias and variance when the
dimensionality increase?
 It depends
 F(X) depends on all dimensions of X
 Bias is likely to go up for nn classifier because
neighbors will be further away from a data point for
faithful interpolation
 F(X) depends on some dimensions of X
 Variance is likely to go up for nn classifier because
spread along the used dimensions might go up

PR , ANN, & ML 23
Practical Issues (cont.)
 So you choose a labeled data set and test your
classifier
 But that is just on one particular data set
 How do you know how it will do for other labeled
data set? or
 How do you estimate bias and variance? (or how
do you know the particular relation is stable and
repeatable?)
 We do not really know F for other data sets
 We have only one data set, not an ensemble
 How do you extrapolate (from eemp to e)?
 How do you improve bias and variance?

PR , ANN, & ML 24
Example: Estimation - Jackknife
 Perform many “leave-one-out” estimations
 E.g., to estimate mean and variance

Traditional Leave-one-out
1 n 1 n
u(.)  u(i ) 
n
1 where u(i )  xj
uˆ  
n i 1
xi n i1 n 1 j 1, j i
n 1 n  n 1 n
2
   i
n i 1
( x uˆ ) 2
Var[u]   (i) (.)
n i1
(u  u ) 2

This does not work for This is applicable to


other statistics, such as other statistics, such as
median and mode median and mode
PR , ANN, & ML 25
Estimation – Jackknife (cont.)
 Jackknife estimation is defined as function of leave-
one-out results
 Enable mean and variance computation from one
data set
1
u(.)   u(i ) 
Var[u ] 
n 1
 (i ) (.)
(u  u ) 2

n i n i
 n 1
nu  xi
1
  
n
i (i )
(u  uˆ ) 2

n i n 1
2 n 1 nuˆ  xi
n u   xi    u 2
( ˆ )
1 n i n 1
 i
n 1 uˆ  xi 2  2
n n  1   (
n i n 1
) 
1 n u  nu
2

n n  1
u PR , ANN, & ML 26
General Jackknife Estimation
 Similar to mean and variance estimation
 Perform many leave-one-out estimations
of the parameter
 
 (i )   ( x1 , x2 , , xi 1 , xi 1 , , xn )
 1 n 
 (.)    (i ) e.g.,  can be the hyperplane equation
n i 1
 n 1 n   2
var[ ] 
n i 1
 ( (i )   (.) )

PR , ANN, & ML 27
Bias and Variance of
General Jackknife Estimation

Bias    E ( )
 
bias jackknife  (n  1)( (.)   )
   2
Var[ ]  E (  E ( ))
 n 1 n   2
Varjackknife [ ]  
n i 1
( ( i )   (.) )

PR , ANN, & ML 28
Example: Mode Estimate
 n=6

 D={0,10,10,10,20,20}

 1 n  1
 (.)    (i )  {10  15  15  15  10  10]  12.5
n i 1 6
 
bias jackknife  (n  1)( (.)   )  5(12.5  10)  12.5
 n 1 n   2 5
Varjackknife [ ]  
n i 1
( (i )   (.) ) 
6
{3(10  12.5) 2
 3(15  12. 5) 2
}  31.25

Mode: (1)most common elements, (2) two equal peaks, midpoint btw
If two elements are equally likely, the mode is the midway point
PR , ANN, & ML 29
Estimation - Bootstrap
 Perform many subset estimations (n out of n) with
replacement
 There are nn possible samples
 E.g., two samples (x1,x2) generate 22 subsets (x1,x1), (x1,x2),
(x2,x1), (x2,x2)
 *(.) 1 B
 
B
 
b 1
*( b )

1 B   *(.) 
bias boot 
B

b 1
*( b )
   

1 B  *(.)
var boot [ ] 
B
 [
b 1
*( b )
 ]2

PR , ANN, & ML 30
The Question Remained
 From data sets we can
estimate eemp
eemp (1  eemp )
 We desire e e  eemp  c
n
 They are related by
n : sample size
 n=40 c : confidence interval
 28 correct and 12 incorrect
 eemp =12/40=.3
 With a confidence interval of 95%, c=1.96

.3  .7
e  .3  1.96  .3  .14
40
PR , ANN, & ML 31
What?
 The formula is valid if
 Hypothesis is discrete-valued
 Samples are independently drawn

 With a fixed probabilistic distribution

 Then the experiment outcomes can be


described by a binomial distribution

PR , ANN, & ML 32
Comparison
 Repeat an experiment  Use a classifier for
many times many sets of data
 Each time, toss a coin  Each data set, the
n times to see if it classifier gets h wrong
lands on head (h) or and t correct out of n
tail (t) h+t=n samples, h+t=n
 A coin of an  The classifier has an
(unknown) probability (unknown, but fixed)
p to land on head probability p to
classify data
incorrectly

PR , ANN, & ML 33
Comparison (cont.)
 Results of each coin  Results of each sample
toss is a random classification is a random
variable variable
 Results of how many  Results of how many
heads in n toss is also incorrect labels in n
a random variable samples is also a random
variable
 Repetitive experiments  Repetitive classifications
give outcomes in give outcomes in
binomial distribution binomial distribution

PR , ANN, & ML 34
Binomial Distributions
n!
P (h)  p h (1  p ) n  h
h ! ( n  h )!
n : sample size
h : number of heads
P(h/n)
p : head probabilty

h/n

PR , ANN, & ML 35
Binomial Distribution
 Then it can be shown that

n! nh
P ( h)  p (1  p )
h

h!(n  h)!
E (h)  np
Var (h)  np (1  p )
 h  np (1  p)

PR , ANN, & ML 36
Estimators
 An estimator for p is  An estimator for e (p)
(number of heads)/n is eemp
 This estimator is an  This estimator is an
unbiased estimator unbiased estimator
because because
# head np # error np
E( ) p E( ) 
n
 p
n n n

 Standard deviation in  Standard deviation in


the estimator is the estimator is
# head np (1  p ) p (1  p ) # error np (1  p ) p (1  p )
( )  ( ) 
n n n n n n

PR , ANN, & ML 37
Confidence Interval
 An N% confidence interval for some
parameter p is an interval that is expected
with probability N% to contain p
 For binomial distribution, this can be
approximated by normal

eemp (1  eemp )
e  eemp  c
n
n : sample size
c : confidence interval

PR , ANN, & ML p 38
Capacity (Simplicity)
 We have just discussed the repeatability
issue
 The assumption is that the classification error is
the same for training and new data
 The misclassification rate is drawn from the
same population
 True for simple classifiers, but not for
complicated classifiers
 The other issue is simplicity (or more
generally the capacity) of the classifier
PR , ANN, & ML 39
General Theoretical Bound
 The sample size
 The larger the sample size, the more confident
we should be about “we have seen enough”
 The complexity of the classifier
 The simpler the classifier, the more confident
we should be about “we have observed enough”
 Or complex classifier can do wired things when
you are not looking 

PR , ANN, & ML 40
VC Dimension
 Vapnik-Chervonenkis dimension
 Defined for a class of functions f(alpha)
 The maximum number of points that can be
shattered by the function
 Shatter means that given, say, n points, there are 2n
ways to label them {+1, -1}. These points are shattered
if an f(alpha) can be found to correctly assign those
labels
 E.g., three points can be shattered by 1 line, but not four
points
 Linear function in n space is of VC dimension n+1
 A measure of “capacity” of a classifier

PR , ANN, & ML 41
Generalization
h(log(2n / h)  1)  log( / 4)
 It can be shown that e( )  eemp ( ) 
n

 Basically, the expected error is bounded


above, the bound depends on
 Empirical error (eemp)
 VC dimension (h)

 n: training sample size

 Expected performance (  ), say loss of 0.05


with a probability of 0.95

PR , ANN, & ML 42
Capacity Interpretation
h(log(2n / h)  1)  log( / 4)
e( )  eemp ( ) 
n
 A simple classifier
 Has low VC dimension and small second term
 Has high empirical error and large first term
 A complicated classifier
 Has high VC dimension and large second term
 Has low empirical error and small first term

 Some trade-off to achieve the lowest right-


hand side

PR , ANN, & ML 43
Generalization Performance
 Hence, a classifier should be chosen to give
the lowest bound
 However, many times the bound is not tight,
easily the bound can reach 1 and make it
useless
 Only useful for small VC dimension

PR , ANN, & ML 44
Upper Bound
 95% confidence level
 10,000 samples

 h/n >0.37

PR , ANN, & ML 45
Ensemble Classifiers
 Combining simple classifiers by majority
votes
 Famous ones: bagging and boosting

 Why it works:
 Mightreduce error (bias)
 Reduce variance

PR , ANN, & ML 46
Reduce Bias
 If each classifier makes error, say 30%
 How likely it is for a committee of n classifiers to
make mistake by majority rule
 Answer: Bonelli distribution
 Big IF: they must perform independently

PR , ANN, & ML 47
Improvement - Averaging
 Each machine reach a local minimum
 Majority vote

Same training data


Different starting point

Majority vote
PR , ANN, & ML 48
Improvement - Bagging
 Bootstrap AGGregation
 A simple “parallel processing” model using multiple
component classifiers
 Help stability problem
Different (splitting) training data
Different starting point

Majority vote
PR , ANN, & ML 49
Why Bagging Works?
 It can reduce both bias and variance
 Bias: not necessarily improve

  
E D ( g (x; D)  F (x)) 2  ( E D [ g (x; D)  F (x)]) 2  E D ( g (x; D)  ED [ g (x; D)]) 2 
1
ED [ g (x; D)  F (x)]  ED [  g i (x; D)  F (x)]
n
1 1
 [  ED g i (x; D)   F (x)]
n n
1 1
 [ ED g i (x; D)  F (x)]  {bias of a constituent}
n n
 average bias of a constituent

PR , ANN, & ML 50
Why Bagging Works?
 It can reduce both bias and variance
 Variance: reduce to 1/n – IF all constituents
are independent
E ( g (x; D)  F (x))   ( E [ g (x; D)  F (x)])  E ( g (x; D)  E [ g (x; D)]) 
D
2
D
2
D D
2


E D ( g (x; D)  E D [ g (x; D)]) 2 
1 1
n

 E D [( g i ( x; D )  E D
n
 g i (x; D)) 2 ]

1 1
 E D [ ( ( g i (x; D )  E D g i (x; D)) ]
2

n n
1
 {average variance of all constituents}
n

PR , ANN, & ML 51
Boosting by Filtering
 Bagging is competition model while
boosting is a collaborative model
 Component Classifiers
 should be introduced when needed
 should then be trained on ambiguous samples

 Iterative refinement of results to reduce


error and ambiguity

PR , ANN, & ML 52
Boosting by Filtering (cont)

c1 c2 c3

 If c1 and c2 agree, use that


 Otherwise, use c3

PR , ANN, & ML 53
Boosting by Filtering (cont.)
 D1:
 subset from D (without replacement) to train c1
 D2:
 Head: samples from D-D1, where c1 is wrong
 Tail: samples from D-D1 where c1 is correct
 Half correct/half wrong for D1, D2 is learning what
D1 has difficulty with
 D3:
 D-(D1+D2) where c1 and c2 disagree
 D3 is learning what the previous two cannot agree

PR , ANN, & ML 54
Boosting by Filtering (cont.)

PR , ANN, & ML 55
Boosting by Filtering (cont.)
 If each committee machine has an error rate
of , then the combined machine has an error
rate of 32 -23


32 -23

PR , ANN, & ML 56
AdaBoost –Adaptive Boosting
 Basic idea is very simple
 Add more component classifiers if error is
higher than preset threshold
 Samples are weighted: if samples are accurately
classified by combined component classifiers,
the chance of being picked for the new
classifier is reduced
 Adaboost focuses on difficult patterns

PR , ANN, & ML 57
Adaboost Algorithm
Initialization
1
D  {( x , y ), , ( x , y )}, k max , W0 (i )  , i  1,  , n
1 1 n n

n
Procedure
for (k  1, k  k max , k   )
train weak clearner Ck using D sampled according to Wk (i)
Ek  prob[hk ( xi )  yi ]   Dt (i ) Error rate of Ck
hk ( xi )  yi
1 1  Ek
 k  ln[ ] Weighting of Ck
2 Ek
e  k hk ( xi )  yi
W (i ) 
Wk 1 (i )  k  
Zk  e k
 hk ( xi )  yi
return
Ck and αk , k  1,  k max
Final hypothesis
k max
H ( x)  sign(   k hk ( x))
k 1

PR , ANN, & ML 58
PR , ANN, & ML 59
PR , ANN, & ML 60
Comparison
Multilayer Perceptron Boosting
 Many neurons  Many weak classifiers
 Train together  Trained individually
 Hard to train  Easy to train
 Same data set  Fine-tuned data set
 Require nonlinearity  Require different data
in thresholding sets
 Complicated decision  Simple decision
boundary boundary
 Overfitting likely  Less susceptible to
overfitting

PR , ANN, & ML 61
Adaboost Algorithm (cont.)
 As long as each individual weak learner has
a better than chance performance, Adaboost
can boost the performance arbitrarily well
(Freund and Schapire, JCSS 1997)

PR , ANN, & ML 62
Applications
 Adaboost has been used successfully in
many applications
 One famous one is to the use in face
detection from images

PR , ANN, & ML 63
Viola and Jones

PR , ANN, & ML 64
Fast Feature Computation

PR , ANN, & ML 65
Classifier
 Adaboost idea – greedy feature (classifier)
selection
 Weak leaner (h) – select a single rectangular
feature
 f:feature
 : threshold

 p: polarity

 X: 24x24 ixel sub window

PR , ANN, & ML 66
Algorithm

PR , ANN, & ML 67
Efficiency
 Use attention cascade

PR , ANN, & ML 68
Results

PR , ANN, & ML 69
Results

PR , ANN, & ML 70
Learning with Queries
 Given a weak classifier or several weak
component classifiers
 Find out where ambiguity is
 Where weak classifier gives high reading for
the top two discrimination functions (e.g., in a
linear machine)
 Where component classifiers yield the greatest
disagreement
 Train the classifiers with those ambiguous
samples

PR , ANN, & ML 71
Learning with Queries (cont.)

Samples generated with


queries

PR , ANN, & ML 72

You might also like