0% found this document useful (0 votes)
8 views72 pages

Performance

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views72 pages

Performance

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 72

Performance and

Generalization
Classifier Performance
 Intuitively, performance of classifiers
(learning algorithms) depends on
 Complexity of the classifiers (e.g., how many
layers and how many neurons per layers)
 Training samples (generally more is better)

 Training procedures (e.g., how many


searches/epochs are allowed)
 Etc.

PR , ANN, & ML 2
Generalization Performance
 You can make a classifier performs very
well on any training data set
 Given enough structure complexity
 Given enough training cycles

 But how does it do on a validation (unseen)


data set?
 Or how is the generalization performance?

PR , ANN, & ML 3
Generalization Performance (cont.)
 First, try to do better on unseen data by doing
better on training data might not work
 Because overfitting can be a problem
 You can fit the training data arbitrarily well, but there is
no prediction of what it will do on data not seen
 Example: curve fitting

 Using a large network or complicated classifier does


not necessarily lead to good generalization (they
almost always lead to good training results)
PR , ANN, & ML 4
Generalization Performance (cont.)
 In fact, some relations must exist in the data set even
when the data set is made of random numbers
 Example: given n people, each
 Has a credit card
 Has a phone
 The credit card and phone number association is captured
by an n-1-degree polynomial
 But can you extrapolate (predict other credit card, phone
number association)?
 A problem of overfitting

PR , ANN, & ML 5
Intuitively
 Meaningful associations usually imply
 Simplicity (capacity)
 The association function should be simple
 More generally to determine how much capability a
classifier possesses
 Repeatability (stability)
 The association function should not change drastically
when different training data sets are used to derive the
function, or E(f)=0 (over different data set)
 Average salary of Ph.D. is higher than that of high-
school dropout – simple and repeatable relation (not
sensitive to the particular training data set)
PR , ANN, & ML 6
Generalization Performance (cont.)
 So does that mean we should always prefer
simplicity?
 Occam’s Razor: nature prefers simplicity
 Explanations should not be multiplied beyond
necessity
 Sometimes, it is a bias or preference over
the forms and parameters of a classifier

PR , ANN, & ML 7
No free lunch theorem
 Under very general assumption, one should
not prefer one classifier (or learning
algorithm) over another for the
generalization performance
 Why?
 Because given certain training data, there is no
telling (in general) what unseen data will
behave

PR , ANN, & ML 8
Example
target

 Training data might not x F h1 h2


provide any information 000 1 1 1
about F(x)
training 001 -1 -1 -1
 There are multiple (25)
target functions that are 010 1 1 1
consistent with the n=3 011 -1 1 -1
patterns in training set 100 1 1 -1
 Each inversion of F (-F)Unseen 101 -1 1 -1
will make one good and2 5

the other bad combinations


110 1 1 -1
111 1 1 -1

PR , ANN, & ML 9
Example (cont.)
 However, in reality, we also expect that learning
(training) is effective
 given a large number of representative samples and
 a target function that is smooth

 Or the sampling theorem says that extrapolation


and reconstruction is possible under certain
conditions

PR , ANN, & ML 10
How do we reconcile?
 So the choice of a classifier or a training
algorithm depends on our preconceived
notion of the relevant target functions
 In that sense, both Occam’s Razor and no
free lunch theorem can co-exist

PR , ANN, & ML 11
Make it More Concrete
 Let’s assume that there are two classes {+1,
-1} and n samples (x1, y1) … (xn, yn)
 The classifier is f(x,) -> y (: tunable
parameters of f, e.g., hyperplane)
1
 Loss (error) is 2 | y  f ( x , ) |
i i

 Then there are two types of errors (risks)


1 n
 Empiricalerror eemp ( )  
2n i 1
| yi  f ( xi ,  ) |

 Expected error 1
2
e ( )  | y  f ( x, ) | dP( x, y )

PR , ANN, & ML 12
Issues with the Two Risks
 How do we estimate how good is a
classifier based on a particular run of the
experiment?
 How best to compute eemp from samples
 How do we estimate e from eemp?
 Practically(with some particular data sets), and
 Theoretically (with an upper bound)

PR , ANN, & ML 13
Answers
 Computing eemp
 There are statistical resampling techniques to
give better estimate of a classifier’s performance
 Estimating e from eemp
 Theoretically
 There is an upper bound on e given eemp
 Practically
 Given a particular set(s) of training data, a procedure
exists to estimate e from eemp

PR , ANN, & ML 14
Practical Issues
 OK, so philosophically nothing is better
than anything else
 In reality, we have to choose classifiers and
learning algorithms
 Run some classification experiments in
supervised mode (with labeled data)
 What should we look for in a “good”
classifier or learning algorithm?

PR , ANN, & ML 15
Assumption
 Use RBF interpretation
 Interpolation a function with given data points
 True function F(x)
 Interpolation function g(x; D)

 D is the given data set

PR , ANN, & ML 16
Bias and Variance
  
E D ( g (x; D)  F (x)) 2  ( E D [ g (x; D)  F (x)]) 2  E D ( g (x; D)  ED [ g (x; D)]) 2 
bias2 variance


E D ( g (x; D)  F (x)) 2  =0
 E D [ g ]  2 E D [ g ]F  F
2 2

 ED [ g 2 ]  2 ED [ g ]F  F 2  ( ED [ g ]) 2  2( ED [ g ]) 2  ( E D [ g ]) 2

 ( E D [ g ]) 2  2 ED [ g ]F  F 2  ED [ g 2 ]  2 E D [ g ]ED [ g ]  ( ED [ g ]) 2
 ( E D [ g  F ]) 2  ED ( g  ED [ g ]) 2

PR , ANN, & ML 17
Bias and Variance
  
E D ( g (x; D)  F (x)) 2  ( E D [ g (x; D)  F (x)]) 2  E D ( g (x; D)  E D [ g (x; D )]) 2 
bias2 variance
 Bias – measure the accuracy
 How good are the classifiers confirm to reality
 High bias implies a poor match
 Variance – measure the precision or specificity of a
match
 How good is the classifiers confirm to one another
 High variance implies a weak match

PR , ANN, & ML 18
Graphic Interpretation

PR , ANN, & ML 19
Tradeoff
 Intuitively, increasing the flexibility (more
parameters)
 Givesbetter fit (low bias)
 Produces harder to control fit (high variance)

 Bias-variance tradeoff is a lot like precision-


recall, you cannot have both

PR , ANN, & ML 20
Bias and Variance in Fitting
 With more parameters, a
better fit (low bias) is
possible, at the expense
that parameter values may
vary widely given
different sampled data
(high variance)

a, b: fixed linear model


c: learned cubic model
d: learned linear model

PR , ANN, & ML 21
Bias and Variance in Classifiers
 Simple models
 capture aggregate class
properties which are usually
more stable (hence low
variance)
 However, they miss fine details
and give poor fit (hence high
bias)
 Complicated models
 capture aggregate class
properties and fine variance
(hence low bias)
 However, fine details depend
on samples used (hence high
variance)

PR , ANN, & ML 22
Curse of Dimensionality
 What happens to bias and variance when the
dimensionality increase?
 It depends
 F(X) depends on all dimensions of X
 Bias is likely to go up for nn classifier because
neighbors will be further away from a data point for
faithful interpolation
 F(X) depends on some dimensions of X
 Variance is likely to go up for nn classifier because
spread along the used dimensions might go up

PR , ANN, & ML 23
Practical Issues (cont.)
 So you choose a labeled data set and test your
classifier
 But that is just on one particular data set
 How do you know how it will do for other labeled
data set? or
 How do you estimate bias and variance? (or how
do you know the particular relation is stable and
repeatable?)
 We do not really know F for other data sets
 We have only one data set, not an ensemble
 How do you extrapolate (from eemp to e)?
 How do you improve bias and variance?

PR , ANN, & ML 24
Example: Estimation - Jackknife
 Perform many “leave-one-out” estimations
 E.g., to estimate mean and variance

Traditional Leave-one-out
1 n 1 n
u(.)  u(i ) 
n
1 where u(i )  xj
uˆ  
n i 1
xi n i1 n 1 j 1, j i
n 1 n  n 1 n
2
   i
n i 1
( x uˆ ) 2
Var[u]   (i) (.)
n i1
(u  u ) 2

This does not work for This is applicable to


other statistics, such as other statistics, such as
median and mode median and mode
PR , ANN, & ML 25
Estimation – Jackknife (cont.)
 Jackknife estimation is defined as function of leave-
one-out results
 Enable mean and variance computation from one
data set
1
u(.)   u(i ) 
Var[u ] 
n 1
 (i ) (.)
(u  u ) 2

n i n i
 n 1
nu  xi
1
  
n
i (i )
(u  uˆ ) 2

n i n 1
2 n 1 nuˆ  xi
n u   xi    u 2
( ˆ )
1 n i n 1
 i
n 1 uˆ  xi 2  2
n n  1   (
n i n 1
) 
1 n u  nu
2

n n  1
u PR , ANN, & ML 26
General Jackknife Estimation
 Similar to mean and variance estimation
 Perform many leave-one-out estimations
of the parameter
 
 (i )   ( x1 , x2 , , xi 1 , xi 1 , , xn )
 1 n 
 (.)    (i ) e.g.,  can be the hyperplane equation
n i 1
 n 1 n   2
var[ ] 
n i 1
 ( (i )   (.) )

PR , ANN, & ML 27
Bias and Variance of
General Jackknife Estimation

Bias    E ( )
 
bias jackknife  (n  1)( (.)   )
   2
Var[ ]  E (  E ( ))
 n 1 n   2
Varjackknife [ ]  
n i 1
( ( i )   (.) )

PR , ANN, & ML 28
Example: Mode Estimate
 n=6

 D={0,10,10,10,20,20}

 1 n  1
 (.)    (i )  {10  15  15  15  10  10]  12.5
n i 1 6
 
bias jackknife  (n  1)( (.)   )  5(12.5  10)  12.5
 n 1 n   2 5
Varjackknife [ ]  
n i 1
( (i )   (.) ) 
6
{3(10  12.5) 2
 3(15  12. 5) 2
}  31.25

Mode: (1)most common elements, (2) two equal peaks, midpoint btw
If two elements are equally likely, the mode is the midway point
PR , ANN, & ML 29
Estimation - Bootstrap
 Perform many subset estimations (n out of n) with
replacement
 There are nn possible samples
 E.g., two samples (x1,x2) generate 22 subsets (x1,x1), (x1,x2),
(x2,x1), (x2,x2)
 *(.) 1 B
 
B
 
b 1
*( b )

1 B   *(.) 
bias boot 
B

b 1
*( b )
   

1 B  *(.)
var boot [ ] 
B
 [
b 1
*( b )
 ]2

PR , ANN, & ML 30
The Question Remained
 From data sets we can
estimate eemp
eemp (1  eemp )
 We desire e e  eemp  c
n
 They are related by
n : sample size
 n=40 c : confidence interval
 28 correct and 12 incorrect
 eemp =12/40=.3
 With a confidence interval of 95%, c=1.96

.3  .7
e  .3  1.96  .3  .14
40
PR , ANN, & ML 31
What?
 The formula is valid if
 Hypothesis is discrete-valued
 Samples are independently drawn

 With a fixed probabilistic distribution

 Then the experiment outcomes can be


described by a binomial distribution

PR , ANN, & ML 32
Comparison
 Repeat an experiment  Use a classifier for
many times many sets of data
 Each time, toss a coin  Each data set, the
n times to see if it classifier gets h wrong
lands on head (h) or and t correct out of n
tail (t) h+t=n samples, h+t=n
 A coin of an  The classifier has an
(unknown) probability (unknown, but fixed)
p to land on head probability p to
classify data
incorrectly

PR , ANN, & ML 33
Comparison (cont.)
 Results of each coin  Results of each sample
toss is a random classification is a random
variable variable
 Results of how many  Results of how many
heads in n toss is also incorrect labels in n
a random variable samples is also a random
variable
 Repetitive experiments  Repetitive classifications
give outcomes in give outcomes in
binomial distribution binomial distribution

PR , ANN, & ML 34
Binomial Distributions
n!
P (h)  p h (1  p ) n  h
h ! ( n  h )!
n : sample size
h : number of heads
P(h/n)
p : head probabilty

h/n

PR , ANN, & ML 35
Binomial Distribution
 Then it can be shown that

n! nh
P ( h)  p (1  p )
h

h!(n  h)!
E (h)  np
Var (h)  np (1  p )
 h  np (1  p)

PR , ANN, & ML 36
Estimators
 An estimator for p is  An estimator for e (p)
(number of heads)/n is eemp
 This estimator is an  This estimator is an
unbiased estimator unbiased estimator
because because
# head np # error np
E( ) p E( ) 
n
 p
n n n

 Standard deviation in  Standard deviation in


the estimator is the estimator is
# head np (1  p ) p (1  p ) # error np (1  p ) p (1  p )
( )  ( ) 
n n n n n n

PR , ANN, & ML 37
Confidence Interval
 An N% confidence interval for some
parameter p is an interval that is expected
with probability N% to contain p
 For binomial distribution, this can be
approximated by normal

eemp (1  eemp )
e  eemp  c
n
n : sample size
c : confidence interval

PR , ANN, & ML p 38
Capacity (Simplicity)
 We have just discussed the repeatability
issue
 The assumption is that the classification error is
the same for training and new data
 The misclassification rate is drawn from the
same population
 True for simple classifiers, but not for
complicated classifiers
 The other issue is simplicity (or more
generally the capacity) of the classifier
PR , ANN, & ML 39
General Theoretical Bound
 The sample size
 The larger the sample size, the more confident
we should be about “we have seen enough”
 The complexity of the classifier
 The simpler the classifier, the more confident
we should be about “we have observed enough”
 Or complex classifier can do wired things when
you are not looking 

PR , ANN, & ML 40
VC Dimension
 Vapnik-Chervonenkis dimension
 Defined for a class of functions f(alpha)
 The maximum number of points that can be
shattered by the function
 Shatter means that given, say, n points, there are 2n
ways to label them {+1, -1}. These points are shattered
if an f(alpha) can be found to correctly assign those
labels
 E.g., three points can be shattered by 1 line, but not four
points
 Linear function in n space is of VC dimension n+1
 A measure of “capacity” of a classifier

PR , ANN, & ML 41
Generalization
h(log(2n / h)  1)  log( / 4)
 It can be shown that e( )  eemp ( ) 
n

 Basically, the expected error is bounded


above, the bound depends on
 Empirical error (eemp)
 VC dimension (h)

 n: training sample size

 Expected performance (  ), say loss of 0.05


with a probability of 0.95

PR , ANN, & ML 42
Capacity Interpretation
h(log(2n / h)  1)  log( / 4)
e( )  eemp ( ) 
n
 A simple classifier
 Has low VC dimension and small second term
 Has high empirical error and large first term
 A complicated classifier
 Has high VC dimension and large second term
 Has low empirical error and small first term

 Some trade-off to achieve the lowest right-


hand side

PR , ANN, & ML 43
Generalization Performance
 Hence, a classifier should be chosen to give
the lowest bound
 However, many times the bound is not tight,
easily the bound can reach 1 and make it
useless
 Only useful for small VC dimension

PR , ANN, & ML 44
Upper Bound
 95% confidence level
 10,000 samples

 h/n >0.37

PR , ANN, & ML 45
Ensemble Classifiers
 Combining simple classifiers by majority
votes
 Famous ones: bagging and boosting

 Why it works:
 Mightreduce error (bias)
 Reduce variance

PR , ANN, & ML 46
Reduce Bias
 If each classifier makes error, say 30%
 How likely it is for a committee of n classifiers to
make mistake by majority rule
 Answer: Bonelli distribution
 Big IF: they must perform independently

PR , ANN, & ML 47
Improvement - Averaging
 Each machine reach a local minimum
 Majority vote

Same training data


Different starting point

Majority vote
PR , ANN, & ML 48
Improvement - Bagging
 Bootstrap AGGregation
 A simple “parallel processing” model using multiple
component classifiers
 Help stability problem
Different (splitting) training data
Different starting point

Majority vote
PR , ANN, & ML 49
Why Bagging Works?
 It can reduce both bias and variance
 Bias: not necessarily improve

  
E D ( g (x; D)  F (x)) 2  ( E D [ g (x; D)  F (x)]) 2  E D ( g (x; D)  ED [ g (x; D)]) 2 
1
ED [ g (x; D)  F (x)]  ED [  g i (x; D)  F (x)]
n
1 1
 [  ED g i (x; D)   F (x)]
n n
1 1
 [ ED g i (x; D)  F (x)]  {bias of a constituent}
n n
 average bias of a constituent

PR , ANN, & ML 50
Why Bagging Works?
 It can reduce both bias and variance
 Variance: reduce to 1/n – IF all constituents
are independent
E ( g (x; D)  F (x))   ( E [ g (x; D)  F (x)])  E ( g (x; D)  E [ g (x; D)]) 
D
2
D
2
D D
2


E D ( g (x; D)  E D [ g (x; D)]) 2 
1 1
n

 E D [( g i ( x; D )  E D
n
 g i (x; D)) 2 ]

1 1
 E D [ ( ( g i (x; D )  E D g i (x; D)) ]
2

n n
1
 {average variance of all constituents}
n

PR , ANN, & ML 51
Boosting by Filtering
 Bagging is competition model while
boosting is a collaborative model
 Component Classifiers
 should be introduced when needed
 should then be trained on ambiguous samples

 Iterative refinement of results to reduce


error and ambiguity

PR , ANN, & ML 52
Boosting by Filtering (cont)

c1 c2 c3

 If c1 and c2 agree, use that


 Otherwise, use c3

PR , ANN, & ML 53
Boosting by Filtering (cont.)
 D1:
 subset from D (without replacement) to train c1
 D2:
 Head: samples from D-D1, where c1 is wrong
 Tail: samples from D-D1 where c1 is correct
 Half correct/half wrong for D1, D2 is learning what
D1 has difficulty with
 D3:
 D-(D1+D2) where c1 and c2 disagree
 D3 is learning what the previous two cannot agree

PR , ANN, & ML 54
Boosting by Filtering (cont.)

PR , ANN, & ML 55
Boosting by Filtering (cont.)
 If each committee machine has an error rate
of , then the combined machine has an error
rate of 32 -23


32 -23

PR , ANN, & ML 56
AdaBoost –Adaptive Boosting
 Basic idea is very simple
 Add more component classifiers if error is
higher than preset threshold
 Samples are weighted: if samples are accurately
classified by combined component classifiers,
the chance of being picked for the new
classifier is reduced
 Adaboost focuses on difficult patterns

PR , ANN, & ML 57
Adaboost Algorithm
Initialization
1
D  {( x , y ), , ( x , y )}, k max , W0 (i )  , i  1,  , n
1 1 n n

n
Procedure
for (k  1, k  k max , k   )
train weak clearner Ck using D sampled according to Wk (i)
Ek  prob[hk ( xi )  yi ]   Dt (i ) Error rate of Ck
hk ( xi )  yi
1 1  Ek
 k  ln[ ] Weighting of Ck
2 Ek
e  k hk ( xi )  yi
W (i ) 
Wk 1 (i )  k  
Zk  e k
 hk ( xi )  yi
return
Ck and αk , k  1,  k max
Final hypothesis
k max
H ( x)  sign(   k hk ( x))
k 1

PR , ANN, & ML 58
PR , ANN, & ML 59
PR , ANN, & ML 60
Comparison
Multilayer Perceptron Boosting
 Many neurons  Many weak classifiers
 Train together  Trained individually
 Hard to train  Easy to train
 Same data set  Fine-tuned data set
 Require nonlinearity  Require different data
in thresholding sets
 Complicated decision  Simple decision
boundary boundary
 Overfitting likely  Less susceptible to
overfitting

PR , ANN, & ML 61
Adaboost Algorithm (cont.)
 As long as each individual weak learner has
a better than chance performance, Adaboost
can boost the performance arbitrarily well
(Freund and Schapire, JCSS 1997)

PR , ANN, & ML 62
Applications
 Adaboost has been used successfully in
many applications
 One famous one is to the use in face
detection from images

PR , ANN, & ML 63
Viola and Jones

PR , ANN, & ML 64
Fast Feature Computation

PR , ANN, & ML 65
Classifier
 Adaboost idea – greedy feature (classifier)
selection
 Weak leaner (h) – select a single rectangular
feature
 f:feature
 : threshold

 p: polarity

 X: 24x24 ixel sub window

PR , ANN, & ML 66
Algorithm

PR , ANN, & ML 67
Efficiency
 Use attention cascade

PR , ANN, & ML 68
Results

PR , ANN, & ML 69
Results

PR , ANN, & ML 70
Learning with Queries
 Given a weak classifier or several weak
component classifiers
 Find out where ambiguity is
 Where weak classifier gives high reading for
the top two discrimination functions (e.g., in a
linear machine)
 Where component classifiers yield the greatest
disagreement
 Train the classifiers with those ambiguous
samples

PR , ANN, & ML 71
Learning with Queries (cont.)

Samples generated with


queries

PR , ANN, & ML 72

You might also like