0% found this document useful (0 votes)
11 views20 pages

ML 05

Uploaded by

varalak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views20 pages

ML 05

Uploaded by

varalak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 20

Machine Learning

Chapter 5. Evaluating Hypotheses

Tom M. Mitchell
Evaluating Hypotheses
 Sample error, true error
 Confidence intervals for observed hypothesis error
 Estimators
 Binomial distribution, Normal distribution, Central
Limit Theorem
 Paired t tests
 Comparing learning methods

2
Two Definitions of Error
 The true error of hypothesis h with respect to target functi
on f and distribution D is the probability that h will misclas
sify an instance drawn at random according to D.

 The sample error of h with respect to target function f and


data sample S is the proportion of examples h misclassifies

 Where (f(x)  h(x)) is 1 if f(x)  h(x), and 0 otherwise.


 How well does errorS(h) estimate errorD(h)?
3
Problems Estimating Error
1. Bias: If S is training set, errorS(h) is optimistically
biased
bias  E [errorS(h)] - errorD(h)
For unbiased estimate, h and S must be chosen ind
ependently

2. Variance: Even with unbiased S, errorS(h) may stil


l vary from errorD(h)

4
Example
 Hypothesis h misclassifies 12 of the 40 examples i
nS
errorS(h) = 12 / 40 = .30

 What is errorD(h) ?

5
Estimators
 Experiment:
1. choose sample S of size n according to distributi
on D
2. measure errorS(h)
errorS(h) is a random variable (i.e., result of an e
xperiment)
errorS(h) is an unbiased estimator for errorD(h)
Given observed errorS(h) what can we conclude
about errorD(h) ?
6
Confidence Intervals
 If
– S contains n examples, drawn independently of h and ea
ch other
– n  30
 Then, with approximately N% probability, errorD
(h) lies in interval

N% 50 68 80 90 95 98 99
where % % % % % % %
zN 0.67 1.00 1.28 1.64 1.96 2.33 2.58
7
errorS(h) is a Random Variable
 Rerun the experiment with different randomly drawn S (of
size n)
 Probability of observing r misclassified examples:

8
Binomial Probability Distribution

Probability P(r) of r heads in n coin flips, if p = Pr(heads)


 Expected, or mean value of X, E[X], is

 Variance of X is
 Standard deviation of X, X, is

9
Normal Distribution Approximates Binomial

errorS(h) follows a Binomial distribution, with


 mean errorS(h) = errorD(h)
 standard deviation errorS(h)

Approximate this by a Normal distribution with


 mean errorS(h) = errorD(h)
 standard deviation errorS(h)

10
Normal Probability Distribution
(1/2)

The probability that X will fall into the interval (a, b) is given by
 Expected, or mean value of X, E[X], is E[X] = 
 Variance of X is Var(X) = 2
 Standard deviation of X, X is X = 

11
Normal Probability Distribution
(2/2)

80% of area (probability) lies in   1.28


N% of area (probability) lies in   zN

N% 50 68 80 90 95 98 99
% % % % % % %
zN 0.67 1.00 1.28 1.64 1.96 2.33 2.58
12
Confidence Intervals, More Correctly
 If
– S contains n examples, drawn independently of h and ea
ch other
– n  30
 Then, with approximately 95% probability, errorS(h) lies i
n interval

equivalently, errorD(h) lies in interval

which is approximately
13
Central Limit Theorem
 Consider a set of independent, identically distributed rando
m variables Y1 . . . Yn, all governed by an arbitrary probabili
ty distribution with mean  and finite variance 2. Define t
he sample mean,

 Central Limit Theorem. As n  , the distribution gover


ning Y approaches a Normal distribution, with mean  and
variance 2 / n .

14
Calculating Confidence Intervals
1. Pick parameter p to estimate
– errorD(h)
2. Choose an estimator
– errorS(h)
3. Determine probability distribution that governs estimator
– errorS(h) governed by Binomial distribution, approximated by Normal
when n  30
4. Find interval (L, U) such that N% of probability mass falls in t
he interval
– Use table of zN values

15
Difference Between Hypotheses
Test h1 on sample S1, test h2 on S2
1. Pick parameter to estimate
d  errorD(h1) - errorD(h2)
2. Choose an estimator
^
d  errorS1(h1) – errorS2(h2)
3. Determine probability distribution that governs estimator

4. Find interval (L, U) such that N% of probability mass falls in the interv
al

16
Paired t test to compare hA, hB
1. Partition data into k disjoint test sets T1, T2, . . ., Tk of equal
size, where this size is at least 30.
2. For i from 1 to k, do
i  errorTi(hA) - errorTi(hB)
3. Return the value , where

N% confidence interval estimate for d:

Note i approximately Normally distributed


17
Comparing learning algorithms
LA and LB (1/3)
What we’d like to estimate:
ESD[errorD(LA (S)) - errorD(LB (S))]
where L(S) is the hypothesis output by learner L using training set S
i.e., the expected difference in true error between hypotheses output by
learners LA and LB, when trained using randomly selected training sets
S drawn according to distribution D.

But, given limited data D0, what is a good estimator?


– could partition D0 into training set S0 and test set T0, and measure
errorT0(LA (S0)) - errorT0(LB (S0))
– even better, repeat this many times and average the results (next slide)
18
Comparing learning algorithms
LA and LB (2/3)
1. Partition data D0 into k disjoint test sets T1, T2, . . ., Tk of eq
ual size, where this size is at least 30.
2. For i from 1 to k, do
use Ti for the test set, and the remaining data for training set S i
– S i  { D 0 – Ti }
– hA  LA(Si)
– hB  LB(Si)
 i  errorTi(hA) - errorTi(hB)

3. Return the value , where

19
Comparing learning algorithms
LA and LB (3/3)
Notice we’d like to use the paired t test on  to obtain a co
nfidence interval
but not really correct, because the training sets in this algor
ithm are not independent (they overlap!)
more correct to view algorithm as producing an estimate of
ESD0[errorD(LA (S)) - errorD(LB (S))]
instead of
ESD[errorD(LA (S)) - errorD(LB (S))]
but even this approximation is better than no comparison
20

You might also like