Lecture 7 Classification
Lecture 7 Classification
Review/Practice:
What is the standard error of…?
And what shape is the sampling distribution?
A mean?
A difference in means?
A proportion?
A difference in proportions?
An odds ratio?
The ln(odds ratio)?
A beta coefficient from simple linear regression?
A beta coefficient from logistic regression?
Where do these formulas for
standard error come from?
Mathematical theory, such as the central
limit theorem.
Maximum likelihood estimation theory
(standard error is related to the second
derivative of the likelihood; assumes
sufficiently large sample)
In recent decades, computer simulation…
Computer simulation of the sampling
distribution of the sample mean:
1. Pick any probability distribution and specify a mean and standard
deviation.
2. Tell the computer to randomly generate 1000 observations from that
probability distributions
E.g., the computer is more likely to spit out values with high
probabilities
3. Plot the “observed” values in a histogram.
4. Next, tell the computer to randomly generate 1000 averages-of-2
(randomly pick 2 and take their average) from that probability
distribution. Plot “observed” averages in histograms.
5. Repeat for averages-of-10, and averages-of-100.
Uniform on [0,1]: average of 1
(original distribution)
Uniform: 1000 averages of 2
Uniform: 1000 averages of 5
Uniform: 1000 averages of 100
~Exp(1): average of 1
(original distribution)
~Exp(1): 1000 averages of 2
~Exp(1): 1000 averages of 5
~Exp(1): 1000 averages of 100
~Bin(40, .05): average of 1
(original distribution)
~Bin(40, .05): 1000 averages of 2
~Bin(40, .05): 1000 averages of 5
~Bin(40, .05): 1000 averages of 100
The Central Limit Theorem:
If all possible random samples, each of size n, are taken
from any population with a mean and a standard
deviation , the sampling distribution of the sample
means (averages) will:
1. have mean: x
2. have standard deviation: x
n
3. be approximately normally distributed regardless of the shape
of the parent population (normality improves with larger n)
Mathematical Proof
If X is a random variable from any distribution with known
mean, E(x), and variance, Var(x), then the expected value
and variance of the average of n observations of X is:
n n
x i E ( x) nE ( x)
E ( X n ) E ( i 1 ) i 1
E ( x)
n n n
n n
Standard deviation =
1 1 1 1
a b c d
The Bootstrap standard error
Described by Bradley Efron (Stanford) in
1979.
Allows you to calculate the standard errors
when no formulas are available.
Allows you to calculate the standard errors
when assumptions are not met (e.g., large
sample, normality)
Why Bootstrap?
The bootstrap uses computer simulation.
But, unlike the simulations I showed you
previously that drew observations from a
hypothetical world, the bootstrap:
– draws observations only from your own sample
(not a hypothetical world)
– makes no assumptions about the underlying
distribution in the population.
Bootstrap re-sampling…getting
something for nothing!
The standard error is the amount of
variability in the statistic if you could take
repeated samples of size n.
How do you take repeated samples of size n
from n observations??
Here’s the trickSampling with
replacement!
Sampling with replacement
Sampling with replacement means every
observation has an equal chance of being
selected (=1/n), and observations can be
selected more than once.
Sampling with replacement
Original sample of n=6 Possible new samples:
observations.
A A
A A
B D
C C
C B C
A
D Re-sample with D E
replacement C
E B C D
F E F
F **What’s the probability of each of these
particular samples discounting order?
Bootstrap Procedure
1. Number your observations 1,2,3,…n
2. Draw a random sample of size n WITH
REPLACEMENT.
3. Calculate your statistic (mean, beta coefficient, ratio,
etc.) with these data.
4. Repeat steps 1-3 many times (e.g., 500 times).
5. Calculate the variance of your statistic directly from
your sample of 500 statistics.
6. You can also calculate confidence intervals directly
from your sample of 500 statistics. Where do 95% of
statistics fall?
When is bootstrap used?
If you have a new-fangled statistic without a known
formula for standard error.
– e.g. male: female ratio.
If you are not sure if large sample assumptions are
met.
– Maximum likelihood estimation assumes “large enough”
sample.
If you are not sure if normality assumptions are met.
– Bootstrap makes no assumptions about the distribution of
the variables in the underlying population.
Bootstrap example:
Hypothetical data from a case-control study…
Case Control
Exposed 17 2
Unexposed 7 22
Predictors OR 95% CI
Smoking history* 7.9 2.6–23.6
Age per 10-yr increment 2.2 1.7–2.8
Nodule diameter per 1-mm increment 1.1 1.1–1.2
Time since quitting smoking per 10-yr increment 0.6 0.4–0.7
*
Ever vs never.
AREA UNDER
THE CURVE is
a measure of the
accuracy of
your model.
Results
The authors found an AUC of 0.79 (95%
CI: 0.74 to 0.84), which can be interpreted
as follows:
– If the model has no predictive power, you have
a 50-50 chance of correctly classifying a person
with SPN.
– Instead, here, the model has a 79% chance of
correct classification (quite an improvement
over 50%).
A role for 10-fold cross-
validation
If we were to apply this logistic regression
model to a new dataset, the AUC will be
smaller, and may be considerably smaller
(because of over-fitting).
Since we don’t have extra data lying
around, we can use 10-fold cross-validation
to get a better estimate of the AUC…
10-fold cross validation
1. Divide the 375 people randomly into sets of 37 and
38.
2. Fit the logistic regression model to 337 (nine-tenths
of the data).
3. Using the resulting model, calculate predicted
probabilities for the test data set (n=38). Save these
predicted probabilities.
4. Repeat steps 2 and 3, holding out a different tenth of
the data each time.
5. Build the ROC curve and calculate AUC using the
predicted probabilities generated in (3).
Results…
After cross-validation, the AUC was 0.78
(95% CI: 0.73 to 0.83).
This shows that the model is robust.
We will implement 10-fold cross-validation
in the lab on Wednesday (takes a little
programming in SAS)…