0% found this document useful (0 votes)

33 views52 pages

Lecture 7 Classification

KNN with spam filter and example

Uploaded by

geetha.r

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views52 pages

Lecture 7 Classification

KNN with spam filter and example

Uploaded by

geetha.r

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 52

Bootstrap and Cross-Validation

Review/Practice:
What is the standard error of…?
And what shape is the sampling distribution?
 A mean?
 A difference in means?
 A proportion?
 A difference in proportions?
 An odds ratio?
 The ln(odds ratio)?
 A beta coefficient from simple linear regression?
 A beta coefficient from logistic regression?
Where do these formulas for
standard error come from?
 Mathematical theory, such as the central
limit theorem.
 Maximum likelihood estimation theory
(standard error is related to the second
derivative of the likelihood; assumes
sufficiently large sample)
 In recent decades, computer simulation…
Computer simulation of the sampling
distribution of the sample mean:
1. Pick any probability distribution and specify a mean and standard
deviation.
2. Tell the computer to randomly generate 1000 observations from that
probability distributions
E.g., the computer is more likely to spit out values with high
probabilities
3. Plot the “observed” values in a histogram.
4. Next, tell the computer to randomly generate 1000 averages-of-2
(randomly pick 2 and take their average) from that probability
distribution. Plot “observed” averages in histograms.
5. Repeat for averages-of-10, and averages-of-100.
Uniform on [0,1]: average of 1
(original distribution)
Uniform: 1000 averages of 2
Uniform: 1000 averages of 5
Uniform: 1000 averages of 100
~Exp(1): average of 1
(original distribution)
~Exp(1): 1000 averages of 2
~Exp(1): 1000 averages of 5
~Exp(1): 1000 averages of 100
~Bin(40, .05): average of 1
(original distribution)
~Bin(40, .05): 1000 averages of 2
~Bin(40, .05): 1000 averages of 5
~Bin(40, .05): 1000 averages of 100
The Central Limit Theorem:
If all possible random samples, each of size n, are taken
from any population with a mean  and a standard
deviation , the sampling distribution of the sample
means (averages) will:
1. have mean: x  

2. have standard deviation: x 
n
3. be approximately normally distributed regardless of the shape
of the parent population (normality improves with larger n)
Mathematical Proof
If X is a random variable from any distribution with known
mean, E(x), and variance, Var(x), then the expected value
and variance of the average of n observations of X is:

n n

x i  E ( x) nE ( x)
E ( X n )  E ( i 1 )  i 1
  E ( x)
n n n
n n

x i  Var ( x) nVar ( x) Var ( x)

Var ( X n )  Var ( i 1 )  i 1
 
n n2 n 2
n
Computer simulation for the
OR…
 We have two underlying binomial distributions…
 The cases are distributed as a binomial with
N=number of cases sampled for the study and
p=true proportion exposed in all cases in the larger
population.
 The controls are distributed as a binomial with
N=number of controls sampled for the study and
p=true proportion exposed in all controls in the
larger population.
Properties of the OR (simulation)
(50 cases/50 controls/20% exposed)

If the Odds Ratio=1.0 then with 50

cases and 50 controls, of whom 20%
are exposed, this is the expected
variability of the sample ORnote
the right skew
Properties of the lnOR

Standard deviation =

1 1 1 1
  
a b c d
The Bootstrap standard error
 Described by Bradley Efron (Stanford) in
1979.
 Allows you to calculate the standard errors
when no formulas are available.
 Allows you to calculate the standard errors
when assumptions are not met (e.g., large
sample, normality)
Why Bootstrap?
 The bootstrap uses computer simulation.
 But, unlike the simulations I showed you
previously that drew observations from a
hypothetical world, the bootstrap:
– draws observations only from your own sample
(not a hypothetical world)
– makes no assumptions about the underlying
distribution in the population.
Bootstrap re-sampling…getting
something for nothing!
 The standard error is the amount of
variability in the statistic if you could take
repeated samples of size n.
 How do you take repeated samples of size n
from n observations??
 Here’s the trickSampling with
replacement!
Sampling with replacement
 Sampling with replacement means every
observation has an equal chance of being
selected (=1/n), and observations can be
selected more than once.
Sampling with replacement
Original sample of n=6 Possible new samples:
observations.
A A
A A
B D
C C
C B C
A
D Re-sample with D E
replacement C
E B C D
F E F
F **What’s the probability of each of these
particular samples discounting order?
Bootstrap Procedure
 1. Number your observations 1,2,3,…n
 2. Draw a random sample of size n WITH
REPLACEMENT.
 3. Calculate your statistic (mean, beta coefficient, ratio,
etc.) with these data.
 4. Repeat steps 1-3 many times (e.g., 500 times).
 5. Calculate the variance of your statistic directly from
your sample of 500 statistics.
 6. You can also calculate confidence intervals directly
from your sample of 500 statistics. Where do 95% of
statistics fall?
When is bootstrap used?
 If you have a new-fangled statistic without a known
formula for standard error.
– e.g. male: female ratio.
 If you are not sure if large sample assumptions are
met.
– Maximum likelihood estimation assumes “large enough”
sample.
 If you are not sure if normality assumptions are met.
– Bootstrap makes no assumptions about the distribution of
the variables in the underlying population.
Bootstrap example:
Hypothetical data from a case-control study…

Case Control

Exposed 17 2

Unexposed 7 22

Calculate the risk ratio and 95% confidence interval…

Method 1: use formula
 Use the formula for calculating 95% CIs for
ORs:
1 1 1 1 1 1 1 1
ad 1.96 + + +
a b c d
ad +1.96 + + +
a b c d
95% CI = ( )e ,( )e
bc bc
17 * 22
OR = = 26.714
2*7
1 1 1 1 1 1 1 1
17 * 22 1.96 + + + 17 * 22 +1.96 + + +
95% CI = ( )e 17 22 2 7
,( )e 17 22 2 7
= 4.909 - 145.377
2*7 2*7

In SAS, see output from PROC FREQ.

Method 2: use MLE
 Calculate the OR and 95% CI using logistic
regression (MLE theory)
 In SAS, use PROC LOGISTIC:
 From SAS, Beta and standard error of beta
are: 3.2852+/-0.8644
 From SAS, OR and 95% CI are: 26.714
(4.909,145.376)
Method 3: use Bootstrap…
 1. In SAS, re-sample 500 samples of n=48
(with replacement).
 2. For each sample, run logistic regression
to get the beta coefficient for exposure.
 3. Examine the distribution of the resulting
500 beta coefficients.
 4. Obtain the empirical standard error and
95% CI.
Bootstrap results…
1 3.2958
2 2.9267
3 2.5257
4 4.2485 Recall: MLE
5 3.2607 estimate of
6 3.5040
7 2.4343 beta
8 14.7715 coefficient
9 13.9865
10 3.1711 was
11 2.2642
12 1.5378
13 14.2988 3.2852
Etc. to 500…
Bootstrap results
 N Mean Std Dev

ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
 500 4.8685208 3.8538840

ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

This is a far cry from: 3.2852+/-0.8644

What’s happening
here???

95% confidence interval is

the interval that covers
95% of the observed
statistics…(2.5% area in
each tail)

Beta coefficient for exposure

Results…
 95% CI (beta) = 1.8871-14.6034
 95% CI (OR) = 6.6-2258925
 We will implement the bootstrap in the lab
on Wednesday (takes a little programming
in SAS)…
Validation
 Validation addresses the problem of over-
fitting.

 Internal Validation: Validate your model on

your current data set (cross-validation)
 External Validation: Validate your model
on a completely new dataset
Holdout validation
 One way to validate your model is to fit your
model on half your dataset (your “training set”)
and test it on the remaining half of your dataset
(your “test set”).
 If over-fitting is present, the model will perform
well in your training dataset but poorly in your test
dataset.
 Of course, you “waste” half your data this way,
and often you don’t have enough data to spare…
Alternative strategies:
 Leave-one-out validation (leave one
observation out at a time; fit the model on
the remaining training data; test on the held
out data point).
 K-fold cross-validation—what we will
discuss today.
When is cross-validation
used?
 Very important in microarray experiments
(“p is larger than N”).
 Anytime you want to prove that your model
is not over-fit, that it will have good
prediction in new datasets.
10-fold cross-validation (one
example of K-fold cross-validation)
 1. Randomly divide your data into 10 pieces, 1 through k.
 2. Treat the 1st tenth of the data as the test dataset. Fit the
model to the other nine-tenths of the data (which are now
the training data).
 3. Apply the model to the test data (e.g., for logistic
regression, calculate predicted probabilities of the test
observations).
 4. Repeat this procedure for all 10 tenths of the data.
 5. Calculate statistics of model accuracy and fit (e.g.,
ROC curves) from the test data only.
Example: 10-fold cross
validation
 Gould MK, Ananth L, Barnett PG; Veterans Affairs SNAP
Cooperative Study Group A clinical model to estimate the pretest
probability of lung cancer in patients with solitary pulmonary
nodules. Chest. 2007 Feb;131(2):383-8.

 Aim: to estimate the probability that a patient who presents with

solitary pulmonary nodule (SPNs) in their lungs has a malignant
lung tumor to help guide clinical decision making for people with
this condition.
 Study design: n=375 veterans with SPNs; 54% have a malignant
tumor and 46% do not (as confirmed by a gold standard test). The
authors used multiple logistic regression to select the best predictors
of malignancy.
Results from multiple logistic
regression:
Table 2. Predictors of Malignant SPNs

Predictors OR 95% CI
Smoking history* 7.9 2.6–23.6
Age per 10-yr increment 2.2 1.7–2.8
Nodule diameter per 1-mm increment 1.1 1.1–1.2
Time since quitting smoking per 10-yr increment 0.6 0.4–0.7

*
Ever vs never.

Gould MK, et al. Chest. 2007 Feb;131(2):383-8.

Prediction model:
Predicted Probability of malignant SPN = ex/(1+ex)
Where X=-8.404 + (2.061 x smoke) + (0.779 x age 10) +
(0.112 x diameter) – (0.567x years quit 10)

Gould MK, et al. Chest. 2007 Feb;131(2):383-8.

Results…
 To evaluate the accuracy of their model, the authors
calculated the area under the ROC curve.
 Review: What is an ROC curve?
– Calculate the predicted probability (pi) for every person in the
dataset.
– Order the pi’s from 1 to n (here 375).
– Classify every person with pi > p1 as having the disease.
Calculate sensitivity and specificity of this rule for the 375
people in the dataset. (sensitivity will be 100%; specificity
should be 0%).
– Classify every person with pi > p2 as having the disease.
Calculate sensitivity and specificity of this cutoff.
ROC curves continued…
– Repeat until you get to p375. Now specificity
will be 100% and sensitivity will be 0%
– Plot sensitivity against 1 minus the specificity:

AREA UNDER
THE CURVE is
a measure of the
accuracy of
your model.
Results
 The authors found an AUC of 0.79 (95%
CI: 0.74 to 0.84), which can be interpreted
as follows:
– If the model has no predictive power, you have
a 50-50 chance of correctly classifying a person
with SPN.
– Instead, here, the model has a 79% chance of
correct classification (quite an improvement
over 50%).
A role for 10-fold cross-
validation
 If we were to apply this logistic regression
model to a new dataset, the AUC will be
smaller, and may be considerably smaller
(because of over-fitting).
 Since we don’t have extra data lying
around, we can use 10-fold cross-validation
to get a better estimate of the AUC…
10-fold cross validation
 1. Divide the 375 people randomly into sets of 37 and
38.
 2. Fit the logistic regression model to 337 (nine-tenths
of the data).
 3. Using the resulting model, calculate predicted
probabilities for the test data set (n=38). Save these
predicted probabilities.
 4. Repeat steps 2 and 3, holding out a different tenth of
the data each time.
 5. Build the ROC curve and calculate AUC using the
predicted probabilities generated in (3).
Results…
 After cross-validation, the AUC was 0.78
(95% CI: 0.73 to 0.83).
 This shows that the model is robust.
 We will implement 10-fold cross-validation
in the lab on Wednesday (takes a little
programming in SAS)…

Bradley Efron, R.J. Tibshirani An Introduction To Bootstrap
60% (5)
Bradley Efron, R.J. Tibshirani An Introduction To Bootstrap
225 pages
Coding Theory
No ratings yet
Coding Theory
67 pages
Sta3030 1-2 Merged Test 1
No ratings yet
Sta3030 1-2 Merged Test 1
114 pages
An Introduction To The Bootstrap 3ai7r0o65z
No ratings yet
An Introduction To The Bootstrap 3ai7r0o65z
8 pages
L08 Empirical Sampling
No ratings yet
L08 Empirical Sampling
40 pages
Bootsteps
No ratings yet
Bootsteps
30 pages
Bootstrap Stat 498 B
No ratings yet
Bootstrap Stat 498 B
61 pages
Bootstrap T Test
No ratings yet
Bootstrap T Test
19 pages
Bootstrap Method
No ratings yet
Bootstrap Method
28 pages
Lecture 31-36
No ratings yet
Lecture 31-36
44 pages
Multivariate Material
No ratings yet
Multivariate Material
58 pages
Statistical Computing
No ratings yet
Statistical Computing
8 pages
FlowChart V20
No ratings yet
FlowChart V20
8 pages
Lesson 16
No ratings yet
Lesson 16
24 pages
Sta255 Week 10-2 Pre
No ratings yet
Sta255 Week 10-2 Pre
20 pages
Lec 6
No ratings yet
Lec 6
13 pages
Bootstrap 1
No ratings yet
Bootstrap 1
16 pages
Bootstrap Simulation
No ratings yet
Bootstrap Simulation
17 pages
Boot
No ratings yet
Boot
15 pages
Business Stats Ken Black Case Answers
100% (1)
Business Stats Ken Black Case Answers
54 pages
Méthode de Bootstrapping
No ratings yet
Méthode de Bootstrapping
28 pages
Bootstrap Example
No ratings yet
Bootstrap Example
5 pages
Bootstrap Report
No ratings yet
Bootstrap Report
92 pages
of Bootstrap by Spida - 2010
No ratings yet
of Bootstrap by Spida - 2010
80 pages
Chapter 4
No ratings yet
Chapter 4
25 pages
HW 9 Bootstrap, Jackknife, and Permutation Tests
No ratings yet
HW 9 Bootstrap, Jackknife, and Permutation Tests
7 pages
A Practical Guide To Bootstrap in R
No ratings yet
A Practical Guide To Bootstrap in R
4 pages
Small-Sample Inference and Bootstrap: Leonid Kogan
No ratings yet
Small-Sample Inference and Bootstrap: Leonid Kogan
29 pages
Model Validation-Tutorial
No ratings yet
Model Validation-Tutorial
35 pages
Lab #6: Bootstrap Intervals: Why It Works
No ratings yet
Lab #6: Bootstrap Intervals: Why It Works
7 pages
Statistics
No ratings yet
Statistics
53 pages
ABD Formulas
No ratings yet
ABD Formulas
55 pages
N Mean (x) yn Sn std. Deviation σ: Year Maximum flood discharge (m3/sec)
No ratings yet
N Mean (x) yn Sn std. Deviation σ: Year Maximum flood discharge (m3/sec)
4 pages
Wasserman 8 PDF
No ratings yet
Wasserman 8 PDF
12 pages
Bootstrap 1
No ratings yet
Bootstrap 1
7 pages
Cheatsheet
No ratings yet
Cheatsheet
4 pages
Bootstrapping Techniques in Statistical Analysis and Approaches in R MATH 289
No ratings yet
Bootstrapping Techniques in Statistical Analysis and Approaches in R MATH 289
10 pages
4.5-Bootstrap Variations
No ratings yet
4.5-Bootstrap Variations
25 pages
Bootstrap
No ratings yet
Bootstrap
52 pages
Chapter 1. Bootstrap Method: 1.1 The Practice of Statistics
No ratings yet
Chapter 1. Bootstrap Method: 1.1 The Practice of Statistics
28 pages
Lecture On Bootstrap - Lecture Notes
No ratings yet
Lecture On Bootstrap - Lecture Notes
29 pages
MIT18 05S14 Class24-Slde-A
No ratings yet
MIT18 05S14 Class24-Slde-A
16 pages
Introduction To Data Analysis: Professor David Richardson IIT Stuart School of Business
No ratings yet
Introduction To Data Analysis: Professor David Richardson IIT Stuart School of Business
31 pages
Braun Bootstrap2012 PDF
No ratings yet
Braun Bootstrap2012 PDF
63 pages
Boots Trapping
No ratings yet
Boots Trapping
4 pages
Bootstrap Methods 2020
No ratings yet
Bootstrap Methods 2020
16 pages
Bootstrap: Estimate Statistical Uncertainties
No ratings yet
Bootstrap: Estimate Statistical Uncertainties
22 pages
Lab Checkup Notes 2
No ratings yet
Lab Checkup Notes 2
7 pages
Hypothesis Testing For Analysis
No ratings yet
Hypothesis Testing For Analysis
8 pages
R Session Bootstrapping Randomisation 2024
No ratings yet
R Session Bootstrapping Randomisation 2024
4 pages
Stat - Bootstrapping in Statistics
No ratings yet
Stat - Bootstrapping in Statistics
7 pages
Chapter Five Sceduling CPT and PERT
No ratings yet
Chapter Five Sceduling CPT and PERT
29 pages
MIT18 05S14 Prac Fnal Exm
No ratings yet
MIT18 05S14 Prac Fnal Exm
8 pages
Simulations in Statistical Inference
No ratings yet
Simulations in Statistical Inference
12 pages
OPIM Quality & Statistical Process Control
100% (1)
OPIM Quality & Statistical Process Control
4 pages
Formula Stables
No ratings yet
Formula Stables
29 pages
DRS 113 Probability Theory Lecture Notes Collection
No ratings yet
DRS 113 Probability Theory Lecture Notes Collection
246 pages
Advanced Econometric Methods I: Lecture Notes On Bootstrap: 1 Motivation
No ratings yet
Advanced Econometric Methods I: Lecture Notes On Bootstrap: 1 Motivation
19 pages
Basic Bootstrap in Stata
No ratings yet
Basic Bootstrap in Stata
2 pages
Intro Bootstrap 341
No ratings yet
Intro Bootstrap 341
18 pages
A Leisurely Look at The Bootstrap, The Jackknife, and Cross-Validation (1983 13s) - BRADLEY EFRON
No ratings yet
A Leisurely Look at The Bootstrap, The Jackknife, and Cross-Validation (1983 13s) - BRADLEY EFRON
13 pages
Bootstrap Explained
No ratings yet
Bootstrap Explained
1 page
Probable Errors
No ratings yet
Probable Errors
13 pages
Lecture 19 20
No ratings yet
Lecture 19 20
5 pages
Bootstrap Up
No ratings yet
Bootstrap Up
5 pages
Statistics For Management - 3
No ratings yet
Statistics For Management - 3
32 pages
Random Variables PDF
No ratings yet
Random Variables PDF
90 pages
Lesson Plan - FCV - 2024
No ratings yet
Lesson Plan - FCV - 2024
4 pages
Chapter 1-Sta408
No ratings yet
Chapter 1-Sta408
36 pages
Illustration of The Naïve Method
No ratings yet
Illustration of The Naïve Method
3 pages
Object Detyection Using CNN
No ratings yet
Object Detyection Using CNN
113 pages
Multiple Linear Regression
No ratings yet
Multiple Linear Regression
43 pages
Project Assignment 1 2
No ratings yet
Project Assignment 1 2
4 pages
Quantitative Risk Analysis in Safety Engineering: Due Monday, February 24, 20120
No ratings yet
Quantitative Risk Analysis in Safety Engineering: Due Monday, February 24, 20120
10 pages
Graded Worksheet D3
No ratings yet
Graded Worksheet D3
1 page
IIT Convolution Codes
No ratings yet
IIT Convolution Codes
11 pages
Program Name: B.Tech CSE Semester: 5th Course Name: Machine Learning Course Code:PEC-CS-D-501 (I) Facilitator Name: Aastha
No ratings yet
Program Name: B.Tech CSE Semester: 5th Course Name: Machine Learning Course Code:PEC-CS-D-501 (I) Facilitator Name: Aastha
20 pages
Out Put Factor
No ratings yet
Out Put Factor
13 pages
Communication Systems, 5e: Chapter 11: Baseband Digital Transmission A. Bruce Carlson Paul B. Crilly
No ratings yet
Communication Systems, 5e: Chapter 11: Baseband Digital Transmission A. Bruce Carlson Paul B. Crilly
48 pages
Productivity Estimation of Bulldozers Using Generalized Linear Mixed Models
No ratings yet
Productivity Estimation of Bulldozers Using Generalized Linear Mixed Models
11 pages
2024 Decision Trees
No ratings yet
2024 Decision Trees
28 pages
21CS644 Module
No ratings yet
21CS644 Module
30 pages
CO4-L2-SAMPLING AND SAMPLING DISTRIBUTIONS (Proportion and Difference Between Means)
No ratings yet
CO4-L2-SAMPLING AND SAMPLING DISTRIBUTIONS (Proportion and Difference Between Means)
22 pages
MR Gura - Statistics Form 6 Term 1
No ratings yet
MR Gura - Statistics Form 6 Term 1
4 pages
Nida Maryam - uPDATED
No ratings yet
Nida Maryam - uPDATED
12 pages
Ma324 2 PDF
No ratings yet
Ma324 2 PDF
1 page
CONSM Practical File
No ratings yet
CONSM Practical File
27 pages
Probability
No ratings yet
Probability
11 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
9 pages
Scheme Report
No ratings yet
Scheme Report
1 page
Practical-2 Sem 2
No ratings yet
Practical-2 Sem 2
5 pages
Acceptance-Rejection Sampling and Multi-dimensional Monte Carlo Integrations Utilizing Mathematica®
From Everand
Acceptance-Rejection Sampling and Multi-dimensional Monte Carlo Integrations Utilizing Mathematica®
SUJAUL CHOWDHURY
No ratings yet
Multi-dimensional Monte Carlo Integrations Utilizing Mathematica
From Everand
Multi-dimensional Monte Carlo Integrations Utilizing Mathematica
SUJAUL CHOWDHURY
No ratings yet
Digital Signal and Image Processing using MATLAB, Volume 3: Advances and Applications, The Stochastic Case
From Everand
Digital Signal and Image Processing using MATLAB, Volume 3: Advances and Applications, The Stochastic Case
Gérard Blanchet
3/5 (1)
Learn Statistics Fast: A Simplified Detailed Version for Students
From Everand
Learn Statistics Fast: A Simplified Detailed Version for Students
Hesbon R.M
No ratings yet

Lecture 7 Classification

Uploaded by

Lecture 7 Classification

Uploaded by

Bootstrap and Cross-Validation

x i  Var ( x) nVar ( x) Var ( x)

If the Odds Ratio=1.0 then with 50

Calculate the risk ratio and 95% confidence interval…

In SAS, see output from PROC FREQ.

This is a far cry from: 3.2852+/-0.8644

95% confidence interval is

Beta coefficient for exposure

 Internal Validation: Validate your model on

 Aim: to estimate the probability that a patient who presents with

Gould MK, et al. Chest. 2007 Feb;131(2):383-8.

Gould MK, et al. Chest. 2007 Feb;131(2):383-8.

You might also like