0% found this document useful (0 votes)

236 views4 pages

A Practical Guide To Bootstrap in R

The document provides an overview of bootstrap resampling, including what it is, why it is used, and how to implement it in R. Specifically: - Bootstrap resampling creates multiple samples from an original sample to estimate properties of the population distribution, like the standard deviation of a statistic. - It is useful when the sample size is small, the population distribution is unknown or complex, or for pilot studies. - In R, bootstrapping involves defining a function to calculate the statistic of interest and applying the boot() function to that function over many resamples. - As an example, the document bootstraps the correlation between fantasy points and draft percentage for NBA players, finding the correlation

Uploaded by

Andrea Alvarado

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

236 views4 pages

A Practical Guide To Bootstrap in R

Uploaded by

Andrea Alvarado

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

A Practical Guide to Bootstrap in R

What, Why, When, and How

https://towardsdatascience.com/a-practical-guide-to-bootstrap-with-r-examples-bd975ec6dcea

 Bootstrap is a resampling method with replacement.

 It allows us to estimate the distribution of the population even from a single sample.

 In Machine Learning, bootstrap estimates the prediction performance while applying to

unobserved data.

 For the dataset and R code, please check my Github (link).

What is bootstrap?

Bootstrap is a resampling method where large numbers of samples of the same size are repeatedly drawn,
with replacement, from a single original sample.

Here is the English translation. Normally, it is not possible to infer the population parameter from a single,
or a finite number of, sample. The uncertainty of the population originates from sampling variability: there
is one value with a sample, while we obtain another value if we collect another sample.

The next question is:

How to eliminate the variability and approximate the population parameter as closely as possible?

By repeatedly sampling with replacement, bootstrap creates the resulting samples distribution a Gaussian
distribution, which makes statistical inference (e.g., constructing a Confidence Interval) possible.

Bootstrap breaks down into the following steps:

1. decide how many bootstrap samples to perform

2. what is the sample size?

3. for each bootstrap sample:

 draw a sample with replacement with the chosen size

 calculate the statistic of interest for that sample

4. calculate the mean of the calculated sample statistics

These procedures may seem a little bit daunting, but fortunately we don’t have to manually run the
calculations by hand. Modern programming languages (e.g., R or Python) handle the dirty work for us.

Why bootstrap?

Before answering the why, let’s look at some common inference challenges that we face as data scientists.
After A/B tests, to what extent can we trust a small sample size (say, 100) would represent the true
population?

If sample repeatedly, will the estimate of interest vary? If they do vary, what does the distribution look
like?

Is it possible to make valid inferences when the distribution of the population is too complicated or
unknown?

Making valid statistical inference is a major part of data scientists’ daily routine. However, there are strict
assumptions and prerequisites for valid inference, which may not hold or remain unknown.

Ideally, we would like to collect data on the entire population and save us some troubles of statistical
inference. Obviously, this is an expensive and not-so-recommended approach, considering time and
monetary expenses.

For example, it’s not feasible to survey the entire American population for their political view but a small
portion of the entirety, say 10k Americans, and ask for their political preferences is doable.

The challenge with the approach is that we may get slightly different results each time we collect a sample.
Theoretically, the standard deviation of a point estimate could be considerably large for repeated
samplings from the population, which may bias the estimate.

Here is the punch line:

As a non-parametric estimation method, bootstrap comes in handy and quantifies the uncertainty of an
estimate involved with the standard deviation.

When?

For the following scenarios, bootstrap is a desirable approach:

1. When the distribution of a statistic is unknown or complicated.

 Reason: bootstrap is a non-parametric approach and does not ask for specific distribution.

2. When the sample size is too small to draw a valid inference.

 Reason: bootstrap is a resampling method with replacement and re-creates any number of
resamples if needed.

3. You need a pilot study to feel the water before pouring all of your resources in the wrong direction.

 Reason: related to #2 point, bootstrap can create the population’s distribution, and we can check
the variance.

R Code

Since Sport is back and DraftKings is trending recently, I’ll use the NBA post-season game 6 between the
Celtics and Heat (Github).

We are interested in the correlation between two variables: the percentage that players have been
drafted (Drafted) and the Fantasy Points (FPTS) scored.
To calculate the correlation for one game is simple but with less practical value. We have a general
impression that top players (e.g. Jayson Tatum, Jimmy Butler) are going to the most drafted and also score
the most points in a game.

What potentially could help pick players in the next game is:

If we draw repeated samples 10,000 times, what is the range of the correlation between these two
variables?

In R, there are two steps for bootstrapping.

Install the package boot if you haven’t.

Step 1: Define a function that calculates the metric of interest

function_1<-function(data,i){d2<-data[i,]
return(cor(d2$X.Drafted,d2$FPTS))}

The above function has two arguments: data and i. The first argument, data, is the dataset to be used,
and i is the vector index of which rows from the dataset will be picked to create a bootstrap sample.

Step 2: Apply the boot() function

set.seed(1)
bootstrap_correlation <- boot(mydata,function_1,R=10000)

Remember to set seed to get reproducible results. The boot() function has three parameters: the dataset,
which function, and the number of bootstrap samples.
bootstrap_correlation

The original correlation between the two variables is 0.892, with a standard error of 0.043.
summary(bootstrap_correlation)

The returned value is an object of a class called ‘boot,’ which contains the following parts:

t0: values of our statistic in original dataset

t: a matrix with sum(R) rows, each of which is a bootstrap replicate of the result of calling statistic
class(bootstrap_correlation)
[1] "boot"
range(bootstrap_correlation$t)
[1] 0.6839681 0.9929641
mean(bootstrap_correlation$t)
[1] 0.8955649
sd(bootstrap_correlation$t)
[1] 0.04318599
Some other common statistics of bootstrap samples: range, mean, and standard deviation, shown above.
boot.ci(boot.out=bootstrap_correlation,type=c(‘norm’,’basic’,’perc’,’bca’
))
This is how we calculate 4 types of confidence intervals for bootstrapped samples.

Conclusion

In the above example, we are interested in the correlation between two variables: Drafted and FPTS.
However, this is only one sample, and the finite sample size makes it difficult to generalize the finding at
the population level. In other words, can we extend the discovery from one sample to all other cases?

In Game 6, the correlation coefficient between the percentage of players drafted and the fantasy points
scored at DraftKings stands at 0.89. We bootstrap the sample 10000 times and find the following sample
distribution:

1. Range of the correlation coefficient: [0.6839681, 0.9929641].

2. Mean: 0.8955649

3. Standard deviation: 0.04318599

4. 95% confidence interval: [0.8041,0.9734]

As we can see, the range of the coefficient is quite wide from 0.68 to 0.99, and the 95% CI is from 0.8 to
0.97. We may get a 0.8 correlation coefficient next time.

Both statistics suggest that these two variables are from moderately to strongly correlated, and we may
get a less strongly correlated relationship between these two players (e.g., because of the regression to
the mean for top players). Practically, we shall be especially careful while drafting the top-performing
players.

Bradley Efron, R.J. Tibshirani An Introduction To Bootstrap
60% (5)
Bradley Efron, R.J. Tibshirani An Introduction To Bootstrap
225 pages
ST104a Commentary 2022
No ratings yet
ST104a Commentary 2022
29 pages
Dedu404 Methodology of Educational Research and Statistics English PDF
100% (1)
Dedu404 Methodology of Educational Research and Statistics English PDF
332 pages
Statistical Methods For Data Science
100% (2)
Statistical Methods For Data Science
406 pages
Intro To The Philosophy of The Human Person - CH 2-3 Scientific Method of Philosophizing
No ratings yet
Intro To The Philosophy of The Human Person - CH 2-3 Scientific Method of Philosophizing
33 pages
Grade 7 Q3W1 - Data and Probability (MATATAG Curriculum)
No ratings yet
Grade 7 Q3W1 - Data and Probability (MATATAG Curriculum)
35 pages
Bootstrap Methods and Their Applications.
No ratings yet
Bootstrap Methods and Their Applications.
96 pages
How To Write Methods Section For Literature Review
100% (2)
How To Write Methods Section For Literature Review
5 pages
(Cambridge Series in Statistical and Probabilistic Mathematics) A. C. Davison, D. v. Hinkley - Bootstrap Methods and Their Application-Cambridge University Press (1997)
No ratings yet
(Cambridge Series in Statistical and Probabilistic Mathematics) A. C. Davison, D. v. Hinkley - Bootstrap Methods and Their Application-Cambridge University Press (1997)
596 pages
Mis Notas de R PDF
100% (1)
Mis Notas de R PDF
396 pages
PR Group 4 Very Final
No ratings yet
PR Group 4 Very Final
41 pages
MBA II BRM Trimester End Exam
50% (2)
MBA II BRM Trimester End Exam
3 pages
Bootstrap Student Presentation
100% (1)
Bootstrap Student Presentation
36 pages
An Introduction To The Bootstrap 3ai7r0o65z
No ratings yet
An Introduction To The Bootstrap 3ai7r0o65z
8 pages
Lab Report Template: Cover Page
No ratings yet
Lab Report Template: Cover Page
3 pages
Language Testing and Assessment
No ratings yet
Language Testing and Assessment
4 pages
Structured Literature Review Mass Media
100% (1)
Structured Literature Review Mass Media
6 pages
Bootstrapping
100% (1)
Bootstrapping
18 pages
Summative Test 1 2nd Quarter Practical Research 2
No ratings yet
Summative Test 1 2nd Quarter Practical Research 2
4 pages
Bootstrap 1
No ratings yet
Bootstrap 1
16 pages
Intro Bootstrap 341
No ratings yet
Intro Bootstrap 341
18 pages
HAU 8301 Research Method UPDATED
No ratings yet
HAU 8301 Research Method UPDATED
12 pages
All Lectures - New
No ratings yet
All Lectures - New
134 pages
Bootstrap PDF
No ratings yet
Bootstrap PDF
24 pages
Bootstrap: Estimate Statistical Uncertainties
No ratings yet
Bootstrap: Estimate Statistical Uncertainties
22 pages
What Teachers Should Know About The Bootstrap Resa
No ratings yet
What Teachers Should Know About The Bootstrap Resa
84 pages
Bootstrap Report
No ratings yet
Bootstrap Report
92 pages
Bootstrap
No ratings yet
Bootstrap
52 pages
Braun Bootstrap2012 PDF
No ratings yet
Braun Bootstrap2012 PDF
63 pages
of Bootstrap by Spida - 2010
No ratings yet
of Bootstrap by Spida - 2010
80 pages
Bootstrap Stat 498 B
No ratings yet
Bootstrap Stat 498 B
61 pages
Advanced Econometric Methods I: Lecture Notes On Bootstrap: 1 Motivation
No ratings yet
Advanced Econometric Methods I: Lecture Notes On Bootstrap: 1 Motivation
19 pages
Res1 Grp1 Final
No ratings yet
Res1 Grp1 Final
48 pages
Financial Statistics Laboratory 3: Bootstrap
No ratings yet
Financial Statistics Laboratory 3: Bootstrap
16 pages
Lecture 7 Classification
No ratings yet
Lecture 7 Classification
52 pages
Allama Iqbal Open University Islamabad
No ratings yet
Allama Iqbal Open University Islamabad
11 pages
Bootsteps
No ratings yet
Bootsteps
30 pages
MIT18 05S14 Class24-Slde-A
No ratings yet
MIT18 05S14 Class24-Slde-A
16 pages
Sta3030 1-2 Merged Test 1
No ratings yet
Sta3030 1-2 Merged Test 1
114 pages
Validation Model 2024-2
No ratings yet
Validation Model 2024-2
37 pages
S M S T C Lecture 2425 4
No ratings yet
S M S T C Lecture 2425 4
43 pages
Lecture 9 PDF
No ratings yet
Lecture 9 PDF
22 pages
3 - Q3 Practical Research
No ratings yet
3 - Q3 Practical Research
34 pages
Bootstrap Method PDF
No ratings yet
Bootstrap Method PDF
14 pages
4.5-Bootstrap Variations
No ratings yet
4.5-Bootstrap Variations
25 pages
Bootstrap Methods 2020
No ratings yet
Bootstrap Methods 2020
16 pages
Small-Sample Inference and Bootstrap: Leonid Kogan
No ratings yet
Small-Sample Inference and Bootstrap: Leonid Kogan
29 pages
3030 Slides Module 1A
No ratings yet
3030 Slides Module 1A
24 pages
Bootstrap Simulation
No ratings yet
Bootstrap Simulation
17 pages
AdvEcx Chp3 Full 3006
No ratings yet
AdvEcx Chp3 Full 3006
17 pages
Chapter 4
No ratings yet
Chapter 4
25 pages
Resampled Inference Resampled Inference
No ratings yet
Resampled Inference Resampled Inference
21 pages
Wasserman 8 PDF
No ratings yet
Wasserman 8 PDF
12 pages
Macro 1 - Bootstrap
No ratings yet
Macro 1 - Bootstrap
10 pages
The Nature and Extent of Crime
No ratings yet
The Nature and Extent of Crime
10 pages
Unit5 Randomsampling
No ratings yet
Unit5 Randomsampling
21 pages
Bootstrap Confidence Intervals Class 24, 18.05 Jeremy Orloff and Jonathan Bloom 1 Learning Goals
No ratings yet
Bootstrap Confidence Intervals Class 24, 18.05 Jeremy Orloff and Jonathan Bloom 1 Learning Goals
12 pages
Chapter 1 3
No ratings yet
Chapter 1 3
25 pages
Lec 6
No ratings yet
Lec 6
13 pages
Boot
No ratings yet
Boot
15 pages
Sta255 Week 10-2 Pre
No ratings yet
Sta255 Week 10-2 Pre
20 pages
L22 Bootstrap
No ratings yet
L22 Bootstrap
7 pages
CH 4 Research Design Final 2
No ratings yet
CH 4 Research Design Final 2
7 pages
Introduction To Nursing Research
No ratings yet
Introduction To Nursing Research
2 pages
Bootstrapping Techniques in Statistical Analysis and Approaches in R MATH 289
No ratings yet
Bootstrapping Techniques in Statistical Analysis and Approaches in R MATH 289
10 pages
Mps 004 Notes
No ratings yet
Mps 004 Notes
38 pages
Bootstrap Example
No ratings yet
Bootstrap Example
5 pages
Notessc w05
No ratings yet
Notessc w05
10 pages
RFQ-018-25 TOR IC EIIP Guidelines Development Iraq Mar25
No ratings yet
RFQ-018-25 TOR IC EIIP Guidelines Development Iraq Mar25
11 pages
HRM Assignment
No ratings yet
HRM Assignment
3 pages
Bootstrap 1
No ratings yet
Bootstrap 1
7 pages
HW 9 Bootstrap, Jackknife, and Permutation Tests
No ratings yet
HW 9 Bootstrap, Jackknife, and Permutation Tests
7 pages
Parts of Research
No ratings yet
Parts of Research
5 pages
Minitab Project Report
No ratings yet
Minitab Project Report
3 pages
Research Paradigm: Paradigm Ontology: Epistemology: Axiology: Typical Methods
No ratings yet
Research Paradigm: Paradigm Ontology: Epistemology: Axiology: Typical Methods
5 pages
It Right A Guide To Academic and Professional Excellence 1.PDF 20241126 003833 0000
No ratings yet
It Right A Guide To Academic and Professional Excellence 1.PDF 20241126 003833 0000
4 pages
Resampling Methods For Time Series
No ratings yet
Resampling Methods For Time Series
5 pages
Lecture 19 20
No ratings yet
Lecture 19 20
5 pages
Instructions For Final Project Examination
No ratings yet
Instructions For Final Project Examination
3 pages
4Q 1ST Summative Test Math 7
No ratings yet
4Q 1ST Summative Test Math 7
2 pages
Stats 201 Midterm Sheet
No ratings yet
Stats 201 Midterm Sheet
2 pages
Bootstrapping Regression Models: 1 Basic Ideas
No ratings yet
Bootstrapping Regression Models: 1 Basic Ideas
14 pages
DSCI 100 Bootstrap Concept Cheat Sheet
No ratings yet
DSCI 100 Bootstrap Concept Cheat Sheet
2 pages
Bootstrap Up
No ratings yet
Bootstrap Up
5 pages
2.template For Knowledge Test - Answers V.2
No ratings yet
2.template For Knowledge Test - Answers V.2
2 pages
Stat 201 Mt1 Cheatsheet
No ratings yet
Stat 201 Mt1 Cheatsheet
2 pages
Bootstrap Explained
No ratings yet
Bootstrap Explained
1 page
Basic Bootstrap in Stata
No ratings yet
Basic Bootstrap in Stata
2 pages
Machine Learning - A Complete Exploration of Highly Advanced Machine Learning Concepts, Best Practices and Techniques: 4
From Everand
Machine Learning - A Complete Exploration of Highly Advanced Machine Learning Concepts, Best Practices and Techniques: 4
Peter Bradley
No ratings yet
Introduction To Business Statistics Through R Software: Software
From Everand
Introduction To Business Statistics Through R Software: Software
Editor IJSMI
No ratings yet

A Practical Guide To Bootstrap in R

Uploaded by

A Practical Guide To Bootstrap in R

Uploaded by

A Practical Guide to Bootstrap in R

What, Why, When, and How

 Bootstrap is a resampling method with replacement.

 In Machine Learning, bootstrap estimates the prediction performance while applying to

 For the dataset and R code, please check my Github (link).

The next question is:

Bootstrap breaks down into the following steps:

1. decide how many bootstrap samples to perform

2. what is the sample size?

3. for each bootstrap sample:

 draw a sample with replacement with the chosen size

 calculate the statistic of interest for that sample

4. calculate the mean of the calculated sample statistics

Here is the punch line:

For the following scenarios, bootstrap is a desirable approach:

1. When the distribution of a statistic is unknown or complicated.

2. When the sample size is too small to draw a valid inference.

In R, there are two steps for bootstrapping.

Install the package boot if you haven’t.

Step 1: Define a function that calculates the metric of interest

Step 2: Apply the boot() function

t0: values of our statistic in original dataset

1. Range of the correlation coefficient: [0.6839681, 0.9929641].

3. Standard deviation: 0.04318599

4. 95% confidence interval: [0.8041,0.9734]

You might also like