0% found this document useful (0 votes)
235 views4 pages

A Practical Guide To Bootstrap in R

The document provides an overview of bootstrap resampling, including what it is, why it is used, and how to implement it in R. Specifically: - Bootstrap resampling creates multiple samples from an original sample to estimate properties of the population distribution, like the standard deviation of a statistic. - It is useful when the sample size is small, the population distribution is unknown or complex, or for pilot studies. - In R, bootstrapping involves defining a function to calculate the statistic of interest and applying the boot() function to that function over many resamples. - As an example, the document bootstraps the correlation between fantasy points and draft percentage for NBA players, finding the correlation

Uploaded by

Andrea Alvarado
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
235 views4 pages

A Practical Guide To Bootstrap in R

The document provides an overview of bootstrap resampling, including what it is, why it is used, and how to implement it in R. Specifically: - Bootstrap resampling creates multiple samples from an original sample to estimate properties of the population distribution, like the standard deviation of a statistic. - It is useful when the sample size is small, the population distribution is unknown or complex, or for pilot studies. - In R, bootstrapping involves defining a function to calculate the statistic of interest and applying the boot() function to that function over many resamples. - As an example, the document bootstraps the correlation between fantasy points and draft percentage for NBA players, finding the correlation

Uploaded by

Andrea Alvarado
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

A Practical Guide to Bootstrap in R

What, Why, When, and How


https://towardsdatascience.com/a-practical-guide-to-bootstrap-with-r-examples-bd975ec6dcea

 Bootstrap is a resampling method with replacement.

 It allows us to estimate the distribution of the population even from a single sample.

 In Machine Learning, bootstrap estimates the prediction performance while applying to


unobserved data.

 For the dataset and R code, please check my Github (link).

What is bootstrap?

Bootstrap is a resampling method where large numbers of samples of the same size are repeatedly drawn,
with replacement, from a single original sample.

Here is the English translation. Normally, it is not possible to infer the population parameter from a single,
or a finite number of, sample. The uncertainty of the population originates from sampling variability: there
is one value with a sample, while we obtain another value if we collect another sample.

The next question is:

How to eliminate the variability and approximate the population parameter as closely as possible?

By repeatedly sampling with replacement, bootstrap creates the resulting samples distribution a Gaussian
distribution, which makes statistical inference (e.g., constructing a Confidence Interval) possible.

Bootstrap breaks down into the following steps:

1. decide how many bootstrap samples to perform

2. what is the sample size?

3. for each bootstrap sample:

 draw a sample with replacement with the chosen size

 calculate the statistic of interest for that sample

4. calculate the mean of the calculated sample statistics

These procedures may seem a little bit daunting, but fortunately we don’t have to manually run the
calculations by hand. Modern programming languages (e.g., R or Python) handle the dirty work for us.

Why bootstrap?

Before answering the why, let’s look at some common inference challenges that we face as data scientists.
After A/B tests, to what extent can we trust a small sample size (say, 100) would represent the true
population?

If sample repeatedly, will the estimate of interest vary? If they do vary, what does the distribution look
like?

Is it possible to make valid inferences when the distribution of the population is too complicated or
unknown?

Making valid statistical inference is a major part of data scientists’ daily routine. However, there are strict
assumptions and prerequisites for valid inference, which may not hold or remain unknown.

Ideally, we would like to collect data on the entire population and save us some troubles of statistical
inference. Obviously, this is an expensive and not-so-recommended approach, considering time and
monetary expenses.

For example, it’s not feasible to survey the entire American population for their political view but a small
portion of the entirety, say 10k Americans, and ask for their political preferences is doable.

The challenge with the approach is that we may get slightly different results each time we collect a sample.
Theoretically, the standard deviation of a point estimate could be considerably large for repeated
samplings from the population, which may bias the estimate.

Here is the punch line:

As a non-parametric estimation method, bootstrap comes in handy and quantifies the uncertainty of an
estimate involved with the standard deviation.

When?

For the following scenarios, bootstrap is a desirable approach:

1. When the distribution of a statistic is unknown or complicated.

 Reason: bootstrap is a non-parametric approach and does not ask for specific distribution.

2. When the sample size is too small to draw a valid inference.

 Reason: bootstrap is a resampling method with replacement and re-creates any number of
resamples if needed.

3. You need a pilot study to feel the water before pouring all of your resources in the wrong direction.

 Reason: related to #2 point, bootstrap can create the population’s distribution, and we can check
the variance.

R Code

Since Sport is back and DraftKings is trending recently, I’ll use the NBA post-season game 6 between the
Celtics and Heat (Github).

We are interested in the correlation between two variables: the percentage that players have been
drafted (Drafted) and the Fantasy Points (FPTS) scored.
To calculate the correlation for one game is simple but with less practical value. We have a general
impression that top players (e.g. Jayson Tatum, Jimmy Butler) are going to the most drafted and also score
the most points in a game.

What potentially could help pick players in the next game is:

If we draw repeated samples 10,000 times, what is the range of the correlation between these two
variables?

In R, there are two steps for bootstrapping.

Install the package boot if you haven’t.

Step 1: Define a function that calculates the metric of interest


function_1<-function(data,i){d2<-data[i,]
return(cor(d2$X.Drafted,d2$FPTS))}

The above function has two arguments: data and i. The first argument, data, is the dataset to be used,
and i is the vector index of which rows from the dataset will be picked to create a bootstrap sample.

Step 2: Apply the boot() function


set.seed(1)
bootstrap_correlation <- boot(mydata,function_1,R=10000)

Remember to set seed to get reproducible results. The boot() function has three parameters: the dataset,
which function, and the number of bootstrap samples.
bootstrap_correlation

The original correlation between the two variables is 0.892, with a standard error of 0.043.
summary(bootstrap_correlation)

The returned value is an object of a class called ‘boot,’ which contains the following parts:

t0: values of our statistic in original dataset

t: a matrix with sum(R) rows, each of which is a bootstrap replicate of the result of calling statistic
class(bootstrap_correlation)
[1] "boot"
range(bootstrap_correlation$t)
[1] 0.6839681 0.9929641
mean(bootstrap_correlation$t)
[1] 0.8955649
sd(bootstrap_correlation$t)
[1] 0.04318599
Some other common statistics of bootstrap samples: range, mean, and standard deviation, shown above.
boot.ci(boot.out=bootstrap_correlation,type=c(‘norm’,’basic’,’perc’,’bca’
))
This is how we calculate 4 types of confidence intervals for bootstrapped samples.

Conclusion

In the above example, we are interested in the correlation between two variables: Drafted and FPTS.
However, this is only one sample, and the finite sample size makes it difficult to generalize the finding at
the population level. In other words, can we extend the discovery from one sample to all other cases?

In Game 6, the correlation coefficient between the percentage of players drafted and the fantasy points
scored at DraftKings stands at 0.89. We bootstrap the sample 10000 times and find the following sample
distribution:

1. Range of the correlation coefficient: [0.6839681, 0.9929641].

2. Mean: 0.8955649

3. Standard deviation: 0.04318599

4. 95% confidence interval: [0.8041,0.9734]

As we can see, the range of the coefficient is quite wide from 0.68 to 0.99, and the 95% CI is from 0.8 to
0.97. We may get a 0.8 correlation coefficient next time.

Both statistics suggest that these two variables are from moderately to strongly correlated, and we may
get a less strongly correlated relationship between these two players (e.g., because of the regression to
the mean for top players). Practically, we shall be especially careful while drafting the top-performing
players.

You might also like