A Practical Guide To Bootstrap in R
A Practical Guide To Bootstrap in R
It allows us to estimate the distribution of the population even from a single sample.
What is bootstrap?
Bootstrap is a resampling method where large numbers of samples of the same size are repeatedly drawn,
with replacement, from a single original sample.
Here is the English translation. Normally, it is not possible to infer the population parameter from a single,
or a finite number of, sample. The uncertainty of the population originates from sampling variability: there
is one value with a sample, while we obtain another value if we collect another sample.
How to eliminate the variability and approximate the population parameter as closely as possible?
By repeatedly sampling with replacement, bootstrap creates the resulting samples distribution a Gaussian
distribution, which makes statistical inference (e.g., constructing a Confidence Interval) possible.
These procedures may seem a little bit daunting, but fortunately we don’t have to manually run the
calculations by hand. Modern programming languages (e.g., R or Python) handle the dirty work for us.
Why bootstrap?
Before answering the why, let’s look at some common inference challenges that we face as data scientists.
After A/B tests, to what extent can we trust a small sample size (say, 100) would represent the true
population?
If sample repeatedly, will the estimate of interest vary? If they do vary, what does the distribution look
like?
Is it possible to make valid inferences when the distribution of the population is too complicated or
unknown?
Making valid statistical inference is a major part of data scientists’ daily routine. However, there are strict
assumptions and prerequisites for valid inference, which may not hold or remain unknown.
Ideally, we would like to collect data on the entire population and save us some troubles of statistical
inference. Obviously, this is an expensive and not-so-recommended approach, considering time and
monetary expenses.
For example, it’s not feasible to survey the entire American population for their political view but a small
portion of the entirety, say 10k Americans, and ask for their political preferences is doable.
The challenge with the approach is that we may get slightly different results each time we collect a sample.
Theoretically, the standard deviation of a point estimate could be considerably large for repeated
samplings from the population, which may bias the estimate.
As a non-parametric estimation method, bootstrap comes in handy and quantifies the uncertainty of an
estimate involved with the standard deviation.
When?
Reason: bootstrap is a non-parametric approach and does not ask for specific distribution.
Reason: bootstrap is a resampling method with replacement and re-creates any number of
resamples if needed.
3. You need a pilot study to feel the water before pouring all of your resources in the wrong direction.
Reason: related to #2 point, bootstrap can create the population’s distribution, and we can check
the variance.
R Code
Since Sport is back and DraftKings is trending recently, I’ll use the NBA post-season game 6 between the
Celtics and Heat (Github).
We are interested in the correlation between two variables: the percentage that players have been
drafted (Drafted) and the Fantasy Points (FPTS) scored.
To calculate the correlation for one game is simple but with less practical value. We have a general
impression that top players (e.g. Jayson Tatum, Jimmy Butler) are going to the most drafted and also score
the most points in a game.
What potentially could help pick players in the next game is:
If we draw repeated samples 10,000 times, what is the range of the correlation between these two
variables?
The above function has two arguments: data and i. The first argument, data, is the dataset to be used,
and i is the vector index of which rows from the dataset will be picked to create a bootstrap sample.
Remember to set seed to get reproducible results. The boot() function has three parameters: the dataset,
which function, and the number of bootstrap samples.
bootstrap_correlation
The original correlation between the two variables is 0.892, with a standard error of 0.043.
summary(bootstrap_correlation)
The returned value is an object of a class called ‘boot,’ which contains the following parts:
t: a matrix with sum(R) rows, each of which is a bootstrap replicate of the result of calling statistic
class(bootstrap_correlation)
[1] "boot"
range(bootstrap_correlation$t)
[1] 0.6839681 0.9929641
mean(bootstrap_correlation$t)
[1] 0.8955649
sd(bootstrap_correlation$t)
[1] 0.04318599
Some other common statistics of bootstrap samples: range, mean, and standard deviation, shown above.
boot.ci(boot.out=bootstrap_correlation,type=c(‘norm’,’basic’,’perc’,’bca’
))
This is how we calculate 4 types of confidence intervals for bootstrapped samples.
Conclusion
In the above example, we are interested in the correlation between two variables: Drafted and FPTS.
However, this is only one sample, and the finite sample size makes it difficult to generalize the finding at
the population level. In other words, can we extend the discovery from one sample to all other cases?
In Game 6, the correlation coefficient between the percentage of players drafted and the fantasy points
scored at DraftKings stands at 0.89. We bootstrap the sample 10000 times and find the following sample
distribution:
2. Mean: 0.8955649
As we can see, the range of the coefficient is quite wide from 0.68 to 0.99, and the 95% CI is from 0.8 to
0.97. We may get a 0.8 correlation coefficient next time.
Both statistics suggest that these two variables are from moderately to strongly correlated, and we may
get a less strongly correlated relationship between these two players (e.g., because of the regression to
the mean for top players). Practically, we shall be especially careful while drafting the top-performing
players.