Macro 1 - Bootstrap

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 10

Bootstrapping Macro: Part 1 - Use Cases

The name bootstrapping refers to the idea of picking oneself up


by ones own bootstraps - fitting, because the technique uses the
sample data to gain more knowledge about the sample data. This
macro implements the simplest form of bootstrapping, a Monte
Carlo procedure for sampling with replacement:
Consider you want to know the average price of a tent.
1) Collect some small sample data.
- For example, visit the websites of 16 retailers:
170

300

240

202

24

230

49

109

128

239

199

370

280

154

109

259

2) Re-sample, with replacement, from the sample data.


- Sample 1
24

259

300

280

24

199

154

259

259

109

109

128

370

109

280

154

- Sample 2, Sample 3,

3) Calculate a test statistic of interest.


- For example, the sample mean:
Original Sample Sample Sample Sample Sample Sample
Data:
1:
2:
3:
4:
5:
6:

191.4

188.6

226.9

187.2

243.2

210.1

196.8

4) Compare the test statistic of the original data to the test


statistics of the re-sampled data.

Figure 1: Example distribution of sample means. Red line


represents the original sample mean.

The major benefit of the bootstrap technique is that in Step 4 we


do not make any assumptions about the distribution of the data.
Another benefit is the flexibility in the choice of the test statistic
(we can compute p-values for quantities whose distribution would
be hard to derive).

Additional Resources: https://en.wikipedia.org/wiki/Bootstrapping

Bootstrapping Macro: Part 2 - Implementation Details


The bootstrapping macro has two options: re-sampling and
calculation can be performed with Alteryx built-in tools or with the
R tool. Why take the time to make two implementations? I did this
as part of a challenge - when creating a new predictive tool I am
quick to jump into R. However, the overhead of passing data
between the Alteryx engine and the R engine can be large. I think
it is worthwhile, when possible, to try and implement a technique
natively.
As a challenge, alternatives for bootstrapping big data sets exist.
Check out the work done by researchers at Berkeley - it would be
interesting to supplement the bootstrapping macro with this
implementation.

Lets take a deeper look at some parts of the macro.


1) Sampling with replacement in Alteryx:

Let n be the sample size of the original data, and let m be the
number of bootstrap samples. We start by creating n*m random
numbers which take on integer values 1,2,n. These values
represent our index. To create a random sample we perform a
join* based on the record id of the original sample and our
random index. For example:
n=4,m=2
Record Id

Original Sample

Random Index

Sample Value
3

The last step is to group our newly created data using the tile tool
(Sample 1 - Blue; Sample 2 - Green) and make it look pretty with
the cross tab tool.
*Alteryx automatically sorts the data by sample value during the
join, so to get back our random samples we un-sort by sorting
based on the original record id.
2) Calculating complicated test statistics with the R Tool:
An un-intuitive feature of the macro is the option to specify a test
statistic using the R tool.

When this option is selected data will be passed into R and run
through the following code:

At its heart, this code resamples the original data, evaluates a


command to generate an object based on the re-sampled data,
and then takes part of that object as the test statistic.
As a simple example, if we wanted to use the R Tool to calculate
sample means we would specify the R command to run over
samples as mean and the Attribute to . as [1].
The flexibility to bootstrap a model object and was included so
that the tool could be extended to bootstrapping the distributions
of linear model coefficients, time series auto-correlation functions,
etc. For example, the default configuration (as pictured) will treat
each sample as a time series and will find a distribution for autocorrelation of the first lag (the correlation between x(t) and x(t-1)
across the time series).

3) P-value: Why the flip-flop?

The bootstrap macro outputs a p-value. This p-value is for the


null hypothesis that the observed test statistic is similar to the
mean of the bootstrapped test statistics. I say similar because
the macro actually tests one of 2 hypotheses. If the observed
test statistic is larger than the mean, it reports a p-value for the
hypothesis that the test statistic is
less-than-or-equal-to the bootstrapped data. If the observed
test statistic is smaller than the mean, it reports a p-value for
the hypothesis that the test statistic is greater-than-or-equal-to
the bootstrapped data.
This is implemented in this formula tool:

Bootstrapping Macro: Part 3 - Learning the CLT


The central limit theorem is foundational to statistics. The basic
idea is that sample means will converge to the normal distribution,
regardless of the underlying data.

The bootstrapping macro makes it easy to quickly calculate a


whole bunch of sample means, so I thought it would be fun to try
and create an app to demonstrate the CLT.
Have fun!

You might also like