Lecture 9 PDF
Lecture 9 PDF
Lecture 9 PDF
Bootstrap
Introduction
The purpose of statistical inference is to extract and
summarize the key characteristics of a dataset.
To know the accuracy of data summaries, we can find some
explicit results for certain statistics in the
literature. However, for some complex statistics, there
are no formulas available. Bootstrap has been invented to
solve this issue.
Bootstrapping is a powerful, yet simple, technique for
obtaining information from a single sample of data,
which would usually be obtained using analytic
techniques. The methodology revolves around sampling with
replacement from the observed data sample, to create a
large number of sets of pseudo-data, which are consistent
with the same underlying distribution. Statistics of
interest can then be obtained for each set of pseudo-
data, and the distribution of those statistics
investigated to obtain further insight. For example, the
mean of each set of pseudo-data can be calculated, and
the standard deviation of the set of means calculated as
an estimate of the standard error of the mean.
The name bootstrap comes from the Rudolph Erich Raspe's
tale, where Baron Munchausen had fallen to the bottom of
a deep lake. And as he was to give in to his fate he
thought to pull himself up by his own bootstraps.
Non-parametric bootstrap
Introduction
Non-parametric bootstrap tries to estimate the sampling
distribution of a statistic of interest
1
( ,..., )
n
s s x x given
the observed sample
1
,...,
n
x x assuming the cdf is unknown.
2
Bootstrap carries out this purpose by generating "new"
data from
1
,...,
n
x x by sampling with replacement.
So we get b sets of data
1
,...,
n
x x where an estimation of
1
,...,
b
s s can be made. Hence, we are able to study the
sampling distribution of s .
Bootstrap method is structured as follows:
1. Random sample with replacement on the observed sample
1
,...,
n
x x : in output, we get b bootstrap samples
1
,...,
n
x x .
2. Then compute the statistic s on bootstrap samples:
1
,...,
b
s s .
3. Finally use the bootstrap data
1
,...,
b
s s to quantify the
accuracy of s (e.g. standard error, confidence set,. . ).
Parametric bootstrap
Introduction
Although the non-parametric bootstrap is useful for
estimating the bias and standard error of sample
statistics, it is not very useful for constructing
hypothesis tests. This is because the nonparametric
bootstrap distribution is constructed based on the
properties of the actual underlying population, as they
appear in the original sample. For a hypothesis test, the
sampling distribution of the statistic s under the null
hypothesis - not the actual population - is needed.
Method and concepts
In the parametric bootstrap, we supposed the observed
data
1
,...,
n
x x come from a parametric model with unknown
parameter . All the uncertainty relies on the fact we do
not know the value of parameter.
The difference with the non-parametric bootstrap is that
resampling are carried out with the parametric model once
3
is estimated. Then from the bootstrap samples
1
,...,
n
x x , we
get new estimation of statistic s .
Parametric bootstrap method is structured as follows:
1. Estimating the parameter of the parametric model
given the data
1
,...,
n
x x say
.
2. Random sampling from the parametric model with =
to get b bootstrap samples
1
,...,
n
x x .
3. Then computing the statistic s on bootstrap samples:
bootstrap replicates are denoted
1
,...,
b
s s
4. Finally using the bootstrap data
1
,...,
b
s s to quantify the
accuracy of s (e.g. standard error, confidence set,. .).
The sample function
A major component of bootstrapping is being able to
resample a given data set and in R the function which
does this is the sample function.
sample(x, size, replace, prob)
The first argument is a vector containing the data set to
be resampled or the indices of the data to be resampled.
The size option specifies the sample size with the
default being the size of the population being resampled.
The replace option determines if the sample will be drawn
with or without replacement where the default value is
FALSE, i.e. without replacement. The prob option takes a
vector of length equal to the data set given in the first
argument containing the probability of selection for each
element of x. The default value is for a random sample
where each element has equal probability of being
sampled. In a typical bootstrapping situation we would
want to obtain bootstrapping samples of the same size as
4
the population being sampled and we would want to sample
with replacement.
#using sample to generate a permutation of the sequence
1:10
sample(10)
[1] 4 8 3 5 1 10 6 2 9 7
#bootstrap sample from the same sequence
sample(10, replace=T)
[1] 1 3 9 4 10 3 5 1 6 4
#boostrap sample from the same sequence with
#probabilities that favor the numbers 1-5
prob1 <- c(rep(.15, 5), rep(.05, 5))
prob1
[1] 0.15 0.15 0.15 0.15 0.15 0.05 0.05 0.05 0.05 0.05
sample(10, replace=T, prob=prob1)
[1] 4 2 1 7 6 5 4 4 2 9
#sample of size 5 from elements of a matrix
#creating the data matrix
y1 <- matrix( round(rnorm(25,5)), ncol=5)
y1
[,1] [,2] [,3] [,4] [,5]
[1,] 6 4 6 4 5
[2,] 6 5 5 7 4
[3,] 5 4 5 7 6
[4,] 5 3 6 6 6
[5,] 3 4 4 5 5
#saving the sample of size 5 in the vector x1
x1 <- y1[sample(25, 5)]
x1
5
[1] 6 4 5 5 4
#sampling the rows of the a matrix
#creating the data matrix
y2 <- matrix( round(rnorm(40, 5)), ncol=5)
y2
[,1] [,2] [,3] [,4] [,5]
[1,] 5 5 4 7 4
[2,] 5 6 4 6 4
[3,] 5 4 4 6 3
[4,] 5 6 5 6 6
[5,] 6 5 4 4 4
[6,] 5 5 5 4 5
[7,] 4 5 5 5 4
[8,] 5 5 4 6 6
#saving the sample of rows in the matrix x2
x2 <- y2[sample(8, 3), ]
x2
[,1] [,2] [,3] [,4] [,5]
[1,] 4 5 5 5 4
[2,] 5 6 5 6 6
[3,] 6 5 4 4 4
Applications of bootstrap
Bootstrap estimate of standard error
As usual s stands for the statistic of interest based on
the observed sample. In output of the bootstrap
method, we get bootstrap replicates
1
,...,
b
s s . Let us denote
its standard error by
s
.
s
is approximated by the
empirical standard deviation:
6
2
1
1
( )
1
b
s i
i
s s
b
where s is the mean of bootstrap replicates
(i.e.
1
1
b
i
i
s s
b
).
Bootstrap estimate of bias
Similarly, we have an analogous estimate of the bias of
statistics s . We recall that the bias is the difference
between the expectation of our statistic s and the
mathematical quantity S we want to estimate. For
instance, the bias of the mean is defined as
( ) ( )
n
n X
B E X E X .
We estimate the bias of the statistic s by
1
1
b
i
i
B s s
b
Obtaining a standard error for the estimate of the
median.
#calculating the standard error of the median
#creating the data set by taking 100 observations
#from a normal distribution with mean 5 and stdev 3
#we have rounded each observation to nearest integer
data <- round(rnorm(100, 5, 3))
data[1:10]
[1] 6 3 3 4 3 8 2 2 3 2
#obtaining 20 bootstrap samples
#display the first of the bootstrap samples
resamples <- lapply(1:20, function(i)
{sample(data, replace = T)})
resamples[1]
[[1]]:
7
[1] 5 1 7 6 5 2 2 6 9 5 4 6 6 3 5 4 10
7 8 1 8 0 5 2
[25] 8 3 0 9 3 2 3 10 5 8 5 4 0 4 7 3 5
6 3 6 3 2 9 7
[49] 2 4 9 6 6 0 7 5 9 3 0 6 8 5 2 3 3
3 4 3 2 9 3 3
[73] 2 3 8 2 8 3 9 6 5 2 4 3 3 7 1 3 5
9 4 3 4 2 9 0
[97] 3 6 9 7
#calculating the median for each bootstrap sample
r.median <- sapply(resamples, median)
r.median
[1] 4.0 4.5 4.0 5.0 4.0 5.0 5.0 5.0 5.0 4.0 5.0 5.0 5.0
5.0 5.0 4.0 5.0 5.0
[19] 6.0 5.0
#calculating the standard deviation of the distribution
of medians
sqrt(var(r.median))
[1] 0.5250313
We can put all these steps into a single function where
all we would need to specify is which data set to use and
how many times we want to resample in order to obtain the
adjusted standard error of the median.
b.median <- function(data, num) {
resamples <- lapply(1:num, function(i) sample(data,
replace=T))
r.median <- sapply(resamples, median)
std.err <- sqrt(var(r.median))
Num- how many times
we want to resample
8
return( list(std.err=std.err, resamples=resamples,
medians=r.median))
}
#we can input the data directly into the function and
display
#the standard error in one line of code
b.median(rnorm(100, 5, 2), 50)$std.err
Package "boot"
The boot package provides extensive facilities for
bootstrapping and related resampling methods. You can
bootstrap a single statistic or a vector. The main
bootstrapping function is boot( ).
Non-parametric bootstrap
We apply this procedure on the data USArrestsy with a
non-parametric bootstrap.
murder <- USArrests[,1]
murder
[1] 13.2 10.0 8.1 8.8 9.0 7.9 3.3 5.9 15.4 17.4
5.3 2.6 10.4 7.2 2.2 6.0
[17] 9.7 15.4 2.1 11.3 4.4 12.1 2.7 16.1 9.0 6.0
4.3 12.2 2.1 7.4 11.4 11.1
[33] 13.0 0.8 7.3 6.6 4.9 6.3 3.4 14.4 3.8 13.2
12.7 3.2 2.2 8.5 4.0 5.7
Exercise (in class)
Generalize the function b.median to work for any
summary statistic
9
[49] 2.6 6.8
bootReplMean <- boot(data=murder,
statistic=function(x,y) mean(x[y]), R=1000, stype = "i")
bootReplMean
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = murder, statistic = function(x, y)
mean(x[y]), R = 1000,
stype = "i")
Bootstrap Statistics :
original bias std. error
t1* 7.788 -0.025094 0.604386
# Function boot: boot( ) calls the statistic function R
times. Each time, it generates a set of random indices,
with replacement, from the integers 1:nrow(data). These
indices are used within the statistic function to select
a sample. The statistics are calculated on the sample.
Arguments
data The data as a vector, matrix or data frame. If it
is a matrix or data frame then each row is
considered as one multivariate observation.
statistic A function which when applied to data returns a
vector containing the statistic(s) of interest.
Statistic must take at least two arguments. The
first argument passed will always be the original
data. The
11
R The number of bootstrap replicates. second will be
a vector of indices, frequencies or weights which
define the bootstrap sample.
stype A character string indicating what the second
argument of statistic represents. Possible values
of stype are "i" (indices - the
default), "f" (frequencies), or "w"(weights). Not
used for sim = "parametric".
bootReplMean $t0
[1] 7.788
# t0 -The observed values of statistic/s applied to the
orginal data.
bootReplMean $t
# A matrix with sum(R) rows each of which is a bootstrap
replicate of the result of calling statistic.
plot() can be used to examine the results.
plot(bootReplMean)
You can use boot.ci( ) function to obtain confidence
intervals for the statistic/s
boot.ci(bootReplMean)
Examining the convergence of the standard deviation
bootstrap replicate:
BESEmean <- sapply(2:1000, function(n)
sd(bootReplMean$t[1:n]))
plot(2:1000, BESEmean, type ="l", xlab="b", ylab="BESE",
ylim= c(min(BESEmean), max(BESEmean)))
11
Parametric bootstrap
We apply bootstrap standard error and bias on an
exponentially random generated sample. To estimate the
parameter , we use the unbiased maximum likelihood
estimator
1
1
n n
i
i
n
x
. We are interested in the standard
deviation of
n
, where an explicit formula exist
Exercise (in class)
1. Estimate the median standard deviation and
confidence interval for "murder". R=50000
2. Estimate the amount of resampeling (R) necessary
for convergence of the standard deviation.
12
2
( )
2
n
Var
n
. Hence we can compare the bootstrap standard
error on
n
with the true value.
x <- rexp(1000, y <- rlnorm(1)) #generating random
sample.
lambdaMLE <- (length(x)-1)/sum(x)#estimating the
parameter
para.rg <- function(data,mle){ rexp(n=length(data),rate=
lambdaMLE )} #generating random sample from the
parametric model with the estimated parameter
para.boot <- function(data) {(length(data)-1)/sum(data)}
# computing the statistic on bootstrap sample
para.boot.MLE <- boot(x, para.boot, R=50000,
sim="parametric", ran.gen= para.rg, mle=lambdaMLE)$t
# Generate 50000 bootstrap replicates
BESEmeanPara <- sapply(2: 50000,
function(n){sd(para.boot.MLE[1:n])})
truevalue <- sqrt(y^2/(length(x)-2)) #
2
( )
2
n
Var
n
The data
which when applied to data function A
returns a vector containing the statistic(s)
of interest
The number of
bootstrap replicates
The type of
simulation required
This function is used only when sim = "parametric". it
describes how random values are to be generated. It should be
a function of two arguments. The first argument should be the
observed data and the second argument consists of any other
information needed.
13
plot(2: 50000, BESEmeanPara, type ="l", xlab="b",
ylab="BESE", col="blue",
ylim=c(min(truevalue,BESEmeanPara) , max(BESEmeanPara)))
lines(1:500*100, truevalue*rep(1,500))
legend(x=3000,y=0.05,c("true value of
s.e.","Bootstrap"),col=c("black","blue"),lty=1)
14
Generating the bootstrapped 95% confidence interval for
R-squared in the linear regression
The data source is mtcars. The bootstrapped confidence
interval is based on 1000 replications.
function to obtain R-Squared from the data:
rsq <- function(formula, data, indices) {
d <- data[indices,] # allows boot to select sample
fit <- lm(formula, data=d)
return(summary(fit)$r.square)
}
bootstrapping with 1000 replications:
results <- boot(data=mtcars, statistic=rsq,
R=1000, formula=mpg~wt+disp)
results
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = mtcars, statistic = rsq, R = 1000, formula =
mpg ~
wt + disp)
Bootstrap Statistics :
original bias std. error
t1* 0.7809306 0.01119669 0.04797569
15
plot(results)
boot.ci(results, type="bca")
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates
CALL :
boot.ci(boot.out = results, type = "bca")
Intervals :
Level BCa
95% ( 0.6512, 0.8542 )
Calculations and Intervals on Original Scale
Some BCa intervals may be unstable
16
get the 95% CI for the three model regression
coefficients (intercept, car weight, displacement).
In example above, the function rsq returned a number and
boot.ci returned a single confidence interval. The
statistics function you provide can also return a vector.
In this case we add an index parameter to plot( ) and
boot.ci( ) to indicate which column in $t is to analyzed.
function to obtain regression weights
bs <- function(formula, data, indices) {
d <- data[indices,] # allows boot to select sample
fit <- lm(formula, data=d)
return(coef(fit))
}
bootstrapping with 1000 replications
results <- boot(data=mtcars, statistic=bs,
R=1000, formula=mpg~wt+disp)
# view results
results
plot(results, index=1) # intercept
plot(results, index=2) # wt
plot(results, index=3) # disp
# get 95% confidence intervals
boot.ci(results, type="bca", index=1) # intercept
boot.ci(results, type="bca", index=2) # wt
boot.ci(results, type="bca", index=3) # disp
17
Bootstrap chain-ladder
The BootChainLadder procedure provides a predictive
distribution of reserves for a cumulative claims
development triangle
The BootChainLadder function uses a two-stage
bootstrapping/simulation approach. In the first stage an
ordinary chain-ladder methods is applied to the
cumulative claims triangle. From this we calculate the
scaled Pearson residuals (Scaled Pearson residuals are
the raw residuals divided by the standard deviation of
the data according to the model mean variance
relationship and estimated scale parameter) which we
bootstrap R times to forecast future incremental claims
payments via the standard chain-ladder method. In the
second stage we simulate the process error with the
bootstrap value as the mean and using the process
distribution assumed.
The set of reserves obtained in this way forms the
predictive distribution, from which summary statistics
such as mean, prediction error or quantiles can be
derived.
The implementation of BootChainLadder follows closely the
discussion of the bootstrap model in section 8 and
appendix 3 of the paper by England and Verrall (2002)-
Stochastic Claims Reserving in General Insurance:
1. Obtain the standard chain-ladder development factors
from cumulative data.
2. Obtain cumulative fitted values for the past
triangle by backwards recursion, starting with the
observed cumulative paid to date in the latest
diagonal.
18
3. Obtain incremental fitted values for the past
triangle by differencing.
4. Calculate the unscaled Pearson residuals for the
past triangle using:
incremental claims
incremental fitted values
ij
ij
ij ij
ij
ij
C
m
C m
r
m
5. Calculate the scaled Pearson residuals.
6. Begin iterative loop, to be repeated N times:
i. Resample the residuals with replacement, creating a
new past triangle of residuals.
ii. For each cell in the past triangle, solve equation
ij ij
ij
ij
C m
r
m
for
ij
C , giving a set of simulated-
incremental data for the past triangle.
iii. Create the associated set of simulated-cumulative
data.
iv. Fit the standard chain-ladder model to the
simulated-cumulative data.
v. Project to form a future triangle of cumulative
payments.
vi. Obtain the corresponding future triangle of
incremental payments by differencing, to be used as
the mean when simulating from the process
distribution.
vii. For each cell ( , ) i j in the future triangle, simulate a
payment from the process distribution with mean
(obtained at the previous step).
viii. Sum the simulated payments in the future triangle by
origin year and overall to give the origin year and
total reserve estimates respectively.
ix. Store the results, and return to start of iterative
loop.
19
B <- BootChainLadder(RAA, R=999, process.distr="gamma")
B
BootChainLadder(Triangle = RAA, R = 999, process.distr =
"gamma")
Latest Mean Ultimate Mean IBNR SD IBNR IBNR 75% IBNR 95%
1981 18,834 18,834 0 0 0 0
1982 16,704 16,871 167 664 262 1,374
1983 23,466 24,158 692 1,310 1,185 3,026
1984 27,067 28,809 1,742 1,864 2,684 5,317
1985 26,180 29,103 2,923 2,423 4,231 7,170
1986 15,852 19,766 3,914 2,556 5,287 8,740
1987 12,314 17,913 5,599 3,184 7,398 11,733
1988 13,112 24,036 10,924 4,746 13,869 19,283
1989 5,395 16,443 11,048 5,918 14,472 21,805
1990 2,063 19,059 16,996 13,443 25,037 41,980
Totals
Latest: 160,987
Mean Ultimate: 214,994
Mean IBNR: 54,007
SD IBNR: 18,351
Total IBNR 75%: 65,175
Total IBNR 95%: 86,646
Plot(B)
21
quantile(B, c(0.75,0.95,0.99, 0.995))
$ByOrigin
IBNR 75% IBNR 95% IBNR 99% IBNR 99.5%
1981 0.0000 0.000 0.000 0.000
1982 261.5228 1374.443 2156.636 2620.206
1983 1184.5912 3025.926 5087.740 5966.613
1984 2683.8241 5316.608 7023.275 7973.960
1985 4230.7258 7170.423 9822.821 10922.209
1986 5287.1116 8740.298 11670.006 13003.344
1987 7397.5856 11732.980 14314.841 15545.541
1988 13869.4386 19283.282 24803.602 27525.101
1989 14472.0201 21805.066 27781.963 30001.351
1990 25036.6265 41979.861 55194.203 56810.024
$Totals
Totals
IBNR 75%: 65174.99
IBNR 95%: 86645.83
IBNR 99%: 104323.44
IBNR 99.5%: 108224.30
21
Fit a log-normal distribution to the IBNR
plot(ecdf(B$IBNR.Totals))
library(MASS)
fit <- fitdistr(B$IBNR.Totals[B$IBNR.Totals>0],
"lognormal")
fit
meanlog sdlog
10.836817253 0.356902419
( 0.011291893) ( 0.007984574)
curve(plnorm(x,fit$estimate["meanlog"],
fit$estimate["sdlog"]),col="red", add=TRUE)
22
Exercise (in class)
Fit the bootstrap chain letter to the triangle ABC