Estimation Through Bootsrtapping
Estimation Through Bootsrtapping
T.A.S. Vijayaraghavan
Although confidence interval estimates have been widely used in
making inferences about population parameters, the estimating
procedure used is based on assumptions that do not always hold. If
there is substantial non-normality in the population, particularly if a
small sample size n is used, the confidence interval estimate for the
mean may not be precise. We will consider an alternative estimation
approach called bootstrapping. Bradley Efron who is usually credited
not only with the term "bootstrapping" in this context but also with
making the central assertion that the relative frequency distribution of
the repeated sample statistics is an estimate of the sampling
distribution published his insights in 1979 in an article in Annals of
Mathematical Statistics. The word "bootstrap" suggests pulling oneself
up by one's bootstraps, which is almost equivalent to achieving the
impossible.
Much of the rationale behind Efron's path-breaking approach is based
on the fact that the distributional assumptions of multinormality
required for parametric inferences are not always met nor do we
always know the sampling distribution of several other statistics of
interest (eg skewness, kurtosis, difference between medians,
eigenvales). Under such conditions parametric inferences are difficult
to justify and it is sometimes better to draw inferences about
population parameters strictly from the sample at hand rather than
making unrealistic assumptions about the population of interest. The
bootstrap concept is simple. Statisticians would have come up with the
idea for a long time, were it not for the necessity of extensive
computations. The best known technique in this case is bootstrapping
which involves resampling the sample data with replacement many,
many times in order to generate an empirical estimate of the entire
sampling distribution of a statistic. In other words, bootstrapping treats
the sample as if it were the population and applies Monte Carlo
sampling to generate an empirical estimate of the statistic's sampling
distribution.
The typical requirements for bootstrapping is considerable
computational power, enough to take 500 - 1000 samples; appropriate
software or computer program and a sample size of at least 30
observations.
Still bootstrapping is not a panacea for all other worries. It is good for
confidence intervals and bias estimation but not for point estimation.
As well, the original sample must be a good representation of the
population of interest (cover full range of population values) and must
have been drawn as simple random sample (although work with more
complex samples is underway). If these conditions are not met,
uncertainty will remain.
Briefly, bootstrapping estimation procedures involve an initial sample
and then repeated resampling from the initial sample. What makes
bootstrapping estimation useful is that the procedures are based on
the initial sample and make no assumptions regarding the shape of the
underlying population. In addition, the procedures do not require
knowledge of any population parameters.
The steps in bootstrapping estimation of the mean are as follows:
1. Draw a random sample of size n with out replacement from a
population of size N.
2. Resample the initial sample by selecting n observations with
replacement from the n observations in the initial sample.
3. Compute , the statistic of interest, from this resample.
4. Repeat steps 2 and 3 m different times ( where m is typically
selected as a value between 100 and 1000, depending on the speed
of the computer being used).
5. Form the resampling distribution of the statistic of interest (i.e. the
distribution of the sample mean obtained from m samples) using a
stem-and-leaf display or an ordered array.
6. To form a 100(1-)% bootstrap confidence interval of the population
mean , use the stem-and-leaf display or ordered array for the
resampling distribution and find the value that cuts off the smallest
(/2)100% and the value that cuts off the largest (/2)100% of the
statistic. These values provide the lower and upper limits for the
bootstrap confidence interval estimate of the unknown parameter.
Consider the following example of annual cooking oil consumption (in
gallons) of single family homes where the marketing manager is
interested in estimating the average annual consumption.
1150.2 1352.6
983.45
1365.1
942.71
1577.7
330.00
5
7
872.37 1126.5
7
1459.5 1252.0
6
1
941.96 767.37
1013.2 1402.5
7
9
1184.1
7
373.91
1598.5
7
1069.3
2
1
1046.3
5
1047.4
0
1598.6
6
1108.9
4
1110.5
0
1064.4
6
1343.2
9
1326.1
9
7
1050.8
6
1018.2
3
1617.7
3
1074.8
6
851.60
996.92
1300.76
975.86
For these data we may use a statistical software package to obtain the
sample average X=1,122.7 gallons and the sample standard deviation
s =295.72 gallons.
If the manager would like to have 95% confidence that the interval
obtained includes the population average amount of oil consumed per
year, using X= 1122.7, s=295.72, n=35, and t34=2.0322 ( treating
n=35 as small sample!),we have
X+t
n-1
s/ n = 1122.7 + 2.0322(295.72/ 35 )
1021.12
1224.28
(of course assuming normal population here)
We would conclude with 95% confidence that the average amount of
cooking oil consumed per year is between 1021.12 and 1224.28
gallons. The 95% confidence interval states that we are 95% sure that
the sample we have selected is one in which the population mean is
located within the interval. This 95% confidence actually means that if
all possible samples of size 35 were selected (some thing that would
never be done in practice), 95% of the intervals developed would
include the true population mean somewhere within the interval.
If one wants to treat the case as a sufficiently large sample without the
normal population assumption, the Central limit Theorem can be
invoked and one would have obtained the interval as
X+z
s/ n = 1122.7 + 1.96(295.72/ 35 )
=1122.7 + 97.97
1024.73 1220.67 which is narrower than the ones obtained
using t-distribution)
/2
1150.2
5
1013.2
7
767.37
373.91
1050.8
6
373.91
1110.5
0
1184.1
7
996.92
1050.8
6
1074.8
6
1069.3
2
1013.2
7
1047.4
0
1459.5
6
1326.1
9
330.0
0
330.0
0
1018.2
3
975.86
1126.5
7
1064.4
6
373.91
1110.5
0
1365.1
1
1402.5
9
1110.5
0
1343.2
9
1598.6
6
1018.2
3
1343.2
9
941.96
Notice that in this first resample some values from the original sample
such as 330.00 are repeated and others such as 1,352.67 do not occur.
If this resampling process is performed m=200 times, the resampling
distribution containing 200 resampling means can be developed.
Table.3 presents the ordered array of the resampling distribution. To
form a 95% bootstrap confidence interval estimate of , the smallest
2.5%and the largest 2.5% of the resample means need to be identified
(step6).
988.50
994.49
1030.2
1
1042.2
3
1055.0
8
1061.7
9
1071.0
0
1079.4
6
1086.8
1030.7
3
1046.1
6
1055.3
9
1062.4
6
1071.3
8
1081.4
5
1089.4
1003.
26
1032.0
4
1050.0
1
1055.7
6
1062.8
6
1074.9
9
1083.4
0
1089.9
1004.2
5
1032.7
7
1050.9
7
1059.5
7
1063.3
1
1077.1
8
1084.2
5
1092.3
1006.
53
1034.1
7
1052.7
3
1059.8
8
1066.9
1
1077.6
3
1085.2
1
1092.5
1014.3
8
1035.2
1
1053.0
0
1060.5
1
1068.8
4
1078.6
2
1085.9
8
1094.4
1018.6
2
1041.4
5
1054.5
8
1061.7
0
1070.5
3
1079.4
1
1086.2
2
1095.2
4
1095.3
5
1100.2
0
1106.8
4
1115.7
1
1119.9
8
1123.3
4
1127.1
5
1129.8
4
1136.6
7
1142.7
2
1145.7
2
1153.3
5
1159.5
5
1164.0
6
1171.2
1
1176.3
2
1183.9
9
1187.2
7
1191.2
8
1206.2
8
1230.1
1
5
1095.4
0
1101.5
9
1111.1
4
1116.4
6
1120.3
1
1123.7
3
1127.3
0
1132.3
0
1136.9
0
1142.7
3
1146.2
1
1156.3
5
1161.0
1
1164.3
9
1172.0
3
1176.7
3
1184.2
7
1188.0
3
1191.8
6
1209.6
1
1233.4
3
5
1095.8
8
1103.4
1
1112.4
6
1116.5
1
1121.0
1
1124.3
8
1127.5
5
1132.7
8
1138.2
4
1143.0
3
1149.7
7
1156.4
3
1161.1
6
1165.4
5
1172.1
9
1178.3
9
1184.6
8
1188.2
5
1192.9
0
1216.8
5
1233.9
7
7
1096.2
6
1104.0
7
1112.4
7
1117.8
4
1121.4
6
1124.9
8
1127.8
6
1133.3
9
1138.5
2
1143.3
2
1150.8
7
1158.4
9
1161.5
0
1167.8
7
1172.9
0
1178.6
3
1185.8
0
1188.5
3
1193.2
1
1225.2
0
1251.8
8
3
1097.5
9
1104.8
2
1112.5
3
1118.6
7
1121.6
8
1125.6
3
1128.0
6
1134.0
3
1139.5
3
1143.3
5
1152.8
8
1158.6
2
1162.1
2
1168.5
0
1173.0
0
1182.1
6
1185.9
7
1189.3
9
1196.6
6
1226.4
6
8
1100.0
6
1105.1
9
1113.5
8
1119.1
7
1121.9
2
1126.0
1
1129.0
3
1134.4
3
1139.8
1
1143.7
0
1153.1
3
1159.1
4
1163.1
5
1170.0
6
1173.3
7
1182.3
7
1186.8
8
1190.2
0
1199.3
0
1226.6
7
5
1100.1
9
1106.6
7
1115.0
9
1119.7
1
1122.4
0
1127.0
0
1129.6
6
1135.0
7
1140.6
0
1144.3
7
1153.2
5
1159.4
5
1163.2
4
1171.1
1
1174.2
5
1183.1
2
1187.0
1
1191.1
0
1206.2
3
1227.
02
will cut off the highest 2.5% . From Table.3, we obtain the values of
1006.53 gallons as the fifth smallest and 1227.02 as the fifth largest.
Therefore, the 95% bootstrap confidence interval for the population
average amount of cooking oil consumed is 1006.53 to 1227.02
gallons. This estimate is fairly closer to the traditional confidence
interval estimate of 1021.12 to 1224.28 obtained earlier. However, the
bootstrap estimate requires less stringent assumptions than does the
traditional confidence interval estimate.