Bootstrap Test
Bootstrap Test
623
System A System B
0.5
25
25
0.4
20
20
Frequency
0.3
15
15
10
10
0.2
5
0.1
0
0
0.0 0.4 0.8 0.0 0.4 0.8
0.0
Average Precision Average Precision
−0.1
Figure 1: The distribution of 50 average precision
−0.2
scores for two example IR systems submitted to
TREC 3. The mean average precision (MAP) of sys- 0 10 20 30 40 50
tem A is 0.258 and the MAP of system B is 0.206. Topic Number
624
2.2 Wilcoxon Signed Rank Test
3500
The null hypothesis of the Wilcoxon signed rank is the
3000
same as the randomization test, i.e. systems A and B have
2500 the same distribution [13].
Whereas the randomization test can use any test statistic,
2000
625
For our example, system A has a higher average precision value, we determine the fraction of samples in the shifted
than system B on 29 topics. The two-sided sign test with distribution that have an absolute value as large or larger
29 successes out of 50 trials has a p-value of 0.3222. The than our experiment’s difference. This fraction is the p-
sign minimum difference test with the minimum difference value.
set to 0.01 has 25 successes out of 43 trials (seven “ties”) and For our example of system A compared to system B, the
a p-value of 0.3604. Both of these p-values are much larger bootstrap p-value is 0.0107. This is comparable to the ran-
than either the p-values for the randomization (0.0138) or domization test’s p-value of 0.0138.
Wilcoxon (0.0560) tests. An IR researcher using the sign
test would definitely fail to reject the null hypothesis and 2.5 Student’s Paired t-test
would conclude that the difference between systems A and In some ways the bootstrap bridges the divide between the
B was a chance occurrence. randomization test and Student’s t-test. The randomization
While the choice of 0.01 for a minimum difference in aver- test is distribution-free and is free of a random sampling as-
age precision made sense to us, the sign minimum difference sumption. The bootstrap is distribution-free but assumes
test is clearly sensitive to the choice of the minimum dif- random sampling from a population. The t-test’s null hy-
ference. If we increase the minimum difference to 0.05, the pothesis is that systems A and B are random samples from
p-value drops to 0.0987. the same normal distribution [9].
The sign test’s test statistic is one that few IR researchers The details of the paired t-test can be found in most statis-
will report, but if an IR researcher does want to report the tics texts [1]. The two-sided p-value of the t-test is 0.0153,
number of successes, the sign test appears to be a good can- which is in agreement with both the randomization (0.0138)
didate for testing the statistical significance. and bootstrap (0.0107) tests.
The sign test, as with the Wilcoxon, is simply the ran- In 1935 Fisher [9] presented the randomization test as
domization test with a specific test statistic [11]. This can an “independent check on the more expeditious methods in
be seen by realizing that the null distribution of successes common use.” The more expeditious methods that he refers
(the binomial distribution) is obtained by counting the num- to are the methods of Student’s t-test. He was responding
ber of successes for the 2N permutations of the scores for N to criticisms of the t-test’s use of the normal distribution
trials, where for our IR experiments N is the number of in its null hypothesis. The randomization test provided a
topics. means to test “the wider hypothesis in which no normal-
As with the Wilcoxon, given the modern use of computers ity distribution is implied.” His contention was that if the
to compute statistical significance, there seem to only be p-value produced by the t-test was close to the p-value pro-
disadvantages to the use of the sign test compared to the duced by the randomization test, then the t-test could be
randomization test used with the same test statistic as we trusted. In practice, the t-test has been found to be a good
are using to measure the difference between two systems. approximation to the randomization test [1].
626
We computed all runs’ scores using trec eval [3]. We spec- rand. t-test boot. Wilcx. sign sign d.
ified the option -M1000 to limit the maximum number of rand. - 0.007 0.011 0.153 0.256 0.240
documents per query to 1000. t-test 0.007 - 0.007 0.153 0.255 0.240
We measured statistical significance using the Student’s boot. 0.011 0.007 - 0.153 0.258 0.243
paired t-test, Wilcoxon signed rank test, the sign test, the Wilcx. 0.153 0.153 0.153 - 0.191 0.165
sign minimum difference test, the bootstrap shift method, sign 0.256 0.255 0.258 0.191 - 0.131
and the randomization test. The minimum difference in av- sign d. 0.240 0.240 0.243 0.165 0.131 -
erage precision was 0.01 for the sign minimum difference
test. All reported p-values are for two-sided tests. Table 1: Root mean square errors among the ran-
We used the implementations of the t-test and Wilcoxon domization, t-test, bootstrap, Wilcoxon, sign, and
in R [15] (t.test and wilcox.test). We implemented the sign minimum difference tests on 11986 pairs of
sign test in R using R’s binomial test (binom.test) with ties TREC runs. This subset of the 18820 pairs elim-
reducing the number of trials. inates pairs for which all tests agree the p-value was
We implemented the randomization and bootstrap tests < 0.0001.
ourselves in C++. Our program can input the relational
output of trec eval.
Since we cannot feasibly compute the 250 permutations sequence of random numbers. We used Matsumoto and
required for an exact randomization test of a pair of TREC Nishimura’s Mersenne Twister RNG [12], which is well suited
runs, each scored on 50 topics, we randomly sampled from for Monte Carlo simulations given its period of 219937 − 1.
the permutations. The coefficient of variation of the esti-
mated p-value, p̂ as shown by Efron and Tibshirani [8] is:
4. RESULTS
1/2
(1 − p)/p In this section, we report the amount of agreement among
cvB (p̂) =
B the p-values produced by the various significance tests. If
the significance tests agree with each other, there is little
where B is the number of samples and p is the actual one- practical difference among the tests.
sided p-value. The coefficient of variation is the standard We computed the root mean square error between each
error of the estimated p-value divided by the mean. For test and each other test. The root mean square error (RMSE)
example, to estimate a p-value of 0.05 with an error of is:
10% requires setting p = 0.05 and B = 1901 to produce 1/2
1
N
a cvB (p̂) = 0.1. To estimate the number of samples for a 2
two sided test, we divide p in half. RM SE = (Ei − Oi )
N i
For our comparative experiments, we used 100,000 sam-
ples. For a set of experiments in the discussion, we used 20 where Ei is the estimated p-value given by one test and Oi
million samples to obtain a highly accurate p-value for the is the other test’s p-value.
randomization test. The p-value for the randomization test Table 1 shows the RMSE for each of the tests on a subset
with 100K samples differs little from the value from 20M of the TREC run pairs. We formed this subset by removing
samples. all pairs for which all tests agreed the p-value was < 0.0001.
With 100K samples, a two-sided 0.05 p-value is computed This eliminated 6834 pairs and reduced the number of pairs
with an estimated error of 2% or ±0.001 and a 0.01 p-value from 18820 to 11986. This subset eliminates pairs that are
has an error of 4.5% or ±0.00045. This level of accuracy is obviously very different from each other — so different that
very good. a statistical test would likely never be used.
With B = 20 × 106 , an estimated two-sided p-value of The randomization, bootstrap, and t tests largely agree
0.001 should be accurate to within 1% of its value. As the with each other. The RMSE between these three tests is
estimated p-value get larger, they become more accurate approximately 0.01, which is an error of 20% for a p-value of
estimates. For example, a 0.1 p-value will be estimated to 0.05. An IR researcher testing systems similar to the TREC
within 0.01% or ±0.0001 of its value. Thus, even with the ad-hoc runs would find no practical difference among these
small p-values that concern most researchers, we will have three tests. The Wilcoxon and sign tests do not agree with
calculated them to an estimated accuracy that allows us to any of the other tests. The sign minimum difference test
use them as a gold standard to judge other tests with the better agrees with the other tests but still is significantly
same null hypothesis. different. Of note, the sign and sign minimum difference
On a modern microprocessor, for a pair of runs each with tests produce significantly different p-values.
50 topics, our program computes over 511,000 randomiza- We also looked at a subset of 2868 TREC run pairs where
tion test samples per second. Thus, we can compute a ran- either the randomization, bootstrap, or t test produced a
domization test p-value for a pair of runs in 0.2 seconds using p-value between 0.01 and 0.1. For this subset, the RMSE
only 100K samples. between the randomization, t-test, and bootstrap tests av-
We do not know how to estimate the accuracy of the boot- eraged 0.006, which is only a 12% error for a 0.05 p-value.
strap test’s p-values given a number of samples, but 100K For this subset of runs, the RMSE between these three tests
samples is 10 to 100 times more samples than most texts rec- and the Wilcoxon decreased but was still a large error of
ommend. Wilbur [21] and Sakai [16] both used 1000 samples approximately 0.06. The sign and sign minimum difference
for their bootstrap experiments. showed little improvement.
Selection of a random number generator (RNG) is im- Figure 4 shows the relationship between the randomiza-
portant when producing large numbers of samples. A poor tion, bootstrap, and t tests’ p-values. A point is drawn
RNG will have a small period and begin returning the same for each pair of runs. Both the t-test and bootstrap ap-
627
Figure 4: The three way agreement between the randomization test, the Student’s t-test, and the bootstrap
test. Plotted are the p-values for each test vs. each other test. The figure on the left shows the full range of
p-values from 0 to 1 while the figure on the right shows a closer look at smaller p-values.
pear to have a tendency to be less confident in dissimilar to the randomization test, and thus to the t-test and boot-
pairs (small randomization test p-value) and produce larger strap, the Wilcoxon and sign tests will result in failure to
p-values than the randomization test, but these tests find detect significance and false detections of significance.
similar pairs to be more dissimilar than the randomization
test. While the RMSE values in Table 1 say that overall the
t-test agrees equally with the randomization and bootstrap 5. DISCUSSION
tests, Figure 4 shows that the t-test has fewer outliers with To our understanding, the tests we evaluated are all valid
the bootstrap. tests. By valid, we mean that the test produces a p-value
Of note are two pairs of runs for which the randomization that is close to the theoretical p-value for the test statistic
produces p-values of around 0.07, the bootstrap produces under the null hypothesis. Unless a researcher is inventing
p-values of around 0.1, and the t-test produces much larger a new hypothesis test, an established test is not going to be
p-values (0.17 and 0.22). These two pairs may be rare ex- wrong in and of itself.
amples of where the t-test’s normality assumption leads to A researcher may misapply a test by evaluating perfor-
different p-values compared to the distribution free random- mance on one criterion and testing significance using a dif-
ization and bootstrap tests. ferent criterion. For example, a researcher may decide to
Looking at the view of smaller p-values on the right of report a difference in the median average precision, but mis-
Figure 4, we see that the behavior between the three tests takenly test the significance of the difference in mean aver-
remains the same except that there is small but noticeable age precision. Or, the researcher may choose a test with an
systematic bias towards smaller p-values for the bootstrap inappropriate null hypothesis.
when compared to both the randomization and t tests. By The strong agreement among the randomization, t-test,
adding 0.005 to the bootstrap p-values, we were able to re- and bootstrap shows that for the typical TREC style eval-
duce the overall RMSE between the bootstrap and t-test uation with 50 topics, there is no practical difference in the
from 0.007 to 0.005 and from 0.011 to 0.009 for the random- null hypotheses of these three tests.
ization test. Even though the Wilcoxon and sign tests have the same
Figure 5 plots the randomization test’s p-value versus the null hypothesis as the randomization test, these two tests
Wilcoxon and sign minimum difference test’s p-values. As utilize different criteria (test statistics) and produce very
variants of the randomization test, we use the randomization different p-values compared to all of the other tests.
test for comparison purposes with these two tests. The dif- The use of the sign and Wilcoxon tests should have ceased
ferent test statistics for the three tests leads to significantly some time ago based simply on the fact that they test cri-
different p-values. The bands for the sign test are a result teria that do not match the criteria of interest. The sign
of the limited number of p-values for the test. Compared and Wilcoxon tests were appropriate before affordable com-
putation, but are inappropriate today. The sign test retains
628
Figure 5: The relationship between the randomization test’s p-values and the Wilcoxon and sign minimum
difference tests’ p-values. The Wilcoxon test is on the left and the sign minimum difference test is on the
right. A point is drawn for each pair of runs. The x axis is the p-value produced by the randomization test
run with 100K samples, and the y axis is the p-value of the other test.
validity if the only thing one can measure is a preference for Randomization Test
one system over another and this preference has no scale, Other Test Significant Not Significant
but for the majority of IR experiments, this scenario is not Significant H = Hit F = False Alarm
the case. Not Significant M = Miss Z
A researcher wanting a distribution-free test with no as-
sumptions of random sampling should use the randomiza- Table 2: The randomization test is used to deter-
tion test with the test statistic of their choice and not the mine significance against some α. If the other test
Wilcoxon or sign tests. returns a p-value on the same side of α, it scores a
hit or a correct rejection of the null hypothesis (Z).
5.1 Wilcoxon and Sign Tests If the other test returns a p-value on the opposite
The Wilcoxon and sign tests are simplified variants of the side of α, it score a miss or a false alarm.
randomization test. Both of these tests gained popularity
before computer power made the randomization test feasi-
ble. Here we look at the degree to which use of these sim-
plified tests results in errors compared to the randomization
test.
Common practice is to declare results significant when a accuracy, we use it to judge which results are significant at
p-value is less than or equal to some value α. Often α is various values of α for the null hypothesis of the randomiza-
set to be 0.05 by researchers. It is somewhat misleading to tion test. Recall that the null hypotheses of the Wilcoxon
turn the p-value into a binary decision. For example, there and sign tests are the same as the randomization test. The
is little difference between a p-value of 0.049 and 0.051, but only difference between the randomization, Wilcoxon, and
one is declared significant and the other not. Our preference sign tests is that they have different test statistics. The ran-
is to report the p-value and flag results meeting the decision domization’s test statistic matches our statistic of interest:
criteria. the difference in mean average precision.
Nevertheless, some decision must often be made between For example, if the randomization test estimates the p-
significant or not. Turning the p-value into a binary decision value to be 0.006 and we set α = 0.01, we will assume the
allows us to examine two questions about the comparative result is significant. If another test estimates the p-value to
value of statistical tests: be greater than α, that is a miss. If the other p-value is
1. What percent of significant results will a researcher less than or equal to α, the other test scores a hit. When
mistakenly judge to be insignificant? the randomization test finds the p-value to be greater than
α, the other test can false alarm by returning a p-value less
2. What percent of reported significant results will actu- than α. Table 2 shows a contingency table summarizing hits,
ally be insignificant? misses, and false alarms.
We used a randomization test with 20 million samples to With these definitions of a hit, miss, and false alarm we
produce a highly accurate estimate of the p-value. Given its can define the miss rate and false alarm ratio as measures
629
Miss Rate False Alarm Ratio
1.0
1.0
Sign Sign
0.8
0.8
Sign D. Sign D.
Wilcoxon Wilcoxon
0.6
Miss Rate
0.4
0.4
0.2
0.2
0.0
0.0
0 10 20 30 40 50 0 10 20 30 40 50
Relative Percent Increase in Mean Average Precision Relative Percent Increase in Mean Average Precision
of questions 1 and 2 above: IR researcher using the Wilcoxon or sign tests could fail to
detect significant advances in IR techniques.
M
M issRate =
H +M 5.2 Randomization vs. Bootstrap vs. t-test
where M is the number of misses and H is the number of The randomization, bootstrap, and t tests all agreed with
hits. each other given the TREC runs. Which of these should one
F prefer to use over the others? One approach recommended
F alseAlarmRatio = by Hull [10] is to compute the p-value for all tests of interest
H +F
and if they disagree look further at the experiment and the
where F is the number of false alarms and H is the number tests’ criteria and null hypotheses to decide which test is
of hits. The false alarm ratio is not the false alarm rate. most appropriate.
Another way to understand the questions we are address- We have seen with the Wilcoxon and sign tests the mis-
ing is as follows. A researcher is given access to two statisti- takes an IR researcher can make using a significance test
cal significance tests. The researcher is told that one is much that utilizes one criterion while judging and presenting re-
more accurate in its p-values. To get an understanding of sults using another criterion. This issue with the choice of
how poor the poorer test is, the researcher says “I consider test statistic goes beyond the Wilcoxon and sign tests. We
differences with p-values less than α to be significant. I al- ran an additional set of experiments where we calculated
ways have. If I had used the better test instead of the poorer the p-value for the randomization test using the difference
test, what percentage of my previously reported significant in median average precision. The p-values for the median
results would I now consider to be insignificant? On the flip do not agree with the p-values for the difference in mean
side, how many significant results did I fail to publish?” average precision.
The miss rate and false alarm ratio can be thought of The IR researcher should select a significance test that
as the rates at which the researcher would be changing de- uses the same test statistic as the researcher is using to com-
cisions of significance if the researcher switched from using pare systems. As a result, Student’s t-test can only be used
the Wilcoxon or sign test and switched to the randomization for the difference between means and not for the median or
test. other test statistics. Both the randomization test and the
As we stated in the introduction, the goal of the researcher bootstrap can be used with any test statistic.
is to make progress by finding new methods that are better While our experiment found little practical difference among
than existing methods and avoid the promotion of methods the different null hypotheses of the randomization, boot-
that are worse. strap, and t tests, this may not always be so.
Figures 6 and 7 show the miss rate and false alarm ratio for Researchers have been quite concerned that the null hy-
the sign, sign minimum difference (sign d.), and Wilcoxon pothesis of the t-test is not applicable to IR [19, 18, 21]. On
when α is set to 0.1 and 0.05. We show α = 0.1 both as an our experimental data, this concern does not appear to be
“easy” significance level but also for the researcher who may justified, but all of our experiments used a sample size N of
be interested in the behavior of the tests when they produce 50 topics. N = 50 is a large sample. At smaller sample sizes,
one-sided p-values and α = 0.05. In all cases, all of our tests violations of normality may result in errors in the t-test. Co-
produced two-sided p-values. hen [4] makes the strong point that the randomization test
Given the ad-hoc TREC run pairs, if a researcher reports performs as well as the t-test when the normality assump-
significance for a small improvement using the Wilcoxon or tion is met but that the randomization test outperforms the
sign, we would have doubt in that result. Additionally, an t-test when the normality assumption is unmet. As such,
630
Miss Rate False Alarm Ratio
1.0
1.0
Sign Sign
0.8
0.8
Sign D. Sign D.
Wilcoxon Wilcoxon
0.6
Miss Rate
0.4
0.4
0.2
0.2
0.0
0.0
0 10 20 30 40 50 0 10 20 30 40 50
Relative Percent Increase in Mean Average Precision Relative Percent Increase in Mean Average Precision
the researcher is safe to use the randomization test in either outcomes given the experimental data. The randomization
case but must be wary of the t-test. test does not consider — the often incorrect — idea that the
Between the randomization (permutation) test and the scores are random samples from a population.
bootstrap, which is better? Efron invented the bootstrap in The test topics used in TREC evaluations are not random
1979. Efron and Tibshirani [8] write at the end of chapter samples from the population of topics. TREC topics are
15: hand selected to meet various criteria such as the estimated
number of relevant documents in the test collection [20]. Ad-
Permutation methods tend to apply to only ditionally, neither the assessors nor the document collection
a narrow range of problems. However when they are random.
apply, as in testing F = G in a two-sample prob- The randomization test looks only at the experiment and
lem, they give gratifyingly exact answers without produces a probability that the experimental results could
parametric assumptions. The bootstrap distri- have occurred by chance without any assumption of random
bution was originally called the “combination dis- sampling from a population.
tribution.” It was designed to extend the virtues An IR researcher may argue that the assumption of ran-
of permutation testing to the great majority of dom samples from a population is required to draw an infer-
statistical problems where there is nothing to ence from the experiment to the larger world. This cannot
permute. When there is something to permute, be the case. IR researchers have for long understood that
as in Figure 15.1, it is a good idea to do so, even if inferences from their experiments must be carefully drawn
other methods like the bootstrap are also brought given the construction of the test setup. Using a signifi-
to bear. cance test based on the assumption of random sampling is
The randomization method does apply to the typical IR ex- not warranted for most IR research.
periment. Noreen [14] has reservations about the use of the Given these fundamental difference between the random-
bootstrap for hypothesis testing. ization, bootstrap, and t tests, we recommend the random-
Our largest concern with the bootstrap is the systematic ization test be used when it is applicable. The randomiza-
bias towards smaller p-values we found in comparison to tion test is applicable to most IR experiments.
both the randomization and t tests. This bias may be an
artifact of our implementation, but an issue with the boot- 5.3 Other Metrics
strap is the number of its possible variations and the need for Our results have focused on the mean average precision
expert guidance on its correct use. For example, a common (MAP). We also looked at how the precision at 10 (P10),
technique is to Studentize the test statistic to improve the mean reciprocal rank (MRR), and R-precision affected the
bootstrap’s estimation of the p-value [8]. It is unclear when results. In general the tests behaved the same as for the
one needs to do this and additionally such a process would MAP. Of note, the Wilcoxon test showed less variation for
seem to limit the set of applicable test statistics. Unlike the the MRR than for the other metrics.
bootstrap, the randomization test is simple to understand
and implement.
Another issue with both the bootstrap and the t-test is 6. RELATED WORK
that both of them have as part of their null hypotheses that Edgington’s book [7] on randomization tests provides ex-
the scores from the two IR systems are random samples tensive coverage of the many aspects of the test and details
from a single population. In contrast, the randomization how the test was created by Fisher in the 1930s and later
test only concerns itself with the other possible experimental was developed by many other statisticians. Box et al. pro-
631
vide an excellent explanation of the randomization test in 8. ACKNOWLEDGMENTS
chapter 4 of their classic text [1]. Efron and Tibshirani have We thank Trevor Strohman for his helpful discussions and
a detailed chapter on the permutation (randomization) test feedback on an earlier draft of this paper. We also thank the
in their book [8]. anonymous reviewers for their helpful comments.
Kempthorne and Doerfler have shown that for a set of ar- This work was supported in part by the Center for In-
tificial distributions the randomization test is to be preferred telligent Information Retrieval and in part by the Defense
to the Wilcoxon test which is to be preferred to the sign test Advanced Research Projects Agency (DARPA) under con-
[11]. In contrast, our analysis is based on the actual score tract number HR0011-06-C-0023. Any opinions, findings
distributions from IR retrieval systems. and conclusions or recommendations expressed in this ma-
Hull reviewed Student’s t-test, the Wilcoxon signed rank terial are the authors’ and do not necessarily reflect those of
test, and the sign test and stressed the value of significance the sponsor.
testing in IR [10]. Hull’s suggestion to compare the output
of the tests was part of the inspiration for our experimental
methodology. Hull also made the point that the t-test tends
9. REFERENCES
to be robust to violations of its normality assumption. [1] G. E. P. Box, W. G. Hunter, and J. S. Hunter. Statistics
for Experimenters. John Wiley & Sons, 1978.
Wilbur compared the randomization, bootstrap, Wilcoxon,
[2] J. V. Bradley. Distribution-Free Statistical Tests.
and sign tests for IR evaluation but excluded the t-test based Prentice-Hall, 1968.
on its normality assumption [21]. Wilbur found the random- [3] C. Buckley. trec eval.
ization test and the bootstrap test to perform well, but rec- http://trec.nist.gov/trec_eval/trec_eval.8.0.tar.gz.
ommended the bootstrap over the other tests in part because [4] P. R. Cohen. Empirical methods for artificial intelligence.
of its greater generality. MIT Press, 1995.
Savoy advocated the use of the bootstrap hypothesis test [5] G. Cormack and T. Lynam. Validity and power of t-test for
as a solution to the problem that the normality assump- comparing map and gmap. In SIGIR ’07. ACM Press, 2007.
tion required of the t-test is clearly violated by the score [6] G. V. Cormack and T. R. Lynam. Statistical precision of
information retrieval evaluation. In SIGIR ’06, pages
distributions of IR experiments [18]. Sakai used bootstrap 533–540. ACM Press, 2006.
significance tests to evaluate evaluation metrics [16], while [7] E. S. Edgington. Randomization Tests. Marcel Dekker,
our emphasis was on the comparison of significance tests. 1995.
Box et al. stress that when comparative experiments prop- [8] B. Efron and R. J. Tibshirani. An Introduction to the
erly use randomization of test subjects, the t-test is usually Bootstrap. Chapman & Hall/CRC, 1998.
robust to violations of its assumptions and can be used as [9] R. A. Fisher. The Design of Experiments. Oliver and Boyd,
an approximation to the randomization test [1]. We have first edition, 1935.
confirmed this to be the case for IR score distributions. [10] D. Hull. Using statistical testing in the evaluation of
Both Sanderson and Zobel [17] and Cormack and Ly- retrieval experiments. In SIGIR ’93, pages 329–338, New
York, NY, USA, 1993. ACM Press.
nam [5] have found that the t-test should be preferred to
[11] O. Kempthorne and T. E. Doerfler. The behavior of some
both the Wilcoxon and sign tests. We have taken the ad- significance tests under experimental randomization.
ditional step of comparing these tests to the randomization Biometrika, 56(2):231–248, August 1969.
and bootstrap tests that have been proposed by others for [12] M. Matsumoto and T. Nishimura. Mersenne twister: a
significance testing in IR evaluation. 623-dimensionally equidistributed uniform pseudo-random
number generator. ACM Trans. Model. Comput. Simul.,
8(1):3–30, 1998.
7. CONCLUSION [13] W. Mendenhall, D. D. Wackerly, and R. L. Scheaffer.
For a large collection of TREC ad-hoc retrieval system Mathematical Statistics with Applications. PWS-KENT
pairs, the randomization test, the bootstrap shift method Publishing Company, 1990.
test, and Student’s t-test all produce comparable signifi- [14] E. W. Noreen. Computer Intensive Methods for Testing
cance values (p-values). Given that an IR researcher will Hypotheses. John Wiley, 1989.
obtain a similar p-value for each of these tests, there is no [15] R Development Core Team. R: A language and
environment for statistical computing. R Foundation for
practical difference between them. Statistical Computing, Vienna, Austria, 2004.
On the same set of experimental data, the Wilcoxon signed 3-900051-07-0.
rank test and the sign test both produced very different p- [16] T. Sakai. Evaluating evaluation metrics based on the
values. These two tests are variants of the randomization bootstrap. In SIGIR ’06, pages 525–532. ACM Press, 2006.
test with different test statistics. Before affordable compu- [17] M. Sanderson and J. Zobel. Information retrieval system
tation existed, both of these tests provided easy to com- evaluation: effort, sensitivity, and reliability. In SIGIR ’05,
pute, approximate levels of significance. In comparison to pages 162–169. ACM Press, 2005.
the randomization test, both the Wilcoxon and sign tests [18] J. Savoy. Statistical inference in retrieval effectiveness
evaluation. IPM, 33(4):495–512, 1997.
can incorrectly predict significance and can fail to detect
[19] C. J. van Rijsbergen. Information Retrieval. Butterworths,
significant results. IR researchers should discontinue use of second edition, 1979.
the Wilcoxon and sign tests. http://www.dcs.gla.ac.uk/Keith/Preface.html.
The t-test is only applicable for measuring the significance [20] E. M. Voorhees and D. K. Harman, editors. TREC. MIT
of the difference between means. Both the randomization Press, 2005.
and bootstrap tests can use test statistics other than the [21] W. J. Wilbur. Non-parametric significance tests of retrieval
mean, e.g. the median. For IR evaluation, we recommend performance comparisons. J. Inf. Sci., 20(4):270–284, 1994.
the use of the randomization test with a test statistic that [22] F. Wilcoxon. Individual comparisons by ranking methods.
Biometrics Bulletin, 1(6):80–83, December 1945.
matches the test statistic used to measure the difference be-
tween two systems.
632