0% found this document useful (0 votes)

29 views10 pages

Bootstrap Test

The document compares five statistical significance tests used in information retrieval evaluation: the Student's t-test, the Wilcoxon signed rank test, the sign test, the bootstrap test, and Fisher's randomization test. It analyzes results from ad-hoc retrieval runs submitted to several TRECs and finds the t-test, bootstrap test, and randomization test largely agree while the Wilcoxon and sign tests disagree and should not be used.

Uploaded by

OsamaKhalid

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views10 pages

Bootstrap Test

Uploaded by

OsamaKhalid

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

A Comparison of Statistical Significance Tests for

Information Retrieval Evaluation

Mark D. Smucker, James Allan, and Ben Carterette

Center for Intelligent Information Retrieval
Department of Computer Science
University of Massachusetts Amherst
{smucker, allan, carteret}@cs.umass.edu

ABSTRACT is better than the other? A common batch-style experiment

Information retrieval (IR) researchers commonly use three is to select a collection of documents, write a set of topics,
tests of statistical significance: the Student’s paired t-test, and create relevance judgments for each topic and then mea-
the Wilcoxon signed rank test, and the sign test. Other sure the effectiveness of each system using a metric like the
researchers have previously proposed using both the boot- mean average precision (MAP). TREC typifies this style of
strap and Fisher’s randomization (permutation) test as non- evaluation [20].
parametric significance tests for IR but these tests have seen We know that there is inherent noise in an evaluation.
little use. For each of these five tests, we took the ad-hoc re- Some topics are harder than others. The assessors hired
trieval runs submitted to TRECs 3 and 5-8, and for each pair to judge relevance of documents are human and thus open
of runs, we measured the statistical significance of the dif- to variability in their behavior. And finally, the choice of
ference in their mean average precision. We discovered that document collection can affect our measures.
there is little practical difference between the randomization, We want to promote retrieval methods that truly are bet-
bootstrap, and t tests. Both the Wilcoxon and sign test have ter rather than methods that by chance performed better
a poor ability to detect significance and have the potential given the set of topics, judgments, and documents used in
to lead to false detections of significance. The Wilcoxon the evaluation. Statistical significance tests play an impor-
and sign tests are simplified variants of the randomization tant role in helping the researcher achieve this goal. A pow-
test and their use should be discontinued for measuring the erful test allows the researcher to detect significant improve-
significance of a difference between means. ments even when the improvements are small. An accurate
test only reports significance when it exists.
An important question then is: what statistical signifi-
Categories and Subject Descriptors cance test should IR researchers use?
H.3.3 [Information Storage and Retrieval]: Information We take a pragmatic approach to answering this question.
Search and Retrieval If two significance tests tend to produce nearly equivalent
significance levels (p-values), then to the researcher there
is little practical difference between the tests. While the
General Terms underlying fundamentals of the tests may be very different,
Experimentation if they report the same significance level, the fundamental
differences cease to be practical differences.
Keywords Using the runs submitted to five TREC ad-hoc retrieval
evaluations, we computed the significance values for the Stu-
Statistical significance, hypothesis test, sign, Wilcoxon, Stu- dent’s paired t, Wilcoxon signed rank, sign, shifted boot-
dent’s t-test, randomization, permutation, bootstrap strap, and randomization tests. Comparing these signifi-
cance values we found that:
1. INTRODUCTION
• Student’s t, bootstrap, and randomization tests largely
A chief goal of the information retrieval (IR) researcher
agree with each other. Researchers using any of these
is to make progress by finding better retrieval methods and
three tests are likely to draw the same conclusions re-
avoid the promotion of worse methods. Given two informa-
garding statistical significance of their results.
tion retrieval (IR) systems, how can we determine which one
• The Wilcoxon and sign tests disagree with the other
tests and each other. For a host of reasons that we
explain, the Wilcoxon and sign tests should no longer
Permission to make digital or hard copies of all or part of this work for be used by IR researchers.
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies We also came to the following conclusions as part of our
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific study:
permission and/or a fee.
CIKM’07, November 6–8, 2007, Lisboa, Portugal. • A test should test the same statistic that a researcher
Copyright 2007 ACM 978-1-59593-803-9/07/0011 ...$5.00. reports. Thus, the t-test is only appropriate for testing

623
System A System B

0.5
25

0.4
20

Difference in Average Precision

Frequency

0.3
15

15
10

0.2
5

0.1
0

0
0.0 0.4 0.8 0.0 0.4 0.8

0.0
Average Precision Average Precision

−0.1
Figure 1: The distribution of 50 average precision

−0.2
scores for two example IR systems submitted to
TREC 3. The mean average precision (MAP) of sys- 0 10 20 30 40 50
tem A is 0.258 and the MAP of system B is 0.206. Topic Number

Figure 2: The per topic diﬀerences in average pre-

cision of the same two systems as in Figure 1.
the difference between means. Both the randomization
and bootstrap can use any statistic.
• Based on the tests’ various fundamentals, we recom- system produces a score for each topic and on a per-topic
mend the randomization test as the preferred test in basis we obtain matched pairs of scores. All of the tests
all cases for which it is applicable. we consider evaluate significance in light of this paired de-
sign, which is common to most batch-style IR experiments.
Other researchers have studied the use of significance tests Figure 2 shows the per topic differences between the two
as part of IR evaluation [5, 6, 10, 16, 17, 18, 21], but we know example systems.
of no other work that looks at all of these tests or takes our As measured by mean average precision, system A per-
pragmatic, comparative approach. formed 20.1% better than B, but is this a statistically sig-
nificant improvement? We have already chosen our criterion
2. SIGNIFICANCE TESTING by which to judge the difference of the two systems – dif-
As Box, Hunter, and Hunter [1] explain, a significance test ference in mean average precision. We next need to form
consists of the following essential ingredients: a null hypothesis and determine whether we can reject the
null hypothesis.
1. A test statistic or criterion by which to judge the two Each of the following significance tests has its own crite-
systems. IR researchers commonly use the difference rion and null hypothesis. The randomization and bootstrap
in mean average precision (MAP) or the difference in tests can use whatever criterion we specify while the other
the mean of another IR metric. tests are fixed in their test statistic. While there are fun-
damental differences in the null hypotheses, all of the tests
2. A distribution of the test statistic given our null hy- aim to measure the probability that the experimental re-
pothesis. A typical null hypothesis is that there is no sults would have occurred by chance if systems A and B
difference in our two systems. were actually the same system.
3. A significance level that is computed by taking the
value of the test statistic for our experimental sys-
2.1 Randomization Test
tems and determining how likely a value that large For Fisher’s randomization test [1, 4, 8, 9], our null hy-
or larger could have occurred under the null hypothe- pothesis is that system A and system B are identical and
sis. This probability of the experimental criterion score thus system A has no effect compared to system B on the
given the distribution created by null hypothesis is also mean average precision for the given topics, corpora, and
known as the p-value. relevance judgments.
If system A and system B are identical, we can imagine
When the significance level is low, the researcher can feel that there is some system N that produced the results for A
comfortable in rejecting the null hypothesis. If the null hy- and B. For example, on one topic, system A had an average
pothesis cannot be rejected, then the difference between the precision (AP) of 0.499 and system B had an AP of 0.577.
two systems may be the result of the inherent noise in the Under the null hypothesis, system N produced both results
evaluation. and we merely labeled one as being produced by system A
To make our discussion more concrete, we will use two and the other by system B. To generate the results for all 50
actual runs submitted to TREC 3 as an example. On the 50 topics, we asked system N to produce two results for each
topics of TREC 3, system A had a MAP of 0.258 and system topic and we labeled one of them as produced by A and the
B had a MAP of 0.206. Figure 1 shows the distribution of other by B.
average precision scores for systems A and B. Thus, if system A and system B are identical, then we can
We know that a large amount of the variability in the think of them as merely labels applied to the scores produced
scores on an IR evaluation comes from the topics. Each by system N. The decision to label one score for a topic as

624
2.2 Wilcoxon Signed Rank Test

3500
The null hypothesis of the Wilcoxon signed rank is the

3000
same as the randomization test, i.e. systems A and B have
2500 the same distribution [13].
Whereas the randomization test can use any test statistic,
2000

the Wilcoxon test uses a speciﬁc test statistic. The Wilcoxon

test statistic takes the paired score diﬀerences and ranks
1500

them in ascending order by absolute value. The sign of each

diﬀerence is given to its rank as a label so that we will typ-
1000

ically have a mix of “negative” and “positive” ranks. For

a two-sided test, the minimum of the sums of the two sets
500

of ranks is the test statistic. Diﬀerences of zero and tied

diﬀerences require special handling [13].
0

The Wilcoxon test statistic throws away the true diﬀer-

−0.06 −0.04 −0.02 0.00 0.02 0.04 0.06
ences and replaces them with ranks that crudely approxi-
Difference in Average Precision mate the magnitudes of the differences. This loss of informa-
tion gained computational ease and allowed the tabulation
Figure 3: The distribution of 100,000 differences in of an analytical solution to the distribution of possible rank
mean average precision between random permuta- sums. One refers the test statistic to this table to determine
tions of system A and B. the p-value of the Wilcoxon test statistic. For sample sizes
greater than 25, a normal approximation to this distribution
exists [13].
For our example, the Wilcoxon p-value is 0.0560. This
produced by system A or B is arbitrary. In fact, since there is significantly larger than the randomization test’s 0.0138.
are 50 topics, there are 250 ways to label the results under While we would likely judge the systems to be significantly
the null hypothesis. One of these labelings is exactly the different given the randomization test, we would likely come
labeling of the example that produced MAPs of 0.258 for to the opposite conclusion using the Wilcoxon test.
system A and 0.206 for system B. Of note is that the null hypothesis distribution of the
Under the null hypothesis, any permutation of the labels Wilcoxon test statistic is the same distribution as if this
is an equally likely output. We can measure the difference test statistic was used for the randomization test [11]. Thus
between A and B for each permutation. If we created each we can see the dramatic affect that choosing a different test
of the 250 permutations, we could measure the number of statistic can have for a statistical test.
times a difference in MAP was as great or greater than the Wilcoxon’s test made sense when Wilcoxon presented it
difference we measured in our example (0.258 − 0.206 = in 1945 as a test to “obtain a rapid approximate idea of
0.052). This number divided by 250 would be the exact the significance of the differences” [22]. Given that IR re-
one-sided p-value or achieved significance level for the null searchers will use a computer to compute their significance
hypothesis. Doing such a test of the null hypothesis is known tests, there seem to be only disadvantages to the test com-
as a randomization test or permutation test. If we measured pared to the randomization test; the randomization test can
the number of times the absolute value of the difference was use the actual test statistic of concern such as the difference
as great or greater than the measured difference, we would in mean.
have the exact two-sided p-value. Wilcoxon believed that one of the advantages of his test
Computing 250 permutations takes even a fast computer was that its utilization of magnitude information would make
longer than any IR researcher is willing to wait. An alterna- it better than the sign test, which only retains the direction
tive is to sample and produce a limited number of random of the difference [22].
permutations. The more samples, the more accurate will
our estimate of the p-value be. We will discuss the number 2.3 Sign Test
of samples needed in Section 3. Like the randomization and Wilcoxon tests, the sign test
Returning to our example, we created 100,000 random has a null hypothesis that systems A and B have the same
permutations of system A and B and measured the difference distribution [13].
in mean average precision for each arrangement. Figure 3 The test statistic for the sign test is the number of pairs
shows the distribution of differences. Our example’s MAP for which system A is better than system B. Under the null
difference is 0.052. Of the 100,000 measured differences, 689 hypothesis, the test statistic has the binomial distribution
are ≤ −0.052 and 691 are ≥ 0.052. This gives us a two-sided with the number of trials being the total number of pairs.
p-value of (689+691)/100000 = 0.0138. This shows that the The number of trials is reduced for each tied pair.
difference of 0.052 is unlikely and thus we should reject the Given that IR researchers compute IR metrics to the pre-
null hypothesis and report that system A has achieved a cision available on their computers, van Rijsbergen proposed
statistically significant improvement over system B. that a tie should be determined based on some set absolute
Before the era of cheap computer power, the randomiza- difference between two scores [19]. We will refer to this
tion test was impractical for all but the smallest experi- variant of the test as the sign minimum difference test and
ments. As such, statisticians created significance tests that abbreviate it as the sign d. test in tables and figures. We
replaced the actual score differences with the ranks of the used a minimum absolute difference of 0.01 in average pre-
scores [2]. Of these tests, IR researchers have most widely cision for our experiments with the sign minimum difference
used the Wilcoxon signed rank test. test.

625
For our example, system A has a higher average precision value, we determine the fraction of samples in the shifted
than system B on 29 topics. The two-sided sign test with distribution that have an absolute value as large or larger
29 successes out of 50 trials has a p-value of 0.3222. The than our experiment’s difference. This fraction is the p-
sign minimum difference test with the minimum difference value.
set to 0.01 has 25 successes out of 43 trials (seven “ties”) and For our example of system A compared to system B, the
a p-value of 0.3604. Both of these p-values are much larger bootstrap p-value is 0.0107. This is comparable to the ran-
than either the p-values for the randomization (0.0138) or domization test’s p-value of 0.0138.
Wilcoxon (0.0560) tests. An IR researcher using the sign
test would definitely fail to reject the null hypothesis and 2.5 Student’s Paired t-test
would conclude that the difference between systems A and In some ways the bootstrap bridges the divide between the
B was a chance occurrence. randomization test and Student’s t-test. The randomization
While the choice of 0.01 for a minimum difference in aver- test is distribution-free and is free of a random sampling as-
age precision made sense to us, the sign minimum difference sumption. The bootstrap is distribution-free but assumes
test is clearly sensitive to the choice of the minimum dif- random sampling from a population. The t-test’s null hy-
ference. If we increase the minimum difference to 0.05, the pothesis is that systems A and B are random samples from
p-value drops to 0.0987. the same normal distribution [9].
The sign test’s test statistic is one that few IR researchers The details of the paired t-test can be found in most statis-
will report, but if an IR researcher does want to report the tics texts [1]. The two-sided p-value of the t-test is 0.0153,
number of successes, the sign test appears to be a good can- which is in agreement with both the randomization (0.0138)
didate for testing the statistical significance. and bootstrap (0.0107) tests.
The sign test, as with the Wilcoxon, is simply the ran- In 1935 Fisher [9] presented the randomization test as
domization test with a specific test statistic [11]. This can an “independent check on the more expeditious methods in
be seen by realizing that the null distribution of successes common use.” The more expeditious methods that he refers
(the binomial distribution) is obtained by counting the num- to are the methods of Student’s t-test. He was responding
ber of successes for the 2N permutations of the scores for N to criticisms of the t-test’s use of the normal distribution
trials, where for our IR experiments N is the number of in its null hypothesis. The randomization test provided a
topics. means to test “the wider hypothesis in which no normal-
As with the Wilcoxon, given the modern use of computers ity distribution is implied.” His contention was that if the
to compute statistical significance, there seem to only be p-value produced by the t-test was close to the p-value pro-
disadvantages to the use of the sign test compared to the duced by the randomization test, then the t-test could be
randomization test used with the same test statistic as we trusted. In practice, the t-test has been found to be a good
are using to measure the difference between two systems. approximation to the randomization test [1].

2.4 Bootstrap Test – Shift Method 2.6 Summary

For our example, the randomization test has a p-value of
As with the randomization test, the bootstrap shift method
0.0138. The Wilcoxon signed rank test’s p-value is 0.0560.
significance test is a distribution-free test. The bootstrap’s
The sign test has a p-value of 0.3222 and the sign minimum
null hypothesis is that the scores of systems A and B are ran-
difference test has a p-value of 0.3604. The bootstrap has a
dom samples from the same distribution [4, 8, 14]. This is
p-value of 0.0107, and the paired t-test’s p-value is 0.0153.
different than the randomization test’s null hypothesis that
All p-values are for two-sided tests.
makes no assumptions about random sampling from a pop-
If a researcher decides to declare p-values less than 0.05
ulation.
as significant, then only the randomization, bootstrap, and
The bootstrap tries to recreate the population distribution
t tests were able to detect significance. The Wilcoxon and
by sampling with replacement from the sample. For the shift
sign p-values would cause the researcher to decide that sys-
method, we draw pairs of scores (topics) with replacement
tem A did not produce a statistically significant increase in
from the scores of systems A and B until we have drawn the
performance over system B.
same number of pairs as in the experiment. For our example
If the Wilcoxon and sign test tend to produce poor esti-
the number of topics is 50. Once we have our 50 random
mates of the significance of the difference between means, a
pairs, we compute the test statistic over this new set of pairs,
researcher using the Wilcoxon or sign test is likely spend a
which for our example is the difference in the mean average
lot longer searching for methods that improve retrieval per-
precision. The bootstrap can be used with any test statistic.
formance compared to a researcher using the randomization,
We repeat this process B times to create the bootstrap
bootstrap, or t test.
distribution of the test statistic. As we did with the ran-
We next describe our experiments to measure the degree
domization test, we set B = 100,000, which should be more
to which the various tests agree with each other.
than adequate to obtain an accurate bootstrap distribution.
The bootstrap distribution is not the same as the null hy-
pothesis distribution. The shift method approximates the 3. METHODS AND MATERIALS
null hypothesis distribution by assuming the bootstrap dis- In this section, we describe the details of the data used
tribution has the same shape as the null distribution. The and important specifics regarding our experiment.
other tests we examine do not make any similar guesses. We took the ad-hoc retrieval runs submitted to TRECs 3
The other tests directly determine the null hypothesis dis- and 5–8 and for each pair of runs, we measured the statistical
tribution. significance of the difference in their mean average precision.
We then take the bootstrap distribution and shift it so This totaled 18820 pairs with 780, 1830, 2701, 5253, and
that its mean is zero. Finally, to obtain the two-sided p- 8256 pairs coming from TRECs 3, 5-8 respectively.

626
We computed all runs’ scores using trec eval [3]. We spec- rand. t-test boot. Wilcx. sign sign d.
ified the option -M1000 to limit the maximum number of rand. - 0.007 0.011 0.153 0.256 0.240
documents per query to 1000. t-test 0.007 - 0.007 0.153 0.255 0.240
We measured statistical significance using the Student’s boot. 0.011 0.007 - 0.153 0.258 0.243
paired t-test, Wilcoxon signed rank test, the sign test, the Wilcx. 0.153 0.153 0.153 - 0.191 0.165
sign minimum difference test, the bootstrap shift method, sign 0.256 0.255 0.258 0.191 - 0.131
and the randomization test. The minimum difference in av- sign d. 0.240 0.240 0.243 0.165 0.131 -
erage precision was 0.01 for the sign minimum difference
test. All reported p-values are for two-sided tests. Table 1: Root mean square errors among the ran-
We used the implementations of the t-test and Wilcoxon domization, t-test, bootstrap, Wilcoxon, sign, and
in R [15] (t.test and wilcox.test). We implemented the sign minimum difference tests on 11986 pairs of
sign test in R using R’s binomial test (binom.test) with ties TREC runs. This subset of the 18820 pairs elim-
reducing the number of trials. inates pairs for which all tests agree the p-value was
We implemented the randomization and bootstrap tests < 0.0001.
ourselves in C++. Our program can input the relational
output of trec eval.
Since we cannot feasibly compute the 250 permutations sequence of random numbers. We used Matsumoto and
required for an exact randomization test of a pair of TREC Nishimura’s Mersenne Twister RNG [12], which is well suited
runs, each scored on 50 topics, we randomly sampled from for Monte Carlo simulations given its period of 219937 − 1.
the permutations. The coefficient of variation of the esti-
mated p-value, p̂ as shown by Efron and Tibshirani [8] is:
4. RESULTS
1/2
(1 − p)/p In this section, we report the amount of agreement among
cvB (p̂) =
B the p-values produced by the various significance tests. If
the significance tests agree with each other, there is little
where B is the number of samples and p is the actual one- practical difference among the tests.
sided p-value. The coefficient of variation is the standard We computed the root mean square error between each
error of the estimated p-value divided by the mean. For test and each other test. The root mean square error (RMSE)
example, to estimate a p-value of 0.05 with an error of is:
10% requires setting p = 0.05 and B = 1901 to produce 1/2
1
N
a cvB (p̂) = 0.1. To estimate the number of samples for a 2
two sided test, we divide p in half. RM SE = (Ei − Oi )
N i
For our comparative experiments, we used 100,000 sam-
ples. For a set of experiments in the discussion, we used 20 where Ei is the estimated p-value given by one test and Oi
million samples to obtain a highly accurate p-value for the is the other test’s p-value.
randomization test. The p-value for the randomization test Table 1 shows the RMSE for each of the tests on a subset
with 100K samples differs little from the value from 20M of the TREC run pairs. We formed this subset by removing
samples. all pairs for which all tests agreed the p-value was < 0.0001.
With 100K samples, a two-sided 0.05 p-value is computed This eliminated 6834 pairs and reduced the number of pairs
with an estimated error of 2% or ±0.001 and a 0.01 p-value from 18820 to 11986. This subset eliminates pairs that are
has an error of 4.5% or ±0.00045. This level of accuracy is obviously very different from each other — so different that
very good. a statistical test would likely never be used.
With B = 20 × 106 , an estimated two-sided p-value of The randomization, bootstrap, and t tests largely agree
0.001 should be accurate to within 1% of its value. As the with each other. The RMSE between these three tests is
estimated p-value get larger, they become more accurate approximately 0.01, which is an error of 20% for a p-value of
estimates. For example, a 0.1 p-value will be estimated to 0.05. An IR researcher testing systems similar to the TREC
within 0.01% or ±0.0001 of its value. Thus, even with the ad-hoc runs would find no practical difference among these
small p-values that concern most researchers, we will have three tests. The Wilcoxon and sign tests do not agree with
calculated them to an estimated accuracy that allows us to any of the other tests. The sign minimum difference test
use them as a gold standard to judge other tests with the better agrees with the other tests but still is significantly
same null hypothesis. different. Of note, the sign and sign minimum difference
On a modern microprocessor, for a pair of runs each with tests produce significantly different p-values.
50 topics, our program computes over 511,000 randomiza- We also looked at a subset of 2868 TREC run pairs where
tion test samples per second. Thus, we can compute a ran- either the randomization, bootstrap, or t test produced a
domization test p-value for a pair of runs in 0.2 seconds using p-value between 0.01 and 0.1. For this subset, the RMSE
only 100K samples. between the randomization, t-test, and bootstrap tests av-
We do not know how to estimate the accuracy of the boot- eraged 0.006, which is only a 12% error for a 0.05 p-value.
strap test’s p-values given a number of samples, but 100K For this subset of runs, the RMSE between these three tests
samples is 10 to 100 times more samples than most texts rec- and the Wilcoxon decreased but was still a large error of
ommend. Wilbur [21] and Sakai [16] both used 1000 samples approximately 0.06. The sign and sign minimum difference
for their bootstrap experiments. showed little improvement.
Selection of a random number generator (RNG) is im- Figure 4 shows the relationship between the randomiza-
portant when producing large numbers of samples. A poor tion, bootstrap, and t tests’ p-values. A point is drawn
RNG will have a small period and begin returning the same for each pair of runs. Both the t-test and bootstrap ap-

627
Figure 4: The three way agreement between the randomization test, the Student’s t-test, and the bootstrap
test. Plotted are the p-values for each test vs. each other test. The ﬁgure on the left shows the full range of
p-values from 0 to 1 while the ﬁgure on the right shows a closer look at smaller p-values.

pear to have a tendency to be less confident in dissimilar to the randomization test, and thus to the t-test and boot-
pairs (small randomization test p-value) and produce larger strap, the Wilcoxon and sign tests will result in failure to
p-values than the randomization test, but these tests find detect significance and false detections of significance.
similar pairs to be more dissimilar than the randomization
test. While the RMSE values in Table 1 say that overall the
t-test agrees equally with the randomization and bootstrap 5. DISCUSSION
tests, Figure 4 shows that the t-test has fewer outliers with To our understanding, the tests we evaluated are all valid
the bootstrap. tests. By valid, we mean that the test produces a p-value
Of note are two pairs of runs for which the randomization that is close to the theoretical p-value for the test statistic
produces p-values of around 0.07, the bootstrap produces under the null hypothesis. Unless a researcher is inventing
p-values of around 0.1, and the t-test produces much larger a new hypothesis test, an established test is not going to be
p-values (0.17 and 0.22). These two pairs may be rare ex- wrong in and of itself.
amples of where the t-test’s normality assumption leads to A researcher may misapply a test by evaluating perfor-
different p-values compared to the distribution free random- mance on one criterion and testing significance using a dif-
ization and bootstrap tests. ferent criterion. For example, a researcher may decide to
Looking at the view of smaller p-values on the right of report a difference in the median average precision, but mis-
Figure 4, we see that the behavior between the three tests takenly test the significance of the difference in mean aver-
remains the same except that there is small but noticeable age precision. Or, the researcher may choose a test with an
systematic bias towards smaller p-values for the bootstrap inappropriate null hypothesis.
when compared to both the randomization and t tests. By The strong agreement among the randomization, t-test,
adding 0.005 to the bootstrap p-values, we were able to re- and bootstrap shows that for the typical TREC style eval-
duce the overall RMSE between the bootstrap and t-test uation with 50 topics, there is no practical difference in the
from 0.007 to 0.005 and from 0.011 to 0.009 for the random- null hypotheses of these three tests.
ization test. Even though the Wilcoxon and sign tests have the same
Figure 5 plots the randomization test’s p-value versus the null hypothesis as the randomization test, these two tests
Wilcoxon and sign minimum difference test’s p-values. As utilize different criteria (test statistics) and produce very
variants of the randomization test, we use the randomization different p-values compared to all of the other tests.
test for comparison purposes with these two tests. The dif- The use of the sign and Wilcoxon tests should have ceased
ferent test statistics for the three tests leads to significantly some time ago based simply on the fact that they test cri-
different p-values. The bands for the sign test are a result teria that do not match the criteria of interest. The sign
of the limited number of p-values for the test. Compared and Wilcoxon tests were appropriate before affordable com-
putation, but are inappropriate today. The sign test retains

628
Figure 5: The relationship between the randomization test’s p-values and the Wilcoxon and sign minimum
diﬀerence tests’ p-values. The Wilcoxon test is on the left and the sign minimum diﬀerence test is on the
right. A point is drawn for each pair of runs. The x axis is the p-value produced by the randomization test
run with 100K samples, and the y axis is the p-value of the other test.

validity if the only thing one can measure is a preference for Randomization Test
one system over another and this preference has no scale, Other Test Significant Not Significant
but for the majority of IR experiments, this scenario is not Significant H = Hit F = False Alarm
the case. Not Significant M = Miss Z
A researcher wanting a distribution-free test with no as-
sumptions of random sampling should use the randomiza- Table 2: The randomization test is used to deter-
tion test with the test statistic of their choice and not the mine significance against some α. If the other test
Wilcoxon or sign tests. returns a p-value on the same side of α, it scores a
hit or a correct rejection of the null hypothesis (Z).
5.1 Wilcoxon and Sign Tests If the other test returns a p-value on the opposite
The Wilcoxon and sign tests are simplified variants of the side of α, it score a miss or a false alarm.
randomization test. Both of these tests gained popularity
before computer power made the randomization test feasi-
ble. Here we look at the degree to which use of these sim-
plified tests results in errors compared to the randomization
test.
Common practice is to declare results significant when a accuracy, we use it to judge which results are significant at
p-value is less than or equal to some value α. Often α is various values of α for the null hypothesis of the randomiza-
set to be 0.05 by researchers. It is somewhat misleading to tion test. Recall that the null hypotheses of the Wilcoxon
turn the p-value into a binary decision. For example, there and sign tests are the same as the randomization test. The
is little difference between a p-value of 0.049 and 0.051, but only difference between the randomization, Wilcoxon, and
one is declared significant and the other not. Our preference sign tests is that they have different test statistics. The ran-
is to report the p-value and flag results meeting the decision domization’s test statistic matches our statistic of interest:
criteria. the difference in mean average precision.
Nevertheless, some decision must often be made between For example, if the randomization test estimates the p-
significant or not. Turning the p-value into a binary decision value to be 0.006 and we set α = 0.01, we will assume the
allows us to examine two questions about the comparative result is significant. If another test estimates the p-value to
value of statistical tests: be greater than α, that is a miss. If the other p-value is
1. What percent of significant results will a researcher less than or equal to α, the other test scores a hit. When
mistakenly judge to be insignificant? the randomization test finds the p-value to be greater than
α, the other test can false alarm by returning a p-value less
2. What percent of reported significant results will actu- than α. Table 2 shows a contingency table summarizing hits,
ally be insignificant? misses, and false alarms.
We used a randomization test with 20 million samples to With these definitions of a hit, miss, and false alarm we
produce a highly accurate estimate of the p-value. Given its can define the miss rate and false alarm ratio as measures

629
Miss Rate False Alarm Ratio

1.0

1.0
Sign Sign
0.8

0.8
Sign D. Sign D.
Wilcoxon Wilcoxon

False Alarm Ratio

0.6

0.6
Miss Rate

0.4

0.4
0.2

0.2
0.0

0.0
0 10 20 30 40 50 0 10 20 30 40 50

Relative Percent Increase in Mean Average Precision Relative Percent Increase in Mean Average Precision

Figure 6: Miss rate and false alarm ratio for α = 0.1.

of questions 1 and 2 above: IR researcher using the Wilcoxon or sign tests could fail to
detect significant advances in IR techniques.
M
M issRate =
H +M 5.2 Randomization vs. Bootstrap vs. t-test
where M is the number of misses and H is the number of The randomization, bootstrap, and t tests all agreed with
hits. each other given the TREC runs. Which of these should one
F prefer to use over the others? One approach recommended
F alseAlarmRatio = by Hull [10] is to compute the p-value for all tests of interest
H +F
and if they disagree look further at the experiment and the
where F is the number of false alarms and H is the number tests’ criteria and null hypotheses to decide which test is
of hits. The false alarm ratio is not the false alarm rate. most appropriate.
Another way to understand the questions we are address- We have seen with the Wilcoxon and sign tests the mis-
ing is as follows. A researcher is given access to two statisti- takes an IR researcher can make using a significance test
cal significance tests. The researcher is told that one is much that utilizes one criterion while judging and presenting re-
more accurate in its p-values. To get an understanding of sults using another criterion. This issue with the choice of
how poor the poorer test is, the researcher says “I consider test statistic goes beyond the Wilcoxon and sign tests. We
differences with p-values less than α to be significant. I al- ran an additional set of experiments where we calculated
ways have. If I had used the better test instead of the poorer the p-value for the randomization test using the difference
test, what percentage of my previously reported significant in median average precision. The p-values for the median
results would I now consider to be insignificant? On the flip do not agree with the p-values for the difference in mean
side, how many significant results did I fail to publish?” average precision.
The miss rate and false alarm ratio can be thought of The IR researcher should select a significance test that
as the rates at which the researcher would be changing de- uses the same test statistic as the researcher is using to com-
cisions of significance if the researcher switched from using pare systems. As a result, Student’s t-test can only be used
the Wilcoxon or sign test and switched to the randomization for the difference between means and not for the median or
test. other test statistics. Both the randomization test and the
As we stated in the introduction, the goal of the researcher bootstrap can be used with any test statistic.
is to make progress by finding new methods that are better While our experiment found little practical difference among
than existing methods and avoid the promotion of methods the different null hypotheses of the randomization, boot-
that are worse. strap, and t tests, this may not always be so.
Figures 6 and 7 show the miss rate and false alarm ratio for Researchers have been quite concerned that the null hy-
the sign, sign minimum difference (sign d.), and Wilcoxon pothesis of the t-test is not applicable to IR [19, 18, 21]. On
when α is set to 0.1 and 0.05. We show α = 0.1 both as an our experimental data, this concern does not appear to be
“easy” significance level but also for the researcher who may justified, but all of our experiments used a sample size N of
be interested in the behavior of the tests when they produce 50 topics. N = 50 is a large sample. At smaller sample sizes,
one-sided p-values and α = 0.05. In all cases, all of our tests violations of normality may result in errors in the t-test. Co-
produced two-sided p-values. hen [4] makes the strong point that the randomization test
Given the ad-hoc TREC run pairs, if a researcher reports performs as well as the t-test when the normality assump-
significance for a small improvement using the Wilcoxon or tion is met but that the randomization test outperforms the
sign, we would have doubt in that result. Additionally, an t-test when the normality assumption is unmet. As such,

630
Miss Rate False Alarm Ratio

1.0

1.0
Sign Sign
0.8

0.8
Sign D. Sign D.
Wilcoxon Wilcoxon

False Alarm Ratio

0.6

0.6
Miss Rate

0.4

0.4
0.2

0.2
0.0

0.0
0 10 20 30 40 50 0 10 20 30 40 50

Relative Percent Increase in Mean Average Precision Relative Percent Increase in Mean Average Precision

Figure 7: Miss rate and false alarm ratio for α = 0.05.

the researcher is safe to use the randomization test in either outcomes given the experimental data. The randomization
case but must be wary of the t-test. test does not consider — the often incorrect — idea that the
Between the randomization (permutation) test and the scores are random samples from a population.
bootstrap, which is better? Efron invented the bootstrap in The test topics used in TREC evaluations are not random
1979. Efron and Tibshirani [8] write at the end of chapter samples from the population of topics. TREC topics are
15: hand selected to meet various criteria such as the estimated
number of relevant documents in the test collection [20]. Ad-
Permutation methods tend to apply to only ditionally, neither the assessors nor the document collection
a narrow range of problems. However when they are random.
apply, as in testing F = G in a two-sample prob- The randomization test looks only at the experiment and
lem, they give gratifyingly exact answers without produces a probability that the experimental results could
parametric assumptions. The bootstrap distri- have occurred by chance without any assumption of random
bution was originally called the “combination dis- sampling from a population.
tribution.” It was designed to extend the virtues An IR researcher may argue that the assumption of ran-
of permutation testing to the great majority of dom samples from a population is required to draw an infer-
statistical problems where there is nothing to ence from the experiment to the larger world. This cannot
permute. When there is something to permute, be the case. IR researchers have for long understood that
as in Figure 15.1, it is a good idea to do so, even if inferences from their experiments must be carefully drawn
other methods like the bootstrap are also brought given the construction of the test setup. Using a signifi-
to bear. cance test based on the assumption of random sampling is
The randomization method does apply to the typical IR ex- not warranted for most IR research.
periment. Noreen [14] has reservations about the use of the Given these fundamental difference between the random-
bootstrap for hypothesis testing. ization, bootstrap, and t tests, we recommend the random-
Our largest concern with the bootstrap is the systematic ization test be used when it is applicable. The randomiza-
bias towards smaller p-values we found in comparison to tion test is applicable to most IR experiments.
both the randomization and t tests. This bias may be an
artifact of our implementation, but an issue with the boot- 5.3 Other Metrics
strap is the number of its possible variations and the need for Our results have focused on the mean average precision
expert guidance on its correct use. For example, a common (MAP). We also looked at how the precision at 10 (P10),
technique is to Studentize the test statistic to improve the mean reciprocal rank (MRR), and R-precision affected the
bootstrap’s estimation of the p-value [8]. It is unclear when results. In general the tests behaved the same as for the
one needs to do this and additionally such a process would MAP. Of note, the Wilcoxon test showed less variation for
seem to limit the set of applicable test statistics. Unlike the the MRR than for the other metrics.
bootstrap, the randomization test is simple to understand
and implement.
Another issue with both the bootstrap and the t-test is 6. RELATED WORK
that both of them have as part of their null hypotheses that Edgington’s book [7] on randomization tests provides ex-
the scores from the two IR systems are random samples tensive coverage of the many aspects of the test and details
from a single population. In contrast, the randomization how the test was created by Fisher in the 1930s and later
test only concerns itself with the other possible experimental was developed by many other statisticians. Box et al. pro-

631
vide an excellent explanation of the randomization test in 8. ACKNOWLEDGMENTS
chapter 4 of their classic text [1]. Efron and Tibshirani have We thank Trevor Strohman for his helpful discussions and
a detailed chapter on the permutation (randomization) test feedback on an earlier draft of this paper. We also thank the
in their book [8]. anonymous reviewers for their helpful comments.
Kempthorne and Doerfler have shown that for a set of ar- This work was supported in part by the Center for In-
tificial distributions the randomization test is to be preferred telligent Information Retrieval and in part by the Defense
to the Wilcoxon test which is to be preferred to the sign test Advanced Research Projects Agency (DARPA) under con-
[11]. In contrast, our analysis is based on the actual score tract number HR0011-06-C-0023. Any opinions, findings
distributions from IR retrieval systems. and conclusions or recommendations expressed in this ma-
Hull reviewed Student’s t-test, the Wilcoxon signed rank terial are the authors’ and do not necessarily reflect those of
test, and the sign test and stressed the value of significance the sponsor.
testing in IR [10]. Hull’s suggestion to compare the output
of the tests was part of the inspiration for our experimental
methodology. Hull also made the point that the t-test tends
9. REFERENCES
to be robust to violations of its normality assumption. [1] G. E. P. Box, W. G. Hunter, and J. S. Hunter. Statistics
for Experimenters. John Wiley & Sons, 1978.
Wilbur compared the randomization, bootstrap, Wilcoxon,
[2] J. V. Bradley. Distribution-Free Statistical Tests.
and sign tests for IR evaluation but excluded the t-test based Prentice-Hall, 1968.
on its normality assumption [21]. Wilbur found the random- [3] C. Buckley. trec eval.
ization test and the bootstrap test to perform well, but rec- http://trec.nist.gov/trec_eval/trec_eval.8.0.tar.gz.
ommended the bootstrap over the other tests in part because [4] P. R. Cohen. Empirical methods for artificial intelligence.
of its greater generality. MIT Press, 1995.
Savoy advocated the use of the bootstrap hypothesis test [5] G. Cormack and T. Lynam. Validity and power of t-test for
as a solution to the problem that the normality assump- comparing map and gmap. In SIGIR ’07. ACM Press, 2007.
tion required of the t-test is clearly violated by the score [6] G. V. Cormack and T. R. Lynam. Statistical precision of
information retrieval evaluation. In SIGIR ’06, pages
distributions of IR experiments [18]. Sakai used bootstrap 533–540. ACM Press, 2006.
significance tests to evaluate evaluation metrics [16], while [7] E. S. Edgington. Randomization Tests. Marcel Dekker,
our emphasis was on the comparison of significance tests. 1995.
Box et al. stress that when comparative experiments prop- [8] B. Efron and R. J. Tibshirani. An Introduction to the
erly use randomization of test subjects, the t-test is usually Bootstrap. Chapman & Hall/CRC, 1998.
robust to violations of its assumptions and can be used as [9] R. A. Fisher. The Design of Experiments. Oliver and Boyd,
an approximation to the randomization test [1]. We have first edition, 1935.
confirmed this to be the case for IR score distributions. [10] D. Hull. Using statistical testing in the evaluation of
Both Sanderson and Zobel [17] and Cormack and Ly- retrieval experiments. In SIGIR ’93, pages 329–338, New
York, NY, USA, 1993. ACM Press.
nam [5] have found that the t-test should be preferred to
[11] O. Kempthorne and T. E. Doerfler. The behavior of some
both the Wilcoxon and sign tests. We have taken the ad- significance tests under experimental randomization.
ditional step of comparing these tests to the randomization Biometrika, 56(2):231–248, August 1969.
and bootstrap tests that have been proposed by others for [12] M. Matsumoto and T. Nishimura. Mersenne twister: a
significance testing in IR evaluation. 623-dimensionally equidistributed uniform pseudo-random
number generator. ACM Trans. Model. Comput. Simul.,
8(1):3–30, 1998.
7. CONCLUSION [13] W. Mendenhall, D. D. Wackerly, and R. L. Scheaffer.
For a large collection of TREC ad-hoc retrieval system Mathematical Statistics with Applications. PWS-KENT
pairs, the randomization test, the bootstrap shift method Publishing Company, 1990.
test, and Student’s t-test all produce comparable signifi- [14] E. W. Noreen. Computer Intensive Methods for Testing
cance values (p-values). Given that an IR researcher will Hypotheses. John Wiley, 1989.
obtain a similar p-value for each of these tests, there is no [15] R Development Core Team. R: A language and
environment for statistical computing. R Foundation for
practical difference between them. Statistical Computing, Vienna, Austria, 2004.
On the same set of experimental data, the Wilcoxon signed 3-900051-07-0.
rank test and the sign test both produced very different p- [16] T. Sakai. Evaluating evaluation metrics based on the
values. These two tests are variants of the randomization bootstrap. In SIGIR ’06, pages 525–532. ACM Press, 2006.
test with different test statistics. Before affordable compu- [17] M. Sanderson and J. Zobel. Information retrieval system
tation existed, both of these tests provided easy to com- evaluation: effort, sensitivity, and reliability. In SIGIR ’05,
pute, approximate levels of significance. In comparison to pages 162–169. ACM Press, 2005.
the randomization test, both the Wilcoxon and sign tests [18] J. Savoy. Statistical inference in retrieval effectiveness
evaluation. IPM, 33(4):495–512, 1997.
can incorrectly predict significance and can fail to detect
[19] C. J. van Rijsbergen. Information Retrieval. Butterworths,
significant results. IR researchers should discontinue use of second edition, 1979.
the Wilcoxon and sign tests. http://www.dcs.gla.ac.uk/Keith/Preface.html.
The t-test is only applicable for measuring the significance [20] E. M. Voorhees and D. K. Harman, editors. TREC. MIT
of the difference between means. Both the randomization Press, 2005.
and bootstrap tests can use test statistics other than the [21] W. J. Wilbur. Non-parametric significance tests of retrieval
mean, e.g. the median. For IR evaluation, we recommend performance comparisons. J. Inf. Sci., 20(4):270–284, 1994.
the use of the randomization test with a test statistic that [22] F. Wilcoxon. Individual comparisons by ranking methods.
Biometrics Bulletin, 1(6):80–83, December 1945.
matches the test statistic used to measure the difference be-
tween two systems.

632

What Corpus Linguistics Can Tell Us About Metaphor Use in Newspaper Texts
No ratings yet
What Corpus Linguistics Can Tell Us About Metaphor Use in Newspaper Texts
18 pages
IR System Evaluation
No ratings yet
IR System Evaluation
8 pages
760 Stat2
No ratings yet
760 Stat2
31 pages
Free Response Analysis
No ratings yet
Free Response Analysis
6 pages
Pogi Dem
No ratings yet
Pogi Dem
9 pages
Biostatistics Final
No ratings yet
Biostatistics Final
7 pages
Meeting 13 - 14 Non Parametric Statistics 16 - 17
No ratings yet
Meeting 13 - 14 Non Parametric Statistics 16 - 17
28 pages
Parametric & Non-Parametric Tests
100% (1)
Parametric & Non-Parametric Tests
34 pages
Benavoli 14
No ratings yet
Benavoli 14
9 pages
Parametric & Non-Parametric Tests
No ratings yet
Parametric & Non-Parametric Tests
34 pages
Significance Levels-0.05, 0.01?????
No ratings yet
Significance Levels-0.05, 0.01?????
6 pages
Module07 Notes
No ratings yet
Module07 Notes
14 pages
BTY587 - Unit II
No ratings yet
BTY587 - Unit II
16 pages
MBA 1st Assignment 102
No ratings yet
MBA 1st Assignment 102
9 pages
Kruskal and Wallis 1952
No ratings yet
Kruskal and Wallis 1952
40 pages
Statistical Precision of Information Retrieval Evaluation: Gordon V. Cormack and Thomas R. Lynam
No ratings yet
Statistical Precision of Information Retrieval Evaluation: Gordon V. Cormack and Thomas R. Lynam
8 pages
19 Statistical Testing 1-1
No ratings yet
19 Statistical Testing 1-1
41 pages
Test Statistics (Inferential)
No ratings yet
Test Statistics (Inferential)
1 page
Modelling in R
No ratings yet
Modelling in R
47 pages
Week7-Inferentionalstat - (Grup Differences)
No ratings yet
Week7-Inferentionalstat - (Grup Differences)
32 pages
Data Analysis
No ratings yet
Data Analysis
4 pages
Business Research Methods: MBA - FALL 2014
No ratings yet
Business Research Methods: MBA - FALL 2014
32 pages
Hns 2321 Biostatistics Lecture Notes On Inferential Statistics
No ratings yet
Hns 2321 Biostatistics Lecture Notes On Inferential Statistics
25 pages
Stats 1
No ratings yet
Stats 1
9 pages
Veazie - Understanding Statistical Testing
No ratings yet
Veazie - Understanding Statistical Testing
9 pages
Research Methodology Notes Part 3
No ratings yet
Research Methodology Notes Part 3
30 pages
09 Nonparametric Test Li Wenyun1730862642
No ratings yet
09 Nonparametric Test Li Wenyun1730862642
44 pages
DAV 3 Cheatsheet
No ratings yet
DAV 3 Cheatsheet
8 pages
Hypotheses Testing
No ratings yet
Hypotheses Testing
5 pages
Test of Significance
No ratings yet
Test of Significance
45 pages
Topic6 - One Sample T-Test (Small Sample) - Group2
No ratings yet
Topic6 - One Sample T-Test (Small Sample) - Group2
49 pages
23MT2013 DSS CO4 Session 19 Statistical Tests
No ratings yet
23MT2013 DSS CO4 Session 19 Statistical Tests
42 pages
Hypothesis
No ratings yet
Hypothesis
27 pages
How To Select A Test ?
No ratings yet
How To Select A Test ?
11 pages
Tests of Significance
No ratings yet
Tests of Significance
3 pages
STA 2023 Activity 4: How Fast Is Google?: Returns Results in Less Than 0.2 Seconds." (Source
No ratings yet
STA 2023 Activity 4: How Fast Is Google?: Returns Results in Less Than 0.2 Seconds." (Source
4 pages
Evaluation - Statistical Significance Testing
No ratings yet
Evaluation - Statistical Significance Testing
42 pages
DSR Front End
No ratings yet
DSR Front End
3 pages
226lec11 JDA
No ratings yet
226lec11 JDA
54 pages
SBP RM Coursework 31052025
No ratings yet
SBP RM Coursework 31052025
21 pages
StatisticalTests (2slidesPerPage)
No ratings yet
StatisticalTests (2slidesPerPage)
50 pages
Sign Mann Wilcoxon Kruskal - PPT - Compatibility Mode
No ratings yet
Sign Mann Wilcoxon Kruskal - PPT - Compatibility Mode
28 pages
Hypothesis Testing
No ratings yet
Hypothesis Testing
75 pages
Test Formula Assumption Notes Source Procedures That Utilize Data From A Single Sample
No ratings yet
Test Formula Assumption Notes Source Procedures That Utilize Data From A Single Sample
12 pages
DA Unit II - II
No ratings yet
DA Unit II - II
47 pages
Chapter 5 Descriptive Inferential Statistics
No ratings yet
Chapter 5 Descriptive Inferential Statistics
33 pages
Hypothesis Testing : Z-Test, T-Test, F-Test
No ratings yet
Hypothesis Testing : Z-Test, T-Test, F-Test
42 pages
Prob Stat Lesson 9
No ratings yet
Prob Stat Lesson 9
44 pages
Stat Study Mat 3
No ratings yet
Stat Study Mat 3
51 pages
NON PARAMETRIC TESTs and Regression
No ratings yet
NON PARAMETRIC TESTs and Regression
29 pages
Types of Inferential Statistics
No ratings yet
Types of Inferential Statistics
3 pages
Academic Research UW
No ratings yet
Academic Research UW
28 pages
D1UA401B Research Methodology-UNIT-4 Pazhanisamy-BBA IV Semester Section19
No ratings yet
D1UA401B Research Methodology-UNIT-4 Pazhanisamy-BBA IV Semester Section19
108 pages
MINITAB Guide - Tests of Central Tendency
No ratings yet
MINITAB Guide - Tests of Central Tendency
29 pages
Hypothesis Testing 2
No ratings yet
Hypothesis Testing 2
7 pages
13.-P-Value 2
No ratings yet
13.-P-Value 2
10 pages
09 Evaluation
No ratings yet
09 Evaluation
22 pages
Hypothesis Testing
No ratings yet
Hypothesis Testing
6 pages
The Pace of Modern Culture: Articles
No ratings yet
The Pace of Modern Culture: Articles
11 pages
(FBI) Criminal Background Check (F4, F6 Visa Applicant Only)
No ratings yet
(FBI) Criminal Background Check (F4, F6 Visa Applicant Only)
1 page
Flaming and Trolling: Claire Hardaker
No ratings yet
Flaming and Trolling: Claire Hardaker
30 pages
Example Solution
No ratings yet
Example Solution
8 pages
CASCADE: Contextual Sarcasm Detection in Online Discussion Forums
No ratings yet
CASCADE: Contextual Sarcasm Detection in Online Discussion Forums
11 pages
Hybrid Word Embeddings For Text Classification
No ratings yet
Hybrid Word Embeddings For Text Classification
5 pages
Statistical Metaphor Processing
No ratings yet
Statistical Metaphor Processing
54 pages
Lab 2: SQL Queries: 1 Capitals
No ratings yet
Lab 2: SQL Queries: 1 Capitals
4 pages
Ling Anth Syllabus
No ratings yet
Ling Anth Syllabus
13 pages
Analysis Gender Bias
No ratings yet
Analysis Gender Bias
9 pages
Unit 5 - Activity 1 - Probability Distribution Worksheet
No ratings yet
Unit 5 - Activity 1 - Probability Distribution Worksheet
4 pages
Correlation
No ratings yet
Correlation
13 pages
CS2 CMP Upgrade 2025
No ratings yet
CS2 CMP Upgrade 2025
12 pages
Chapter 12
No ratings yet
Chapter 12
12 pages
The Box-Jenkins Method
No ratings yet
The Box-Jenkins Method
14 pages
Chapter - 4
No ratings yet
Chapter - 4
33 pages
Hurricane Climatology A Modern Statistical Guide Using R James B Elsner PDF Download
No ratings yet
Hurricane Climatology A Modern Statistical Guide Using R James B Elsner PDF Download
80 pages
Of Economic and Financial Variables 5369760
No ratings yet
Of Economic and Financial Variables 5369760
51 pages
Inbound 5393904316875859152
No ratings yet
Inbound 5393904316875859152
38 pages
Pattern Recognition - Unit - 1&2
100% (1)
Pattern Recognition - Unit - 1&2
41 pages
UCCM2233 - Chp3 Num Descriptive Measures-Wble
No ratings yet
UCCM2233 - Chp3 Num Descriptive Measures-Wble
103 pages
Autoregressive Conditional Heteroskedasticity (ARCH) : Volatility Clustering
No ratings yet
Autoregressive Conditional Heteroskedasticity (ARCH) : Volatility Clustering
9 pages
9 0
No ratings yet
9 0
9 pages
Open Screenshot 2023-12-13 at 8.06.13 PM 26
No ratings yet
Open Screenshot 2023-12-13 at 8.06.13 PM 26
56 pages
Quiz 3 Chap 3 Answer
No ratings yet
Quiz 3 Chap 3 Answer
5 pages
MCO-22 Jun 2024
No ratings yet
MCO-22 Jun 2024
6 pages
Short-Term Actuarial Mathematics Exam-October 2018
No ratings yet
Short-Term Actuarial Mathematics Exam-October 2018
8 pages
AISYAH G0220521 Dan SITTI NASMAIDA G0220503
No ratings yet
AISYAH G0220521 Dan SITTI NASMAIDA G0220503
2 pages
Quantitative Techniques - EPGP-15 - Course Outline
No ratings yet
Quantitative Techniques - EPGP-15 - Course Outline
4 pages
Basuc Statshi
100% (3)
Basuc Statshi
20 pages
Chi-Square Test and Odds Ratio - Tagged
No ratings yet
Chi-Square Test and Odds Ratio - Tagged
45 pages
Unit4 Statistics
No ratings yet
Unit4 Statistics
37 pages
Measuring Regional Inequality Using Nightlight Satellite Data and Population Density For Nigeria
No ratings yet
Measuring Regional Inequality Using Nightlight Satellite Data and Population Density For Nigeria
13 pages
International Journal of Scientific Research: Community Medicine
No ratings yet
International Journal of Scientific Research: Community Medicine
3 pages
Econometric S
No ratings yet
Econometric S
59 pages
MANISH BHUSHAN - OPSCM Project
No ratings yet
MANISH BHUSHAN - OPSCM Project
4 pages
Metode Firth (PMLE)
No ratings yet
Metode Firth (PMLE)
50 pages
Hypothesis Testing Notes and Samlping
No ratings yet
Hypothesis Testing Notes and Samlping
62 pages
Comparing Medians: The Man Whitney U-Test
No ratings yet
Comparing Medians: The Man Whitney U-Test
8 pages
Probit and Logit
No ratings yet
Probit and Logit
35 pages

Bootstrap Test

Uploaded by

Bootstrap Test

Uploaded by

A Comparison of Statistical Significance Tests for

Information Retrieval Evaluation

Mark D. Smucker, James Allan, and Ben Carterette

ABSTRACT is better than the other? A common batch-style experiment

Difference in Average Precision

Figure 2: The per topic diﬀerences in average pre-

the Wilcoxon test uses a speciﬁc test statistic. The Wilcoxon

them in ascending order by absolute value. The sign of each

ically have a mix of “negative” and “positive” ranks. For

of ranks is the test statistic. Diﬀerences of zero and tied

The Wilcoxon test statistic throws away the true diﬀer-

2.4 Bootstrap Test – Shift Method 2.6 Summary

False Alarm Ratio

Figure 6: Miss rate and false alarm ratio for α = 0.1.

False Alarm Ratio

Figure 7: Miss rate and false alarm ratio for α = 0.05.

You might also like