Agresti 2000
Agresti 2000
To cite this article: Alan Agresti & Brian Caffo (2000): Simple and Effective Confidence Intervals for Proportions and
Differences of Proportions Result from Adding Two Successes and Two Failures, The American Statistician, 54:4, 280-288
This article may be used for research, teaching, and private study purposes. Any substantial or systematic
reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to
anyone is expressly forbidden.
The publisher does not give any warranty express or implied or make any representation that the contents
will be complete or accurate or up to date. The accuracy of any instructions, formulae, and drug doses should
be independently verified with primary sources. The publisher shall not be liable for any loss, actions, claims,
proceedings, demand, or costs or damages whatsoever or howsoever caused arising directly or indirectly in
connection with or arising out of the use of this material.
Teacher's Corner
Simple and Effective Confidence Intervals for Proportions
and Differences of Proportions Result from Adding Two
Successes and Two Failures
Alan AGRESTIand Brian CAFFO
An approximate 100(1 - a)% confidence interval for
p1 - p2 is
The standard confidence intervals for proportions and their
differences used in introductory statistics courses have poor
performance, the actual coverage probability often being
much lower than intended. However, simple adjustments of
these intervals based on adding four pseudo observations, These confidence intervals result from inverting large-
Downloaded by [McGill University Library] at 08:56 29 March 2013
half of each type, perform surprisingly well even for small sample Wald tests, which evaluate standard errors at
samples. To illustrate, for a broad variety of parameter set- the maximum likelihood estimates. For instance, the in-
tings with 10 observations in each sample, a nominal 95% terval for p is the set of po values for which I$ -
interval for the difference of proportions has actual cov-
erage probability below .93 in 88% of the cases with the
p o l / d m< z , / ~ ; that is, the set of po having P
value exceeding a in testing HO : p = po against H , : p # po
standard interval but in only 1% with the adjusted interval; using the approximately normal test statistic. The intervals
the mean distance between the nominal and actual cover- are sometimes called Wald intervals. Although these inter-
age probabilities is .06 for the standard interval, but .01 for
vals are simple and natural for students who have previ-
the adjusted one. In teaching with these adjusted intervals, ously seen analogous large-sample formulas for means, a
one can bypass awkward sample size guidelines and use the considerable literature shows that they behave poorly (e.g.,
same formulas with small and large samples. Ghosh 1979; Vollset 1993; Newcombe 1998a, 1998b). This
KEY WORDS: Binomial distribution; Score test; Small can be true even when the sample size is very large (Brown,
sample; Wald test. Cai, and DasGupta 1999). In this article, we describe sim-
ple adjustments of these intervals that perform much better
but can be easily taught in the typical non-calculus-based
statistics course.
1. INTRODUCTION These references showed that a much better confidence
interval for a single proportion is based on inverting the
Let X denote a binomial variate for n trials with pa- test with standard error evaluated at the null hypothesis,
rameter p , denoted bin(n,p), and let $ = X/n denote the which is the score test approach. This confidence interval,
sample proportion. For two independent samples, let X1 due to Wilson (1927), is the set of po values for which
be bin(nl,pl), and let X2 be bin(n2,pz). Let z, denote the I$ - P o I / J P o ( l - p o ) / n < z,/2, which is
1--a quantile of the standard normal distribution. Nearly all
elementary statistics textbooks present the following confi-
dence intervals for p and p1 - p2:
280 The American Statistician, November 2000, Vol. 54, No. 4 @ 2000 American Statistical Association
Coverage Probability Coverage Probability Coverage Probability
95%
0 .2 .4 .6 .8 1 P
0 .2 4 . 6 8 1 0 .2 .4 .6 .8 1
99% 90
.85
80
.75
Downloaded by [McGill University Library] at 08:56 29 March 2013
.70 P
0 .2 .4 .6 8 1
after adding 2%25= 1.962 M 4 pseudo observations, two of 1; and 1/2 rather than the weighted average of the
@ of
each type. That is, their adjusted “add two successes and variances; by Jensen’s inequality, the adjusted interval is
two failures” interval has the simple form wider than the score interval.
For small samples, the improvement in performance of
* 6 2 . 0 2 5 d @ ( 1 - 6)/fi ,
the adjusted interval compared to the ordinary Wald interval
(3) is dramatic. To illustrate, Figure 1 shows the actual cover-
+
but with fi = ( n+ 4) trials and fi = ( X 2)/(n+ 4). The age probabilities for the nominal 95% Wald and adjusted
midpoint equals that of the 95% score confidence interval intervals plotted as a function of p , for n = 5, 10, and 20.
(rounding 2.025 to 2.0 for that interval), but the coefficient of For all n great improvement occurs for p near 0 or 1. For
2,025 uses the variance @(1 - @)/6at the weighted average instance, Brown et al. (1999) stated that when p = .01, the
size of n required such that the actual coverage probability
Coverage Probability
of a nominal 95% Wald interval is uniformly at least .94
for all n above that value is n = 7963, whereas for the ad-
justed interval this is true for every n; when p = .10 the
values are n = 646 for the Wald interval and n = 11 for
the adjusted interval. The Wald interval behaves especially
poorly with small n for p near the boundary, partly because
of the nonnegligible probability of having 1; = 0 or 1 and
thus the degenerate interval [O, 01 or [l, 11. Agresti and
w
Coull (1998) recommended the adjusted interval for use in
elementary statistics courses, since the Wald interval be-
haves poorly yet the score interval is too complex for most
students. Many students in non-calculus-based courses are
mystified by quadratic equations (which are needed to solve
1 for the score interval) and would have difficulty using the
0 2 4 6 8 weighted average formula above. In such courses, it is of-
ten easier to show how to adapt a simple method so that it
t Pseudo Observations works well rather than to present a more complex method.
Let I t ( n , z )denote the adjustment of the Wald interval
Figure 2. Boxplots of coverage probabilities for nominal 95% ad- that adds t/2 successes and t / 2 failures. With confidence
justed confidence intervals based on adding t pseudo observations; dis-
tributions refer to 10,000 cases, with n l and n2 each chosen uniformly levels (1 - a ) other than .95, the Agresti and Coull approx-
between 10 and 30 and p l and p2 chosen uniformly between 0 and 1. imation of the score interval uses I t ( n , z ) with t = zi,2
The American Statistician, November 2000, Vol. 54, No. 4 281
Table 1. Summary of Performance of Nominal 95% Confidence Intervals for p1 - p2 Based on Adding t Pseudo Obser-
vations, Averaging with Respect to a Uniform Distribution for (p, ,p2).
Cov. Prob. < .93 10 .880 .090 ,010 .I 00 ,235 .072 ,046
NOTE: Table reports mean of coverage probabilities Ct(n,pl; n,pz), mean of distances ICt(n,pr; n,pz) - .95/from nominal level, mean of expected interval lengths, and proportion of cases
with Ct(n,pi: n,pz) <.93
instead of t = 4, for instance adding 2.7 pseudo observa- tially after adding a pseudo observation of each type to
tions for a 90% interval and 5.4 for a 99% interval. Many each sample, regarding sample i as (n, + 2 ) trials with
instructors in elementary courses will find it simpler to tell 17, = ( X , + l ) / ( n z+ 2). There is no reason to expect an
students to use the same constant for all cases. One will optimal interval to result from this method, or in particu-
do reasonably well, especially at high nominal confidence lar from adding the same number of pseudo observations
levels, by the recipe of always using t = 4. The perfor- to each sample or even the same number of cases of each
mance of the adjusted interval 14(n,x) is much better than type, but we restricted attention to this form because of the
the Wald interval (1) for the usual confidence levels. To simplicity of explaining it in a classroom setting.
illustrate, Figure 1 also shows coverage probabilities for
nominal 99% intervals, when n = 5, 10, 20. Since the .95
confidence level is the most common in practice and since 2. COMPARING PERFORMANCE OF WALD
this “add two successes and two failures” adjustment pro- INTERVALS AND ADJUSTED INTERVALS
vides strong improvement over the Wald for other levels For the two-sample comparison of proportions, we now
as well, it is simplest for elementary courses to recommend study the performance of the Wald confidence formula (2)
that adjustment uniformly. Of the elementary texts that rec- after adding t pseudo observations, t/4 of each type to each
ommend adjustment of the Wald interval by adding pseudo sample, truncating when the interval for p1 -p2 contains val-
observations, some (e.g., McClave and Sincich 2000) di- ues < -1 or > 1. Denote this interval by I t ( n l ,XI;n2,22),
rect students to use I 4 ( n , x ) regardless of the confidence or It for short, so I0 denotes the ordinary Wald interval.
coefficient whereas others (e.g., Samuels and Witmer 1999) Our discussion refers mainly to the .95 confidence coef-
recommend t = z:,~. ficient, but our evaluations also studied .90 and .99 coef-
The purpose of this article is to show that a simple ad- ficients. Let Ct(n1,pl;n2,p2), or Ct for short, denote the
justment, adding two successes and two failures (total), true coverage probability of a nominal 95% confidence in-
also works quite well for two-sample comparisons of pro- terval It. We investigated whether there is a t value for
portions. The simple Wald formula (2) improves substan- which ICt(nl,pl;n2,p2) - .951 tends to be small for most
1 1
\
\
.8 .8
.6
.4
.2
0 I
n l = n2 = 10
I
.6
.4
.2
0
\ \
n l =30 n 2 = 1 0
Downloaded by [McGill University Library] at 08:56 29 March 2013
I I I I I
0 2 4 6 8 0 2 4 6 8
Figure 3. Proportion of (pl, p2) cases with p l and p2 chosen uniformly between 0 and 1 for which nominal 95% adjusted confidence intervals
based on adding t pseudo observations have actual coverage probabilities below ,93, for n l = n2 = 10 and n l = 30, n2 = 10.
.m pi .80
1
0 2 4 6 8 1 0 2 4 6 8 1 0 2 4 6 8 1
P2 = .1 P2 = .3 P2 = .5
Figure 4. Coverage probabilities for nominal 95% Wald and adjusted confidence intervals (adding t = 4 pseudo observations) as a function of
p l when p2 = . I , .3, .5,with n l = n2 = 20.
.95 -
.90 -
.85 -
,85]
.80 .80 -
I ......
- Adjusted
Wald
I
.75 -
'70{ , , , , , , PI .70 {, , , , , , pl 70 1 I I I I I
PI
0 2 .4 .6 .8 1 0 .2 .4 .6 8 1 0 .2 .4 .6 .8 1
n l = n2 = 10 n l = 20, n2 = 10 n l = 40, n2 = 10
Downloaded by [McGill University Library] at 08:56 29 March 2013
Figure 5. Coverage probabilities for nominal 95% Wald and adjusted confidence intervals (adding t = 4 pseudo observations) as a function of
p l when p2 = .3 when n l = n2 = 10, n l = 20, n2 = 10, and n l = 40, n2 = 10.
tual coverage probability below .88 were (.623, .045, .016, (721,122) = (10, l o ) , (20, lo), and (40,lO). Figure 6 shows Co
.131, .255). The pattern exhibited here is illustrative of a and C4 as a function of p1 when p l - p 2 = 0 or .2 and when
variety of results from analyzing Ct more closely, as we the relative risk p 1 / p 2 = 2.0 or 4.0, when n1 = n2 = 10.
now discuss. In Figures 4-6, only rarely does the adjusted interval have
We analyzed the performance of the It interval for coverage significantly below the nominal level. On the other
various fixed (nl, n2) combinations. Table 1 summarizes hand, Figures 4 and 6 show that it can be very conservative
some characteristics, in an average sense based on tak- when pl and p2 are both close to 0 or 1, say with (pl + p 2 ) / 2
ing ( p l , p 2 ) uniform from the unit square, for (n1,n2)= below about .2 or above about .8 for the small sample sizes
(10, l o ) , (20,20), (30,30), (30,lO). Although the adjusted studied here. This is preferred, however, to the very low
interval I4 tends to be conservative, it compares well to coverages of the Wald interval in these cases. Figures 7
other cases in the mean of the distances ICt - .951 and es- and 8 illustrate their behavior, showing surface plots of Co
pecially the proportion of cases for which Ct < .93. For n, and C, over the unit square when n1 = n2 = 10. The spikes
= 10, for instance, the actual coverage probability is below at values of p , in Figures 4 and 5 become ridges at values
.93 for 88% of such cases with the Wald interval, but for of p1 - pa in these figures.
only 1% of them with 14. Figure 3 shows the proportions The poor performance of the Wald interval does not oc-
of coverage probabilities that are below .93 as a function cur because it is too short. In fact, for moderate-sized p ,
it tends to be too long. For instance, when n1 = n2 = 10,
o f t , for (nl, n2) = (10,lO) and (30, 10). The improvement
10has greater expected length than 14 for p2 between . l l
over the ordinary Wald interval from adding t = 4 pseudo
and .89 when p1 = .5 and for p2 between .18 and .82
observations is substantial. Remaining figures concentrate
when pl = .3. When n1 = 122 = n and when Ijl =
on this particular adjustment, which fared well in a variety
5 2 = 5,10 has greater length than It when 5 falls within
of evaluations we conducted.
Averaging performance over the unit square for ( p I , p 2 )
+
J.25 - n(4n + t ) / [ 2 4 n 2+ 12nt 2t2] of .5. For all t > 0,
this interval around .5 shrinks monotonically as n increases
can mask poor behavior in certain regions, and in practice to .50 f .50/&, or (.21,.79), which applies also to the
certain pairings (e.g., Ip1 - pal small) are often more com- Agresti and Coull (1998) adjusted interval in the single-
mon or more important than others. Thus, besides studying sample case. As in the single-proportion case, the Wald in-
these summary expectations, we plotted C, as a function terval suffers from having the maximum likelihood estimate
of p1 for various fixed values of p2, p l - p2, and p l / p 2 . exactly in the middle of the interval.
To illustrate, Figure 4 plots the Wald coverage COand the There is nothing unique about t = 4 pseudo observations
coverage C, for the adjusted interval, fixing pa at .l, .3, in providing good performance of adjusted intervals in the
and .5, for n1 = 122 = 20. The poor coverage spikes for one- and two-sample problems. For instance, Figure 3 and
the Wald interval disappear with 14,but this adjustment Table 1 show that other adjustments often work well. A re-
is quite conservative when p l and p2 are both close to 0 gion of t values provide substantial improvement over the
or both close to 1. The adjustment I4 performs reasonably Wald interval, with values near t = 2 being less conserva-
well, and much better than the Wald interval, even with very tive than t = 4. We emphasized the case t = 4 earlier for
small or unbalanced sample sizes. Figure 5 illustrates, plot- the two-sample case because it rarely has poor coverage.
ting COand C, as a function of p l with p 2 fixed at .3, for We believe it is worth permitting some conservativeness to
284 Teacher's Comer .
Coverage Probability Coverage Probability PI -P2=.2
P I -P2=0
..-.__ ......
..................................
...............
_.. .....
.........
.? - Adjusted
I I I I i I pi .6 , , , , PI
0 .2 .4 .6 .8 1 .2 .4 .6 .8 1
Coverage Probability
PI/P2=2
Downloaded by [McGill University Library] at 08:56 29 March 2013
0 .2 .4 .6 .8 1 0 .2 .4 .6 .8 1
Figure 6. Coverage probabilities for nominal 95% Wald and adjusted confidence intervals (adding t = 4 pseudo observations) as a function of
p l when p l p2 = 0 or .2 and when pl/p2 = 2 or 4, for n l = n2 = 10.
ensure that the coverage probability rarely falls much below courses, it focuses on the simple It adjustment rather than
the nominal level. In the one-sample case the adjusted in- methods that may be suggested by statistical principles. To
terval 12(n,z)is better than I4(12,z) in approximating the find a good method more generally, one approach is to invert
score interval with small confidence levels, such as 90%. a test of Ho : p l - p 2 = A that has good properties, such
An advantage of the interval 12(n,2)for p is consistency
as using the large-sample score test (Mee 1984) or profile
between the single-sample case and our recommended ad-
likelihood methods (Newcombe 199813). The score test of
justment 14(n1, zl;122, x2) for two samples. For instance,
as n2 + 00 and the second sample yields a perfect esti- p l - p 2 = 0 is the familiar Pearson chi-squared test, so
mate, the resulting “add two successes and two failures” this approach has the advantage that the confidence interval
two-sample interval uses the first sample in the same way is consistent with the most commonly taught test of the
as does the “add one success and one failure” single-sample same nominal level. The method of obtaining the confidence
interval. However, for the single-sample problem we prefer interval is too complex for elementary courses, however,
the 14(n,z)interval, since .95 is by far the most common partly because the test of p1 - p 2 = h requires finding the
confidence level in practice and this interval works some- maximum likelihood estimates of (pl,p 2 ) for the standard
what better than 12(n, ). in that case.
error subject to the constraint $1 - $2 = A.
Newcombe (1998b) evaluated various confidence interval
methods for pl - p 2 . He proposed a method that performs
substantially better than the Wald interval and similar to
3. COMPARING THE ADJUSTED INTERVAL the score interval, while being computationally simpler (al-
WITH OTHER GOOD INTERVALS though too complex for most elementary statistics courses).
Many methods have been proposed for improving on the His method is a hybrid of results from the single-sample
ordinary Wald confidence interval for p1 - pa. Since this score intervals for p1 and 132. Specifically, let @,) u,)be the
article dicusses methods appropriate in elementary statistics roots for p , in z a p = I$,-pZI/Jpz(l - p,)/n,. Newcombe’s
The American Statistician, November 2000, VoZ. 54, No. 4 285
hybrid score interval is .92 for the 95% adjusted interval and .86 for the 95% hybrid
score interval.
The adjusted interval I4 and the hybrid score interval
both have a greater tendency for distal non-coverage then
mesial non-coverage. For instance, for the 10,000 randomly
selected cases, the mean probability for which the lower
limit exceeds pl - p2 when p1 - p~ > 0 or the upper limit
is less than pl - p 2 when p1 - p a < 0 was .030 for 14 and
Compared to the adjusted interval I,, the hybrid score in- .033 for the 95% hybrid score interval, whereas the mean
terval also is conservative when pl and p 2 are both close to probability for which the upper limit is less than p l - p2
0 or 1; overall, it is less conservative, however, with mean when p l - p2 > 0 or the lower limit exceeds pl - p2 when
coverage probability closer to the nominal level (see Table p l - p a < 0 was .013 for I4 and .014 for the 95% hybrid
1). Likewise, it tends to be a bit shorter. It has a some- score. As t increases for I t , the ratio of incidence of distal
what higher proportion of cases with coverage probability non-coverage to mesial non-coverage increases; for these
being too small, mainly for values of lpl - pal near 1; for randomly selected cases, for t = (0, 2, 4, 6, 8) it equals (.7,
the 10,000 randomly selected cases with nt also random 1.2, 2.2,4.3, 8.1). Unlike the adjusted interval and the Wald
between 10 and 30, the minimum coverage probability was interval, the hybrid score interval cannot produce overshoot,
Downloaded by [McGill University Library] at 08:56 29 March 2013
c o rage Probability
ct rage Probability
1
1
95
.95
,9
#7
.7
&+?d,jz
Downloaded by [McGill University Library] at 08:56 29 March 2013
what to do when the guidelines are violated, other than Association, 74, 894900.
perhaps to consult a statistician. The results in this arti- Leemis, L. M., and Trivedi, K. S. (19961, “A Comparison of Approximate
Interval Estimators for the Bernoulli Parameter,” The American Statis-
cle suggest that for the “add two successes and two fail- tician, 50, 63-68.
ures’’ adjusted confidence intervals, one might simply by- McClave, J. T., and Sincich, T. (2000), Statistics (8th ed.), Englewood
pass sample size rules. The adjusted intervals have safe Cliffs, NJ: Prentice Hall.
operating characteristics for practical application with al- Mee, R. W. (1984), “Confidence Bounds for the Difference Between Two
most all sample sizes. In fact, we note in closing (and with Probabilities,” Biometrics, 40, 1175-1 176.
tongue in cheek) that the adjusted intervals L(7-2,~) and Newcombe, R. (1998a), “Two-sided Confidence Intervals for the Single
Proportion: Comparison of Seven Methods,” Statistics in Medicine, 17,
I4(n1,zl; 7-22, z 2 ) have the advantage that, as with Bayesian
857-872.
methods, one can do an analysis without having any data. -(1998b), “Interval Estimation for the Difference Between Inde-
In the single-sample case the adjusted sample then has fi =
2/4, and the 95% confidence interval is .5 ad-, * pendent Proportions: Comparison of Eleven Methods,” Statistics in
Medicine, 17, 873-890.
or [0, 11. In the two-sample case the adjusted samples have Samuels, M. L., and Witmer,.J. W. (19991, Statisticsfor the Life Sciences
p1 = l / 2 and $2 = 1/2, and the 95% confidence interval is (2nd ed.), Englewood Cliffs, NJ: Prentice Hall.
+
(.5 - .5) f 2 J [ ( . 5 ) ( . 5 ) / 2 ] [(.5)(.5)/2],or [-1, +1]. Both Vollset, S. E. (1993), “Confidence Intervals for a Binomial Proportion,”
Statistics in Medicine, 12, 809-824.
analyses are uninformative, as one would hope from a fre-
Wilson, E. B. (1927), “Probable Inference, the Law of Succession, and
quentist approach with no data. No one will get into too Statistical Inference,” Journal of the American Statistical Association,
much trouble using them! 22.209-212.