Variations Boxplots
Variations Boxplots
The American Statistician, Vol. 32, No. 1. (Feb., 1978), pp. 12-16.
Stable URL:
http://links.jstor.org/sici?sici=0003-1305%28197802%2932%3A1%3C12%3AVOBP%3E2.0.CO%3B2-9
Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at
http://www.jstor.org/about/terms.html. JSTOR's Terms and Conditions of Use provides, in part, that unless you have obtained
prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you may use content in
the JSTOR archive only for your personal, non-commercial use.
Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at
http://www.jstor.org/journals/astata.html.
Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed
page of such transmission.
JSTOR is an independent not-for-profit organization dedicated to and preserving a digital archive of scholarly journals. For
more information regarding JSTOR, please contact [email protected].
http://www.jstor.org
Fri Jun 8 05:07:37 2007
Variations of Box Plots
Box plots display batches of data. Five values from a set of data
are conventionally used; the extremes, the upper and lower hinges
(quartiles), and the median. Such plots are becoming a widely used -- < tipper ~ ~
,-...---------. t~~~~
tool in exploratory data analysis and in preparing visual summaries
for statisticians and nonstatisticians alike. Three variants of the
basic display, devised by the authors, are described. The first
visually incorporates a measure of group size; the second incorpo-
rates an indication of rough significance of differences between <-------- t i p p e r Hinge (Quari~ie)
medians; the third combines the features of the first two. These
techniques are displayed by examples.
KEY WORDS: Box Plots; Exploratory data analysis; Graphical
techniques.
1. Introduction
13
over 15 'I ~ r d e r: 1 to 2 3 Lo 5 6 to 10 l l io 15 over 15
Years L ~ v e d~ n C h ~ c a g o Years L ~ v e d~n C h ~ c a g o
T e l e p h o n e B ~ l lv s Y e a r s L ~ v e d ~ n C h ~ c a g o T e l e p h o n e B ~ l lu s Y e a r s L ~ v e d ~ n C h ~ c a g o
K ~ d t h of B o x P r o p o r t ~ o n a l t o R o o t G r o u p S ~ z e Kon-overla ~ n gof K o t c h e s l n d l c a t e
S ~ g n ~ f ~ cD a ~n gt f e r e n c e a t R o u g h 9 5 % L e v e l
Figure D. Variable Width Box Plot
Figure E. Notched Box Plot
each group were 1,000 times larger, boxes of identical
width would be produced.) Hence the viewer can
determine confidence intervals around the medians
only in a relative manner. Figure E, a notched box
plot of the data, shows these explicitly. The notches
surrounding the medians provide a measure of the
rough significance of differences between the values.
Specifically, if the notches about two medians do not
overlap in this display, the medians are, roughly,
significantly different at about a 95% confidence
level.4 Now the seemingly impressive difference pre-
viously noted is brought clearly into proper perspec-
tive-the difference is, in fact, not significant by our
test. In fact, none of the differences seen in the first
five boxes are significant.
It should be noted that the convention has been
adopted that, should the notch lie outside either
hinge, an unnotched box, plotted with dashed lines, is
displayed for that group indicating low confidence in
it. Experience has shown that few cases exist where
this is not the appropriate strategy. These few, how-
ever, normally occur in cases where all notches
protrude beyond a hinge, andlor one or more of the
medians lies very near a hinge.
Figure F, using the warp data in Figure B, is such a
case. All boxes would be dashed, and the median of
type a1 is very near the upper hinge. Hence protrud-
ing notches have been plotted. Even so, little is
gained except, perhaps, some slight confirmation of a1 am ah b1 bm bh
Type of Warp
features already obvious. N u m b e r of W a r p B r e a k s D u r l n g a F l x e d
-- A m o u n t of W e a v l n g f o r 6 T y p e s o f W a r p
Section 7 contains a description of the method used in deter-
mining the notch widths and suggests possible alternatives. Figure F. Tippett's Warp Break Data-Notched
Non-overla ~ n go f N o t c h e s l n d l c a t e
be based in part on group size. While this fact is Si n i f i c a n t D~P?erence a t Rough 95% Level
~ ~ d ot f \ B o x P r o p o r t i o n a l t o R o o t G r o u p S ~ z e
admitted, the question of whether such redundancy is
necessarily undesirable may well be debated. In the Figure H . Variable Width Notched Box Plot
final analysis, the user's personal preference is often NOTE: Y axis scale is logarithmic.
I I
under I1 to z 3 to 5 6 to 10 1 1 t o 15 o v e r 15
0
LI Lo 15 oter I S
Years Lived in Chlcago
Years Llved ~ n Chicago
T e l e p h o n e Bill v s Y e a r s L l v e d i n C h ~ c a g o T e l e p h o n e 6111 v s Y e a r s Llved ~ n C h l c a g o
SI n ~ f ~ c a D n it r y e r e n c e a t R o u g h 9 5 % L e v e l S ~ e n ~ i ~ c aD n~ tr f e r e n c e a t R o u e h 9 5 % L e v e l
~ ~ d ot f %B o x P r o p o r t i o n a l t o R o o t G r o u p S l z e 'W1dt71 of B o x P r o p o r t ~ o n a l t o R o o t G r o u p S i z e
Figure G. Var~ableWidth Notched Box Plot* Figure I. Variable Width Notched Box Plot*
* Plot truncated at $50; extremes printed at top. " The first five groups are combined.
6. Choice of Box Widths differences at the 95 percent level was desired, this
In the examples shown in Figures D, G, H , and I, was not done. It can be shown that C = 1.96 would
widths of the boxes have been made proportional to only be appropriate if the standard deviations of the
the square root of the group size. This choice was two groups were vastly different. If they were nearly
based on the fact that many variability measures, equal, C = 1.386 would be the appropriate value,
such as standard error, are proportional to the root of with 1.96 resulting in far too stringent a test (far
the group size. It is plausible to take this as the beyond 99 percent). A value between these limits, C
= 1.7, was empirically selected as preferable. Thus
standard to be used unless other methods offer signif-
icant improvement. the notches used were computed as
One example would be when the actual intent is to
display strata fractions. Since the use of a logit scale
is sometimes preferred in these displays, box widths Clearly, a variety of other choices, such as a single
based on such a scale might be used here. Again, if less conservative value (<1.7) or one dependent upon
the intent is to emphasize differences in group size, the data (chosen to compromise over the range of the
the use of square roots will clearly minimize the ratios of the spreads involved), are possible and may
visual impact of differences, particularly in cases be preferable in certain cases.
where the sizes are relatively similar. Here one might 8. Display of Outliers
opt for using widths directly proportional to size.
(However, limited experience in this area tends to In many applications, and particularly for all but
indicate that this may overemphasize differences.) the least sophisticated viewers, special attention to
Again it must be stressed that whatever is done outlying values may be desirable. This can be simply
should be clearly indicated. done by changing from box plots to schematic plots
(Tukey 1977). Since the central box is the same on
7. Choice of Notch Size both, all the preceding discussion will apply.
In notched box plots, one is, of course, faced with 9. Conclusion
the question of how best to determine the widths of
the notches. Many methods, both classical and non- Box plots have proven to be a most valuable tool in
parametric, might be considered. None will likely be data analysis. The variants described-variable width
best in all cases. boxes, notched boxes, and a combination of the
In Figures E, G, H , and I, the widths were com- twc-provide additional information on the display.
puted from the midspread or interquartile range (R) of Hopefully, these not only facilitate interpretation and
the data (a robust measure of spread), and the num- provide additional insight into the data but also lessen
ber of observations ( N ) for each group. The Gaus- the possibility of misinterpretations due to unwar-
sian-based asymptotic approximation (Kendall and ranted assumptions.
Stuart 1967) of the standard deviation s of the median [Recei\,ed Jatz~~ar?
26. 1977. Revised Septenlber 13, 1977.1
(M) is given by
References
Cleveland, W. S., Dunn, D. M., and Terpenning, I. (1976),
and can be shown to be reasonably broadly applicable " S A B d R e s i s t a n t Seasonal Adjustment Procedure with Graph-
to other distributions. An appreciation of why this is ical Methods for Interpretation and Diagnosis," Proceedings of
so can be obtained as follows. The asymptotic for- the IVBER-CENSUS Conference on Seasonal Analysis of Eco-
rlonzic Tinze Series, L . S . Bureau of the Census: Washington,
mula for the standard deviation of M is 1 / 2 f , f l , D.C.
where f , is the density at the population median. -, Graedel, T . E . , and Kleiner, B. (1977), "Lrban Formalde-
Also, R is a consiste~ltestimate of the population hyde: Observed Correlation with Source Emissions and Photo-
interquartile range 4 , and 1 / 2 4 is the average chemistry," Atnzospheric Environrnerzt, 11, pp. 357-360.
density between the population quartiles. Thus for any Cohen, A , , Gnanadesikan, R., Kettenring, J . R., and Landwehr, J .
M. (1977), "Methodological Developments in Some Applications
distribution for which the middle portion is shaped of Clustering," Proceedings of the Sy~nposirlnzor1 Applicatiorzs
approximately like a Gaussian, the ratio &/(I 1 2 4 ) of Statistics, ed. P. R . Krishnaiah, North-Holland Publishing Co.
will be close to the Gaussian value of 1.08. (This does Kendall, M. G., and Stuart, A . (19671, The Atll'ancrd Tlzeo~y~f
not explain the rather remarkable result that for the Statistics, Vol. 1. 2nd e d . , Ch. 14, New York: Hafner Publishing
(very skewed) one-sided exponential 2 4 f , = 1.lo.) Co.
Kettenring, J . R . , Rogers, W. H . , Smith, M. E . , and Warner, J. L.
The notch around each median may then be calcu-
(1976), "Cluster Analysis Applied to the Validation of Course
lated as Objectives," Journal of Etlucational Statistics, 1, No. 1, 39-57.
Tippett, L . H . C. (19501, Technological Applications qf Statistics,
M i Cs, (7.2)
New York: John Wiley and Sons, p. 106.
where C is a constant. Should one desire a notch Tukey, J . W. (1970), E.~ploratorl\Dotci A I I N / ~ (Limited
S~S Prelinzi-
nary Edition), Vol. 1, Ch. 5, Reading, Mass: Addison-Wesley
indicating a 95 percent confidence interval about each Publishing Co.
median, C = 1.96 would be used. However, since a -(1977), E.~plomtory Datci Analysis (First Edition), Reading,
form of "gap gauge" which would indicate significant Mass: Addison-Wesley Publishing Co.