0% found this document useful (0 votes)
30 views

Variations Boxplots

The document describes variations of box plots that were developed to address weaknesses in interpreting the basic box plot display. It introduces a variable-width box plot that incorporates group size information and a notched box plot that provides an indication of differences between median values.

Uploaded by

Kliton Andrea
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

Variations Boxplots

The document describes variations of box plots that were developed to address weaknesses in interpreting the basic box plot display. It introduces a variable-width box plot that incorporates group size information and a notched box plot that provides an indication of differences between median values.

Uploaded by

Kliton Andrea
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Variations of Box Plots

Robert McGill; John W. Tukey; Wayne A. Larsen

The American Statistician, Vol. 32, No. 1. (Feb., 1978), pp. 12-16.

Stable URL:
http://links.jstor.org/sici?sici=0003-1305%28197802%2932%3A1%3C12%3AVOBP%3E2.0.CO%3B2-9

The American Statistician is currently published by American Statistical Association.

Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at
http://www.jstor.org/about/terms.html. JSTOR's Terms and Conditions of Use provides, in part, that unless you have obtained
prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you may use content in
the JSTOR archive only for your personal, non-commercial use.

Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at
http://www.jstor.org/journals/astata.html.

Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed
page of such transmission.

JSTOR is an independent not-for-profit organization dedicated to and preserving a digital archive of scholarly journals. For
more information regarding JSTOR, please contact [email protected].

http://www.jstor.org
Fri Jun 8 05:07:37 2007
Variations of Box Plots

ROBERT McGILL, JOHN W. TUKEY, AND WAYNE A. LARSEN*

Box plots display batches of data. Five values from a set of data
are conventionally used; the extremes, the upper and lower hinges
(quartiles), and the median. Such plots are becoming a widely used -- < tipper ~ ~
,-...---------. t~~~~
tool in exploratory data analysis and in preparing visual summaries
for statisticians and nonstatisticians alike. Three variants of the
basic display, devised by the authors, are described. The first
visually incorporates a measure of group size; the second incorpo-
rates an indication of rough significance of differences between <-------- t i p p e r Hinge (Quari~ie)
medians; the third combines the features of the first two. These
techniques are displayed by examples.
KEY WORDS: Box Plots; Exploratory data analysis; Graphical
techniques.

1. Introduction

Box plots display batches of data (Tukey 1970,


1977). Five values from a set of data are convention-
ally used; the extremes, the upper and lower hinges1
(quartiles), and the median. The basic configuration
of the display is shown in Figure A. The technique
has been used with considerable success in a diverse
range of projects (cf. Cleveland, Dunn, and Terpen-
ning 1976; Cleveland, Graedel, and Kleiner 1977;
Cohen, Gnanadesikan, and Landwehr 1977; Kettenring
et al. 1976). Inevitably, certain weaknesses came to
light in particular cases; most frequently these were
the result of inappropriate interpretation of the results Figure A . Configuration of a Box Plot
rather than problems with the technique itself. In
almost all cases, inclusion of additional available
information in the display would have prevented the
misinterpretation.
In an attempt to improve the basic display, three
modified forms of box plots have been devised. The
original version and these new variants are described
by example in the following sections.

2. Basic Box Plot

Beginning in a positive vein, we first consider a


case where the original method serves well and
examine why this is the case. Figure B displays

' Robert McGill is with Statistics and Data Analysis Research


Department, Bell Laboratories, Murray Hill, NJ 07974. John W.
Tukey is with Research-Communications Principles Division, Bell
Laboratories, Murray Hill, NJ 07974, and Department of Statistics,
Princeton University, Princeton, NJ 08540. Wayne A. Larsen is
with Eyring Research Institute, Provo, UT 84601. This material
was prepared in part in connection with research at Princeton
University and United States Energy Research and Development
Administration. 10 -
' The lower hinge is actually defined, for a sample of size n , as al a m ah bl b rn bh
the ([(n + 1)/2] 1112th order statistic; [ . . . ] indicating the Type of Warp

integer portion of the quotient. If the result is not an integer, the N u m b e r of W a r p B r e a k s D u r l n g a F i x e d


A m o u n t o f W e a v l n g f o r 6 T y p e s of W a r p
mean of the adjacent order statistics is used. The upper hinge is
defined analogously. Figure B. Tippett's Warp Break Data

12 @ The American Statistician, February 1978, Vol. 32, No. 1


Tippett's (1950) warp break data for six types of
weaving warps. The characteristics of the data are
easily seen, with type a1 appearing rather different
from the other types.
Frequently, misinterpretation results because the
viewer, particularly the nonstatistician, attempts to
gain more information from the display than it con-
tains. One might, for example, conclude that the
overall median for all the types combined is about 26
or 27. While this conclusion is not justified based on
the information contained, it is one that is often
made. In this instance, it happens to also be correct-
the actual overall median is 26.
In this example, two factors worked to the viewer's
advantage. First, each group contained the same
number of observations (specifically, nine). Second,
the variance of each group, with the possible excep-
tion of the first, is moderately constant. In the
absence of information to the contrary, the viewer
will likely assume these facts (and, in this case, be
correct). Next we examine a case where this is not
so.

dnder 1 1 ra 2 3 Lo 5 6 la lL' 11 to 15 over 15


3. Variable-Width Box Plot ?ears I ~ v e d !n C h l c a g o
Telephone 611: 1 s Y e a r s L ~ \ e d ~n C h ~ c a g o
Figure C contains a regular box plot for another set Figure C. Regular Box Plot
of data. Here a single month's telephone bills for a
group of Chicago residence customers is displayed.
The data is subdivided by the number of years lived This large variation was obviously not deliberately
in the city. It should be emphasized that the plot introduced. Rather, it was the result of an unfortunate
correctly portrays the characteristics of the data choice of group boundaries, made for the collection
displayed.' However, not everything known is of the data, which caused over 75 percent of the
shown. customers to fall within one group.
On initial examination, perhaps the most striking Figure D shows a means of displaying this addi-
feature is the pronounced drop seen in the last (over tional information-the variable-width box plot. Here
15 year) group-the median is about $13 while the the width of each box has been made proportional to
medians of other groups are about $20 or more. the square root of the number of customers in the
Returning to our rather naive user, he might conclude corresponding group.3 The viewer's attention is im-
that the overall median for all groups combined is mediately drawn to the size differences. (Obviously, a
about $21. This time the conclusion is not only title clearly setting forth what has been done is
unjustified, it is also grossly in error. The actual definitely in order.) Since the additional information
overall median is about $14. What has gone wrong in has been clearly displayed, a better appraisal of the
this case? data can be made and misinterpretations avoided.
The information available but not displayed is
shown in the following tabulation. The number of
customers in the various groups differ widely. In fact 4. Notched Box Plot
the ratio between the largest and the smallest is over
33:l. Returning to Figure C, the viewer might notice the
Years C~lstomers surprisingly large (over $5) difference in the medians
less than 1 11 of the first two boxes-classes which, intuitively at
1 to 2 17 least, might be assumed to be rather similar. Were the
3 to 5 26 variable-width plot examined, one might be led to
6 to 10 35 doubt the significance of this difference. While it is
11 to 15 29
evident that the number of customers in these groups
over 15 368
is smaller than in other groups, actual size is not
indicated. (Note that if the number of customers in
' I t will readily be admitted that plotting the data on a trans-
formed scale (e.g., logs) might be preferable, although nontechnical
viewers are often confused by this technique. Truncating the upper The use of square root will be both intuitive and pleasing to the
extremes on the display (not in computation) is another possible statistically inclined. However, as discussed in Section 6, other
improvement. Both are illustrated later. functions of group size may sometimes be more appropriate.

13
over 15 'I ~ r d e r: 1 to 2 3 Lo 5 6 to 10 l l io 15 over 15
Years L ~ v e d~ n C h ~ c a g o Years L ~ v e d~n C h ~ c a g o
T e l e p h o n e B ~ l lv s Y e a r s L ~ v e d ~ n C h ~ c a g o T e l e p h o n e B ~ l lu s Y e a r s L ~ v e d ~ n C h ~ c a g o
K ~ d t h of B o x P r o p o r t ~ o n a l t o R o o t G r o u p S ~ z e Kon-overla ~ n gof K o t c h e s l n d l c a t e
S ~ g n ~ f ~ cD a ~n gt f e r e n c e a t R o u g h 9 5 % L e v e l
Figure D. Variable Width Box Plot
Figure E. Notched Box Plot
each group were 1,000 times larger, boxes of identical
width would be produced.) Hence the viewer can
determine confidence intervals around the medians
only in a relative manner. Figure E, a notched box
plot of the data, shows these explicitly. The notches
surrounding the medians provide a measure of the
rough significance of differences between the values.
Specifically, if the notches about two medians do not
overlap in this display, the medians are, roughly,
significantly different at about a 95% confidence
level.4 Now the seemingly impressive difference pre-
viously noted is brought clearly into proper perspec-
tive-the difference is, in fact, not significant by our
test. In fact, none of the differences seen in the first
five boxes are significant.
It should be noted that the convention has been
adopted that, should the notch lie outside either
hinge, an unnotched box, plotted with dashed lines, is
displayed for that group indicating low confidence in
it. Experience has shown that few cases exist where
this is not the appropriate strategy. These few, how-
ever, normally occur in cases where all notches
protrude beyond a hinge, andlor one or more of the
medians lies very near a hinge.
Figure F, using the warp data in Figure B, is such a
case. All boxes would be dashed, and the median of
type a1 is very near the upper hinge. Hence protrud-
ing notches have been plotted. Even so, little is
gained except, perhaps, some slight confirmation of a1 am ah b1 bm bh
Type of Warp
features already obvious. N u m b e r of W a r p B r e a k s D u r l n g a F l x e d
-- A m o u n t of W e a v l n g f o r 6 T y p e s o f W a r p
Section 7 contains a description of the method used in deter-
mining the notch widths and suggests possible alternatives. Figure F. Tippett's Warp Break Data-Notched

14 @ The American Statistician, February 1978, Vol. 32, No. 1


5. Variable-Width Notched Box Plot

In certain (perhaps many) cases, advantage will be


gained by combining both the techniques described.
Figures G through I contain such displays on the
same telephone-bill data. In addition, the upper ex-
tremes have been truncated for plotting purposes
only, and labeled appropriately in Figure G, and a log
axis used in Figure H. Now both group size and
confidence intervals can be seen simultaneously. As
might be expected, the combination adds little in
cases where a single technique (variable width or
notches) would suffice. Here, due to the nature of the
data, the combination does seem to provide an im-
provement.
Figure I contains the result of the next reasonable
step in the analysis of this data. The first five groups
have been combined, and boxes for under 15 years
and over 15 years displayed. These two groups do
appear significantly different-by at least about $4,
the distance between the nearest edges of the
notches.
Some might argue that the combination will nor-
under 11 t o 2 3 to 5 6 Lo 10 1 1
1
to 15
1
over 15

mally contain considerable redundancy, since almost Years Lived in Ch~cago

all reasonable measures of significant difference will T e l e p h o n e B ~ l lv s Y e a r s L ~ v e d ~ n C h i c a g o

Non-overla ~ n go f N o t c h e s l n d l c a t e

be based in part on group size. While this fact is Si n i f i c a n t D~P?erence a t Rough 95% Level

~ ~ d ot f \ B o x P r o p o r t i o n a l t o R o o t G r o u p S ~ z e
admitted, the question of whether such redundancy is
necessarily undesirable may well be debated. In the Figure H . Variable Width Notched Box Plot

final analysis, the user's personal preference is often NOTE: Y axis scale is logarithmic.

the best criterion.

I I
under I1 to z 3 to 5 6 to 10 1 1 t o 15 o v e r 15
0
LI Lo 15 oter I S
Years Lived in Chlcago
Years Llved ~ n Chicago
T e l e p h o n e Bill v s Y e a r s L l v e d i n C h ~ c a g o T e l e p h o n e 6111 v s Y e a r s Llved ~ n C h l c a g o

h o n - o v e r l ~ ~ n gof N o t c h e s I n d i c a t e Non-overla ~ n gof S o t c h e s l n d l c a t e

SI n ~ f ~ c a D n it r y e r e n c e a t R o u g h 9 5 % L e v e l S ~ e n ~ i ~ c aD n~ tr f e r e n c e a t R o u e h 9 5 % L e v e l

~ ~ d ot f %B o x P r o p o r t i o n a l t o R o o t G r o u p S l z e 'W1dt71 of B o x P r o p o r t ~ o n a l t o R o o t G r o u p S i z e

Figure G. Var~ableWidth Notched Box Plot* Figure I. Variable Width Notched Box Plot*

* Plot truncated at $50; extremes printed at top. " The first five groups are combined.

6. Choice of Box Widths differences at the 95 percent level was desired, this
In the examples shown in Figures D, G, H , and I, was not done. It can be shown that C = 1.96 would
widths of the boxes have been made proportional to only be appropriate if the standard deviations of the
the square root of the group size. This choice was two groups were vastly different. If they were nearly
based on the fact that many variability measures, equal, C = 1.386 would be the appropriate value,
such as standard error, are proportional to the root of with 1.96 resulting in far too stringent a test (far
the group size. It is plausible to take this as the beyond 99 percent). A value between these limits, C
= 1.7, was empirically selected as preferable. Thus
standard to be used unless other methods offer signif-
icant improvement. the notches used were computed as
One example would be when the actual intent is to
display strata fractions. Since the use of a logit scale
is sometimes preferred in these displays, box widths Clearly, a variety of other choices, such as a single
based on such a scale might be used here. Again, if less conservative value (<1.7) or one dependent upon
the intent is to emphasize differences in group size, the data (chosen to compromise over the range of the
the use of square roots will clearly minimize the ratios of the spreads involved), are possible and may
visual impact of differences, particularly in cases be preferable in certain cases.
where the sizes are relatively similar. Here one might 8. Display of Outliers
opt for using widths directly proportional to size.
(However, limited experience in this area tends to In many applications, and particularly for all but
indicate that this may overemphasize differences.) the least sophisticated viewers, special attention to
Again it must be stressed that whatever is done outlying values may be desirable. This can be simply
should be clearly indicated. done by changing from box plots to schematic plots
(Tukey 1977). Since the central box is the same on
7. Choice of Notch Size both, all the preceding discussion will apply.
In notched box plots, one is, of course, faced with 9. Conclusion
the question of how best to determine the widths of
the notches. Many methods, both classical and non- Box plots have proven to be a most valuable tool in
parametric, might be considered. None will likely be data analysis. The variants described-variable width
best in all cases. boxes, notched boxes, and a combination of the
In Figures E, G, H , and I, the widths were com- twc-provide additional information on the display.
puted from the midspread or interquartile range (R) of Hopefully, these not only facilitate interpretation and
the data (a robust measure of spread), and the num- provide additional insight into the data but also lessen
ber of observations ( N ) for each group. The Gaus- the possibility of misinterpretations due to unwar-
sian-based asymptotic approximation (Kendall and ranted assumptions.
Stuart 1967) of the standard deviation s of the median [Recei\,ed Jatz~~ar?
26. 1977. Revised Septenlber 13, 1977.1
(M) is given by
References
Cleveland, W. S., Dunn, D. M., and Terpenning, I. (1976),
and can be shown to be reasonably broadly applicable " S A B d R e s i s t a n t Seasonal Adjustment Procedure with Graph-
to other distributions. An appreciation of why this is ical Methods for Interpretation and Diagnosis," Proceedings of
so can be obtained as follows. The asymptotic for- the IVBER-CENSUS Conference on Seasonal Analysis of Eco-
rlonzic Tinze Series, L . S . Bureau of the Census: Washington,
mula for the standard deviation of M is 1 / 2 f , f l , D.C.
where f , is the density at the population median. -, Graedel, T . E . , and Kleiner, B. (1977), "Lrban Formalde-
Also, R is a consiste~ltestimate of the population hyde: Observed Correlation with Source Emissions and Photo-
interquartile range 4 , and 1 / 2 4 is the average chemistry," Atnzospheric Environrnerzt, 11, pp. 357-360.
density between the population quartiles. Thus for any Cohen, A , , Gnanadesikan, R., Kettenring, J . R., and Landwehr, J .
M. (1977), "Methodological Developments in Some Applications
distribution for which the middle portion is shaped of Clustering," Proceedings of the Sy~nposirlnzor1 Applicatiorzs
approximately like a Gaussian, the ratio &/(I 1 2 4 ) of Statistics, ed. P. R . Krishnaiah, North-Holland Publishing Co.
will be close to the Gaussian value of 1.08. (This does Kendall, M. G., and Stuart, A . (19671, The Atll'ancrd Tlzeo~y~f
not explain the rather remarkable result that for the Statistics, Vol. 1. 2nd e d . , Ch. 14, New York: Hafner Publishing
(very skewed) one-sided exponential 2 4 f , = 1.lo.) Co.
Kettenring, J . R . , Rogers, W. H . , Smith, M. E . , and Warner, J. L.
The notch around each median may then be calcu-
(1976), "Cluster Analysis Applied to the Validation of Course
lated as Objectives," Journal of Etlucational Statistics, 1, No. 1, 39-57.
Tippett, L . H . C. (19501, Technological Applications qf Statistics,
M i Cs, (7.2)
New York: John Wiley and Sons, p. 106.
where C is a constant. Should one desire a notch Tukey, J . W. (1970), E.~ploratorl\Dotci A I I N / ~ (Limited
S~S Prelinzi-
nary Edition), Vol. 1, Ch. 5, Reading, Mass: Addison-Wesley
indicating a 95 percent confidence interval about each Publishing Co.
median, C = 1.96 would be used. However, since a -(1977), E.~plomtory Datci Analysis (First Edition), Reading,
form of "gap gauge" which would indicate significant Mass: Addison-Wesley Publishing Co.

16 0 The American Statistician, February 1978, Vol. 32, No. 1

You might also like