Box Plot
Box Plot
History
The range-bar method was first introduced by Mary Eleanor Spear in her book "Charting Statistics" in
1952[4] and again in her book "Practical Charting Techniques" in 1969.[5] The box-and-whisker plot was
first introduced in 1970 by John Tukey, who later published on the subject in his book "Exploratory Data
Analysis" in 1977.[6]
Elements
A boxplot is a standardized way of displaying the dataset based on the five-number summary: the
minimum, the maximum, the sample median, and the first and third quartiles.
Minimum (Q0 or 0th percentile): the lowest data point in the data set excluding any outliers
Maximum (Q4 or 100th percentile): the highest data point in the data set excluding any
outliers
Median (Q2 or 50th percentile): the middle value in the data set
First quartile (Q1 or 25th percentile): also known as the lower quartile qn(0.25), it is the
median of the lower half of the dataset.
Third quartile (Q3 or 75th percentile): also known as
the upper quartile qn(0.75), it is the median of the upper
half of the dataset.[7]
Whiskers
In the most straight-forward method, the boundary of the lower Figure 3. Same box-plot with
whisker is the minimum value of the data set, and the boundary of whiskers drawn within the 1.5 IQR
the upper whisker is the maximum value of the data set. value
There are other representations in which the whiskers can stand for several other things, such as:
The minimum and the maximum value of the data set (as shown in Figure 2)
One standard deviation above and below the mean of the data set
The 9th percentile and the 91st percentile of the data set
The 2nd percentile and the 98th percentile of the data set
Rarely, box-plot can be plotted without the whiskers. This can be appropriate for sensitive information to
avoid whiskers (and outliers) disclosing actual values observed.[9]
Some box plots include an additional character to represent the mean of the data.[10][11]
The unusual percentiles 2%, 9%, 91%, 98% are sometimes used for whisker cross-hatches and whisker
ends to depict the seven-number summary. If the data are normally distributed, the locations of the seven
marks on the box plot will be equally spaced. On some box plots, a cross-hatch is placed before the end of
each whisker.
Because of this variability, it is appropriate to describe the convention that is being used for the whiskers
and outliers in the caption of the box-plot.
Variations
Since the mathematician John W. Tukey first popularized this type
of visual data display in 1969, several variations on the classical
box plot have been developed, and the two most commonly found
variations are the variable width box plots and the notched box
plots shown in Figure 4.
Variable width box plots illustrate the size of each group whose
data is being plotted by making the width of the box proportional to
the size of the group. A popular convention is to make the box
width proportional to the square root of the size of the group.[12]
One convention for obtaining the boundaries of these notches is to use a distance of around
the median.[13]
Adjusted box plots are intended to describe skew distributions, and they rely on the medcouple statistic of
skewness.[14] For a medcouple value of MC, the lengths of the upper and lower whiskers on the box-plot
are respectively defined to be:
For a symmetrical data distribution, the medcouple will be zero, and this reduces the adjusted box-plot to
the Tukey's box-plot with equal whisker lengths of for both whiskers.
Other kinds of box plots, such as the violin plots and the bean plots can show the difference between
single-modal and multimodal distributions, which cannot be observed from the original classical box-
plot.[6]
Examples
A box plot of the data set can be generated by first calculating five
relevant values of this data set: minimum, maximum, median (Q2 ),
first quartile (Q1 ), and third quartile (Q3 ).
The minimum is the smallest number of the data set. In this case, the Figure 5. The generated boxplot
minimum recorded day temperature is 57 °F. figure of the example on the left with
no outliers.
The maximum is the largest number of the data set. In this case, the
maximum recorded day temperature is 81 °F.
The median is the "middle" number of the ordered data set. This means that there are exactly 50% of the
elements is less than the median and 50% of the elements is greater than the median. The median of this
ordered data set is 70 °F.
The first quartile value (Q1 or 25th percentile) is the number that marks one quarter of the ordered data
set. In other words, there are exactly 25% of the elements that are less than the first quartile and exactly
75% of the elements that are greater than it. The first quartile value can be easily determined by finding the
"middle" number between the minimum and the median. For the hourly temperatures, the "middle" number
found between 57 °F and 70 °F is 66 °F.
The third quartile value (Q3 or 75th percentile) is the number that marks three quarters of the ordered data
set. In other words, there are exactly 75% of the elements that are less than the third quartile and 25% of the
elements that are greater than it. The third quartile value can be easily obtained by finding the "middle"
number between the median and the maximum. For the hourly temperatures, the "middle" number between
70 °F and 81 °F is 75 °F.
The interquartile range, or IQR, can be calculated by subtracting the first quartile value (Q1 ) from the third
quartile value (Q3 ):
Hence,
The upper whisker boundary of the box-plot is the largest data value that is within 1.5 IQR above the third
quartile. Here, 1.5 IQR above the third quartile is 88.5 °F and the maximum is 81 °F. Therefore, the upper
whisker is drawn at the value of the maximum, which is 81 °F.
Similarly, the lower whisker boundary of the box plot is the smallest data value that is within 1.5 IQR
below the first quartile. Here, 1.5 IQR below the first quartile is 52.5 °F and the minimum is 57 °F.
Therefore, the lower whisker is drawn at the value of the minimum, which is 57 °F.
The ordered set for the recorded temperatures is (°F): 52, 57, 57,
58, 63, 66, 66, 67, 67, 68, 69, 70, 70, 70, 70, 72, 73, 75, 75, 76, 76,
78, 79, 89.
In this example, only the first and the last number are changed. The
median, third quartile, and first quartile remain the same.
Figure 6. The generated boxplot of
In this case, the maximum value in this data set is 89 °F, and 1.5 the example on the left with outliers.
IQR above the third quartile is 88.5 °F. The maximum is greater
than 1.5 IQR plus the third quartile, so the maximum is an outlier.
Therefore, the upper whisker is drawn at the greatest value smaller than 1.5 IQR above the third quartile,
which is 79 °F.
Similarly, the minimum value in this data set is 52 °F, and 1.5 IQR below the first quartile is 52.5 °F. The
minimum is smaller than 1.5 IQR minus the first quartile, so the minimum is also an outlier. Therefore, the
lower whisker is drawn at the smallest value greater than 1.5 IQR below the first quartile, which is 57 °F.
An additional example for obtaining box-plot from a data set containing a large number of data points is:
Here stands for the general ordering of the data points (i.e. if , then )
Using the above example that has 24 data points (n = 24), one can calculate the median, first and third
quartile either mathematically or visually.
Median :
First quartile :
Third quartile :
Visualization
Although box plots may seem more primitive than histograms
or kernel density estimates, they do have a number of
advantages. First, the box plot enables statisticians to do a
quick graphical examination on one or more data sets. Box-
plots also take up less space and are therefore particularly
useful for comparing distributions between several groups or
sets of data in parallel (see Figure 1 for an example). Lastly,
the overall structure of histograms and kernel density estimate
can be strongly influenced by the choice of number and
width of bins techniques and the choice of bandwidth,
respectively.
See also
Bagplot
Candlestick chart
Data and information visualization
Exploratory data analysis
Fan chart
Five-number summary
Functional boxplot
Seven-number summary
Violin plot
References
1. C., Dutoit, S. H. (2012). Graphical exploratory data analysis (http://worldcat.org/oclc/1019645
745). Springer. ISBN 978-1-4612-9371-2. OCLC 1019645745 (https://www.worldcat.org/ocl
c/1019645745).
2. Grubbs, Frank E. (February 1969). "Procedures for Detecting Outlying Observations in
Samples" (https://dx.doi.org/10.1080/00401706.1969.10490657). Technometrics. 11 (1): 1–
21. doi:10.1080/00401706.1969.10490657 (https://doi.org/10.1080%2F00401706.1969.104
90657). ISSN 0040-1706 (https://www.worldcat.org/issn/0040-1706).
3. Richard., Boddy (2009). Statistical Methods in Practice : for Scientists and Technologists (htt
p://worldcat.org/oclc/940679163). John Wiley & Sons. ISBN 978-0-470-74664-6.
OCLC 940679163 (https://www.worldcat.org/oclc/940679163).
4. Spear, Mary Eleanor (1952). Charting Statistics. McGraw Hill. p. 166.
5. Spear, Mary Eleanor. (1969). Practical charting techniques. New York: McGraw-Hill.
ISBN 0070600104. OCLC 924909765 (https://www.worldcat.org/oclc/924909765).
6. Wickham, Hadley; Stryjewski, Lisa. "40 years of boxplots" (https://vita.had.co.nz/papers/boxp
lots.pdf) (PDF). Retrieved December 24, 2020.
7. Holmes, Alexander; Illowsky, Barbara; Dean, Susan (31 March 2015). "Introductory
Business Statistics" (https://opentextbc.ca/introbusinessstatopenstax/chapter/measures-of-th
e-location-of-the-data/). OpenStax.
8. Dekking, F.M. (2005). A Modern Introduction to Probability and Statistics (https://archive.org/
details/modernintroducti00dekk_722). Springer. pp. 234 (https://archive.org/details/modernin
troducti00dekk_722/page/n240)–238. ISBN 1-85233-896-2.
9. Derrick, Ben; Green, Elizabeth; Ritchie, Felix; White, Paul (September 2022). "The Risk of
Disclosure When Reporting Commonly Used Univariate Statistics". Privacy in Statistical
Databases. 13463: 119–129. doi:10.1007/978-3-031-13945-1_9 (https://doi.org/10.1007%2F
978-3-031-13945-1_9).
10. Frigge, Michael; Hoaglin, David C.; Iglewicz, Boris (February 1989). "Some Implementations
of the Boxplot". The American Statistician. 43 (1): 50–54. doi:10.2307/2685173 (https://doi.or
g/10.2307%2F2685173). JSTOR 2685173 (https://www.jstor.org/stable/2685173).
11. Marmolejo-Ramos, F.; Tian, S. (2010). "The shifting boxplot. A boxplot based on essential
summary statistics around the mean" (https://doi.org/10.21500%2F20112084.823).
International Journal of Psychological Research. 3 (1): 37–46. doi:10.21500/20112084.823
(https://doi.org/10.21500%2F20112084.823).
12. McGill, Robert; Tukey, John W.; Larsen, Wayne A. (February 1978). "Variations of Box Plots".
The American Statistician. 32 (1): 12–16. doi:10.2307/2683468 (https://doi.org/10.2307%2F2
683468). JSTOR 2683468 (https://www.jstor.org/stable/2683468).
13. "R: Box Plot Statistics" (http://stat.ethz.ch/R-manual/R-devel/library/grDevices/html/boxplot.st
ats.html). R manual. Retrieved 26 June 2011.
14. Hubert, M.; Vandervieren, E. (2008). "An adjusted boxplot for skewed distribution".
Computational Statistics and Data Analysis. 52 (12): 5186–5201. CiteSeerX 10.1.1.90.9812
(https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.90.9812).
doi:10.1016/j.csda.2007.11.008 (https://doi.org/10.1016%2Fj.csda.2007.11.008).
Further reading
Tukey, John W. (1977). Exploratory Data Analysis (https://archive.org/details/exploratorydata
a00tuke_0). Addison-Wesley. ISBN 9780201076165.
Benjamini, Y. (1988). "Opening the Box of a Boxplot". The American Statistician. 42 (4):
257–262. doi:10.2307/2685133 (https://doi.org/10.2307%2F2685133). JSTOR 2685133 (htt
ps://www.jstor.org/stable/2685133).
Rousseeuw, P. J.; Ruts, I.; Tukey, J. W. (1999). "The Bagplot: A Bivariate Boxplot". The
American Statistician. 53 (4): 382–387. doi:10.2307/2686061 (https://doi.org/10.2307%2F26
86061). JSTOR 2686061 (https://www.jstor.org/stable/2686061).
External links
Beeswarm Boxplot (http://www.r-statistics.com/2011/03/beeswarm-boxplot-and-plotting-it-wit
h-r/) - superimposing a frequency-jittered stripchart on top of a box plot