Boxplot
Boxplot
https://www.statology.org/boxplots/
A boxplot, sometimes called a box-and-whisker plot, is a plot that visualizes the five-number summary of a
dataset. The box part is constructed based on the quartiles, and the whiskers are the lines that represent the
distance from quartiles to minimum and maxiumum values, except for the outliers. The top whisker represents
the max, the top of the box represents the 3rd quartile, the middle line in the box represents the median, the tiny
“x” in the box represents the average, the bottom of the box represents the 1st quartile, and the bottom whisker
represents the minimum value:
One of the easiest ways to visualize a five number summary is by creating a boxplot, sometimes called a box-
and-whisker plot, which uses a box with a line in the middle along with “whiskers” that extend on each end.
A box plot provides a pictorial representation of the following statistics: maximum, 75th percentile, median
(50th percentile), mean, 25th percentile and minimum. Box plots are especially useful when comparing samples
and testing whether data is distributed symmetrically.
Using Boxplots to Graph the Interquartile Range
Boxplots are a great way to visualize interquartile ranges and their relation to the median and the overall
distribution. These graphs display ranges of values based on quartiles and show asterisks for outliers that fall
outside the whiskers. Boxplots work by splitting your data into quarters.
The box in the boxplot is interquartile range! It contains 50% of data. By comparing the size of these boxes, you
can understand your data’s variability. More dispersed distributions have wider boxes.
Additionally, find where the median line falls within each interquartile box. If the median is closer to one side or
the other of the box, it’s a skewed distribution. When the median is near the center of the interquartile range,
distribution is symmetric
.
For example, in the boxplot below, method 3 has the highest variability in scores and is left-skewed. Conversely,
method 2 has a tighter distribution that is symmetrical, although it also has an outlier—read the next section for
more about that!
Related post: Boxplots versus Individual Value Plots
Example 1
A market research company asks 30 people to evaluate three brands of tablet computers using a questionnaire.
The 30 people are divided at random into 3 groups of 10 people each, where the first group evaluates Brand A,
the second evaluates Brand B and the third evaluates Brand C. Figure 1 summarizes the questionnaire scores
from these groups.
To generate the box plots for these three groups, press Ctrl-m and select the Descriptive Statistics and
Normality data analysis tool. A dialog box will now appear as shown in Figure 4 of Descriptive
Statistics Tools. Select the Box Plot option and insert A3:C13 in the Input Range. Check Headings
included with the data and uncheck Use exclusive version of quartile.
The resulting chart is shown in Figure 2.
Note too that the data analysis tool also generates a table, which may be
located behind the chart. For those who are interested, this table
contains the information in Figure 3, as explained further in Special
Charting Capabilities.
For each sample, the box plot consists of a rectangular box with one line
extending upward and another extending downward (usually called
whiskers). The box itself is divided into two parts. In particular, the
meaning of each element in the box plot is described in Figure 3.
Element Meaning
Top of upper whisker Maximum value of the sample
Top of box 75th percentile of the sample
Line through the box Median of the sample
Bottom of the box 25th percentile of the sample
Bottom of the lower whisker Minimum of the sample
× markers Mean of the sample
Figure 3 – Box Plot elements
There are two versions of this table, depending on whether or not you
check or uncheck the. Use exclusive version of quartile field. If
checked then the QUARTILE.EXC version of the 25 th and 75th percentile
is used (or QUARTILE_EXC for Excel 2007 users), while if this field is
unchecked then the QUARTILE.INC (or equivalently the QUARTILE)
version is used. See Ranking Functions in Excel for more details about
the difference between these two versions.
From the box plot in Figure 2, we can see that the scores for Brand C
tend to be higher than for the other brands and those for Brand B tend to
be lower. We also see that the distribution of Brand A is pretty
symmetric at least in the range between the 1 st and 3rd quartiles, although
there is some asymmetry for higher values (or potentially there is an
outlier). Brands B and C look less symmetric. Because of the long upper
whisker (especially with respect to the box), Brand B may have an outlier
(see Outliers and Robustness for a discussion of outliers).
Alternative Representation
When a data set has one or more negative values, the y-axis will be
shifted upward by the amount of -MIN(R1). Here, R1 is the data range
containing the data. Thus if R1 ranges from -10 to 20, the range in the
chart will range from 0 to 30.
Example 2: Create the box plot for the data in Figure 5.9.1 where cell
B11 is changed to -300 and the exclusive version of the quartile function.
The procedure is the same as for Example 1, except that this time we
check the Use exclusive version of quartile option. The output is
shown in Figure 5.
The key difference is that since the smallest data value is -300 (the value
in cell F13), all the box plot values are shifted up by 300. This is evident
by noting that the lower tail for Brand B is at 0 instead of -300 (and that
cell G6 contains 0 instead of -300).
Note that two y-axes are displayed. The one on left is based on the
displacement of 300 units, while the one on the right shows the correct
units.
You can remove the y-axis on the left by following the following steps:
1. Select the y-axis on the left and then right-click.
2. Choose the Format Axis… option from the menu that
appears.
3. When the menu of options appears as shown in Figure 6,
change the Label Position option from Next to
Axis to None.
Note that if you change any of the data elements, the box chart will still
be correct, although the right y-axis will not change and will still reflect
the original data, and so you will need to rely on the left y-axis (you can
remove the right y-axis as described above for the left y-axis).
See Box Plots with Outliers to see how to generate box plots in Excel
which also explicitly show outliers. The following two versions are
described:
An Excel charting capability that is available for versions of
Excel starting with Excel 2016
An extended version of the Real Statistics data analysis tool
described above. This tool is available even for versions of
Excel prior to Excel 2016.
See Special Charting Capabilities for how to create a box plot manually,
using only Excel charting capabilities.
Starting with Excel 2016 Microsoft added a Box and Whiskers chart capability. To create a box plot, highlight the
data range A2:C11 and select Insert > Insert Statistic Chart > Box and
Whisker. The boxplot will appear:
You can add a legend as well as chart and axis titles. The box part of the chart is as described above. The mean
is shown as an ×. The whiskers extend up from the top of the box to the largest value that is less than or equal to
1.5 times the interquartile range (IQR) and down from the bottom of the box to the smallest value that is larger
than 1.5 times the IQR. Values outside this range are considered to be outliers and are represented by dots.
The boundaries of the box and whiskers are calculated using the formulas below:
The only outlier is 1850 for Brand B, which is higher than the upper whisker, and so is shown as a dot.
=MAX(IF(C2:C11<=H7,C2:C11,MIN(C2:C11)))
=MIN(IF(C2:C11>=H8,C2:C11,MAX(C2:C11)))
to calculate a value for cell H10. In fact, since the Excel Box Plot is only
available in Excel 2016, we can also use the Excel 2016 (non-array)
formulas =MAXIFS(C2:C11,”<=”&H7) and =MINIFS(C2:C11,”>=”&H8).
The Real Statistics Resource Pack also provides a way of generating box
plots with outliers. To produce such a box plot, proceed as in Example 1
of Creating Box Plots in Excel, except that this time you should select
the Box Plots with Outliers option of the Descriptive Statistics
and Normality data analysis tool. The output for Example 1 of Creating
Box Plots in Excel is shown in Figure 3.
Figure 3 – Output from Box Plots with Outliers tool
As you can see, the output is similar to that shown in Figure 1, except
that this version is available in other releases of Excel prior to Excel
2016. Also, the Outlier Multiplier is not fixed at 1.5 but can be set to
another value by the user (in the dialog box for the Descriptive
Statistics and Normality data analysis tool).
=MIN(IF(ISBLANK(A4:A13),””,IF(A4:A13>=F13-$F2*(F15-
F13),A4:A13,””)))
=MAX(IF(ISBLANK(A4:A13),””,IF(A4:A13<=F15+$F2*(F15-
F13),A4:A13,””))).
Negative numbers are handled in a manner similar to that for Box Plots
without Outliers (often using a second y-axis). Keep in mind, though,
that a second y-axis is only employed when the lower whisker of at least
one of the box plots is negative. If some outlier is negative but none of
the lower whiskers are negative, then a second y-axis is not needed.
See Creating Box Plots with Outliers in Excel for how to create a box plot
with outliers manually, using only Excel charting capabilities. Issues that
arise when some of the data is negative are also explored in a little more
depth there.