bloxplots in data science
bloxplots in data science
Boxplots provide a visual representation of a data set that can be used to determine
whether the data set is symmetric or skewed. Constructing a boxplot requires calculation of
the “5 number summary”, the interquartile range (IQR), and the presence of any outliers.
5 Number Summary – The 5 number summary for a data set includes the following, which are listed in
order from smallest to largest –
IQR - The Interquartile Range is a measure of spread used to calculate the lower and upper outlier
boundaries. These boundaries are then used to determine whether a data set has any actual outliers.
Outliers - Outliers are data points that are considerably smaller or larger than most of the other values
in a data set. Data values that are smaller than the lower outlier boundary or larger than the upper outlier
boundary are outliers. Some data sets do not have any outliers. Outliers that are determined to be the
result of an error should be removed from the data set.
Example – For the following data set (2012 data for MLB team payrolls in millions), find a) the 5 number
summary, b) the IQR, c) the upper and lower outlier boundaries, and d) any outliers. Note – data should be
sorted from lowest to highest if it is not provided that way. This allows the easy identification of the min,
max, median, and individual data positions within the set.
Average of 2
Represents 25th # of data Represents 75th # of data
middle data
percentile points in set percentile points in set
points in set
25 75
Minimum 𝑷𝒐𝒔𝒊𝒕𝒊𝒐𝒏 𝑸𝟏 = 100
(30) Median 𝑷𝒐𝒔𝒊𝒕𝒊𝒐𝒏 𝑸𝟑 = 100
(30) Maximum
83+88
55 = 7.5 8th Position = 2
= 22.5 23rd Position 198
= 75 = 85.5 = 118
If the “position” calculation results in a decimal, round up to the next whole number to determine the position.
If the calculation results in a whole number, average that position’s data value with the next data value
Upper Outliers 198 (Yankees) (This data value is bigger than the upper
boundary of 182.5.)
Constructing a Box Plot – Construct a Boxplot for the data set in the previous example. Determine
whether the data set is symmetric or skewed.
x
85.5
105
115
118
125
135
145
155
165
175
185
195
55
65
75
95
Draw the whisker out to the This data set is Draw the whisker out to the
smallest data value that is larger Skewed RIGHT largest data value that is smaller
than the lower boundary than the upper boundary
Try this on your own - Construct a Boxplot for the following data set by finding the 5 number summary,
the IQR, the outlier boundaries, and any outliers (if they exist.).
Data Set
Answers:
5 Number Summary
𝑀𝑖𝑛 = 8.2
𝑄1 = 8.9
𝑀𝑒𝑑𝑖𝑎𝑛 = 10.05
𝑄3 = 12.2
𝑀𝑎𝑥 = 16.1
𝐼𝑄𝑅 = 3.3
𝐿𝑜𝑤𝑒𝑟 𝑂𝑢𝑡𝑙𝑖𝑒𝑟 𝐵𝑜𝑢𝑛𝑑𝑎𝑟𝑦 = 3.95
𝑈𝑝𝑝𝑒𝑟 𝑂𝑢𝑡𝑙𝑖𝑒𝑟 𝐵𝑜𝑢𝑛𝑑𝑎𝑟𝑦 = 17.15
𝑂𝑢𝑡𝑙𝑖𝑒𝑟𝑠 = None
Box Plot:
10.05
12.2
16.1
8.2
8.9
10
11
12
13
14
15
16
17
8