0% found this document useful (0 votes)
2 views

bloxplots in data science

The document explains boxplots, the five-number summary, interquartile range (IQR), and outliers in data sets. It details how to construct a boxplot using the five-number summary, calculate the IQR, and identify outliers based on specified boundaries. An example with MLB team payrolls illustrates these concepts, along with a practice data set for further application.

Uploaded by

Samay Rajput
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

bloxplots in data science

The document explains boxplots, the five-number summary, interquartile range (IQR), and outliers in data sets. It details how to construct a boxplot using the five-number summary, calculate the IQR, and identify outliers based on specified boundaries. An example with MLB team payrolls illustrates these concepts, along with a practice data set for further application.

Uploaded by

Samay Rajput
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Boxplots, Interquartile Range, and Outliers

Boxplots provide a visual representation of a data set that can be used to determine
whether the data set is symmetric or skewed. Constructing a boxplot requires calculation of
the “5 number summary”, the interquartile range (IQR), and the presence of any outliers.
5 Number Summary – The 5 number summary for a data set includes the following, which are listed in
order from smallest to largest –

1. Minimum - The smallest value in the data set.


2. First Quartile - Separates the lowest 25% of the data in a set from the highest 75%. It is
25
typically denoted as 𝑸𝟏 𝑤ℎ𝑒𝑟𝑒, 100 ∙ (# 𝑝𝑜𝑖𝑛𝑡𝑠 𝑖𝑛 𝑑𝑎𝑡𝑎 𝑠𝑒𝑡) = 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 𝑜𝑓 𝑄1 𝑖𝑛 𝑠𝑒𝑡.
3. Median – The middle value in a sorted (smallest to largest) data set. If there is an even
number of values, it is calculated by averaging the two middle values. The Median is also
referred to as the Second Quartile (𝑸𝟐 ) because it separates the lower 50% of data in a set
from the upper 50%.
4. Third Quartile - Separates the lowest 75% of the data in a set from the highest 25%. It is
75
typically denoted as 𝑸𝟑 𝑤ℎ𝑒𝑟𝑒, 100 ∙ (# 𝑝𝑜𝑖𝑛𝑡𝑠 𝑖𝑛 𝑑𝑎𝑡𝑎 𝑠𝑒𝑡) = 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 𝑜𝑓 𝑄3 𝑖𝑛 𝑠𝑒𝑡.
5. Maximum – The largest value in the data set.

IQR - The Interquartile Range is a measure of spread used to calculate the lower and upper outlier
boundaries. These boundaries are then used to determine whether a data set has any actual outliers.

𝑰𝒏𝒕𝒆𝒓𝒒𝒖𝒂𝒓𝒕𝒊𝒍𝒆 𝑹𝒂𝒏𝒈𝒆 (𝐼𝑄𝑅) = 𝑄3 − 𝑄1

𝑳𝒐𝒘𝒆𝒓 𝑂𝑢𝑡𝑙𝑖𝑒𝑟 𝐵𝑜𝑢𝑛𝑑𝑎𝑟𝑦 = 𝑄1 − 1.5 𝐼𝑄𝑅

𝑼𝒑𝒑𝒆𝒓 𝑂𝑢𝑡𝑙𝑖𝑒𝑟 𝐵𝑜𝑢𝑛𝑑𝑎𝑟𝑦 = 𝑄3 + 1.5 𝐼𝑄𝑅

Outliers - Outliers are data points that are considerably smaller or larger than most of the other values
in a data set. Data values that are smaller than the lower outlier boundary or larger than the upper outlier
boundary are outliers. Some data sets do not have any outliers. Outliers that are determined to be the
result of an error should be removed from the data set.

Example – For the following data set (2012 data for MLB team payrolls in millions), find a) the 5 number
summary, b) the IQR, c) the upper and lower outlier boundaries, and d) any outliers. Note – data should be
sorted from lowest to highest if it is not provided that way. This allows the easy identification of the min,
max, median, and individual data positions within the set.

Team Payroll Team Payroll Team Payroll Team Payroll


1 Padres 55 9 Rockies 78 17 Mets 93 25 Rangers 121
2 Athletics 55 10 Indians 78 18 Twins 94 26 Tigers 132
3 Astros 61 11 Nationals 81 19 Dodgers 95 27 Angels 154
4 Royals 61 12 Orioles 81 20 W Sox 97 28 Red Sox 173
5 Pirates 63 13 Mariners 82 21 Brewers 98 29 Phillies 175
6 Rays 64 14 Reds 82 22 Cardinals 110 30 Yankees 198
7 D Backs 74 15 Braves 83 23 Giants 118
8 Blue Jays 75 16 Cubs 88 24 Marlins 118
a) 5 Number Summary – These values can be calculated by hand (shown below) OR they can be found
using the “1-Var Stats” button from the Stat Menu on a TI-83 or TI-84 calculator.

Average of 2
Represents 25th # of data Represents 75th # of data
middle data
percentile points in set percentile points in set
points in set

25 75
Minimum 𝑷𝒐𝒔𝒊𝒕𝒊𝒐𝒏 𝑸𝟏 = 100
(30) Median 𝑷𝒐𝒔𝒊𝒕𝒊𝒐𝒏 𝑸𝟑 = 100
(30) Maximum
83+88
55 = 7.5  8th Position = 2
= 22.5  23rd Position 198
= 75 = 85.5 = 118

If the “position” calculation results in a decimal, round up to the next whole number to determine the position.
If the calculation results in a whole number, average that position’s data value with the next data value

b) IQR  𝐼𝑄𝑅 = 𝑄3 − 𝑄1 = 118 − 75 = 43


c) Upper and Lower Outlier Boundaries –
𝐿𝑜𝑤𝑒𝑟 𝑂𝑢𝑡𝑙𝑖𝑒𝑟 𝐵𝑜𝑢𝑛𝑑𝑎𝑟𝑦 = 𝑄1 − 1.5 𝐼𝑄𝑅 = 75 − 1.5 (43) = 10.5
𝑈𝑝𝑝𝑒𝑟 𝑂𝑢𝑡𝑙𝑖𝑒𝑟 𝐵𝑜𝑢𝑛𝑑𝑎𝑟𝑦 = 𝑄3 + 1.5 𝐼𝑄𝑅 = 118 + 1.5 (43) = 182.5
d) Outliers – Lower Outliers  None (There are no individual data points smaller than the lower
boundary of 10.5.)

Upper Outliers  198 (Yankees) (This data value is bigger than the upper
boundary of 182.5.)

Constructing a Box Plot – Construct a Boxplot for the data set in the previous example. Determine
whether the data set is symmetric or skewed.

𝑄1 Median 𝑄3 Mark outliers with an “x”

x
85.5

105

115
118

125

135

145

155

165

175

185

195
55

65

75

95

MLB Team Payrolls (in millions)

Draw the whisker out to the This data set is Draw the whisker out to the
smallest data value that is larger Skewed RIGHT largest data value that is smaller
than the lower boundary than the upper boundary
Try this on your own - Construct a Boxplot for the following data set by finding the 5 number summary,
the IQR, the outlier boundaries, and any outliers (if they exist.).

Data Set

8.2 8.8 9.2 10.6 12.7


8.4 9.0 9.7 11.6 14.0
8.5 9.2 10.4 11.8 15.9
8.8 9.2 10.5 12.6 16.1

Answers:
5 Number Summary 
𝑀𝑖𝑛 = 8.2
𝑄1 = 8.9
𝑀𝑒𝑑𝑖𝑎𝑛 = 10.05
𝑄3 = 12.2
𝑀𝑎𝑥 = 16.1
𝐼𝑄𝑅 = 3.3
𝐿𝑜𝑤𝑒𝑟 𝑂𝑢𝑡𝑙𝑖𝑒𝑟 𝐵𝑜𝑢𝑛𝑑𝑎𝑟𝑦 = 3.95
𝑈𝑝𝑝𝑒𝑟 𝑂𝑢𝑡𝑙𝑖𝑒𝑟 𝐵𝑜𝑢𝑛𝑑𝑎𝑟𝑦 = 17.15
𝑂𝑢𝑡𝑙𝑖𝑒𝑟𝑠 = None

Box Plot:
10.05

12.2

16.1
8.2

8.9

10

11

12

13

14

15

16

17
8

You might also like