0% found this document useful (0 votes)
46 views21 pages

Math 133 - Unit 10 Summary Statistics

Uploaded by

Astro Hajeer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views21 pages

Math 133 - Unit 10 Summary Statistics

Uploaded by

Astro Hajeer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Math 133 – Engineering Mathematics 1

Unit 10 – Summary Statistics

10.1 Frequency Distribution Table and its Histogram

Let us consider a set of grades from a sample of 150 Math 133 students:

MATH 133 Grades (Raw Data)

What do we see from the above result? All we can tell is that there are students who have grades of 83.8,
some scored 52.3, some 62.6, etc. The above presentation is just a clutter of numerical grades.

Just as our bedrooms can be filled with nothing but clutter, we need to organise this clutter so that we
can make some sense of the data we just collected. Therefore, we need to organise our data, just like we
need to organise the things in our bedrooms.

Organising Data into a Distribution

A distribution summarises the organisation of raw data into


• What values each group or class takes, and
• How often each class occurs (i.e. the frequency of each class).

We can summarise the results of the grades of students into a Frequency Distribution as follows. First of
all, we need to know the following:

How to draw a frequency distribution for Quantitative Data: Continuous Case


• This distribution is defined as a Grouped Frequency Distribution.
• Categories are defined as intervals of numbers called classes or groups.
• Classes (or groups) must not overlap.
• Classes (or groups) of equal width are preferred.
Step 1: Determine the minimum (Min) and the maximum (Max) values in the data set.

It is always a good idea to look through the grades and see what the minimum observation is and what
the maximum observation is in the data set.

In the above data set, we see that the minimum grade is

𝑀𝑖𝑛 = 34.6
and the maximum grade is
𝑀𝑎𝑥 = 97.5

Step 2: One good way to define the classes would be as follows:


• 30 to <40 – Grades that are from 30 to less than 40 are included in this class.
• 40 to <50 – Grades that are from 40 to less than 50 are included in this class.
• 50 to <60 – Grades that are from 50 to less than 60 are included in this class.
• 60 to <70 – Grades that are from 60 to less than 70 are included in this class.
• 70 to <80 – Grades that are from 70 to less than 80 are included in this class.
• 80 to <90 – Grades that are from 80 to less than 90 are included in this class.
• 90 to 100 – Grades that are from 90 to 100 are included in this class.
Each class has a width of 10 grades.

Step 3: We generate the Frequency Distribution Table.

Cumulative
Relative
Class Limits Tally Frequency Relative
Frequency
Frequency
30 – <40 / 1 0.0067 0.0067
40 – <50 ///// //// 9 0.0600 0.0667
///// /////
50 – <60 ///// ///// ///// 34 0.2267 0.2933
///// ////
///// /////
60 – <70 ///// ///// ///// 29 0.1933 0.4867
////
///// /////
///// ///// /////
70 – <80 42 0.2800 0.7667
///// ///// /////
//
///// /////
80 – <90 ///// ///// ///// 28 0.1867 0.9533
///
90 – 100 ///// // 7 0.0467 1
Total 150 1.0001 100%

Note: The total for the Relative Frequency should be exactly 1. Sometimes, we may be off slightly because
of rounding errors.
Step 4: We draw the Histogram.

Math 133 - Frequency vs Grades


45
40
35
30
Frequency

25
20
15
10
5
0
30 - <40 40 - <50 50 - <60 60 - <70 70 - <80 80 - <90 90 - 100
Grades

This is my favourite way of depicting the data. Advantages include seeing the class intervals directly, no
time wasted on figuring any “hidden” clues.

10.2 Shapes of Distributions:

Four basic distribution shapes:


1. Symmetric – Three types of symmetric shapes:
i. Bell-shaped (or Mound-shaped)
ii. U-shaped
iii. Uniform
2. Skewed Right
3. Skewed Left
4. Irregular
Examples:

• Symmetric bell-shaped
45

40

35

30
Frequency

25

20

15

10

0
1 2 3 4 5 6 7 8 9
Value of Variable

• Symmetric U-shaped

45

40

35

30
Frequency

25

20

15

10

0
1 2 3 4 5 6 7 8 9
Value of Variable

• Symmetric Uniform

45

40

35

30
Frequency

25

20

15

10

0
1 2 3 4 5 6 7 8 9 10 11 12

Value of the Variable


• Skewed Right – Tail to the right.

50

45

40

35

Frequency
30

25

20

15

10

0
1 2 3 4 5 6 7 8 9 10 11

Value of Variable

• Skewed Left – Tail to the left.

50

45

40

35
Frequency

30

25

20

15

10

0
1 2 3 4 5 6 7 8 9 10 11

Value of Variable
10.3 Outlier

• An outlier is an observation that lies outside the overall pattern of a distribution.


• A large gap in the distribution is typically a sign of an outlier.

Example:

Question: Are Alaska and Florida outliers because of human error during the collection of data?
Thought: Maybe not.
• Alaska is too cold so not many seniors may want to live there.
• Florida is warm so many seniors may want to live there.

What happens when you get outliers?


• If you have the resource to collect / verify the data again, go do it. It is frowned upon to reject outliers
without verification.
• If not, use robust statistical methods where outliers do not affect the results too much.

So if possible, go to Alaska and Florida and verify the percentage of seniors in these two states.
10.4 Dot Plots

A dot plot is a graph in which each observation is represented by a dot placed over a numerical value each
time that observation is observed

Example: Thirty nine participants took part in a fishing competition. Each participant is represented by a
dot placed over the number of fishes caught by the participant.

Number of fishes caught

We can see that the above distribution can be considered as right skewed since the tail is somewhat to
the right.

10.5 Describing Data with Numerical Measures

Symbols: Size of Population, N


Size of Sample, n

10.5.1 The Mean (or Average)

Symbols: Population mean, 𝜇


Sample mean, 𝑥̅

𝑆𝑢𝑚 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠
Formula: 𝑀𝑒𝑎𝑛 = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠

Let 𝑥𝑖 denote the 𝑖 𝑡ℎ observation in the data set. Thus,

∑ 𝑥𝑖 𝑥1 +𝑥2 +𝑥3 +⋯+𝑥𝑁


• Population mean, 𝜇 = 𝑁
= 𝑁

∑ 𝑥𝑖 𝑥1 +𝑥2 +𝑥3 +⋯+𝑥𝑛


• Sample mean, 𝑥̅ = 𝑛
= 𝑛

Note: Sometimes, we just simplify the notation ∑ 𝑥𝑖 to simply ∑ 𝑥.


Example: We are given the heights in inches of a simple random sample of 25 women:
58.2, 59.5, 60.7, 60.9, 61.9, 61.9, 62.2, 62.2, 62.4, 62.9, 63.1, 63.9, 63.9, 64.0, 64.5, 64.1, 64.8, 65.2, 65.7,
66.2, 66.7, 67.1, 67.8, 68.9 and 69.6

∴ The sample mean height of the 25 women is

∑𝑥 58.2+59.5+60.7+⋯+69.6
𝑥̅ = 𝑛
= 25

1598.3
= 25
= 63.932 𝑖𝑛

10.5.2 The Median

Symbol: M

Note: The first step in determining the median is to rearrange the data in ascending order, i.e. from
smallest to largest. The median is the middle-most value when the data values are arranged in ascending
order.

𝑛+1
Location Formula: 𝑃𝑜𝑠𝑖𝑡𝑖𝑜𝑛 𝑜𝑓 𝑚𝑒𝑑𝑖𝑎𝑛 =
2

Note: The above formula is NOT for the value of the median. It is for the position of the median.

Example 1: Let us consider the heights in inches of the 25 women in the previous example.
58.2, 59.5, 60.7, 60.9, 61.9, 61.9, 62.2, 62.2, 62.4, 62.9, 63.1, 63.9, 63.9, 64.0, 64.5, 64.1, 64.8, 65.2, 65.7,
66.2, 66.7, 67.1, 67.8, 68.9 and 69.6
What is the median of this data set?

Solution:
To find the median, the first step is to rearrange the data in ascending order, i.e. from smallest to largest.

Position 1 2 3 4 5 6 7 8 9 10 11 12 13
Height 58.2 59.5 60.7 60.9 61.9 61.9 62.2 62.2 62.4 62.9 63.1 63.9 63.9

Position 14 15 16 17 18 19 20 21 22 23 24 25
Height 64.0 64.1 64.5 64.8 65.2 65.7 66.2 66.7 67.1 67.8 68.9 69.6

The sample size is 𝑛 = 25 observations.

𝑛+1 25+1 26
∴ The median is located at the 2
= 2
= 2
= 13𝑡ℎ position.

And the value at the 13th position, i.e. the median is 𝑀 = 63.9 inches.
Example 2: A student reported her 10 grades,
77, 86, 58, 67, 75, 77, 71, 65, 77 and 92
What is the median of this data set?

Solution:
The first step is to rearrange the data in ascending order:

Position 1 2 3 4 5 6 7 8 9 10
Grade 58 65 67 71 75 77 77 77 86 92

5th position 6th position

The sample size is 𝑛 = 10 observations.

𝑛+1 10+1 11 1 𝑡ℎ
∴ The median is located at the = = =5 position.
2 2 2 2

1
Therefore, we take the value at the 5 2 𝑡ℎ position to be exactly halfway between the values of 75 and 77,
or the midpoint value of 75 and 77.

75+77 152
Thus, the median of the above data set is 𝑀= 2
= 2
= 76

10.5.3 The Mode

Symbol: m

The mode of a variable is the value that occurs most often in the data set.
It is a good idea to rearrange the data in order so that we can look for values that occur most often.

Example 1: Let us look at the 10 grades shown in the previous example:


77, 86, 58, 67, 75, 77, 71, 65, 77 and 92
What is the mode of the data set?

Solution:
The first step is to rearrange the data in order:

Position 1 2 3 4 5 6 7 8 9 10
Grade 58 65 67 71 75 77 77 77 86 92

Occurs most often

We can see that the value that occurs most often is 77.

Hence, the mode of the data set is 𝑚 = 77


It is possible to have more than one mode in some data sets.

Example 2: Let us consider the heights in inches of the 25 women in a previous example.
58.2, 59.5, 60.7, 60.9, 61.9, 61.9, 62.2, 62.2, 62.4, 62.9, 63.1, 63.9, 63.9, 64.0, 64.5, 64.1, 64.8, 65.2, 65.7,
66.2, 66.7, 67.1, 67.8, 68.9 and 69.6
What is the mode of this data set?

Solution: The first step is to rearrange the data set in order:

Position 1 2 3 4 5 6 7 8 9 10 11 12 13
Height 58.2 59.5 60.7 60.9 61.9 61.9 62.2 62.2 62.4 62.9 63.1 63.9 63.9

Occurs twice Occurs twice Occurs twice


Position 14 15 16 17 18 19 20 21 22 23 24 25
Height 64.0 64.1 64.5 64.8 65.2 65.7 66.2 66.7 67.1 67.8 68.9 69.6

We can see that there are three values that occur twice (this data set’s the highest number of
occurrences), i.e. 61.9, 62.2 and 63.9.

Therefore, there are three modes in this data set,


𝑚1 = 61.9, 𝑚2 = 62.2 and 𝑚3 = 63.9

We call these:
• Data with one mode = Unimodal data
• Data with two modes = Bimodal data
• Data with three modes = Trimodal data

If a data has more than three observations that occur most often, it renders the definition of a mode
impractical. Therefore, it will no longer be useful to denote its modes.

10.5.3.1 Impact of skewness on Unimodal data

1. Symmetric bell-shaped (mound-shaped) data


2. Right-skewed data

3. Left-skewed data

Notice from the above three diagrams that:


• The mean is most affected by skewness and outliers, i.e. the mean is pulled in the direction of the
tail or the direction of the skewness.
• The median somewhat affected by skewness and outliers.
• The mode is unaffected by skewness and outliers.
10.6 Measures of Spread or Dispersion

Given a data set, we usually want to know its mean and a measure of the spread.

Mean Mean

Data is not so spread out, it is quite concentrated Data is quite spread out, it is not so concentrated
near the mean, i.e. variability is small. near the mean, i.e. variability is large.

There are a few ways to define the spread of which perhaps the most popular definition is the Standard
Deviation. Before we can define the Standard Deviation, we need to calculate the Variance.

10.6.1 The Variance

Symbol: For population, 𝜎 2


For sample, 𝑠 2

1
Definition: For population variance, 𝜎 2 = 𝑁 ∑𝑁
𝑖=1(𝑥𝑖 − 𝜇)
2

1
For sample variance, 𝑠 2 = 𝑛−1 ∑𝑛𝑖=1(𝑥𝑖 − 𝑥̅ )2

1 1
Formulas: For population variance, 𝜎 2 = 𝑁 (∑ 𝑥𝑖 2 − 𝑁 (∑ 𝑥𝑖 )2 )

1 1
For sample variance, 𝑠 2 = 𝑛−1 (∑ 𝑥𝑖 2 − 𝑛 (∑ 𝑥𝑖 )2 )

What is the difference between ∑𝑛𝑖=1 𝑥𝑖 2 and (∑𝑛𝑖=1 𝑥𝑖 )2 ?

• ∑𝑛𝑖=1 𝑥𝑖 2 = 𝑥1 2 + 𝑥2 2 + 𝑥3 2 + ⋯ + 𝑥𝑛 2 We square each value before adding the results.

• (∑𝑛𝑖=1 𝑥𝑖 )2 = (𝑥1 + 𝑥2 + 𝑥3 + ⋯ + 𝑥𝑛 )2 We add the values, then we square the result.


Example: Let us look at the sample of the 10 grades shown reported by the student as in a previous
example:
58, 65, 67, 71, 75, 77, 77, 77, 86 and 92
What is the variance of this data set?

Solution: As before, it is always a good idea to do our calculation in a table.

Since we also require the sum ∑𝑛𝑖=1 𝑥𝑖 2 , we need the square of each 𝑥𝑖 value. So we generate another
row, 𝑥𝑖 2 .

𝒊 1 2 3 4 5 6 7 8 9 10 Sum, ∑

𝒙𝒊 58 65 67 71 75 77 77 77 86 92 745
𝒙𝒊 𝟐 3364 4225 4489 5041 5625 5929 5929 5929 7396 8464 56391

Thus, the variance of the sample of the student’s ten grades is

1 1
𝑠 2 = 𝑛−1 (∑ 𝑥𝑖 2 − 𝑛 (∑ 𝑥𝑖 )2 )

1 1
= 10−1 (56391 − 10 (745)2 )

1 1
= (56391 − (555025))
9 10

1 888.5
= 9 (56391 − 55502.5) = 9
≈ 98.722222

10.6.2 The Standard Deviation

Symbol: For population, 𝜎


For sample, 𝑠

Definition: The standard deviation is the square root of the variance.

Formulas: For population std. dev., 𝜎 = √𝜎 2


For sample std. dev., 𝑠 = √𝑠 2

Example: Let us look at the sample of the 10 grades shown reported by the student:
58, 65, 67, 71, 75, 77, 77, 77, 86 and 92
What is the standard deviation of this data set?

Solution: From the previous example, we calculated the variance of the data set to be
888.5
𝑠2 = ≈ 98.722222
9

888.5
∴ The std. dev. of the data set is 𝑠 = √𝑠 2 = √ ≈ 9.935906
9
10.6.3 Large Populations or Large Samples

For large populations or large samples where values are usually repeated, it is often necessary to tabulate
the values and their corresponding frequencies instead of writing a long list of values. Suppose 𝑥1 appears
𝑓1 number of times, 𝑥2 appears 𝑓2 number of times, 𝑥3 appears 𝑓3 number of times, … and so on as in the
following table,

Value, 𝑥𝑖 𝑥1 𝑥2 𝑥3 𝑥4 ⋯ ⋯ Sum, Σ

Frequency, 𝑓𝑖 𝑓1 𝑓2 𝑓3 𝑓4 ⋯ ⋯ ∑ 𝑓𝑖

If the data is from a population, we have

Population size, 𝑁 = ∑ 𝑓𝑖

∑𝑥 ∑(𝑥𝑖 𝑓𝑖 ) ∑(𝑥𝑖 𝑓𝑖 )
Mean, 𝜇= 𝑁
= ∑ 𝑓𝑖
= 𝑁

2 2
(∑ 𝑥)2 (∑(𝑥𝑖 𝑓𝑖 ))
∑ 𝑥2− ∑(𝑥𝑖 2 𝑓𝑖 )− ∑(𝑥𝑖 2 𝑓𝑖 )−
(∑(𝑥𝑖 𝑓𝑖 ))
2 𝑁 ∑ 𝑓𝑖 𝑁
Variance, 𝜎 = = ∑ 𝑓𝑖
=
𝑁 𝑁

If the data is from a sample, we have

Sample size, 𝑛 = ∑ 𝑓𝑖

∑𝑥 ∑(𝑥𝑖 𝑓𝑖 ) ∑(𝑥𝑖 𝑓𝑖 )
Mean, 𝑥̅ = = ∑ 𝑓𝑖
=
𝑛 𝑛

2 2
(∑ 𝑥)2 (∑(𝑥𝑖 𝑓𝑖 ))
∑ 𝑥2− ∑(𝑥𝑖 2 𝑓𝑖 )− ∑(𝑥𝑖 2 𝑓𝑖 )−
(∑(𝑥𝑖 𝑓𝑖 ))
2 𝑛 ∑ 𝑓𝑖 𝑛
Variance, 𝑠 = 𝑛−1
= ∑ 𝑓𝑖 −1
= 𝑛−1

Therefore, we need to expand the table as follows:

Value, 𝑥𝑖 𝑥1 𝑥2 𝑥3 𝑥4 ⋯ ⋯ Sum, Σ

Frequency, 𝑓𝑖 𝑓1 𝑓2 𝑓3 𝑓4 ⋯ ⋯ ∑ 𝑓𝑖

𝑥𝑖 𝑓𝑖 𝑥1 𝑓1 𝑥2 𝑓2 𝑥3 𝑓3 𝑥4 𝑓4 ⋯ ⋯ ∑(𝑥𝑖 𝑓𝑖 )

𝑥𝑖 2 𝑓𝑖 𝑥1 2 𝑓1 𝑥2 2 𝑓2 𝑥3 2 𝑓3 𝑥4 2 𝑓4 ⋯ ⋯ ∑(𝑥𝑖 2 𝑓𝑖 )
As usual, the standard deviation is the square root of the variance,

Population standard deviation, 𝜎 = √𝜎 2


and
Sample standard deviation, 𝑠 = √𝑠 2

Example: The owner of a DVD rental shop meticulously kept a record of how many DVDs were
rented by customers who entered his shop from the day he opened his business until the day he closed.
The following table shows the number of DVDs, 𝑥𝑖 , rented by the corresponding number of customers, 𝑓𝑖 ,
out of every 100 customers who entered his shop.

No. of DVDs rented, 𝑥𝑖 0 1 2 3 4 5


No. of customers, 𝑓𝑖 6 58 22 10 3 1

Find the average number of DVDs rented by each customer and the standard deviation of the number of
DVDs rented by his customers.

Solution: We expand the above table as follows:

Value, 𝑥𝑖 0 1 2 3 4 5 Sum, Σ

Frequency, 𝑓𝑖 6 58 22 10 3 1 100

𝑥𝑖 𝑓𝑖 0 58 44 30 12 5 149

𝑥𝑖 2 𝑓𝑖 0 58 88 90 48 25 309
The size of this population being studied is
𝑁 = ∑ 𝑓𝑖 = 100
The total number of DVDs rented is
∑ 𝑥 = ∑ 𝑥𝑖 𝑓𝑖 = 149

∴ The average number of DVDs rented per customer is

∑ 𝑥𝑖 𝑓𝑖 149
𝜇= = = 1.49
𝑁 100

The variance of the number of DVDs rented by his customers is

(∑(𝑥𝑖 𝑓𝑖 ))2 1492 22201


∑(𝑥𝑖 2 𝑓𝑖 ) − 309 − 100 309 − 100 309 − 222.01 86.99
𝜎2 = 𝑁 = = = = = 0.8699
𝑁 100 100 100 100

∴ The std dev of the number of DVDs rented by his customers is

𝜎 = √𝜎 2 = √0.8699 ≈ 0.932684
10.7 Percentiles
- Measures of relative standing.

The 𝒑𝒕𝒉 percentile of a set of data is a value 𝒙𝒑 with 𝒑% of the data values less than it and (𝟏𝟎𝟎 − 𝒑)%
of the data values greater than it.

Example: If 80% of students have marks lower than yours and 20% have marks higher than yours,
then your mark is at the 80th percentile.

10.7.1 Quartiles

Quartiles divide an ordered, from smallest to largest, data set into four groups each containing as close to
25% of the data as possible.

Graphical example:

25% 25% 25% 25%

𝑄1 𝑄2 = 𝑀 𝑄3

We can now define the following quartiles:

• 𝑄1 = 1st (Lower) Quartile = 25𝑡ℎ percentile = 𝑥25

• 𝑄2 = 2nd Quartile = 50𝑡ℎ percentile = 𝑥50 . This is also the Median, M.

• 𝑄3 = 3rd (Upper) Quartile = 75𝑡ℎ percentile = 𝑥75


Location Formulas: Recall that n is the number of observations in the data set.

𝑛+1
• 𝑃𝑜𝑠𝑖𝑡𝑖𝑜𝑛 𝑜𝑓 𝑄1 = 4

2(𝑛+1) 𝑛+1
• 𝑃𝑜𝑠𝑖𝑡𝑖𝑜𝑛 𝑜𝑓 𝑄2 (𝑜𝑟 𝑀) = 4
= 2

3(𝑛+1)
• 𝑃𝑜𝑠𝑖𝑡𝑖𝑜𝑛 𝑜𝑓 𝑄3 = 4

Example: Let us consider the following data set of the number of years that 24 HIV patients lived
before succumbing to AIDS: 0.6, 1.6, 2.1, 2.3, 2.9, 3.4, 3.7, 4.1, 4.5, 5.3, 1.2, 1.9, 2.3, 2.5, 3.3, 3.6, 3.8, 4.2,
4.7, 7.4, 1.5, 2.8, 3.9 and 4.9.
What are the quartiles of the above data set?

Solution: The data set must be rearranged in order from smallest to largest before we proceed to
determine the positions of the quartiles:

Position 1 2 3 4 5 6 7 8 9 10 11 12
Years 0.6 1.2 1.5 1.6 1.9 2.1 2.3 2.3 2.5 2.8 2.9 3.3

6𝑡ℎ position 7𝑡ℎ position


1𝑡ℎ
64 position
Position 13 14 15 16 17 18 19 20 21 22 23 24
Years 3.4 3.6 3.7 3.8 3.9 4.1 4.2 4.5 4.7 4.9 5.3 7.4

18𝑡ℎ position 19𝑡ℎ position


3𝑡ℎ
18 4 position

There are altogether 𝑛 = 24 observations.

𝑛+1 24+1 25 1𝑡ℎ


• The location of 𝑄1 is at the 4
= 4
= 4
= 64 position

1𝑡ℎ
Since 𝑄1 is the value at the 6 4 position, we can calculate that

1 1
𝑄1 = 2.1 + (2.3 − 2.1) = 2.1 + (0.2) = 2.1 + 0.05 = 2.15
4 4
𝑛+1 24+1 25 1𝑡ℎ
• The location of 𝑄2 or M is at the 2
= 2
= 2
= 12 2 position

1𝑡ℎ
Since 𝑄2 or M is the value at the 12 2 position, we can calculate that

1 1
𝑄2 = 𝑀 = 3.3 + (3.4 − 3.3) = 3.3 + (0.1) = 3.3 + 0.05 = 3.35
2 2

3(𝑛+1) 3(24+1) (3)(25) 3𝑡ℎ


• The location of 𝑄3 is at the 4
= 4
= 4
= 18 4 position

3𝑡ℎ
Since 𝑄3 is the value at the 18 4 position, we can calculate that

3 3
𝑄3 = 4.1 + (4.2 − 4.1) = 4.1 + (0.1) = 4.1 + 0.075 = 4.175
4 4

10.7.2 Five-Number Summary

The 5-number summary of a data set consists of

𝑀𝑖𝑛 𝑄1 𝑀 𝑄3 𝑀𝑎𝑥

Where 𝑀𝑖𝑛 = the smallest observation in the data set,


𝑄1 = the 1st quartile of the data set,
𝑀 = the median of the data set,
𝑄3 = the 3rd quartile of the data set, and
𝑀𝑎𝑥 = the largest observation in the data set.

10.7.2.1 Interquartile Range

Symbol: 𝐼𝑄𝑅

Formula: 𝐼𝑄𝑅 = 𝑄3 − 𝑄1
10.7.2.2 Lower Fence and Upper Fence

Symbols: Lower Fence, 𝐿𝐹


Upper Fence, 𝑈𝐹

Formulas: 𝐿𝐹 = 𝑄1 − 1.5 × 𝐼𝑄𝑅

𝑈𝐹 = 𝑄3 + 1.5 × 𝐼𝑄𝑅

Example: Let us consider again the data in the previous example:

Position 1 2 3 4 5 6 7 8 9 10 11 12
Years 0.6 1.2 1.5 1.6 1.9 2.1 2.3 2.3 2.5 2.8 2.9 3.3

Position 13 14 15 16 17 18 19 20 21 22 23 24
Years 3.4 3.6 3.7 3.8 3.9 4.1 4.2 4.5 4.7 4.9 5.3 7.4

We have already calculated that 𝑄1 = 2.15, 𝑀 = 3.35 and 𝑄3 = 4.175,


and we note that the smallest observation is 0.6 years and the largest observation is 7.4 years.

Therefore, we can write the 5-number summary for the above data set as

𝑀𝑖𝑛 = 0.6 𝑄1 = 2.15 𝑀 = 3.35 𝑄3 = 4.175 𝑀𝑎𝑥 = 7.4

The Interquartile Range is

𝐼𝑄𝑅 = 𝑄3 − 𝑄1 = 4.175 − 2.15 = 2.025

The Lower Fence is

𝐿𝐹 = 𝑄1 − 1.5 × 𝐼𝑄𝑅 = 2.15 − 1.5 × 2.025 = −0.8875

The Upper Fence is

𝑈𝐹 = 𝑄3 + 1.5 × 𝐼𝑄𝑅 = 4.175 + 1.5 × 2.025 = 7.2125


10.7.3 Boxplot (or Box-and-Whiskers) Plot

This is a graphical presentation of the 5-number summary and it defines outliers if any exist in the data
set. The following are steps to construct a boxplot:
1. Obtain the 5-Number Summary, the Interquartile Range, and the Lower Fence and the Upper
Fence.
2. Draw and label a horizontal line (axis) to represent the scale of measurement, and locate Q1, M
and Q3 on it.
3. Draw a box between Q1 and Q3 above the axis and a line within the box indicating M.
4. Draw a dotted line to indicate the Lower Fence (LF) and another dotted line to indicate the Upper
Fence (UF).
5. Draw whiskers from the left side of the box to the first data above the LF and from the right side
of the box to the first data below the UF.
6. Observations to the left of the LF and to the right of the UF are outliers and are marked with
asterisks.

Outliers Outliers
∗ ∗ ∗ ∗

x
Min LF First data 𝑄1 M 𝑄3 First data UF Max
above LF below UF
Axis Label

Example: Let us draw a boxplot based on the data in the previous example:

Position 1 2 3 4 5 6 7 8 9 10 11 12
Years 0.6 1.2 1.5 1.6 1.9 2.1 2.3 2.3 2.5 2.8 2.9 3.3

Position 13 14 15 16 17 18 19 20 21 22 23 24
Years 3.4 3.6 3.7 3.8 3.9 4.1 4.2 4.5 4.7 4.9 5.3 7.4

We have already obtained the 5-number summary, i.e.

𝑀𝑖𝑛 = 0.6 𝑄1 = 2.15 𝑀 = 3.35 𝑄3 = 4.175 𝑀𝑎𝑥 = 7.4

As well as the Lower Fence and the Upper Fence, i.e.

𝐿𝐹 = −0.8875 and 𝑈𝐹 = 7.2125


The boxplot based on the given data set is as follows:

∗ Outlier

t
−2 −1 0 1 2 3 4 5 6 7 8
𝐿𝐹 = −0.8875 𝑀𝑖𝑛 = 0.6 𝑄1 = 2.15 𝑀 = 3.35 𝑄3 = 4.175 𝑈𝐹 = 7.2125 𝑀𝑎𝑥 = 7.4
First data below
𝑈𝐹 𝑖𝑠 5.3
No. of years a HIV patient survives

You might also like