Math 133 - Unit 10 Summary Statistics
Math 133 - Unit 10 Summary Statistics
Let us consider a set of grades from a sample of 150 Math 133 students:
What do we see from the above result? All we can tell is that there are students who have grades of 83.8,
some scored 52.3, some 62.6, etc. The above presentation is just a clutter of numerical grades.
Just as our bedrooms can be filled with nothing but clutter, we need to organise this clutter so that we
can make some sense of the data we just collected. Therefore, we need to organise our data, just like we
need to organise the things in our bedrooms.
We can summarise the results of the grades of students into a Frequency Distribution as follows. First of
all, we need to know the following:
It is always a good idea to look through the grades and see what the minimum observation is and what
the maximum observation is in the data set.
𝑀𝑖𝑛 = 34.6
and the maximum grade is
𝑀𝑎𝑥 = 97.5
Cumulative
Relative
Class Limits Tally Frequency Relative
Frequency
Frequency
30 – <40 / 1 0.0067 0.0067
40 – <50 ///// //// 9 0.0600 0.0667
///// /////
50 – <60 ///// ///// ///// 34 0.2267 0.2933
///// ////
///// /////
60 – <70 ///// ///// ///// 29 0.1933 0.4867
////
///// /////
///// ///// /////
70 – <80 42 0.2800 0.7667
///// ///// /////
//
///// /////
80 – <90 ///// ///// ///// 28 0.1867 0.9533
///
90 – 100 ///// // 7 0.0467 1
Total 150 1.0001 100%
Note: The total for the Relative Frequency should be exactly 1. Sometimes, we may be off slightly because
of rounding errors.
Step 4: We draw the Histogram.
25
20
15
10
5
0
30 - <40 40 - <50 50 - <60 60 - <70 70 - <80 80 - <90 90 - 100
Grades
This is my favourite way of depicting the data. Advantages include seeing the class intervals directly, no
time wasted on figuring any “hidden” clues.
• Symmetric bell-shaped
45
40
35
30
Frequency
25
20
15
10
0
1 2 3 4 5 6 7 8 9
Value of Variable
• Symmetric U-shaped
45
40
35
30
Frequency
25
20
15
10
0
1 2 3 4 5 6 7 8 9
Value of Variable
• Symmetric Uniform
45
40
35
30
Frequency
25
20
15
10
0
1 2 3 4 5 6 7 8 9 10 11 12
50
45
40
35
Frequency
30
25
20
15
10
0
1 2 3 4 5 6 7 8 9 10 11
Value of Variable
50
45
40
35
Frequency
30
25
20
15
10
0
1 2 3 4 5 6 7 8 9 10 11
Value of Variable
10.3 Outlier
Example:
Question: Are Alaska and Florida outliers because of human error during the collection of data?
Thought: Maybe not.
• Alaska is too cold so not many seniors may want to live there.
• Florida is warm so many seniors may want to live there.
So if possible, go to Alaska and Florida and verify the percentage of seniors in these two states.
10.4 Dot Plots
A dot plot is a graph in which each observation is represented by a dot placed over a numerical value each
time that observation is observed
Example: Thirty nine participants took part in a fishing competition. Each participant is represented by a
dot placed over the number of fishes caught by the participant.
We can see that the above distribution can be considered as right skewed since the tail is somewhat to
the right.
𝑆𝑢𝑚 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠
Formula: 𝑀𝑒𝑎𝑛 = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠
∑𝑥 58.2+59.5+60.7+⋯+69.6
𝑥̅ = 𝑛
= 25
1598.3
= 25
= 63.932 𝑖𝑛
Symbol: M
Note: The first step in determining the median is to rearrange the data in ascending order, i.e. from
smallest to largest. The median is the middle-most value when the data values are arranged in ascending
order.
𝑛+1
Location Formula: 𝑃𝑜𝑠𝑖𝑡𝑖𝑜𝑛 𝑜𝑓 𝑚𝑒𝑑𝑖𝑎𝑛 =
2
Note: The above formula is NOT for the value of the median. It is for the position of the median.
Example 1: Let us consider the heights in inches of the 25 women in the previous example.
58.2, 59.5, 60.7, 60.9, 61.9, 61.9, 62.2, 62.2, 62.4, 62.9, 63.1, 63.9, 63.9, 64.0, 64.5, 64.1, 64.8, 65.2, 65.7,
66.2, 66.7, 67.1, 67.8, 68.9 and 69.6
What is the median of this data set?
Solution:
To find the median, the first step is to rearrange the data in ascending order, i.e. from smallest to largest.
Position 1 2 3 4 5 6 7 8 9 10 11 12 13
Height 58.2 59.5 60.7 60.9 61.9 61.9 62.2 62.2 62.4 62.9 63.1 63.9 63.9
Position 14 15 16 17 18 19 20 21 22 23 24 25
Height 64.0 64.1 64.5 64.8 65.2 65.7 66.2 66.7 67.1 67.8 68.9 69.6
𝑛+1 25+1 26
∴ The median is located at the 2
= 2
= 2
= 13𝑡ℎ position.
And the value at the 13th position, i.e. the median is 𝑀 = 63.9 inches.
Example 2: A student reported her 10 grades,
77, 86, 58, 67, 75, 77, 71, 65, 77 and 92
What is the median of this data set?
Solution:
The first step is to rearrange the data in ascending order:
Position 1 2 3 4 5 6 7 8 9 10
Grade 58 65 67 71 75 77 77 77 86 92
𝑛+1 10+1 11 1 𝑡ℎ
∴ The median is located at the = = =5 position.
2 2 2 2
1
Therefore, we take the value at the 5 2 𝑡ℎ position to be exactly halfway between the values of 75 and 77,
or the midpoint value of 75 and 77.
75+77 152
Thus, the median of the above data set is 𝑀= 2
= 2
= 76
Symbol: m
The mode of a variable is the value that occurs most often in the data set.
It is a good idea to rearrange the data in order so that we can look for values that occur most often.
Solution:
The first step is to rearrange the data in order:
Position 1 2 3 4 5 6 7 8 9 10
Grade 58 65 67 71 75 77 77 77 86 92
We can see that the value that occurs most often is 77.
Example 2: Let us consider the heights in inches of the 25 women in a previous example.
58.2, 59.5, 60.7, 60.9, 61.9, 61.9, 62.2, 62.2, 62.4, 62.9, 63.1, 63.9, 63.9, 64.0, 64.5, 64.1, 64.8, 65.2, 65.7,
66.2, 66.7, 67.1, 67.8, 68.9 and 69.6
What is the mode of this data set?
Position 1 2 3 4 5 6 7 8 9 10 11 12 13
Height 58.2 59.5 60.7 60.9 61.9 61.9 62.2 62.2 62.4 62.9 63.1 63.9 63.9
We can see that there are three values that occur twice (this data set’s the highest number of
occurrences), i.e. 61.9, 62.2 and 63.9.
We call these:
• Data with one mode = Unimodal data
• Data with two modes = Bimodal data
• Data with three modes = Trimodal data
If a data has more than three observations that occur most often, it renders the definition of a mode
impractical. Therefore, it will no longer be useful to denote its modes.
3. Left-skewed data
Given a data set, we usually want to know its mean and a measure of the spread.
Mean Mean
Data is not so spread out, it is quite concentrated Data is quite spread out, it is not so concentrated
near the mean, i.e. variability is small. near the mean, i.e. variability is large.
There are a few ways to define the spread of which perhaps the most popular definition is the Standard
Deviation. Before we can define the Standard Deviation, we need to calculate the Variance.
1
Definition: For population variance, 𝜎 2 = 𝑁 ∑𝑁
𝑖=1(𝑥𝑖 − 𝜇)
2
1
For sample variance, 𝑠 2 = 𝑛−1 ∑𝑛𝑖=1(𝑥𝑖 − 𝑥̅ )2
1 1
Formulas: For population variance, 𝜎 2 = 𝑁 (∑ 𝑥𝑖 2 − 𝑁 (∑ 𝑥𝑖 )2 )
1 1
For sample variance, 𝑠 2 = 𝑛−1 (∑ 𝑥𝑖 2 − 𝑛 (∑ 𝑥𝑖 )2 )
Since we also require the sum ∑𝑛𝑖=1 𝑥𝑖 2 , we need the square of each 𝑥𝑖 value. So we generate another
row, 𝑥𝑖 2 .
𝒊 1 2 3 4 5 6 7 8 9 10 Sum, ∑
𝒙𝒊 58 65 67 71 75 77 77 77 86 92 745
𝒙𝒊 𝟐 3364 4225 4489 5041 5625 5929 5929 5929 7396 8464 56391
1 1
𝑠 2 = 𝑛−1 (∑ 𝑥𝑖 2 − 𝑛 (∑ 𝑥𝑖 )2 )
1 1
= 10−1 (56391 − 10 (745)2 )
1 1
= (56391 − (555025))
9 10
1 888.5
= 9 (56391 − 55502.5) = 9
≈ 98.722222
Example: Let us look at the sample of the 10 grades shown reported by the student:
58, 65, 67, 71, 75, 77, 77, 77, 86 and 92
What is the standard deviation of this data set?
Solution: From the previous example, we calculated the variance of the data set to be
888.5
𝑠2 = ≈ 98.722222
9
888.5
∴ The std. dev. of the data set is 𝑠 = √𝑠 2 = √ ≈ 9.935906
9
10.6.3 Large Populations or Large Samples
For large populations or large samples where values are usually repeated, it is often necessary to tabulate
the values and their corresponding frequencies instead of writing a long list of values. Suppose 𝑥1 appears
𝑓1 number of times, 𝑥2 appears 𝑓2 number of times, 𝑥3 appears 𝑓3 number of times, … and so on as in the
following table,
Value, 𝑥𝑖 𝑥1 𝑥2 𝑥3 𝑥4 ⋯ ⋯ Sum, Σ
Frequency, 𝑓𝑖 𝑓1 𝑓2 𝑓3 𝑓4 ⋯ ⋯ ∑ 𝑓𝑖
Population size, 𝑁 = ∑ 𝑓𝑖
∑𝑥 ∑(𝑥𝑖 𝑓𝑖 ) ∑(𝑥𝑖 𝑓𝑖 )
Mean, 𝜇= 𝑁
= ∑ 𝑓𝑖
= 𝑁
2 2
(∑ 𝑥)2 (∑(𝑥𝑖 𝑓𝑖 ))
∑ 𝑥2− ∑(𝑥𝑖 2 𝑓𝑖 )− ∑(𝑥𝑖 2 𝑓𝑖 )−
(∑(𝑥𝑖 𝑓𝑖 ))
2 𝑁 ∑ 𝑓𝑖 𝑁
Variance, 𝜎 = = ∑ 𝑓𝑖
=
𝑁 𝑁
Sample size, 𝑛 = ∑ 𝑓𝑖
∑𝑥 ∑(𝑥𝑖 𝑓𝑖 ) ∑(𝑥𝑖 𝑓𝑖 )
Mean, 𝑥̅ = = ∑ 𝑓𝑖
=
𝑛 𝑛
2 2
(∑ 𝑥)2 (∑(𝑥𝑖 𝑓𝑖 ))
∑ 𝑥2− ∑(𝑥𝑖 2 𝑓𝑖 )− ∑(𝑥𝑖 2 𝑓𝑖 )−
(∑(𝑥𝑖 𝑓𝑖 ))
2 𝑛 ∑ 𝑓𝑖 𝑛
Variance, 𝑠 = 𝑛−1
= ∑ 𝑓𝑖 −1
= 𝑛−1
Value, 𝑥𝑖 𝑥1 𝑥2 𝑥3 𝑥4 ⋯ ⋯ Sum, Σ
Frequency, 𝑓𝑖 𝑓1 𝑓2 𝑓3 𝑓4 ⋯ ⋯ ∑ 𝑓𝑖
𝑥𝑖 𝑓𝑖 𝑥1 𝑓1 𝑥2 𝑓2 𝑥3 𝑓3 𝑥4 𝑓4 ⋯ ⋯ ∑(𝑥𝑖 𝑓𝑖 )
𝑥𝑖 2 𝑓𝑖 𝑥1 2 𝑓1 𝑥2 2 𝑓2 𝑥3 2 𝑓3 𝑥4 2 𝑓4 ⋯ ⋯ ∑(𝑥𝑖 2 𝑓𝑖 )
As usual, the standard deviation is the square root of the variance,
Example: The owner of a DVD rental shop meticulously kept a record of how many DVDs were
rented by customers who entered his shop from the day he opened his business until the day he closed.
The following table shows the number of DVDs, 𝑥𝑖 , rented by the corresponding number of customers, 𝑓𝑖 ,
out of every 100 customers who entered his shop.
Find the average number of DVDs rented by each customer and the standard deviation of the number of
DVDs rented by his customers.
Value, 𝑥𝑖 0 1 2 3 4 5 Sum, Σ
Frequency, 𝑓𝑖 6 58 22 10 3 1 100
𝑥𝑖 𝑓𝑖 0 58 44 30 12 5 149
𝑥𝑖 2 𝑓𝑖 0 58 88 90 48 25 309
The size of this population being studied is
𝑁 = ∑ 𝑓𝑖 = 100
The total number of DVDs rented is
∑ 𝑥 = ∑ 𝑥𝑖 𝑓𝑖 = 149
∑ 𝑥𝑖 𝑓𝑖 149
𝜇= = = 1.49
𝑁 100
𝜎 = √𝜎 2 = √0.8699 ≈ 0.932684
10.7 Percentiles
- Measures of relative standing.
The 𝒑𝒕𝒉 percentile of a set of data is a value 𝒙𝒑 with 𝒑% of the data values less than it and (𝟏𝟎𝟎 − 𝒑)%
of the data values greater than it.
Example: If 80% of students have marks lower than yours and 20% have marks higher than yours,
then your mark is at the 80th percentile.
10.7.1 Quartiles
Quartiles divide an ordered, from smallest to largest, data set into four groups each containing as close to
25% of the data as possible.
Graphical example:
𝑄1 𝑄2 = 𝑀 𝑄3
𝑛+1
• 𝑃𝑜𝑠𝑖𝑡𝑖𝑜𝑛 𝑜𝑓 𝑄1 = 4
2(𝑛+1) 𝑛+1
• 𝑃𝑜𝑠𝑖𝑡𝑖𝑜𝑛 𝑜𝑓 𝑄2 (𝑜𝑟 𝑀) = 4
= 2
3(𝑛+1)
• 𝑃𝑜𝑠𝑖𝑡𝑖𝑜𝑛 𝑜𝑓 𝑄3 = 4
Example: Let us consider the following data set of the number of years that 24 HIV patients lived
before succumbing to AIDS: 0.6, 1.6, 2.1, 2.3, 2.9, 3.4, 3.7, 4.1, 4.5, 5.3, 1.2, 1.9, 2.3, 2.5, 3.3, 3.6, 3.8, 4.2,
4.7, 7.4, 1.5, 2.8, 3.9 and 4.9.
What are the quartiles of the above data set?
Solution: The data set must be rearranged in order from smallest to largest before we proceed to
determine the positions of the quartiles:
Position 1 2 3 4 5 6 7 8 9 10 11 12
Years 0.6 1.2 1.5 1.6 1.9 2.1 2.3 2.3 2.5 2.8 2.9 3.3
1𝑡ℎ
Since 𝑄1 is the value at the 6 4 position, we can calculate that
1 1
𝑄1 = 2.1 + (2.3 − 2.1) = 2.1 + (0.2) = 2.1 + 0.05 = 2.15
4 4
𝑛+1 24+1 25 1𝑡ℎ
• The location of 𝑄2 or M is at the 2
= 2
= 2
= 12 2 position
1𝑡ℎ
Since 𝑄2 or M is the value at the 12 2 position, we can calculate that
1 1
𝑄2 = 𝑀 = 3.3 + (3.4 − 3.3) = 3.3 + (0.1) = 3.3 + 0.05 = 3.35
2 2
3𝑡ℎ
Since 𝑄3 is the value at the 18 4 position, we can calculate that
3 3
𝑄3 = 4.1 + (4.2 − 4.1) = 4.1 + (0.1) = 4.1 + 0.075 = 4.175
4 4
𝑀𝑖𝑛 𝑄1 𝑀 𝑄3 𝑀𝑎𝑥
Symbol: 𝐼𝑄𝑅
Formula: 𝐼𝑄𝑅 = 𝑄3 − 𝑄1
10.7.2.2 Lower Fence and Upper Fence
𝑈𝐹 = 𝑄3 + 1.5 × 𝐼𝑄𝑅
Position 1 2 3 4 5 6 7 8 9 10 11 12
Years 0.6 1.2 1.5 1.6 1.9 2.1 2.3 2.3 2.5 2.8 2.9 3.3
Position 13 14 15 16 17 18 19 20 21 22 23 24
Years 3.4 3.6 3.7 3.8 3.9 4.1 4.2 4.5 4.7 4.9 5.3 7.4
Therefore, we can write the 5-number summary for the above data set as
This is a graphical presentation of the 5-number summary and it defines outliers if any exist in the data
set. The following are steps to construct a boxplot:
1. Obtain the 5-Number Summary, the Interquartile Range, and the Lower Fence and the Upper
Fence.
2. Draw and label a horizontal line (axis) to represent the scale of measurement, and locate Q1, M
and Q3 on it.
3. Draw a box between Q1 and Q3 above the axis and a line within the box indicating M.
4. Draw a dotted line to indicate the Lower Fence (LF) and another dotted line to indicate the Upper
Fence (UF).
5. Draw whiskers from the left side of the box to the first data above the LF and from the right side
of the box to the first data below the UF.
6. Observations to the left of the LF and to the right of the UF are outliers and are marked with
asterisks.
Outliers Outliers
∗ ∗ ∗ ∗
x
Min LF First data 𝑄1 M 𝑄3 First data UF Max
above LF below UF
Axis Label
Example: Let us draw a boxplot based on the data in the previous example:
Position 1 2 3 4 5 6 7 8 9 10 11 12
Years 0.6 1.2 1.5 1.6 1.9 2.1 2.3 2.3 2.5 2.8 2.9 3.3
Position 13 14 15 16 17 18 19 20 21 22 23 24
Years 3.4 3.6 3.7 3.8 3.9 4.1 4.2 4.5 4.7 4.9 5.3 7.4
∗ Outlier
t
−2 −1 0 1 2 3 4 5 6 7 8
𝐿𝐹 = −0.8875 𝑀𝑖𝑛 = 0.6 𝑄1 = 2.15 𝑀 = 3.35 𝑄3 = 4.175 𝑈𝐹 = 7.2125 𝑀𝑎𝑥 = 7.4
First data below
𝑈𝐹 𝑖𝑠 5.3
No. of years a HIV patient survives