MetNum1 2023 1 Week 10
MetNum1 2023 1 Week 10
Semester 2023/1
Week 10
1
Topics covered in Part 2
Week Title
Week 9 Introduction to statistics and probability
Week 10 Descriptive statistics and sampling techniques
Week 11 Probability theory
Week 12 Discrete and continuous probability distributions
Week 13 Variance, co-variance, and correlation
Week 14 Statistical inference methods
Week 15 Statistical analysis using Octave/MATLAB
Week 16 UAS
2
Outline
2. Measures of variability
3. Measures of shape
5. Sampling methods
3
Cartoon of the week
4
What is descriptive statistics?
5
Summary of measures
Measures of Variability
Measures of Location üRange
üPercentile üInterquartile range
üQuartile üVariance
üStandard Deviation
6
Topic 1: Measures of location
and center
7
Sorting data
l In ordering, the data are normally arranged from the smallest value
to the largest (a.k.a. ascending order).
8
Sorting data
i xi
We can put the data 1 5 We can group the
i fi xi
in ascending order, 2 5 data, as follows:
1 4 5
as follows: 3 5
2 2 6
4 5 i - group number
3 1 7
i - observation 5 6 fi - frequency 4 1 8
number 6 6 xi - data value
xi - data value 7 7
8 8
9
Percentiles
l Percentiles are a measure of location, whereby the data are partitioned into
100 segments.
l The Pth percentile in the ordered set is that value below which P% of the
observations in the set lie.
l If the position is not a whole number, linear interpolation is used to find the
correct percentile value.
10
Worked Example 1: percentiles
Marks
33
26
24
The marks scored by a 21
group of students on their 19
final exams are shown on 20
the right. 18
18
52
56
Find the 50th, 80th, and 27
90th percentiles of the 22
given data set. 18
49
22
20
23
32
20
18
11
Worked Example 1: percentiles
Marks
i (sorted)
1 18
2 18
Solution: 3 18
4 18
The first step is 5 19
to sort the data in 6 20
ascending order 7 20
8 20
and write the 9 21
corresponding 10 22
index number (𝑖) 11 22
for each 12 23
observation. 13 24
14 26
15 27
16 32
17 33
18 49
19 52
20 56
12
Worked Example 1: percentiles
24
23
Marks
22
21
20
8 9 10 11 12
Observation number
l In this case, since both the 10th and 11th observations have the same
value, the 50th percentile is 22.
13
Worked Example 1: percentiles
l In this case, since the 16th and 17th observations have different values, the
80th percentile is 32.8.
14
Worked Example 1: percentiles
15
Quartiles
ØThe 25th percentile is known as the first (or lower) quartile (Q1).
ØThe 75th percentile is known as the third (or upper) quartile (Q3).
16
Arithmetic mean or average
l The arithmetic mean or average of a set of measurements is a measure of
center. It is equal to the sum of the measurements divided by the total
number of measurements, n.
l For a sample, the mean is normally assigned the symbol 𝑥̅ (pronounced
as ‘x bar’)
Ungrouped data Grouped data
å xi å f i .xi
x= x=
n n
Note: If we were able to enumerate the whole population, the population
mean would be assigned the symbol μ.
17
Calculating the sample mean
i xi
∑$!"# 𝑥!
𝑥= 1 12.6
8 2 12.9
12.6 + 12.9+. . . +13.1 3 13.4
=
8 4 12.3
104 5 13.6
= = 13.0 N 6 13.5
8
7 12.6
8 13.1
18
Median
• The median of a set of measurements is the middle measurement when
the measurements are put in ascending order.
19
Mode
For example:
1. For the set {2, 4, 9, 8, 8, 5, 3}, the mode is 8, which occurs twice.
2. For the set {2, 2, 9, 8, 8, 5, 3}, there are two modes: 8 and 2 (this
is called a bimodal set).
3. For the set {2, 4, 9, 8, 5, 3}, there is no mode (each value is
unique).
20
Mean vs median vs mode
å xi 55
Mean x= = = 2.2
n 25
21
Class Exercise 1: mean, median, mode
22
Topic 2: measures of variability
23
Variability (dispersion)
Measures of variability tell us how spread out the data are (in other words, the
degree to which the different data points deviate from the central value).
Low variability
Medium variability
High variability
24
Measures of variability
25
Range
If the n observations in a sample are denoted by x1, x2, …, xn, the sample range
is given by the largest observation in the sample minus the smallest observation,
i.e.:
r = max(xi) – min(xi)
26
Sample range
Position Interpolation
First Quartile: (20 + 1) ´ 25/100 = 5.25 19 + (.25)(1) = 19.25
Interquartile Range = Q3 – Q1
= 30.75 – 19.25 = 11.5
27
Variance
å ( x - x ) 2 Q: Why do we use a
s2 = i squared term? Why
n -1 not just ∑ 𝑥! − 𝑥̅ ?
å ( x - µ ) 2
s2 = i
N
28
Bessel’s correction
You may have noticed that the sample variance is divided by (n – 1), whereas the
population variance uses the total population size N.
The use of (n – 1) is known as Bessel’s correction and reduces the bias (i.e. the
discrepancy between the measured value and the ‘true’ value) due to estimating the
population variance using a sample. A more detailed explanation can be found here!
The smaller the sample, the greater the bias, as the sample mean is less representative
of the population mean for smaller samples.
29
Standard deviation
The variance is a measure of how spread out the dataset is. However, because it
uses a squared term, it does not give us a measure of how far the data is from
the mean in terms of the same units as the mean.
Therefore, to obtain a more direct measure of the variation of the data points
relative to the mean, we take the square root of the variance, which is known as
the standard deviation.
Sample standard deviation: 𝑠 = 𝑠 %
Population standard deviation : 𝜎 = 𝜎 %
30
Calculating the sample variance
Recall the pull-off force data in Slide 18. The mean was calculated as 𝑥̅ = 13.0 N. The table
below displays the quantities needed to calculate the sample variance and sample standard
deviation:
"
𝑥̅ 2
i xi x i𝑥-! − 𝑥̅
xbar (x i𝑥-! −
xbar)
1 12.6 -0.4 0.16
2 12.9 -0.1 0.01
3 13.4 0.4 0.16
4 12.3 -0.7 0.49
5 13.6 0.6 0.36
6 13.5 0.5 0.25
7 12.6 -0.4 0.16
8 13.1 0.1 0.01
31
Shortcut formula
Using the definition of the mean, it is possible to rewrite the variance formula from the previous
slide,
å( xi - x ) 2
s =
2
n -1
into the following form (try it yourself before watching the derivation!):
2 ( å x ) 2
å xi - i
s2 = n
n -1
This is known as the shortcut formula, as it allows us to calculate the variance using values of xi
directly, without having to subtract each value from the mean.
32
Using the shortcut formula
2
n
æn
ö
å 2
ix - ç å xi ÷
è i =1 ø
n i xi x i2
s 2 = i =1 1 12.6 158.76
n -1 2 12.9 166.41
3 13.4 179.56
1,353.60 - (104.0 ) 8
2 4 12.3 151.29
5 13.6 184.96
= 6 13.5 182.25
7 7 12.6 158.76
8 13.1 171.61
1.60
= = 0.2286 pounds
0.23 N 2 2 sums = 104.0 1,353.60
7
When calculating s, don’t forget to
s = 0.2286 = 0.48 N2
0.48pounds use the unrounded value of s2!
33
Class Exercise 2: mean and std dev
(a) Suppose the mean score on a national test is 400 with a standard
deviation of 50. if each score is increased by 25, what are the new
mean and standard deviation?
(b) Suppose the mean score on a national test is 400 with a standard
deviation of 50. if each score is increased by 25%, what are the
new mean and standard deviation?
34
Topic 3: measures of shape
35
Skewness and kurtosis
38
Topic 4: data
presentation
methods
39
Data presentation
40
Methods for displaying data
41
Pie charts and bar charts
Student Cola Preference From Survey
Colas Frequency (Count) 112
120
Frequency
63
Bloxy Cola 112 60 47 47
Mecca Cola 47
40 27
RC Cola 13
20 13
Corsica Cola 27
Zam Zam Cola 47 0
Coca Pepsi Bloxy Me cca RC Cola Corsica Zam
Cola Cola Cola Cola Zam
Cola
RC Cola 13
Pepsi 63
Me cca Cola 47
Corsica Cola 27
42
Time series plots
Time series plots show the data (or statistic) value on the vertical axis
and the time on the horizontal axis.
Such plots reveal trends, cycles or other time-oriented behavior that
could not otherwise be seen in the data.
Box and whisker plots (also just called box plots) illustrate the data in a graphical display
that simultaneously describes several important features of a data set, such as:
Ø Center
Ø Spread
Ø Symmetry
Ø Identification of outliers
44
Box plot without outliers
• The figure below shows the construction of a box plot without outliers:
45
Box plot with outliers
To illustrate the construction of a box plot with outliers, consider the alloy compressive strength
data listed in Table 1.
46
Box plot with outliers
Step 1: Find the values of Q1, Q2, and Q3 and calculate the IQR.
Step 4: Find the closest data points within the upper and lower limits and draw the ‘whiskers’
extending to these points.
47
Box plot (with outliers)
The figure shows the resulting box plot (with outliers) of the compressive strength data of 80
aluminum-lithium alloy specimens. We can see that the dataset contains three outliers, at
245, 87, and 76.
IQR
IQR
IQR
48
Histograms
• A histogram is way of representing grouped data. It looks like a bar chart, but
with data values on the x-axis and frequency (or relative frequency, which is
normalized relative to the sample size, n) on the y-axis.
49
Grouping data into intervals
l Intervals should be
o Mutually exclusive: Every observation is assigned to only one group,
without any overlap
o Exhaustive: Every observation is assigned to a group
In addition, the intervals are normally equal width, although the first or
last group may be open-ended
50
Frequency distributions
The data is gathered into bins or cells, which are defined by the boundaries of
the intervals.
The number of classes multiplied by the interval width should exceed the range
of the data (meaning that all the data fit within the defined intervals).
51
Constructing a frequency distribution
Step 1. Decide on the number of classes (k) to include in the frequency distribution. We can do this
using Sturge’s rule and then rounding up.
Sturge’s rule: k = 1 + 3.22 log n, where n = sample size
Step 2. Find the class interval (i) as follows: determine the range (r) of the data, divide the range by
the number of classes, and round up to the next convenient number.
𝑟
𝑖=
𝑘
Step 3. Find the class limits. You can use the minimum data value as the lower limit of the first
class.
Step 4. Make a tally mark for each data entry in the row of the appropriate class. Count the tally
marks to find the frequency (f) for each class.
52
Worked Example 2: frequency distribution
The following sample data set lists the number of minutes 50 students spent on
social media during a given day. Construct a frequency distribution of the data.
50 40 41 17 11 7 22 44 28 21 19 23 37 51 54 42 88
41 78 56 72 56 17 7 69 30 80 56 29 33 46 31 39 20
18 29 34 59 73 77 36 39 30 62 54 67 39 31 53 44
53
Worked Example 2: frequency distribution
Solution:
1. 𝑘 = 1 + 3.22 log 50 = 6.47 → Round up to 7
$$+,
2. 𝑖 = ,
= 11.52 → Round up to 12
Class Tally Frequency, f Relative Cumulative
frequency frequency
7 – 18 |||| | 6 0.12 6
19 – 30 |||| |||| 10 0.2 16
31 – 42 |||| |||| ||| 13 0.26 29
43 – 54 |||| ||| 8 0.16 37
55 – 66 |||| 5 0.1 42
67 – 78 |||| | 6 0.12 48
79 – 90 || 2 0.04 50
Σ f = 50 1.00 54
Histogram of a frequency
distribution
Histograms are commonly used to display frequency distributions. However, since the histogram
is on a continuous scale, the class boundaries for discrete data will need to be adjusted so that
there are no gaps in between.
For the data from Example 9, the distance from the upper limit of the first class to the lower limit
of the second class is 19 – 18 = 1. Half this distance is 0.5. Hence, on the histogram, the upper
boundary used for the first class (which is also the lower boundary for the second class) is 18.5.
Similarly, the second class is adjusted to be between 18.5 and 30.5, and so on.
Note that for the first class, the starting value is normally adjusted down to maintain an equal
width to the other intervals. In this case, the starting value would be 6.5.
55
Histogram of a frequency distribution
The resulting histogram is constructed as follows, with the x-axis labelled with either the
class midpoints (left) or class boundaries (right):
56
Frequency polygon
A frequency polygon is like a histogram, but instead of bars, the class midpoints are
connected using straight lines.
57
Relative frequency histogram
58
Ogive
Cumulative frequency refers to the total frequency value at each upper class boundary (as shown
in the table). A plot of cumulative frequency is known as an ogive (see the figure).
Notice that the graph starts at 6.5, where the cumulative frequency is 0, and ends at 90.5, where
the cumulative frequency is 50.
59
Class Exercise 3: histograms
60
Topic 5: sampling
techniques
61
What is sampling?
For example, different sampling methods are widely used by researchers in the
field of market research so that they do not need to study the entire population to
collect actionable insights.
62
Types of sampling
63
Probability sampling
64
Simple random sampling
65
Systematic sampling
66
Stratified sampling
67
Cluster sampling
The population is divided into subgroups that should all have similar
characteristics to the whole group. The subgroups are then randomly
selected.
If subgroups are very large, these can be further sampled (known as
multistage sampling).
Can lead to errors if the subgroups are not truly representative.
68
Non-probability sampling
Non-probability sampling is mainly used in qualitative or exploratory research,
where the goal is to gain an initial understanding of a small or under-researched
population.
Can be cheaper and easier to implement than probability sampling but is also
more prone to sampling bias which can cause errors.
69
Convenience sampling
Members of the population are selected based on ease of access (e.g. living in the same city or
country as the researcher).
Cheap and easy to implement, but there is no way to tell if the sample is representative, so
results may not be generalizable.
70
Voluntary response sampling
The sample consists of people who willingly participate (volunteer) in the study. Hence, it is easy
for the researcher to implement.
Likely to result in a degree of sample bias as some people may be inherently more likely than
others to volunteer.
71
Purposive sampling
Researcher uses their prior knowledge or expertise to select the most suitable sample (also
known as judgement sampling).
Cheap and easy to implement, but there is no way to tell if the sample is representative, so
results may not be generalizable.
72
Snowball sampling
If the population is hard to access, snowball sampling can be used to recruit participants via other
participants.
The number of people you have access to grows rapidly (or “snowballs”) as you contact more
people.
73
Class Activity: sampling
Collect the heights of all students in a spreadsheet. Calculate the mean height of
the class.
74
Problem Set 2
75
Question 1
7.15, 7.20, 7.18, 7.19, 7.21, 7.20, 7.16, 7.18, 7.20, 7.17
76
Question 2
From the following data,
1.09 1.92 2.31 1.79 2.28 1.74 1.47 1.97
0.85 1.24 1.58 2.03 1.7 2.17 2.55 2.11
1.86 1.9 1.68 1.51 1.64 0.72 1.69 1.85
1.82 1.79 2.46 1.88 2.08 1.67 1.37 1.93
1.4 1.64 2.09 1.75 1.63 2.37 1.75 1.69
77
Question 3
Fifty students were asked how Table 1. Sleep data for students.
much sleep they get per school Amount of Frequency
night, rounded to the nearest sleep per
school night
hour. The results are shown in (hours)
Table 1. Based on these data: 4 2
5 5
(a) Calculate the 28th and 80th 6 7
percentiles 7 12
(b) Construct a relative 8 14
9 7
frequency histogram
10 3
(c) Construct an ogive
78
Question 4
79