2023 Topic 1 Investigating and Comparing Data Distributions
2023 Topic 1 Investigating and Comparing Data Distributions
Topic Timeline
Study Design Content
2
Contents
Topic Timeline ............................................................................................................................................................... 1
Study Design Content .................................................................................................................................................. 2
1.1 Introduction to data distributions ....................................................................................................................... 4
Types of data .................................................................................................................................................................. 4
Measures of centre and spread ............................................................................................................................... 5
1.2 Tables and Charts ....................................................................................................................................................... 6
Frequency tables ........................................................................................................................................................... 6
Grouped frequency tables ......................................................................................................................................... 7
Bar charts ......................................................................................................................................................................... 7
1.3 Histograms .................................................................................................................................................................... 8
Histograms and grouped frequency tables ......................................................................................................... 8
Centre and spread of histograms............................................................................................................................ 8
Shapes of histograms .................................................................................................................................................. 9
1.4 Boxplots ...................................................................................................................................................................... 11
Five-number-summary ........................................................................................................................................... 11
IQR, outliers and fences ........................................................................................................................................... 12
Boxplots ......................................................................................................................................................................... 13
Comparing boxplots and histograms ................................................................................................................. 14
1.5 Dot Plots and Stem-and-Leaf Plots ................................................................................................................... 15
Dot plots ........................................................................................................................................................................ 15
Stem-and-leaf plot ..................................................................................................................................................... 15
1.6 Back-to-back stem plots and parallel boxplots ........................................................................................... 17
Back-to-back stem plots .......................................................................................................................................... 17
Parallel boxplots ........................................................................................................................................................ 18
Which display do we use? ...................................................................................................................................... 18
1.7 Mean and Standard Deviation ............................................................................................................................ 19
Measures of centre .................................................................................................................................................... 19
Measures of spread ................................................................................................................................................... 20
Standard deviations away from the mean ....................................................................................................... 21
Topic 1 Review ................................................................................................................................................................ 23
3
1.1 Introduction to data distributions
Types of data
▪ We have two main types of data: numerical (quantitative, values are numbers) and
categorical (qualitative, values can’t be quantified).
▪ Categorical data involves either numbers where adding makes no sense or categories that
don’t involve any numbers.
o Ordinal data has a natural order. E.g. star ratings from 1-4 on Uber, house numbers,
letter grades.
o Nominal has no natural order. E.g. eye colour.
▪ Numerical data can be divided into several types, depending on context.
o Continuous data can be measured to as many decimal places as you can physically
manage. E.g. handspan, rainfall, temperature.
o Discrete data can only take specific values and can’t be measured to however many
decimal places. E.g. number of pets, shoe size (6, 6 ½, 7 but not in between these).
o Ratio data has a fixed numerical beginning. E.g. handspan (ratio and continuous)
and number of pets (ratio and discrete) both start at zero, but temperature does
not.
o Interval has no fixed beginning. E.g. temperature (interval and continuous) and
calendar years (interval and discrete).
4
Example 1
We do…
Classify the following sets of data as categorical or numerical, and then nominal/ordinal or
discrete/continuous/interval/ratio.
c) Hair colour
f) Temperature at Vladivostok
Range: 5 − 0 = 5
TO DO:
Nelson Ex 1.1 p. 9
q’s 2 (every 2nd), 3bcd, 4-7, 11, 13
5
1.2 Tables and Charts
Example 1
Construct frequency tables for the following sets of data:
I do…
The different car colours along a quiet road are counted:
6
Grouped frequency tables
▪ Numerical data, usually continuous data
▪ When it’s impractical to list each individual value
▪ Gives the frequency of certain ranges
▪ Intervals must all be the same size
Example 2
We do…
Create a grouped frequency table for the following data, and find the modal interval:
45, 78, 80, 67, 43, 59, 32, 12, 100, 45, 58, 56, 69, 16
Total 100%
Modal interval:
Bar charts
▪ Bar charts provide a visual display for categorical data sets.
▪ The bars are drawn with gaps between to show that values are separate categories
▪ You’ve done plenty of these.
TO DO:
Nelson Ex 1.2 p. 15
q’s 3, 4ac, 5, 8-15
7
1.3 Histograms
Siblings Frequency
8
Shapes of histograms
Symmetrical
The mean, median, and mode are all
approximately in the middle
𝑚𝑒𝑎𝑛 ≈ 𝑚𝑒𝑑𝑖𝑎𝑛 ≈ 𝑚𝑜𝑑𝑒
Multimodal
More than one mode
Negatively skewed
The mean is ‘below’ the mode
𝑚𝑒𝑎𝑛 < 𝑚𝑒𝑑𝑖𝑎𝑛 < 𝑚𝑜𝑑𝑒
Positively skewed
The mean is ‘above’ the mode
𝑚𝑒𝑎𝑛 > 𝑚𝑒𝑑𝑖𝑎𝑛 > 𝑚𝑜𝑑𝑒
▪ We can also identify outliers. These are an extreme high or low value in the data.
9
Example 2
You do…
TO DO:
Nelson Ex 1.3 p. 24
q’s 3-6, 9-13
10
1.4 Boxplots
By hand:
Put in order and split in halves –
(4, 7, 11, 12, 13, 13,), 14, (14, 15, 15, 16, 16, 25)
Minimum = 4
11+12
Q1 = 2 = 11.5
Median = 14
15+16
Q3 = 2 = 15.5
Maximum = 25
On CAS:
11
IQR, outliers and fences
▪ The Interquartile Range (IQR) is the difference between the lower and upper quartiles. This
gives us the range of the middle 50% of the data.
o 𝐼𝑄𝑅 = 𝑄3 − 𝑄1
▪ The IQR is used in a calculation to identify possible outliers.
▪ A data value is a possible outlier if it is:
o Lower than the lower fence: 𝑄1 − 𝟏. 𝟓 × 𝐼𝑄𝑅
o Higher than the upper fence: 𝑄3 + 𝟏. 𝟓 × 𝐼𝑄𝑅
▪ We calculate both fences and then compare our possible outliers to these numbers.
Example 2
I do… You do…
Identify any outliers from the following data Identify any outliers from the following data
set, using calculations. set, using calculations.
50, 57, 62, 64, 65, 65, 65, 68, 70, 71, 72, 72, 73, 3, 7, 20, 22, 22, 22, 25, 25, 28, 31, 34, 34, 39
74, 77, 79, 79
Using CAS:
12
Boxplots
▪ AKA ‘box and whisker plots’
▪ A box is used to represent the middle 50% of the data points
▪ The median is shown by a vertical line drawn within the box
▪ Whiskers extend out from Q1 and Q3 to the minimum and maximum, respectively
▪ Any outliers are drawn as dots above and below the lower and upper fences
Example 3
We do:
Calculate and construct the:
a) five-figure-summary
b) lower and upper fences, and identify outliers
c) boxplot (first by hand then using CAS)
15, 2, 24, 30, 25, 19, 24, 33, 41, 60, 42, 35, 35
28, 28, 19, 19, 28, 25, 20, 36, 38, 43, 45, 39
13
Comparing boxplots and histograms
TO DO:
Nelson Ex 1.4 p. 34
q’s 3-15
14
1.5 Dot Plots and Stem-and-Leaf Plots
Revision: Ex 1.5 p. 42 q’s 1-2
Dot plots
▪ Like a bar chart, but each data point is marked by a dot
Stem-and-leaf plot
▪ Presents the numerical data in a different format that easier for the reader to interpret
▪ The stem is the ‘tens’ column of the numbers, and the leaf is the ‘ones’
▪ Make sure that the data points are ordered from low to high!
▪ When describing stem-and-leaf plots, you turn them on their side in your head and do the
same as histograms:
Symmetrical Negatively skewed Positively skewed
Example 1
Draw a stem-and-leaf plot of the following data sets:
I do… You do…
The following is a set of marks obtained by a
group of students on a test:
15, 2, 24, 30, 25, 19, 24, 33, 41, 60, 42, 35, 35
28, 28, 19, 19, 28, 25, 20, 36, 38, 43, 45, 39
15
Example 3
You do…
Calculate the:
a) i. mode
ii. range
iii. five-number summary
b) Identify any outliers, justifying your response with calculations.
TO DO:
Nelson Ex 1.5 p. 42
q’s 3-7, 11-13, 14b, 15abc
16
1.6 Back-to-back stem plots and parallel boxplots
Revision: Ex 1.6 p. 52 q’s 1-2
We can directly compare the distribution of data for two groups using stem plots or boxplots.
Back-to-back stem plots
▪ Allows for easy comparison of two small sets of discrete data
Example 1
I do…
Use the following data sets to
a) construct a back-to-back stem-and-leaf plot
b) compare the median, range, and IQR for both sets
a)
You do…
Construct a back-to-back stem plot for the following data.
17
Parallel boxplots
▪ Two boxplots drawn on the same scale.
▪ Allows for easy comparison of two or more sets of data.
▪ Address median, IQR, and possible outliers.
Example 2
Use the following data sets to parallel boxplots, then compare the distributions.
I do…
TO DO:
Nelson Ex 1.6 p. 52
q’s 4-14
18
1.7 Mean and Standard Deviation
Revision: Ex 1.7 p. 64 q’s 1, 2
We summarise data using summary statistics, which are certain calculations using the data. These
typically summarise the centre or spread of the data.
Measures of centre
▪ ̅) is the ‘average’
The mean (𝒙
o For ungrouped data:
𝑠𝑢𝑚 𝑜𝑓 𝑑𝑎𝑡𝑎 𝑣𝑎𝑙𝑢𝑒𝑠
𝑥̅ =
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑎𝑡𝑎 𝑣𝑎𝑙𝑢𝑒𝑠
o For a grouped frequency table:
𝑠𝑢𝑚 𝑜𝑓 (𝑒𝑎𝑐ℎ 𝑣𝑎𝑙𝑢𝑒 × 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦)
𝑥̅ =
𝑠𝑢𝑚 𝑜𝑓 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑖𝑒𝑠
▪ Remember:
o Median = ‘middle’ data point
o Mode = highest frequency (most common)
o If 𝑚𝑒𝑎𝑛 − 𝑚𝑒𝑑𝑖𝑎𝑛 > 0 (i.e. positive), the data is positively skewed
Example 1
Calculate the mean, median and mode of the following data sets, then comment on the shape of
the distribution:
I do… You do…
Score Frequency
37 2
38 4
39 7
40 4
41 1
672
𝑚𝑒𝑎𝑛 = = 26.88
25
Median = middle number
Median = 21
Mode = number with highest
frequency
Mode = 21 (happens 4 times)
𝑚𝑒𝑎𝑛 − 𝑚𝑒𝑑𝑖𝑎𝑛 = 26.88 − 21
= +5.88
This dataset is positively skewed.
19
Measures of spread
▪ The standard deviation measures the spread of the data distribution about the mean
∑(𝑥−𝑥̅ )2
o 𝑠= (always do on CAS!)
𝑛
▪ Remember:
o The range is the difference between the smallest and largest data points
▪ 𝑟𝑎𝑛𝑔𝑒 = 𝑚𝑎𝑥 − 𝑚𝑖𝑛
o The interquartile range (IQR) gives the spread of the middle 50% of data values
▪ 𝐼𝑄𝑅 = 𝑄3 − 𝑄1
Example 2:
Calculate the range, IQR, and standard deviation of the following data sets:
I do… You do…
Score Frequency
37 2
38 4
39 7
40 4
41 1
Range = 𝑚𝑎𝑥 − 𝑚𝑖𝑛
Range = 49 − 16 = 33
Q1 = middle number of lower half.
Q3 = middle number of upper half
Count in from both ends until you meet
in middle:
20
Standard deviations away from the mean
▪ You can calculate how many standard deviations a data point is from the mean by finding
the ‘z score’:
𝑥 − 𝑥̅
𝑧=
𝑠
(i.e. the data point minus the mean, divided by the standard deviation)
▪ A lot of data collected in real life has histograms that (when smoothed out) look like the
‘normal distribution’:
▪ The mean, median, and mode are all pretty much in the ‘middle’ of the data, and very few
data points exist towards the ‘left’ and ‘right’ sides.
▪ If the histogram resembles a bell curve, we say that the data is approximately normally
distributed’.
▪ If the data is normally distributed, then we know that the following percentages of the
data are within a given number of standard deviations from the mean:
21
Example 3:
The weights of bags of red gravel may be modelled by a normal distribution with mean 25.5kg
and standard deviation 0.5kg.
Determine the probability that a randomly selected bag of red gravel will weigh:
I do… You do…
a) Less than 24.5kg a) More than 26kg
c) Less than 23.5kg or more than c) Less than 24.5kg or more than 26kg
26.5kg
TO DO:
Nelson Ex 1.7 p. 64
q’s 4-12, 14
22
Topic 1 Review
▪ As you complete the questions, write one double-sided A4 sheet of notes. These notes
should include any information, theories, formulas, and examples that you used to help
you answer the questions. You may use the Chapter Summary on page 67 as a guide, or .
▪ As you create these pages, they will consist of useful summaries to have at the very start of
your bound reference.
▪ Use the following excerpt from the Study Design to help structure your notes:
TO DO:
Nelson page 71-75
all questions
23