Lecture 1
Lecture 1
Lecture 1 - Introduction
Adina Amanbekkyzy
[email protected]
Lecture overview:
2
1. Introduction
• Statistics – the art of learning from data. It’s
concerned with the collection of data, its
subsequent description, and its analysis, which
leads to the drawing of conclusions.
4
Flowchart. The modelling process
1. Recognising a real-world problem
6
Key terms
• Population – a total collection of elements. (For
instance, all residents of a given state, all the
television set produced in the last year by a
particula manufacturer)
9
Types of variables
• Qualitative variables (categorical) - variables
that take non-numerical values (i.e. they're
not numbers)
10
Example
An employer collects information about the computers in
his office. He gathers observations of the 5 variables
shown in the table.
1. Manufacturer Bell Banana Deucer Deucer
2. Processor speed
2.6 2.1 1.8 2.2
(in GHz)
3. Year of purchase 2009 2010 2011 2009
4. Memory (in MB) 2 3 3.1 4.8
5. Colour Grey Grey Grey Black
12
Self-study
A mechanic collects the following information about
cars he services:
Make, Mileage, Colour, Number of doors, Cost of
service
Write down all the variables from this list that are:
a) qualitative
b) quantitative
13
Types of variables
• Qualitative variables
• Quantitative variables
• Discrete
• Continuous
14
Types of variables
There are then two different types of quantitative
variables.
• A discrete variable can only take certain values
within a particular range (e.g. shoe sizes) - this
means there are 'gaps' between possible values (you
can't take size 9.664 shoes, for example)
15
Example
The variables below are all quantitative.
(i) length, (ii) weight, (iii) number of brothers, (iv)
time, (v) total value of 6 coins from down in the back
of my sofa
16
Answers
a) 'Length', 'weight' and 'time' can all take any value
in a range. So the continuous variables are:
‘length', 'weight' and 'time'.
17
Types of data sets
Data is often shown in the form of a table.
1. Frequency tables show the number of
observations of various values.
19
Types of measurement scales
!"#$%#&'(
frequency density =
')*++ ,-./0
20
Histograms
Histogram
21
Bar charts
Histogram vs. Bar chart
23
Example
Here’s some data showing the heights of 24 people:
24
Here’s some data showing the heights of 24 people:
Correct histogram
Incorrect
histogram
25
Draw and interpret histograms
(i) The vertical axis shows frequency density.
(ii) The horizontal axis has a continuous scale like an
ordinary graph and there are no gaps between the
columns.
(iii) A bar's left-hand edge corresponds to the lower class
boundary. A bar's right-hand edge corresponds to the
upper class boundary.
IMPORTANT:
1)Simply to say we can scale in both directions
2) k1 and k2 are reciprocals→ k1=1/ k2
30
Solution continued
3. Now you need to find the area of the bar from 130
cm to 155 cm:
31
Self-study
For the same histogram
32
Solution
33
Stem and leaf diagram
1) What are Stem and leaf diagrams?
16
14
5
Clacss width and midpoint
mid-point =
!"#$% &!'(( )"*+,'%- .*//$% &!'(( )"*+,'%-
0
Example
This grouped frequency table shows the lengths (to
the nearest cm) of the same 50 potatoes.
Length of potato (l, in cm) Frequency The shortest potato that could go in
the 6-7 class would actually have a
4–5 5 length of 5.5 cm (since 5.5 cm is
6–7 11 rounded up to 6 cm when measuring
8–9 15 to the nearest cm). So the lower class
10 – 11 16 boundary of the 6-7 class is 5.5 cm.
12 – 13 3
The upper class boundary of the 6-7 class is the same as the
lower class boundary of the 8-9 class - this is 7.5 cm. This
means there are never any gaps between classes.
42
Example
This table shows the data for all groups:
Length of potato Lower class Upper class Class
Frequency Midpoint
(l, in cm) boundary boundary width
43
Example
The heights of a number of trees were recorded. The
data collected is shown in this table.
Number of trees 26 17 11 6
44
Solution
1) It’s best to make another table. Include extra
rows showing:
45
Solution
Height of
0 - 5.5 5.5 - 10.5 10.5 – 15.5 15.5 – 20.5
trees to
[0; 5.5) [5.5; 10.5) [10.5; 15.5) [15.5; 20.5)
Total
nearest dm
2.75 8 13 18
Number of
26 17 11 6
trees, f
46
Median. Example
To find an estimate for the median, use linear
interpolation. The table below shows the ‘tree data’
from the last example with the cumulative
frequency. Estimate the median height (to 2 d.p.).
Number of trees 26 17 11 6
Cumulative
26 43 54 60
frequency
47
Median. Solution
1. First, find which class the median is in. Since ( = *+
= 30, there
) )
are 30 values less than or equal to the median. This means the
median must be in the 5.5-10.5 class.
2.
48
Solution
3.
4.
5.
49
Median
6.
50
Median. Shorter equation
OR
𝑎M
𝑚 = 𝑙𝑜𝑤𝑒𝑟 𝑐𝑙𝑎𝑠𝑠 𝑏𝑜𝑢𝑛𝑑𝑎𝑟𝑦 + 𝑐𝑙𝑎𝑠𝑠 𝑤𝑖𝑑𝑡ℎ×
𝑏M
where 𝑎M , 𝑏M are the proper differences between
cumulative frequencies.
51
Example
Estimate the median length of the
newts described in this table (round
to 1 d.p.)
52
Solution
53
Exercise
54
Solution
a) 0-2 letters. All the classes are the same width, so you
can use the frequency to find the modal class (instead of
the frequency density).
b) First we add some extra columns to the table:
Number of
letters
56
Comparing measures of central
tendency
57
1) What type of data we are dealing with?
58
1) Do we have outliers (a very large or small
numbers that differs dramatically from
main data set)?
If yes then
Mean - quantitative data (not suitable)
Median - quantitative data (suitable)
59
Mean
Advantage:
• The mean is a good average because you use
all your data in working it out.
Disadvantage:
• It can be heavily affected by extreme
values/outliers and by distributions of data
values that are not symmetric.
• And it can only be used with quantitative data
(i.e. numbers).
60
Median
Advantage:
• The median is not affected by extreme values
- so this is a good average to use when you have
outliers.
• This also makes it a good average to use when the
data set is not symmetric.
• Disadvantage:
• It can be used with quantitative data (i.e. numbers).
61
Mode
Advantage:
• The mode can be used with qualitative (non-
numerical) data.
Disadvantage:
• Some data sets can have more than one
mode (and if every value in a data set occurs
just once, then the mode isn't helpful at all).
62
Quick question!
63
Quick question!
64
Quick question!
65
Solution
Median? Mean?
66
Solution
Median? Mean?
67
Solution
68
Solution
69
Solution
70
Measures of dispersian
• Quartiles Q1, Q2, Q3 split the data into four parts
where Q1 is the lower quartile, Q2 is the median, Q3 is the
upper quartile.
1, 2, 3, 4, 6, 7, 8, 9, 10, 11, 15
Range = 15 – 1 =14
72
Exercise
Given the data set; -5, 17, 33, 22, 0.27, p
is known to have a range of 43.25, what
are the two possible missing values, p?
Possible missing values of p
-10.25 -5 0.27 17 22 33 38.25
Range = p = 38.25- (-5)=43.25
Range = p = 33-(-10,25)=43.25
However, we have better INTERQUARTILE
tool to measure dispersion RANGE !
73
Interquartile range (IQR)
A more useful way to measure dispersion is to use the
interquartile range – but first we need to find the
quartiles.
Quartiles
Quartiles Q1, Q2, Q3 split the data into four parts
74
Calculation of the Quartiles
(Discrete data)
• To calculate the lower quartile, Q1, divide n by 4. When n is a whole
4
number, find the midpoint between the corresponding term and the
n
term above. When 4 is not a whole number, round the number up and
pick the corresponding term.
IQR=2
76
Exercise
IQR=10-3=7
77
Calculation of the Quartiles
(Continuous and grouped data)
78
Example
The length of time, spent on the internet each evening by
a group of students is shown in the table below.
Find Q1 and Q3 and then the IQR for the data below.
For grouped data we need to use interpolation method.
Lower Quartile Q1
1st Step:
n 70
Q1 : = = 17.5th
4 4
2nd step:
79
Example
3rd step:
80
Example
Upper Quartile Q3
1st Step:
3n 3 * 70
Q3 : = = 52.5th
4 4
2nd step:
3rd step:
81
Exercise
The weights of 31 Jersey cows were recorded to the
nearest kilogram. The weights are shown in the table.
a. Complete the cumulative frequency column in the table.
Weight of a cattle Frequency
(kg)
b. Find Q1. 300-349 3
350-399 6
c. Find Q3. 400-449 10
450-499 7
500-549 5
82
Exercise
The weights of 31 Jersey cows were recorded to the
nearest kilogram. The weights are shown in the table.
a. Complete the cumulative frequency column in the table.
𝟑𝟏
𝑸𝟏 = = 𝟕. 𝟕𝟓𝒕𝒉 𝒗𝒂𝒍𝒖𝒆, 𝒔𝒐 𝑸𝟏 𝒊𝒔 𝒊𝒏 𝒕𝒉𝒆 𝒄𝒍𝒂𝒔𝒔 𝟑𝟓𝟎 − 𝟑𝟗𝟗
𝟒
𝑸𝟏 − 𝟑𝟒𝟗. 𝟓 𝟕. 𝟕𝟓 − 𝟑
=
𝟑𝟗𝟗. 𝟓 − 𝟑𝟒𝟗. 𝟓 𝟗−𝟑
𝑸𝟏 = 39.58+349.5 ≈ 389.1
84
Exercise
c. Find Q3.
31
𝑄-= 3 ∗ = 23.25th value
4
𝑄- 𝑖𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑐𝑙𝑎𝑠𝑠 450 − 499
Q3 - 449.5 23.25 - 19
=
499.5 - 449.5 26 - 19
Q3 = 30.36 + 449.5 » 479.9
d. Work out the interquartile range.
IQR = Q3 - Q1 = 479.9 - 389.1 = 90.8
85
Calculation of the Interpercentile range
Percentiles
Quartiles divide the data into four parts, where each part
contains the same number of data values. Percentiles are
similar, but they divide the data into 100 parts.
Interpercentile range
86
Exercise
The height, in cm, of 70 seventeen year old boys were
recorded. Calculate:
a) The 90th percentile. b) The 10th percentile.
c) The 10% to 90% interpercentile range.
3rd step:
87
Exercise
b. The 10th percentile (try yourself!)
1st step:
3rd step:
88
Determine if a value is an outlier
An OUTLIER is an extreme value that is a long way
from the majority of the readings in the data set.
89
Example
92
Exercise
Find variance and standard deviation of the following
data set:
2, 3, 4, 4, 6, 11, 12
x 2 3 4 4 6 11 12
𝑥 ! 4 9 16 16 36 121 144
x = 2 + 3 + 4 + 4 + 6 + 11 + 12 = 42
"#$
Variance, 𝜎 ! = − 6! ≈ 13.4
%
Standard deviation, 𝜎 = 13.43 ≈3.7
93
References:
1. Palin A., Park A., Whiteley C., (2012), A-level
mathematics for Edexcel Statistics 1, CGP, UK.
2. Attwood, G., Clegg, A., Dyer, G. and Dyer, J
(2008), Edexcel AS and A-Level Modular
Mathematics series S2, Pearson, Harlow, UK.
3. Lecture notes, Statistics and Math for Life
Sciences courses, NUFYP, Nazarbayev
University.
94