Edexcel GCSE (9 – 1)
Statistics
Mr M Dominguez
[email protected]
Chapter 4 Scatter Diagrams and
Correlation
Lesson 1: 4.1 to 4.5 Print out Q1 worksheet
Lesson 2: 4.6 to 4.8
Lesson 3: 4.8 to 4.9
§ 4.1 Scatter Diagrams
The most important graphical summary of bivariate data is the scatter
diagram. This is simply a plot of the points (XI, Yi) in the plane. The following
figures show scatter diagram of June maximum temperatures against January
maximum temperatures, and of January maximum temperatures against
latitude.
A key feature in a scatter diagram is the correlation, or trend between X and
Y. “Higher January temperatures tend to be paired with higher June
temperatures, so these two values have a positive correlation.” Higher
latitudes tend to be paired with lower January temperature decreases, so
these values have a Negative correlation. If higher X values are paired with
low or with high Y values equally often, there is no correlation.
For a scatter diagram we plot the explanatory (independent )
variable on the x-axis and the response (dependent) variable on
the y-axis
Sometimes we can struggle identifying which variable is which.
The most obvious way to identify variables is to look at which
comes first in the table of values.
Scatter diagrams are used to represent bivariate data.
A common misconception is that the data must be continuous
You do not need to start your axes from 0. Graphs should
contain a suitable scale.
1) The table below shows the shoe size and mass of 8 men.
(a) Plot a scatter graph for this data and draw a line of best fit.
Size 5 12 7 10 9 11 6 8
Mass 65 97 68 78 79 88 74 80
100 (b) Why is a scatter diagram
95 suitable for this data?
90
85
Mass (kg)
80
75
70
65
60
4 5 6 7 8 9 10 11 12 13
Shoe Size
§ 4.2 Correlation
There are many different types of correlation not just positive or negative.
Scatter graphs are used to show whether there is a relationship between two sets
of data. The relationship between the data can be described as either:
1. A positive correlation. As one quantity increases so does the other.
2. A negative correlation. As one quantity increases the other decreases.
3. No linear correlation. Both quantities vary with no clear relationship.
Soup Sales
Shoe Size
Height
Shoe Size Temperature Annual Income
Positive Correlation Negative correlation No correlation
A positive or negative correlation is characterised by a straight line with a
positive /negative gradient. The strength of the correlation depends on
the spread of points around the imagined line.
Strong Positive Moderate Positive Weak Positive
Strong negative Moderate Negative Weak negative
Describing / interpreting correlation in context
Two types of questions
• What correlation does the scatter diagram suggest
• Describe the correlation between height and weight. Or;
• Interpret, in context the type of correlation. Or;
• What conclusions can you draw about the correlation between
height and weight?
Describing / interpreting correlation in context
The scatter diagrams shows the heights and weights of different students
• Describe:(strong) Positive correlation.
• Interpret, in context: As height increases the weight increases
1) The table below shows the shoe size and mass of 8 men.
(a) Plot a scatter graph for this data and draw a line of best fit.
Size 5 12 7 10 9 11 6 8
Mass 65 97 68 78 79 88 74 80
100 (c) What is the correlation
95 between shoe size and mass?
90 Positive correlation
?
85
Mass (kg)
80 (d) Describe/ interpret the
correlation in context.
75
As shoe size increases, Mass
70 increases (Shoe ?size must come
65 first. Why?)
60
4 5 6 7 8 9 10 11 12 13
Shoe Size
1) The table below shows the shoe size and mass of 8 men.
(a) Plot a scatter graph for this data and draw a line of best fit.
Size 5 12 7 10 9 11 6 8
Mass 65 97 68 78 79 88 74 80
100 (c) What is the correlation
95 between shoe size and mass?
90 Positive correlation
?
85
Mass (kg)
80 (d) Describe/ interpret the
correlation in context.
75
As shoe size increases, Mass
70 increases (Shoe?size must come
65 first. Why?)
60
4 5 6 7 8 9 10 11 12 13
Shoe Size
§ 4.3 Causal Relationships
Do not draw causal implications from statements about associations, unless
your data come from a randomized experiment. Just because January and
June temperatures increase together does not mean that January
temperatures cause June temperatures to increase (or vice versa). The only
certain way to sort out causality is to move beyond statistical analysis and talk
about mechanisms.
In general, if X and Y have an association, then
(i) X could cause Y to change (a causal relationship)
(ii) Y could cause X to change (a causal relationship)
(iii) a third unmeasured (perhaps unknown) variable Z could
cause both X and Y to change.
Unless your data come from a randomized experiment, statistical analysis
alone is not capable of answering questions about causality.
Page 215 Q 1,2, and 6
For the association between January and July temperatures, we can try to
propose some simple mechanisms:
i. warmer or cooler air masses in January persist in the atmosphere until
July, causing similar effects on the July temperature.
ii. None, it is impossible for one event to cause another event that
preceded it in time.
iii. If Z is latitude, then latitude influences temperature because it
determines the amount of atmosphere that solar energy must traverse to
reach a particular point on the Earth’s surface.
§ 4.4 Line of best fit
The line of best fit must:
• Pass through the mean of each data set.
• Have the same number of points above and below the line.
1) The table below shows the shoe size and mass of 10 men.
(e) Find the mean shoe size and the mean mass
Size 5 12 7 10 10 9 8 11 6 8
Mass 65 97 68 92 78 78 76 88 74 80
1) The table below shows the shoe size and mass of 8 men.
Size 5 12 7 10 9 11 6 8
Mass 65 97 68 78 79 88 74 80
100 (f) Draw a line of best fit
95
The mean point should be
90 plotted on your graph (a cross
85 with a circle round it. The line of
best fit must always pass
Mass (kg)
80
through this point.
75 (mean data 1, mean data 2)
70
In this case: (8.5, 78.625)
65
60
4 5 6 7 8 9 10 11 12 13
Shoe Size
2) The table below shows the number of people who visited a museum over a 10 day
period last summer together with the daily sunshine totals.
(a) Plot a scatter graph for this data and draw a line of best fit.
Hours Sunshine 6 0.5 8.5 3 8 10 7 5 3 2
Visitors 300 475 100 390 200 50 175 220 350 320
500 (b) Draw a line of best fit and
450 comment on the correlation.
400
Number of Visitors
If you have a calculator you can
350 find the mean of each set of
data and plot this point to help
300 you draw the line of best fit.
250 Ideally all lines of best fit should
200 pass through co-ordinates:
(mean data 1, mean data
150 2) In this case:
100
0 1 2 3 4 5 6 7 8 9 10 Means Means 2
Hours of Sunshine
§ 4.5 Interpolation and extrapolation
Using our line of best fit we can estimate the
value of one variable when given the other.
If the value we are estimating is with in our range
of values we call it interpolation.
If the value we are estimating is outside our
range of values we call it extrapolation.
Interpolation estimates will always be more
accurate than Extrapolation estimates.
Furthermore the more you extrapolate the more
inaccurate your estimation will be.
For GCSE maths you may describe an estimation as being inaccurate
as it is out side the collect range of values.
1) The table below shows the shoe size and mass of 8 men.
Size 5 12 7 10 9 11 6 8
Mass 65 97 68 78 79 88 74 80
100 (g) Use your line of best fit to
95 estimate:
90 87 kg (i) The mass of a man with
85 shoe size 10½.
Mass (kg)
80 (ii) The shoe size of a man
with a mass of 62 kg.
75
(iii) Which estimation will be
70 more accurate and why?
65
part ii is less accurate as it is
60 Size 4.2 extrapolation. or
?
Part i is more accurate as it is
4 5 6 7 8 9 10 11 12 13 interpolation
Shoe Size
2) The table below shows the number of people who visited a museum over a 10 day
period last summer together with the daily sunshine totals.
Hours Sunshine 6 0.5 8.5 3 8 10 7 5 3 2
Visitors 300 475 100 390 200 50 175 220 350 320
500
450 Use your line of best fit to
400 estimate:
Number of Visitors
350 (i) The number of visitors
300 for 4 hours of sunshine.
310
250 (ii) The hours of sunshine
when 250 people visit.
200
150
5½
100
0 1 2 3 4 5 6 7 8 9 10
Hours of Sunshine
§ 4.6 The equation of a line of best fit
To find the equation of the line of best fit you must find
the Gradient. You must also know a point on the line.
Either the y intercept of the mean. You can then use one
of the two general equations for a straight line.
Using the line you can estimate value, but most
importantly you must be able to describe the
significances of m and c in the equation within the
context of the question.
It is incorrect to describe m as the gradient and c as the y
intercept. The descriptions must be in context.
How can we come up with an Maths vs English Test Scores
equation that could estimate a Maths
100
Score (y) from an English score (x)?
90
𝒚=𝟎. 𝟓𝟒 𝒙? +𝟑𝟗
80
70
60
Maths Score
We can find the gradient by 50
picking two random points on 40
the line suitably far apart. 30
Change in y is 43 20 The y-intercept seems
10 to be about 39.
(0, 39) and (80, 82)
0
Change in x is 80 0 10 20 30 40 50 60 70 80 90 100
English Score
m = Δy = 43 = 0.54
Δx 80 ?
Interpret the value of and in equation of the line
For every extra mark in English the maths mark increases by 0.54
The maths mark is approximately 39 when a student scores 0 in the English test.
We can actually use our calculator to input data and find a line of best fit.
Distance from Kingston (x) 0.2km 2.5km 3.6km 0.8km
House Price (y) £560,000 £470,000 £365,000 £580,000
1) The table below shows the shoe size and mass of 8 men.
Size 5 12 7 10 9 11 6 8
Mass 65 97 68 78 79 88 74 80
100 (h) Calculate the gradient of the
line of best fit.
95
90
85 ?
Mass (kg)
80
(i) Find the equation of the line of
75
best fit.
70
This time we can’t find the -
65 intercept from the graph
60
To find sub in a know point (8.5,
4 5 6 7 8 9 10 11 12 13 78.625) hence,
Shoe Size ?
1) The table below shows the shoe size and mass of 8 men.
Size 5 12 7 10 9 11 6 8
Mass 65 97 68 78 79 88 74 80
100
95
Interpret the values of the
90 gradient and the y-intercept.
85 Gradient:
Mass (kg)
80 As shoe size increases by 1 the
75 mass increase by 4.3kg
70 y-intercept:
?
65 A man with a shoe size of 0 has
an estimated mass of 42.2kg.
60 (Why is this value
? not very
accurate? Does this make any
4 5 6 7 8 9 10 11 12 13 sense?
Shoe Size
25
20
y = -0.18x + 17
Weekly time on internet (hours)
15
10
0
0 10 20 30 40 50 60 ? 70 80 90
Age
If someone’s age is 50, how many
hours would we therefore expect (-0.18 x 50) + 17 = 8
them to be on the internet?
In general, we should be
Earnings wary of making estimates
£80000 using values outside the
range of our data.
£70000
Estimating for this age is
£60000
bad because:
£50000 The person may have
retired. ?
£40000
£30000 Estimating for this age is
bad because:
£20000 Children don’t have full-
time jobs. ?
£10000
0 10 20 30 40 50 60 70 80 90
Age
When we use our line of best fit to estimate a ?
value inside the range of our
data, this is known as: interpolation
? outside the range of
When we use our line of best fit to estimate a value
our data, this is known as: extrapolation
Key Question
The scatter diagram shows
information about 10 apartments in
a city.
The graph shows the distance from
the city centre and the monthly rent
of each apartment.
a) Draw a line of best fit.
(2)
b) Describe and interpret
the correlation shown in
The independent
the scatter diagram
variable (x axis)
(2)
should always come
Description: (Strong)
first
Negative Correlation.
Interpretation: As the distance
from the city increases the
Monthly rent decreases
Key Question
c) Calculate the gradient of the
line of best fit.
Δ𝑦 100−400 (2)
𝑔𝑟𝑎𝑑𝑖𝑒𝑛𝑡= = =−120
Δ𝑥 3.8−1.3 (1.3, 400)
d) Write the equation of the line
of best fit in the form
𝑦 =−120 𝑥+560 (2)
The y-intercept
e) Interpret the value of a and b.
(2) The independent variable
a: As the distance from
always comes first. Make (3.8, 100)
the city increases by 1km
sure to include units.
the cost of monthly rent
decreases by £120
b: apartments in the When the distance from the city
centre of the city have a centre is 0 the monthly rent of
monthly rent of £560 an apartment is
Key Question
f) An apartment which is
5km from the city centre
has a monthly rent of
£100. Explain why using
the line of best fit to
predict the monthly rent
may not be reliable
(1)
Extrapolation: 5km is
not in the data range.
g) Why is a scatter diagram
a suitable diagram to
represent this data.
(1)
Bivariate data
The scatter diagram shows information for some weather stations. It shows the height of
each weather station above sea level (m) and the mean July midday temperature (C) for
that weather station.
Find the equation of the line of best fit. Given that the mean is value is (14, 1450)
What does the gradient mean in the context of the question?
What does the y intercept mean in the context of the question?
Is it sensible to extend the graph to x=0?
Use you equation to predict the height of a station which records a temperature of 15 oC.
mean July midday temperature (C)
Find the equation of the line of best fit.
What does the gradient mean in the context of the question?
What does the y intercept mean in the context of the question?
Is it sensible to extend the graph to x=0?
Use you equation to predict the height of a station which records a temperature of 15 oC.
§ 4.7 Spearman’s rank correlation coefficient
Shows “agreement” as apposed to correlation.
close to +1 more agreement between ranks
close to -1 more disagreement between ranks
close to Zero, the ranks neither agree or disagree.
When comparing two different relationships or sets of data.
Eg: is there more agreement between height and weight or height and arm length.
§ 4.8 Calculating Spearman’s rank correlation coefficient
Does being good at maths make you better at biology?
Student Maths exam Biology exam
score score
Anand 57 83
Bernard 45 37
Charlotte 72 41
Demi 78 86
Eustace 53 56
Ferdinand 63 85
Gemma 86 77
Hector 98 87
Ivor 59 70
Jasmine 71 59
Is there a statistically significant correlation between these two sets of results?
Does being good at maths make you better at biology?
Student Maths exam Biology exam
score score
Anand 57 83
Bernard 45 37
Charlotte 72 41
Demi 78 86
Eustace 53 56
Ferdinand 63 85
Gemma 86 77
Hector 98 87
Ivor 59 70
Jasmine 71 59
Is there a statistically significant correlation between these two sets of results?
Step 1: Rank each set of data (lowest to highest)
Student Maths Maths Biology Biology
exam rank exam score rank
score
Alex 57 3 83 7
Bernard 45 1 37 1
Charlotte 72 7 41 2
Demi 78 8 86 9
Eustace 53 2 56 3
Ferdinand 63 5 85 8
Gemma 86 9 77 6
Hector 98 10 87 10
Ivor 59 4 70 5
Jasmine 71 6 59 4
Step 2: Work out the differences in ranks (maths – biology)
Student Maths Maths Biology Biology
exam score rank exam rank d d2
score
Alex 57 3 83 7
4 16
Bernard 45 1 37 1
0 0
Charlotte 72 7 41 2
5 25
Demi 78 8 86 9
1 1
Eustace 53 2 56 3
1 1
Ferdinand 63 5 85 8
3 9
Gemma 86 9 77 6
3 9
Hector 98 10 87 10
Ivor 59 4 70 5 0 0
Jasmine 71 6 59 4 1 1
22
∑d 4
66
Step 3: Work out the square of the differences
Step 4: Work out the sum of the square of the differences
Step 5: Work out the value of the coefficient, rs
n = 10
∑d2 = 66
6(66) 6(66)
rs = 1 - =1-
10(102 – 1) 10 x 99
= 1 – 0.4 = 0.6
Step 6: comment on the value of rs
rs = 0.6 suggests a relatively moderate (agreement) positive
correlation between Maths and Biology scores.
As a maths scores increase, Biology scores also increase
Step 1: Rank each set of data
Step 2: Work out the differences in ranks (Why doesn’t it matter
what order we subtract in?)
Step 3: Work out the square of the differences
Step 4: Work out the sum of the square of the differences
Step 5: Work out the value of the coefficient, rs
Step 6: comment on the value of rs
1) The table below shows the shoe size and mass of 8 men.
Size 5 12 7 10 9 11 6 8
Mass 65 97 68 78 79 88 74 80
Rank(s) 1 8 3 6 5 7 2 4
Rank(M) 1 8 2 4 5 7 3 6
d 0 0 1 2 0 0 1 2
𝑑2 0 0 1 4 ?0 0 1 4
Calculate Spearman’s rank correlation coefficient (3d.p) and
interpret your answer.
∑ 𝑑2 =10
81
The shoe size and mass are in (strong) agreement / positive correlation
Calculate Spearman's rank Correlation
coefficient and interpret your answer.
Mock % (a) GCSE % (b) Rank-a Rank-b d d2
Adnan 78 85
Ben 83 93
Carl 54 76
Dan 77 86
Edward 45 78
Fred 95 97
George 89 91
Harry 77 75
Ivor 77 84
Total
Calculate Spearman's rank Correlation
coefficient and interpret your answer.
Mock % (a) GCSE % (b) Rank-a Rank-b d d2
Adnan 78 85 6 5 1 1
Ben 83 93 7 8 1 1
Carl 54 76 2 2 0 0
Dan 77 86 4 6 2 4
Edward 45 78 1 3 2 4
Fred 95 97 9? 9? 0? ?0
George 89 91 8 7 1 1
Harry 77 75 4 1 3 9
Ivor 77 84 4 4 0 0
Total 20
?
Strong agreement between mock and GCSE %
?33 (positive correlation)
?
The higher a students mark in the Mock the higher
their mark in the GCSE
§ 4.9 PMCC
If both variables X, Y are random samples from normal distributions (the data is
symmetrical about the mean and the samples set is chosen using a random sampling
method) then the Product Moment Correlation Coefficient (PMCC) can be calculated to
given an estimation of the correlation.
Why use PMCC?
•Gives a value between -1 and 1.
•The closer to -1 or 1 the PMCC is the stronger the correlation.
•A negative value implies negative correlation etc.
•If close to 0 does not imply no correlation. Only shows there is no linear correlation.
•(do not need to know how to calculate)
However if the variables X and Y are not random samples from a normal distribution. We
can not use PMCC
For example IRG attainment based on class test would be normally distributed but.
Teachers opinions on effort would not be.
For each of the questions below identify the most appropriate value for
Spearman's rank correlation coefficient and Persons product moment correlation
coefficient, from the list. Then explain your reasoning.
-0.95 -0.60 0.60 0.95
Spearman's = 0.95? ? Spearman's = 0
Spearman's = -0.95
Person’s = 0.95
? Person’s = 0?
Person’s = -0.6
The model is linear hence There is no agreement
Person’s will be
Spearman’s and Person’s or linear correlation,
closer to 0 as the
would give strong positive hence both ?values will
relationship is non-
correlation. be close to 0.
linear.