Covariance and Correlation
Covariance and Correlation
Interpretation of regression
Interpreting correlation
●● How does the way data is represented affect our ability to make
predictions?
You can use a GDC, graphing software or a spreadsheet to plot scatter
ATL
diagrams, add a line of best fit and produce an equation for the line that
models the relationship between the variables.
Technology uses linear regression to find the line of best fit. This method
finds the vertical displacements of the points from the line and then uses
a process called ‘least squares’ to find a line that will predict y for a given
value of the independent variable x.
y
2
P6
1
P4 P5
P2
P3
P1
1 2 3 x
4
−1
Each square in the diagram is the square of the vertical distance from a point
to the line. The least squares method finds the line that gives the minimum
value for the sum of these squares. With some GDCs or graphing software
you can try to find the line of best fit yourself by plotting a line through ( x , y ),
showing the squares and minimizing the sum of their areas.
1
R E A S ONING WI T H DATA
Example 1
The table gives the weight and height of 10 students.
a Use technology to draw a scatter diagram and find the line of best fit.
b Comment on the form, direction and strength of the correlation.
c Use the graph to estimate the weight of a 162 cm tall student.
Name Height (cm) Weight (kg) Name Height (cm) Weight (kg)
Alex 168 77 Frida 159 87
Bea 149 45 George 195 94
Chip 173 65 Harry 144 53
Dave 166 59 Inge 176 65
Erin 185 79 Jepe 180 87
a
1.1 1.2 1.3 Example 2 RAD
95
80
Weight
50
y = 0.7445 • x + −55.09
Practice 1
ATL
1 The table gives the height and weight of 14 adult mountain zebras.
Height
1.37 1.17 1.2 1.34 1.42 1.42 1.37 1.48 1.51 1.23 1.57 1.29 1.30 1.44
x (m)
Weight
257 171 185 214 315 271 242 329 314 185 356 228 230 285
y (kg)
2
2 There are twelve 15-year-old girls in an athletics club. The table shows their
leg length and time to run 1000 m.
Leg length
82 91 85 93 86 95 70 87 90 75 91 80
x (cm)
Time to run
245 210 243 220 240 225 250 243 230 248 220 240
y (s)
Size, x (cm) 33 36 41 48 52 55 62 66
Cost, y ($) 230 350 400 550 550 600 620 700
Making predictions
●● Does correlation indicate causation?
●● Do I want to be like everybody else?
An outlier can arise in a variety of ways.
It could be due to human error, categorized by:
●● Measurement error (wrong units)
●● Clerical error (writing the results incorrectly or misunderstanding the
question)
●● Instrumental error (faulty instruments)
However, it could be a legitimate point which caused these results.
In chapter 6 you used the IQR to determine the outliers; now it is your
decision whether or not to include the outliers in the data set.
If you think the outlier is due to human error, remove the data point; if you
think it is a legitimate point, it must be kept in.
3
R E A S ONING WI T H DATA
Exploration 1
The Olympic Games are held every four years. The table shows the gold
medal distance for the men’s long jump event over a period of 28 years.
4
Exploration 2
The table shows the number of visitors to the Bahamas for a seven-year
period and the revenue from registered nail salons in the US during the
same time period.
Practice 2
ATL
1 The average prices of a liter of milk in January in the UK tabulated for six
consecutive years are given here.
a Use technology to draw a scatter diagram and line of best fit.
Month Pence per liter
Put the dependent variable on the x-axis, with January 2009
at zero. Then January 2010 is 12 months, January 2011 is Jan 2009 23.73
24 months, etc.
Jan 2010 24.67
b Use the line of best fit to estimate the average price of milk in:
Jan 2011 27.36
i July 2009
Jan 2012 28.08
ii July 2012
iii January 2015 Jan 2013 31.64
c In January 2015, the average price of milk was 24.46 pence per liter. Jan 2014 31.52
Compare this to your prediction. Explain any differences.
5
R E A S ONING WI T H DATA
2 Discuss what the correlation analyses below show. Can you draw any
logical conclusions?
a Over a period of 10 years, the correlation between the annual number of
human births and the population of labradoodles in Denmark was strong
and positive.
b The correlation between the amount of time a person spent looking at a
computer screen each day and the quality of sleep that they reported that
night was medium and negative.
c Between the years 2005–2015 there was a strong, positive linear
correlation between the number of TV licenses bought in the UK and
the number of young offenders sent to prison.
3 Do you think that tall people have longer arms than short people?
The table here lists height and arm span for 10 students.
Arm span (cm) 155 156 160 165 170 177 177 188 196 200
Height (cm) 182 162 180 166 167 173 176 182 184 186
a Use technology to draw a scatter diagram with arm span on the x-axis
and height on the y-axis. Find the line of best fit.
b Use the line of best fit to predict (interpolate) the height of a student
whose arm span is 171 cm.
c Are there any outliers? If so, remove them from the list and find the new
line of best fit. Use this new line to predict the height of a student whose
arm span is 171 cm.
d Discuss the differences between b and c.
Problem solving
4 Students weigh sets of drawing pins and record the number of pins and their
total weight in grams. Here is their data:
Number of pins 7 13 17 20 24 24 28 29 31 33
Weight (g) 5 7 11 10 14 15 16 9 11 20
The students work in two groups to calculate the mean weight of a drawing pin.
a Group 1 finds the total weight of the drawing pins and divides by the
number of drawing pins. Use this method to calculate the mean weight.
b Group 2 draws a scatter diagram. Use technology to draw a scatter
diagram with number of pins on the x-axis and weight on the y-axis.
Find the line of best fit.
c Group 2 decide that two of the weights are unreliable. Remove the
unreliable data and find a new line of best fit. Use it to find the mean
weight of a drawing pin.
d Determine which group’s method gives the most reliable estimate of the
mean weight of a drawing pin.