0% found this document useful (0 votes)
29 views6 pages

Covariance and Correlation

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views6 pages

Covariance and Correlation

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

E9.R1 Regress or progress?

Interpretation of regression

Interpreting correlation
●● How does the way data is represented affect our ability to make
predictions?
You can use a GDC, graphing software or a spreadsheet to plot scatter
ATL

diagrams, add a line of best fit and produce an equation for the line that
models the relationship between the variables.
Technology uses linear regression to find the line of best fit. This method
finds the vertical displacements of the points from the line and then uses
a process called ‘least squares’ to find a line that will predict y for a given
value of the independent variable x.

y
2

P6

1
P4 P5
P2
P3
P1
1 2 3 x
4

−1

Each square in the diagram is the square of the vertical distance from a point
to the line. The least squares method finds the line that gives the minimum
value for the sum of these squares. With some GDCs or graphing software
you can try to find the line of best fit yourself by plotting a line through ( x , y ),
showing the squares and minimizing the sum of their areas.

1
R E A S ONING WI T H DATA

Example 1
The table gives the weight and height of 10 students.
a Use technology to draw a scatter diagram and find the line of best fit.
b Comment on the form, direction and strength of the correlation.
c Use the graph to estimate the weight of a 162 cm tall student.

Name Height (cm) Weight (kg) Name Height (cm) Weight (kg)
Alex 168 77 Frida 159 87
Bea 149 45 George 195 94
Chip 173 65 Harry 144 53
Dave 166 59 Inge 176 65
Erin 185 79 Jepe 180 87

a
1.1 1.2 1.3 Example 2 RAD

95

80
Weight

User your GDC to create a line


65 of best fit from the data.

50
y = 0.7445 • x + −55.09

140 150 160 170 180 190


Height

b There is moderate positive linear correlation between


the heights and weights.
c Using the graph, a height of 162 cm would correspond to a weight of
approximately 65 kg.

Practice 1
ATL

1 The table gives the height and weight of 14 adult mountain zebras.

Height
1.37 1.17 1.2 1.34 1.42 1.42 1.37 1.48 1.51 1.23 1.57 1.29 1.30 1.44
x (m)
Weight
257 171 185 214 315 271 242 329 314 185 356 228 230 285
y (kg)

a Use technology to draw a scatter diagram.


b Find the equation of the line of best fit.
c Comment on the correlation between the height and weight of the zebras.
d Use the line of best fit to predict the weight of an adult mountain zebra
with height 1.38 m.

2
2 There are twelve 15-year-old girls in an athletics club. The table shows their
leg length and time to run 1000 m.

Leg length
82 91 85 93 86 95 70 87 90 75 91 80
x (cm)
Time to run
245 210 243 220 240 225 250 243 230 248 220 240
y (s)

a Using technology, draw a scatter diagram.


b Find the equation of the line of best fit.
c Comment on the correlation between leg length and the time taken to
run 1000 m.
d Use the line of best fit to predict the time to run 1000 m for a student
with leg measurement 84 cm.
3 The table shows the diameter of the cylinder in a washing machine and the
cost of the machine when purchased new at the same shop.

Size, x (cm) 33 36 41 48 52 55 62 66
Cost, y ($) 230 350 400 550 550 600 620 700

a Plot the points on a scatter diagram.


b Find the equation of the line of best fit.
c Use your line of best fit to estimate the cost of a washing machine with
a cylinder diameter of 50 cm.

Making predictions
●● Does correlation indicate causation?
●● Do I want to be like everybody else?
An outlier can arise in a variety of ways.
It could be due to human error, categorized by:
●● Measurement error (wrong units)
●● Clerical error (writing the results incorrectly or misunderstanding the
question)
●● Instrumental error (faulty instruments)
However, it could be a legitimate point which caused these results.
In chapter 6 you used the IQR to determine the outliers; now it is your
decision whether or not to include the outliers in the data set.
If you think the outlier is due to human error, remove the data point; if you
think it is a legitimate point, it must be kept in.

An outlier is an anomaly in the data. To identify outliers in bivariate data,


look at the distance of the data point from the line of best fit.

3
R E A S ONING WI T H DATA

Exploration 1
The Olympic Games are held every four years. The table shows the gold
medal distance for the men’s long jump event over a period of 28 years.

Year 1956 1960 1964 1968 1972 1976 1980 1984


Use ‘number of
Long
7.83 8.12 8.07 8.90 8.25 8.35 8.54 8.54 years after 1956’
jump (m)
on the x-axis.
1 Use a GDC or graphing software to draw a scatter diagram and find the (0, 4, 8, …)
line of best fit for the data.
2 Describe the relationship that the graph shows.
3 One jump stands out from the others. Is it much further from the line of
Discover more
best fit than the other points? Is it an outlier?
about this jump
4 How does this outlier affect the line of best fit and any predictions that by researching the
are based on it? Olympic long jump
5 Find the line of best fit without the anomalous jump. Use this line to gold medal for this
estimate the gold medal long jump distance for that year. year, in particular
search for a
video called Bob
A scatter diagram may show a relationship between two variables Beaman’s Long
(correlation), but it does not show that a change in one variable causes a Olympic Shadow
change in the other (causation). For example, the number of pirates and by The New York
global average temperature measured over a period of years have a negative Times.
correlation. But it is unlikely that the fall in the number of pirates has caused
higher temperatures. Demonstrating cause and effect is much harder than
showing a strong linear correlation.

Reflect and discuss 1 You cannot


know that the
●● In Exploration 1, you used the line of best fit to predict a value relationship you
within the range of data you were given. Was it appropriate to can see between
do this? two data sets will
still hold for higher
●● The modern Olympics began in 1896. Could you use your line of or lower data
best fit to predict the winning jump in this year? What about the values than those
distance jumped in the original Greek Olympic Games 1500 years given.
earlier, or Olympic Games 100 years in the future?

Predicting values inside the range of the given data is called


interpolation. Predicting values outside the range of the given data
is called extrapolation.
In general, interpolation is more likely to give an accurate prediction than
extrapolation.

4
Exploration 2
The table shows the number of visitors to the Bahamas for a seven-year
period and the revenue from registered nail salons in the US during the
same time period.

2008 2009 2010 2011 2012 2013 2014


Visitors to the
4.6 4.6 5.3 5.6 5.9 6.2 6.3
Bahamas (millions)
Revenue from nail
salons (billions of 6.3 6.0 6.2 7.3 7.5 8.2 8.5
US dollars)

1 Draw a scatter diagram to illustrate these results, with visitors to the


Bahamas on the x-axis and revenue from nail salons on the y-axis.
2 Describe the correlation between the two variables.
3 Draw a line of best fit and use it to predict the revenue from nail salons
when there are 5 million visitors to the Bahamas.
4 Does it make sense to make predictions from the line of best fit?

Exploration 2 shows that relationships can occur between the most


obscure variables. Obviously there is an association between the visitors
to the Bahamas and the revenue of nail salons, but an increase in one does
not cause an increase in the other. They are both dependent on another
factor, time.
You cannot use correlation to show that two variables are related to each other.

Practice 2
ATL

1 The average prices of a liter of milk in January in the UK tabulated for six
consecutive years are given here.
a Use technology to draw a scatter diagram and line of best fit.
Month Pence per liter
Put the dependent variable on the x-axis, with January 2009
at zero. Then January 2010 is 12 months, January 2011 is Jan 2009 23.73
24 months, etc.
Jan 2010 24.67
b Use the line of best fit to estimate the average price of milk in:
Jan 2011 27.36
i July 2009
Jan 2012 28.08
ii July 2012
iii January 2015 Jan 2013 31.64
c In January 2015, the average price of milk was 24.46 pence per liter. Jan 2014 31.52
Compare this to your prediction. Explain any differences.

5
R E A S ONING WI T H DATA

2 Discuss what the correlation analyses below show. Can you draw any
logical conclusions?
a Over a period of 10 years, the correlation between the annual number of
human births and the population of labradoodles in Denmark was strong
and positive.
b The correlation between the amount of time a person spent looking at a
computer screen each day and the quality of sleep that they reported that
night was medium and negative.
c Between the years 2005–2015 there was a strong, positive linear
correlation between the number of TV licenses bought in the UK and
the number of young offenders sent to prison.
3 Do you think that tall people have longer arms than short people?
The table here lists height and arm span for 10 students.

Arm span (cm) 155 156 160 165 170 177 177 188 196 200
Height (cm) 182 162 180 166 167 173 176 182 184 186

a Use technology to draw a scatter diagram with arm span on the x-axis
and height on the y-axis. Find the line of best fit.
b Use the line of best fit to predict (interpolate) the height of a student
whose arm span is 171 cm.
c Are there any outliers? If so, remove them from the list and find the new
line of best fit. Use this new line to predict the height of a student whose
arm span is 171 cm.
d Discuss the differences between b and c.

Problem solving
4 Students weigh sets of drawing pins and record the number of pins and their
total weight in grams. Here is their data:

Number of pins 7 13 17 20 24 24 28 29 31 33
Weight (g) 5 7 11 10 14 15 16 9 11 20

The students work in two groups to calculate the mean weight of a drawing pin.
a Group 1 finds the total weight of the drawing pins and divides by the
number of drawing pins. Use this method to calculate the mean weight.
b Group 2 draws a scatter diagram. Use technology to draw a scatter
diagram with number of pins on the x-axis and weight on the y-axis.
Find the line of best fit.
c Group 2 decide that two of the weights are unreliable. Remove the
unreliable data and find a new line of best fit. Use it to find the mean
weight of a drawing pin.
d Determine which group’s method gives the most reliable estimate of the
mean weight of a drawing pin.

You might also like