Rasyid
Rasyid
1 Correlation 529
p-value
Using techniques developed in Section 7.2, the best that we can say is that the
p-value is between 0.2 and 0.5. A calculator or computer would tell us that the
actual p-value of this test is 0.2115. ■
Microsoft Excel can help us to create a scatterplot for a set of paired data and also cal-
culate the correlation coefficient. We will rework the example involving the giant
sequoia trees, using Excel to construct a scatterplot and calculate the correlation coef-
ficient r.
E XAMPLE Here are the heights and circumferences, both in feet, of 12 giant
10.7 sequoia trees. Construct a scatterplot for these two variables,
treating height as x. Also, calculate the correlation coefficient r.
1 274.9 102.6
2 246.1 101.1
3 267.4 107.6
4 240.9 93.0
5 255.8 98.3
6 243.0 109.0
7 257.5 85.3
8 268.8 113.0
9 223.8 94.8
10 270.3 104.2
11 247.8 91.3
12 254.7 88.3
In a new Excel worksheet, type the heights in column A, from cell A1 down to A12.
In the next column, B, type the circumferences in cells B1 through B12. When cre-
ating a scatterplot, it is crucial that you type the values of x to the left of the values
of y. Also be sure that the data are paired after you type them in; in other words,
each x value should be next to its corresponding y value.
Highlight the data by clicking on cell A1 and dragging the mouse to cell B12 before
releasing the button. Now, from the Insert menu select Chart. This starts the
Chart Wizard. In Step 1, select XY (Scatter) under Chart Type. Click on Next to
advance to Step 2. There is nothing that we need to do in Step 2, so click on Next
to advance to Step 3.
In Step 3, we can change the appearance of our scatterplot. Click on the Titles
tab to add a title to our graph, as well as labeling the x-axis (Height) and y-axis
(Circumference). If you click on the Legend tab you can get rid of the box to the
right of the graph. Click on the box labeled Show Legend so the check disap-
pears. Click on Next to advance to Step 4. In Step 4 you can decide whether you
want your scatterplot to appear on the same worksheet or have its own page.
After you make this decision, click on Finish. Here is an example of what you
should see.
530 CHAPTER 10 Linear Correlation and Regression
Sequoia Trees
120
100
Circumference
80
60
40
20
0
0 50 100 150 200 250 300
Height
If you are unhappy with all of the points being in the upper right-hand corner, you
can fix that by adjusting the x- and/or y-axis. Note that all of the heights are
between 200 and 300 feet. Right-click with your mouse on the values on the x-axis,
and select Format Axis. Click on the Scale tab, and change the minimum value
to 200 instead of 0. You can adjust the maximum value also if you wish to. Click OK
for the changes to take effect. Here is an example of what you should see after
changing the scale.
Sequoia Trees
120
100
Circumference
80
60
40
20
0
200 220 240 260 280 300
Height
Note that all of the points are at the top of the graph, with circumferences between
80 feet and 120 feet. We can change the scale on the y-axis in the same fashion.
Here is what you should see after changing the minimum value to 80 instead of 0.
Sequoia Trees
120
Circumference
110
100
90
80
200 220 240 260 280 300
Height
The strength of the correlation seems to change as we change the scale. The scat-
terplot helps us to get an idea of what type of correlation we have, but we must
10.1 Correlation 531
calculate the correlation coefficient r to be sure. Excel has a built-in function to cal-
culate the correlation coefficient.
=CORREL(A1:A12,B1:B12)
The result that Excel gives us, rounded to four decimal places, is 0.389. ■
TI-83 Correlation
The TI-83 can help us to create a scatterplot for a set of paired data. With a minor
adjustment, it can also help us calculate the correlation coefficient through one of its
built-in functions. We will rework the example involving the giant sequoia trees, using
the TI-83 to construct a scatterplot and calculate the correlation coefficient r.
1 274.9 102.6
2 246.1 101.1
3 267.4 107.6
4 240.9 93.0
5 255.8 98.3
6 243.0 109.0
7 257.5 85.3
8 268.8 113.0
9 223.8 94.8
10 270.3 104.2
11 247.8 91.3
12 254.7 88.3
In list L1 enter the heights of the trees and in list L2 enter the circumferences. Be
sure that the data are paired after you type them in; in other words, each x value
should be next to its corresponding y value. To construct the scatterplot, press
2nd Y= to access the Stat Plot menu, which looks like the following.
532 CHAPTER 10 Linear Correlation and Regression
Highlight number 1 and press the ENTER key. Be sure that On is highlighted on
the first line of the display. Next to type, we want to highlight the first option to con-
struct a scatterplot, which looks like . Next to Xlist enter L1, and enter L2 next
to Ylist. Next to Mark, you have the choice of three different ways to display the
points on the scatterplot. Pick the one you want by highlighting it and press ENTER .
L1
L2
Now press the GRAPH key and your scatterplot should appear looking like this.
Your calculator may need a minor adjustment in order to calculate the correlation
coefficient r. Press the 2nd key, followed by 0 to access the Catalog menu.
Scroll down until you find the choice DiagnosticOn. When the cursor is next to it, as
shown here,
Now to calculate the correlation coefficient, we will use a tool that will be explained
in full in the next section. Press the STAT key, move to the Calc menu, and
select option 8: LinReg (a+bx). When you are brought back to the main screen,
press ( 2nd 1 , 2nd 2 ) . The screen should look like this.
10.1 Correlation 533
Press the ENTER key, and you will see r, as well as r 2. Ignore the other values
until the next section.
EXERCISES 10.1
For Exercises 1–4, create a scatterplot for the given data. Use your scatterplot to determine
whether there is a positive correlation, negative correlation, or no correlation between these
two variables. (Do not calculate r.) If there is a correlation, do you feel that it is weak or
strong? Explain your responses in your own words.
1. Here are the shoe sizes of 10 randomly selected men, along with their heights in
inches. (Treat shoe size as the independent variable x.)
Shoe Size 4 7 7.5 9 9.5
Height (in.) 62 64 66 68 68
Shoe Size 10.5 9 10 11 11.5
Height (in.) 69 69 70 71 72
2. Here are the waist and hip measurements, in inches, of twelve 4-year-old dance
students. (Treat the waist measurements as the independent variable x.)
Waist 24 26 22 23 23 25
Hips 29 31 28 28 28 31
Waist 24 27 23 23 25 31
Hips 30 34 28 28 28 37
(Source: Dancer’s Edge dance studio.)
3. Here are the midterm exam scores of 20 algebra students, along with their scores
on the final exam. (Treat the midterm scores as the independent variable x.)
1 65 85 11 38 56
2 50 77 12 92 95
3 75 90 13 59 60
4 70 84 14 66 87
5 68 61 15 61 83
6 58 70 16 71 77
7 49 76 17 68 79
8 92 78 18 82 74
9 68 80 19 85 94
10 93 92 20 67 80
4. Here are the scores of 16 randomly selected statistics students on a test on Unit 3
and on the final exam. (Treat the Unit 3 scores as the independent variable x.)
534 CHAPTER 10 Linear Correlation and Regression
1 68 59 9 84 81
2 93 91 10 74 85
3 90 75 11 90 84
4 97 100 12 54 78
5 97 98 13 78 95
6 48 59 14 86 94
7 89 89 15 56 86
8 96 97 16 79 57
For Exercises 5–8, calculate the correlation coefficient r for the given data and compare
your result to your responses from Exercises 1–4.
5. Here are the shoe sizes of 10 randomly selected men, along with their heights in
inches. (Treat shoe size as the independent variable x.)
Shoe Size 4 7 7.5 9 9.5
Height (in.) 62 64 66 68 68
Shoe Size 10.5 9 10 11 11.5
Height (in.) 69 69 70 71 72
6. Here are the waist and hip measurements, in inches, of twelve 4-year-old dance
students. (Treat the waist measurements as the independent variable x.)
Waist 24 26 22 23 23 25
Hips 29 31 28 28 28 31
Waist 24 27 23 23 25 31
Hips 30 34 28 28 28 37
(Source: Dancer’s Edge dance studio.)
7. Here are the midterm exam scores of 20 algebra students, along with their scores
on the final exam. (Treat the midterm scores as the independent variable x.)
1 65 85 11 38 56
2 50 77 12 92 95
3 75 90 13 59 60
4 70 84 14 66 87
5 68 61 15 61 83
6 58 70 16 71 77
7 49 76 17 68 79
8 92 78 18 82 74
9 68 80 19 85 94
10 93 92 20 67 80
8. Here are the scores of 16 randomly selected statistics students on a test on Unit 3
and on the final exam. (Treat the Unit 3 scores as the independent variable x.)
1 68 59 9 84 81
2 93 91 10 74 85
3 90 75 11 90 84
4 97 100 12 54 78
5 97 98 13 78 95
6 48 59 14 86 94
7 89 89 15 56 86
8 96 97 16 79 57
10.1 Correlation 535
9. Here are the gross values, in millions of dollars, of milk produced and cattle raised
in Tulare County, California, for the ten years from 1989 through 1998.
Treating the value of milk production as the independent variable, calculate the
correlation coefficient r and the coefficient of determination r 2. Explain what the
coefficient of determination tells us for this problem.
10. A tutorial lab on campus offers free tutoring for any student on campus. Here are
the GPAs for 12 randomly selected students, and the number of tutoring appoint-
ments that those students missed.
GPA 2.66 2.05 2.07 2.62 1.30 3.00 3.25 2.58 2.36 2.81 3.11 2.56
Missed 3 1 2 0 7 0 2 0 3 1 1 2
Appointments
Is there a relation between a student’s GPA and the number of tutoring appoint-
ments that the student missed? Treating the student GPAs as the independent vari-
able, calculate the correlation coefficient r and the coefficient of determination r 2.
Explain what the coefficient of determination tells us for this problem.
11. Wilt Chamberlain played 14 NBA seasons in Philadelphia, San Francisco, and Los
Angeles. Here are his point totals for those 14 seasons.
(a) Construct a scatterplot for these data, and calculate the correlation coefficient.
(Since the data represent a population, this is ρ.)
(b) Note that the point on the scatterplot associated with the 1969–70 season is
far removed from the rest of the points. (Wilt was injured for most of the sea-
son.) If we disregard that point, we will be able to adjust the scale on our axes
to get a better view of the other points and how they are related. Construct a
scatterplot with this point removed.
(c) Calculate the correlation coefficient without including data from the 1969–70
season. Does this radically change the coefficient that was calculated in part (a)?
536 CHAPTER 10 Linear Correlation and Regression
12. It seems that there should be a strong relation between one season’s NBA ticket
prices and the previous season’s prices. Here are the average ticket prices for NBA
arenas for the 1998–99 and 1999–2000 seasons. Calculate the correlation coeffi-
cient for these two variables. (Since the data represent a population, this is ρ.)
Team 1998–99 1999–2000 Team 1998–99 1999–2000
13. Barry Sanders was an NFL running back with the Detroit Lions for 10 NFL seasons.
Here are his yearly statistics for the 1989 through 1998 seasons. Included are the
number of rushing attempts, number of rushing touchdowns, and the number of
rushing yards and receiving yards. Calculate the correlation coefficient between
the number of rushing attempts and the number of rushing yards.
Year Attempts Rush TD Rushing Yards Receiving Yards
1985 49 927 3
1986 86 1570 15
1987 65 1078 22
1988 64 1306 9
1989 82 1483 17
1990 100 1502 13
1991 80 1206 14
1992 84 1201 10
1993 98 1503 15
1994 112 1499 13
1995 122 1848 15
1996 108 1254 8
10.1 Correlation 537
15. Is there a relation between family income in a city and the price of homes in that
city? Here are the median family incomes for 16 California cities, and the median
sales price for homes in those cities. Calculate the correlation coefficient r. Then,
at the 0.05 level of significance, test the claim that there is a significant correla-
tion between these two variables.
Bakersfield 38.7 94
Riverside 47.2 133
Modesto 43.1 128
Visalia 34.3 98
Sacramento 51.9 158
Redding 37.5 113
Fresno 37.2 109
Merced 36.9 122
Stockton 44.3 154
Ventura 65.3 235
Los Angeles 51.3 189
Santa Barbara 52.1 210
San Diego 52.5 208
San Luis Obispo 48.0 192
San Jose 82.6 355
San Francisco 72.4 407
(Source: National Association of Home Builders, Fresno Bee.)
16. For a random sample of 20 dairy cows, here are the number of pounds of milk pro-
duced during their first lactation (after their first calf) and their second lactation
(after their second calf). Is there a relation between these two variables? Calculate
the correlation coefficient r for these data. Then, at the 0.05 level of significance,
test the claim that there is a significant correlation between the amount of milk
produced during the first lactation and the second lactation.
17. If a company has a quick Web server, does that mean that it is reliable as well?
Here is a listing of 10 major electronic commerce Web sites. For each site, the aver-
age length of time for the site to come up on a user’s computer and the percent-
age of time the site is available are shown. Calculate the correlation coefficient r.
Then, at the 0.05 level of significance, test the claim that there is a significant cor-
relation between these two variables.
(continued)
Site Seconds Availability (%)
18. Many politicians and citizen groups complain about the cost of prescription med-
ications in the United States. Here are the prices of a dose of 10 medications in
Canada and the United States (all in U.S. dollars).
(a) Calculate the correlation coefficient r. Then, at the 0.01 level of significance, test
the claim that there is a significant correlation between these two variables.
(b) Construct a scatterplot for these data.
(c) Note that the point on the scatterplot associated with Epogen is far removed
from the rest of the points. If we disregard that point, we will be able to adjust
the scale on our axes to get a better view of the other points and how they are
related. Construct a scatterplot with this point removed.
(d) Calculate the correlation coefficient without including the data for Epogen.
Does this radically change the coefficient that was calculated in part (a)?
19. For eight NHL goalies in the season’s second month, here are the number of min-
utes played by the goalie, the number of goals given up by the goalie, and the
number of shots attempted against the goalie.
Minutes 788 661 783 831 476 643 767 608
Goals 22 20 28 34 24 30 39 32
Shots 335 258 355 413 243 313 339 295
(a) Calculate the correlation coefficient between the number of minutes played
and the number of goals given up.
(b) Calculate the correlation coefficient between the number of shots attempted
and the number of goals given up.
(c) Based on your results, would the number of minutes played or the number of
shots attempted be a better predictor of the number of goals given up?
20. Cal Ripken set a major league record by playing in 2424 consecutive games for the
Baltimore Orioles between 1982 and 1998. Here are his at-bats (AB), runs (R), hits
(H), home runs (HR), runs batted in (RBI), walks (BB), and strike-outs (SO).
10.1 Correlation 539
Year AB R H HR RBI BB SO
(a) Calculate the correlation coefficient between the number of at-bats and the
number of runs batted in.
(b) Calculate the correlation coefficient between the number of hits and the num-
ber of runs batted in.
(c) Calculate the correlation coefficient between the number of home runs and
the number of runs batted in.
(d) Based on your results, would the number of at-bats, the number of hits, or the
number of home runs be a better predictor of the number of runs batted in?
Exercises 21–25 use the following data. For the 50 states and Washington, D.C., here are
the 1999
(continued)
State ACT ACT % SAT V SAT M SAT %
21. Calculate the correlation coefficient for SAT verbal scores and SAT math scores.
22. Calculate the correlation coefficient for ACT scores and the composite SAT scores
(SAT verbal score + SAT math score).
23. A state superintendent of schools, when asked about her state’s low scores, claims
that the low scores are due to the high percentage of high school graduates that
take the test. “There is a significant negative correlation between scores and the
percentage of high school graduates that take the SAT.”
(a) Calculate the correlation coefficient for the percentage of graduates who took
the SAT and the composite SAT scores (SAT verbal score + SAT math score).
(b) Based on your results, does the superintendent’s statement appear to be valid?
24. Calculate the correlation coefficient for the percentage of graduates who took the
ACT and the ACT scores. How does this coefficient compare to the comparable
coefficient for the SAT calculated in the previous exercise?
25. Calculate the correlation coefficient for the percentage of graduates who took the
ACT and the percentage of graduates who took the SAT. Does this correlation
make sense? Explain in your own words, and include a scatterplot to support
your argument.
26. The results of 47 horse races at Santa Anita Park were selected at random. Here are
the number of horses in the race, and the price that the winning horse in that race
paid to win.
10.1 Correlation 541
1 7 9.20 25 7 5.80
2 9 26.60 26 10 8.80
3 4 4.40 27 10 64.00
4 9 19.20 28 7 8.40
5 10 23.40 29 6 6.20
6 8 27.40 30 12 14.80
7 8 14.20 31 12 4.00
8 9 15.20 32 7 8.40
9 6 14.80 33 6 6.20
10 9 5.60 34 12 14.80
11 8 7.80 35 12 4.00
12 9 17.00 36 6 3.00
13 7 8.80 37 10 8.80
14 10 15.40 38 7 5.80
15 10 17.40 39 10 64.00
16 11 22.80 40 7 8.80
17 11 8.00 41 7 10.80
18 12 21.00 42 9 8.60
19 5 21.80 43 8 7.00
20 11 7.40 44 11 12.80
21 12 33.20 45 9 8.20
22 10 5.00 46 11 62.80
23 10 8.80 47 12 10.80
24 6 3.00
(a) Calculate the correlation coefficient for the number of horses in the race and
the winning price.
(b) At the 0.05 level of significance, test the claim that there is a significant cor-
relation between these two variables.
27. For 100 randomly selected community college students, here are the number of
units they are enrolled in and the number of hours that they study per week.
12 4 15 9 6 9 16 5
9 3 18 7 15 7 6 2
16 6 12 3 16 9 18 7
12 4 21 9 21 11 16 5
21 8 12 3 6 3 12 3
15 3 9 3 9 3 21 9
12 0 16 10 12 6 4 1
12 2 9 3 12 4 9 2
15 4 12 5 15 6 12 5
12 5 6 1 12 4 12 6
16 7 15 4 21 9 15 4
9 3 16 4 16 9 12 3
16 5 16 3 12 6 14 7
15 6 12 4 9 2 9 2
21 9 21 9 12 3 16 5
18 9 12 8 16 6 12 4
16 7 15 8 9 2 9 3
12 3 14 7 12 4 12 4
12 5 9 2 12 9 16 5
(continues)
542 CHAPTER 10 Linear Correlation and Regression
(continued)
Study Study Study Study
Units Hours Units Hours Units Hours Units Hours
15 4 16 5 12 4 12 3
16 5 12 4 15 5 16 5
12 4 15 7 16 6 9 2
9 2 9 3 12 5 15 7
6 2 12 3 12 3 12 3
12 4 12 5 9 2 12 6
(a) Calculate the correlation coefficient for the number of units and the number
of study hours per week, using the data from only the first 25 students (first
column).
(b) Use the result of part (a) to test the claim that there is a significant correlation
between these two variables at the 0.01 level of significance.
(c) Calculate the correlation coefficient between the number of students and the
number of study hours per week, using the data from the complete sample of
100 students.
(d) Use the result of part (c) to test the claim that there is a significant correlation
between these two variables at the 0.01 level of significance.
(e) Note that the test statistic is higher for the sample of 100 students than it is for
the 25 students, even though the correlation coefficient was actually greater
for the 25 students. Explain why this happened in your own words.
28. For 71 randomly selected 1999 National League baseball games, here are the
times required to complete the games (in minutes), the total number of hits in the
game, and the combined number of runs in the game.
(a) Calculate the correlation coefficient between the number of hits and the time
required to complete the game.
(b) At the 0.01 level of significance, test the claim that there is a significant corre-
lation between the number of hits and the time required to complete the game.
10.2 Linear Regression 543
(c) Calculate the correlation coefficient between the number of runs and the time
required to complete the game.
(d) At the 0.01 level of significance, test the claim that there is a significant corre-
lation between the number of runs and the time required to complete the game.
(e) Based on your results, would the number of hits or the number of runs be bet-
ter for predicting the time required to complete a game?
SECTION 10.2
Linear Regression
If we can determine that there is a linear correlation between two variables, then the
behavior of those two variables can be described graphically by a line. In this section
we learn how to find the equation of the line that best fits a set of data. We will go on
to use that equation to predict the value of one of the variables for a particular value
of the other variable.
A regression line is a line that best fits a set of data. First, we will learn how to
find the equation of this line, and then later we will see what it means to best fit a set
of data. The general formula of a regression line is ŷ = a + bx. In the equation ŷ,
which is read “y-hat,” is the predicted value of y for a given value of x. The slope of
the line is b, and we calculate it first. We then use the value of b to help calculate a,
which is the y-intercept of the line. Here are the formulas to calculate b and a.
n• Σxy – ( Σx ) • ( Σ y )
b =
nΣ x 2
– ( Σx )2
a = y – bx
You may notice that the calculation of the slope b involves many of the same pieces
as the formula for calculating the correlation coefficient r. Again, it would be best to
use technology to find b and a, but we will use the formula for demonstration pur-
poses in this section.
E XAMPLE Here are the scores of five randomly selected students on test 1
10.9 and test 2 in a math class. Find the equation of the regression line,
treating the score on test 1 as x and the score on test 2 as y.
1 83 82
2 86 84
3 76 63
4 92 83
5 71 55
We begin by creating a scatterplot, to make sure that the data seem to have a
linear relationship.
544 CHAPTER 10 Linear Correlation and Regression
Test Scores
100
90
Test 2
80
70
60
50
60 70 80 90 100
Test 1
x y xy x2
83 82 6,806 6,889
86 84 7,224 7,396
76 63 4,788 5,776
92 83 7,636 8,464
71 55 3,905 5,041
n • Σx y – ( Σx ) • ( Σy )
b=
nΣ x 2
– ( Σx )2
The slope of this regression line is 1.5073. This tells us that for each point increase
on test 1, the score on test 2 will increase by 1.5073 points. Now we calculate the
y-intercept a.
∑x ∑y
x– = y– =
n n
408 367
= =
5 5
= 81.6 = 73.4
a = y– – bx–
= 73.4 – (1.5073) • 81.6
= – 49.5957
Now let’s examine why this line best fits the set of data. Here is the graph of the
regression line on the scatterplot of the data.
Test Scores
100
90
Test 2
80
70
60
50
60 70 80 90 100
Test 1
Note that the line does not go through any of the data points. The vertical distances
between each point and the regression line are called the residuals.
Test Scores
100
90
Test 2
80
residual
70
60
50
60 70 80 90 100
Test 1
The equations for the slope and y-intercept are designed to minimize the sum of the
squares of the residuals. In other words, if we took each residual and squared it, and
then we totaled the results, this line would have the lowest possible total for this set of
data. For this reason, the regression line is often called the least-squares line.
We can use the regression line to predict a y-value for a given x-value. Recall that
the equation of the regression line for the previous example was ŷ = –49.5957
+ 1.5073x, where x represents a student’s score on test 1 and ŷ is the predicted score
for that student on test 2. Suppose that a student got a score of 95 on test 1. What score
can we expect from that student on test 2? All we need to do is plug in 95 for x in the
regression equation.
ŷ = –49.5957 + 1.5073x
= –49.5957 + 1.5073(95)
= –49.5957 + 143.1935
= 93.5978
The predicted score from the regression equation is 93.5978. We could round this to
produce a predicted score of 94. This score seems to fit the pattern of the data.
How about predicting the score of a student who scored 50 on test 1? Plug in 50
for x in the regression equation.
546 CHAPTER 10 Linear Correlation and Regression
ŷ = – 49.5957 + 1.5073x
= –49.5957 + 1.5073(50)
= – 49.5957 + 75.365
= 25.7693
The predicted score from the regression equation is 25.7693, which rounds to a score
of 26. Does this value seem reasonable? It is hard to say, because we do not have a
student in our sample data with a score on test 1 that is close to 50. When we plug in
a value for x that is not close to the values of x on which the regression equation is
based, this is called extrapolation. When we predict values using extrapolation, the
results are not reliable. We should use a regression equation for values of the inde-
pendent variable that are close to the values on which the equation is based.
As a further example of extrapolation, consider the y-intercept of the regression
equation ŷ = – 49.5957 + 1.5073x. The y-intercept of an equation is the y-value that
corresponds to an x-value of 0. So, if a student had a score of 0 on test 1 we would pre-
dict a score of –49.5957 (or approximately –50) on test 2. Of course, this value makes
no sense. Note that none of the test 1 scores on which the regression equation is based
were close to 0. In fact, none of the scores were below 71. The y-intercept of some
regression equations can be meaningful if there are x-values close to 0, representing
some sort of initial condition.
Let’s revisit an example from the previous section.
E XAMPLE Here are the number of hours that 10 students spent studying for a
10.10 final exam, and their score on that exam. Find the equation of the
regression line that best fits the data, and use it to predict the
score of students who study for 0 hours, 5 hours, 10 hours, 15
hours, 20 hours, and 40 hours.
Hours 7 8 4 9 13
Score 70 76 57 77 91
Hours 5 9 6 16 3
Score 66 82 64 96 50
60
40
20
0
0 5 10 15 20
Hours
x y xy x2
7 70 490 49
8 76 608 64
4 57 228 16
9 77 693 81
13 91 1183 169
5 66 330 25
9 82 738 81
6 64 384 36
16 96 1536 256
3 50 150 9
n • Σx y – ( Σx ) • ( Σy )
b=
nΣ x 2
– ( Σx )2
10 • 6340 – 80 • 729
=
10 • 786 – 80 2
63, 400 – 58, 320
=
7860 – 6400
5080
=
1460
= 3.4795
The slope of this regression line is 3.4795. This tells us that for each additional hour
that a student studies, his or her score on the exam will increase by 3.4795 points.
Now we calculate the y-intercept a.
∑x ∑y
x– = y– =
n n
80 729
= =
10 10
= 8 = 72.9
a = y– – bx–
= 72.9 – (3.4795) • 8
= 45.064 Hours Studied/Score on Exam
120
The regression equation is ŷ = 45.064 + 3.4795x.
100
Here is the graph of the regression line. Note how
well it fits. This is due to the fact that there is a strong 80
Score
x = 0 x = 5 x = 10
ŷ = 45.064 + 3.4795x ŷ = 45.064 + 3.4795x ŷ = 45.064 + 3.4795x
= 45.064 + 3.4795(0) = 45.064 + 3.4795(5) = 45.064 + 3.4795(10)
= 45.064 = 62.4615 = 79.859
x = 15 x = 20 x = 40
ŷ = 45.064 + 3.4795x ŷ = 45.064 + 3.4795x ŷ = 45.064 + 3.4795x
= 45.064 + 3.4795(15) = 45.064 + 3.4795(20) = 45.064 + 3.4795(40)
= 97.2565 = 114.654 = 184.244
Here’s a summary of these results in a table, with the scores rounded to the near-
est whole number.
Hours 0 5 10 15 20 40
Score 45 62 80 97 115 184
Let’s begin by investigating the y-intercept. The lowest number of hours studied was
3 hours by a student who had a score of 50, so a score of 45 for a student who did
not study seems to fit. We can think of the equation as telling us that each student
“begins” with a score of 45.064, and then gains 3.4795 points for each hour studied.
The scores for 5 hours, 10 hours, and 15 hours seem to fit as well. There is a prob-
lem with the prediction for 20 hours of study time. The highest number of hours
studied in the sample data was 16 hours, the next highest was 3 hours lower at 13
hours, and the rest were below 10 hours. So this extrapolation has some problems,
especially considering that the highest score possible is 100. One problem with
regression lines is that in any small “window” a set of data may appear to be linear,
but if we expand outward the data may actually follow some sort of curve. Finally,
the extrapolation for 40 hours of study time is obviously no good. ■
E XAMPLE Here are the gross values, in millions of dollars, of nectarines and
10.11 peaches grown in Tulare County, California, for the ten years from
1989 through 1998. First, find the correlation coefficient for these
two variables. Then, find the equation of the regression line for
these two variables. Also, predict the value of peaches produced
for a year in which the value of nectarines produced is $60 million.
1989 47 32
1990 53 47
1991 52 57
1992 68 43
1993 51 53
1994 90 63
1995 74 76
1996 89 64
1997 83 66
1998 56 56
(Source: Visalia Times Delta.)
10.2 Linear Regression 549
It is hard to determine which of these variables affects the other, but since we will
be trying to predict the value of peaches for a certain value of nectarines, we will let
x be the gross value of nectarines and y be the gross value of peaches.
Peach Value
($ millions) 60
40
20
40 60 80 100
Nectarine Value
($ millions)
There seems to be a general positive trend. Since the correlation coefficient and
regression equation involve many of the same sums, we will calculate them all first.
x y xy x2 y2
(Σ x )(Σ y )
Σx y –
n
r =
⎛ (Σ x )2 ⎞⎛ (Σ y )2 ⎞
⎜ Σx 2 – ⎟ ⎜ Σy 2
– ⎟
⎜ n ⎟⎜ n ⎟
⎝ ⎠⎝ ⎠
663 • 557
38, 190 –
10
=
⎛ 663 2 ⎞ ⎛ 557 2 ⎞
⎜ 46, 469 – ⎟ ⎜ 32, 473 – ⎟
⎝ 10 ⎠ ⎝ 10 ⎠
369, 291
38, 190 –
10
=
⎛ 439, 569⎞ ⎛ 310, 249 ⎞
⎜ 46, 469 – ⎟ ⎜ 32, 473 – ⎟
⎝ 10 ⎠⎝ 10 ⎠
= 0.661
The correlation coefficient is 0.661, so we do have positive correlation. Now for the
calculation of the regression equation.
550 CHAPTER 10 Linear Correlation and Regression
n • Σx y – ( Σx ) • ( Σy )
b=
nΣ x 2
– ( Σx )2
10 • 38, 190 – 663 • 557
=
10 • 46, 469 – 663 2
381, 900 – 369, 291
=
464, 690 – 439, 569
12, 609
=
25, 121
= 0.5019
The slope of this regression line is 0.5019. This tells us that for each additional mil-
lion dollars in nectarine production, the value of peaches produced will increase by
0.5019 million dollars, or $501,900. Now we calculate the y-intercept a.
∑x ∑y
x– = y– =
n n
663 557
= =
10 10
= 66.3 = 55.7
a = y– – bx–
= 55.7 – (0.5019) • 66.3
= 22.4240
The regression equation is ŷ = 22.4240 + 0.5019x. Here is the graph of the regres-
sion line.
Nectarine/Peach Value in Tulare Co.
80
Peach Value
($ millions)
60
40
20
40 60 80 100
Nectarine Value
($ millions)
What will be the value of peaches produced in a year in which the value of nec-
tarines produced is $60 million? Simply plug in 60 for x in the regression equation.
ŷ = 22.4240 + 0.5019x
= 22.4240 + 0.5019(60)
= 52.538
We predict the production of $52.538 million of peaches. ■
We can use Microsoft Excel to quickly find the regression equation, and even to draw
the graph of our regression line over our scatterplot. We begin with the regression
equation, reworking the example involving nectarines and peaches.
10.2 Linear Regression 551
E XAMPLE Here are the gross values, in millions of dollars, of nectarines and
10.12 peaches grown in Tulare County, California, for the ten years from
1989 through 1998. Find the equation of the regression line for
these two variables, letting x represent the value of nectarines
produced and y the value of peaches produced.
1989 47 32
1990 53 47
1991 52 57
1992 68 43
1993 51 53
1994 90 63
1995 74 76
1996 89 64
1997 83 66
1998 56 56
(Source: Visalia Times Delta.)
In a new Excel worksheet, type the values of the nectarines in column A, from cell
A1 through A10. In the next column, type the values of the peaches beginning in
cell B1 and continuing through cell B10.
We need to use the Excel’s Data Analysis ToolPak, so we must be sure that it has
been added to the Tools menu. Click on the Tools menu; if you see Data Analysis
then you may skip to the next paragraph. If you do not see Data Analysis, then
click on Add-Ins. When a dialog box opens, check the box next to Analysis
ToolPak, and then click OK.
From the Tools menu, select Data Analysis. When the dialog box appears, scroll
down to select Regression and click OK. When the dialog box appears, type
B1:B10 next to Input Y Range. Next to Input X Range, type A1:A10. Click on OK,
and Excel will give you a great deal of information in a new worksheet. Most of this
information will not be used until the next section. While all of the regression infor-
mation is still highlighted, from the Format menu select Column. Then choose
AutoFit Selection. This will make the information easier to see.
Look down the first column of regression information until you find Intercept and
X Variable 1. Under the column labeled Coefficients you will find the value of
a directly next to Intercept, and you will find the value of b directly next to
X Variable 1. Here is a picture of what the output looks like.
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.661093529
R Square 0.437044654
Adjusted R Square 0.366675236
Standard Error 10.0946498
Observations 10
ANOVA
df SS MS F Significance F
Regression 1 632.8843637 632.8843637 6.210718593 0.037397804
Residual 8 815.2156363 101.9019545
Total 9 1448.1