0% found this document useful (0 votes)
15 views23 pages

Rasyid

The document discusses using Microsoft Excel and a TI-83 calculator to analyze bivariate data and calculate the correlation coefficient. It provides step-by-step instructions for creating a scatterplot of tree height and circumference data in Excel and on the TI-83, and calculating the correlation coefficient using functions in each.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views23 pages

Rasyid

The document discusses using Microsoft Excel and a TI-83 calculator to analyze bivariate data and calculate the correlation coefficient. It provides step-by-step instructions for creating a scatterplot of tree height and circumference data in Excel and on the TI-83, and calculating the correlation coefficient using functions in each.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

10.

1 Correlation 529

p-value
Using techniques developed in Section 7.2, the best that we can say is that the
p-value is between 0.2 and 0.5. A calculator or computer would tell us that the
actual p-value of this test is 0.2115. ■

MICROSOFT EXCEL Correlation

Microsoft Excel can help us to create a scatterplot for a set of paired data and also cal-
culate the correlation coefficient. We will rework the example involving the giant
sequoia trees, using Excel to construct a scatterplot and calculate the correlation coef-
ficient r.

E XAMPLE Here are the heights and circumferences, both in feet, of 12 giant
10.7 sequoia trees. Construct a scatterplot for these two variables,
treating height as x. Also, calculate the correlation coefficient r.

Tree Height (ft.) Circumference (ft.)

1 274.9 102.6
2 246.1 101.1
3 267.4 107.6
4 240.9 93.0
5 255.8 98.3
6 243.0 109.0
7 257.5 85.3
8 268.8 113.0
9 223.8 94.8
10 270.3 104.2
11 247.8 91.3
12 254.7 88.3

In a new Excel worksheet, type the heights in column A, from cell A1 down to A12.
In the next column, B, type the circumferences in cells B1 through B12. When cre-
ating a scatterplot, it is crucial that you type the values of x to the left of the values
of y. Also be sure that the data are paired after you type them in; in other words,
each x value should be next to its corresponding y value.

Highlight the data by clicking on cell A1 and dragging the mouse to cell B12 before
releasing the button. Now, from the Insert menu select Chart. This starts the
Chart Wizard. In Step 1, select XY (Scatter) under Chart Type. Click on Next to
advance to Step 2. There is nothing that we need to do in Step 2, so click on Next
to advance to Step 3.

In Step 3, we can change the appearance of our scatterplot. Click on the Titles
tab to add a title to our graph, as well as labeling the x-axis (Height) and y-axis
(Circumference). If you click on the Legend tab you can get rid of the box to the
right of the graph. Click on the box labeled Show Legend so the check disap-
pears. Click on Next to advance to Step 4. In Step 4 you can decide whether you
want your scatterplot to appear on the same worksheet or have its own page.
After you make this decision, click on Finish. Here is an example of what you
should see.
530 CHAPTER 10 Linear Correlation and Regression

Sequoia Trees
120
100

Circumference
80
60
40
20
0
0 50 100 150 200 250 300
Height

If you are unhappy with all of the points being in the upper right-hand corner, you
can fix that by adjusting the x- and/or y-axis. Note that all of the heights are
between 200 and 300 feet. Right-click with your mouse on the values on the x-axis,
and select Format Axis. Click on the Scale tab, and change the minimum value
to 200 instead of 0. You can adjust the maximum value also if you wish to. Click OK
for the changes to take effect. Here is an example of what you should see after
changing the scale.

Sequoia Trees
120
100
Circumference

80
60
40
20
0
200 220 240 260 280 300
Height

Note that all of the points are at the top of the graph, with circumferences between
80 feet and 120 feet. We can change the scale on the y-axis in the same fashion.
Here is what you should see after changing the minimum value to 80 instead of 0.

Sequoia Trees
120
Circumference

110

100

90

80
200 220 240 260 280 300
Height

The strength of the correlation seems to change as we change the scale. The scat-
terplot helps us to get an idea of what type of correlation we have, but we must
10.1 Correlation 531

calculate the correlation coefficient r to be sure. Excel has a built-in function to cal-
culate the correlation coefficient.

=CORREL(x cell range, y cell range)

For our example, type the following in any cell.

=CORREL(A1:A12,B1:B12)

The result that Excel gives us, rounded to four decimal places, is 0.389. ■

TI-83 Correlation
The TI-83 can help us to create a scatterplot for a set of paired data. With a minor
adjustment, it can also help us calculate the correlation coefficient through one of its
built-in functions. We will rework the example involving the giant sequoia trees, using
the TI-83 to construct a scatterplot and calculate the correlation coefficient r.

E XAMPLE Here are the heights and circumferences, both in feet, of 12


10.8 giant sequoia trees. Construct a scatterplot for these two vari-
ables, treating height as x. Also, calculate the correlation coeffi-
cient r.

Tree Height (ft.) Circumference (ft.)

1 274.9 102.6
2 246.1 101.1
3 267.4 107.6
4 240.9 93.0
5 255.8 98.3
6 243.0 109.0
7 257.5 85.3
8 268.8 113.0
9 223.8 94.8
10 270.3 104.2
11 247.8 91.3
12 254.7 88.3

In list L1 enter the heights of the trees and in list L2 enter the circumferences. Be
sure that the data are paired after you type them in; in other words, each x value
should be next to its corresponding y value. To construct the scatterplot, press
2nd Y= to access the Stat Plot menu, which looks like the following.
532 CHAPTER 10 Linear Correlation and Regression

Highlight number 1 and press the ENTER key. Be sure that On is highlighted on
the first line of the display. Next to type, we want to highlight the first option to con-
struct a scatterplot, which looks like . Next to Xlist enter L1, and enter L2 next
to Ylist. Next to Mark, you have the choice of three different ways to display the
points on the scatterplot. Pick the one you want by highlighting it and press ENTER .

L1
L2

Now press the GRAPH key and your scatterplot should appear looking like this.

Your calculator may need a minor adjustment in order to calculate the correlation
coefficient r. Press the 2nd key, followed by 0 to access the Catalog menu.
Scroll down until you find the choice DiagnosticOn. When the cursor is next to it, as
shown here,

press the ENTER key twice.

Now to calculate the correlation coefficient, we will use a tool that will be explained
in full in the next section. Press the STAT key, move to the Calc menu, and
select option 8: LinReg (a+bx). When you are brought back to the main screen,
press ( 2nd 1 , 2nd 2 ) . The screen should look like this.
10.1 Correlation 533

Press the ENTER key, and you will see r, as well as r 2. Ignore the other values
until the next section.

EXERCISES 10.1
For Exercises 1–4, create a scatterplot for the given data. Use your scatterplot to determine
whether there is a positive correlation, negative correlation, or no correlation between these
two variables. (Do not calculate r.) If there is a correlation, do you feel that it is weak or
strong? Explain your responses in your own words.
1. Here are the shoe sizes of 10 randomly selected men, along with their heights in
inches. (Treat shoe size as the independent variable x.)
Shoe Size 4 7 7.5 9 9.5
Height (in.) 62 64 66 68 68
Shoe Size 10.5 9 10 11 11.5
Height (in.) 69 69 70 71 72
2. Here are the waist and hip measurements, in inches, of twelve 4-year-old dance
students. (Treat the waist measurements as the independent variable x.)
Waist 24 26 22 23 23 25
Hips 29 31 28 28 28 31
Waist 24 27 23 23 25 31
Hips 30 34 28 28 28 37
(Source: Dancer’s Edge dance studio.)
3. Here are the midterm exam scores of 20 algebra students, along with their scores
on the final exam. (Treat the midterm scores as the independent variable x.)

Student Midterm Final Student Midterm Final

1 65 85 11 38 56
2 50 77 12 92 95
3 75 90 13 59 60
4 70 84 14 66 87
5 68 61 15 61 83
6 58 70 16 71 77
7 49 76 17 68 79
8 92 78 18 82 74
9 68 80 19 85 94
10 93 92 20 67 80

4. Here are the scores of 16 randomly selected statistics students on a test on Unit 3
and on the final exam. (Treat the Unit 3 scores as the independent variable x.)
534 CHAPTER 10 Linear Correlation and Regression

Student Unit 3 Final Student Unit 3 Final

1 68 59 9 84 81
2 93 91 10 74 85
3 90 75 11 90 84
4 97 100 12 54 78
5 97 98 13 78 95
6 48 59 14 86 94
7 89 89 15 56 86
8 96 97 16 79 57

For Exercises 5–8, calculate the correlation coefficient r for the given data and compare
your result to your responses from Exercises 1–4.
5. Here are the shoe sizes of 10 randomly selected men, along with their heights in
inches. (Treat shoe size as the independent variable x.)
Shoe Size 4 7 7.5 9 9.5
Height (in.) 62 64 66 68 68
Shoe Size 10.5 9 10 11 11.5
Height (in.) 69 69 70 71 72
6. Here are the waist and hip measurements, in inches, of twelve 4-year-old dance
students. (Treat the waist measurements as the independent variable x.)
Waist 24 26 22 23 23 25
Hips 29 31 28 28 28 31
Waist 24 27 23 23 25 31
Hips 30 34 28 28 28 37
(Source: Dancer’s Edge dance studio.)
7. Here are the midterm exam scores of 20 algebra students, along with their scores
on the final exam. (Treat the midterm scores as the independent variable x.)

Student Midterm Final Student Midterm Final

1 65 85 11 38 56
2 50 77 12 92 95
3 75 90 13 59 60
4 70 84 14 66 87
5 68 61 15 61 83
6 58 70 16 71 77
7 49 76 17 68 79
8 92 78 18 82 74
9 68 80 19 85 94
10 93 92 20 67 80

8. Here are the scores of 16 randomly selected statistics students on a test on Unit 3
and on the final exam. (Treat the Unit 3 scores as the independent variable x.)

Student Unit 3 Final Student Unit 3 Final

1 68 59 9 84 81
2 93 91 10 74 85
3 90 75 11 90 84
4 97 100 12 54 78
5 97 98 13 78 95
6 48 59 14 86 94
7 89 89 15 56 86
8 96 97 16 79 57
10.1 Correlation 535

9. Here are the gross values, in millions of dollars, of milk produced and cattle raised
in Tulare County, California, for the ten years from 1989 through 1998.

Year Milk ($ millions) Cattle ($ millions)

1989 287 177


1990 363 214
1991 413 212
1992 411 237
1993 455 238
1994 477 223
1995 547 223
1996 569 229
1997 712 252
1998 718 271
(Source: Visalia Times Delta.)

Treating the value of milk production as the independent variable, calculate the
correlation coefficient r and the coefficient of determination r 2. Explain what the
coefficient of determination tells us for this problem.
10. A tutorial lab on campus offers free tutoring for any student on campus. Here are
the GPAs for 12 randomly selected students, and the number of tutoring appoint-
ments that those students missed.
GPA 2.66 2.05 2.07 2.62 1.30 3.00 3.25 2.58 2.36 2.81 3.11 2.56
Missed 3 1 2 0 7 0 2 0 3 1 1 2
Appointments
Is there a relation between a student’s GPA and the number of tutoring appoint-
ments that the student missed? Treating the student GPAs as the independent vari-
able, calculate the correlation coefficient r and the coefficient of determination r 2.
Explain what the coefficient of determination tells us for this problem.
11. Wilt Chamberlain played 14 NBA seasons in Philadelphia, San Francisco, and Los
Angeles. Here are his point totals for those 14 seasons.

Season Team Points Rebounds

1959–60 Philadelphia 2707 1941


1960–61 Philadelphia 3033 2149
1961–62 Philadelphia 4029 2052
1962–63 San Francisco 3586 1946
1963–64 San Francisco 2948 1787
1964–65 San Francisco 2534 1673
1965–66 Philadelphia 2649 1943
1966–67 Philadelphia 1956 1957
1967–68 Philadelphia 1992 1952
1968–69 Los Angeles 1664 1712
1969–70 Los Angeles 328 221
1970–71 Los Angeles 1696 1493
1971–72 Los Angeles 1213 1572
1972–73 Los Angeles 1084 1526

(a) Construct a scatterplot for these data, and calculate the correlation coefficient.
(Since the data represent a population, this is ρ.)
(b) Note that the point on the scatterplot associated with the 1969–70 season is
far removed from the rest of the points. (Wilt was injured for most of the sea-
son.) If we disregard that point, we will be able to adjust the scale on our axes
to get a better view of the other points and how they are related. Construct a
scatterplot with this point removed.
(c) Calculate the correlation coefficient without including data from the 1969–70
season. Does this radically change the coefficient that was calculated in part (a)?
536 CHAPTER 10 Linear Correlation and Regression

12. It seems that there should be a strong relation between one season’s NBA ticket
prices and the previous season’s prices. Here are the average ticket prices for NBA
arenas for the 1998–99 and 1999–2000 seasons. Calculate the correlation coeffi-
cient for these two variables. (Since the data represent a population, this is ρ.)
Team 1998–99 1999–2000 Team 1998–99 1999–2000

New York 79.34 86.82 Sacramento 34.11 44.68


L.A. Lakers 51.11 81.89 Philadelphia 41.96 44.26
Seattle 63.47 64.60 Orlando 44.46 44.18
Houston 58.18 62.63 L.A. Clippers 31.75 43.89
Washington 61.40 59.65 Toronto 26.17 42.76
New Jersey 49.24 59.22 Dallas 34.84 40.76
Utah 43.47 54.60 Detroit 33.32 40.04
Chicago 53.17 52.84 Cleveland 39.75 39.75
Portland 52.28 52.28 Minnesota 38.61 39.08
Boston 49.79 49.79 San Antonio 38.01 38.92
Indiana 43.36 48.39 Denver 30.53 38.34
Golden State 36.79 48.10 Vancouver 31.90 34.71
Miami 36.55 46.57 Charlotte 28.12 32.04
Atlanta 36.79 45.75 Milwaukee 29.06 30.83
Phoenix 48.84 45.39
(Source: Team Marketing Report, USA Today.)

13. Barry Sanders was an NFL running back with the Detroit Lions for 10 NFL seasons.
Here are his yearly statistics for the 1989 through 1998 seasons. Included are the
number of rushing attempts, number of rushing touchdowns, and the number of
rushing yards and receiving yards. Calculate the correlation coefficient between
the number of rushing attempts and the number of rushing yards.
Year Attempts Rush TD Rushing Yards Receiving Yards

1989 280 14 1470 282


1990 255 13 1304 480
1991 342 16 1548 307
1992 312 9 1352 225
1993 243 3 1115 205
1994 331 7 1883 283
1995 314 11 1500 398
1996 307 11 1553 147
1997 335 11 2053 305
1998 343 4 1491 289
14. Jerry Rice began his NFL career as a wide receiver with the San Francisco Forty-
Niners in 1985. Here are the number of receptions, receiving yards, and touch-
down receptions for the 1985–1996 seasons. Calculate the correlation coefficient
between the number of receptions and receiving yards.
Year Receptions Yards Touchdowns

1985 49 927 3
1986 86 1570 15
1987 65 1078 22
1988 64 1306 9
1989 82 1483 17
1990 100 1502 13
1991 80 1206 14
1992 84 1201 10
1993 98 1503 15
1994 112 1499 13
1995 122 1848 15
1996 108 1254 8
10.1 Correlation 537

15. Is there a relation between family income in a city and the price of homes in that
city? Here are the median family incomes for 16 California cities, and the median
sales price for homes in those cities. Calculate the correlation coefficient r. Then,
at the 0.05 level of significance, test the claim that there is a significant correla-
tion between these two variables.

Median Income Median Sales Price


City ($ thousands) ($ thousands)

Bakersfield 38.7 94
Riverside 47.2 133
Modesto 43.1 128
Visalia 34.3 98
Sacramento 51.9 158
Redding 37.5 113
Fresno 37.2 109
Merced 36.9 122
Stockton 44.3 154
Ventura 65.3 235
Los Angeles 51.3 189
Santa Barbara 52.1 210
San Diego 52.5 208
San Luis Obispo 48.0 192
San Jose 82.6 355
San Francisco 72.4 407
(Source: National Association of Home Builders, Fresno Bee.)

16. For a random sample of 20 dairy cows, here are the number of pounds of milk pro-
duced during their first lactation (after their first calf) and their second lactation
(after their second calf). Is there a relation between these two variables? Calculate
the correlation coefficient r for these data. Then, at the 0.05 level of significance,
test the claim that there is a significant correlation between the amount of milk
produced during the first lactation and the second lactation.

First Second First Second


Cow Lactation Lactation Cow Lactation Lactation

1 22,792 29,655 11 15,058 8,013


2 31,693 23,817 12 16,681 23,301
3 18,367 35,360 13 21,002 32,529
4 17,440 21,848 14 17,987 23,620
5 29,798 29,828 15 15,968 26,437
6 28,540 33,898 16 23,580 26,227
7 23,661 22,444 17 14,334 25,529
8 39,242 23,648 18 24,030 27,368
9 21,574 25,303 19 20,392 29,527
10 34,542 34,812 20 24,347 25,592

17. If a company has a quick Web server, does that mean that it is reliable as well?
Here is a listing of 10 major electronic commerce Web sites. For each site, the aver-
age length of time for the site to come up on a user’s computer and the percent-
age of time the site is available are shown. Calculate the correlation coefficient r.
Then, at the 0.05 level of significance, test the claim that there is a significant cor-
relation between these two variables.

Site Seconds Availability (%)

Amazon.com 17.66 90.1


Barnesandnoble.com 15.85 94.2
(continues)
538 CHAPTER 10 Linear Correlation and Regression

(continued)
Site Seconds Availability (%)

CDnow 15.52 89.2


eBay 17.25 78.4
eToys.com 20.11 84.0
Gateway 19.01 86.1
Landsend.com 13.49 94.2
Macys.com 31.42 93.0
Wal Mart Online 22.24 92.0
Wine.com 20.58 93.6
(Source: Keynote Systems.)

18. Many politicians and citizen groups complain about the cost of prescription med-
ications in the United States. Here are the prices of a dose of 10 medications in
Canada and the United States (all in U.S. dollars).

Drug Canada United States

Prilosec 1.47 3.31


Prozac 1.07 2.27
Lipitor 1.34 2.54
Prevacid 1.34 3.13
Epogen 21.44 23.40
Zocor 1.47 3.16
Zoloft 1.07 1.98
Zyprexa 3.39 5.27
Claritin 1.11 1.96
Paxil 1.13 2.22
(Source: USA Today.)

(a) Calculate the correlation coefficient r. Then, at the 0.01 level of significance, test
the claim that there is a significant correlation between these two variables.
(b) Construct a scatterplot for these data.
(c) Note that the point on the scatterplot associated with Epogen is far removed
from the rest of the points. If we disregard that point, we will be able to adjust
the scale on our axes to get a better view of the other points and how they are
related. Construct a scatterplot with this point removed.
(d) Calculate the correlation coefficient without including the data for Epogen.
Does this radically change the coefficient that was calculated in part (a)?
19. For eight NHL goalies in the season’s second month, here are the number of min-
utes played by the goalie, the number of goals given up by the goalie, and the
number of shots attempted against the goalie.
Minutes 788 661 783 831 476 643 767 608
Goals 22 20 28 34 24 30 39 32
Shots 335 258 355 413 243 313 339 295
(a) Calculate the correlation coefficient between the number of minutes played
and the number of goals given up.
(b) Calculate the correlation coefficient between the number of shots attempted
and the number of goals given up.
(c) Based on your results, would the number of minutes played or the number of
shots attempted be a better predictor of the number of goals given up?
20. Cal Ripken set a major league record by playing in 2424 consecutive games for the
Baltimore Orioles between 1982 and 1998. Here are his at-bats (AB), runs (R), hits
(H), home runs (HR), runs batted in (RBI), walks (BB), and strike-outs (SO).
10.1 Correlation 539

Year AB R H HR RBI BB SO

1982 598 90 158 28 93 46 95


1983 663 121 211 27 102 58 97
1984 641 103 195 27 86 71 89
1985 642 116 181 26 110 67 68
1986 627 98 177 25 81 70 60
1987 624 97 157 27 98 81 77
1988 575 87 152 23 81 102 69
1989 646 80 166 21 93 57 72
1990 600 78 150 21 84 82 66
1991 650 99 210 34 114 53 46
1992 637 73 160 14 72 64 50
1993 641 87 165 24 90 65 58
1994 444 71 140 13 75 32 41
1995 550 71 144 17 88 52 59
1996 640 94 178 26 102 59 78
1997 615 79 166 17 84 56 73
1998 601 65 163 14 61 51 68

(a) Calculate the correlation coefficient between the number of at-bats and the
number of runs batted in.
(b) Calculate the correlation coefficient between the number of hits and the num-
ber of runs batted in.
(c) Calculate the correlation coefficient between the number of home runs and
the number of runs batted in.
(d) Based on your results, would the number of at-bats, the number of hits, or the
number of home runs be a better predictor of the number of runs batted in?
Exercises 21–25 use the following data. For the 50 states and Washington, D.C., here are
the 1999

• average ACT composite scores (ACT)


• percentage of high school graduates that took the ACT (ACT %)
• average SAT verbal score (SAT V)
• average SAT math score (SAT M)
• percentage of high school graduates that took the SAT (SAT %)
State ACT ACT % SAT V SAT M SAT %

AL 20.2 65 561 555 9


AK 21.1 35 516 514 50
AZ 21.4 28 524 524 34
AR 20.3 69 563 556 6
CA 21.3 12 497 514 49
CO 21.5 62 536 540 32
CT 21.6 3 510 509 80
DE 20.5 3 503 497 67
DC 18.6 13 494 478 77
FL 20.6 39 499 498 53
GA 20.0 16 487 482 63
HI 21.6 18 482 513 52
ID 21.4 60 542 540 16
IL 21.4 67 569 585 12
IN 21.2 19 496 498 60
IA 22.0 66 594 598 5
KS 21.5 75 578 576 9
KY 20.1 68 547 547 12
LA 19.6 76 561 558 8
(continues)
540 CHAPTER 10 Linear Correlation and Regression

(continued)
State ACT ACT % SAT V SAT M SAT %

ME 22.1 4 507 503 68


MD 20.9 10 507 507 65
MA 22.0 6 511 511 78
MI 21.3 69 557 565 11
MN 22.1 64 586 598 9
MS 18.7 82 563 548 4
MO 21.6 67 572 572 8
MT 21.8 54 545 546 21
NE 21.7 41 568 571 8
NV 21.5 5 512 517 34
NH 22.2 5 520 518 72
NJ 20.7 4 498 510 80
NM 20.1 64 549 542 12
NY 22.0 14 495 502 76
NC 19.4 12 493 493 61
ND 21.4 79 594 605 5
OH 21.4 59 534 538 25
OK 20.6 69 567 560 8
OR 22.6 11 525 525 53
PA 21.4 7 498 495 70
RI 22.7 3 504 499 70
SC 19.1 18 479 475 61
SD 21.2 70 585 588 4
TN 19.9 77 559 553 13
TX 20.3 31 494 499 50
UT 21.4 68 570 565 5
VT 21.9 9 514 506 70
VA 20.6 7 508 499 65
WA 22.6 18 525 526 52
WV 20.2 58 527 512 18
WI 22.3 67 584 595 7
WY 21.4 66 546 551 10

(Source: College Entrance Examination Board, American College Testing Program.)

21. Calculate the correlation coefficient for SAT verbal scores and SAT math scores.
22. Calculate the correlation coefficient for ACT scores and the composite SAT scores
(SAT verbal score + SAT math score).
23. A state superintendent of schools, when asked about her state’s low scores, claims
that the low scores are due to the high percentage of high school graduates that
take the test. “There is a significant negative correlation between scores and the
percentage of high school graduates that take the SAT.”
(a) Calculate the correlation coefficient for the percentage of graduates who took
the SAT and the composite SAT scores (SAT verbal score + SAT math score).
(b) Based on your results, does the superintendent’s statement appear to be valid?
24. Calculate the correlation coefficient for the percentage of graduates who took the
ACT and the ACT scores. How does this coefficient compare to the comparable
coefficient for the SAT calculated in the previous exercise?
25. Calculate the correlation coefficient for the percentage of graduates who took the
ACT and the percentage of graduates who took the SAT. Does this correlation
make sense? Explain in your own words, and include a scatterplot to support
your argument.
26. The results of 47 horse races at Santa Anita Park were selected at random. Here are
the number of horses in the race, and the price that the winning horse in that race
paid to win.
10.1 Correlation 541

Number of Winning Number of Winning


Race Horses Price Race Horses Price

1 7 9.20 25 7 5.80
2 9 26.60 26 10 8.80
3 4 4.40 27 10 64.00
4 9 19.20 28 7 8.40
5 10 23.40 29 6 6.20
6 8 27.40 30 12 14.80
7 8 14.20 31 12 4.00
8 9 15.20 32 7 8.40
9 6 14.80 33 6 6.20
10 9 5.60 34 12 14.80
11 8 7.80 35 12 4.00
12 9 17.00 36 6 3.00
13 7 8.80 37 10 8.80
14 10 15.40 38 7 5.80
15 10 17.40 39 10 64.00
16 11 22.80 40 7 8.80
17 11 8.00 41 7 10.80
18 12 21.00 42 9 8.60
19 5 21.80 43 8 7.00
20 11 7.40 44 11 12.80
21 12 33.20 45 9 8.20
22 10 5.00 46 11 62.80
23 10 8.80 47 12 10.80
24 6 3.00

(a) Calculate the correlation coefficient for the number of horses in the race and
the winning price.
(b) At the 0.05 level of significance, test the claim that there is a significant cor-
relation between these two variables.
27. For 100 randomly selected community college students, here are the number of
units they are enrolled in and the number of hours that they study per week.

Study Study Study Study


Units Hours Units Hours Units Hours Units Hours

12 4 15 9 6 9 16 5
9 3 18 7 15 7 6 2
16 6 12 3 16 9 18 7
12 4 21 9 21 11 16 5
21 8 12 3 6 3 12 3
15 3 9 3 9 3 21 9
12 0 16 10 12 6 4 1
12 2 9 3 12 4 9 2
15 4 12 5 15 6 12 5
12 5 6 1 12 4 12 6
16 7 15 4 21 9 15 4
9 3 16 4 16 9 12 3
16 5 16 3 12 6 14 7
15 6 12 4 9 2 9 2
21 9 21 9 12 3 16 5
18 9 12 8 16 6 12 4
16 7 15 8 9 2 9 3
12 3 14 7 12 4 12 4
12 5 9 2 12 9 16 5
(continues)
542 CHAPTER 10 Linear Correlation and Regression

(continued)
Study Study Study Study
Units Hours Units Hours Units Hours Units Hours

15 4 16 5 12 4 12 3
16 5 12 4 15 5 16 5
12 4 15 7 16 6 9 2
9 2 9 3 12 5 15 7
6 2 12 3 12 3 12 3
12 4 12 5 9 2 12 6

(a) Calculate the correlation coefficient for the number of units and the number
of study hours per week, using the data from only the first 25 students (first
column).
(b) Use the result of part (a) to test the claim that there is a significant correlation
between these two variables at the 0.01 level of significance.
(c) Calculate the correlation coefficient between the number of students and the
number of study hours per week, using the data from the complete sample of
100 students.
(d) Use the result of part (c) to test the claim that there is a significant correlation
between these two variables at the 0.01 level of significance.
(e) Note that the test statistic is higher for the sample of 100 students than it is for
the 25 students, even though the correlation coefficient was actually greater
for the 25 students. Explain why this happened in your own words.
28. For 71 randomly selected 1999 National League baseball games, here are the
times required to complete the games (in minutes), the total number of hits in the
game, and the combined number of runs in the game.

Time Hits Runs Time Hits Runs Time Hits Runs

170 19 11 152 20 14 160 14 3


187 20 11 190 23 9 177 20 7
169 14 6 213 30 22 171 16 9
129 12 1 158 19 11 198 26 22
159 25 15 176 20 9 159 21 10
209 27 19 146 13 6 160 9 3
181 16 9 172 9 4 146 16 10
183 17 11 137 13 6 188 14 6
197 23 12 200 16 9 149 16 6
170 19 13 158 19 13 151 13 6
157 20 6 181 25 13 172 18 7
214 32 27 202 21 11 143 12 6
154 15 6 152 14 3 189 19 11
166 23 12 128 11 8 204 27 19
191 21 11 161 18 12 215 24 13
178 17 6 153 8 3 181 16 10
198 23 14 164 19 13 141 14 5
178 17 13 169 19 9 152 18 4
161 13 6 194 24 14 176 20 10
198 20 6 108 10 5 180 19 14
179 17 10 146 19 5 158 12 4
150 21 13 157 12 6
265 20 9 167 16 13
136 18 16 169 16 7
143 15 3 190 22 12

(a) Calculate the correlation coefficient between the number of hits and the time
required to complete the game.
(b) At the 0.01 level of significance, test the claim that there is a significant corre-
lation between the number of hits and the time required to complete the game.
10.2 Linear Regression 543

(c) Calculate the correlation coefficient between the number of runs and the time
required to complete the game.
(d) At the 0.01 level of significance, test the claim that there is a significant corre-
lation between the number of runs and the time required to complete the game.
(e) Based on your results, would the number of hits or the number of runs be bet-
ter for predicting the time required to complete a game?

SECTION 10.2
Linear Regression
If we can determine that there is a linear correlation between two variables, then the
behavior of those two variables can be described graphically by a line. In this section
we learn how to find the equation of the line that best fits a set of data. We will go on
to use that equation to predict the value of one of the variables for a particular value
of the other variable.
A regression line is a line that best fits a set of data. First, we will learn how to
find the equation of this line, and then later we will see what it means to best fit a set
of data. The general formula of a regression line is ŷ = a + bx. In the equation ŷ,
which is read “y-hat,” is the predicted value of y for a given value of x. The slope of
the line is b, and we calculate it first. We then use the value of b to help calculate a,
which is the y-intercept of the line. Here are the formulas to calculate b and a.

n• Σxy – ( Σx ) • ( Σ y )
b =
nΣ x 2
– ( Σx )2

a = y – bx

You may notice that the calculation of the slope b involves many of the same pieces
as the formula for calculating the correlation coefficient r. Again, it would be best to
use technology to find b and a, but we will use the formula for demonstration pur-
poses in this section.

E XAMPLE Here are the scores of five randomly selected students on test 1
10.9 and test 2 in a math class. Find the equation of the regression line,
treating the score on test 1 as x and the score on test 2 as y.

Student Test 1 Score Test 2 Score

1 83 82
2 86 84
3 76 63
4 92 83
5 71 55

We begin by creating a scatterplot, to make sure that the data seem to have a
linear relationship.
544 CHAPTER 10 Linear Correlation and Regression

Test Scores
100
90

Test 2
80
70
60
50
60 70 80 90 100
Test 1

There seems to be a linear association, so we continue with the calculation of b,


the slope of the regression line.

x y xy x2

83 82 6,806 6,889
86 84 7,224 7,396
76 63 4,788 5,776
92 83 7,636 8,464
71 55 3,905 5,041

408 367 30,359 33,566

So ∑x = 408, ∑y = 367, ∑xy = 30,359, ∑x 2 = 33,566. In the formula, n is the num-


ber of pairs, which is 5.

n • Σx y – ( Σx ) • ( Σy )
b=
nΣ x 2
– ( Σx )2

5 • 30, 359 – 408 • 367


=
5 • 33, 566 – 408 2
151, 795 – 149, 736
=
167 , 830 – 166, 464
2059
=
1366
= 1.5073

The slope of this regression line is 1.5073. This tells us that for each point increase
on test 1, the score on test 2 will increase by 1.5073 points. Now we calculate the
y-intercept a.

∑x ∑y
x– =  y– = 
n n

408 367
=  = 
5 5
= 81.6 = 73.4

a = y– – bx–
= 73.4 – (1.5073) • 81.6
= – 49.5957

The regression equation is ŷ = –49.5957 + 1.5073x. ■


10.2 Linear Regression 545

Now let’s examine why this line best fits the set of data. Here is the graph of the
regression line on the scatterplot of the data.

Test Scores
100
90
Test 2
80
70
60
50
60 70 80 90 100
Test 1

Note that the line does not go through any of the data points. The vertical distances
between each point and the regression line are called the residuals.

Test Scores
100
90
Test 2

80
residual
70
60
50
60 70 80 90 100
Test 1

The equations for the slope and y-intercept are designed to minimize the sum of the
squares of the residuals. In other words, if we took each residual and squared it, and
then we totaled the results, this line would have the lowest possible total for this set of
data. For this reason, the regression line is often called the least-squares line.
We can use the regression line to predict a y-value for a given x-value. Recall that
the equation of the regression line for the previous example was ŷ = –49.5957
+ 1.5073x, where x represents a student’s score on test 1 and ŷ is the predicted score
for that student on test 2. Suppose that a student got a score of 95 on test 1. What score
can we expect from that student on test 2? All we need to do is plug in 95 for x in the
regression equation.

ŷ = –49.5957 + 1.5073x
= –49.5957 + 1.5073(95)
= –49.5957 + 143.1935
= 93.5978

The predicted score from the regression equation is 93.5978. We could round this to
produce a predicted score of 94. This score seems to fit the pattern of the data.
How about predicting the score of a student who scored 50 on test 1? Plug in 50
for x in the regression equation.
546 CHAPTER 10 Linear Correlation and Regression

ŷ = – 49.5957 + 1.5073x
= –49.5957 + 1.5073(50)
= – 49.5957 + 75.365
= 25.7693

The predicted score from the regression equation is 25.7693, which rounds to a score
of 26. Does this value seem reasonable? It is hard to say, because we do not have a
student in our sample data with a score on test 1 that is close to 50. When we plug in
a value for x that is not close to the values of x on which the regression equation is
based, this is called extrapolation. When we predict values using extrapolation, the
results are not reliable. We should use a regression equation for values of the inde-
pendent variable that are close to the values on which the equation is based.
As a further example of extrapolation, consider the y-intercept of the regression
equation ŷ = – 49.5957 + 1.5073x. The y-intercept of an equation is the y-value that
corresponds to an x-value of 0. So, if a student had a score of 0 on test 1 we would pre-
dict a score of –49.5957 (or approximately –50) on test 2. Of course, this value makes
no sense. Note that none of the test 1 scores on which the regression equation is based
were close to 0. In fact, none of the scores were below 71. The y-intercept of some
regression equations can be meaningful if there are x-values close to 0, representing
some sort of initial condition.
Let’s revisit an example from the previous section.

E XAMPLE Here are the number of hours that 10 students spent studying for a
10.10 final exam, and their score on that exam. Find the equation of the
regression line that best fits the data, and use it to predict the
score of students who study for 0 hours, 5 hours, 10 hours, 15
hours, 20 hours, and 40 hours.

Hours 7 8 4 9 13
Score 70 76 57 77 91

Hours 5 9 6 16 3
Score 66 82 64 96 50

We begin with a scatterplot.

Hours Studied/Score on Exam


120
100
80
Score

60
40
20
0
0 5 10 15 20
Hours

There appears to be a linear relationship to the data, so we continue with


the calculations.
10.2 Linear Regression 547

x y xy x2

7 70 490 49
8 76 608 64
4 57 228 16
9 77 693 81
13 91 1183 169
5 66 330 25
9 82 738 81
6 64 384 36
16 96 1536 256
3 50 150 9

80 729 6340 786

Thus, ∑x = 80, ∑y = 729, ∑xy = 6340, ∑x 2 = 786, and n = 10.

n • Σx y – ( Σx ) • ( Σy )
b=
nΣ x 2
– ( Σx )2

10 • 6340 – 80 • 729
=
10 • 786 – 80 2
63, 400 – 58, 320
=
7860 – 6400
5080
=
1460
= 3.4795

The slope of this regression line is 3.4795. This tells us that for each additional hour
that a student studies, his or her score on the exam will increase by 3.4795 points.
Now we calculate the y-intercept a.

∑x ∑y
x– =  y– = 
n n

80 729
=  = 
10 10
= 8 = 72.9

a = y– – bx–
= 72.9 – (3.4795) • 8
= 45.064 Hours Studied/Score on Exam
120
The regression equation is ŷ = 45.064 + 3.4795x.
100
Here is the graph of the regression line. Note how
well it fits. This is due to the fact that there is a strong 80
Score

correlation between the two variables. Recall from the 60


last section that for these two variables r = 0.969, 40
which is nearly perfect correlation. 20
0
0 5 10 15 20
Hours
548 CHAPTER 10 Linear Correlation and Regression

Now, for the predictions.

x = 0 x = 5 x = 10
ŷ = 45.064 + 3.4795x ŷ = 45.064 + 3.4795x ŷ = 45.064 + 3.4795x
= 45.064 + 3.4795(0) = 45.064 + 3.4795(5) = 45.064 + 3.4795(10)
= 45.064 = 62.4615 = 79.859

x = 15 x = 20 x = 40
ŷ = 45.064 + 3.4795x ŷ = 45.064 + 3.4795x ŷ = 45.064 + 3.4795x
= 45.064 + 3.4795(15) = 45.064 + 3.4795(20) = 45.064 + 3.4795(40)
= 97.2565 = 114.654 = 184.244

Here’s a summary of these results in a table, with the scores rounded to the near-
est whole number.

Hours 0 5 10 15 20 40
Score 45 62 80 97 115 184

Let’s begin by investigating the y-intercept. The lowest number of hours studied was
3 hours by a student who had a score of 50, so a score of 45 for a student who did
not study seems to fit. We can think of the equation as telling us that each student
“begins” with a score of 45.064, and then gains 3.4795 points for each hour studied.
The scores for 5 hours, 10 hours, and 15 hours seem to fit as well. There is a prob-
lem with the prediction for 20 hours of study time. The highest number of hours
studied in the sample data was 16 hours, the next highest was 3 hours lower at 13
hours, and the rest were below 10 hours. So this extrapolation has some problems,
especially considering that the highest score possible is 100. One problem with
regression lines is that in any small “window” a set of data may appear to be linear,
but if we expand outward the data may actually follow some sort of curve. Finally,
the extrapolation for 40 hours of study time is obviously no good. ■

E XAMPLE Here are the gross values, in millions of dollars, of nectarines and
10.11 peaches grown in Tulare County, California, for the ten years from
1989 through 1998. First, find the correlation coefficient for these
two variables. Then, find the equation of the regression line for
these two variables. Also, predict the value of peaches produced
for a year in which the value of nectarines produced is $60 million.

Year Nectarines ($ millions) Peaches ($ millions)

1989 47 32
1990 53 47
1991 52 57
1992 68 43
1993 51 53
1994 90 63
1995 74 76
1996 89 64
1997 83 66
1998 56 56
(Source: Visalia Times Delta.)
10.2 Linear Regression 549

It is hard to determine which of these variables affects the other, but since we will
be trying to predict the value of peaches for a certain value of nectarines, we will let
x be the gross value of nectarines and y be the gross value of peaches.

Nectarine/Peach Value in Tulare Co.


80

Peach Value
($ millions) 60

40

20
40 60 80 100
Nectarine Value
($ millions)
There seems to be a general positive trend. Since the correlation coefficient and
regression equation involve many of the same sums, we will calculate them all first.

x y xy x2 y2

47 32 1,504 2,209 1,024


53 47 2,491 2,809 2,209
52 57 2,964 2,704 3,249
68 43 2,924 4,624 1,849
51 53 2,703 2,601 2,809
90 63 5,670 8,100 3,969
74 76 5,624 5,476 5,776
89 64 5,696 7,921 4,096
83 66 5,478 6,889 4,356
56 56 3,136 3,136 3,136

663 557 38,190 46,469 32,473

For our formula, ∑x = 663, ∑y = 557, ∑xy = 38,190, ∑x 2 = 46,469,


∑y 2 = 32,473, and n = 10. Now we can plug into the formula to calculate r.

(Σ x )(Σ y )
Σx y –
n
r =
⎛ (Σ x )2 ⎞⎛ (Σ y )2 ⎞
⎜ Σx 2 – ⎟ ⎜ Σy 2
– ⎟
⎜ n ⎟⎜ n ⎟
⎝ ⎠⎝ ⎠

663 • 557
38, 190 –
10
=
⎛ 663 2 ⎞ ⎛ 557 2 ⎞
⎜ 46, 469 – ⎟ ⎜ 32, 473 – ⎟
⎝ 10 ⎠ ⎝ 10 ⎠

369, 291
38, 190 –
10
=
⎛ 439, 569⎞ ⎛ 310, 249 ⎞
⎜ 46, 469 – ⎟ ⎜ 32, 473 – ⎟
⎝ 10 ⎠⎝ 10 ⎠

= 0.661

The correlation coefficient is 0.661, so we do have positive correlation. Now for the
calculation of the regression equation.
550 CHAPTER 10 Linear Correlation and Regression

n • Σx y – ( Σx ) • ( Σy )
b=
nΣ x 2
– ( Σx )2
10 • 38, 190 – 663 • 557
=
10 • 46, 469 – 663 2
381, 900 – 369, 291
=
464, 690 – 439, 569
12, 609
=
25, 121
= 0.5019

The slope of this regression line is 0.5019. This tells us that for each additional mil-
lion dollars in nectarine production, the value of peaches produced will increase by
0.5019 million dollars, or $501,900. Now we calculate the y-intercept a.
∑x ∑y
x– =  y– = 
n n
663 557
=  = 
10 10
= 66.3 = 55.7

a = y– – bx–
= 55.7 – (0.5019) • 66.3
= 22.4240
The regression equation is ŷ = 22.4240 + 0.5019x. Here is the graph of the regres-
sion line.
Nectarine/Peach Value in Tulare Co.
80
Peach Value
($ millions)

60

40

20
40 60 80 100
Nectarine Value
($ millions)
What will be the value of peaches produced in a year in which the value of nec-
tarines produced is $60 million? Simply plug in 60 for x in the regression equation.
ŷ = 22.4240 + 0.5019x
= 22.4240 + 0.5019(60)
= 52.538
We predict the production of $52.538 million of peaches. ■

MICROSOFT EXCEL Linear Regression

We can use Microsoft Excel to quickly find the regression equation, and even to draw
the graph of our regression line over our scatterplot. We begin with the regression
equation, reworking the example involving nectarines and peaches.
10.2 Linear Regression 551

E XAMPLE Here are the gross values, in millions of dollars, of nectarines and
10.12 peaches grown in Tulare County, California, for the ten years from
1989 through 1998. Find the equation of the regression line for
these two variables, letting x represent the value of nectarines
produced and y the value of peaches produced.

Year Nectarines ($ millions) Peaches ($ millions)

1989 47 32
1990 53 47
1991 52 57
1992 68 43
1993 51 53
1994 90 63
1995 74 76
1996 89 64
1997 83 66
1998 56 56
(Source: Visalia Times Delta.)

In a new Excel worksheet, type the values of the nectarines in column A, from cell
A1 through A10. In the next column, type the values of the peaches beginning in
cell B1 and continuing through cell B10.

We need to use the Excel’s Data Analysis ToolPak, so we must be sure that it has
been added to the Tools menu. Click on the Tools menu; if you see Data Analysis
then you may skip to the next paragraph. If you do not see Data Analysis, then
click on Add-Ins. When a dialog box opens, check the box next to Analysis
ToolPak, and then click OK.

From the Tools menu, select Data Analysis. When the dialog box appears, scroll
down to select Regression and click OK. When the dialog box appears, type
B1:B10 next to Input Y Range. Next to Input X Range, type A1:A10. Click on OK,
and Excel will give you a great deal of information in a new worksheet. Most of this
information will not be used until the next section. While all of the regression infor-
mation is still highlighted, from the Format menu select Column. Then choose
AutoFit Selection. This will make the information easier to see.

Look down the first column of regression information until you find Intercept and
X Variable 1. Under the column labeled Coefficients you will find the value of
a directly next to Intercept, and you will find the value of b directly next to
X Variable 1. Here is a picture of what the output looks like.

SUMMARY OUTPUT

Regression Statistics
Multiple R 0.661093529
R Square 0.437044654
Adjusted R Square 0.366675236
Standard Error 10.0946498
Observations 10

ANOVA
df SS MS F Significance F
Regression 1 632.8843637 632.8843637 6.210718593 0.037397804
Residual 8 815.2156363 101.9019545
Total 9 1448.1

You might also like