Chapter 3
Chapter 3
3.1
Overview
In Chapter 2 we defined the regression function s1(x1, . . . ,
xk) of a response variable Y on k predictor variables X1, ... ,;
and introduced many of the basic concepts underlying
regression. In particular we learned that the best function
for predicting the Y value of an item using the values of
X1, ... , Xk is the regression function Ay (xi , ... , xk). In this
chapter we focus on the simple but important special case of
straight line regression. Accordingly, throughout this
chapter, we assume that there is only one predictor variable
X and that the graph of the regression function of Y on X is
a straight line, i.e.,
ity(x) = /30 + flix (3.1.1)
The quantity /31 is the slope and flo is the intercept of the
regression line. Thus the mean of the Y values in the
subpopulation determined by X = x is given by Ay(x) = fio
+ /31x. Recall that ay(x) denotes the standard deviation of
this subpopulation. If the entire population data are
available, then we can calculate exactly the values of /30,
p1, and cry(x) for every allowable x, but since the entire
population is almost never available in a real problem, we
cannot know the values of Po, fit, and ay(x) exactly, so we
must rely on sample data to estimate these and other
unknown quantities (parameters). In this chapter we
consider point and confidence interval estimation of
various quantities of interest, and we also discuss statistical
tests. Section 3.3 introduces two sets of assumptions
under which the theory of linear regression has been well
developed. Section 3.4 discusses point estimation of
parameters of interest. Methods for examining the validity
of regression assumptions are discussed in Section 3.5.
Confidence interval procedures and statistical tests are
described in Sections 3.6 and 3.7, respectively. Section 3.8
introduces the analysis of variance. The coefficient of
correlation and the coefficient of determination are
described in Section 3.9. The effect of measurement errors
on inferences about various model parameters is explained
in Section 3.10. Section 3.11 considers the special case of the
straight line regression model, where the regression line is
known
100 Chapter 3: Straight Line Regression
to pass through the origin. Chapter exercises appear in Section 3.12. Laboratory as-
signments describing the use of a statistical computing package (MINITAB or SAS)
for straight line regression are in Chapter 3 of the laboratory manual.
Before proceeding further, we present a detailed illustrative example where the
entire population of numbers is assumed to be available, even though it never is in
a real problem, so that you can get a better grasp of the concepts. This example
will also point out how various questions of interest, arising in real problems, can
be answered exactly when the entire population of numbers is available. Statistical
inference procedures, discussed in this chapter and throughout this book, attempt
to provide answers to such questions when only sample data, and not the entire
population, are available.
3.2
An Example of Straight Line Regression
Table D-3 in Appendix D contains a set of data consisting of 2,600 pairs of numbers
(Y, X), where Y is the score (in percent) obtained by a student on a standardized
calculus test administered at a certain university, and X is the number of hours
(recorded to the nearest hour) that the student spent studying for this test. These data
are also stored in the file grades.dat on the data disk. For purposes of illustration,
we suppose that these data form a bivariate population {(Y, X )}. The size of the
population is thus 2,600. An examination of these data shows that there are 13
distinct values of X in the population, and they are 0, 1, 2, ... , 12. The number of
observations, the means, and the standard deviations for each of the corresponding
13 subpopulations of Y values are exhibited in Table 3.2.1. A plot of the means of the
TABLE 3.2.1
Subpopulation Counts, Means, and Standard Deviations for Population Data in Table D-3
Hours Number Subpopulation Subpopulation
X of Items Mean Standard Deviation
45.0 2.881
§ 49.0 2.881
53.0 2.881
§ 57.0 2.881 ,
§ 61.0 2.88f
65.0 2.881
§§ 69.0 2.881
73.0 2.881
§§ 77.0 2.881
§§ 81.0 2.881
85.0 2.881
§§ 89.0 2.881
§§ 93.0 2.881
32 An Example of Straight Line Regression 101
FIGURE 3.2.1
8
Population data values
• •
♦ Subpopulation means
• •
•
0
00
V)
• •
8 •
•
•
0 2 4 6 8 10 12
Hours
Note These data are specifically concocted for the purpose of illustration so that
the population regression function of Y on X will be exactly a straight line and,
in addition, the subpopulation standard deviations will all be the same. In most real
problems, we cannot expect the population regression function to conform exactly to
a straight line model, and the subpopulation standard deviations cannot be expected
to all be exactly the same. But in many situations, these idealized conditions may
be met approximately. You should also be aware that in actual investigations the
number of subpopulations of Y values, determined by X, can be quite large, and
the sizes of the subpopulations need not all be the same. In this particular example,
however, we have deliberately kept the number of subpopulations rather small (13
to be precise) and the sizes of the subpopulations all equal (200 observations in each
subpopulation) for ease of discussion.
102 Chapter 3: Straight Line Regression
Thus, because we know the entire population ((Y, X )} in this example, we are
able to determine exactly the values of /30, fil and the subpopulation standard devi-
ations cry (x). Any other population summary quantity (parameter) can be calculated
exactly as well.
Clearly, we are able to obtain exact answers to the questions in (3.2.1) when the
entire population {(Y, X)} is available to us. In many situations, we can answer
various questions concerning the population even if we do not know the entire
population but know only certain important summary quantities (parameters) of the
population. To demonstrate this in the present example, we begin by examining the
histogram of the subpopulation of Y values determined by X = 10. The histogram is
in Figure 3.2.2, which suggests that this subpopulation is approximately Gaussian.
In fact, we should examine the subpopulation of Y values for each distinct X value
to determine if each is approximately Gaussian.
FIGURE 3.2.2
Now suppose that we do not have the entire population {(Y, X )} available to
O
7 8 85 9 9
Sco