0% found this document useful (0 votes)
4 views6 pages

Chapter 3

This document discusses straight line regression, focusing on the relationship between a response variable Y and a single predictor variable X. It introduces key concepts such as the regression function, slope, and intercept, and explains how to estimate parameters using sample data. An illustrative example with student test scores is provided to demonstrate the application of these concepts in answering specific questions about the population data.

Uploaded by

nguyenvanbeox00
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views6 pages

Chapter 3

This document discusses straight line regression, focusing on the relationship between a response variable Y and a single predictor variable X. It introduces key concepts such as the regression function, slope, and intercept, and explains how to estimate parameters using sample data. An illustrative example with student test scores is provided to demonstrate the application of these concepts in answering specific questions about the population data.

Uploaded by

nguyenvanbeox00
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

3

Straight Line Regression

3.1
Overview
In Chapter 2 we defined the regression function s1(x1, . . . ,
xk) of a response variable Y on k predictor variables X1, ... ,;
and introduced many of the basic concepts underlying
regression. In particular we learned that the best function
for predicting the Y value of an item using the values of
X1, ... , Xk is the regression function Ay (xi , ... , xk). In this
chapter we focus on the simple but important special case of
straight line regression. Accordingly, throughout this
chapter, we assume that there is only one predictor variable
X and that the graph of the regression function of Y on X is
a straight line, i.e.,
ity(x) = /30 + flix (3.1.1)
The quantity /31 is the slope and flo is the intercept of the
regression line. Thus the mean of the Y values in the
subpopulation determined by X = x is given by Ay(x) = fio
+ /31x. Recall that ay(x) denotes the standard deviation of
this subpopulation. If the entire population data are
available, then we can calculate exactly the values of /30,
p1, and cry(x) for every allowable x, but since the entire
population is almost never available in a real problem, we
cannot know the values of Po, fit, and ay(x) exactly, so we
must rely on sample data to estimate these and other
unknown quantities (parameters). In this chapter we
consider point and confidence interval estimation of
various quantities of interest, and we also discuss statistical
tests. Section 3.3 introduces two sets of assumptions
under which the theory of linear regression has been well
developed. Section 3.4 discusses point estimation of
parameters of interest. Methods for examining the validity
of regression assumptions are discussed in Section 3.5.
Confidence interval procedures and statistical tests are
described in Sections 3.6 and 3.7, respectively. Section 3.8
introduces the analysis of variance. The coefficient of
correlation and the coefficient of determination are
described in Section 3.9. The effect of measurement errors
on inferences about various model parameters is explained
in Section 3.10. Section 3.11 considers the special case of the
straight line regression model, where the regression line is
known
100 Chapter 3: Straight Line Regression

to pass through the origin. Chapter exercises appear in Section 3.12. Laboratory as-
signments describing the use of a statistical computing package (MINITAB or SAS)
for straight line regression are in Chapter 3 of the laboratory manual.
Before proceeding further, we present a detailed illustrative example where the
entire population of numbers is assumed to be available, even though it never is in
a real problem, so that you can get a better grasp of the concepts. This example
will also point out how various questions of interest, arising in real problems, can
be answered exactly when the entire population of numbers is available. Statistical
inference procedures, discussed in this chapter and throughout this book, attempt
to provide answers to such questions when only sample data, and not the entire
population, are available.

3.2
An Example of Straight Line Regression
Table D-3 in Appendix D contains a set of data consisting of 2,600 pairs of numbers
(Y, X), where Y is the score (in percent) obtained by a student on a standardized
calculus test administered at a certain university, and X is the number of hours
(recorded to the nearest hour) that the student spent studying for this test. These data
are also stored in the file grades.dat on the data disk. For purposes of illustration,
we suppose that these data form a bivariate population {(Y, X )}. The size of the
population is thus 2,600. An examination of these data shows that there are 13
distinct values of X in the population, and they are 0, 1, 2, ... , 12. The number of
observations, the means, and the standard deviations for each of the corresponding
13 subpopulations of Y values are exhibited in Table 3.2.1. A plot of the means of the

TABLE 3.2.1
Subpopulation Counts, Means, and Standard Deviations for Population Data in Table D-3
Hours Number Subpopulation Subpopulation
X of Items Mean Standard Deviation

45.0 2.881
§ 49.0 2.881
53.0 2.881
§ 57.0 2.881 ,
§ 61.0 2.88f
65.0 2.881
§§ 69.0 2.881
73.0 2.881
§§ 77.0 2.881
§§ 81.0 2.881
85.0 2.881
§§ 89.0 2.881
§§ 93.0 2.881
32 An Example of Straight Line Regression 101

Y values of these 13 subpopulations against the corresponding X values (i.e., a plot of


µp(x) against x for all allowable x values) is shown in Figure 3.2.1. This plot clearly
shows that the regression function of Y on X is of the form µp(x) = Po + PA;
i.e., the subpopulation means for Y lie on a straight line when plotted against the
corresponding values of X. Furthermore, we can calculate the values of po and pi
explicitly. In fact, the value of Po is 45.0 because the mean value of Y corresponding
to X = 0 is 45.0 (see Table 3.2.1). Also the value of pi is 4.0 because the increase
in the mean value of Y for a unit increase in X is easily seen to be 4.0%. Hence the
population regression function is
(x) = 45.0 + 4.0x
Observe also that the subpopulation standard deviations are all equal to 2.881.

FIGURE 3.2.1
8
Population data values
• •
♦ Subpopulation means

• •

0
00

V)
• •
8 •


0 2 4 6 8 10 12
Hours

Note These data are specifically concocted for the purpose of illustration so that
the population regression function of Y on X will be exactly a straight line and,
in addition, the subpopulation standard deviations will all be the same. In most real
problems, we cannot expect the population regression function to conform exactly to
a straight line model, and the subpopulation standard deviations cannot be expected
to all be exactly the same. But in many situations, these idealized conditions may
be met approximately. You should also be aware that in actual investigations the
number of subpopulations of Y values, determined by X, can be quite large, and
the sizes of the subpopulations need not all be the same. In this particular example,
however, we have deliberately kept the number of subpopulations rather small (13
to be precise) and the sizes of the subpopulations all equal (200 observations in each
subpopulation) for ease of discussion.
102 Chapter 3: Straight Line Regression

Thus, because we know the entire population ((Y, X )} in this example, we are
able to determine exactly the values of /30, fil and the subpopulation standard devi-
ations cry (x). Any other population summary quantity (parameter) can be calculated
exactly as well.

Some Questions of Interest


A student who is considering taking this calculus test may be interested in knowing
the answers to the following questions:
1 What is the average increase in score per additional hour of studying time?
2 What is the average score of students who did not study at all for the test?
3 What is the best predicted value of the score of a student who spent 10 hours
studying for this test?
4 Of all the students in the population who spent 10 hours studying for the test,
what proportion obtained a score of 90% or above?
(3.2.1)
We give answers to these four questions by three methods.
a Answers based on the entire population data
b Answers based on only population parameters
c Answers based on only a random sample from the population
Of course in any real problem we can use only method (c) to obtain answers, but
we give the answers to questions (1)—(4) of (3.2.1) by all three methods to help you
understand that samples really can help answer questions about the population.

a Answers Based on the Entire Population Data


Answers to the preceding questions based on the entire population data are as fol-
lows:
1 The increase in the average score for each additional hour of studying time is
equal to /31, the slope of the regression line of Y on X, which has the value 4.0.
2 The average score of students who did not study at all for the test (i.e., X = 0)
is Ay (0), which is equal to the intercept /3 0 of the regression line, which has the
value 45.0.
3 The best predicted value of the score of a student in this population who spent
10 hours studying for this test is Ay (10) = 45.0 + 4.0(10) = 85.0.
4 In Table D-3 in Appendix D, an examination of the subpopulation of Y values
corresponding to X = 10 shows that 11 out of the 200 students in this subpopu-
lation obtained a score of 90% or above. Thus the required proportion is 0.055.
32 An Example of Straight Line Regression 103

b Answers Based on Only Population Parameters

Clearly, we are able to obtain exact answers to the questions in (3.2.1) when the
entire population {(Y, X)} is available to us. In many situations, we can answer
various questions concerning the population even if we do not know the entire
population but know only certain important summary quantities (parameters) of the
population. To demonstrate this in the present example, we begin by examining the
histogram of the subpopulation of Y values determined by X = 10. The histogram is
in Figure 3.2.2, which suggests that this subpopulation is approximately Gaussian.
In fact, we should examine the subpopulation of Y values for each distinct X value
to determine if each is approximately Gaussian.

FIGURE 3.2.2
Now suppose that we do not have the entire population {(Y, X )} available to

O
7 8 85 9 9
Sco

us, but suppose we do know that the regression function of Y on -X is given by


µp (x)= 45.0 + 4.0x and that each subpopulation of Y values has a standard de-
viation equal to 2.881. Thus we know the values of /30 and fir which are 45.0 and
4.0, respectively, and we also know that ay (x) = 2.881 for each allowable x. Further-
more, by plotting the histogram of Y for each distinct value of X, we can demonstrate
that each subpopulation of Y values is (approximately) Gaussian. With this informa-
tion we can answer questions (1)—(4) in (3.2.1). Questions (1)—(3) can be answered
knowing only that the regression function of Y on X is tt,y(x) = 45.0 + 4.0x. To
answer question (4) we first observe that the mean Y value for the subpopulation
corresponding to an X value of 10 is equal to 45.0 + 4.0(10) = 85 and that its stan-
dard deviation is 2.881. We now use the fact that this subpopulation of Y values is
approximately Gaussian. The proportion of values in a Gaussian population, with

You might also like