Regression Analysis With Cross-Sectional Data
Regression Analysis With Cross-Sectional Data
1
x u. (2.1)
Equation (2.1), which is assumed to hold in the population of interest, defines the simple
linear regression model. It is also called the two-variable linear regression model
or bivariate linear regression model because it relates the two variables x and y. We
now discuss the meaning of each of the quantities in (2.1). (Incidentally, the term
regression has origins that are not especially important for most modern econometric
applications, so we will not explain it here. See Stigler [1986] for an engaging history
of regression analysis.)
89782_02_c02_p023-072.qxd 5/25/05 11:46 AM Page 24
TABLE 2.1
Terminology for Simple Regression
y x
Dependent Variable Independent Variable
Explained Variable Explanatory Variable
Response Variable Control Variable
Predicted Variable Predictor Variable
Regressand Regressor
When related by (2.1), the variables y and x have several different names used
interchangeably, as follows: y is called the dependent variable, the explained vari-
able, the response variable, the predicted variable, or the regressand; x is called
the independent variable, the explanatory variable, the control variable, the pre-
dictor variable, or the regressor. (The term covariate is also used for x.) The terms
dependent variable and independent variable are frequently used in econometrics.
But be aware that the label independent here does not refer to the statistical notion
of independence between random variables (see Appendix B).
The terms explained and explanatory variables are probably the most descrip-
tive. Response and control are used mostly in the experimental sciences, where the
variable x is under the experimenters control. We will not use the terms predicted vari-
able and predictor, although you sometimes see these in applications that are purely
about prediction and not causality. Our terminology for simple regression is summarized
in Table 2.1.
The variable u, called the error term or disturbance in the relationship, represents
factors other than x that affect y. A simple regression analysis effectively treats all factors
affecting y other than x as being unobserved. You can usefully think of u as standing for
unobserved.
Equation (2.1) also addresses the issue of the functional relationship between y and x.
If the other factors in u are held fixed, so that the change in u is zero, u 0, then x has
a linear effect on y:
y
1
x if u 0. (2.2)
Thus, the change in y is simply
1
multiplied by the change in x. This means that
1
is the
slope parameter in the relationship between y and x, holding the other factors in u fixed;
it is of primary interest in applied economics. The intercept parameter
0
, sometimes
called the constant term, also has its uses, although it is rarely central to an analysis.
Chapter 2 The Simple Regression Model 25
89782_02_c02_p023-072.qxd 5/25/05 11:46 AM Page 25
26 Part 1 Regression Analysis with Cross-Sectional Data
E X A M P L E 2 . 1
(Soybean Yield and Fertilizer)
Suppose that soybean yield is determined by the model
yield
0
1
fertilizer u, (2.3)
so that y yield and x fertilizer. The agricultural researcher is interested in the effect of
fertilizer on yield, holding other factors fixed. This effect is given by
1
. The error term u con-
tains factors such as land quality, rainfall, and so on. The coefficient
1
measures the effect of
fertilizer on yield, holding other factors fixed: yield
1
fertilizer.
E X A M P L E 2 . 2
(A Simple Wage Equation)
A model relating a persons wage to observed education and other unobserved factors is
wage
0
1
educ u. (2.4)
If wage is measured in dollars per hour and educ is years of education, then
1
measures the
change in hourly wage given another year of education, holding all other factors fixed. Some
of those factors include labor force experience, innate ability, tenure with current employer,
work ethic, and innumerable other things.
The linearity of (2.1) implies that a one-unit change in x has the same effect on y,
regardless of the initial value of x. This is unrealistic for many economic applications. For
example, in the wage-education example, we might want to allow for increasing returns:
the next year of education has a larger effect on wages than did the previous year. We will
see how to allow for such possibilities in Section 2.4.
The most difficult issue to address is whether model (2.1) really allows us to draw ceteris
paribus conclusions about how x affects y. We just saw in equation (2.2) that
1
does mea-
sure the effect of x on y, holding all other factors (in u) fixed. Is this the end of the causal-
ity issue? Unfortunately, no. How can we hope to learn in general about the ceteris paribus
effect of x on y, holding other factors fixed, when we are ignoring all those other factors?
Section 2.5 will show that we are only able to get reliable estimators of
0
and
1
from
a random sample of data when we make an assumption restricting how the unobservable
u is related to the explanatory variable x. Without such a restriction, we will not be able
to estimate the ceteris paribus effect,
1
. Because u and x are random variables, we need
a concept grounded in probability.
Before we state the key assumption about how x and u are related, we can always make
one assumption about u. As long as the intercept
0
is included in the equation, nothing is
lost by assuming that the average value of u in the population is zero. Mathematically,
89782_02_c02_p023-072.qxd 5/25/05 11:46 AM Page 26
E(u) 0. (2.5)
Assumption (2.5) says nothing about the relationship between u and x, but simply makes
a statement about the distribution of the unobservables in the population. Using the pre-
vious examples for illustration, we can see that assumption (2.5) is not very restrictive. In
Example 2.1, we lose nothing by normalizing the unobserved factors affecting soybean
yield, such as land quality, to have an average of zero in the population of all cultivated
plots. The same is true of the unobserved factors in Example 2.2. Without loss of gener-
ality, we can assume that things such as average ability are zero in the population of all
working people. If you are not convinced, you should work through Problem 2.2 to see
that we can always redefine the intercept in equation (2.1) to make (2.5) true.
We now turn to the crucial assumption regarding how u and x are related. A natural
measure of the association between two random variables is the correlation coefficient.
(See Appendix B for definition and properties.) If u and x are uncorrelated, then, as ran-
dom variables, they are not linearly related. Assuming that u and x are uncorrelated goes
a long way toward defining the sense in which u and x should be unrelated in equation
(2.1). But it does not go far enough, because correlation measures only linear dependence
between u and x. Correlation has a somewhat counterintuitive feature: it is possible for u
to be uncorrelated with x while being correlated with functions of x, such as x
2
. (See
Section B.4 for further discussion.) This possibility is not acceptable for most regression
purposes, as it causes problems for interpreting the model and for deriving statistical prop-
erties. A better assumption involves the expected value of u given x.
Because u and x are random variables, we can define the conditional distribution of u
given any value of x. In particular, for any x, we can obtain the expected (or average) value
of u for that slice of the population described by the value of x. The crucial assumption is
that the average value of u does not depend on the value of x. We can write this as
E(ux) E(u) 0, (2.6)
where the second equality follows from (2.5). The first equality in equation (2.6) is the
new assumption. It says that, for any given value of x, the average of the unobservables
is the same and therefore must equal the average value of u in the population. When we
combine the first equality in equation (2.6) with assumption (2.5), we obtain the zero
conditional mean assumption.
Let us see what (2.6) entails in the wage example. To simplify the discussion, assume
that u is the same as innate ability. Then (2.6) requires that the average level of ability
is the same regardless of years of education. For example, if E(abil8) denotes the aver-
age ability for the group of all people with eight years of education, and E(abil16)
denotes the average ability among people in the population with sixteen years of edu-
cation, then (2.6) implies that these must be the same. In fact, the average ability level
must be the same for all education levels. If, for example, we think that average ability
increases with years of education, then (2.6) is false. (This would happen if, on aver-
age, people with more ability choose to become more educated.) As we cannot observe
innate ability, we have no way of knowing whether or not average ability is the same
Chapter 2 The Simple Regression Model 27
89782_02_c02_p023-072.qxd 5/25/05 11:46 AM Page 27
FIGURE 2.1
E(yx) as a linear function of x.
for all education levels. But this is an issue that we must address before relying on sim-
ple regression analysis.
In the fertilizer example, if fertilizer amounts are chosen independently of other features
of the plots, then (2.6) will hold: the aver-
age land quality will not depend on the
amount of fertilizer. However, if more fer-
tilizer is put on the higher-quality plots of
land, then the expected value of u changes
with the level of fertilizer, and (2.6) fails.
Assumption (2.6) gives
1
another
interpretation that is often useful. Taking
the expected value of (2.1) conditional on
x and using E(ux) 0 gives
E(yx)
0
1
x. (2.8)
28 Part 1 Regression Analysis with Cross-Sectional Data
Suppose that a score on a final exam, score, depends on classes
attended (attend) and unobserved factors that affect exam per-
formance (such as student ability). Then
score
0
1
attend u. (2.7)
When would you expect this model to satisfy (2.6)?
Q U E S T I O N 2 . 1
y
x
1
E(yx) b
0
b
1
x
x
2
x
3
89782_02_c02_p023-072.qxd 5/25/05 11:46 AM Page 28
Equation (2.8) shows that the population regression function (PRF), E(yx), is a
linear function of x. The linearity means that a one-unit increase in x changes the
expected value of y by the amount
1
. For any given value of x, the distribution of y is
centered about E(yx), as illustrated in Figure 2.1 on the preceding page.
When (2.6) is true, it is useful to break y into two components. The piece
0
1
x is
sometimes called the systematic part of ythat is, the part of y explained by xand u
is called the unsystematic part, or the part of y not explained by x. We will use assump-
tion (2.6) in the next section for motivating estimates of
0
and
1
. This assumption is also
crucial for the statistical analysis in Section 2.5.
2.2 Deriving the Ordinary Least Squares Estimates
Now that we have discussed the basic ingredients of the simple regression model, we will
address the important issue of how to estimate the parameters
0
and
1
in equation (2.1).
To do this, we need a sample from the population. Let {(x
i
,y
i
): i1,,n} denote a random
sample of size n from the population. Because these data come from (2.1), we can write
y
i
1
x
i
u
i
(2.9)
for each i. Here, u
i
is the error term for observation i because it contains all factors affect-
ing y
i
other than x
i
.
As an example, x
i
might be the annual income and y
i
the annual savings for family i
during a particular year. If we have collected data on fifteen families, then n 15. A scat-
terplot of such a data set is given in Figure 2.2, along with the (necessarily fictitious)
population regression function.
We must decide how to use these data to obtain estimates of the intercept and slope
in the population regression of savings on income.
There are several ways to motivate the following estimation procedure. We will use
(2.5) and an important implication of assumption (2.6): in the population, u is uncorre-
lated with x. Therefore, we see that u has zero expected value and that the covariance
between x and u is zero:
E(u) 0 (2.10)
and
Cov(x,u) E(xu) 0, (2.11)
where the first equality in (2.11) follows from (2.10). (See Section B.4 for the definition
and properties of covariance.) In terms of the observable variables x and y and the
unknown parameters
0
and
1
, equations (2.10) and (2.11) can be written as
E(y
0
1
x) 0 (2.12)
and
Chapter 2 The Simple Regression Model 29
89782_02_c02_p023-072.qxd 5/25/05 11:46 AM Page 29
E[x(y
0
1
x)] 0, (2.13)
respectively. Equations (2.12) and (2.13) imply two restrictions on the joint probability
distribution of (x,y) in the population. Since there are two unknown parameters to esti-
mate, we might hope that equations (2.12) and (2.13) can be used to obtain good estima-
tors of
0
and
1
. In fact, they can be. Given a sample of data, we choose estimates
0
and
1
to solve the sample counterparts of (2.12) and (2.13):
n
1
n
i1
(y
i
1
x
i
) 0 (2.14)
and
n
1
n
i1
x
i
(y
i
1
x
i
) 0.
(2.15)
30 Part 1 Regression Analysis with Cross-Sectional Data
FIGURE 2.2
Scatterplot of savings and income for 15 families, and the population regression
E(savingsincome)
0
1
income.
E(savingsincome) b
0
b
1
income
savings
0
income
0
89782_02_c02_p023-072.qxd 5/25/05 11:46 AM Page 30
This is an example of the method of moments approach to estimation. (See Section C.4 for a
discussion of different estimation approaches.) These equations can be solved for
0
and
1
.
Using the basic properties of the summation operator from Appendix A, equation
(2.14) can be rewritten as
y
1
x, (2.16)
where y n
1
n
i1
y
i
is the sample average of the y
i
and likewise for x. This equation allows
us to write
0
in terms of
1
, y, and x:
0
y
1
x. (2.17)
Therefore, once we have the slope estimate
1
, it is straightforward to obtain the intercept
estimate
0
, given y and x.
Dropping the n
1
in (2.15) (since it does not affect the solution) and plugging (2.17)
into (2.15) yields
n
i1
x
i
[y
i
(y
1
x)
1
x
i
] 0,
which, upon rearrangement, gives
n
i1
x
i
(y
i
y)
1
n
i1
x
i
(x
i
x).
From basic properties of the summation operator [see (A.7) and (A.8)],
n
i1
x
i
(x
i
x)
n
i1
(x
i
x)
2
and
n
i1
x
i
(y
i
y)
n
i1
(x
i
x)(y
i
y).
Therefore, provided that
n
i1
(x
i
x)
2
0,
(2.18)
the estimated slope is
1
. (2.19)
Equation (2.19) is simply the sample covariance between x and y divided by the sample
variance of x. (See Appendix C. Dividing both the numerator and the denominator
by n 1 changes nothing.) This makes sense because
1
equals the population covari-
ance divided by the variance of x when E(u) 0 and Cov(x,u) 0. An immediate
n
i1
(x
i
x) (y
i
y)
n
i1
(x
i
x)
2
Chapter 2 The Simple Regression Model 31
89782_02_c02_p023-072.qxd 5/25/05 11:46 AM Page 31
implication is that if x and y are positively correlated in the sample, then
1
is positive;
if x and y are negatively correlated, then
1
is negative.
Although the method for obtaining (2.17) and (2.19) is motivated by (2.6), the only
assumption needed to compute the estimates for a particular sample is (2.18). This is
hardly an assumption at all: (2.18) is true provided the x
i
in the sample are not all equal
to the same value. If (2.18) fails, then we have either been unlucky in obtaining our sam-
ple from the population or we have not specified an interesting problem (x does not vary
in the population). For example, if y wage and x educ, then (2.18) fails only if
everyone in the sample has the same amount of education (for example, if everyone is
a high school graduate; see Figure 2.3). If just one person has a different amount of edu-
cation, then (2.18) holds, and the estimates can be computed.
The estimates given in (2.17) and (2.19) are called the ordinary least squares (OLS)
estimates of
0
and
1
. To justify this name, for any
0
and
1
define a fitted value for y
when x x
i
as
y
i
1
x
i
. (2.20)
32 Part 1 Regression Analysis with Cross-Sectional Data
FIGURE 2.3
A scatterplot of wage against education when educ
i
12 for all i.
wage
12 educ 0
89782_02_c02_p023-072.qxd 5/25/05 11:46 AM Page 32
This is the value we predict for y when x x
i
for the given intercept and slope. There is
a fitted value for each observation in the sample. The residual for observation i is the dif-
ference between the actual y
i
and its fitted value:
u
i
y
i
y
i
y
i
1
x
i
. (2.21)
Again, there are n such residuals. [These are not the same as the errors in (2.9), a point
we return to in Section 2.5.] The fitted values and residuals are indicated in Figure 2.4.
Now, suppose we choose
0
and
1
to make the sum of squared residuals,
n
i1
u
i
2
n
i1
(y
i
1
x
i
)
2
,
(2.22)
as small as possible. The appendix to this chapter shows that the conditions necessary for
(
0
,
1
) to minimize (2.22) are given exactly by equations (2.14) and (2.15), without n
1
.
Equations (2.14) and (2.15) are often called the first order conditions for the OLS esti-
mates, a term that comes from optimization using calculus (see Appendix A). From our
previous calculations, we know that the solutions to the OLS first order conditions are
Chapter 2 The Simple Regression Model 33
FIGURE 2.4
Fitted values and residuals.
y b
0
b
1
x
y
x
1
x
i
x
y
i
y
i
fitted value
y
1
i
residual
y
1
1
x, (2.23)
where it is understood that
0
and
1
have been obtained using equations (2.17) and (2.19).
The notation y, read as y hat, emphasizes that the predicted values from equation (2.23)
are estimates. The intercept,
0
, is the predicted value of y when x 0, although in some
cases it will not make sense to set x 0. In those situations,
0
is not, in itself, very inter-
esting. When using (2.23) to compute predicted values of y for various values of x, we
must account for the intercept in the calculations. Equation (2.23) is also called the sample
regression function (SRF) because it is the estimated version of the population regres-
sion function E(yx)
0
1
x. It is important to remember that the PRF is something
fixed, but unknown, in the population. Because the SRF is obtained for a given sample of
data, a new sample will generate a different slope and intercept in equation (2.23).
In most cases, the slope estimate, which we can write as
1
y/x, (2.24)
is of primary interest. It tells us the amount by which y changes when x increases by one
unit. Equivalently,
y
1
x, (2.25)
so that given any change in x (whether positive or negative), we can compute the predicted
change in y.
We now present several examples of simple regression obtained by using real data. In
other words, we find the intercept and slope estimates with equations (2.17) and (2.19). Since
these examples involve many observations, the calculations were done using an economet-
rics software package. At this point, you should be careful not to read too much into these
34 Part 1 Regression Analysis with Cross-Sectional Data
89782_02_c02_p023-072.qxd 5/25/05 11:46 AM Page 34
Chapter 2 The Simple Regression Model 35
regressions; they are not necessarily uncovering a causal relationship. We have said nothing
so far about the statistical properties of OLS. In Section 2.5, we consider statistical proper-
ties after we explicitly impose assumptions on the population model equation (2.1).
E X A M P L E 2 . 3
(CEO Salary and Return on Equity)
For the population of chief executive officers, let y be annual salary (salary) in thousands of
dollars. Thus, y 856.3 indicates an annual salary of $856,300, and y 1452.6 indicates
a salary of $1,452,600. Let x be the average return on equity (roe) for the CEOs firm for
the previous three years. (Return on equity is defined in terms of net income as a percentage
of common equity.) For example, if roe 10, then average return on equity is 10 percent.
To study the relationship between this measure of firm performance and CEO compensa-
tion, we postulate the simple model
salary
0
1
roe u.
The slope parameter
1
measures the change in annual salary, in thousands of dollars, when
return on equity increases by one percentage point. Because a higher roe is good for the com-
pany, we think
1
0.
The data set CEOSAL1.RAW contains information on 209 CEOs for the year 1990; these
data were obtained from Business Week (5/6/91). In this sample, the average annual salary is
$1,281,120, with the smallest and largest being $223,000 and $14,822,000, respectively. The
average return on equity for the years 1988, 1989, and 1990 is 17.18 percent, with the small-
est and largest values being 0.5 and 56.3 percent, respectively.
Using the data in CEOSAL1.RAW, the OLS regression line relating salary to roe is
salary 963.191 18.501 roe, (2.26)
where the intercept and slope estimates have been rounded to three decimal places; we use
salary hat to indicate that this is an estimated equation. How do we interpret the equa-
tion? First, if the return on equity is zero, roe 0, then the predicted salary is the intercept,
963.191, which equals $963,191 since salary is measured in thousands. Next, we can write
the predicted change in salary as a function of the change in roe: salary 18.501 (roe).
This means that if the return on equity increases by one percentage point, roe 1, then
salary is predicted to change by about 18.5, or $18,500. Because (2.26) is a linear equation,
this is the estimated change regardless of the initial salary.
We can easily use (2.26) to compare predicted salaries at different values of roe. Suppose
roe 30. Then salary 963.191 18.501(30) 1518.221, which is just over $1.5 million.
However, this does not mean that a particular CEO whose firm had a roe 30
earns $1,518,221. Many other factors affect salary. This is just our prediction from the OLS
regression line (2.26). The estimated line is graphed in Figure 2.5, along with the population
regression function E(salaryroe). We will never know the PRF, so we cannot tell how close the
SRF is to the PRF. Another sample of data will give a different regression line, which may or
may not be closer to the population regression line.
89782_02_c02_p023-072.qxd 5/25/05 11:46 AM Page 35
36 Part 1 Regression Analysis with Cross-Sectional Data
salary
963.191
salary 963.191 18.501 roe
E(salaryroe) b
0
b
1
roe
roe
E X A M P L E 2 . 4
(Wage and Education)
For the population of people in the workforce in 1976, let y wage, where wage is mea-
sured in dollars per hour. Thus, for a particular person, if wage 6.75, the hourly wage is
$6.75. Let x educ denote years of schooling; for example, educ 12 corresponds to a com-
plete high school education. Since the average wage in the sample is $5.90, the Consumer
Price Index indicates that this amount is equivalent to $19.06 in 2003 dollars.
Using the data in WAGE1.RAW where n 526 individuals, we obtain the following OLS
regression line (or sample regression function):
wage 0.90 0.54 educ. (2.27)
FIGURE 2.5
The OLS regression line salary 963.191 18.501 roe and the (unknown)
population regression function.
89782_02_c02_p023-072.qxd 5/25/05 11:46 AM Page 36
Chapter 2 The Simple Regression Model 37
We must interpret this equation with cau-
tion. The intercept of 0.90 literally means
that a person with no education has a pre-
dicted hourly wage of 90 cents an hour.
This, of course, is silly. It turns out that only
18 people in the sample of 526 have less
than eight years of education. Consequently,
it is not surprising that the regression line does poorly at very low levels of education. For a
person with eight years of education, the predicted wage is wage 0.90 0.54(8) 3.42,
or $3.42 per hour (in 1976 dollars).
The slope estimate in (2.27) implies that one more year of education increases hourly wage
by 54 cents an hour. Therefore, four more years of education increase the predicted wage by
4(0.54) 2.16, or $2.16 per hour. These are fairly large effects. Because of the linear nature
of (2.27), another year of education increases the wage by the same amount, regardless of
the initial level of education. In Section 2.4, we discuss some methods that allow for non-
constant marginal effects of our explanatory variables.
E X A M P L E 2 . 5
(Voting Outcomes and Campaign Expenditures)
The file VOTE1.RAW contains data on election outcomes and campaign expenditures for 173
two-party races for the U.S. House of Representatives in 1988. There are two candidates in
each race, A and B. Let voteA be the percentage of the vote received by Candidate A and
shareA be the percentage of total campaign expenditures accounted for by Candidate A.
Many factors other than shareA affect the election outcome (including the quality of the can-
didates and possibly the dollar amounts spent by A and B). Nevertheless, we can estimate a
simple regression model to find out whether spending more relative to ones challenger implies
a higher percentage of the vote.
The estimated equation using the 173 observations is
voteA 26.81 0.464 shareA. (2.28)
This means that if the share of Candidate As spending increases by one percentage point,
Candidate A receives almost one-half a percentage point (0.464) more of the total vote.
Whether or not this is a causal effect is unclear, but it is not unbelievable. If shareA 50,
voteA is predicted to be about 50, or half the vote.
In some cases, regression analysis is not used to determine causality but to simply
look at whether two variables are positively or negatively related, much like a standard
The estimated wage from (2.27), when educ 8, is $3.42 in
1976 dollars. What is this value in 2003 dollars? (Hint: You have
enough information in Example 2.4 to answer this question.)
Q U E S T I O N 2 . 2
89782_02_c02_p023-072.qxd 5/25/05 11:46 AM Page 37
correlation analysis. An example of this
occurs in Exercise C2.3, where you are
asked to use data from Biddle and Hamer-
mesh (1990) on time spent sleeping and
working to investigate the tradeoff
between these two factors.
A Note on Terminology
In most cases, we will indicate the estimation of a relationship through OLS by writing
an equation such as (2.26), (2.27), or (2.28). Sometimes, for the sake of brevity, it is useful
to indicate that an OLS regression has been run without actually writing out the equation.
We will often indicate that equation (2.23) has been obtained by OLS in saying that we
run the regression of
y on x, (2.29)
or simply that we regress y on x. The positions of y and x in (2.29) indicate which is the
dependent variable and which is the independent variable: we always regress the depen-
dent variable on the independent variable. For specific applications, we replace y and x
with their names. Thus, to obtain (2.26), we regress salary on roe, or to obtain (2.28), we
regress voteA on shareA.
When we use such terminology in (2.29), we will always mean that we plan to estimate
the intercept,
0
, along with the slope,
1
. This case is appropriate for the vast majority of
applications. Occasionally, we may want to estimate the relationship between y and x
assuming that the intercept is zero (so that x 0 implies that y 0); we cover this case
briefly in Section 2.6. Unless explicitly stated otherwise, we always estimate an intercept
along with a slope.
2.3 Properties of OLS on Any
Sample of Data
In the previous section, we went through the algebra of deriving the formulas for the OLS
intercept and slope estimates. In this section, we cover some further algebraic properties
of the fitted OLS regression line. The best way to think about these properties is to remem-
ber that they hold, by construction, for any sample of data. The harder taskconsidering
the properties of OLS across all possible random samples of datais postponed until
Section 2.5.
Several of the algebraic properties we are going to derive will appear mundane.
Nevertheless, having a grasp of these properties helps us to figure out what happens to
the OLS estimates and related statistics when the data are manipulated in certain ways,
such as when the measurement units of the dependent and independent variables
change.
38 Part 1 Regression Analysis with Cross-Sectional Data
In Example 2.5, what is the predicted vote for Candidate A if
shareA 60 (which means 60 percent)? Does this answer seem
reasonable?
Q U E S T I O N 2 . 3
89782_02_c02_p023-072.qxd 5/25/05 11:46 AM Page 38
TABLE 2.2
Fitted Values and Residuals for the First 15 CEOs
obsno roe salary salaryhat uhat
1 14.1 1095 1224.058 129.0581
2 10.9 1001 1164.854 163.8542
3 23.5 1122 1397.969 275.9692
4 5.9 578 1072.348 494.3484
5 13.8 1368 1218.508 149.4923
6 20.0 1145 1333.215 188.2151
7 16.4 1078 1266.611 188.6108
8 16.3 1094 1264.761 170.7606
9 10.5 1237 1157.454 79.54626
Fitted Values and Residuals
We assume that the intercept and slope estimates,
0
and
1
, have been obtained for the
given sample of data. Given
0
and
1
, we can obtain the fitted value y
i
for each observa-
tion. [This is given by equation (2.20).] By definition, each fitted value of y
i
is on the OLS
regression line. The OLS residual associated with observation i, u
i
, is the difference
between y
i
and its fitted value, as given in equation (2.21). If u
i
is positive, the line under-
predicts y
i
; if u
i
is negative, the line overpredicts y
i
. The ideal case for observation i is
when u
i
0, but in most cases, every residual is not equal to zero. In other words, none
of the data points must actually lie on the OLS line.
E X A M P L E 2 . 6
(CEO Salary and Return on Equity)
Table 2.2 contains a listing of the first 15 observations in the CEO data set, along with the fit-
ted values, called salaryhat, and the residuals, called uhat.
The first four CEOs have lower salaries than what we predicted from the OLS regression
line (2.26); in other words, given only the firms roe, these CEOs make less than what we
predicted. As can be seen from the positive uhat, the fifth CEO makes more than predicted
from the OLS regression line.
Chapter 2 The Simple Regression Model 39
(continued)
89782_02_c02_p023-072.qxd 5/25/05 11:46 AM Page 39
TABLE 2.2
Fitted Values and Residuals for the First 15 CEOs (Continued)
obsno roe salary salaryhat uhat
10 26.3 833 1449.773 616.7726
11 25.9 567 1442.372 875.3721
12 26.8 933 1459.023 526.0231
13 14.8 1339 1237.009 101.9911
14 22.3 937 1375.768 438.7678
15 56.3 2011 2004.808 006.191895
Algebraic Properties of OLS Statistics
There are several useful algebraic properties of OLS estimates and their associated statis-
tics. We now cover the three most important of these.
(1) The sum, and therefore the sample average of the OLS residuals, is zero.
Mathematically,
n
i1
u
i
0. (2.30)
This property needs no proof; it follows immediately from the OLS first order condition
(2.14), when we remember that the residuals are defined by u
i
y
i
1
x
i
. In other
words, the OLS estimates
0
and
1
are chosen to make the residuals add up to zero (for
any data set). This says nothing about the residual for any particular observation i.
(2) The sample covariance between the regressors and the OLS residuals is zero.
This follows from the first order condition (2.15), which can be written in terms of the
residuals as
n
i1
x
i
u
i
0. (2.31)
The sample average of the OLS residuals is zero, so the left-hand side of (2.31) is
proportional to the sample covariance between x
i
and u
i
.
(3) The point (x,y) is always on the OLS regression line. In other words, if we take
equation (2.23) and plug in x for x, then the predicted value is y. This is exactly what
equation (2.16) showed us.
40 Part 1 Regression Analysis with Cross-Sectional Data
89782_02_c02_p023-072.qxd 5/25/05 11:46 AM Page 40
E X A M P L E 2 . 7
(Wage and Education)
For the data in WAGE1.RAW, the average hourly wage in the sample is 5.90, rounded to two
decimal places, and the average education is 12.56. If we plug educ 12.56 into the OLS
regression line (2.27), we get wage 0.90 0.54(12.56) 5.8824, which equals 5.9 when
rounded to the first decimal place. These figures do not exactly agree because we have
rounded the average wage and education, as well as the intercept and slope estimates. If we
did not initially round any of the values, we would get the answers to agree more closely, but
to little useful effect.
Writing each y
i
as its fitted value, plus its residual, provides another way to interpret
an OLS regression. For each i, write
y
i
y
i
u
i
. (2.32)
From property (1), the average of the residuals is zero; equivalently, the sample average
of the fitted values, y
i
, is the same as the sample average of the y
i
, or y
y. Further,
properties (1) and (2) can be used to show that the sample covariance between y
i
and u
i
is
zero. Thus, we can view OLS as decomposing each y
i
into two parts, a fitted value and a
residual. The fitted values and residuals are uncorrelated in the sample.
Define the total sum of squares (SST), the explained sum of squares (SSE), and
the residual sum of squares (SSR) (also known as the sum of squared residuals), as
follows:
SST
n
i1
(y
i
y)
2
. (2.33)
SSE
n
i1
(y
i
y)
2
. (2.34)
SSR
n
i1
u
i
2
. (2.35)
SST is a measure of the total sample variation in the y
i
; that is, it measures how spread
out the y
i
are in the sample. If we divide SST by n 1, we obtain the sample variance of
y, as discussed in Appendix C. Similarly, SSE measures the sample variation in the y
i
(where we use the fact that y
n
i1
(y
i
y)
2
n
i1
[(y
i
y
i
) (y
i
y)]
2
n
i1
[u
i
(y
i
y)]
2
n
i1
u
i
2
2
n
i1
u
i
(y
i
y)
n
i1
(y
i
y)
2
SSR 2
n
i1
u
i
(y
i
y) SSE.
Now, (2.36) holds if we show that
n
i1
u
i
(y
i
y) 0. (2.37)
But we have already claimed that the sample covariance between the residuals and the
fitted values is zero, and this covariance is just (2.37) divided by n1. Thus, we have
established (2.36).
Some words of caution about SST, SSE, and SSR are in order. There is no uniform
agreement on the names or abbreviations for the three quantities defined in equations
(2.33), (2.34), and (2.35). The total sum of squares is called either SST or TSS, so there
is little confusion here. Unfortunately, the explained sum of squares is sometimes called
the regression sum of squares. If this term is given its natural abbreviation, it can easily
be confused with the term residual sum of squares. Some regression packages refer to
the explained sum of squares as the model sum of squares.
To make matters even worse, the residual sum of squares is often called the error
sum of squares. This is especially unfortunate because, as we will see in Section 2.5,
the errors and the residuals are different quantities. Thus, we will always call (2.35)
the residual sum of squares or the sum of squared residuals. We prefer to use the abbre-
viation SSR to denote the sum of squared residuals, because it is more common in
econometric packages.
Goodness-of-Fit
So far, we have no way of measuring how well the explanatory or independent vari-
able, x, explains the dependent variable, y. It is often useful to compute a number
that summarizes how well the OLS regression line fits the data. In the following
discussion, be sure to remember that we assume that an intercept is estimated along
with the slope.
42 Part 1 Regression Analysis with Cross-Sectional Data
89782_02_c02_p023-072.qxd 5/25/05 11:46 AM Page 42
Assuming that the total sum of squares, SST, is not equal to zerowhich is true except
in the very unlikely event that all the y
i
equal the same valuewe can divide (2.36) by
SST to get 1 SSE/SST SSR/SST. The R-squared of the regression, sometimes called
the coefficient of determination, is defined as
R
2
SSE/SST 1 SSR/SST. (2.38)
R
2
is the ratio of the explained variation compared to the total variation; thus, it is
interpreted as the fraction of the sample variation in y that is explained by x. The second
equality in (2.38) provides another way for computing R
2
.
From (2.36), the value of R
2
is always between zero and one, because SSE can be
no greater than SST. When interpreting R
2
, we usually multiply it by 100 to change it
into a percent: 100R
2
is the percentage of the sample variation in y that is explained
by x.
If the data points all lie on the same line, OLS provides a perfect fit to the data. In
this case, R
2
1. A value of R
2
that is nearly equal to zero indicates a poor fit of the
OLS line: very little of the variation in the y
i
is captured by the variation in the y
i
(which
all lie on the OLS regression line). In fact, it can be shown that R
2
is equal to the
square of the sample correlation coefficient between y
i
and y
i
. This is where the term
R-squared came from. (The letter R was traditionally used to denote an estimate
of a population correlation coefficient, and its usage has survived in regression
analysis.)
E X A M P L E 2 . 8
(CEO Salary and Return on Equity)
In the CEO salary regression, we obtain the following:
salary 963.191 18.501 roe (2.39)
n 209, R
2
0.0132.
We have reproduced the OLS regression line and the number of observations for clarity. Using
the R-squared (rounded to four decimal places) reported for this equation, we can see how
much of the variation in salary is actually explained by the return on equity. The answer is: not
much. The firms return on equity explains only about 1.3 percent of the variation in salaries
for this sample of 209 CEOs. That means that 98.7 percent of the salary variations for these
CEOs is left unexplained! This lack of explanatory power may not be too surprising because
many other characteristics of both the firm and the individual CEO should influence salary;
these factors are necessarily included in the errors in a simple regression analysis.
Chapter 2 The Simple Regression Model 43
89782_02_c02_p023-072.qxd 5/25/05 11:46 AM Page 43
In the social sciences, low R-squareds in regression equations are not uncommon,
especially for cross-sectional analysis. We will discuss this issue more generally under
multiple regression analysis, but it is worth emphasizing now that a seemingly low
R-squared does not necessarily mean that an OLS regression equation is useless. It is still
possible that (2.39) is a good estimate of the ceteris paribus relationship between salary
and roe; whether or not this is true does not depend directly on the size of R-squared.
Students who are first learning econometrics tend to put too much weight on the size of
the R-squared in evaluating regression equations. For now, be aware that using R-squared
as the main gauge of success for an econometric analysis can lead to trouble.
Sometimes, the explanatory variable explains a substantial part of the sample varia-
tion in the dependent variable.
E X A M P L E 2 . 9
(Voting Outcomes and Campaign Expenditures)
In the voting outcome equation in (2.28), R
2
0.856. Thus, the share of campaign expendi-
tures explains over 85 percent of the variation in the election outcomes for this sample. This
is a sizable portion.
2.4 Units of Measurement
and Functional Form
Two important issues in applied economics are (1) understanding how changing the units
of measurement of the dependent and/or independent variables affects OLS estimates and
(2) knowing how to incorporate popular functional forms used in economics into regres-
sion analysis. The mathematics needed for a full understanding of functional form issues
is reviewed in Appendix A.
The Effects of Changing Units
of Measurement on OLS Statistics
In Example 2.3, we chose to measure annual salary in thousands of dollars, and the return
on equity was measured as a percentage (rather than as a decimal). It is crucial to know
how salary and roe are measured in this example in order to make sense of the estimates
in equation (2.39).
We must also know that OLS estimates change in entirely expected ways when the
units of measurement of the dependent and independent variables change. In Example 2.3,
suppose that, rather than measuring salary in thousands of dollars, we measure it in dollars.
Let salardol be salary in dollars (salardol 845,761 would be interpreted as $845,761).
Of course, salardol has a simple relationship to the salary measured in thousands of
44 Part 1 Regression Analysis with Cross-Sectional Data
89782_02_c02_p023-072.qxd 5/25/05 11:46 AM Page 44
dollars: salardol 1,000salary. We do not need to actually run the regression of salardol
on roe to know that the estimated equation is:
salardol 963,191 18,501 roe. (2.40)
We obtain the intercept and slope in (2.40) simply by multiplying the intercept and the
slope in (2.39) by 1,000. This gives equations (2.39) and (2.40) the same interpretation.
Looking at (2.40), if roe 0, then salardol 963,191, so the predicted salary is $963,191
[the same value we obtained from equation (2.39)]. Furthermore, if roe increases by one,
then the predicted salary increases by $18,501; again, this is what we concluded from our
earlier analysis of equation (2.39).
Generally, it is easy to figure out what happens to the intercept and slope estimates
when the dependent variable changes units of measurement. If the dependent variable
is multiplied by the constant cwhich means each value in the sample is multiplied
by cthen the OLS intercept and slope estimates are also multiplied by c. (This
assumes nothing has changed about the independent variable.) In the CEO salary exam-
ple, c 1,000 in moving from salary to salardol.
We can also use the CEO salary example to see what happens when we change the
units of measurement of the independent
variable. Define roedec roe/100 to
be the decimal equivalent of roe; thus,
roedec 0.23 means a return on equity of
23 percent. To focus on changing the units
of measurement of the independent vari-
able, we return to our original dependent
variable, salary, which is measured in
thousands of dollars. When we regress salary on roedec, we obtain
salary 963.191 1,850.1 roedec. (2.41)
The coefficient on roedec is 100 times the coefficient on roe in (2.39). This is as it should
be. Changing roe by one percentage point is equivalent to roedec 0.01. From (2.41),
if roedec 0.01, then salary 1,850.1(0.01) 18.501, which is what is obtained by
using (2.39). Note that, in moving from (2.39) to (2.41), the independent variable was
divided by 100, and so the OLS slope estimate was multiplied by 100, preserving the inter-
pretation of the equation. Generally, if the independent variable is divided or multiplied
by some nonzero constant, c, then the OLS slope coefficient is multiplied or divided by
c, respectively.
The intercept has not changed in (2.41) because roedec 0 still corresponds to a zero
return on equity. In general, changing the units of measurement of only the independent
variable does not affect the intercept.
In the previous section, we defined R-squared as a goodness-of-fit measure for OLS
regression. We can also ask what happens to R
2
when the unit of measurement of either
the independent or the dependent variable changes. Without doing any algebra, we should
know the result: the goodness-of-fit of the model should not depend on the units of
Chapter 2 The Simple Regression Model 45
Suppose that salary is measured in hundreds of dollars, rather than
in thousands of dollars, say, salarhun. What will be the OLS intercept
and slope estimates in the regression of salarhun on roe?
Q U E S T I O N 2 . 4
89782_02_c02_p023-072.qxd 6/14/05 5:37 PM Page 45
measurement of our variables. For example, the amount of variation in salary, explained
by the return on equity, should not depend on whether salary is measured in dollars or in
thousands of dollars or on whether return on equity is a percentage or a decimal. This
intuition can be verified mathematically: using the definition of R
2
, it can be shown that
R
2
is, in fact, invariant to changes in the units of y or x.
Incorporating Nonlinearities in Simple Regression
So far, we have focused on linear relationships between the dependent and independent
variables. As we mentioned in Chapter 1, linear relationships are not nearly general enough
for all economic applications. Fortunately, it is rather easy to incorporate many nonlinear-
ities into simple regression analysis by appropriately defining the dependent and indepen-
dent variables. Here, we will cover two possibilities that often appear in applied work.
In reading applied work in the social sciences, you will often encounter regression
equations where the dependent variable appears in logarithmic form. Why is this done?
Recall the wage-education example, where we regressed hourly wage on years of educa-
tion. We obtained a slope estimate of 0.54 [see equation (2.27)], which means that each
additional year of education is predicted to increase hourly wage by 54 cents. Because of
the linear nature of (2.27), 54 cents is the increase for either the first year of education or
the twentieth year; this may not be reasonable.
Suppose, instead, that the percentage increase in wage is the same, given one more
year of education. Model (2.27) does not imply a constant percentage increase: the
percentage increase depends on the initial wage. A model that gives (approximately) a
constant percentage effect is
log(wage)
0
1
educ u, (2.42)
where log() denotes the natural logarithm. (See Appendix A for a review of logarithms.)
In particular, if u 0, then
%wage (100
1
)educ. (2.43)
Notice how we multiply
1
by 100 to get the percentage change in wage given one addi-
tional year of education. Since the percentage change in wage is the same for each addi-
tional year of education, the change in wage for an extra year of education increases as
education increases; in other words, (2.42) implies an increasing return to education. By
exponentiating (2.42), we can write wage exp(
0
1
educ u). This equation is
graphed in Figure 2.6, with u 0.
Estimating a model such as (2.42) is straightforward when using simple regression.
Just define the dependent variable, y, to be y log(wage). The independent variable is
represented by x educ. The mechanics of OLS are the same as before: the intercept and
slope estimates are given by the formulas (2.17) and (2.19). In other words, we obtain
0
and
1
from the OLS regression of log(wage) on educ.
46 Part 1 Regression Analysis with Cross-Sectional Data
89782_02_c02_p023-072.qxd 5/25/05 11:46 AM Page 46
E X A M P L E 2 . 1 0
(A Log Wage Equation)
Using the same data as in Example 2.4, but using log(wage) as the dependent variable, we
obtain the following relationship:
log(wage) 0.584 0.083 educ (2.44)
n 526, R
2
0.186.
The coefficient on educ has a percentage interpretation when it is multiplied by 100: wage
increases by 8.3 percent for every additional year of education. This is what economists mean
when they refer to the return to another year of education.
It is important to remember that the main reason for using the log of wage in (2.42) is to
impose a constant percentage effect of education on wage. Once equation (2.42) is obtained,
the natural log of wage is rarely mentioned. In particular, it is not correct to say that another
year of education increases log(wage) by 8.3 percent.
Chapter 2 The Simple Regression Model 47
FIGURE 2.6
wage exp(
0
1
educ), with
1
0.
wage
educ 0
89782_02_c02_p023-072.qxd 5/25/05 11:46 AM Page 47
The intercept in (2.42) is not very meaningful, as it gives the predicted log(wage), when
educ 0. The R-squared shows that educ explains about 18.6 percent of the variation in
log(wage) (not wage). Finally, equation (2.44) might not capture all of the nonlinearity in
the relationship between wage and schooling. If there are diploma effects, then the
twelfth year of educationgraduation from high school could be worth much more than
the eleventh year. We will learn how to allow for this kind of nonlinearity in Chapter 7.
Another important use of the natural log is in obtaining a constant elasticity model.
E X A M P L E 2 . 1 1
(CEO Salary and Firm Sales)
We can estimate a constant elasticity model relating CEO salary to firm sales. The data set is
the same one used in Example 2.3, except we now relate salary to sales. Let sales be annual
firm sales, measured in millions of dollars. A constant elasticity model is
log(salary)
0
1
log(sales) u, (2.45)
where
1
is the elasticity of salary with respect to sales. This model falls under the simple
regression model by defining the dependent variable to be y log(salary) and the indepen-
dent variable to be x log(sales). Estimating this equation by OLS gives
log(salary) 4.822 0.257 log(sales) (2.46)
n 209, R
2
0.211.
The coefficient of log(sales) is the estimated elasticity of salary with respect to sales. It implies
that a 1 percent increase in firm sales increases CEO salary by about 0.257 percentthe usual
interpretation of an elasticity.
The two functional forms covered in this section will often arise in the remainder of
this text. We have covered models containing natural logarithms here because they appear
so frequently in applied work. The interpretation of such models will not be much differ-
ent in the multiple regression case.
It is also useful to note what happens to the intercept and slope estimates if we change
the units of measurement of the dependent variable when it appears in logarithmic form.
Because the change to logarithmic form approximates a proportionate change, it makes
sense that nothing happens to the slope. We can see this by writing the rescaled vari-
able as c
1
y
i
for each observation i. The original equation is log(y
i
)
0
1
x
i
u
i
. If we
add log(c
1
) to both sides, we get log(c
1
) log(y
i
) [log(c
1
)
0
]
1
x
i
u
i
, or log(c
1
y
i
)
[log(c
1
)
0
]
1
x
i
u
i
. (Remember that the sum of the logs is equal to the log of
their product, as shown in Appendix A.) Therefore, the slope is still
1
, but the intercept is
48 Part 1 Regression Analysis with Cross-Sectional Data
89782_02_c02_p023-072.qxd 5/25/05 11:46 AM Page 48
TABLE 2.3
Summary of Functional Forms Involving Logarithms
Dependent Independent Interpretation
Model Variable Variable of
1
level-level y x y
1
x
level-log y log(x) y (
1
/100)%x
log-level log(y) x %y (100
1
)x
log-log log(y) log(x) %y
1
%x
now log(c
1
)
0
. Similarly, if the independent variable is log(x), and we change the units
of measurement of x before taking the log, the slope remains the same, but the intercept
changes. You will be asked to verify these claims in Problem 2.9.
We end this subsection by summarizing four combinations of functional forms avail-
able from using either the original variable or its natural log. In Table 2.3, x and y stand
for the variables in their original form. The model with y as the dependent variable and x
as the independent variable is called the level-level model because each variable appears
in its level form. The model with log(y) as the dependent variable and x as the indepen-
dent variable is called the log-level model. We will not explicitly discuss the level-log
model here, because it arises less often in practice. In any case, we will see examples of
this model in later chapters.
The last column in Table 2.3 gives the interpretation of
1
. In the log-level model,
100
1
is sometimes called the semi-elasticity of y with respect to x. As we mentioned in
Example 2.11, in the log-log model,
1
is the elasticity of y with respect to x. Table 2.3
warrants careful study, as we will refer to it often in the remainder of the text.
The Meaning of Linear Regression
The simple regression model that we have studied in this chapter is also called the simple
linear regression model. Yet, as we have just seen, the general model also allows for cer-
tain nonlinear relationships. So what does linear mean here? You can see by looking at
equation (2.1) that y
0
1
x u. The key is that this equation is linear in the param-
eters
0
and
1
. There are no restrictions on how y and x relate to the original explained
and explanatory variables of interest. As we saw in Examples 2.10 and 2.11, y and x can
be natural logs of variables, and this is quite common in applications. But we need not
stop there. For example, nothing prevents us from using simple regression to estimate a
model such as cons
0
1
inc
1
inc) u. Estimation of such mod-
els takes us into the realm of the nonlinear regression model, which is beyond the scope
of this text. For most applications, choosing a model that can be put into the linear regres-
sion framework is sufficient.
2.5 Expected Values and Variances
of the OLS Estimators
In Section 2.1, we defined the population model y
0
1
x u, and we claimed that the
key assumption for simple regression analysis to be useful is that the expected value of u
given any value of x is zero. In Sections 2.2, 2.3, and 2.4, we discussed the algebraic prop-
erties of OLS estimation. We now return to the population model and study the statistical
properties of OLS. In other words, we now view
0
and
1
as estimators for the parameters
0
and
1
that appear in the population model. This means that we will study properties of
the distributions of
0
and
1
over different random samples from the population. (Appen-
dix C contains definitions of estimators and reviews some of their important properties.)
Unbiasedness of OLS
We begin by establishing the unbiasedness of OLS under a simple set of assumptions. For
future reference, it is useful to number these assumptions using the prefix SLR for sim-
ple linear regression. The first assumption defines the population model.
Assumption SLR.1 (Linear in Parameters)
In the population model, the dependent variable, y, is related to the independent variable, x,
and the error (or disturbance), u, as
y
0
1
x u, (2.47)
where
0
and
1
are the population intercept and slope parameters, respectively.
To be realistic, y, x, and u are all viewed as random variables in stating the population
model. We discussed the interpretation of this model at some length in Section 2.1 and
gave several examples. In the previous section, we learned that equation (2.47) is not as
89782_02_c02_p023-072.qxd 5/25/05 11:46 AM Page 50
Chapter 2 The Simple Regression Model 51
restrictive as it initially seems; by choosing y and x appropriately, we can obtain interest-
ing nonlinear relationships (such as constant elasticity models).
We are interested in using data on y and x to estimate the parameters
0
and, especially,
1
. We assume that our data were obtained as a random sample. (See Appendix C for a
review of random sampling.)
Assumption SLR.2 (Random Sampling)
We have a random sample of size n, {(x
i
,y
i
): i 1,2,,n}, following the population model
in equation (2.47).
We will have to address failure of the random sampling assumption in later chapters that
deal with time series analysis and sample selection problems. Not all cross-sectional sam-
ples can be viewed as outcomes of random samples, but many can be.
We can write (2.47) in terms of the random sample as
y
i
1
x
i
u
i
, i 1,2,,n, (2.48)
where u
i
is the error or disturbance for observation i (for example, person i, firm i, city i,
and so on). Thus, u
i
contains the unobservables for observation i that affect y
i
. The u
i
should
not be confused with the residuals, u
i
, that we defined in Section 2.3. Later on, we will
explore the relationship between the errors and the residuals. For interpreting
0
and
1
in
a particular application, (2.47) is most informative, but (2.48) is also needed for some of
the statistical derivations.
The relationship (2.48) can be plotted for a particular outcome of data as shown in
Figure 2.7.
As we already saw in Section 2.2, the OLS slope and intercept estimates are not
defined unless we have some sample variation in the explanatory variable. We now add
variation in the x
i
to our list of assumptions.
Assumption SLR.3 (Sample Variation
in the Explanatory Variable)
The sample outcomes on x, namely, {x
i
, i 1,,n}, are not all the same value.
This is a very weak assumptioncertainly not worth emphasizing, but needed
nevertheless. If x varies in the population, random samples on x will typically contain
variation, unless the population variation is minimal or the sample size is small. Simple
inspection of summary statistics on x
i
reveals whether Assumption SLR.3 fails: if the sam-
ple standard deviation of x
i
is zero, then Assumption SLR.3 fails; otherwise, it holds.
Finally, in order to obtain unbiased estimators of
0
and
1
, we need to impose the
zero conditional mean assumption that we discussed in some detail in Section 2.1. We now
explicitly add it to our list of assumptions.
89782_02_c02_p023-072.qxd 5/25/05 11:46 AM Page 51
Assumption SLR.4 (Zero Conditional Mean)
The error u has an expected value of zero given any value of the explanatory variable. In other
words,
E(ux) 0.
For a random sample, this assumption implies that E(u
i
x
i
) 0, for all i 1,2,,n.
In addition to restricting the relationship between u and x in the population, the zero
conditional mean assumptioncoupled with the random sampling assumptionallows
for a convenient technical simplification. In particular, we can derive the statistical
properties of the OLS estimators as conditional on the values of the x
i
in our sample.
Technically, in statistical derivations, conditioning on the sample values of the inde-
pendent variable is the same as treating the x
i
as fixed in repeated samples, which we
think of as follows. We first choose n sample values for x
1
, x
2
, , x
n
. (These can be
repeated.) Given these values, we then obtain a sample on y (effectively by obtaining a
random sample of the u
i
). Next, another sample of y is obtained, using the same values
for x
1
, x
2
, , x
n
. Then another sample of y is obtained, again using the same x
1
, x
2
, ,
x
n
. And so on.
52 Part 1 Regression Analysis with Cross-Sectional Data
FIGURE 2.7
Graph of y
i
0
1
x
i
u
i
.
y
x
1
x
i
x
y
i
u
1
y
1
u
i
E(yx) b
0
b
1
x
PRF
89782_02_c02_p023-072.qxd 5/25/05 11:46 AM Page 52
The fixed in repeated samples scenario is not very realistic in nonexperimental
contexts. For instance, in sampling individuals for the wage-education example, it makes
little sense to think of choosing the values of educ ahead of time and then sampling indi-
viduals with those particular levels of education. Random sampling, where individuals are
chosen randomly and their wage and education are both recorded, is representative of how
most data sets are obtained for empirical analysis in the social sciences. Once we assume
that E(ux) 0, and we have random sampling, nothing is lost in derivations by treating
the x
i
as nonrandom. The danger is that the fixed in repeated samples assumption always
implies that u
i
and x
i
are independent. In deciding when simple regression analysis is going
to produce unbiased estimators, it is critical to think in terms of Assumption SLR.4.
Now, we are ready to show that the OLS estimators are unbiased. To this end, we use
the fact that
n
i1
(x
i
x)(y
i
y)
n
i1
(x
i
x)y
i
(see Appendix A) to write the OLS
slope estimator in equation (2.19) as
1
. (2.49)
Because we are now interested in the behavior of
1
across all possible samples,
1
is prop-
erly viewed as a random variable.
We can write
1
in terms of the population coefficients and errors by substituting the
right-hand side of (2.48) into (2.49). We have
1
,
(2.50)
where we have defined the total variation in x
i
as SST
x
n
i1
(x
i
x)
2
in order to simplify
the notation. (This is not quite the sample variance of the x
i
because we do not
divide by n 1.) Using the algebra of the summation operator, write the numerator
of
1
as
n
i1
(x
i
x)
0
n
i1
(x
i
x)
1
x
i
n
i1
(x
i
x)u
i
(2.51)
0
n
i1
(x
i
x)
1
n
i1
(x
i
x)x
i
n
i1
(x
i
x)u
i
.
As shown in Appendix A,
n
i1
(x
i
x) 0 and
n
i1
(x
i
x)x
i
n
i1
(x
i
x)
2
SST
x
.
Therefore, we can write the numerator of
1
as
1
SST
x
n
i1
(x
i
x)u
i
. Putting this over
the denominator gives
n
i1
(x
i
x)(
0
1
x
i
u
i
)
SST
x
n
i1
(x
i
x)y
i
SST
x
n
i1
(x
i
x)y
i
n
i1
(x
i
x)
2
Chapter 2 The Simple Regression Model 53
89782_02_c02_p023-072.qxd 5/25/05 11:46 AM Page 53
54 Part 1 Regression Analysis with Cross-Sectional Data
1
1
1
(1/SST
x
)
n
i1
d
i
u
i
,
(2.52)
where d
i
x
i
x. We now see that the estimator
1
equals the population slope,
1
,
plus a term that is a linear combination in the errors {u
1
,u
2
,,u
n
}. Conditional on the
values of x
i
, the randomness in
1
is due entirely to the errors in the sample. The
fact that these errors are generally different from zero is what causes
1
to differ
from
1
.
Using the representation in (2.52), we can prove the first important statistical property
of OLS.
Theorem 2.1 (Unbiasedness of OLS)
Using Assumptions SLR.1 through SLR.4,
E(
0
)
0
, and E(
1
)
1
, (2.53)
for any values of
0
and
1
. In other words,
0
is unbiased for
0
, and
1
is unbiased for
1
.
PROOF: In this proof, the expected values are conditional on the sample values of the inde-
pendent variable. Because SST
x
and d
i
are functions only of the x
i
, they are nonrandom in the
conditioning. Therefore, from (2.52), and keeping the conditioning on {x
1
,x
2
,...,x
n
} implicit, we
have
E(
1
)
1
E[(1/SST
x
)
n
i1
d
i
u
i
]
1
(1/SST
x
)
n
i1
E(d
i
u
i
)
1
(1/SST
x
)
n
i1
d
i
E(u
i
)
1
(1/SST
x
)
n
i1
d
i
0
1
,
where we have used the fact that the expected value of each u
i
(conditional on {x
1
,x
2
,...,x
n
})
is zero under Assumptions SLR.2 and SLR.4. Since unbiasedness holds for any outcome on
{x
1
,x
2
,...,x
n
}, unbiasedness also holds without conditioning on {x
1
,x
2
,...,x
n
}.
The proof for
0
is now straightforward. Average (2.48) across i to get y
0
1
x u,
and plug this into the formula for
0
:
0
y
1
x
0
1
x u
1
x
0
(
1
1
)x u.
Then, conditional on the values of the x
i
,
E(
0
)
0
E[(
1
1
)x] E(u)
0
E[(
1
1
)]x,
since E(u ) 0 by Assumptions SLR.2 and SLR.4. But, we showed that E(
1
)
1
, which implies
that E[(
1
1
)] 0. Thus, E(
0
)
0
. Both of these arguments are valid for any values of
0
and
1
, and so we have established unbiasedness.
n
i1
(x
i
x)u
i
SST
x
89782_02_c02_p023-072.qxd 5/25/05 11:46 AM Page 54
Remember that unbiasedness is a feature of the sampling distributions of
1
and
0
,
which says nothing about the estimate that we obtain for a given sample. We hope that, if
the sample we obtain is somehow typical, then our estimate should be near the pop-
ulation value. Unfortunately, it is always possible that we could obtain an unlucky sam-
ple that would give us a point estimate far from
1
, and we can never know for sure
whether this is the case. You may want to review the material on unbiased estimators in
Appendix C, especially the simulation exercise in Table C.1 that illustrates the concept of
unbiasedness.
Unbiasedness generally fails if any of our four assumptions fail. This means that it is
important to think about the veracity of each assumption for a particular application.
Assumption SLR.1 requires that y and x be linearly related, with an additive disturbance.
This can certainly fail. But we also know that y and x can be chosen to yield interesting
nonlinear relationships. Dealing with the failure of (2.47) requires more advanced meth-
ods that are beyond the scope of this text.
Later, we will have to relax Assumption SLR.2, the random sampling assumption, for
time series analysis. But what about using it for cross-sectional analysis? Random
sampling can fail in a cross section when samples are not representative of the underly-
ing population; in fact, some data sets are constructed by intentionally oversampling
different parts of the population. We will discuss problems of nonrandom sampling in
Chapters 9 and 17.
As we have already discussed, Assumption SLR.3 almost always holds in interesting
regression applications. Without it, we cannot even obtain the OLS estimates.
The assumption we should concentrate on for now is SLR.4. If SLR.4 holds, the OLS
estimators are unbiased. Likewise, if SLR.4 fails, the OLS estimators generally will be
biased. There are ways to determine the likely direction and size of the bias, which we
will study in Chapter 3.
The possibility that x is correlated with u is almost always a concern in simple
regression analysis with nonexperimental data, as we indicated with several examples in
Section 2.1. Using simple regression when u contains factors affecting y that are also cor-
related with x can result in spurious correlation: that is, we find a relationship between
y and x that is really due to other unobserved factors that affect y and also happen to be
correlated with x.
E X A M P L E 2 . 1 2
(Student Math Performance and the School Lunch Program)
Let math10 denote the percentage of tenth graders at a high school receiving a passing score
on a standardized mathematics exam. Suppose we wish to estimate the effect of the federally
funded school lunch program on student performance. If anything, we expect the lunch pro-
gram to have a positive ceteris paribus effect on performance: all other factors being equal, if
a student who is too poor to eat regular meals becomes eligible for the school lunch program,
his or her performance should improve. Let lnchprg denote the percentage of students who
are eligible for the lunch program. Then, a simple regression model is
Chapter 2 The Simple Regression Model 55
89782_02_c02_p023-072.qxd 5/25/05 11:46 AM Page 55
56 Part 1 Regression Analysis with Cross-Sectional Data
math10
0
1
lnchprg u, (2.54)
where u contains school and student characteristics that affect overall school performance.
Using the data in MEAP93.RAW on 408 Michigan high schools for the 19921993 school
year, we obtain
math10 32.14 0.319 lnchprg
n 408, R
2
0.171.
This equation predicts that if student eligibility in the lunch program increases by 10 percent-
age points, the percentage of students passing the math exam falls by about 3.2 percentage
points. Do we really believe that higher participation in the lunch program actually causes
worse performance? Almost certainly not. A better explanation is that the error term u in equa-
tion (2.54) is correlated with lnchprg. In fact, u contains factors such as the poverty rate of
children attending school, which affects student performance and is highly correlated with eli-
gibility in the lunch program. Variables such as school quality and resources are also contained
in u, and these are likely correlated with lnchprg. It is important to remember that the esti-
mate 0.319 is only for this particular sample, but its sign and magnitude make us suspect
that u and x are correlated, so that simple regression is biased.
In addition to omitted variables, there are other reasons for x to be correlated with u
in the simple regression model. Because the same issues arise in multiple regression analy-
sis, we will postpone a systematic treatment of the problem until then.
Variances of the OLS Estimators
In addition to knowing that the sampling distribution of
1
is centered about
1
(
1
is
unbiased), it is important to know how far we can expect
1
to be away from
1
on aver-
age. Among other things, this allows us to choose the best estimator among all, or at least
a broad class of, unbiased estimators. The measure of spread in the distribution of
1
(and
0
) that is easiest to work with is the variance or its square root, the standard deviation.
(See Appendix C for a more detailed discussion.)
It turns out that the variance of the OLS estimators can be computed under
Assumptions SLR.1 through SLR.4. However, these expressions would be somewhat
complicated. Instead, we add an assumption that is traditional for cross-sectional analy-
sis. This assumption states that the variance of the unobservable, u, conditional on x, is
constant. This is known as the homoskedasticity or constant variance assumption.
Assumption SLR.5 (Homoskedasticity)
The error u has the same variance given any value of the explanatory variable. In other words,
Var(ux)
2
.
89782_02_c02_p023-072.qxd 5/25/05 11:46 AM Page 56
Chapter 2 The Simple Regression Model 57
We must emphasize that the homoskedasticity assumption is quite distinct from
the zero conditional mean assumption, E(ux) 0. Assumption SLR.4 involves the
expected value of u, while Assumption SLR.5 concerns the variance of u (both conditional
on x). Recall that we established the unbiasedness of OLS without Assumption SLR.5: the
homoskedasticity assumption plays no role in showing that
0
and
1
are unbiased. We add
Assumption SLR.5 because it simplifies the variance calculations for
0
and
1
and
because it implies that ordinary least squares has certain efficiency properties, which we
will see in Chapter 3. If we were to assume that u and x are independent, then the distri-
bution of u given x does not depend on x, and so E(ux) E(u) 0 and Var(ux)
2
.
But independence is sometimes too strong of an assumption.
Because Var(ux) E(u
2
x) [E(ux)]
2
and E(ux) 0,
2
E(u
2
x), which means
2
is also the unconditional expectation of u
2
. Therefore,
2
E(u
2
) Var(u), because
E(u) 0. In other words,
2
is the unconditional variance of u, and so
2
is often called
the error variance or disturbance variance. The square root of
2
, , is the standard devi-
ation of the error. A larger means that the distribution of the unobservables affecting y
is more spread out.
It is often useful to write Assumptions SLR.4 and SLR.5 in terms of the condi-
tional mean and conditional variance of y:
E(yx)
0
1
x. (2.55)
Var(yx)
2
. (2.56)
In other words, the conditional expectation of y given x is linear in x, but the variance of y
given x is constant. This situation is graphed in Figure 2.8 where
0
0 and
1
0.
When Var(ux) depends on x, the error term is said to exhibit heteroskedasticity
(or nonconstant variance). Because Var(ux) Var(yx), heteroskedasticity is present
whenever Var(yx) is a function of x.
E X A M P L E 2 . 1 3
(Heteroskedasticity in a Wage Equation)
In order to get an unbiased estimator of the ceteris paribus effect of educ on wage, we must
assume that E(ueduc) 0, and this implies E(wageeduc)
0
1
educ. If we also make
the homoskedasticity assumption, then Var(ueduc)
2
does not depend on the level of edu-
cation, which is the same as assuming Var(wageeduc)
2
. Thus, while average wage is
allowed to increase with education level it is this rate of increase that we are interested in
estimatingthe variability in wage about its mean is assumed to be constant across all edu-
cation levels. This may not be realistic. It is likely that people with more education have a wider
variety of interests and job opportunities, which could lead to more wage variability at higher
levels of education. People with very low levels of education have fewer opportunities and
often must work at the minimum wage; this serves to reduce wage variability at low educa-
tion levels. This situation is shown in Figure 2.9. Ultimately, whether Assumption SLR.5 holds
is an empirical issue, and in Chapter 8 we will show how to test Assumption SLR.5.
89782_02_c02_p023-072.qxd 5/25/05 11:46 AM Page 57
With the homoskedasticity assumption in place, we are ready to prove the following:
Theorem 2.2 (Sampling Variances of the OLS Estimators)
Under Assumptions SLR.1 through SLR.5,
Var(
1
)
2
/SST
x
,
(2.57)
and
Var(
0
) , (2.58)
2
n
1
n
i1
x
i
2
n
i1
(x
i
x)
2
n
i1
(x
i
x)
2
58 Part 1 Regression Analysis with Cross-Sectional Data
FIGURE 2.8
The simple regression model under homoskedasticity.
x
1
x
2
x
E( yx) b
0
b
1
x
f(yx)
x
3
y
89782_02_c02_p023-072.qxd 5/25/05 11:46 AM Page 58
where these are conditional on the sample values {x
1
,,x
n
}.
PROOF: We derive the formula for Var(
1
), leaving the other derivation as Problem 2.10.
The starting point is equation (2.52):
1
1
(1/SST
x
)
n
i1
d
i
u
i
. Because
1
is just a con-
stant, and we are conditioning on the x
i
, SST
x
and d
i
x
i
x are also nonrandom. Further-
more, because the u
i
are independent random variables across i (by random sampling), the
variance of the sum is the sum of the variances. Using these facts, we have
Var(
1
) (1/SST
x
)
2
Var
n
i1
d
i
u
i
(1/SST
x
)
2
n
i1
d
i
2
Var(u
i
)
(1/SST
x
)
2
n
i1
d
i
2
[since Var(u
i
)
2
for all i ]
2
(1/SST
x
)
2
n
i1
d
i
2
2
(1/SST
x
)
2
SST
x
2
/SST
x
,
which is what we wanted to show.
Chapter 2 The Simple Regression Model 59
FIGURE 2.9
Var(wageeduc) increasing with educ.
8
12
educ
E(wageeduc)
b
0
b
1
educ
f(wageeduc)
16
wage
89782_02_c02_p023-072.qxd 5/25/05 11:46 AM Page 59
Equations (2.57) and (2.58) are the standard formulas for simple regression
analysis, which are invalid in the presence of heteroskedasticity. This will be impor-
tant when we turn to confidence intervals and hypothesis testing in multiple regression
analysis.
For most purposes, we are interested in Var(
1
). It is easy to summarize how this vari-
ance depends on the error variance,
2
, and the total variation in {x
1
, x
2
,, x
n
}, SST
x
. First,
the larger the error variance, the larger is Var(
1
). This makes sense since more variation
in the unobservables affecting y makes it more difficult to precisely estimate
1
. On the
other hand, more variability in the independent variable is preferred: as the variability in
the x
i
increases, the variance of
1
decreases. This also makes intuitive sense since the more
spread out is the sample of independent variables, the easier it is to trace out the relation-
ship between E(yx) and x. That is, the easier it is to estimate
1
. If there is little variation
in the x
i
, then it can be hard to pinpoint how E(yx) varies with x. As the sample size
increases, so does the total variation in the x
i
. Therefore, a larger sample size results in a
smaller variance for
1
.
This analysis shows that, if we are interested in
1
, and we have a choice, then we
should choose the x
i
to be as spread out as
possible. This is sometimes possible with
experimental data, but rarely do we have
this luxury in the social sciences: usually,
we must take the x
i
that we obtain via ran-
dom sampling. Sometimes, we have an
opportunity to obtain larger sample sizes,
although this can be costly.
For the purposes of constructing confidence intervals and deriving test statistics, we
will need to work with the standard deviations of
1
and
0
, sd(
1
) and sd(
0
). Recall that
these are obtained by taking the square roots of the variances in (2.57) and (2.58). In
particular, sd(
1
) /S ST
x
, where is the square root of
2
, and SST
x
is the square
root of SST
x
.
Estimating the Error Variance
The formulas in (2.57) and (2.58) allow us to isolate the factors that contribute to Var(
1
)
and Var(
0
). But these formulas are unknown, except in the extremely rare case that
2
is
known. Nevertheless, we can use the data to estimate
2
, which then allows us to estimate
Var(
1
) and Var(
0
).
This is a good place to emphasize the difference between the errors (or disturbances)
and the residuals, since this distinction is crucial for constructing an estimator of
2
. Equa-
tion (2.48) shows how to write the population model in terms of a randomly sampled
observation as y
i
0
1
x
i
u
i
, where u
i
is the error for observation i. We can also
express y
i
in terms of its fitted value and residual as in equation (2.32): y
i
1
x
i
u
i
. Comparing these two equations, we see that the error shows up in the equation con-
taining the population parameters,
0
and
1
. On the other hand, the residuals show up in
60 Part 1 Regression Analysis with Cross-Sectional Data
Show that, when estimating
0
, it is best to have x 0. What is
Var(
0
) in this case? [Hint: For any sample of numbers,
n
i1
x
i
2
n
i1
(x
i
x)
2
, with equality only if x 0.]
Q U E S T I O N 2 . 5
89782_02_c02_p023-072.qxd 5/25/05 11:46 AM Page 60
the estimated equation with
0
and
1
. The errors are never observable, while the residu-
als are computed from the data.
We can use equations (2.32) and (2.48) to write the residuals as a function of the
errors:
u
i
y
i
1
x
i
(
0
1
x
i
u
i
)
1
x
i
,
or
u
i
u
i
(
0
) (
1
)x
i
. (2.59)
Although the expected value of
0
equals
0
, and similarly for
1
, u
i
is not the same as u
i
.
The difference between them does have an expected value of zero.
Now that we understand the difference between the errors and the residuals, we can
return to estimating
2
. First,
2
E(u
2
), so an unbiased estimator of
2
is n
1
n
i1
u
i
2
. Unfortunately, this is not a true estimator, because we do not observe the errors u
i
. But,
we do have estimates of the u
i
, namely, the OLS residuals u
i
. If we replace the errors
with the OLS residuals, we have n
1
n
i1
u
i
2
SSR/n. This is a true estimator, because
it gives a computable rule for any sample of data on x and y. One slight drawback
to this estimator is that it turns out to be biased (although for large n the bias is small).
Because it is easy to compute an unbiased estimator, we use that instead.
The estimator SSR/n is biased essentially because it does not account for two restric-
tions that must be satisfied by the OLS residuals. These restrictions are given by the two
OLS first order conditions:
n
i1
u
i
0,
n
i1
x
i
u
i
0. (2.60)
One way to view these restrictions is this: if we know n 2 of the residuals, we can
always get the other two residuals by using the restrictions implied by the first order con-
ditions in (2.60). Thus, there are only n 2 degrees of freedom in the OLS residuals, as
opposed to n degrees of freedom in the errors. If we replace u
i
with u
i
in (2.60), the restric-
tions would no longer hold. The unbiased estimator of
2
that we will use makes a degrees
of freedom adjustment:
n
i1
u
i
2
SSR/(n 2). (2.61)
(This estimator is sometimes denoted as s
2
, but we continue to use the convention of put-
ting hats over estimators.)
1
(n 2)
Chapter 2 The Simple Regression Model 61
89782_02_c02_p023-072.qxd 5/25/05 11:46 AM Page 61
62 Part 1 Regression Analysis with Cross-Sectional Data
Theorem 2.3 (Unbiased Estimation of
2
)
Under Assumptions SLR.1 through SLR.5,
E(
2
)
2
.
PROOF: If we average equation (2.59) across all i and use the fact that the OLS residuals aver-
age out to zero, we have 0 u (
0
0
) (
1
1
)x; subtracting this from (2.59) gives
u
i
(u
i
u ) (
1
1
)(x
i
x ). Therefore, u
i
2
(u
i
u )
2
(
1
1
)
2
(x
i
x )
2
2(u
i
u )
(
1
1
)(x
i
x ). Summing across all i gives
n
i1
u
i
2
n
i1
(u
i
u )
2
(
1
1
)
2
n
i1
(x
i
x )
2
2(
1
1
)
n
i1
u
i
(x
i
x ). Now, the expected value of the first term is (n 1)
2
,
something that is shown in Appendix C. The expected value of the second term is simply
2
because E[(
1
1
)
2
] Var(
1
)
2
/s
x
2
. Finally, the third term can be written as 2(
1
1
)
2
s
2
x
;
taking expectations gives 2
2
. Putting these three terms together gives E
n
i1
u
i
2
(n 1)
2
2
2
2
(n 2)
2
, so that E[SSR/(n 2)]
2
.
If
2
is plugged into the variance formulas (2.57) and (2.58), then we have unbiased
estimators of Var(
1
) and Var(
0
). Later on, we will need estimators of the standard devi-
ations of
1
and
0
, and this requires estimating . The natural estimator of is
2
(2.62)
and is called the standard error of the regression (SER). (Other names for are the
standard error of the estimate and the root mean squared error, but we will not use these.)
Although is not an unbiased estimator of , we can show that it is a consistent esti-
mator of (see Appendix C), and it will serve our purposes well.
The estimate is interesting because it is an estimate of the standard deviation in the
unobservables affecting y; equivalently, it estimates the standard deviation in y after the
effect of x has been taken out. Most regression packages report the value of along with
the R-squared, intercept, slope, and other OLS statistics (under one of the several names
listed above). For now, our primary interest is in using to estimate the standard devia-
tions of
0
and
1
. Since sd(
1
) /SST
x
, the natural estimator of sd(
1
) is
se(
1
) /SST
x
/
n
i1
(x
i
x)
2
1/ 2
;
this is called the standard error of
1
. Note that se(
1
) is viewed as a random variable
when we think of running OLS over different samples of y; this is true because varies
with different samples. For a given sample, se(
1
) is a number, just as
1
is simply a num-
ber when we compute it from the given data.
Similarly, se(
0
) is obtained from sd(
0
) by replacing with . The standard error of
any estimate gives us an idea of how precise the estimator is. Standard errors play a central
89782_02_c02_p023-072.qxd 5/25/05 11:46 AM Page 62
role throughout this text; we will use them to construct test statistics and confidence inter-
vals for every econometric procedure we cover, starting in Chapter 4.
2.6 Regression through the Origin
In rare cases, we wish to impose the restriction that, when x 0, the expected value of y
is zero. There are certain relationships for which this is reasonable. For example, if income
(x) is zero, then income tax revenues (y) must also be zero. In addition, there are settings
where a model that originally has a nonzero intercept is transformed into a model with-
out an intercept.
Formally, we now choose a slope estimator, which we call
1
, and a line of the form
y
1
x, (2.63)
where the tildes over
1
and y are used to distinguish this problem from the much more
common problem of estimating an intercept along with a slope. Obtaining (2.63) is called
regression through the origin because the line (2.63) passes through the point x 0,
y 0. To obtain the slope estimate in (2.63), we still rely on the method of ordinary least
squares, which in this case minimizes the sum of squared residuals:
n
i1
(y
i
1
x
i
)
2
. (2.64)
Using one-variable calculus, it can be shown that
1
must solve the first order condition:
n
i1
x
i
(y
i
1
x
i
) 0. (2.65)
From this, we can solve for
1
:
1
, (2.66)
provided that not all the x
i
are zero, a case we rule out.
Note how
1
compares with the slope estimate when we also estimate the inter-
cept (rather than set it equal to zero). These two estimates are the same if, and only if,
x 0. [See equation (2.49) for
1
.] Obtaining an estimate of
1
using regression
through the origin is not done very often in applied work, and for good reason: if the
intercept
0
0, then
1
is a biased estimator of
1
. You will be asked to prove this in
Problem 2.8.
n
i1
x
i
y
i
n
i1
x
i
2
Chapter 2 The Simple Regression Model 63
89782_02_c02_p023-072.qxd 5/25/05 11:46 AM Page 63
S U M M A R Y
We have introduced the simple linear regression model in this chapter, and we have
covered its basic properties. Given a random sample, the method of ordinary least
squares is used to estimate the slope and intercept parameters in the population model.
We have demonstrated the algebra of the OLS regression line, including computation of
fitted values and residuals, and the obtaining of predicted changes in the dependent vari-
able for a given change in the independent variable. In Section 2.4, we discussed two
issues of practical importance: (1) the behavior of the OLS estimates when we change
the units of measurement of the dependent variable or the independent variable and (2)
the use of the natural log to allow for constant elasticity and constant semi-elasticity
models.
In Section 2.5, we showed that, under the four Assumptions SLR.1 through SLR.4, the
OLS estimators are unbiased. The key assumption is that the error term u has zero mean
given any value of the independent variable x. Unfortunately, there are reasons to think
this is false in many social science applications of simple regression, where the omitted
factors in u are often correlated with x. When we add the assumption that the variance of
the error given x is constant, we get simple formulas for the sampling variances of the OLS
estimators. As we saw, the variance of the slope estimator
1
increases as the error variance
increases, and it decreases when there is more sample variation in the independent
variable. We also derived an unbiased estimator for
2
Var(u).
In Section 2.6, we briefly discussed regression through the origin, where the slope
estimator is obtained under the assumption that the intercept is zero. Sometimes, this is
useful, but it appears infrequently in applied work.
Much work is left to be done. For example, we still do not know how to test hypothe-
ses about the population parameters,
0
and
1
. Thus, although we know that OLS is unbi-
ased for the population parameters under Assumptions SLR.1 through SLR.4, we have no
way of drawing inference about the population. Other topics, such as the efficiency of OLS
relative to other possible procedures, have also been omitted.
The issues of confidence intervals, hypothesis testing, and efficiency are central to mul-
tiple regression analysis as well. Since the way we construct confidence intervals and test
statistics is very similar for multiple regressionand because simple regression is a special
case of multiple regressionour time is better spent moving on to multiple regression,
which is much more widely applicable than simple regression. Our purpose in Chapter 2
was to get you thinking about the issues that arise in econometric analysis in a fairly sim-
ple setting.
The Gauss-Markov Assumptions
for Simple Regression
For convenience, we summarize the Gauss-Markov assumptions that we used in this
chapter. It is important to remember that only SLR.1 through SLR.4 are needed to show
0
and
1
are unbiased. We added the homoskedasticity assumption, SLR.5, in order to
obtain the usual OLS variance formulas (2.57) and (2.58).
64 Part 1 Regression Analysis with Cross-Sectional Data
89782_02_c02_p023-072.qxd 5/25/05 11:46 AM Page 64
Assumption SLR.1 (Linear in Parameters)
In the population model, the dependent variable, y, is related to the independent variable,
x, and the error (or disturbance), u, as
y
0
1
x u,
where
0
and
1
are the population intercept and slope parameters, respectively.
Assumption SLR.2 (Random Sampling)
We have a random sample of size n, {(x
i
,y
i
): i 1,2,,n}, following the population
model in Assumption SLR.1.
Assumption SLR.3 (Sample Variation in the Explanatory Variable)
The sample outcomes on x, namely, {x
i
, i 1,,n}, are not all the same value.
Assumption SLR.4 (Zero Conditional Mean)
The error u has an expected value of zero given any value of the explanatory variable. In
other words,
E(ux) 0.
Assumption SLR.5 (Homoskedasticity)
The error u has the same variance given any value of the explanatory variable. In other
words,
Var(ux)
2
.
K E Y T E R M S
Chapter 2 The Simple Regression Model 65
Coefficient of
Determination
Constant Elasticity Model
Control Variable
Covariate
Degrees of Freedom
Dependent Variable
Elasticity
Error Term (Disturbance)
Error Variance
Explained Sum of Squares
(SSE)
Explained Variable
Explanatory Variable
First Order Conditions
Fitted Value
Gauss-Markov Assumptions
Heteroskedasticity
Homoskedasticity
Independent Variable
Intercept Parameter
OLS Regression Line
Ordinary Least Squares
(OLS)
Population Regression
Function (PRF)
Predicted Variable
Predictor Variable
R-squared
Regressand
Regression through the
Origin
Regressor
Residual
Residual Sum of Squares
(SSR)
Response Variable
Sample Regression
Function (SRF)
Semi-elasticity
Simple Linear Regression
Model
Slope Parameter
Standard Error of
1
Standard Error of the
Regression (SER)
Sum of Squared Residuals
(SSR)
Total Sum of Squares
(SST)
Zero Conditional Mean
Assumption
89782_02_c02_p023-072.qxd 5/25/05 11:46 AM Page 65
66 Part 1 Regression Analysis with Cross-Sectional Data
P R O B L E M S
2.1 Let kids denote the number of children ever born to a woman, and let educ denote years
of education for the woman. A simple model relating fertility to years of education is
kids
0
1
educ u,
where u is the unobserved error.
(i) What kinds of factors are contained in u? Are these likely to be correlated
with level of education?
(ii) Will a simple regression analysis uncover the ceteris paribus effect of edu-
cation on fertility? Explain.
2.2 In the simple linear regression model y
0
1
x u, suppose that E(u) 0. Let-
ting
0
E(u), show that the model can always be rewritten with the same slope, but a
new intercept and error, where the new error has a zero expected value.
2.3 The following table contains the ACT scores and the GPA (grade point average) for
eight college students. Grade point average is based on a four-point scale and has been
rounded to one digit after the decimal.
Student GPA ACT
1 2.8 21
2 3.4 24
3 3.0 26
4 3.5 27
5 3.6 29
6 3.0 25
7 2.7 25
8 3.7 30
(i) Estimate the relationship between GPA and ACT using OLS; that is, obtain
the intercept and slope estimates in the equation
GPA
1
ACT.
Comment on the direction of the relationship. Does the intercept have a
useful interpretation here? Explain. How much higher is the GPA predicted
to be if the ACT score is increased by five points?
89782_02_c02_p023-072.qxd 5/25/05 11:46 AM Page 66
(ii) Compute the fitted values and residuals for each observation, and verify
that the residuals (approximately) sum to zero.
(iii) What is the predicted value of GPA when ACT 20?
(iv) How much of the variation in GPA for these eight students is explained by
ACT? Explain.
2.4 The data set BWGHT.RAW contains data on births to women in the United States.
Two variables of interest are the dependent variable, infant birth weight in ounces (bwght),
and an explanatory variable, average number of cigarettes the mother smoked per day
during pregnancy (cigs). The following simple regression was estimated using data on
n 1388 births:
bwght 119.77 0.514 cigs
(i) What is the predicted birth weight when cigs 0? What about when
cigs 20 (one pack per day)? Comment on the difference.
(ii) Does this simple regression necessarily capture a causal relationship
between the childs birth weight and the mothers smoking habits?
Explain.
(iii) To predict a birth weight of 125 ounces, what would cigs have to be?
Comment.
(iv) The proportion of women in the sample who do not smoke while pregnant
is about .85. Does this help reconcile your finding from part (iii)?
2.5 In the linear consumption function
cons
1
inc,
the (estimated) marginal propensity to consume (MPC) out of income is simply the slope,
1
, while the average propensity to consume (APC) is cons/inc
0
/inc
1
. Using
observations for 100 families on annual income and consumption (both measured in dol-
lars), the following equation is obtained:
cons 124.84 0.853 inc
n 100, R
2
0.692.
(i) Interpret the intercept in this equation, and comment on its sign and
magnitude.
(ii) What is the predicted consumption when family income is $30,000?
(iii) With inc on the x-axis, draw a graph of the estimated MPC and APC.
2.6 Using data from 1988 for houses sold in Andover, Massachusetts, from Kiel and
McClain (1995), the following equation relates housing price (price) to the distance from
a recently built garbage incinerator (dist):
log(price) 9.40 0.312 log(dist)
n 135, R
2
0.162.
(i) Interpret the coefficient on log(dist). Is the sign of this estimate what you
expect it to be?
Chapter 2 The Simple Regression Model 67
89782_02_c02_p023-072.qxd 5/25/05 11:46 AM Page 67
(ii) Do you think simple regression provides an unbiased estimator of the
ceteris paribus elasticity of price with respect to dist? (Think about the
citys decision on where to put the incinerator.)
(iii) What other factors about a house affect its price? Might these be corre-
lated with distance from the incinerator?
2.7 Consider the savings function
sav
0
1
inc u, u inc
e,
where e is a random variable with E(e) 0 and Var(e)
e
2
. Assume that e is indepen-
dent of inc.
(i) Show that E(uinc) 0, so that the key zero conditional mean assumption
(Assumption SLR.4) is satisfied. [Hint: If e is independent of inc, then
E(einc) E(e).]
(ii) Show that Var(uinc)
e
2
inc, so that the homoskedasticity Assumption
SLR.5 is violated. In particular, the variance of sav increases with inc.
[Hint: Var(einc) Var(e), if e and inc are independent.]
(iii) Provide a discussion that supports the assumption that the variance of sav-
ings increases with family income.
2.8 Consider the standard simple regression model y
0
1
x u under the Gauss-
Markov Assumptions SLR.1 through SLR.5. The usual OLS estimators
0
and
1
are unbi-
ased for their respective population parameters. Let
1
be the estimator of
1
obtained by
assuming the intercept is zero (see Section 2.6).
(i) Find E(
1
) in terms of the x
i
,
0
, and
1
. Verify that
1
is unbiased for
1
when the population intercept (
0
) is zero. Are there other cases where
1
is unbiased?
(ii) Find the variance of
1
. (Hint: The variance does not depend on
0
.)
(iii) Show that Var(
1
) Var(
1
). [Hint: For any sample of data,
n
i1
x
i
2
n
i1
(x
i
x)
2
, with strict inequality unless x 0.]
(iv) Comment on the tradeoff between bias and variance when choosing
between
1
and
1
.
2.9 (i) Let
0
and
1
be the intercept and slope from the regression of y
i
on x
i
,
using n observations. Let c
1
and c
2
, with c
2
0, be constants. Let
0
and
1
be the intercept and slope from the regression of c
1
y
i
on c
2
x
i
. Show that
1
(c
1
/c
2
)
1
and
0
c
1
0
, thereby verifying the claims on units of
measurement in Section 2.4. [Hint: To obtain
1
, plug the scaled versions
of x and y into (2.19). Then, use (2.17) for
0
, being sure to plug in the
scaled x and y and the correct slope.]
(ii) Now, let
0
and
1
be from the regression of (c
1
y
i
) on (c
2
x
i
) (with
no restriction on c
1
or c
2
). Show that
1
and
0
c
1
c
2
1
.
(iii) Now, let
0
and
1
be the OLS estimates from the regression log (y
i
) on x
i
,
where we must assume y
i
0 for all i. For c
1
0, let
0
and
1
be the
intercept and slope from the regression of log (c
1
y
i
) on x
i
. Show that
1
and
0
log(c
1
)
0
.
68 Part 1 Regression Analysis with Cross-Sectional Data
89782_02_c02_p023-072.qxd 5/25/05 11:46 AM Page 68
(iv) Now, assuming that x
i
0 for all i, let
0
and
1
be the intercept and slope
from the regression of y
i
on log (c
2
x
i
). How do
0
and
1
compare with the
intercept and slope from the regression of y
i
on log (x
i
)?
2.10 Let
0
and
1
be the OLS intercept and slope estimators, respectively, and let u be
the sample average of the errors (not the residuals!).
(i) Show that
1
can be written as
1
1
n
i1
w
i
u
i
where w
i
d
i
/SST
x
and d
i
x
i
x.
(ii) Use part (i), along with
n
i1
w
i
0, to show that
1
and u are uncorre-
lated. [Hint: You are being asked to show that E[(
1
1
)
.
u] 0.
(iii) Show that
0
can be written as
0
0
u (
1
1
)x.
(iv) Use parts (ii) and (iii) to show that Var(
0
)
2
/n
2
(x)
2
/SST
x
.
(v) Do the algebra to simplify the expression in part (iv) to equation (2.58).
[Hint: SST
x
/n n
1
n
i1
x
i
2
(x)
2
.]
2.11 Suppose you are interested in estimating the effect of hours spent in an SAT prepa-
ration course (hours) on total SAT score (sat). The population is all college-bound high
school seniors for a particular year.
(i) Suppose you are given a grant to run a controlled experiment. Explain how
you would structure the experiment in order to estimate the causal effect
of hours on sat.
(ii) Consider the more realistic case where students choose how much time to
spend in a preparation course, and you can only randomly sample sat and
hours from the population. Write the population model as
sat
0
1
hours u
where, as usual in a model with an intercept, we can assume E(u) 0.
List at least two factors contained in u. Are these likely to have positive or
negative correlation with hours?
(iii) In the equation from part (ii), what should be the sign of
1
if the prepa-
ration course is effective?
(iv) In the equation from part (ii), what is the interpretation of
0
?
C O M P U T E R E X E R C I S E S
C2.1 The data in 401K.RAW are a subset of data analyzed by Papke (1995) to study the
relationship between participation in a 401(k) pension plan and the generosity of the plan.
The variable prate is the percentage of eligible workers with an active account; this is the
variable we would like to explain. The measure of generosity is the plan match rate, mrate.
This variable gives the average amount the firm contributes to each workers plan for each
$1 contribution by the worker. For example, if mrate 0.50, then a $1 contribution by
the worker is matched by a 50 contribution by the firm.
(i) Find the average participation rate and the average match rate in the
sample of plans.
Chapter 2 The Simple Regression Model 69
89782_02_c02_p023-072.qxd 5/25/05 11:46 AM Page 69
(ii) Now, estimate the simple regression equation
prate
1
mrate,
and report the results along with the sample size and R-squared.
(iii) Interpret the intercept in your equation. Interpret the coefficient on mrate.
(iv) Find the predicted prate when mrate 3.5. Is this a reasonable predic-
tion? Explain what is happening here.
(v) How much of the variation in prate is explained by mrate? Is this a lot
in your opinion?
C2.2 The data set in CEOSAL2.RAW contains information on chief executive officers
for U.S. corporations. The variable salary is annual compensation, in thousands of dol-
lars, and ceoten is prior number of years as company CEO.
(i) Find the average salary and the average tenure in the sample.
(ii) How many CEOs are in their first year as CEO (that is, ceoten 0)?
What is the longest tenure as a CEO?
(iii) Estimate the simple regression model
log(salary)
0
1
ceoten u,
and report your results in the usual form. What is the (approximate)
predicted percentage increase in salary given one more year as a CEO?
C2.3 Use the data in SLEEP75.RAW from Biddle and Hamermesh (1990) to study whether
there is a tradeoff between the time spent sleeping per week and the time spent in paid work.
We could use either variable as the dependent variable. For concreteness, estimate the model
sleep
0
1
totwrk u,
where sleep is minutes spent sleeping at night per week and totwrk is total minutes worked
during the week.
(i) Report your results in equation form along with the number of observa-
tions and R
2
. What does the intercept in this equation mean?
(ii) If totwrk increases by 2 hours, by how much is sleep estimated to fall?
Do you find this to be a large effect?
C2.4 Use the data in WAGE2.RAW to estimate a simple regression explaining monthly
salary (wage) in terms of IQ score (IQ).
(i) Find the average salary and average IQ in the sample. What is the sam-
ple standard deviation of IQ? (IQ scores are standardized so that the
average in the population is 100 with a standard deviation equal to 15.)
(ii) Estimate a simple regression model where a one-point increase in IQ
changes wage by a constant dollar amount. Use this model to find the
predicted increase in wage for an increase in IQ of 15 points. Does IQ
explain most of the variation in wage?
(iii) Now, estimate a model where each one-point increase in IQ has the same
percentage effect on wage. If IQ increases by 15 points, what is the
approximate percentage increase in predicted wage?
70 Part 1 Regression Analysis with Cross-Sectional Data
89782_02_c02_p023-072.qxd 5/25/05 11:46 AM Page 70
C2.5 For the population of firms in the chemical industry, let rd denote annual expendi-
tures on research and development, and let sales denote annual sales (both are in millions
of dollars).
(i) Write down a model (not an estimated equation) that implies a constant
elasticity between rd and sales. Which parameter is the elasticity?
(ii) Now, estimate the model using the data in RDCHEM.RAW. Write out the
estimated equation in the usual form. What is the estimated elasticity of
rd with respect to sales? Explain in words what this elasticity means.
C2.6 We used the data in MEAP93.RAW for Example 2.12. Now we want to explore
the relationship between the math pass rate (math10) and spending per student (expend).
(i) Do you think each additional dollar spent has the same effect on the pass
rate, or does a diminishing effect seem more appropriate? Explain.
(ii) In the population model
math10
0
1
log(expend) u,
argue that
1
/10 is the percentage point change in math10 given a
10 percent increase in expend.
(iii) Use the data in MEAP93.RAW to estimate the model from part (ii).
Report the estimated equation in the usual way, including the sample size
and R-squared.
(iv) How big is the estimated spending effect? Namely, if spending increases
by 10 percent, what is the estimated percentage point increase in
math10?
(v) One might worry that regression analysis can produce fitted values for
math10 that are greater than 100. Why is this not much of a worry in
this data set?
A P P E N D I X 2 A
Minimizing the Sum of Squared Residuals
We show that the OLS estimates
0
and
1
do minimize the sum of squared residuals, as
asserted in Section 2.2. Formally, the problem is to characterize the solutions
0
and
1
to
the minimization problem
min
b
0
,b
1
n
i1
(y
i
b
0
b
1
x
i
)
2
,
where b
0
and b
1
are the dummy arguments for the optimization problem; for simplicity,
call this function Q(b
0
,b
1
). By a fundamental result from multivariable calculus (see Appen-
dix A), a necessary condition for
0
and
1
to solve the minimization problem is that
the partial derivatives of Q(b
0
,b
1
) with respect to b
0
and b
1
must be zero when evaluated at
Chapter 2 The Simple Regression Model 71
89782_02_c02_p023-072.qxd 5/25/05 11:46 AM Page 71
0
,
1
: Q(
0
,
1
)/b
0
0 and Q(
0
,
1
)/b
1
0. Using the chain rule from calculus, these
two equations become
2
n
i1
(y
i
1
x
i
) 0.
2
n
i1
x
i
(y
i
1
x
i
) 0.
These two equations are just (2.14) and (2.15) multiplied by 2n and, therefore, are solved
by the same
0
and
1
.
How do we know that we have actually minimized the sum of squared residuals? The
first order conditions are necessary but not sufficient conditions. One way to verify that
we have minimized the sum of squared residuals is to write, for any b
0
and b
1
,
Q(b
0
,b
1
)
n
i1
[y
i
1
x
i
(
0
b
0
) (
1
b
1
)x
i
]
2
n
i1
[u
i
(
0
b
0
) (
1
b
1
)x
i
]
2
n
i1
u
i
2
n(
0
b
0
)
2
(
1
b
1
)
2
n
i1
x
i
2
2(
0
b
0
)(
1
b
1
)
n
i1
x
i
,
where we have used equations (2.30) and (2.31). The first term does not depend on b
0
or
b
1
, while the sum of the last three terms can be written as
n
i1
[(
0
b
0
) (
1
b
1
)x
i
]
2
,
as can be verified by straightforward algebra. Because this is a sum of squared terms, the
smallest it can be is zero. Therefore, it is smallest when b
0
=
0
and b
1
=
1
.
72 Part 1 Regression Analysis with Cross-Sectional Data
89782_02_c02_p023-072.qxd 5/25/05 11:46 AM Page 72