Quant Chapter 05 Ols
Quant Chapter 05 Ols
The linear algebra that we covered in Chapter 1 will now be put to use in explaining the variance
among observations on a dependent variable, placed in the vector y. For each of these
observations yi, we posit the following model:
y i = β 0 + x i1β1 + x i 2 β 2 + L + x ik *β k* + e i . (5.1)
Economists have traditionally referred to Equation (5.1) as ordinary least squares, while other
fields sometime use the expression regression, or least squares regression. Whatever we choose
to call it, putting this equation in matrix terms, we have
⎡ y 1 ⎤ ⎡1 x 11 L x 1k* ⎤ ⎡ β0 ⎤ ⎡ e1 ⎤
⎢ y ⎥ ⎢1 x 21 L x 2 k * ⎥⎥ ⎢β ⎥ ⎢ ⎥
⎢ 2⎥ = ⎢ ⎢ 1 ⎥ + ⎢e 2 ⎥
⎢ L ⎥ ⎢L L L L ⎥ ⎢L⎥ ⎢L ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎣ y n ⎦ ⎣1 x n1 L x nk* ⎦ ⎣β k * ⎦ ⎣e n ⎦
(5.2)
y = Xβ + e .
The number of columns of the X matrix is k = k* + 1. If you wish, you can think of X as
containing k* “real” independent variables, plus there is one additional independent variable that
is nothing more than a series of 1’s.
E ( y i ) ≡ ŷ i = β 0 + x i1β1 + x i 2 β 2 + L + x ik *β k*
(5.3)
yˆ = Xβ .
Each ŷ i is formed as the linear combination x ′i⋅ β, with the dot defined as in Equation (1.2).
The difference between ŷ and y is the error, that is e = y − yˆ as y = yˆ + e. The error vector is a
key input in ordinary least squares. Assumptions about the nature of the error are largely
responsible for our ability to make inferences from and about the model. To start, we assume that
E(e) = 0 where both e and 0 are n by 1 columns. Note that this is an assumption that does not
restrict us in any way. If E(e) ≠ 0, the difference would simply be absorbed in the y-intercept, β0.
One of the most important themes in this book is the notion of estimation. In our model, the
values in the y vector and the X matrix are known. They are data. The values in the β vector, on
48 Chapter 5
the other hand, have a different status. These are unknown and hence reflect ignorance about the
theoretical situation at hand. These must be estimated in some way from the sample. How do we
go about doing this? In Section 5.4 we cover the maximum likelihood approach to estimating
regression parameters. Maximum likelihood is also discussed in Section 3.10. For now, we will
be using the least squares principle. This is the idea that the sum of the squared errors of
prediction of the model, the ei, should be as small as possible. We can think about this as a loss
function. As values of yi and ŷ i increasingly diverge, the square of their difference explodes and
observation i figures more and more in the solution for the unknown parameters.
The loss function f is minimized over all possible (combinations of) values in the β vector:
min f
where f is defined as
β
n n
f = e′e = ∑i
e i2 = ∑ (y
i
i − ŷ i ) 2
= (y − yˆ )′(y − yˆ )
= (y − Xβ)′(y − Xβ)
Note that f is a scalar and so are all four components of the last equation above. Components 2
and 3 are actually identical. (Can you explain why? Hint: Look at Equation (1.5) and the
discussion thereof.) We can simplify by combining those two pieces as below:
∂f
The minimum possible value of f occurs where = 0, that is to say, when the partial derivatives
∂β
of f with respect to each of the elements in β are all zero. In this case, the null vector on the right
hand side is k by 1, that is, it has k elements, all zeroes. As we learned in Equation (3.12), the
derivative of a sum is equal to the sum of the derivatives, so we can analyze our f function one
piece at a time. The value of ∂y′y/∂β is just a k by 1 null vector since y′y is a constant with
∂
respect to β. The derivative [− 2y ′Xβ] can be determined from two rules for derivatives
∂β
covered in Chapter 3, namely the derivative of a linear combination
∂a′x
= a′
∂x ′
′
∂f ⎡ ∂f ⎤
=
∂x ⎢⎣ ∂x ′ ⎥⎦
∂
[− 2y ′Xβ] = −2X′y .
∂β
As for piece number 3, βX′Xβ is a quadratic form and we have seen a derivative rule for that also,
in Equation (3.18). Using that rule we would have
∂ β ′X′Xβ
= 2 X ′Xβ .
∂β
∂f
= 2 X′Xβ − 2 X′y = 0 . (5.5)
∂β
We are at an extreme point where any derivative ∂f(x)/∂x = 0. At the minimum, in our case we
then have
The k equations described in Equation (5.7) are sometimes called the normal equations. The last
line gives us what we need, a statistical formula we can use to estimate the unknown parameters.
It has to be admitted at this point that a hat somehow snuck onto the β vector just in time to show
up in the last equation above, Equation (5.8). That is a philosophical matter that has to do with the
fact that up to this point, we have had only a theory about how we might go about estimating the
parameter matrix β in our model. The last equation above, however, gives us a formula we can
β
actually use with a sample of data. Unlike β, ˆ can actually be held in one’s hand. It is one of a
possible infinite number of ways we could estimate β. The hat tells us that it is just one statistic
from a sample that might be proposed to estimate the unknown population parameter.
Is the formula any good? We know that it minimizes f. That means that there is no other formula
that could give us a smaller sum of squared errors for our model. Perhaps some idea of the
efficacy of this formula can be had by thinking about its expectation. So what about the
β
50 Chapter 5
E(β) = E[( X ′X) −1 X′y ]
= ( X′X) −1 X′ E (y )
(5.9)
= ( X′X) X′[ Xβ]
−1
=β
Here we have relied on the identity yˆ ≡ E (y ) going from the second to the third line above. Also,
-1
we passed (X′X) X′ through the expectation operator, something that is certainly legal and in fact
was talked about in Equation (4.5). However, applying Theorem (4.5) in that way means that we
are treating the X matrix as constant. Strictly speaking, the fact that X is fixed implies we cannot
generalize beyond the values in X that we have observed. The good news in the last line above is
that the expectation of β̂ is β, which certainly appears to be a good sign. However, it actually turns
out that this is not strictly necessary. There are other properties that are more important. We turn
now to those.
A good estimator, like our vector β̂, should have four properties. We have already talked about
one of them: unbiasedness:
The above expression is sometimes written using the notation Plim, which stands for Probability
limit. In that case, Equation (5.11) boils down to
Plim βˆ i = β i .
In effect what is going on with consistency is that as n → ∞, βˆ → β. Unbiasedness turns out to not
be as important as consistency. Even if the average estimator is not equal to the parameter, if we
can show that it gets closer and closer as the sample size increases, this is fine. Conversely, if the
average estimator is equal to the parameter, but increasing the sample size doesn’t get you any
closer to that truth, that would not be good. Now, another characteristic of a good estimator is that
it is
Efficient V (βˆ ) ≡ E[(βˆ − β)(βˆ − β)] is smaller than other estimators (5.13)
To show that a statistic is efficient, you need to derive its variance, and the variance is invariably
needed for hypothesis testing and confidence intervals. If this variance is large, you will not be
able to reject even really bad hypotheses.
As we saw above in Equation (5.9), unbiasedness can be demonstrated without any distributional
assumptions about the data. You will note that not a word has been mentioned – up to this point -
as to whether anything here is normally distributed or not. Some of these other properties require
distributional assumptions to prove. In our model, y = Xβ + e, the e vector will play an important
role in these assumptions. Both X and β contain fixed values; the former being simply data and
the latter; by assumption a set of constant values true of the population as a whole. The only input
that varies randomly is e. From this point forward in this chapter we will assume that
n e 1 ~ N ( n 01 , n Σ n ) . (5.14)
This notation (see Section 4.2 for a review) tells us that the n by 1 error vector e is normally
distributed with a mean equal to the null vector, and with a variance matrix Σ. Since e is n by 1, its
mean must be n by 1, and the variances and covariances among the n elements of e can be arrayed
in an n by n symmetric matrix.
Given the assumption above, and our model, we can deduce [from Equations (4.4) and (4.8)]
about the y vector that
n y 1 ~ N ( n Xβ1 , n Σ n ) . (5.15)
Now we are ready to add an important set of assumptions, often called the Gauss-Markov
assumptions. These deal with the form of the n · n error variance-covariance matrix, Σ. We
assume that
⎡σ 2 0 L 0 ⎤
⎢ ⎥
0 σ 2
L 0 ⎥
Σ = σ 2I = ⎢ , (5.16)
⎢L L L L⎥
⎢ ⎥
⎢⎣0 0 L σ 2 ⎥⎦
which is really two assumptions. For one, each ei value has the same variance, namely σ2. For
another, each pair of errors, ei and ej (for which i ≠ j), is independent. In other words, all of the
covariances are zero. Since e is normal, this series of assumptions is often called NIID, that is to
say we are asserting that e is normally, identically and independently distributed.
2
Lets review for a moment the linear model y = Xβ + e with y ~ N(Xβ, σ I). Maximum Likelihood
(ML) estimation begins by looking at the probability of observing a particular observation, yi. The
formula for the normal density function, given in Equation (4.11), tells us that
52 Chapter 5
1
Pr( y i ) = exp ⎡− ( y − x′ β) 2 / 2σ 2 ⎤ (5.17)
⎢⎣ i i⋅ ⎥⎦
2πσ 2
where x ′i⋅ is the ith row of X, i. e. the row needed to calculate ŷ i as below,
⎡ β0 ⎤
⎢β ⎥
ŷ i = [1 x i1 L x ik * ] ⎢ 1 ⎥.
⎢L⎥
⎢ ⎥
⎣β k* ⎦
The part of the normal density that appears as an exponent (to e) is basically the negative one half
1 y−μ
of a z-score squared, that is − z 2 . The role of “μ” in z = is being played by
2 σ
E( y i ) ≡ ŷ i = x ′i⋅ β.
Now that we have figured out the probability of an individual observation, the next step in the
reasoning behind ML is to calculate the probability of the whole sample. Since we assume that we
have independent observations, that means we can simply multiply out the probabilities of all of
the individual observations as is done below,
[ ]
n
1
l = Pr( y ) = ∏ (2πσ
i
2 1/ 2
)
exp − ( y i − x ′i⋅ β) 2 / 2σ 2
1
exp ⎡⎢− ∑ ( y i − x ′i⋅ β) 2 / 2σ 2 ⎤⎥
n
= (5.18)
(2πσ )
2 n/2
⎣ i ⎦
=
1
(2πσ 2 ) n / 2
[
exp − (y − Xβ) ′(y − Xβ) / 2σ 2 . ]
How did we get to the last step? Here are some reminders from Section 3.1. First recall that
exp[a] = ea. Next, you need to remember that we can write a 1 / 2 = a . It is also true
that ∏ exp[f ] = exp[∑ f ]
i i because e e = e a b a +b
, that multiplying a constant
n
∏ a = a ⋅ a ⋅L⋅ a = a
i
n
and finally that ∑ (a i − b i ) 2 = (a − b ) ′ ( a − b ) .
In Section 5.2 we choose a formula, β̂, based on the idea of minimizing the sum of squared errors
of prediction. But the least squares principle is just one way to choose a formula. The Maximum
likelihood principle gives us an alternative logical path to follow in coming up with parameter
estimates. The probability that our model is true is proportional to the likelihood of the sample,
called l or more specifically Pr(y). Therefore, it makes sense to pick β̂ such that l is as large as
possible.
It actually turns out to be simpler to maximize the log of the likelihood of the sample. The
maximum point of l is the same as maximum point of L = ln(l), so this does not impact anything
with derivative
∂L 1
= ( X ′y − X ′Xβ) . (5.20)
∂β 2σ 2
2
If we take ∂L/∂β = 0, multiply both sides by 2σ , and solve for β we end up with the same formula
-1
that we came up with using the least squares principle, namely (X′X) X′y. Thus β̂ is the least
squares and the maximum likelihood estimator. Things don’t always work out this way;
sometimes least squares and ML estimators may be different and therefore in competition with
each other. ML always has much to recommend it though. Whenever ML estimators exist, they
can be shown to be efficient [see Equation (5.13)].
But now it is time to return to the theme of this chapter, confirmatory factor analysis. We need to
be able to develop ML estimators for our three parameter matrices; Λ, Ψ and Θ. Let us return to
that task.
Now that we have a formula β̂ for β, we can go back to our original objective function, f = e′e.
We frequently call this scalar the sum of squares error, written alternatively as SSError or SSE.
Now
so that therefore
The error sum of squares can be seen as a remainder from the total raw sum of squares of the
dependent variable, after the predictable part of has been subtracted. Or, to put this another way,
the SSTotal can be seen as the sum of the SSError + SSPredictable.
54 Chapter 5
In order to prove to yourself that these are all equivalent, substitute the formula for β̂ into each of
the alternative versions of the formula above and then simplify by canceling any product of the
form X′X(X′X)-1.
Taking the last version of the SSPredictable on the right, note that
Thus SSPredictable is the sum of the squares of the predictions of the model, the ŷ i . Another way to
write the SSError is as
e′e = y ′y − y ′Xβˆ ′
= y ′(y − Xβˆ ′)
= y ′e.
yˆ ′e = ( Xβˆ ) ′ (y − Xβˆ )
= βˆ ′X ′y − βˆ ′X ′Xβˆ = 0.
Note that the last line above involves two equivalent versions of SSPredictable, which, being
equivalent, have a difference of 0. The upshot is that the predicted scores, ŷ, and the errors, e, are
orthogonal vectors [Equation (1.17)] with a correlation of 0.
⎡s yy s ′xy ⎤
k Sk = ⎢ ⎥. (5.24)
⎢⎣s xy S xx ⎥⎦
The scalar syy represents the variance of the y variable, Sxx is the covariance matrix for the
independent variables, and sxy = s′yx is the vector of covariances between the dependent variable
and each of the independent variables. There is no information about the levels of the y or x
βˆ = S −xx1 s xy . (5.25)
)
β 0 = x y − β′ x x
where x y is the mean of the dependent variable and the column vector x x contains the means of
each of the independent variables.
Instead of just using deviation scores and eliminating β0, as was done in the previous section, we
can also create a version of the β vector, β* say, based on standardized versions of the variables
and which therefore does not carry any information about the metric of the independent and
dependent variables. This can sometimes be useful for comparing particular values in the β vector
and other purposes.
βˆ * = (Z ′x Z x ) −1 Z x z y (5.26)
where Zx represents the matrix of observations on the independent variables, after having been
converted to Z-scores, and zy is defined analogously for the y vector. The second way that we
have written this, in Equation (5.27), is by using the partitioned correlation matrix, just as we did
with the variance matrix above in Equation (5.24). Here the correlations among the independent
variables are in the matrix Rxx, and those between the independent variables and the dependent
variable are in the vector rxy. The partitioned matrix is shown below:
⎡1 ′ ⎤
rxy
k Rk = ⎢ ⎥ where (5.28)
⎢⎣rxy R xx ⎥⎦
⎡ 1 rx ,x L rx , x ⎤
⎢r L rx ,x ⎥⎥
1 2 1 k*
1
=⎢
x ,x
R xx 2 1 2 k*
⎢ L L L L ⎥
⎢ ⎥
⎣⎢rx , x
k* 1
rx ,x
k* 2
L 1 ⎦⎥
is the matrix of correlations among the k* independent variables, and is therefore k* by k*, the
same as Sxx, and
′ = ryx = [ry , x1
rxy ry , x 2 L ry , x x ]
k*
56 Chapter 5
is the vector of correlations between the dependent variable and each of the k* independent
variables.
It is interesting to note that in the calculation of β̂ as well as the standardized βˆ * , the correlations
among all the independent variables figure into the calculation into each βˆ . Of course, if Rxx = I,
i
this would simplify things quite a bit. In this case, each independent variable would be orthogonal
from all the others and the calculation of each β̂ i could be done sequentially in any order, instead
of simultaneously as we have done above. We can also see here why our regression model is
unprotected from misspecification in the form of missing independent variables. If there is some
other independent variable of which we are not aware, or at least that we did not measure, our
calculations are obviously not taking it into account, even though its presence could easily modify
the values of all the other β's. The only time we can be protected from the threat of unmeasured
independent variables is when we can be totally sure that all unmeasured variables would be
orthogonal to the independent variables that we did measure. How can we ever be sure of this?
We are protected from unmeasured independent variables when we have a designed experiment
that lets us control the assignment of subjects (or in general "experimental units", whatever they
might be) to the values of the independent variables.
Lets assume we have two different sets of independent variables in the matrices X1 and X2. Each
of these has n observations, so they both have n rows, but there are differing numbers of columns
in X1 and X2. Our model is still yˆ = Xβ but
X = [ X1 X 2 ] and
⎡ β1 ⎤
β=⎢ ⎥ ,
⎢ ⎥
⎣β 2 ⎦
where β1 is the vector with as many elements as there are columns in X1 while β2 is the vector
corresponding to each of the independent variables in X2. Note that in this case β1 and β2 are
vectors, not individual beta values. The reason we are doing this is so that we can look at the
regression model in more detail, tracking the relationship between two different sets of
independent variables. Now we can rewrite yˆ = Xβ as
⎡β ⎤
yˆ = [ X 1 X2 ] ⎢ 1 ⎥
⎣β 2 ⎦
= X 1β 1 + X 2 β 2 .
⎡ X 1′ ⎤ ⎡β ⎤ ⎡ X′ ⎤
⎢X ′ ⎥ [X1 X2 ] ⎢ 1 ⎥ = ⎢ 1 ⎥ y
⎣ 2⎦ ⎣β 2 ⎦ ⎣ X ′2 ⎦
The first component of the right hand side of Equation (5.31) is just the usual least squares
formula that we would see if there was only set X1 of the independent variables and X2 was not
part of the model. Instead, something is being subtracted away from the usual formula. To shed
more light on this, we can factor the premultiplying matrix ( X1′ X1 ) −1 X1′ to get
βˆ 1 = ( X1′ X1 ) −1 X1′ [y − X 2 β 2 ] .
What is the term in brackets? None other than the error for the regression equation if there was
only X2 and X1 was not part of the model. In other words, β̂1 is being calculated not using y, but
using the error from the regression of y on X2. The variance that is at all attributable to X2 has
been swept out of the dependent variable y before β̂1 gets calculated, and vice versa.
Define
-1
P = X(X′X) X′ (5.32)
and define
M = I – P, (5.33)
-1
i. e. M = I - X(X′X) X′. Keeping these definitions in mind, let us now consider the simplest of
all possible regression models, namely, a model with only an intercept term,
⎡ ŷ1 ⎤ ⎡ 1 ⎤
⎢ ⎥ ⎢ ⎥
⎢ ŷ 2 ⎥ = ⎢ 1 ⎥ βˆ = 1 βˆ .
⎢ L ⎥ ⎢L⎥ 0 n 1 0
⎢ ⎥ ⎢ ⎥
⎣ ŷ n ⎦ ⎣ 1 ⎦
In this case, the β̂ vector is just the scalar β̂ 0 and so it’s formula becomes
58 Chapter 5
( X ′X) −1 X ′y = (1′1) −1 1′y
= n −1 1′y
= [n −1 n −1 L n −1 ] y
n
1
=
n ∑y
i
i
n ŷ1 = n 11 y
⎡ ŷ1 ⎤ ⎡ y ⎤
⎢ ⎥ ⎢ ⎥
⎢ ŷ 2 ⎥ = ⎢ y ⎥.
⎢ L ⎥ ⎢L⎥
⎢ ⎥ ⎢ ⎥
⎣ ŷ n ⎦ ⎣ y ⎦
P = X( X′X) −1 X′
= n 11 (1′1) −1 1′
⎡1 1 1⎤
⎢n n L n⎥
⎢ ⎥
⎢1 1 1⎥
⎢ L ⎥
⎢ n n n⎥
=
⎢ ⎥
⎢L L L L⎥
⎢ ⎥
⎢1 1 1⎥
⎢ L ⎥
⎣n n n⎦
⎡y⎤
⎢y⎥
yˆ = Py = ⎢ ⎥
⎢L⎥
⎢ ⎥
⎣y⎦
M = I − X( X′X) −1 X′ = I − P
⎡1 1 1⎤
⎡1 0 L 0⎤ ⎢ L
⎢ ⎥ ⎢n n n⎥
⎥
⎢ ⎥ 1 1 1⎥
⎢0 1 L 0⎥ ⎢ L
=⎢ ⎢n n n⎥
⎥ −⎢ ⎥
⎢L L L L⎥ ⎢L L L L⎥
⎢ ⎥ ⎢ ⎥
⎢ ⎥ 1 1 1⎥
⎣⎢ 0 0 L 1 ⎦⎥ ⎢ L
⎣⎢ n n n ⎦⎥
The M matrix transforms the observations in y into error, but in this case the “error” is equivalent
to deviations from the mean (in other words di values):
⎡ y1 − y ⎤
⎢ ⎥
⎢ y 2 − y⎥
e = My = ⎢ ⎥.
⎢ L ⎥
⎢ ⎥
⎢ y n − y⎥
⎣ ⎦
SS Error = y ′My = ∑ y (y i i − y)
= y ′(y − 1y ) = y ′y − y ′1y = y ′y
⎛ ∑y ⎞
∑ ∑ ∑ y i2 − ⎜ ⎟
∑y ,
i
= y i2 − y yi =
⎜ n ⎟ i
⎝ ⎠
which the reader will recognize as the scalar, the corrected sum of squares from Equation (2.11).
While it is known as the linear model, one can fit more complicated curves than lines or planes. It
is relatively straightforward to include quadratic or higher order polynomials in a regression
model, merely by squaring or cubing one of the independent variables (it is wise to mean center
first). For example, consider the model
60 Chapter 5
The second and fourth independent variables are squared versions of the first and third. In order to
demonstrate the wide variety of shapes we can model using polynomial equations, consider the
figure below where β2 and β4 are either 0 or 1:
β2 = 0 β2 = 1
β =0
4
x2 x2
x1 x1
β =1
4
x2 x2
x1 x1
Or consider the following diagram in which the sign of β2 and β4 is either positive or minus:
β 2 = +1 β 2 = −1
β = +1
4
x2 x2
x1 x1
β = −1
4
x1 x2 x1 x2
Mosteller, Frederick and John W. Tukey (1977) Data Analysis and Regression. Reading, MA:
Addison-Wesley.
62 Chapter 5