0% found this document useful (0 votes)

5 views15 pages

Quant Chapter 05 Ols

Uploaded by

oa_matheus

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views15 pages

Quant Chapter 05 Ols

Uploaded by

oa_matheus

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Chapter 5: Ordinary Least Squares

Prerequisite: Chapters 1, 2, Sections 3.1, 3.2, 3.3, 4.1, 4.2

5.1 The Regression Model

The linear algebra that we covered in Chapter 1 will now be put to use in explaining the variance
among observations on a dependent variable, placed in the vector y. For each of these
observations yi, we posit the following model:

y i = β 0 + x i1β1 + x i 2 β 2 + L + x ik *β k* + e i . (5.1)

Economists have traditionally referred to Equation (5.1) as ordinary least squares, while other
fields sometime use the expression regression, or least squares regression. Whatever we choose
to call it, putting this equation in matrix terms, we have

⎡ y 1 ⎤ ⎡1 x 11 L x 1k* ⎤ ⎡ β0 ⎤ ⎡ e1 ⎤
⎢ y ⎥ ⎢1 x 21 L x 2 k * ⎥⎥ ⎢β ⎥ ⎢ ⎥
⎢ 2⎥ = ⎢ ⎢ 1 ⎥ + ⎢e 2 ⎥
⎢ L ⎥ ⎢L L L L ⎥ ⎢L⎥ ⎢L ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎣ y n ⎦ ⎣1 x n1 L x nk* ⎦ ⎣β k * ⎦ ⎣e n ⎦
(5.2)
y = Xβ + e .

The number of columns of the X matrix is k = k* + 1. If you wish, you can think of X as
containing k* “real” independent variables, plus there is one additional independent variable that
is nothing more than a series of 1’s.

The mechanism of prediction is a linear combination of independent variable values, with

coefficients known as β’s. The prediction for yi , in other words E(yi), is traditionally notated with
a hat as below:

E ( y i ) ≡ ŷ i = β 0 + x i1β1 + x i 2 β 2 + L + x ik *β k*
(5.3)
yˆ = Xβ .

Each ŷ i is formed as the linear combination x ′i⋅ β, with the dot defined as in Equation (1.2).

The difference between ŷ and y is the error, that is e = y − yˆ as y = yˆ + e. The error vector is a
key input in ordinary least squares. Assumptions about the nature of the error are largely
responsible for our ability to make inferences from and about the model. To start, we assume that
E(e) = 0 where both e and 0 are n by 1 columns. Note that this is an assumption that does not
restrict us in any way. If E(e) ≠ 0, the difference would simply be absorbed in the y-intercept, β0.

5.2 Least Squares Estimation

One of the most important themes in this book is the notion of estimation. In our model, the
values in the y vector and the X matrix are known. They are data. The values in the β vector, on

48 Chapter 5
the other hand, have a different status. These are unknown and hence reflect ignorance about the
theoretical situation at hand. These must be estimated in some way from the sample. How do we
go about doing this? In Section 5.4 we cover the maximum likelihood approach to estimating
regression parameters. Maximum likelihood is also discussed in Section 3.10. For now, we will
be using the least squares principle. This is the idea that the sum of the squared errors of
prediction of the model, the ei, should be as small as possible. We can think about this as a loss
function. As values of yi and ŷ i increasingly diverge, the square of their difference explodes and
observation i figures more and more in the solution for the unknown parameters.

The loss function f is minimized over all possible (combinations of) values in the β vector:
min f
where f is defined as
β

n n
f = e′e = ∑i
e i2 = ∑ (y
i
i − ŷ i ) 2

= (y − yˆ )′(y − yˆ )

= (y − Xβ)′(y − Xβ)

= y ′y − β ′X′y − y ′Xβ + β ′X′Xβ.

Note that f is a scalar and so are all four components of the last equation above. Components 2
and 3 are actually identical. (Can you explain why? Hint: Look at Equation (1.5) and the
discussion thereof.) We can simplify by combining those two pieces as below:

= y′y – 2y′Xβ + β′X′Xβ. (5.4)

∂f
The minimum possible value of f occurs where = 0, that is to say, when the partial derivatives
∂β
of f with respect to each of the elements in β are all zero. In this case, the null vector on the right
hand side is k by 1, that is, it has k elements, all zeroes. As we learned in Equation (3.12), the
derivative of a sum is equal to the sum of the derivatives, so we can analyze our f function one
piece at a time. The value of ∂y′y/∂β is just a k by 1 null vector since y′y is a constant with
∂
respect to β. The derivative [− 2y ′Xβ] can be determined from two rules for derivatives
∂β
covered in Chapter 3, namely the derivative of a linear combination

∂a′x
= a′
∂x ′

from Equation (3.17) and the derivative of a transpose

′
∂f ⎡ ∂f ⎤
=
∂x ⎢⎣ ∂x ′ ⎥⎦

from Equation (3.19).

Ordinary Least Squares 49

In this case the role of "a" above is being played by -2y′X and the role of x is being played by β:

∂
[− 2y ′Xβ] = −2X′y .
∂β

As for piece number 3, βX′Xβ is a quadratic form and we have seen a derivative rule for that also,
in Equation (3.18). Using that rule we would have

∂ β ′X′Xβ
= 2 X ′Xβ .
∂β

Finally, adding all of the pieces together, each being k by 1, we have

∂f
= 2 X′Xβ − 2 X′y = 0 . (5.5)
∂β

We are at an extreme point where any derivative ∂f(x)/∂x = 0. At the minimum, in our case we
then have

2 X′Xβ − 2 X′y = 0 (5.6)

X ′Xβ = X′y (5.7)

βˆ = ( X′X) −1 X′y . (5.8)

The k equations described in Equation (5.7) are sometimes called the normal equations. The last
line gives us what we need, a statistical formula we can use to estimate the unknown parameters.

It has to be admitted at this point that a hat somehow snuck onto the β vector just in time to show
up in the last equation above, Equation (5.8). That is a philosophical matter that has to do with the
fact that up to this point, we have had only a theory about how we might go about estimating the
parameter matrix β in our model. The last equation above, however, gives us a formula we can
β

actually use with a sample of data. Unlike β, ˆ can actually be held in one’s hand. It is one of a
possible infinite number of ways we could estimate β. The hat tells us that it is just one statistic
from a sample that might be proposed to estimate the unknown population parameter.

Is the formula any good? We know that it minimizes f. That means that there is no other formula
that could give us a smaller sum of squared errors for our model. Perhaps some idea of the
efficacy of this formula can be had by thinking about its expectation. So what about the
β

expectation of ˆ ? What does that look like?

50 Chapter 5
E(β) = E[( X ′X) −1 X′y ]

= ( X′X) −1 X′ E (y )
(5.9)
= ( X′X) X′[ Xβ]
−1

=β

Here we have relied on the identity yˆ ≡ E (y ) going from the second to the third line above. Also,
-1
we passed (X′X) X′ through the expectation operator, something that is certainly legal and in fact
was talked about in Equation (4.5). However, applying Theorem (4.5) in that way means that we
are treating the X matrix as constant. Strictly speaking, the fact that X is fixed implies we cannot
generalize beyond the values in X that we have observed. The good news in the last line above is
that the expectation of β̂ is β, which certainly appears to be a good sign. However, it actually turns
out that this is not strictly necessary. There are other properties that are more important. We turn
now to those.

5.3 What Do We Mean by a Good Statistic?

A good estimator, like our vector β̂, should have four properties. We have already talked about
one of them: unbiasedness:

Unbiased E (β̂ i ) = β i . (5.10)

Consistent Pr(βˆ i − β i ≤ ε ) → 1 as n → ∞. (5.11)

The above expression is sometimes written using the notation Plim, which stands for Probability
limit. In that case, Equation (5.11) boils down to

Plim βˆ i = β i .

In effect what is going on with consistency is that as n → ∞, βˆ → β. Unbiasedness turns out to not
be as important as consistency. Even if the average estimator is not equal to the parameter, if we
can show that it gets closer and closer as the sample size increases, this is fine. Conversely, if the
average estimator is equal to the parameter, but increasing the sample size doesn’t get you any
closer to that truth, that would not be good. Now, another characteristic of a good estimator is that
it is

Sufficient Pr( y | βˆ ) does not depend on β (5.12)

Ordinary Least Squares 51

Sufficiency implies that the formula for the estimator has wrung out all of the information in the
sample that there is about the parameter. Finally, efficiency is very important and forms the basis
for reasoning about the population based on the sample:

Efficient V (βˆ ) ≡ E[(βˆ − β)(βˆ − β)] is smaller than other estimators (5.13)

To show that a statistic is efficient, you need to derive its variance, and the variance is invariably
needed for hypothesis testing and confidence intervals. If this variance is large, you will not be
able to reject even really bad hypotheses.

As we saw above in Equation (5.9), unbiasedness can be demonstrated without any distributional
assumptions about the data. You will note that not a word has been mentioned – up to this point -
as to whether anything here is normally distributed or not. Some of these other properties require
distributional assumptions to prove. In our model, y = Xβ + e, the e vector will play an important
role in these assumptions. Both X and β contain fixed values; the former being simply data and
the latter; by assumption a set of constant values true of the population as a whole. The only input
that varies randomly is e. From this point forward in this chapter we will assume that

n e 1 ~ N ( n 01 , n Σ n ) . (5.14)

This notation (see Section 4.2 for a review) tells us that the n by 1 error vector e is normally
distributed with a mean equal to the null vector, and with a variance matrix Σ. Since e is n by 1, its
mean must be n by 1, and the variances and covariances among the n elements of e can be arrayed
in an n by n symmetric matrix.

Given the assumption above, and our model, we can deduce [from Equations (4.4) and (4.8)]
about the y vector that

n y 1 ~ N ( n Xβ1 , n Σ n ) . (5.15)

Now we are ready to add an important set of assumptions, often called the Gauss-Markov
assumptions. These deal with the form of the n · n error variance-covariance matrix, Σ. We
assume that

⎡σ 2 0 L 0 ⎤
⎢ ⎥
0 σ 2
L 0 ⎥
Σ = σ 2I = ⎢ , (5.16)
⎢L L L L⎥
⎢ ⎥
⎢⎣0 0 L σ 2 ⎥⎦

which is really two assumptions. For one, each ei value has the same variance, namely σ2. For
another, each pair of errors, ei and ej (for which i ≠ j), is independent. In other words, all of the
covariances are zero. Since e is normal, this series of assumptions is often called NIID, that is to
say we are asserting that e is normally, identically and independently distributed.

5.4 Maximum Likelihood Estimation of Regression Parameters

2
Lets review for a moment the linear model y = Xβ + e with y ~ N(Xβ, σ I). Maximum Likelihood
(ML) estimation begins by looking at the probability of observing a particular observation, yi. The
formula for the normal density function, given in Equation (4.11), tells us that

52 Chapter 5
1
Pr( y i ) = exp ⎡− ( y − x′ β) 2 / 2σ 2 ⎤ (5.17)
⎢⎣ i i⋅ ⎥⎦
2πσ 2

where x ′i⋅ is the ith row of X, i. e. the row needed to calculate ŷ i as below,

⎡ β0 ⎤
⎢β ⎥
ŷ i = [1 x i1 L x ik * ] ⎢ 1 ⎥.
⎢L⎥
⎢ ⎥
⎣β k* ⎦

The part of the normal density that appears as an exponent (to e) is basically the negative one half
1 y−μ
of a z-score squared, that is − z 2 . The role of “μ” in z = is being played by
2 σ
E( y i ) ≡ ŷ i = x ′i⋅ β.

Now that we have figured out the probability of an individual observation, the next step in the
reasoning behind ML is to calculate the probability of the whole sample. Since we assume that we
have independent observations, that means we can simply multiply out the probabilities of all of
the individual observations as is done below,

[ ]
n
1
l = Pr( y ) = ∏ (2πσ
i
2 1/ 2
)
exp − ( y i − x ′i⋅ β) 2 / 2σ 2

1
exp ⎡⎢− ∑ ( y i − x ′i⋅ β) 2 / 2σ 2 ⎤⎥
n
= (5.18)
(2πσ )
2 n/2
⎣ i ⎦

=
1
(2πσ 2 ) n / 2
[
exp − (y − Xβ) ′(y − Xβ) / 2σ 2 . ]
How did we get to the last step? Here are some reminders from Section 3.1. First recall that
exp[a] = ea. Next, you need to remember that we can write a 1 / 2 = a . It is also true
that ∏ exp[f ] = exp[∑ f ]
i i because e e = e a b a +b
, that multiplying a constant
n

∏ a = a ⋅ a ⋅L⋅ a = a
i
n
and finally that ∑ (a i − b i ) 2 = (a − b ) ′ ( a − b ) .

In Section 5.2 we choose a formula, β̂, based on the idea of minimizing the sum of squared errors
of prediction. But the least squares principle is just one way to choose a formula. The Maximum
likelihood principle gives us an alternative logical path to follow in coming up with parameter
estimates. The probability that our model is true is proportional to the likelihood of the sample,
called l or more specifically Pr(y). Therefore, it makes sense to pick β̂ such that l is as large as
possible.

It actually turns out to be simpler to maximize the log of the likelihood of the sample. The
maximum point of l is the same as maximum point of L = ln(l), so this does not impact anything

Ordinary Least Squares 53

except that it makes our life easier. After all, the likelihood of independent observations involves
multiplication, and the ln function takes multiplication into addition which simplifies our task.
Returning to the regression model, we have

n n (y − Xβ) ′(y − Xβ)

L = ln(y ) = − ln 2π − ln σ 2 − (5.19)
2 2 2σ 2

with derivative

∂L 1
= ( X ′y − X ′Xβ) . (5.20)
∂β 2σ 2

2
If we take ∂L/∂β = 0, multiply both sides by 2σ , and solve for β we end up with the same formula
-1
that we came up with using the least squares principle, namely (X′X) X′y. Thus β̂ is the least
squares and the maximum likelihood estimator. Things don’t always work out this way;
sometimes least squares and ML estimators may be different and therefore in competition with
each other. ML always has much to recommend it though. Whenever ML estimators exist, they
can be shown to be efficient [see Equation (5.13)].

But now it is time to return to the theme of this chapter, confirmatory factor analysis. We need to
be able to develop ML estimators for our three parameter matrices; Λ, Ψ and Θ. Let us return to
that task.

5.5 Sums of Squares of the Regression Model

Now that we have a formula β̂ for β, we can go back to our original objective function, f = e′e.
We frequently call this scalar the sum of squares error, written alternatively as SSError or SSE.
Now

SS Error = e′e = (y − Xβˆ )′ (y − Xβˆ ) (5.21)

= [y − X( X ′X) −1 X ′y ]′[y − X( X ′X) −1 X ′y ]

= y ′y − y ′X( X ′X) −1 X ′y − y ′X( X ′X) −1 X ′y + y ′X( X ′X) −1 X ′X( X ′X) −1 X ′y

so that therefore

SSError = y′y - y′X(X′X)-1X′y

SSError = SSTotal - SSPredictable (5.22)

The error sum of squares can be seen as a remainder from the total raw sum of squares of the
dependent variable, after the predictable part of has been subtracted. Or, to put this another way,
the SSTotal can be seen as the sum of the SSError + SSPredictable.

There are many ways of expressing the SSPredictable, including

y ′X( X ′X) −1 X ′y = βˆ ′X ′y = y ′Xβˆ = βˆ ′X ′Xβˆ .

54 Chapter 5
In order to prove to yourself that these are all equivalent, substitute the formula for β̂ into each of
the alternative versions of the formula above and then simplify by canceling any product of the
form X′X(X′X)-1.

Taking the last version of the SSPredictable on the right, note that

βˆ ′X′Xβ = [βˆ ′X′] [ Xβ] = [ Xβˆ ]′ [ Xβ] = yˆ ′yˆ .

Thus SSPredictable is the sum of the squares of the predictions of the model, the ŷ i . Another way to
write the SSError is as

e′e = y ′y − y ′Xβˆ ′

= y ′(y − Xβˆ ′)

= y ′e.

However, the quantity y′e (SSError) is not the same as yˆ ′e since

yˆ ′e = ( Xβˆ ) ′ (y − Xβˆ )

= (βˆ ′X ′) (y − Xβˆ ) (5.23)

= βˆ ′X ′y − βˆ ′X ′Xβˆ = 0.

Note that the last line above involves two equivalent versions of SSPredictable, which, being
equivalent, have a difference of 0. The upshot is that the predicted scores, ŷ, and the errors, e, are
orthogonal vectors [Equation (1.17)] with a correlation of 0.

5.6 The Covariance Estimator for β

We can conveniently produce the β̂ vector from the covariances of all the variables; x variables
and y included. We are going to place y in the first row and column of the covariance matrix, S
[see Equation (2.12)]. The S matrix is partitioned (Section 1.4) into sections corresponding to the
y variable and the x's:

⎡s yy s ′xy ⎤
k Sk = ⎢ ⎥. (5.24)
⎢⎣s xy S xx ⎥⎦

The scalar syy represents the variance of the y variable, Sxx is the covariance matrix for the
independent variables, and sxy = s′yx is the vector of covariances between the dependent variable
and each of the independent variables. There is no information about the levels of the y or x

Ordinary Least Squares 55

variables and so we will not be able to calculate β̂ 0 from S, but we can calculate all of the other k*
β values using

βˆ = S −xx1 s xy . (5.25)

If we need to know what the value of β̂ 0 is, we can calculate it as follows:

)
β 0 = x y − β′ x x

where x y is the mean of the dependent variable and the column vector x x contains the means of
each of the independent variables.

5.7 Regression with Z-Scores

Instead of just using deviation scores and eliminating β0, as was done in the previous section, we
can also create a version of the β vector, β* say, based on standardized versions of the variables
and which therefore does not carry any information about the metric of the independent and
dependent variables. This can sometimes be useful for comparing particular values in the β vector
and other purposes.

βˆ * = (Z ′x Z x ) −1 Z x z y (5.26)

= R −xx1 rxy , (5.27)

where Zx represents the matrix of observations on the independent variables, after having been
converted to Z-scores, and zy is defined analogously for the y vector. The second way that we
have written this, in Equation (5.27), is by using the partitioned correlation matrix, just as we did
with the variance matrix above in Equation (5.24). Here the correlations among the independent
variables are in the matrix Rxx, and those between the independent variables and the dependent
variable are in the vector rxy. The partitioned matrix is shown below:

⎡1 ′ ⎤
rxy
k Rk = ⎢ ⎥ where (5.28)
⎢⎣rxy R xx ⎥⎦

⎡ 1 rx ,x L rx , x ⎤
⎢r L rx ,x ⎥⎥
1 2 1 k*

1
=⎢
x ,x
R xx 2 1 2 k*

⎢ L L L L ⎥
⎢ ⎥
⎣⎢rx , x
k* 1
rx ,x
k* 2
L 1 ⎦⎥

is the matrix of correlations among the k* independent variables, and is therefore k* by k*, the
same as Sxx, and

′ = ryx = [ry , x1
rxy ry , x 2 L ry , x x ]
k*

56 Chapter 5
is the vector of correlations between the dependent variable and each of the k* independent
variables.

It is interesting to note that in the calculation of β̂ as well as the standardized βˆ * , the correlations
among all the independent variables figure into the calculation into each βˆ . Of course, if Rxx = I,
i
this would simplify things quite a bit. In this case, each independent variable would be orthogonal
from all the others and the calculation of each β̂ i could be done sequentially in any order, instead
of simultaneously as we have done above. We can also see here why our regression model is
unprotected from misspecification in the form of missing independent variables. If there is some
other independent variable of which we are not aware, or at least that we did not measure, our
calculations are obviously not taking it into account, even though its presence could easily modify
the values of all the other β's. The only time we can be protected from the threat of unmeasured
independent variables is when we can be totally sure that all unmeasured variables would be
orthogonal to the independent variables that we did measure. How can we ever be sure of this?
We are protected from unmeasured independent variables when we have a designed experiment
that lets us control the assignment of subjects (or in general "experimental units", whatever they
might be) to the values of the independent variables.

5.8 Partialing Variance

Lets assume we have two different sets of independent variables in the matrices X1 and X2. Each
of these has n observations, so they both have n rows, but there are differing numbers of columns
in X1 and X2. Our model is still yˆ = Xβ but

X = [ X1 X 2 ] and

⎡ β1 ⎤
β=⎢ ⎥ ,
⎢ ⎥
⎣β 2 ⎦

where β1 is the vector with as many elements as there are columns in X1 while β2 is the vector
corresponding to each of the independent variables in X2. Note that in this case β1 and β2 are
vectors, not individual beta values. The reason we are doing this is so that we can look at the
regression model in more detail, tracking the relationship between two different sets of
independent variables. Now we can rewrite yˆ = Xβ as

⎡β ⎤
yˆ = [ X 1 X2 ] ⎢ 1 ⎥
⎣β 2 ⎦

= X 1β 1 + X 2 β 2 .

The normal equations [c.f. Equation 5.7] would be

⎡ X 1′ ⎤ ⎡β ⎤ ⎡ X′ ⎤
⎢X ′ ⎥ [X1 X2 ] ⎢ 1 ⎥ = ⎢ 1 ⎥ y
⎣ 2⎦ ⎣β 2 ⎦ ⎣ X ′2 ⎦

Ordinary Least Squares 57

but we could also look at the normal equations one set of X variables at a time, as

X1′ X1β1 + X1′ X 2 β 2 = X1′ y , (5.29)

X′2 X1β1 + X′2 X 2 β 2 = X′2 y . (5.30)

If we substract X 1′ X 2 β 2 from Equation (5.29) we end up with

X1′ X1β1 = X1′ y − X1′ X 2 β 2

which, after we solve for β1 , gives us the estimator

βˆ 1 = ( X1′ X1 ) −1 X1′ y − ( X1′ X1 ) −1 X1′ X 2 β 2 . (5.31)

The first component of the right hand side of Equation (5.31) is just the usual least squares
formula that we would see if there was only set X1 of the independent variables and X2 was not
part of the model. Instead, something is being subtracted away from the usual formula. To shed
more light on this, we can factor the premultiplying matrix ( X1′ X1 ) −1 X1′ to get

βˆ 1 = ( X1′ X1 ) −1 X1′ [y − X 2 β 2 ] .

What is the term in brackets? None other than the error for the regression equation if there was
only X2 and X1 was not part of the model. In other words, β̂1 is being calculated not using y, but
using the error from the regression of y on X2. The variance that is at all attributable to X2 has
been swept out of the dependent variable y before β̂1 gets calculated, and vice versa.

5.9 The Intercept-Only Model

Define
-1
P = X(X′X) X′ (5.32)

and define

M = I – P, (5.33)
-1
i. e. M = I - X(X′X) X′. Keeping these definitions in mind, let us now consider the simplest of
all possible regression models, namely, a model with only an intercept term,

⎡ ŷ1 ⎤ ⎡ 1 ⎤
⎢ ⎥ ⎢ ⎥
⎢ ŷ 2 ⎥ = ⎢ 1 ⎥ βˆ = 1 βˆ .
⎢ L ⎥ ⎢L⎥ 0 n 1 0
⎢ ⎥ ⎢ ⎥
⎣ ŷ n ⎦ ⎣ 1 ⎦

In this case, the β̂ vector is just the scalar β̂ 0 and so it’s formula becomes

58 Chapter 5
( X ′X) −1 X ′y = (1′1) −1 1′y

= n −1 1′y

= [n −1 n −1 L n −1 ] y

n
1
=
n ∑y
i
i

so that our model yˆ = Xβˆ is just

n ŷ1 = n 11 y

⎡ ŷ1 ⎤ ⎡ y ⎤
⎢ ⎥ ⎢ ⎥
⎢ ŷ 2 ⎥ = ⎢ y ⎥.
⎢ L ⎥ ⎢L⎥
⎢ ⎥ ⎢ ⎥
⎣ ŷ n ⎦ ⎣ y ⎦

The matrix P is given by the expression

P = X( X′X) −1 X′

= n 11 (1′1) −1 1′

⎡1 1 1⎤
⎢n n L n⎥
⎢ ⎥
⎢1 1 1⎥
⎢ L ⎥
⎢ n n n⎥
=
⎢ ⎥
⎢L L L L⎥
⎢ ⎥
⎢1 1 1⎥
⎢ L ⎥
⎣n n n⎦

so in that case the predicted values of y are

⎡y⎤
⎢y⎥
yˆ = Py = ⎢ ⎥
⎢L⎥
⎢ ⎥
⎣y⎦

and the Sum of Squares Predictable are

Ordinary Least Squares 59

SS Pr edicted = y ′Py = y ∑y i .

The M matrix also takes on a particular form in the intercept-only model.

M = I − X( X′X) −1 X′ = I − P

⎡1 1 1⎤
⎡1 0 L 0⎤ ⎢ L
⎢ ⎥ ⎢n n n⎥
⎥
⎢ ⎥ 1 1 1⎥
⎢0 1 L 0⎥ ⎢ L
=⎢ ⎢n n n⎥
⎥ −⎢ ⎥
⎢L L L L⎥ ⎢L L L L⎥
⎢ ⎥ ⎢ ⎥
⎢ ⎥ 1 1 1⎥
⎣⎢ 0 0 L 1 ⎦⎥ ⎢ L
⎣⎢ n n n ⎦⎥

The M matrix transforms the observations in y into error, but in this case the “error” is equivalent
to deviations from the mean (in other words di values):

⎡ y1 − y ⎤
⎢ ⎥
⎢ y 2 − y⎥
e = My = ⎢ ⎥.
⎢ L ⎥
⎢ ⎥
⎢ y n − y⎥
⎣ ⎦

The SSError is the quadratic form with M in the middle,

SS Error = y ′My = ∑ y (y i i − y)

= y ′(y − 1y ) = y ′y − y ′1y = y ′y

⎛ ∑y ⎞
∑ ∑ ∑ y i2 − ⎜ ⎟
∑y ,
i
= y i2 − y yi =
⎜ n ⎟ i
⎝ ⎠

which the reader will recognize as the scalar, the corrected sum of squares from Equation (2.11).

5.10 Response Surface Models

While it is known as the linear model, one can fit more complicated curves than lines or planes. It
is relatively straightforward to include quadratic or higher order polynomials in a regression
model, merely by squaring or cubing one of the independent variables (it is wise to mean center
first). For example, consider the model

ŷ i = β 0 + x i1β1 + x i21β 2 + x i 2 β 3 + x i22 β 4 .

60 Chapter 5
The second and fourth independent variables are squared versions of the first and third. In order to
demonstrate the wide variety of shapes we can model using polynomial equations, consider the
figure below where β2 and β4 are either 0 or 1:

β2 = 0 β2 = 1

β =0
4
x2 x2
x1 x1

β =1
4

x2 x2
x1 x1

Or consider the following diagram in which the sign of β2 and β4 is either positive or minus:

β 2 = +1 β 2 = −1

β = +1
4

x2 x2
x1 x1

β = −1
4

x1 x2 x1 x2

Ordinary Least Squares 61

References

Mosteller, Frederick and John W. Tukey (1977) Data Analysis and Regression. Reading, MA:
Addison-Wesley.

62 Chapter 5

Mini-Assignment 1 PDF
100% (1)
Mini-Assignment 1 PDF
3 pages
Applied Medical Statisticsv2
No ratings yet
Applied Medical Statisticsv2
277 pages
Advanced Econometrics PDF
No ratings yet
Advanced Econometrics PDF
58 pages
07 Solutions Regression
50% (2)
07 Solutions Regression
74 pages
Lecture5 Module2 Anova 1
No ratings yet
Lecture5 Module2 Anova 1
9 pages
Multiple Linear Reegression
No ratings yet
Multiple Linear Reegression
21 pages
Week 1 Linear - Regression-Final
No ratings yet
Week 1 Linear - Regression-Final
43 pages
Econometric Theory: Module - Iii
No ratings yet
Econometric Theory: Module - Iii
10 pages
UMVUE Statmat 2 2022
No ratings yet
UMVUE Statmat 2 2022
43 pages
Matrix OLS NYU Notes
No ratings yet
Matrix OLS NYU Notes
14 pages
Introduction To Mathematical Modeling: Simple Linear Regression
No ratings yet
Introduction To Mathematical Modeling: Simple Linear Regression
21 pages
Linear Regression
No ratings yet
Linear Regression
47 pages
Derivation of BLUE Property of OLS Estimators
100% (2)
Derivation of BLUE Property of OLS Estimators
4 pages
CH 11 Slides
No ratings yet
CH 11 Slides
41 pages
UnivariateRegression 3
No ratings yet
UnivariateRegression 3
81 pages
Design and Analysis of Computer Experiments: Theory: 1 Density Estimation
No ratings yet
Design and Analysis of Computer Experiments: Theory: 1 Density Estimation
9 pages
Properties of The OLS Estimator: Quantitative Methods 2
No ratings yet
Properties of The OLS Estimator: Quantitative Methods 2
57 pages
ch2 PDF
No ratings yet
ch2 PDF
22 pages
Unit - 1
No ratings yet
Unit - 1
8 pages
LECTURE2
No ratings yet
LECTURE2
13 pages
Regression Basics in Matrix Terms: 1 The Normal Equations of Least Squares
No ratings yet
Regression Basics in Matrix Terms: 1 The Normal Equations of Least Squares
3 pages
Lecture 4
No ratings yet
Lecture 4
11 pages
RegEstimationLS ML StatColumbia
No ratings yet
RegEstimationLS ML StatColumbia
44 pages
Lecture 4: Simple Linear Regression Models, With Hints at Their Estimation
No ratings yet
Lecture 4: Simple Linear Regression Models, With Hints at Their Estimation
12 pages
FCDS - RA ch1 Sp21
No ratings yet
FCDS - RA ch1 Sp21
14 pages
3 SimpleLinearRegression
No ratings yet
3 SimpleLinearRegression
30 pages
Lec3 2019 PDF
No ratings yet
Lec3 2019 PDF
43 pages
Chapter 6: Regression
No ratings yet
Chapter 6: Regression
7 pages
Classical Linear Regression and Its Assumptions
No ratings yet
Classical Linear Regression and Its Assumptions
63 pages
TSNotes 1
No ratings yet
TSNotes 1
29 pages
SLRM Note
No ratings yet
SLRM Note
15 pages
EC501 Lecture 01
No ratings yet
EC501 Lecture 01
28 pages
Linera Regression II PDF
No ratings yet
Linera Regression II PDF
14 pages
Math644 - Chapter 1 - Part2 PDF
No ratings yet
Math644 - Chapter 1 - Part2 PDF
14 pages
Simple Linear Regression Analysis
No ratings yet
Simple Linear Regression Analysis
55 pages
Stat 353 Study Guide
No ratings yet
Stat 353 Study Guide
44 pages
CH 2
No ratings yet
CH 2
31 pages
Econometric Estimation BETA
No ratings yet
Econometric Estimation BETA
36 pages
MLRM
No ratings yet
MLRM
22 pages
Least Squares Estimation PDF
No ratings yet
Least Squares Estimation PDF
5 pages
Math170S Lecture6
No ratings yet
Math170S Lecture6
13 pages
Econ 332 Lecture Notes April 2021
No ratings yet
Econ 332 Lecture Notes April 2021
57 pages
EC501 Lecture 02
No ratings yet
EC501 Lecture 02
27 pages
Lec2 ASE
No ratings yet
Lec2 ASE
86 pages
SGPE Econometrics Lecture 1 OLS
No ratings yet
SGPE Econometrics Lecture 1 OLS
87 pages
Lecture3 Module1 Anova 1
No ratings yet
Lecture3 Module1 Anova 1
10 pages
Mult Regression
No ratings yet
Mult Regression
28 pages
Unit - III
No ratings yet
Unit - III
4 pages
Gary Chamberlain Econometric S
No ratings yet
Gary Chamberlain Econometric S
152 pages
The Linear Regression Model
No ratings yet
The Linear Regression Model
24 pages
Lecture 1: Optimal Prediction (With Refreshers) : 36-401, Fall 2017 Sunday 3 September, 2017
No ratings yet
Lecture 1: Optimal Prediction (With Refreshers) : 36-401, Fall 2017 Sunday 3 September, 2017
13 pages
Uni Variate Regression
No ratings yet
Uni Variate Regression
61 pages
Linear Regression Analysis: Module - Vii
No ratings yet
Linear Regression Analysis: Module - Vii
10 pages
Chapter 1 - Linear Regression With 1 Predictor: Statistical Model
No ratings yet
Chapter 1 - Linear Regression With 1 Predictor: Statistical Model
35 pages
Day 1
No ratings yet
Day 1
41 pages
Lecture 1a
No ratings yet
Lecture 1a
17 pages
Linear Stochastic Models: 5.1 Least Squares
No ratings yet
Linear Stochastic Models: 5.1 Least Squares
12 pages
MA 324, Lecture 1: Yohann Tendero Yohann - Tendero@
No ratings yet
MA 324, Lecture 1: Yohann Tendero Yohann - Tendero@
19 pages
Simple Linear Regression.: 29.1 Method of Least Squares
No ratings yet
Simple Linear Regression.: 29.1 Method of Least Squares
4 pages
Simple Linear Regression.: 29.1 Method of Least Squares
No ratings yet
Simple Linear Regression.: 29.1 Method of Least Squares
4 pages
Simple-Linear-Regression-Model-3 24
No ratings yet
Simple-Linear-Regression-Model-3 24
87 pages
Measuring Emotions in Mathematics: The Achievement Emotions Questionnaire-Mathematics (AEQ M)
No ratings yet
Measuring Emotions in Mathematics: The Achievement Emotions Questionnaire-Mathematics (AEQ M)
16 pages
International Handbook of Mathematics Teacher Education: Volume 4
No ratings yet
International Handbook of Mathematics Teacher Education: Volume 4
418 pages
Chapter 22: Epistemologies of Mathematics and of Mathematics Education
No ratings yet
Chapter 22: Epistemologies of Mathematics and of Mathematics Education
50 pages
Bieleke (2021) The AEQ-S
No ratings yet
Bieleke (2021) The AEQ-S
15 pages
Lind 18e Chap009 PPT
No ratings yet
Lind 18e Chap009 PPT
27 pages
Biostatistics A Sample of Questions For The Final Exam
No ratings yet
Biostatistics A Sample of Questions For The Final Exam
3 pages
Statistics and Probability
100% (1)
Statistics and Probability
27 pages
Normal Distribution Practice
No ratings yet
Normal Distribution Practice
2 pages
Theory of Estimation
No ratings yet
Theory of Estimation
21 pages
3.2 Joint Probability Density Functions: (JPDF) - For Example The JPDF P
No ratings yet
3.2 Joint Probability Density Functions: (JPDF) - For Example The JPDF P
1 page
205-Article Text-845-3-10-20220513
No ratings yet
205-Article Text-845-3-10-20220513
10 pages
Lampiran Hasil Analisis Jalur Dengan Lisrel
No ratings yet
Lampiran Hasil Analisis Jalur Dengan Lisrel
7 pages
MAS202 - Homework For Chapter 13-14
No ratings yet
MAS202 - Homework For Chapter 13-14
7 pages
Supplementary Readings For Reliability, Validity, Utility
No ratings yet
Supplementary Readings For Reliability, Validity, Utility
8 pages
Stat4 Normal Distribution
No ratings yet
Stat4 Normal Distribution
41 pages
Jurnal Pengaruh Orientasi Kewirausahaan Dan Inovasi Produk Terhadap Keunggulan Bersaing
No ratings yet
Jurnal Pengaruh Orientasi Kewirausahaan Dan Inovasi Produk Terhadap Keunggulan Bersaing
15 pages
Rubric Individual Case Study Assignment 2 - Inferential Statistics (35%)
No ratings yet
Rubric Individual Case Study Assignment 2 - Inferential Statistics (35%)
3 pages
MPC 006 Previous Year Question Papers by
No ratings yet
MPC 006 Previous Year Question Papers by
67 pages
Learning Apache Spark With Python: Wenqiang Feng
No ratings yet
Learning Apache Spark With Python: Wenqiang Feng
8 pages
Seminar SPSS MLS
No ratings yet
Seminar SPSS MLS
2 pages
Excel Analysis ToolPak Tutorial
No ratings yet
Excel Analysis ToolPak Tutorial
15 pages
Week 1.4 - Exponential and Normal Distributions
No ratings yet
Week 1.4 - Exponential and Normal Distributions
9 pages
Solve All Questions Without - Choice
No ratings yet
Solve All Questions Without - Choice
4 pages
5 CommonPractices
No ratings yet
5 CommonPractices
106 pages
P-Value 0.2, 0.05 Data Is Not Normal Reject H0: Tests of Normality
No ratings yet
P-Value 0.2, 0.05 Data Is Not Normal Reject H0: Tests of Normality
2 pages
Business Statistics MCQ 2
No ratings yet
Business Statistics MCQ 2
3 pages
Statistical Inference - (MGT601) Mid FA2015
0% (1)
Statistical Inference - (MGT601) Mid FA2015
61 pages
Full Download Principles of Econometrics 4th Edition Hill Test Bank
100% (61)
Full Download Principles of Econometrics 4th Edition Hill Test Bank
35 pages
Distribution Worksheeet
No ratings yet
Distribution Worksheeet
15 pages
Mod2 (Extraqns)
No ratings yet
Mod2 (Extraqns)
6 pages

Quant Chapter 05 Ols

Uploaded by

Quant Chapter 05 Ols

Uploaded by

Chapter 5: Ordinary Least Squares

Prerequisite: Chapters 1, 2, Sections 3.1, 3.2, 3.3, 4.1, 4.2

5.1 The Regression Model

The mechanism of prediction is a linear combination of independent variable values, with

5.2 Least Squares Estimation

= y ′y − β ′X′y − y ′Xβ + β ′X′Xβ.

= y′y – 2y′Xβ + β′X′Xβ. (5.4)

from Equation (3.17) and the derivative of a transpose

from Equation (3.19).

Ordinary Least Squares 49

Finally, adding all of the pieces together, each being k by 1, we have

2 X′Xβ − 2 X′y = 0 (5.6)

X ′Xβ = X′y (5.7)

βˆ = ( X′X) −1 X′y . (5.8)

expectation of ˆ ? What does that look like?

5.3 What Do We Mean by a Good Statistic?

Unbiased E (β̂ i ) = β i . (5.10)

Consistent Pr(βˆ i − β i ≤ ε ) → 1 as n → ∞. (5.11)

Sufficient Pr( y | βˆ ) does not depend on β (5.12)

Ordinary Least Squares 51

5.4 Maximum Likelihood Estimation of Regression Parameters

Ordinary Least Squares 53

n n (y − Xβ) ′(y − Xβ)

5.5 Sums of Squares of the Regression Model

SS Error = e′e = (y − Xβˆ )′ (y − Xβˆ ) (5.21)

= [y − X( X ′X) −1 X ′y ]′[y − X( X ′X) −1 X ′y ]

= y ′y − y ′X( X ′X) −1 X ′y − y ′X( X ′X) −1 X ′y + y ′X( X ′X) −1 X ′X( X ′X) −1 X ′y

SSError = y′y - y′X(X′X)-1X′y

SSError = SSTotal - SSPredictable (5.22)

There are many ways of expressing the SSPredictable, including

y ′X( X ′X) −1 X ′y = βˆ ′X ′y = y ′Xβˆ = βˆ ′X ′Xβˆ .

βˆ ′X′Xβ = [βˆ ′X′] [ Xβ] = [ Xβˆ ]′ [ Xβ] = yˆ ′yˆ .

However, the quantity y′e (SSError) is not the same as yˆ ′e since

= (βˆ ′X ′) (y − Xβˆ ) (5.23)

5.6 The Covariance Estimator for β

Ordinary Least Squares 55

If we need to know what the value of β̂ 0 is, we can calculate it as follows:

5.7 Regression with Z-Scores

= R −xx1 rxy , (5.27)

5.8 Partialing Variance

The normal equations [c.f. Equation 5.7] would be

Ordinary Least Squares 57

X1′ X1β1 + X1′ X 2 β 2 = X1′ y , (5.29)

X′2 X1β1 + X′2 X 2 β 2 = X′2 y . (5.30)

If we substract X 1′ X 2 β 2 from Equation (5.29) we end up with

X1′ X1β1 = X1′ y − X1′ X 2 β 2

which, after we solve for β1 , gives us the estimator

βˆ 1 = ( X1′ X1 ) −1 X1′ y − ( X1′ X1 ) −1 X1′ X 2 β 2 . (5.31)

5.9 The Intercept-Only Model

so that our model yˆ = Xβˆ is just

The matrix P is given by the expression

so in that case the predicted values of y are

and the Sum of Squares Predictable are

Ordinary Least Squares 59

The M matrix also takes on a particular form in the intercept-only model.

The SSError is the quadratic form with M in the middle,

5.10 Response Surface Models

ŷ i = β 0 + x i1β1 + x i21β 2 + x i 2 β 3 + x i22 β 4 .

Ordinary Least Squares 61

You might also like