CLRM Assumptions
CLRM Assumptions
Abstract
Summary of statistical tests for the Classical Linear Regression Model (CLRM), based on Brooks [1],
Greene [5] [6], Pedace [8], and Zeileis [10].
Contents
1 The Classical Linear Regression Model (CLRM) 3
1
5.4 Miscellaneous issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2
1 The Classical Linear Regression Model (CLRM)
Let the column vector xk be the T observations on variable xk , k = 1, · · · , K, and assemble these data
in an T × K data matrix X. In most contexts, the first column of X is assumed to be a column of 1s:
1
1
x1 = .
..
1 T ×1
so that β1 is the constant term in the model. Let y be the T observations y1 , · · · , yT , and let ε be the
column vector containing the T disturbances. The Classical Linear Regression Model (CLRM) can be
written as
xi1
xi2
y = x1 β1 + · · · + xK βK + ε, xi = .
..
xiT T ×1
or in matrix form
yT ×1 = XT ×K βK×1 + εT ×1 .
Assumptions of the CLRM (Brooks [1, page 44], Greene [6, page 16-24]):
(1) Linearity: The model specifies a linear relationship between y and x1 , · · · , xK .
y = Xβ + ε
(2) Full rank: There is no exact linear relationship among any of the ndependent variables in the model.
This assumption will be necessary for estimation of the parameters of the model (see formula (1)).
(3) Exogeneity of the independent variables: E[εi |xj1 , xj2 , · · · , xjK ] = 0. This states that the
expected value of the disturbance at observation i in the sample is not a function of the independent
variables observed at any observation, including this one. This means that the independent variables will
not carry useful information for prediction of εi .
E[ε|X] = 0.
(4) Homoscedasticity and nonautocorrelation: Each disturbance, εi has the same finite variance,
σ 2 , and is uncorrelated with every other disturbance, εj .
E[εε′ |X] = σ 2 I.
(5) Data generation: The data in (xj1 , xj2 , · · · , xjK ) may be any mixture of constants and random
variables.
X may be fixed or random.
(6) Normal distribution: The disturbances are normally distributed.
In order to obtain estimates of the parameters β1 , β2 , · · · , βK , the residual sum of squares (RSS)
( )2
∑
T ∑T ∑K
RSS = ε̂′ ε̂ = ε̂2t = yt − xit βi
t=1 t=1 i=1
3
is minimised so that the coefficient estimates will be given by the ordinary least squares (OLS) estimator
β̂1
β̂2
β̂ = ′ −1 ′
· · · = (X X) X y. (1)
β̂k
In order to calculate the standard errors of the coefficient estimates, the variance of the errors, σ 2 , is
estimated by the estimator
∑T
RSS ε̂2
2
s = = t=1 t (2)
T −K T −K
where we recall K is the number of regressors including a constant. In this case, K observations are “lost”
as K parameters are estimated, leaving T − K degrees of freedom.
Then the parameter variance-covariance matrix is given by
And the coefficient standard errors are simply given by taking the square roots of each of the terms on the
leading diagonal. In summary, we have (Brooks [1, page 91-92])
′
β̂ = (X X)−1 X ′ y = β + (X ′ X)−1 X ′ ε
∑T
ε̂2t
s2 = Tt=1 (4)
−K
Var(β̂) = s2 (X ′ X)−1 .
The OLS estimator is the best linear unbiased estimator (BLUE), consistent and asymptotically normally
distributed (CAN), and if the disturbances are normally distributed, asymptotically efficient among all CAN
estimators.
4
sum of squares, with no constraints imposed. Now if, after imposing constraints on the model, a residual
sum of squares results that is not much higher than the unconstrained model’s residual sum of squares, it
would be concluded that the restrictions were supported by the data. On the other hand, if the residual
sum of squares increased considerably after the restrictions were imposed, it would be concluded that the
restrictions were not supported by the data and therefore that the hypothesis should be rejected.
It can be further stated that RRSS ≥ U RSS.2 Only under a particular set of very extreme circumstances
will the residual sums of squares for the restricted and unrestricted models be exactly equal. This would be
the case when the restriction was already present in the data, so that it is not really a restriction at all.
Finally, we note any hypothesis that could be tested with a t-test could also have been tested using an
F -test, since
t2 (T − K) ∼ F (1, T − K).
5
As a rule of thumb, VIFs greater than 10 signal a highly likely multicollinearity problem, and VIFs
between 5 and 10 signal a somewhat likely multicollinearity issue. Remember to check also other evidence
of multicollinearity (insignificant t-statistics, sensitive or nonsensical coefficient estimates, and nonsensical
coefficient signs and values). A high VIF is only an indicaotr of potential multicollinearity, but it may not
result in a large variance for the estimator if the variance of the independent variable is also large.
6
4 Violation of Assumptions: Heteroscedasticity
4.1 Detection of heteroscedasticity
This is the situation where E[ε2i |X] is not a finite constant.
s21
GQ =
s22
with s21 > s22 . The test statistic is distributed as an F (T1 − K, T2 − K) under the null hypothesis, and the
null of a constant variance is rejected if the test statistic exceeds the critical value.
The GQ test is simple to construct but its conclusions may be contingent upon a particular, and probably
arbitrary, choice of where to split the sample. An alternative method that is sometimes used to sharpen the
inferences from the test and to increase its power is to omit some of the observations from the centre of the
sample so as to introduce a degree of separation between the two sub-samples.
yt = β1 + β2 x2t + β3 x3t + εt .
To test Var(εt ) = σ 2 , estimate the model above, obtaining the residuals ε̂t .
(2) Run the auxiliary regression
The squared residuals are the quantity of interest since Var(εt ) = E[ε2t ] under the assumption that E[εt ] = 0.
The reason that the auxiliary regression takes this form is that it is desirable to investigate whether the
variance of the residuals varies systematically with any known variables relevant to the model. Note also
that this regression should include a constant term, even if the original regression did not. This is as a result
of the fact that ε̂2t will always have a non-zero mean.
(3) Given the auxiliary regression, the test can be conducted using two different approaches.
(i) First it is possible to use the F -test framework. This would involve estimating the auxiliary
regression as the unrestricted regression and then running a restricted regression of ε̂2t on a constant only.
The RSS from each specification would then be used as inputs to the standard F -test formula.
(ii) An alternative approach, called Lagrange Multiplier (LM) test, centres around the value of R2
for the auxiliary regression and does not require the estimation of a second (restricted) regression. If one or
more coefficients in the auxiliary regression is statistically significant, the value of R2 for that equation will
be relatively high, while if none of the variables is significant, R2 will be relatively low. The LM test would
thus operate by obtaining R2 from the auxiliary regression and multiplying it by the number of observations,
T . It can be shown that
T R2 ∼ χ2 (m)
where m is the number of regressors in the auxiliary regression (excluding the constant term), equivalent to
the number of restrictions that would have to be placed under the F -test approach.
7
(4) The test is one of the joint null hypothesis that α2 = α3 = α4 = α5 = α6 = 0. For the LM test, if the
χ2 -test statistic from step (3) is greater than the corresponding value from the statistical table then reject
the null hypothesis that the errors are homoscedastic.
4.3.2 Transformation
A second “solution” for heteroscedasticity is transforming the variables into logs or reducing by some other
measure of “size”. This has the effect of re-scaling the data to “pull in” extreme observations.
8
4.3.3 The White-corrected standard errors
A third “solution” for heteroscedasticity is robust standard error (White-corrected standard errors,
heteroscedasticity-corrected standard errors) following White [9]. In a model with one independent
variable, the robust standard error is
v
u ∑T
u t=1 (xt − x) ε̂t
2 2
se(β̂1 )HC = u
t (∑ T )2 .
(x
t=1 t − x) 2
Generalizing this result to a multiple regression model, the robust standard error is
v
u ∑T
u 2 2
t=1 ω̂tk ε̂t
se(β̂k )HC = u
t (∑ T )2
2
t=1 ω̂tk
2
where the ω̂tk ’s are the squared residuals obtained from the auxiliary regression of xk on all the other
independent variables. Here’s how to calculate robust standard errors:
(1) Estimate your original multivariate model, yt = β1 + β2 x2t + · · · + βK xKt + εt , and obtain the squared
residuals, ε̂2t .
(2) Estimate K − 1 auxiliary regressions of each independent variable on all the other independent
variables and retain all T × (K − 1) squared residuals (ω̂tpk
2
).
(3) For any independent variable, calculate the robust standard errors:
v
u ∑T
u 2 2
t=1 ω̂tk ε̂t
se(β̂k )HC = u
t (∑T )2 .
2
t=1 ω̂tk
The effect of using the correction is that, if the variance of the errors is positively related to the square of
an explanatory variable, the standard errors for the slope coefficients are increased relative to the usual OLS
standard errors, which would make hypothesis testing more “conservative”, so that more evidence would be
required against the null hypothesis before it would be rejected.
The results of Fabozzi and Francis [4] strongly suggest the presence of heteroscedasticity in the context
of the single index market model. Numerous versions of robust standard errors exist for the purpose of
improving the statistical properties of the heteroskedasticity correction; no form of robust standard error is
preferred above all others.
9
5.1.2 The run test (the Geary test)
The run test (the Geary test). You want to use the run test if you’re uncertain about the nature of the
autocorrelation.
A run is defined as a sequence of positive or negative residuals. The hypothesis of no autocorrelation
isn’t sustainable if the residuals have too many or too few runs.
The most common version of the test assumes that runs are distributed normally. If the assumption of
no autocorrelation is sustainable, with 95% confidence, the number of runs should be between
µr ± 1.96σr
where µr is the expected number of runs and σr is the standard deviation. These values are calculated by
√
2T1 T2 2T1 T2 (2T1 T2 − T1 − T2 )
µr = + 1, σr =
T1 + T2 (T1 + T2 )2 (T1 + T2 − 1)
where r is the number of observed runs, T1 is the number of positive residuals, T2 is the number of negative
residuals, and T is the total number of observations.
If the number of observed runs is below the expected interval, it’s evidence of positive autocorrelation;
if the number of runs exceeds the upper bound of the expected interval, it provides evidence of negative
autocorrelation.
εt = ρεt−1 + νt (6)
H0 : ρ = 0, H1 : ρ ̸= 0.
It is not necessary to run the regression given by (6) since the test statistic can be calculated using
quantities that are already available after the first regression has been run
∑T
(ε̂t − ε̂t−1 )2
DW = t=2∑T ≈ 2(1 − ρ̂) (7)
2
t=2 ε̂t
where ρ̂ is the estimated correlation coefficient that would have been obtained from an estimation of (6).
The intuition of the DW statistic is that the numerator “compares” the values of the error at times t − 1 and
t. If there is positive autocorrelation in the errors, this difference in the numerator will be relatively small,
while if there is negative autocorrelation, with the sign of the error changing very frequently, the numerator
will be relatively large. No autocorrelation would result in a value for the numerator between small and
large.
In order for the DW test to be valid for application, three conditions must be fulfilled:
(i) There must be a constant term in the regression.
(ii) The regressors must be non-stochastic.
(iii) There must be no lags on dependent variable in the regresion.4
The DW test does not follow a standard statistical distribution. It has two critical values: an upper
critical value dU and a lower critical value dL . The rejection and non-rejection regions for the DW test are
illustrated in Figure 1.
3 More generally, the AR(1) processes in time series analysis.
4 If
the test were used in the presence of lags of the dependent variable or otherwise stochastic regressors, the test statistic
would be biased towards 2, suggesting that in some instances the null hypothesis of no autocorrelation would not be rejected
when it should be.
10
Figure 1: Rejection and non-rejection regions for DW test
(3) Letting T denote the number of observations, the test statistic is given by
(T − r)R2 ∼ χ2r .
Note that (T − r) pre-multiplies R2 in the test for autocorrelation rather than T . This arises because
the first r observations will effectively have been lost from the sample in order to obtain the r lags used in
the test regression, leaving (T − r) observations from which to estimate the auxiliary regression.
One potential difficulty with Breusch-Godfrey is in determining an appropriate value of r. There is no
obvious answer to this, so it is typical to experiment with a range of values, and also to use the frequency of
the data to decide. For example, if the data is monthly or quarterly, set r equal to 12 or 4, respectively.
where −1 < ρ < 1 and νt is a random variable that satisfies the CLRM assumptions; namely E[νt |εt−1 ] = 0,
Var(νt |εt−1 ) = σν2 , and Cov(νt , νs ) = 0 for all t ̸= s. By repeated substitution, we obtain
εt = νt + ρνt−1 + ρ2 νt−2 + ρ3 νt−3 + · · · .
11
Therefore,
σν2
E[εt ] = 0, Var(εt ) = σν2 + ρ2 σν2 + · · · = .
1 − ρ2
The stationarity assumption (|ρ| < 1) is necessary to constrain the variance from becoming an infinite value.
σν2
OLS assumes no autocorrelation; that is, ρ = 0 in the expression σε2 = 1−ρ 2 . Consequently, in the presence
of autocorrelation, the estimated variances and standard errors from OLS are underestimated.
∑
K
yt = β1 + βi xit + εt
i=2
εt = ρεt−1 + νt . (9)
Estimate the equation using OLS, ignoring the residual autocorrelation.
(2) Obtain the residuals, and run the regression
ε̂t = ρε̂t−1 + νt .
(3) Obtain ρ̂ and construct yt∗ = yt − ρ̂yt−1 , β1∗ = (1 − ρ̂)β1 , x∗2t = (x2t − ρ̂x2(t−1) , etc., so that the
original model can be written as
∑K
yt∗ = β1∗ + βi x∗it + νt
i=2
Cochrane and Orcutt [2] argue that better estimates can be obtained by repeating steps (2)-(4) until the
change in ρ̂ between one iteration and the next is less than some fixed amount (e.g. 0.01). In practice, a
small number of iterations (no more than 5) will usually suffice. We also note assumptions like (9) should
be tested before the Cochrane-Orcutt or similar procedure is implemented.5
12
∑K
(1) Estimate your original model yt = β1 + i=2 βi xit + εt and obtain the residuals: ε̂t .
∑K
(2) Estimate the auxiliary regression x2t = α1 + i=3 αi xit + rt and retain the residuals: r̂t .
(3) Find the intermediate adjustment factor, α̂t = r̂t ε̂t , and decide how much serial correlation (the
number of lags) you’re going to allow. A Breusch-Godfrey test can be useful in making this determination,
while EViews uses INTEGER[4(T /100)2/9 ].
∑T ∑g [ ] (∑
T
)
(4) Obtain the error variance adjustment factor, v̂ = t=1 α̂t2 + 2 h=1 1 − g+1 h
t=h+1 t t−h ,
α̂ α̂
where g represents the number of lags determined in Step 3.
(5) Calculate the serial correlation robust standard error. For variable x2 ,
( )2
se(β̂2 ) √
se(β̂2 )HAC = v̂.
σ̂ε
(6) Repeat Steps (2) through (5) for independent variables x3 through xK .
13
6 Violation of Assumptions: Non-Stochastic Regressors
The OLS estimator is consistent and unbiased in the presence of stochastic regressors, provided that the
regressors are not correlated with the error term of the estimated equation. However, if one or more of the
explanatory variables is contemporaneously correlated with the disturbance term, the OLS estimator will
not even be consistent. This results from the estimator assigning explanatory power to the variables where
in reality it is arising from the correlation between the error term and yt .
E[ε3 ]
b1 = ,
(σ 2 )3/2
14
8.3 Functional form: Ramsey’s RESET
Ramsey’s regression specification error test (RESET) is conducted by adding a quartic function of the
fitted values of the dependent variable (ŷt2 , ŷt3 , and ŷt4 ) to the original regression and then testing the joint
significance of the coefficients for the added variables.
The logic of using a quartic of your fitted values is that they serve as proxies for variables that may have
been omitted – higher order powers of the fitted values of y can capture a variety of non-linear relationships,
since they embody higher order powers and cross-products of the original explanatory variables.
The test consists of the following steps:
1. Estimate the model you want to test for specification error. E.g. yt = β1 + β2 x1t + · · · + βK xKt + εt .
2. Obtain the fitted values after estimating your model and estimate :
∑
K
yt = α1 + α2 ŷt2 + · · · + αp ŷtp + βi xit + νt . (10)
i=1
3. Test the joint significance of the coefficients on the fitted values of yt terms using an F -statistic, or
using the test statistic T R2 , which is distributed asymptotically as χ2 (p − 1) (the value of R2 is obtained
from the regression (10)). If the value of the test statistic is greater than the critical value, reject the null
hypothesis that the functional form was correct.
A RESET allows you to identify whether misspecification is a serious problem with your model, but it
doesn’t allow you to determine the source.
The null hypothesis for the Chow test is structural stability. The larger the F -statistic, the more evidence
you have against structural stability and the more likely the coefficients are to vary from group to group.
If the value of the test statistic is greater than the critical value from the F -distribution, which is an
F (K, T − 2K), then reject the null hypothesis that the parameters are stable over time.
Note the result of the F -statistic for the Chow test assumes homoskedasticity. A large F -statistic only
informs you that the parameters vary between the groups, but it doesn’t tell you which specific parameter(s)
is (are) the source(s) of the structural break.
15
those coefficient estimates for predicting values of y for the other period. These predictions for y are then
implicitly compared with the actual values. The null hypothesis for this test is that the prediction errors for
all of the forecasted observations are zero.
To calculate the test:
1. Run the regression for the whole period (the restricted regression) and obtain the RSS.
2. Run the regression for the “large” sub-period and obtain the RSS (called RSS1 ). Note the number
of observations for the long estimation sub-period will be denoted by T1 . The test statistic is given by
RSS − RSS 1 T1 − K
×
RSS T2
where T2 is the number of observations that the model is attempting to “predict”. The test statistic will
follow an F (T2 , T1 −K) distribution.
Forward predictive failure tests are where the last few observations are kept back for forecast testing.
Backward predictive failure tests attempt to “back-cast” the first few observations. Both types of test offer
further evidence on the stability of the regression relationship over the whole sample period
16
9 The Generalized Linear Regression Model (GLRM)
The generalized linear regression model is
y = Xβ + ε
E[ε|X] = 0 (11)
E[εε′ |X] = σ 2 Ω = Σ,
β̂ = (X ′ X)−1 X ′ y = β + (X ′ X)−1 X ′ ε
is the best linear unbiased estimator (BLUE), consistent and asymptotically normally distributed (CAN),
and if the disturbances are normally distributed, asymptotically efficient among all CAN estimators.
In the GLRM, the OLS estimator remains unbiased, consistent, and asymptotically normally distributed.
It will, however, no longer be efficient and the usual inference procedures based on the F and t distributions
are no longer appropriate.
Theorem 1 (Finite Sample Properties of β̂ in the GLRM). If the regressors and disturbances are
uncorrelated, then the least squares estimator is unbiased in the generalized linear regression model. With
non-stochastic regressors, or conditional on X, the sampling variance of the least squares estimator is
If Var[β̂|X] converges to zero, then β̂ is mean square consistent. With well-behaved regressors, (X ′ X/n)−1
will converge to a constant matrix. But (σ 2 /n)(X ′ ΩX/n) need not converge at all.
Theorem 2 (Consistency of OLS in the GLRM). If Q = p lim(X ′ X/n) and p lim(X ′ ΩX/n) are both
finite positive definite matrices, then β̂ is consistent for β. Under the assumed conditions,
p lim β̂ = β.
17
The conditions in the above theorem depend on both X and Ω. An alternative formula that separates
the two components can be found in Greene [5, page 194-195].
Theorem 3 (Asymptotic Distribution of β̂ in the GLRM). If the regressors are sufficiently well
behaved and the off-diagonal terms in Ω diminish sufficiently rapidly, then the least squares estimator is
asymptotically normally distributed with covariance matrix
( )
σ 2 −1 1 ′
Asy.Var[β̂] = Q p lim X ΩX Q−1 .
n n
The matrices of sums of squares and cross products in the left and right matrices are sample data that are
readily estimable, and the problem is the center matrix that involves the unknown σ 2 Ω = E[εε′ |X]. For
estimation purposes, we will assume that tr(Ω) = n, as it is when σ 2 Ω = σ 2 I in the CLRM.
Let Σ = (σij )i,j = σ 2 Ω = σ 2 (ωij )i,j . What is required is an estimator of the K(K + 1)/2 unknown
elements in the matrix
1 ∑
n
1
Q∗ = X ′ ΣX = σij x̃i x̃′j .
n n i,j=1
where x̃i is the column vector formed by the transpose of row i of X (see Greene [5, page 805]). To verify
this formula of Q∗ , recall we have the convention
xi1
xi2
X = [x1 , · · · , xK ], xi = .
..
xin
with i = 1, · · · , K. So
Consequently,
x̃′1
x̃′2 ∑
n
X = . , X ′ = [x̃1 , x̃2 , . . . , x̃n ], and X ′ ΣX = σij x̃i x̃′j
.. i,j=1
x̃′n
The least squares estimator β̂ is a consistent estimator of β, which implies that the least squares residuals
ε̂i are “pointwise” consistent estimators of their population counterparts εi . The general approach, then,
will be to use X and ε̂ to devise an estimator of Q∗ .
18
9.2.1 HC estimator
Consider the heteroscedasticity case first. White [9] has shown that under very general conditions, the
estimator
1∑ 2
n
S0 = ε̂ x̃i x̃′i
n i=1 i
has
p lim S0 = p lim Q∗ .
Therefore, the White heteroscedasticity consistent (HC) estimator
( )−1 ( ∑ n
)( )−1
1 1 ′ 1 ′ 1 ′ −1 −1
Est.Asy.Var[β̂] = XX 2
ε̂i x̃i x̃i XX = n (X ′ X) S0 (X ′ X) .
n n n i=1 n
would be
1 ∑
n
Q̂∗ = ε̂i ε̂j x̃i x̃′j
n i,j=1
But there are two problems with this estimator. The first one is that it is difficult to conclude yet that
Q̂∗ will converge to anything at all, since the matrix is 1/n times a sum of n2 terms. We can achieve the
convergence of Q̂∗ by assuming that the rows of X are well behaved and that the correlations diminish with
increasing separation in time.
The second problem is a practical one, that Q̂∗ needs not be positive definite. Newey and West [7] have
devised an estimator, the Newey–West autocorrelation consistent (AC) covariance estimator, that
overcomes this difficulty:
1∑ ∑
L n
l
Q̂∗ = S0 + wl ε̂t ε̂t−l (x̃t x̃′t−l + x̃t−l x̃′t ), wl = 1 − .
n L+1
l=1 t=l+1
It must be determined in advance how large L is to be. In general, there is little theoretical guidance. Current
practice specifies L ≈ T 1/4 . Unfortunately, the result is not quite as crisp as that for the heteroscedasticity
consistent estimator.
References
[1] Chris Brooks. Introductory econometrics for finance, 2ed.. New York, Cambridge University Press, 2008.
1, 3, 4
19
[2] Cochrane, D. and Orcutt, G. H. (1949). “Application of Least Squares Regression to Relationships
Containing Autocorrelated Error Terms”, Journal of the American Statistical Association 44, 32–61. 12
[3] Christopher Dougherty. Introduction to econometrics, 3ed.. Oxford University Press, 2007. 14
[4] Fabozzi, F. J. and Francis, J. C. (1980). “Heteroscedasticity in the Single Index Model”, Journal of
Economics and Business 32, 243–8. 9
[5] William H. Greene. Econometric analysis, 5ed.. Prentice Hall, 2002. 1, 16, 17, 18
[6] William H. Greene. Econometric analysis, 7ed.. Prentice Hall, 2012. 1, 3
[8] Roberto Pedace. Econometrics for dummies. Hoboken, John Wiley & Sons Inc., 2013. 1
[9] White, H. (1980). “A Heteroskedasticity-Consistent Covariance Matrix Estimator and a Direct Test for
Heteroskedasticity”, Econometrica 48, 817–38. 9, 19
[10] Zeileis, A. (2004). “Econometric Computing with HC and HAC Covariance Matrix Estimators”, Journal
of Statistical Software, 11:10. 1, 19
[11] Zeng, Y. (2016). “Book Summary: Econometrics for Dummies”, version 1.0.5. Unpublished manuscript.
8, 12
20