Econometrics Shalabh
Econometrics Shalabh
Introduction to Econometrics
It may be pointed out that the econometric methods can be used in other areas like engineering sciences,
biological sciences, medical sciences, geosciences, agricultural sciences etc. In simple words, whenever there
is a need of finding the stochastic relationship in mathematical format, the econometric methods and tools
help. The econometric tools are helpful in explaining the relationships among variables.
Econometric Models:
A model is a simplified representation of a real-world process. It should be representative in the sense that it
should contain the salient features of the phenomena under study. In general, one of the objectives in
modeling is to have a simple model to explain a complex phenomenon. Such an objective may sometimes
lead to oversimplified model and sometimes the assumptions made are unrealistic. In practice, generally, all
the variables which the experimenter thinks are relevant to explain the phenomenon are included in the
model. Rest of the variables are dumped in a basket called “disturbances” where the disturbances are random
variables. This is the main difference between economic modeling and econometric modeling. This is also
the main difference between mathematical modeling and statistical modeling. The mathematical modeling is
exact in nature, whereas the statistical modeling contains a stochastic term also.
An economic model is a set of assumptions that describes the behaviour of an economy, or more generally, a
phenomenon.
Aims of econometrics:
The three main aims econometrics are as follows:
3. Use of models:
The obtained models are used for forecasting and policy formulation, which is an essential part in any policy
decision. Such forecasts help the policymakers to judge the goodness of the fitted model and take necessary
measures in order to re-adjust the relevant economic variables.
Econometrics uses statistical methods after adapting them to the problems of economic life. These adopted
statistical methods are usually termed as econometric methods. Such methods are adjusted so that they
become appropriate for the measurement of stochastic relationships. These adjustments basically attempt to
specify attempts to the stochastic element which operate in real-world data and enters into the determination
of observed data. This enables the data to be called a random sample which is needed for the application of
statistical tools.
The theoretical econometrics includes the development of appropriate methods for the measurement of
economic relationships which are not meant for controlled experiments conducted inside the laboratories.
The econometric methods are generally developed for the analysis of non-experimental data.
The applied econometrics includes the application of econometric methods to specific branches of
econometric theory and problems like demand, supply, production, investment, consumption etc. The applied
econometrics involves the application of the tools of econometric theory for the analysis of the economic
phenomenon and forecasting economic behaviour.
Types of data
Various types of data is used in the estimation of the model.
1. Time series data
Time series data give information about the numerical values of variables from period to period and are
collected over time. For example, the data during the years 1990-2010 for monthly income constitutes a time
series of data.
2. Cross-section data
The cross-section data give information on the variables concerning individual agents (e.g., consumers or
produces) at a given point of time. For example, a cross-section of a sample of consumers is a sample of
family budgets showing expenditures on various commodities by each family, as well as information on
family income, family composition and other demographic, social or financial characteristics.
Aggregation problem:
The aggregation problems arise when aggregative variables are used in functions. Such aggregative variables
may involve.
1. Aggregation over individuals:
For example, the total income may comprise the sum of individual incomes.
4. Spatial aggregation:
Sometimes the aggregation is related to spatial issues. For example, the population of towns, countries, or the
production in a city or region etc..
Such sources of aggregation introduce “aggregation bias” in the estimates of the coefficients. It is important
to examine the possibility of such errors before estimating the model.
where f is some well-defined function and 1 , 2 ,..., k are the parameters which characterize the role and
contribution of X 1, X 2 ,..., X k , respectively. The term reflects the stochastic nature of the relationship
between y and X 1, X 2 ,..., X k and indicates that such a relationship is not exact in nature. When 0, then
the relationship is called the mathematical model otherwise the statistical model. The term “model” is
broadly used to represent any phenomenon in a mathematical framework.
A model or relationship is termed as linear if it is linear in parameters and non-linear, if it is not linear in
parameters. In other words, if all the partial derivatives of y with respect to each of the parameters
1 , 2 ,..., k are independent of the parameters, then the model is called as a linear model. If any of the
partial derivatives of y with respect to any of the 1 , 2 ,..., k is not independent of the parameters, the
model is called non-linear. Note that the linearity or non-linearity of the model is not described by the
linearity or non-linearity of explanatory variables in the model.
For example
y 1 X 12 2 X 2 3 log X 3
is a linear model because y / i , (i 1, 2,3) are independent of the parameters i , (i 1, 2,3). On the other
hand,
y 12 X 1 2 X 2 3 log X
of any of the 1 , 2 or 3 .
When the function f is linear in parameters, then y f ( X 1 , X 2 ,..., X k , 1 , 2 ,..., k ) is called a linear
model and when the function f is non-linear in parameters, then it is called a non-linear model. In general,
the function f is chosen as
f ( X 1 , X 2 ,..., X k , 1 , 2 ..., k ) 1 X 1 2 X 2 ... k X k
to describe a linear model. Since X 1 , X 2 ,..., X k are pre-determined variables and y is the outcome, so both
are known. Thus the knowledge of the model depends on the knowledge of the parameters 1 , 2 ,..., k .
The statistical linear modeling essentially consists of developing approaches and tools to determine
1 , 2 ,..., k in the linear model
y 1 X 1 2 X 2 ... k X k
Different statistical estimation procedures, e.g., method of maximum likelihood, the principle of least
squares, method of moments etc. can be employed to estimate the parameters of the model. The method of
maximum likelihood needs further knowledge of the distribution of y whereas the method of moments and
the principle of least squares do not need any knowledge about the distribution of y .
The regression analysis is a tool to determine the values of the parameters given the data on y and
X 1, X 2 ,..., X k . The literal meaning of regression is “to move in the backward direction”. Before discussing
and understanding the meaning of “backward direction”, let us find which of the following statements is
correct:
S1 : model generates data or
S 2 : data generates the model.
Obviously, S1 is correct. It can be broadly thought that the model exists in nature but is unknown to the
experimenter. When some values to the explanatory variables are provided, then the values for the output or
study variable are generated accordingly, depending on the form of the function f and the nature of the
phenomenon. So ideally, the pre-existing model gives rise to the data. Our objective is to determine the
Consider a simple example to understand the meaning of “regression”. Suppose the yield of the crop ( y )
depends linearly on two explanatory variables, viz., the quantity of fertilizer ( X 1 ) and level of irrigation
( X 2 ) as
y 1 X 1 2 X 2 .
There exist the true values of 1 and 2 in nature but are unknown to the experimenter. Some values on y
are recorded by providing different values to X 1 and X 2 . There exists some relationship between y and
X 1 , X 2 which gives rise to a systematically behaved data on y , X 1 and X 2 . Such a relationship is unknown
to the experimenter. To determine the model, we move in the backward direction in the sense that the
collected data is used to determine the unknown parameters 1 and 2 of the model. In this sense, such an
The theory and fundamentals of linear models lay the foundation for developing the tools for regression
analysis that are based on valid statistical theory and concepts.
Generally, the data is collected on n subjects, then y on data, then y denotes the response or study variable
and y1 , y2 ,..., yn are the n values. If there are k explanatory variables X 1 , X 2 ,.., X k then xij denotes the i th
value of the j th variable i 1, 2,..., n; j 1, 2,..., k . The observation can be presented in the following table:
Notation for the data used in regression analysis
Observation number Response Explanatory variables
y X1 X 2 X k
4. Specification of model:
The experimenter or the person working in the subject usually help in determining the form of the model.
Only the form of the tentative model can be ascertained, and it will depend on some unknown parameters.
For example, a general form will be like
y f ( X 1 , X 2 ,..., X k ; 1 , 2 ,..., k )
where is the random error reflecting mainly the difference in the observed value of y and the value of y
obtained through the model. The form of f ( X 1 , X 2 ,..., X k ; 1 , 2 ,..., k ) can be linear as well as non-linear
depending on the form of parameters 1 , 2 ,..., k . A model is said to be linear if it is linear in parameters.
For example,
y 1 X 1 2 X 12 3 X 2
y 1 2 ln X 2
are linear models whereas
y 1 X 1 22 X 2 3 X 2
y ln 1 X 1 2 X 2
are non-linear models. Many times, the non-linear models can be converted into linear models through some
transformations. So the class of linear models is wider than what it appears initially.
6. Fitting of model:
The estimation of unknown parameters using appropriate method provides the values of the parameter.
Substituting these values in the equation gives us a usable model. This is termed as model fitting. The
estimates of parameters 1 , 2 ,..., k in the model
y f ( X 1 , X 2 ,..., X k , 1 , 2 ,..., k )
are denoted by ˆ1 , ˆ2 ,..., ˆk which gives the fitted model as
When the value of y is obtained for the given values of X 1 , X 2 ,..., X k , it is denoted as ŷ and called as fitted
value.
The fitted equation is used for prediction. In this case, ŷ is termed as the predicted value. Note that the
fitted value is where the values used for explanatory variables correspond to one of the n observations in the
data, whereas predicted value is the one obtained for any set of values of explanatory variables. It is not
generally recommended to predict the y -values for the set of those values of explanatory variables which lie
outside the range of data. When the values of explanatory variables are the future values of explanatory
variables, the predicted values are called forecasted values.
The validation of the assumptions must be made before drawing any statistical conclusion. Any departure
from the validity of assumptions will be reflected in the statistical inferences. In fact, the regression analysis
is an iterative process where the outputs are used to diagnose, validate, criticize and modify the inputs. The
iterative process is illustrated in the following figure.
Inputs Outputs
where y is termed as the dependent or study variable and X is termed as the independent or explanatory
variable. The terms 0 and 1 are the parameters of the model. The parameter 0 is termed as an intercept
term, and the parameter 1 is termed as the slope parameter. These parameters are usually called as
regression coefficients. The unobservable error component accounts for the failure of data to lie on a
straight line and represents the difference between the true and observed realization of y . There can be
several reasons for such difference, e.g., the effect of all deleted variables in the model, variables may be
qualitative, inherent randomness in the observations etc. We assume that is observed as an independent
and identically distributed random variable with mean zero and constant variance 2 . Later, we will
additionally assume that is normally distributed.
The independent variables are viewed as controlled by the experimenter, so it is considered as non-stochastic
whereas y is viewed as random variable with
E ( y ) = 0 + 1 X
and
Var ( y ) = 2 .
Sometimes X can also be a random variable. In such a case, instead of the sample mean and sample
variance of y , we consider the conditional mean of y given X = x as
E ( y | x) = 0 + 1 x
Var ( y | x) = 2 .
When the values of 0 , 1 and 2 are known, the model is completely described. The parameters 0 , 1 and
2 are generally unknown in practice and is unobserved. The determination of the statistical model
y = 0 + 1 X + depends on the determination (i.e., estimation ) of 0 , 1 and 2 . In order to know the
values of these parameters, n pairs of observations ( xi , yi )(i = 1,..., n) on ( X , y ) are observed/collected and
Various methods of estimation can be used to determine the estimates of the parameters. Among them, the
methods of least squares and maximum likelihood are the popular methods of estimation.
are assumed to satisfy the simple linear regression model, and so we can write
yi = 0 + 1 xi + i (i = 1, 2,..., n).
The principle of least squares estimates the parameters 0 and 1 by minimizing the sum of squares of the
difference between the observations and the line in the scatter diagram. Such an idea is viewed from different
perspectives. When the vertical difference between the observations and the line in the scatter diagram is
considered, and its sum of squares is minimized to obtain the estimates of 0 and 1 , the method is known
as direct regression. yi
(xi,
yi)
Y = 0 + 1 X
(Xi,
Yi)
xi
Direct regression
method
Econometrics | Chapter 2 | Simple Linear Regression Analysis | Shalabh, IIT Kanpur
2
Alternatively, the sum of squares of the difference between the observations and the line in the horizontal
direction in the scatter diagram can be minimized to obtain the estimates of 0 and 1 . This is known as a
yi
Y = 0 + 1 X
(xi, yi)
(Xi, Yi)
xi,
Instead of horizontal or vertical errors, if the sum of squares of perpendicular distances between the
observations and the line in the scatter diagram is minimized to obtain the estimates of 0 and 1 , the
yi
(xi
,yi)
Y = 0 + 1 X
(Xi
,Yi)
xi
Major axis regression method
Econometrics | Chapter 2 | Simple Linear Regression Analysis | Shalabh, IIT Kanpur
3
Instead of minimizing the distance, the area can also be minimized. The reduced major axis regression
method minimizes the sum of the areas of rectangles defined between the observed data points and the
nearest point on the line in the scatter diagram to obtain the estimates of regression coefficients. This is
shown in the following figure:
yi
(xi yi)
Y = 0 + 1 X
(Xi, Yi)
xi
The method of least absolute deviation regression considers the sum of the absolute deviation of the
observations from the line in the vertical direction in the scatter diagram as in the case of direct regression to
obtain the estimates of 0 and 1 .
No assumption is required about the form of the probability distribution of i in deriving the least squares
estimates. For the purpose of deriving the statistical inferences only, we assume that i ' s are random
variable with E ( i ) = 0, Var ( i ) = 2 and Cov ( i , j ) = 0 for all i j (i, j = 1, 2,..., n). This assumption is
needed to find the mean, variance and other properties of the least-squares estimates. The assumption that
i ' s are normally distributed is utilized while constructing the tests of hypotheses and confidence intervals
of the parameters.
Based on these approaches, different estimates of 0 and 1 are obtained which have different statistical
properties. Among them, the direct regression approach is more popular. Generally, the direct regression
estimates are referred to as the least-squares estimates or ordinary least squares estimates.
Econometrics | Chapter 2 | Simple Linear Regression Analysis | Shalabh, IIT Kanpur
4
Direct regression method
This method is also known as the ordinary least squares estimation. Assuming that a set of n paired
observations on ( xi , yi ), i = 1, 2,..., n are available which satisfy the linear regression model y = 0 + 1 X + .
S ( 0 , 1 ) n
= −2 ( yt − 0 − 1 xi )
0 i =1
S ( 0 , 1 ) n
= −2 ( yi − 0 − 1 xi )xi .
1 i =1
S ( 0 , 1 )
=0
0
S ( 0 , 1 )
= 0.
1
The solutions of these two equations are called the direct regression estimators, or usually called as the
ordinary least squares (OLS) estimators of 0 and 1 .
b0 = y − b1 x
sxy
b1 =
sxx
where
n n
1 n 1 n
sxy = ( xi − x )( yi − y ), sxx = ( xi − x ) 2 , x = xi , y = yi .
i =1 i =1 n i =1 n i =1
02
= −2
i =1
(−1) = 2n,
2 S ( 0 , 1 ) n
1 2
= 2
i =1
xi2
2 S ( 0 , 1 ) n
= 2 xt = 2nx .
0 1 i =1
The Hessian matrix which is the matrix of second-order partial derivatives, in this case, is given as
2 S ( 0 , 1 ) 2 S ( 0 , 1 )
02 0 1
H* = 2
S ( , ) 2 S ( 0 , 1 )
0 1
0 1 12
n nx
=2 n
nx xi2
i =1
'
= 2 ( , x)
x '
where = (1,1,...,1) ' is a n -vector of elements unity and x = ( x1 ,..., xn ) ' is a n -vector of observations on X .
The matrix H * is positive definite if its determinant and the element in the first row and column of H * are
positive. The determinant of H * is given by
n
H * = 4 n xi2 − n 2 x 2
i =1
n
= 4n ( xi − x ) 2
i =1
0.
n
The case when (x − x )
i =1
i
2
= 0 is not interesting because all the observations, in this case, are identical, i.e.
xi = c (some constant). In such a case, there is no relationship between x and y in the context of regression
n
analysis. Since (x − x )
i =1
i
2
0, therefore H 0. So H is positive definite for any ( 0 , 1 ) , therefore,
The difference between the observed value yi and the fitted (or predicted) value yˆ i is called a residual. The
i th residual is defined as
ei = yi ~ yˆi (i = 1, 2,..., n)
= yi − yˆi
= yi − (b0 + b1 xi ).
Unbiased property:
sxy
Note that b1 = and b0 = y − b1 x are the linear combinations of yi (i = 1,..., n).
sxx
Therefore
n
b1 = ki yi
i =1
n n
where ki = ( xi − x ) / sxx . Note that ki =1
i = 0 and k x
i =1
i i = 1, so
n
E (b1 ) = ki E ( yi )
i =1
n
= ki ( 0 + 1 xi ) .
i =1
= 1.
E (b0 ) = E y − b1 x
= E 0 + 1 x + − b1 x
= 0 + 1 x − 1 x
= 0 .
(x − x )
i
2
= 2 i
(Cov( yi , y j ) = 0 as y1 ,..., yn are independent)
sxx2
2 sxx
=
sxx2
2
= .
sxx
The variance of b0 is
Covariance:
The covariance between b0 and b1 is
It can further be shown that the ordinary least squares estimators b0 and b1 possess the minimum variance
in the class of linear and unbiased estimators. So they are termed as the Best Linear Unbiased Estimators
(BLUE). Such a property is known as the Gauss-Markov theorem, which is discussed later in multiple
linear regression model.
Econometrics | Chapter 2 | Simple Linear Regression Analysis | Shalabh, IIT Kanpur
8
Residual sum of squares:
The residual sum of squares is given as
n n
SS res = ei2 = ( yi − yˆi ) 2
i =1 i =1
n
= ( yi − b0 − b1 xi ) 2
i =1
n
= yi − y + b1 x − b1 xi
2
i =1
n
= ( yi − y ) − b1 ( xi − x )
2
i =1
n n n
= ( yi − y ) 2 + b12 ( xi − x ) 2 − 2b1 ( xi − x )( yi − y )
i =1 i =1 i =1
= s yy + b s − 2b s
2
1 xx
2
1 xx
= s yy − b12 sxx
2
sxy
= s yy − sxx
sxx
sxy2
= s yy −
sxx
= s yy − b1sxy .
n
1 n
where s yy = ( yi − y ) 2 , y = yi .
i =1 n i =1
Estimation of 2
The estimator of 2 is obtained from the residual sum of squares as follows. Assuming that yi is normally
SSres
~ 2 (n − 2).
2
Thus using the result about the expectation of a chi-square random variable, we have
E ( SS res ) = (n − 2) 2 .
Thus an unbiased estimator of 2 is
SSres
s2 = .
n−2
Note that SS res has only ( n − 2) degrees of freedom. The two degrees of freedom are lost due to estimation
1 x2
Var (b0 ) = s + 2
n sxx
and
s2
Var (b1 ) = .
sxx
n n
It is observed that since ( yi − yˆi ) = 0, so
i =1
e
i =1
i = 0. In the light of this property, ei can be regarded as an
estimate of unknown i (i = 1,..., n) . This helps in verifying the different model assumptions on the basis of
Centered Model:
Sometimes it is useful to measure the independent variable around its mean. In such a case, the model
yi = 0 + 1 X i + i has a centred version as follows:
yi = 0 + 1 ( xi − x ) + 1 x + (i = 1, 2,..., n)
= 0* + 1 ( xi − x ) + i
where 0* = 0 + 1 x . The sum of squares due to error is given by
n n 2
S ( , 1 ) = = yi − − 1 ( xi − x ) .
*
0 i
2 *
0
i =1 i =1
Now solving
S ( 0* , 1 )
=0
0*
S ( 0* , 1 )
= 0,
1*
b0* = y
and
sxy
b1 = ,
sxx
respectively.
Thus the form of the estimate of slope parameter 1 remains the same in the usual and centered model
whereas the form of the estimate of intercept term changes in the usual and centered models.
Further, the Hessian matrix of the second order partial derivatives of S ( 0* , 1 ) with respect to 0* and 1
is positive definite at 0* = b0* and 1 = b1 which ensures that S ( 0* , 1 ) is minimized at 0* = b0* and
1 = b1 .
Under the assumption that E ( i ) = 0,Var ( i ) = 2 and Cov( i j ) = 0 for all i j = 1, 2,..., n , it follows that
E (b0* ) = 0* , E (b1 ) = 1 ,
2 2
Var (b0* ) = , Var (b1 ) = .
n sxx
y = y + b1 ( x − x ),
For example, in analyzing the relationship between the velocity ( y ) of a car and its acceleration ( X ) , the
velocity is zero when acceleration is zero.
Using the data ( xi , yi ), i = 1, 2,..., n, the direct regression least-squares estimate of 1 is obtained by
n n
minimizing S ( 1 ) = i2 = ( yi − 1 xi ) 2 and solving
i =1 i =1
S ( 1 )
=0
1
yx i i
b =
*
1
i =1
n
.
x
i =1
2
i
The second-order partial derivative of S ( 1 ) with respect to 1 at 1 = b1 is positive which insures that b1
minimizes S ( 1 ).
Using the assumption that E ( i ) = 0,Var ( i ) = 2 and Cov( i j ) = 0 for all i j = 1, 2,..., n , the properties
x E( y ) i i
E (b ) =
*
1
i =1
n
x
i =1
2
i
x 2
i 1
= i =1
n
x
i =1
2
i
= 1
x Var ( y )
2
i i
Var (b1* ) = i =1
2
n 2
xi
i =1
n
x 2
i
=2 i =1
2
n 2
xi
i =1
2
= n
x
i =1
2
i
y
i =1
2
i − b1 yi xi
i =1
.
n −1
distribution N (0, 2 ). Now we use the method of maximum likelihood to estimate the parameters of the
linear regression model
yi = 0 + 1 xi + i (i = 1, 2,..., n),
the observations yi (i = 1, 2,..., n) are independently distributed with N ( 0 + 1 xi , 2 ) for all i = 1, 2,..., n.
The likelihood function of the given observations ( xi , yi ) and unknown parameters 0 , 1 and 2 is
1/ 2
n
1 1
L( xi , yi ; 0 , 1 , 2 ) = 2
exp − 2 ( yi − 0 − 1 xi ) 2 .
i =1 2 2
The maximum likelihood estimates of 0 , 1 and 2 can be obtained by maximizing L( xi , yi ; 0 , 1 , 2 ) or
equivalently in ln L( xi , yi ; 0 , 1 , 2 ) where
n n 1 n
ln L( xi , yi ; 0 , 1 , 2 ) = − ln 2 − ln 2 − 2 ( yi − 0 − 1 xi ) 2 .
2 2 2 i =1
0
=− 2
(y −
i =1
i 0 − 1 xi ) = 0
ln L( xi , yi ; 0 , 1 , 2 ) 1 n
1
=− 2
(y −
i =1
i 0 − 1 xi )xi = 0
and
ln L( xi , yi ; 0 , 1 , 2 ) n 1 n
4
= − + ( yi − 0 − 1 xi ) 2 = 0.
2
2 2
2 i =1
The solution of these normal equations give the maximum likelihood estimates of 0 , 1 and 2 as
b0 = y − b1 x
n
( x − x )( y − y )
i i
sxy
b1 = i =1
n
=
(x − x ) 2 sxx
i
i =1
and
n
(y −b i 0 − b1 xi ) 2
s2 = i =1
n
respectively.
It can be verified that the Hessian matrix of second-order partial derivation of ln L with respect to 0 , 1 ,
and 2 is negative definite at 0 = b0 , 1 = b1 , and 2 = s 2 which ensures that the likelihood function is
Note that the least-squares and maximum likelihood estimates of 0 and 1 are identical. The least-squares
and maximum likelihood estimates of 2 are different. In fact, the least-squares estimate of 2 is
1 n
s2 =
n − 2 i =1
( yi − y ) 2
First, we develop a test for the null hypothesis related to the slope parameter
H 0 : 1 = 10
2
Assuming 2 to be known, we know that E (b1 ) = 1 , Var (b1 ) = and b1 is a linear combination of
sxx
2
b1 ~ N 1 ,
sxx
and so the following statistic can be constructed
b1 − 10
Z1 =
2
sxx
Reject H 0 if Z1 Z / 2
Similarly, the decision rule for one-sided alternative hypothesis can also be framed.
P − z /2 Z1 z /2 = 1 −
b1 − 1
P − z /2 z /2 = 1 −
2
sxx
2 2
P b1 − z /2 1 b1 + z /2 = 1− .
sxx sxx
So 100 (1 − )% confidence interval for 1 is
2 2
b1 − z / 2 , b1 + z / 2
sxx sxx
where z / 2 is the / 2 percentage point of the N (0,1) distribution.
module on multiple linear regression. This result also follows from the result that under normal distribution,
the maximum likelihood estimates, viz., the sample mean (estimator of population mean) and the sample
variance (estimator of population variance) are independently distributed, so b1 and s 2 are also
independently distributed.
Thus the following statistic can be constructed:
b1 − 1
t0 =
ˆ 2
sxx
b1 − 1
=
SS res
(n − 2) sxx
which follows a t -distribution with ( n − 2) degrees of freedom, denoted as tn − 2 , when H 0 is true.
reject H 0 if t0 tn − 2, / 2
where tn − 2, / 2 is the / 2 percent point of the t -distribution with ( n − 2) degrees of freedom. Similarly, the
decision rule for the one-sided alternative hypothesis can also be framed.
The 100 (1 − )% confidence interval of 1 can be obtained using the t 0 statistic as follows:
Consider
P −t /2 t0 t /2 = 1 −
b1 − 1
P −t /2 t /2 = 1 −
ˆ 2
sxx
ˆ 2 ˆ 2
P b1 − t /2 1 b1 + t / 2 = 1−.
sxx sxx
So the 100 (1 − )% confidence interval 1 is
SSres SSres
b1 − tn −2, /2 , b1 + tn −2, /2 .
(n − 2) sxx (n − 2) sxx
n sx
combination of normally distributed random variables, the following statistic
b0 − 00
Z0 =
1 x2
2 +
n sxx
Reject H 0 if Z 0 Z /2
where Z /2 is the / 2 percentage points on the normal distribution. Similarly, the decision rule for one-
21 x2 21 x2
b0 − z / 2 + , b0 + z / 2 + .
n sxx n sxx
where tn − 2, / 2 is the / 2 percentage point of the t -distribution with ( n − 2) degrees of freedom. Similarly,
the decision rule for one-sided alternative hypothesis can also be framed.
Consider
P tn − 2, /2 t0 tn − 2, /2 = 1 −
b0 − 0
P tn − 2, /2 tn − 2, /2 = 1 −
SS res 1 x 2
+
n − 2 n sxx
SS res 1 x 2 SS res 1 x 2
P b0 − tn − 2, /2 + b + t n − 2, /2 + = 1− .
n − 2 n sxx n − 2 n sxx
0 0
SS res 1 x 2 SS res 1 x 2
b0 − tn − 2, / 2 + +
0 n − 2, / 2
, b t + .
n − 2 n sxx n − 2 n sxx
SS
P n2−2, /2 res n2−2,1− /2 = 1 −
2
SS SS
P 2 res 2 2 res = 1 − .
n −2,1− / 2 n−2, / 2
SSres SS
2 , 2 res .
n −2,1− / 2 n −2, / 2
confidence that both the estimates of 0 and 1 are correct. Consider the centered version of the linear
regression model
yi = 0* + 1 ( xi − x ) + i
sxy
b0* = y and b1 = ,
sxx
respectively.
Using the results that
E (b0* ) = 0* ,
E (b1 ) = 1 ,
2
Var (b0* ) = ,
n
2
Var (b1 ) = .
sxx
are also independently distributed because b0* and b1 are independently distributed. Consequently, the sum
of these two
n(b0* − o* )2 sxx (b1 − 1 ) 2
+ ~ 22 .
2 2
Since
SSres
~ n2− 2
2
and SS res is independently distributed of b0* and b1 , so the ratio
n − 2 Qf
2 SSres
where
n n
Q f = n(b0 − 0 )2 + 2 xt (b0 − 1 )(b1 − 1 ) + xi2 (b1 − 1 ) 2 .
i =1 i =1
Since
n − 2 Q f
P F2,n−2 = 1 −
2 SSres
holds true for all values of 0 and 1 , so the 100 (1 − ) % confidence region for 0 and 1 is
n − 2 Qf
. F2,n −2;1− . .
2 SSres
This confidence region is an ellipse which gives the 100 (1 − )% probability that 0 and 1 are contained
A test statistic for testing H 0 : 1 = 0 can also be formulated using the analysis of variance technique as
follows.
( yi − y )( yˆi − y ) = ( yi − y )b1 ( xi − x )
i =1 i =1
n
= b12 ( xi − x ) 2
i =1
n
= ( yˆi − y ) 2 .
i =1
Thus we have
n n n
( yi − y )2 = ( yi − yˆi )2 + ( yˆi − y )2 .
i =1 i =1 i =1
n
The term ( y − y)
i =1
i
2
is called the sum of squares about the mean, corrected sum of squares of y (i.e.,
n
squares, i.e., SSres = ( yi − yˆi )2
i =1
n
whereas the term ( yˆ − y )
i =1
i
2
describes the proportion of variability explained by the regression,
n
SSr e g = ( yˆi − y ) 2 .
i =1
n
If all observations yi are located on a straight line, then in this case ( y − yˆ )
i =1
i i
2
= 0 and thus
SScorrected = SSr e g .
Note that SS r e g is completely determined by b1 and so has only one degree of freedom. The total sum of
n n
squares s yy = ( yi − y ) 2
has ( n − 1) degrees of freedom due to constraint ( y − y) = 0
i and SS res has
i =1 i =1
All sums of squares are mutually independent and distributed as df2 with df degrees of freedom if the
MSr e g
F0 = .
MSE
If H 0 : 1 = 0 is true, then MS r e g and MSE are independently distributed and thus
F0 ~ F1,n − 2 .
F0 F1,n − 2;1−
at level of significance. The test procedure can be described in an Analysis of variance table.
Regression SS r e g 1 MS r e g MS r e g / MSE
Total s yy n −1
Moreover, we have
sxy s yy
b1 = = rxy .
sxx sxx
and
SS r e g = s yy − SS res
( sxy ) 2
=
sxx
=b s 2
1 xx
= b1sxy .
residuals, so a measure of the quality of a fitted model can be based on SS res . When the intercept term is
This is known as the coefficient of determination. This measure is based on the concept that how much
variation in y ’s stated by s yy is explainable by SS reg and how much unexplainable part is contained in
SS res . The ratio SSr e g / s yy describes the proportion of variability that is explained by regression in relation
to the total variability of y . The ratio SS res / s yy describes the proportion of variability that is not covered
by the regression.
It can be seen that
R 2 = rxy2
where rxy is the simple correlation coefficient between x and y. Clearly 0 R 2 1 , so a value of R 2 closer
to one indicates the better fit and value of R 2 closer to zero indicates the poor fit.
Suppose we want to predict the value of E ( y ) for a given value of x = x0 . Then the predictor is given by
E ( y | x0 ) = ˆ y / x0 = b0 + b1 x0 .
Predictive bias
Then the prediction error is given as
ˆ y| x − E ( y ) = b0 + b1 x0 − E ( 0 + 1 x0 + )
0
= b0 + b1 x0 − ( 0 + 1 x0 )
= (b0 − 0 ) + (b1 − 1 ) x0 .
Then
E ˆ y| x0 − E ( y ) = E (b0 − 0 ) + E (b1 − 1 ) x0
= 0+0 = 0
Thus the predictor y / x0 is an unbiased predictor of E ( y ).
Predictive variance:
The predictive variance of ˆ y| x0 is
PV ( ˆ y| x0 ) = Var (b0 + b1 x0 )
= Var y + b1 ( x0 − x )
= Var ( y ) + ( x0 − x ) 2Var (b1 ) + 2( x0 − x )Cov( y , b1 )
2 2 ( x0 − x ) 2
= + +0
n sxx
1 ( x − x )2
=2 + 0 .
n sxx
1 ( x − x )2
PV ( ˆ y| x0 ) = ˆ 2 + 0
n sxx
1 ( x − x )2
= MSE + 0 .
n sxx
The predictor ˆ y| x0 is a linear combination of normally distributed random variables, so it is also normally
distributed as
(
ˆ y|x ~ N 0 + 1 x0 , PV ( ˆ y|x
0 0
)) .
So if 2 is known, then the distribution of
ˆ y| x − E ( y | x0 )
0
PV ( ˆ y| x0 )
ˆ y| x0 − E ( y | x0 )
P − z /2 z /2 = 1 −
PV ( ˆ y| x0 )
1 ( x − x )2 ( x0 − x ) 2
2 1
ˆ y| x0 − z /2 2 + 0 ,
ˆ +
y| x0 /2
z + .
n sxx n sxx
When 2 is unknown, it is replaced by ˆ 2 = MSE and in this case the sampling distribution of
ˆ y| x − E ( y | x0 )
0
1 ( x − x )2
MSE + 0
n sxx
Note that the width of the prediction interval E ( y | x0 ) is a function of x0 . The interval width is minimum
for x0 = x and widens as x0 − x increases. This is also expected as the best estimates of y to be made at
x -values lie near the center of the data and the precision of estimation to deteriorate as we move to the
boundary of the x -space.
ŷ0 = b0 + b1 x0 .
The true value of y in the prediction period is given by y0 = 0 + 1 x0 + 0 where 0 indicates the value that
would be drawn from the distribution of random error in the prediction period. Note that the form of
predictor is the same as of average value predictor, but its predictive error and other properties are different.
This is the dual nature of predictor.
Predictive bias:
The predictive error of ŷ0 is given by
yˆ 0 − y0 = b0 + b1 x0 − ( 0 + 1 x0 + 0 )
= (b0 − 0 ) + (b1 − 1 ) x0 − 0 .
Thus, we find that
E ( yˆ 0 − y0 ) = E (b0 − 0 ) + E (b1 − 1 ) x0 − E ( 0 )
= 0+0+0 = 0
PV ( yˆ 0 ) = E ( yˆ 0 − y0 ) 2
= E[(b0 − 0 ) + ( x0 − x )(b1 − 1 ) + (b1 − 1 ) x − 0 ]2
= Var (b0 ) + ( x0 − x ) 2Var (b1 ) + x 2Var (b1 ) + Var ( 0 ) + 2( x0 − x )Cov(b0 , b1 ) + 2 xCov (b0 , b1 ) + 2( x0 − x )Var (b1 )
[rest of the terms are 0 assuming the independence of 0 with 1 , 2 ,..., n ]
= Var (b0 ) + [( x0 − x ) 2 + x 2 + 2( x0 − x )]Var (b1 ) + Var ( ) + 2[( x0 − x ) + 2 x ]Cov(b0 , b1 )
= Var (b0 ) + x02Var (b1 ) + Var ( 0 ) + 2 x0Cov(b0 , b1 )
1 x2 2 x 2
= 2 + + x02 + 2 − 2 x0
n sxx sxx sxx
1 ( x − x )2
= 2 1 + + 0 .
n sxx
Prediction interval:
If 2 is known, then the distribution of
yˆ 0 − y0
PV ( yˆ 0 )
yˆ − y0
P − z /2 0 z /2 = 1 −
PV ( yˆ 0 )
1 ( x − x )2 1 ( x0 − x ) 2
2
yˆ 0 − z /2 2 1 + + 0 , y0 + z /2 1 + +
ˆ .
n s xx n s xx
follows a t -distribution with ( n − 2) degrees of freedom. The 100(1- )% prediction interval for ŷ0 in this
case is obtained as
yˆ − y0
P −t /2,n − 2 0 t /2,n − 2 = 1 −
PV ( yˆ 0 )
which gives the prediction interval
1 ( x − x )2 1 ( x0 − x ) 2
yˆ 0 − t /2,n − 2 MSE 1 + + 0 ˆ +
0 /2, n −2
, y t MSE 1 + + .
n sxx n sxx
The prediction interval for ŷ0 is wider than the prediction interval for ˆ y / x0 because the prediction interval
for ŷ0 depends on both the error from the fitted model as well as the error associated with the future
observations.
Y = 0 + 1 X
(xi,
yi)
(Xi,
Yi)
x,
Reverse regression
Econometrics | Chapter 2 | Simple Linear
methodRegression Analysis | Shalabh, IIT Kanpur
30
The reverse regression has been advocated in the analysis of gender (or race) discrimination in salaries. For
example, if y denotes salary and x denotes qualifications, and we are interested in determining if there is
gender discrimination in salaries, we can ask:
“Whether men and women with the same qualifications (value of x) are getting the same salaries
(value of y). This question is answered by the direct regression.”
where i ’s are the associated random error components and satisfy the assumptions as in the case of the
usual simple linear regression model. The reverse regression estimates ˆOR of 0* and ˆ1R of 1* for the
model are obtained by interchanging the x and y in the direct regression estimators of 0 and 1 . The
sxy2
SS res = sxx −
*
.
s yy
Note that
ˆ sxy2
1Rb1 = = rxy2
sxx s yy
where b1 is the direct regression estimator of the slope parameter and rxy is the correlation coefficient
between x and y. Hence if rxy2 is close to 1, the two regression lines will be close to each other.
An important application of the reverse regression method is in solving the calibration problem.
yi
(xi, yi)
Y = 0 + 1 X
(Xi, Yi)
xi
( xi , yi ), i = 1, 2,..., n lie on this line. But these points deviate from the line, and in such a case, the squared
di2 = ( X i − xi ) 2 + (Yi − yi ) 2
where ( X i , Yi ) denotes the i th pair of observation without any error which lies on the line.
estimates of 0 and 1 . The observations ( xi , yi ) (i = 1, 2,..., n) are expected to lie on the line
Yi = 0 + 1 X i ,
so let
Ei = Yi − 0 − 1 X i = 0.
n
The regression coefficients are obtained by minimizing d
i =1
i
2
under the constraints Ei ' s using the
where 1 ,..., n are the Lagrangian multipliers. The set of equations are obtained by setting
L0 L L L
= 0, 0 = 0, 0 = 0 and 0 = 0 (i = 1, 2,..., n).
X i Yi 0 1
Thus we find
L0
= ( X i − xi ) + i 1 = 0
X i
L0
= (Yi − yi ) − i = 0
Yi
L0 n
0
=
i =1
i =0
L0 n
1
= X
i =1
i i = 0.
Since
X i = xi − i 1
Yi = yi + i ,
Ei = ( yi + i ) − 0 − 1 ( xi − i 1 ) = 0
0 + 1 xi − yi
i = .
1 + 12
( 0 + 1 xi − yi )
i =1
=0
1 + 12
n
and using ( X i − xi ) + i 1 = 0 and X
i =1
i i = 0 , we get
( x − ) = 0.
i =1
i i i 1
( x + x 0 i
2
1 i − yi xi )
1 ( 0 + 1 xi − yi ) 2
i =1
− = 0. (1)
(1 + i2 ) (1 + 12 ) 2
n
Using i in the equation and using the equation
i =1
i = 0 , we solve
( 0 + 1 xi − yi )
i =1
= 0.
1 + 12
ˆ0OR = y − ˆ1OR x
) yxi − 1 xxi + x − xi yi − 1 ( y − 1 x + 1 xi − yi ) = 0
n n
(1 +
i =1
1
2 2
1 i
i =1
or
n n 2
(1 + ) 1
2
x y − y − ( x − x ) + −( y − y ) + ( x − x )
i =1
i i 1 i 1
i =1
i 1 i =0
or
n n
(1 + 12 ) (ui + x )(vi − 1ui ) + 1 ( −vi + 1ui ) 2 = 0
i =1 i =1
where
ui = xi − x ,
vi = yi − y .
i =1
u v + 1 (ui2 − vi2 ) − ui vi = 0
2
1 i i
or
12 sxy + 1 ( sxx − s yy ) − sxy = 0.
2 sxy
where sign( s xy ) denotes the sign of sxy which can be positive or negative. So
1 if sxy 0
sign( sxy ) = .
−1 if sxy 0.
n
Notice that this gives two solutions for ˆ1OR . We choose the solution which minimizes d
i =1
i
2
. The other
n
solution maximizes d
i =1
i
2
and is in the direction perpendicular to the optimal solution. The optimal solution
(xi yi)
Y = 0 + 1 X
(Xi, Yi)
xi
Suppose the regression line is Yi = 0 + 1 X i on which all the observed points are expected to lie. Suppose
the points ( xi , yi ), i = 1, 2,..., n are observed which lie away from the line. The area of rectangle extended
where ( X i , Yi ) denotes the i th pair of observation without any error which lies on the line.
Ai = ( X i ~ xi )(Yi ~ yi ).
i =1 i =1
Yi = 0 + 1 X i
and let
Ei* = Yi − 0 − 1 X i = 0.
So now the objective is to minimize the sum of areas under the constraints Ei* to obtain the reduced major
axis estimates of regression coefficients. Using the Lagrangian multiplier method, the Lagrangian function is
n n
LR = Ai − i Ei*
i =1 i =1
n n
= ( X i − xi )(Yi − yi ) − i Ei*
i =1 i =1
where 1 ,..., n are the Lagrangian multipliers. The set of equations are obtained by setting
LR L L L
= 0, R = 0, R = 0, R = 0 (i = 1, 2,..., n).
X i Yi 0 1
Thus
LR
= (Yi − yi ) + 1i = 0
X i
LR
= ( X i − xi ) − i = 0
Yi
LR n
= i = 0
0 i =1
LR n
= i X i = 0.
1 i =1
Now
X i = xi + i
Yi = yi − 1i
0 + 1 X i = yi − 1i
0 + 1 ( xi + i ) = yi − 1i
y − 0 − 1 xi
i = i .
2 1
ˆ0 RM = y − ˆ1RM x
where ˆ1RM is the reduced major axis regression estimate of 1 . Using X i = xi + i , i and ˆ0 RM in
n
X
i =1
i i = 0 , we get
n
yi − y + 1 x − 1 xi yi − y + 1 x − 1 xi
i =1 21
xi −
21
= 0.
Let ui = xi − x and vi = yi − y , then this equation can be re-expressed as
n
(v − u )(v + u + 2 x ) = 0.
i =1
i 1 i i 1 i 1
n n
Using ui = ui = 0, we get
i =1 i =1
n n
Solving this equation, the reduced major axis regression estimate of 1 is obtained as
s yy
ˆ1RM = sign( sxy )
sxx
1 if sxy 0
where sign ( sxy ) =
−1 if sxy 0.
We choose the regression estimator which has same sign as of sxy .
the method of least absolute deviation (LAD) regression, the parameters 0 and 1 are estimated such that
n
the sum of absolute deviations
i =1
i is minimum. It minimizes the absolute vertical sum of errors as in the
yi
(xi, yi)
Y = 0 + 1 X
(Xi, Yi)
xi
Least absolute deviation regression
method
The LAD estimates ˆ0 L and ˆ1L are the estimates of 0 and 1 , respectively which minimize
n
LAD( 0 , 1 ) = yi − 0 − 1 xi
i =1
Conceptually, LAD procedure is more straightforward than OLS procedure because e (absolute residuals)
is a more straightforward measure of the size of the residual than e 2 (squared residuals). The LAD
regression estimates of 0 and 1 are not available in closed form. Instead, they can be obtained
numerically based on algorithms. Moreover, this creates the problems of non-uniqueness and degeneracy in
the estimates. The concept of non-uniqueness relates to that more than one best line pass through a data
point. The degeneracy concept describes that the best line through a data point also passes through more than
one other data points. The non-uniqueness and degeneracy concepts are used in algorithms to judge the
Suppose both dependent and independent variables are stochastic in the simple linear regression model
y = 0 + 1 X +
where is the associated random error component. The observations ( xi , yi ), i = 1, 2,..., n are assumed to be
jointly distributed. Then the statistical inferences can be drawn in such cases which are conditional on X .
are the means of X and y; x2 and y2 are the variances of X and y ; and is the correlation coefficient
between X and y. Then the conditional distribution of y given X = x is the univariate normal
conditional mean
E ( y | X = x) = y| x = 0 + 1 x
Var ( y | X = x) = y2| x = y2 (1 − 2 )
where
0 = y − x 1
and
y
1 = .
x
Econometrics | Chapter 2 | Simple Linear Regression Analysis | Shalabh, IIT Kanpur
40
When both X and y are stochastic, then the problem of estimation of parameters can be reformulated as
follows. Consider a conditional random variable y | X = x having a normal distribution with mean as
conditional mean y| x and variance as conditional variance Var ( y | X = x) = y2| x . Obtain n independently
distributed observation yi | xi , i = 1, 2,..., n from N ( y| x , y2| x ) with nonstochastic X . Now the method of
maximum likelihood can be used to estimate the parameters which yield the estimates of 0 and 1 as
( y − y )( x − x )
i i
ˆ = i =1
n n
( xi − x )2
i =1
( y − y)
i =1
i
2
sxy
=
sxx s yy
sxx
= b1 .
s yy
Thus
sxx
ˆ 2 = b12
s yy
sxy
= b1
s yy
n
s yy − ˆi2
= i =1
s yy
= R2
which is same as the coefficient of determination. Thus R 2 has the same expression as in the case when X
is fixed. Thus R 2 again measures the goodness of the fitted model even when X is stochastic.
y X 11 X 2 2 ... X k k .
This is called the multiple linear regression model. The parameters 1 , 2 ,..., k are the regression
coefficients associated with X 1 , X 2 ,..., X k respectively and is the random error component reflecting the
difference between the observed and fitted linear relationship. There can be various reasons for such
difference, e.g., the joint effect of those variables not included in the model, random factors which can not
be accounted for in the model etc.
Note that the j th regression coefficient j represents the expected change in y per unit change in the j th
E ( y )
j .
X j
Linear model:
y E ( y )
A model is said to be linear when it is linear in parameters. In such a case (or equivalently )
j j
should not depend on any ' s . For example,
i) y 0 1 X is a linear model as it is linear in the parameters.
a linear model.
Econometrics | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur
1
iii) y 0 1 X 2 X 2
is linear in parameters 0 , 1 and 2 but it is nonlinear is variables X . So it is a linear model
1
iv) y 0
X 2
is nonlinear in the parameters and variables both. So it is a nonlinear model.
v) y 0 1 X 2
is nonlinear in the parameters and variables both. So it is a nonlinear model.
vi) y 0 1 X 2 X 2 3 X 3
is a cubic polynomial model which can be written as
y 0 1 X 2 X 2 3 X 3
which is linear in the parameters 0 , 1 , 2 , 3 and linear in the variables X 1 X , X 2 X 2 , X 3 X 3 .
So it is a linear model.
Example:
The income and education of a person are related. It is expected that, on average, a higher level of education
provides higher income. So a simple linear regression model can be expressed as
income 0 1 education .
Not that 1 reflects the change in income with respect to per unit change in education and 0 reflects the
income when education is zero as it is expected that even an illiterate person can also have some income.
Further, this model neglects that most people have higher income when they are older than when they are
young, regardless of education. So 1 will over-state the marginal impact of education. If age and education
are positively correlated, then the regression model will associate all the observed increase in income with an
increase in education. So a better model is
income 0 1 education 2 age .
Often it is observed that the income tends to rise less rapidly in the later earning years than is early years. To
accommodate such a possibility, we might extend the model to
income 0 1education 2 age 3age 2
This is how we proceed for regression modeling in real-life situation. One needs to consider the experimental
condition and the phenomenon before making the decision on how many, why and how to choose the
dependent and independent variables.
Econometrics | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur
2
Model set up:
Let an experiment be conducted n times, and the data is obtained as follows:
Observation number Response Explanatory variables
y X1 X 2 X k
or y X .
In general, the model with k explanatory variables can be expressed as
y X
where y ( y1 , y2 ,..., yn ) ' is a n 1 vector of n observation on study variable,
(iii) Rank ( X ) k
(iv) X is a non-stochastic matrix
(v) ~ N (0, 2 I n ) .
These assumptions are used to study the statistical properties of the estimator of regression coefficients. The
following assumption is required to study, particularly the large sample properties of the estimators.
X 'X
(vi) lim exists and is a non-stochastic and nonsingular matrix (with finite elements).
n
n
The explanatory variables can also be stochastic in some cases. We assume that X is non-stochastic unless
stated separately.
We consider the problems of estimation and testing of hypothesis on regression coefficient vector under the
stated assumption.
Estimation of parameters:
A general procedure for the estimation of regression coefficient vector is to minimize
n n
M ( x) x , in general.
p
We consider the principle of least square which is related to M ( x) x 2 and method of maximum likelihood
estimation for the estimation of parameters.
Econometrics | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur
4
Principle of ordinary least squares (OLS)
Let B be the set of all possible vectors . If there is no further information, the B is k -dimensional real
Euclidean space. The object is to find a vector b ' (b1 , b2 ,..., bk ) from B that minimizes the sum of squared
for given y and X . A minimum will always exist as S ( ) is a real-valued, convex and differentiable
function. Write
S ( ) y ' y ' X ' X 2 ' X ' y .
Differentiate S ( ) with respect to
S ( )
2X ' X 2X ' y
2 S ( )
2 X ' X (atleast non-negative definite).
2
The normal equation is
S ( )
0
X ' Xb X ' y
where the following result is used:
Result: If f ( z ) Z ' AZ is a quadratic form, Z is a m 1 vector and A is any m m symmetric matrix
then F ( z ) 2 Az .
z
Since it is assumed that rank ( X ) k (full rank), then X ' X is a positive definite and unique solution of the
normal equation is
b ( X ' X ) 1 X ' y
which is termed as ordinary least squares estimator (OLSE) of .
2 S ( )
Since is at least non-negative definite, so b minimize S ( ) .
2
where ( X ' X ) is the generalized inverse of X ' X and is an arbitrary vector. The generalized inverse
= X ( X ' X ) X ' y
which is independent of . This implies that ŷ has the same value for all solution b of X ' Xb X ' y.
(ii) Note that for any ,
S ( ) y Xb X (b ) y Xb X (b )
( y Xb)( y Xb) (b ) X ' X (b ) 2(b ) X ( y Xb)
( y Xb)( y Xb) (b ) X ' X (b ) (Using X ' Xb X ' y )
( y Xb)( y Xb) S (b)
y ' y 2 y ' Xb b ' X ' Xb
y ' y b ' X ' Xb
y ' y yˆ ' yˆ .
In the case of ˆ b,
yˆ Xb
X ( X ' X ) 1 X ' y
Hy
Residuals
The difference between the observed and fitted values of the study variable is called as residual. It is
denoted as
e y ~ yˆ
y yˆ
y Xb
y Hy
(I H ) y
Hy
where H I H .
Note that
(i) H is a symmetric matrix
(ii) H is an idempotent matrix, i.e.,
HH ( I H )( I H ) ( I H ) H and
(ii) Bias
Since X is assumed to be nonstochastic and E ( ) 0
E (b ) ( X ' X ) 1 X ' E ( )
0.
Thus OLSE is an unbiased estimator of .
(iv) Variance
The variance of b can be obtained as the sum of variances of all b1 , b2 ,..., bk which is the trace of covariance
matrix of b . Thus
Var (b) tr V (b)
k
E (bi i ) 2
i 1
k
Var (bi ).
i 1
e 'e
( y Xb) '( y Xb)
y '( I H )( I H ) y
y '( I H ) y
y ' Hy.
Also
SS r e s ( y Xb) '( y Xb)
y ' y 2b ' X ' y b ' X ' Xb
y ' y b ' X ' y (Using X ' Xb X ' y )
SSr e s y ' Hy
(X )'H (X )
' H (Using HX 0)
Thus E[ y ' Hy ] (n k ) 2
y ' Hy
2
or E
n k
or E MSr e s 2
SSr e s
where MSr e s is the mean sum of squares due to residual.
nk
Thus an unbiased estimator of 2 is
ˆ 2 MSr e s s 2 (say)
which is a model-dependent estimator.
Gauss-Markov Theorem:
The ordinary least squares estimator (OLSE) is the best linear unbiased estimator (BLUE) of .
Proof: The OLSE of is
b ( X ' X ) 1 X ' y
which is a linear function of y . Consider the arbitrary linear estimator
b* a ' y
of linear parametric function ' where the elements of a are arbitrary constants.
Then for b* ,
E (b* ) E (a ' y ) a ' X
Further
Var (a ' y ) a 'Var ( y )a 2 a ' a
Var ( ' b) 'Var (b)
2 a ' X ( X ' X ) 1 X ' a.
Consider
Var (a ' y ) Var ( ' b) 2 a ' a a ' X ( X ' X ) 1 X ' a
2 a ' I X ( X ' X ) 1 X ' a
2 a '( I H )a.
This reveals that if b* is any linear unbiased estimator then its variance must be no smaller than that of b .
Consequently b is the best linear unbiased estimator, where ‘best’ refers to the fact that b is efficient within
the class of linear and unbiased estimators.
1 1 n 2
(2 2 ) n /2
exp 2 2 i
i 1
1 1
exp 2 '
(2 )
2 n /2
2
1
1
exp 2 ( y X ) '( y X ) .
(2 ) 2
2 n /2
Since the log transformation is monotonic, so we maximize ln L( , 2 ) instead of L( , 2 ) .
n 1
ln L( , 2 ) ln(2 2 ) 2 ( y X ) '( y X ) .
2 2
The maximum likelihood estimators (m.l.e.) of and 2 are obtained by equating the first-order
ln L( , 2 ) 1
2 X '( y X ) 0
2 2
ln L( , 2 ) n 1
2 ( y X ) '( y X ).
2
2 2( 2 ) 2
The likelihood equations are given by
( X ' X ) 1 X ' y
1
2 ( y X ) '( y X ).
n
Further to verify that these values maximize the likelihood function, we find
2 ln L( , 2 ) 1
2 X 'X
2
2 ln L( , 2 ) n 1
6 ( y X ) '( y X )
( )
2 2 2
2 4
2 ln L( , 2 ) 1
4 X '( y X ).
2
Thus the Hessian matrix of second-order partial derivatives of ln L( , 2 ) with respect to and 2 is
2 ln L( , 2 ) 2 ln L( , 2 )
2 2
2 ln L( , 2 ) ln L( , )
2 2
2 2 ( 2 ) 2
which is negative definite at and 2 2 . This ensures that the likelihood function is maximized at
these values.
0.
This implies that OLSE converges to in quadratic mean. Thus OLSE is a consistent estimator of . This
holds true for maximum likelihood estimators also.
The same conclusion can also be proved using the concept of convergence in probability.
An estimator ˆn converges to in probability if
The consistency of OLSE can be obtained under the weaker assumption that
X 'X
plim * .
n
exists and is a nonsingular and nonstochastic matrix such that
X '
plim 0.
n
Since
b ( X ' X ) 1 X '
1
X ' X X '
.
n n
So
1
X 'X X '
plim(b ) plim plim
n n
*1.0
0.
Thus b is a consistent estimator of . Same is true for m.l.e. also.
Econometrics | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur
13
(ii) Consistency of s 2
Now we look at the consistency of s 2 as an estimate of 2 as
1
s2 e 'e
nk
1
' H
nk
1
1 k
1 ' ' X ( X ' X ) X '
1
n n
k ' ' X X ' X X '
1 1
1 .
n n n n n
' 1 n 2
Note that
n
consists of
n i 1
i and { i2 , i 1, 2,..., n} is a sequence of independently and identically
distributed random variables with mean 2 . Using the law of large numbers
'
2
plim
n
' X X ' X 1 X ' 'X X ' X
1
X '
plim
plim plim plim
n n n n n n
0.*1.0
0
plim( s ) (1 0) 0
2 1 2
2.
Thus s 2 is a consistent estimator of 2 . The same holds true for m.l.e. also.
2 ( X ' X ) 1 0
OLS 0 2 4
n k
which means that the Cramer-Rao have bound is attained for the covariance of b but not for s 2 .
of measurement of j th explanatory variable X j . For example, in the following fitted regression model
yˆ 5 X 1 1000 X 2 ,
y is measured in litres, X 1 in litres and X 2 in millilitres. Although ˆ2 ˆ1 but the effect of both
explanatory variables is identical. One litre change in either X 1 and X 2 when another variable is held fixed
Sometimes it is helpful to work with scaled explanatory variables and study variable that produces
dimensionless regression coefficients. These dimensionless regression coefficients are called as
standardized regression coefficients.
There are two popular approaches for scaling, which gives standardized regression coefficients. We discuss
them as follows:
All scaled explanatory variable and scaled study variable has mean zero and sample variance unity, i.e.,
using these new variables, the regression model becomes
yi* 1 zi1 2 zi 2 ... k zik i , i 1, 2,..., n.
Such centering removes the intercept term from the model. The least-squares estimate of ( 1 , 2 ,..., k ) '
is
ˆ ( Z ' Z ) 1 Z ' y* .
This scaling has a similarity to standardizing a normal random variable, i.e., observation minus its mean and
divided by its standard deviation. So it is called as a unit normal scaling.
Econometrics | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur
16
2. Unit length scaling:
In unit length scaling, define
xij x j
ij , i 1, 2,..., n; j 1, 2,..., k
S 1/2
jj
yi y
yi0
SST1/2
n
where S jj ( xij x j ) 2 is the corrected sum of squares for j th explanatory variables X j and
i 1
n
ST SST ( yi y ) 2 is the total sum of squares. In this scaling, each new explanatory variable W j has
i 1
1 n n
mean j ij 0 and length
n i 1
(
i 1
ij j ) 2 1.
(x ui xi )( xuj x j )
Sij
rij u 1
( Sii S jj )1/2 ( Sii S jj )1/2
is the simple correlation coefficient between the explanatory variables X i and X j . Similarly
where
n
(x uj x j )( yu y )
Siy
rjy u 1
1/2
( S jj SST ) ( S jj SST )1/2
is the simple correlation coefficient between the j th explanatory variable X j and study variable y .
So the estimates of regression coefficient in unit normal scaling (i.e., ˆ ) and unit length scaling (i.e., ˆ ) are
The regression coefficients obtained after such scaling, viz., ˆ or ˆ usually called standardized regression
coefficients.
where b0 is the OLSE of intercept term and b j are the OLSE of slope parameters.
Thus pre-multiplication of any column vector by A produces a vector showing those observations in
deviation form:
Note that
1
A '
n
1
.n
n
0
and A is a symmetric and idempotent matrix.
In the model
y X ,
the OLSE of is
Note that Ae e.
k 1 explanatory variables X 2 , X 3 ,..., X k and OLSE b b1 , b2* ' is suitably partitioned with OLSE of
Then
y X 1b1 X 2*b2* e.
Premultiply by A,
Ay AX 1b1 AX 2*b2* Ae
AX 2*b2* e.
Premultiply by X 2* gives
AX ' Ay AX ' AX b
*
2
*
2
*
2
*
2 ..
This equation can be compared with the normal equations X ' y X ' Xb in the model y X . Such a
comparison yields the following conclusions:
b2* is the sub vector of OLSE.
This is the normal equation in terms of deviations. Its solution gives OLS of slope coefficients as
The estimate of the intercept term is obtained in the second step as follows:
1
Premultiplying y Xb e by ' gives
n
The expression of the total sum of squares (TSS) remains the same as earlier and is given by
TSS y ' Ay.
Since
Ay AX 2*b2* e
y ' Ay y ' AX 2*b2* y ' e
Xb e ' AX 2*b2* y ' e
Testing of hypothesis:
There are several important questions which can be answered through the test of hypothesis concerning the
regression coefficients. For example
1. What is the overall adequacy of the model?
2. Which specific explanatory variables seem to be important?
etc.
In order the answer such questions, we first develop the test of hypothesis for a general framework, viz.,
general linear hypothesis. Then several tests of hypothesis can be derived as its special cases. So first, we
discuss the test of a general linear hypothesis.
Econometrics | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur
21
Test of hypothesis for H 0 : R r
We consider a general linear hypothesis that the parameters in are contained in a subspace of parameter
space for which R r , where R is ( J k ) a matrix of known elements and r is a ( J 1 ) vector of known
elements.
We assume that rank ( R ) J , i.e., full rank so that there is no linear dependence in the hypothesis.
Some special cases and interesting example of H 0 : R r are as follows:
(i) H 0 : i 0
Choose J 1, r 0, R [0, 0,..., 0,1, 0,..., 0] where 1 occurs at the i th position is R .
This particular hypothesis explains whether X i has any effect on the linear model or not.
(ii) H 0 : 3 4 or H 0 : 3 4 0
or H 0 : 3 4 0, 3 5 0
0 0 1 1 0 0 ... 0
Choose J 2, r (0, 0) ', R .
0 0 1 0 1 0 ... 0
(iv) H 0 : 3 5 4 2
(v) H 0 : 2 3 ... k 0
J k 1
r (0, 0,..., 0) '
0 1 0 ... 0 0
0
0 1 ... 0 0 I k 1
R .
0 0 0 ... 1 ( k 1)k 0
Econometrics | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur
22
This particular hypothesis explains the goodness of fit. It tells whether i has a linear effect or not and are
they of any importance. It also tests that X 2 , X 3 ,..., X k have no influence in the determination of y . Here
1 0 is excluded because this involves additional implication that the mean level of y is zero. Our main
concern is to know whether the explanatory variables help to explain the variation in y around its mean
value or not.
We develop the likelihood ratio test for H 0 : R r.
max L( , 2 | y, X ) Lˆ ()
max L( , 2 | y, X , R r ) Lˆ ( )
where is the whole parametric space and is the sample space.
If both the likelihoods are maximized, one constrained, and the other unconstrained, then the value of the
unconstrained will not be smaller than the value of the constrained. Hence 1.
First, we discuss the likelihood ratio test for a more straightforward case when
R I k and r 0 , i.e., 0 . This will give us a better and detailed understanding of the minor details,
where 0 is specified by the investigator. The elements of 0 can take on any value, including zero. The
concerned alternative hypothesis is
H1 : 0 .
: ( , 2 ) : i , 2 0, i 1, 2,..., k
: ( , 2 ) : 0 , 2 0 .
Econometrics | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur
23
The unconstrained likelihood under .
1 1
L( , 2 | y, X ) exp 2 ( y X ) '( y X ) .
(2 )2 n /2
2
Since 0 is known, so the constrained likelihood function has an optimum variance estimator
1
2 ( y X 0 ) '( y X 0 )
n
n
n n /2 exp
Lˆ ( ) 2 .
n /2
(2 ) ( y X 0 ) '( y X 0)
n /2
( y X ) '( y X ) e ' e
y ' I X ( X ' X ) 1 X ' y
y ' Hy
(X )'H (X )
' H (using HX 0)
(n k )ˆ 2
Z ' AZ
n n matrix of rank, p then ~ 2 ( p). If B is another n n symmetric idempotent matrix of rank
2
Z ' BZ
q , then ~ 2 (q) . If AB 0 then Z ' AZ is distributed independently of Z ' BZ .
2
Further, if H 0 is true, then 0 and we have the numerator in 0 . Rewriting the numerator in 0 , in
general, we have
( ) ' X ' X ( ) ' X ( X ' X ) 1 X ' X ( X ' X ) 1 X '
' X ( X ' X ) 1 X '
' H
where H is an idempotent matrix with rank k . Thus using this result, we have
' H ' X '( X ' X ) 1 X '
~ 2 (k ).
2 2
Furthermore, the product of the quadratic form matrices in the numerator ( ' H ) and denominator ( ' H )
of 0 is
I X ( X ' X ) 1 X ' X ( X ' X ) 1 X ' X ( X ' X ) 1 X ' X ( X ' X ) 1 X ' X ( X ' X ) 1 X ' 0
and hence the 2 random variables in the numerator and denominator of 0 are independent. Dividing
( 0 ) ' X ' X ( 0 )
2
k
1
(n k )ˆ 2
2
n k
( 0 ) ' X ' X ( 0 )
kˆ 2
( y X 0 ) '( y X 0 ) ( y X ) '( y X )
kˆ 2
~ F (k , n k ) under H 0 .
Numerator in 1 : Difference between the restricted and unrestricted error sum of squares.
1 F (k , n k )
where F (k , n k ) is the upper critical points on the central F -distribution with k and n k degrees of
freedom.
( , 2 ) : i , 2 0, i 1, 2,..., k
( , 2 ) : i , R r , 2 0 .
Since ~ N , 2 ( X ' X ) 1
so R ~ N R , 2 R ( X ' X ) 1 R '
R r R R R( ) ~ N 0, 2 R( X ' X ) 1 R ' .
1
There exists a matrix Q such that R( X ' X ) 1 R ' QQ ' and then
2
nk
1
R r ) ' R ( X ' X ) 1 R ' R r
J ˆ 2
~ F ( J , n k ) under H 0 .
So the decision rule is to reject H 0 whenever
1 F ( J , n k )
where F ( J , n k ) is the upper critical points on the central F distribution with J and (n k ) degrees of
freedom.
H 0 : 2 3 ... k 0
against the alternative hypothesis
H1 : j 0 for at least one j 2,3,..., k
This hypothesis determines if there is a linear relationship between y and any set of the explanatory
variables X 2 , X 3 ,..., X k . Notice that X 1 corresponds to the intercept term in the model and hence xi1 1
This is an overall or global test of model adequacy. Rejection of the null hypothesis indicates that at least
one of the explanatory variables among X 2 , X 3 ,..., X k . contributes significantly to the model. This is called
as analysis of variance.
Since ~ N (0, 2 I ),
so y ~ N ( X , 2 I )
b ( X ' X ) 1 X ' y ~ N , 2 ( X ' X ) 1 .
SS res
Also ˆ 2
nk
( y yˆ ) '( y yˆ )
nk
y ' I X ( X ' X ) 1 X ' y y ' Hy y' y b' X ' y
.
nk nk nk
Since ( X ' X )-1 X ' H 0, so b and ˆ 2 are independently distributed.
SSr e s ~ (2n k ) ,
and partition [ 1 , 2* ] where the subvector 2* contains the regression coefficients 2 , 3 ,..., k .
where SS r e g b2* ' X 2* ' AX 2*b2* is the sum of squares due to regression and the sum of squares due to residuals
is given by
SS r e s ( y Xb) '( y Xb)
y ' Hy
SST SSr e g .
Further
SSr e g * ' X * ' AX * * * ' X * ' AX * *
~ k21 2 2 2 2 2 , i.e., non-central 2 distribution with non centrality parameter 2 2 2 2 2 ,
2 2 2
Since X 2 H 0, so SSr e g and SS r e s are independently distributed. The mean squares due to regression is
SSr e g
MSr e g
k 1
and the mean square due to error is
SSr e s
MSres .
nk
Then
MSreg * ' X * ' AX * *
~ Fk 1,n k 2 2 2 2 2
MS res 2
which is a non-central F -distribution with (k 1, n k ) degrees of freedom and noncentrality parameter
2* ' X 2* ' AX 2* 2*
.
2 2
Under H 0 : 2 3 ... k ,
MSreg
F ~ Fk 1,n k .
MSres
Total SST n 1
Adding such explanatory variables also increases the variance of fitted values ŷ , so one needs to be careful
that only those regressors are added that are of real value in explaining the response. Adding unimportant
explanatory variables may increase the residual mean square, which may decrease the usefulness of the
model.
has already been discussed is the case of a simple linear regression model. In the present case, if H 0 is
accepted, it implies that the explanatory variable X j can be deleted from the model. The corresponding test
statistic is
bj
t ~ t (n k 1) under H 0
se(b j )
where the standard error of OLSE b j of j is
se(b j ) ˆ 2C jj where C jj denotes the j th diagonal element of ( X ' X ) 1 corresponding to b j .
t t .
, n k 1
2
Note that this is only a partial or marginal test because ˆ j depends on all the other explanatory variables
X i (i j that are in the model. This is a test of the contribution of X j given the other explanatory variables
in the model.
y ~ N ( X , 2 I )
b ~ N ( , 2 ( X ' X ) 1 ).
Thus the marginal distribution of any regression coefficient estimate
b j ~ N ( j , 2C jj )
Thus
bj j
tj ~ t (n k ) under H 0 , j 1, 2,...
ˆ 2C jj
SS r e s y ' y b ' X ' y
where ˆ 2 .
nk nk
So the 100(1 )% confidence interval for j ( j 1, 2,..., k ) is obtained as follows:
b j
P t j t 1
2 ,nk ˆ 2C jj ,nk
2
P b j t ˆ 2C jj j b j t ˆ 2C jj 1 .
,nk ,n k
2 2
Thus the confidence interval is
b j t , n k ˆ C jj , b j t , n k ˆ C jj .
2 2
2 2
Econometrics | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur
32
Simultaneous confidence intervals on regression coefficients:
A set of confidence intervals that are true simultaneously with probability (1 ) are called simultaneous or
joint confidence intervals.
It is relatively easy to define a joint confidence region for in multiple regression model.
Since
(b ) ' X ' X (b )
~ Fk ,n k
k MS r e s
(b ) ' X ' X (b )
P F (k , n k ) 1 .
k MSr e s
So a 100 (1 )% joint confidence region for all of the parameters in is
(b ) ' X ' X (b )
F (k , n k )
k MSr e s
which describes an elliptically shaped region.
( y y)
i 1
i
2
SSres SS
1 reg
SST SST
where
SSr e s : sum of squares due to residuals,
Since
e ' e y ' I X ( X ' X ) 1 X ' y y ' Hy,
n n
( y y) y
i 1
i
2
i 1
2
i ny 2 ,
1 n
1
where y
n i 1
yi ' y
n
with 1,1,...,1 ', y y1 , y2 ,..., yn '
Thus
n
1
( y y)
i 1
i y ' y n 2 ' yy '
2
n
1
y ' y y ' ' y
n
1
y ' y y ' ( ' ) ' y
y ' I ( ' ) 1 ' y
y ' Ay
y ' Hy
So R2 1 .
y ' Ay
Similarly, any other value of R 2 between 0 and 1 indicates the adequacy of the fitted model.
With a purpose of correction in the overly optimistic picture, adjusted R 2 , denoted as R 2 or adj R 2 is used
which is defined as
SSr e s / (n k )
R2 1
SST / (n 1)
n 1
1 (1 R ).
2
n k
We will see later that (n k ) and (n 1) are the degrees of freedom associated with the distributions of SSres
SSr e s SST
and SST . Moreover, the quantities and are based on the unbiased estimators of respective
nk n 1
variances of e and y in the context of analysis of variance.
The adjusted R 2 will decline if the addition if an extra variable produces too small a reduction in (1 R 2 ) to
n 1
compensate for the increase in .
nk
Another limitation of adjusted R 2 is that it can be negative also. For example, if k 3, n 10, R 2 0.16,
then
9
R 2 1 0.97 0.25 0
7
which has no interpretation.
Reason that why R 2 is valid only in linear models with intercept term:
In the model y X , the ordinary least squares estimator of is b ( X ' X ) 1 X ' y . Consider the
fitted model as
y Xb ( y Xb)
Xb e
where e is the residual. Note that
y ly Xb e ly
yˆ e ly
where ŷ Xb is the fitted value and l (1,1,...,1) ' is a n 1 vector of elements unity. The total sum of
n
squares TSS ( yi y ) 2 is then obtained as
i 1
The Fisher Cochran theorem requires TSS SS reg SS res to hold true in the context of analysis of
variance and further to define the R2. In order that TSS SS reg SS res holds true, we need that
l ' e should be zero, i.e. l ' e =l '( y yˆ ) 0 which is possible only when there is an intercept term in the
model. We show this claim as follows:
First, we consider a no intercept simple linear regression model yi 1 xi i , (i 1, 2,..., n) where the
n
x y i i n n n
parameter 1 is estimated as b *
1
i 1
n
. Then l ' e = ei ( yi yˆi ) ( yi b1* xi ) 0, in general.
x
i 1
2
i
i 1 i 1 i 1
0.
In a multiple linear regression model with an intercept term y 0l X where the parameters 0
and are estimated as ˆ0 y bx and b ( X ' X ) 1 X ' y , respectively. We find that
l ' e =l '( y yˆ )
=l '( y ˆ0 Xb)
=l '( y y Xb Xb) ,
=l '( y y ) l '( X X )b
=0.
Thus we conclude that for the Fisher Cochran to hold true in the sense that the total sum of squares can
be divided into two orthogonal components, viz., the sum of squares due to regression and sum of
squares due to errors, it is necessary that l ' e =l '( y yˆ ) 0 holds and which is possible only when the
intercept term is present in the model.
3. R 2 always increases with an increase in the number of explanatory variables in the model. The main
drawback of this property is that even when the irrelevant explanatory variables are added in the
model, R 2 still increases. This indicates that the model is getting better, which is not really correct.
( y yˆ )
i i
2
R12 1 i 1
n
( y y)
i 1
i
2
(log y log yˆ )
i i
2
R22 1 i 1
n
.
(log y log y )
i 1
i
2
As such R12 and R22 are not comparable. If still, the two models are needed to be compared, a better
( y anti log yˆ )
i
*
i
R32 1 i 1
n
( y y)
i 1
i
2
y . Now
where y log
*
R12 and R32 on the comparison may give an idea about the adequacy of the two
i i
models.
SST
In the limit, when R 2 1, F . So both F and R 2 vary directly. Larger R 2 implies greater F value.
That is why the F test under the analysis of variance is termed as the measure of the overall significance of
estimated regression. It is also a test of significance of R 2 . If F is highly significant, it implies that we can
reject H 0 , i.e. y is linearly related to X ' s.
Its variance is
Var ( p ) E p E ( y ) ' p E ( y )
= 2 x0 ( X ' X ) 1 x0
Then
E ( yˆ 0 ) x0 E ( y | x0 )
Var ( yˆ 0 ) 2 x0 ( X ' X ) 1 x0
The confidence interval on the mean response at a particular point, such as x01 , x02 ,..., x0 k can be found as
follows:
The 100 (1 )% confidence interval on the mean response at the point x01 , x02 ,..., x0 k , i.e., E ( y / x0 ) is
yˆ 0 t , n k ˆ x0 ( X ' X ) x0 , yˆ 0 t , n k ˆ x0 ( X ' X ) x0
2 1 2 1
.
2 2
p f t , n k ˆ [1 x0 ( X ' X ) x0 ], p f t , n k ˆ [1 x0 ( X ' X ) x0 ] .
2 1 2 1
2 2
independently distributed following N (0, 2 ) . The parameters 0 and 1 are estimated using the
b0 y b1 x
sxy
b1
sxx
where
n n
1 n 1 n
sxy ( xi x )( yi y ), sxx ( xi x ) 2 , x i x , y yi .
i 1 i 1 n i 1 n i 1
pm b0 b1 x0 .
0 0 0.
Thus the predictor pm is an unbiased predictor of E ( y ).
Predictive variance:
The predictive variance of pm is
PV ( pm ) Var (b0 b1 x0 )
Var y b1 ( x0 x )
Var ( y ) ( x0 x ) 2 Var (b1 ) 2( x0 x )Cov( y , b1 )
2 2 ( x0 x ) 2
0
n sxx
1 ( x0 x ) 2
2
.
n sxx
( p ) ˆ 2 1 ( x0 x )
2
PV m
n sxx
1 ( x x )2
MSE 0 .
n sxx
Prediction interval :
The 100(1- )% prediction interval for E ( y ) is obtained as follows:
The predictor pm is a linear combination of normally distributed random variables, so it is also normally
distributed as
pm ~ N 0 1 x0 , PV pm .
p E ( y)
P z / 2 m z /2 1
PV ( pm )
which gives the prediction interval for E ( y ) as
1 ( x x )2 ( x0 x ) 2
2 1
pm z /2 2 0 ,
m /2
p z .
n sxx n sxx
When 2 is unknown, it is replaced by ˆ 2 MSE and in this case, the sampling distribution of
pm E ( y )
1 ( x0 x ) 2
MSE
n sxx
Note that the width of the prediction interval E ( y ) is a function of x0 . The interval width is minimum for
x0 x and widens as x0 x increases. This is also expected as the best estimates of y to be made at x -
values lie near the center of the data and the precision of estimation to deteriorate as we move to the
boundary of the x -space.
pa b0 b1 x0 .
Here a means “actual”. The true value of y in the prediction period is given by y0 0 1 x0 0 where
0 indicates the value that would be drawn from the distribution of random error in the prediction period.
Note that the form of predictor is the same as of average value predictor, but its predictive error and other
properties are different. This is the dual nature of predictor.
Predictive bias:
The predictive error of pa is given by
pa y0 b0 b1 x0 ( 0 1 x0 0 )
(b0 0 ) (b1 1 ) x0 0 .
Thus, we find that
E ( pa y0 ) E (b0 0 ) E (b1 1 ) x0 E ( 0 )
000 0
Predictive variance
Because the future observation y0 is independent of pa , the predictive variance of pa is
PV ( pa ) E ( pa y0 ) 2
E[(b0 0 ) ( x0 x )(b1 1 ) (b1 1 ) x 0 ]2
Var (b0 ) ( x0 x ) 2 Var (b1 ) x 2Var (b1 ) Var ( 0 ) 2( x0 x )Cov(b0 , b1 ) 2 xCov(b0 , b1 ) 2( x0 x )Var (b1 )
[rest of the terms are 0 assuming the independence of 0 with 1 , 2 ,..., n ]
Var (b0 ) [( x0 x ) 2 x 2 2( x0 x )]Var (b1 ) Var ( 0 ) 2[( x0 x ) 2 x ]Cov(b0 , b1 )
Var (b0 ) x02Var (b1 ) Var ( 0 ) 2 x0Cov(b0 , b1 )
1 x2 2 x 2
2 x02 2 2 x0
n sxx sxx sxx
1 ( x x )2
2 1 0 .
n sxx
( p ) ˆ 2 1 1 ( x0 x )
2
PV a
n sxx
1 ( x x )2
MSE 1 0 .
n sxx
Prediction interval:
If 2 is known, then the distribution of
pa y0
PV ( pa )
p y0
P z / 2 a z / 2 1
PV ( pa )
which gives the prediction interval for y0 as
1 ( x x )2
2 1 ( x0 x ) 2
pa z /2 2 1 0 ,
a /2
p z 1 .
n s xx n sxx
follows a t -distribution with (n 2) degrees of freedom. The 100(1- )% prediction interval for y0 in
1 ( x x )2 1 ( x0 x ) 2
pa t /2,n 2 MSE 1 0 ,
a /2,n 2
p t MSE 1 .
n sxx n sxx
The prediction interval is of minimum width at x0 x and widens as x0 x increases.
The prediction interval for pa is wider than the prediction interval for pm because the prediction interval
for pa depends on both the error from the fitted model as well as the error associated with the future
observations.
Econometrics | Chapter 4 | Predictions In Linear Regression Model | Shalabh, IIT Kanpur
5
Within sample prediction in multiple linear regression model
Consider the multiple regression model with k explanatory variables as
y X ,
where y ( y1 , y2 ,..., yn ) ' is a n 1 vector of n observation on study variable,
vector of regression coefficients and (1 , 2 ,..., n ) ' is a n 1 vector of random error components or
disturbance term following N (0, 2 I n ) . If the intercept term is present, take the first column of X to be
(1,1,...,1)' .
Let the parameter be estimated by its ordinary least squares estimator b ( X ' X ) 1 X ' y . Then the
predictor is p Xb which can be used for predicting the actual and average values of the study variable.
This is the dual nature of predictor.
which proves that the predictor p Xb provides an unbiased prediction for the average value.
The predictive variance of p is
p E ( y)
P z /2 z /2 1
PVm ( p )
which gives the prediction interval for E ( y ) as
When 2 is unknown, it is replaced by ˆ 2 MSE and in this case, the sampling distribution of
p E ( y)
m ( p)
PV
p E ( y)
P t /2,n k t / 2,n k 1 .
m ( p)
PV
which gives the prediction interval for E ( y ) as
p t
/2, n k PV m ( p ), p t / 2, n k PV m ( p ) .
Comparing the performances of p to predict actual and average values, we find that p in better
predictor for predicting the average value in comparison to actual value when
PVm ( p) PVa ( p)
or k (n k )
or 2k n.
i.e. when the total number of observations is more than twice the number of explanatory variables.
p y
P z / 2 z /2 1
PVa ( p )
which gives the prediction interval for y as
When 2 is unknown, it is replaced by ˆ 2 MSE and in this case, the sampling distribution of
p y
a ( p)
PV
Further, suppose a set of n f observations on the same set of k explanatory variables are also available,
but the corresponding n f observations on the study variable are not available. Assuming that this set of
variables and f is a n f 1 vector of disturbances following N (0, 2 I n f ) . It is also assumed that the
We now consider the prediction of y f values for given X f from model (2). This can be done by
estimating the regression coefficients from the model (1) based on n observations and use it is
formulating the predictor in the model (2). If ordinary least squares estimation is used to estimate in
the model (1) as
b ( X ' X ) 1 X ' y
then the corresponding predictor is
p f X f b X f ( X ' X ) 1 X ' y.
p f E( y f ) X f b X f
X f (b )
X f ( X ' X ) 1 X ' .
Then
E p f E ( y f ) X f X ' X X ' E
1
0.
Thus p f provides an unbiased prediction for the average value.
2 X f ( X ' X ) 1 X 'f .
If 2 is unknown, then replace 2 by ̂ 2 MSE in the expressions of the predictive covariance matrix
and predictive variance and their estimates are
m ( p ) ˆ 2 X ( X ' X ) 1 X '
Cov f f f
p f z / 2 PVm ( p f ), p f z /2 PVm ( p f ) .
When 2 is unknown, it is replaced by ˆ 2 MSE and in this case, the sampling distribution of
p f E( y f )
m(p )
PV f
p f E( y f )
P t /2,n k t /2,n k 1 .
m(p )
PV
f
which gives the prediction interval for E ( y f ) as
p t m ( p ), p t
m /2, n k
PV / 2, n k PV m ( p ) .
m
pf yf X f b X f f
X f (b ) f .
Then
E p f y f X f E b E ( f ) 0.
tr Cova ( p f )
2 tr ( X ' X ) 1 X 'f X f n f .
The estimates of the covariance matrix and predictive variance can be obtained by replacing 2 by
ˆ 2 MSE as
a ( p ) ˆ 2 X ( X ' X ) 1 X ' I
Cov f f f nf
pf yf
P z / 2 z /2 1
PVa ( p f )
p f z / 2 PVa ( p f ), p f z /2 PVa ( p f ) .
When 2 is unknown, it is replaced by ˆ 2 MSE and in this case, the sampling distribution of
pf yf
a(p )
PV f
pf yf
P t /2,n k t /2, n k 1 .
PV a ( p f )
which gives the prediction interval for y f as
p t a ( p ), p t
PV /2, n k PV a ( p f ) .
f /2,n k f f
Econometrics | Chapter 4 | Predictions In Linear Regression Model | Shalabh, IIT Kanpur
12
Simultaneous prediction of average and actual values of the study variable
The predictions are generally obtained either for the average values of the study variable or actual values
of the study variable. In many applications, it may not be appropriate to confine our attention to only to
either of the two. It may be more appropriate in some situations to predict both the values simultaneously,
i.e., consider the prediction of actual and average values of the study variable simultaneously. For
example, suppose a firm deals with the sale of fertilizer to the user. The interest of the company would be
in predicting the average value of yield which the company would like to use in showing that the average
yield of the crop increases by using their fertilizer. On the other side, the user would not be interested in
the average value. The user would like to know the actual increase in the yield by using the fertilizer.
Suppose both seller and user, both go for prediction through regression modeling. Now using the classical
tools, the statistician can predict either the actual value or the average value. This can safeguard the
interest of either the user or the seller. Instead of this, it is required to safeguard the interest of both by
striking a balance between the objectives of the seller and the user. This can be achieved by combining
both the predictions of actual and average values. This can be done by formulating an objective function
or target function. Such target function has to be flexible and should allow assigning different weights to
the choice of two kinds of predictions depending upon their importance in any given application and also
reducible to individual predictions leading to actual and average value prediction.
Now we consider the simultaneous prediction in within and outside sample cases.
The variance is
Var ( p ) E ( p ) '( p )
E (b ) ' X ' ' X (b )
E ' X ( X ' X ) 1 X ' X X ' X X ' 2 ' (b ) ' X ' ' X (b ')
1
E (1 2 ) ' X ( X ' X ) 1 X ' 2 '
2 (1 2 )tr ( X ' X ) 1 X ' X 2 2trI n
2 (1 2 )k 2 n .
y f X f f ; E ( f ) 0, V ( f ) 2 I n f .
n f 1 n f k k 1 n f 1
b X ' X X ' y.
1
p f X f b;
The variance of p f is
Var ( p f ) E ( p f ) '( p f )
E (b ) ' 'f 'f X f (b ) f
E ' X ( X ' X ) 1 X 'f X f X ' X X ' 'f f 2 'f X f ( X ' X ) 1 X '
1
2 tr X ( X ' X ) 1 X 'f X f ( X ' X ) 1 X ' 2 n f
The usual linear regression model assumes that all the random error components are identically and
independently distributed with constant variance. When this assumption is violated, then ordinary least
squares estimator of the regression coefficient loses its property of minimum variance in the class of linear
and unbiased estimators. The violation of such assumption can arise in anyone of the following situations:
1. The variance of random error components is not constant.
2. The random error components are not independent.
3. The random error components do not have constant variance as well as they are not independent.
In such cases, the covariance matrix of random error components does not remain in the form of an identity
matrix but can be considered as any positive definite matrix. Under such assumption, the OLSE does not
remain efficient as in the case of an identity covariance matrix. The generalized or weighted least squares
method is used in such situations to estimate the parameters of the model.
In this method, the deviation between the observed and expected values of yi is multiplied by a weight i
For a simple linear regression model, the weighted least squares function is
n
S ( 0 , 1 ) i yi 0 1 xi .
2
The least-squares normal equations are obtained by differentiating S ( 0 , 1 ) with respect to 0 and 1 and
The solution of these two normal equations gives the weighted least squares estimate of 0 and 1 .
Econometrics | Chapter 5 | Generalized and Weighted Least Squares Estimation | Shalabh, IIT Kanpur
1
Generalized least squares estimation
Suppose in usual multiple regression model
y X with E ( ) 0, V ( ) 2 I ,
V ( ) 2
where is a known n n nonsingular, positive definite and symmetric matrix.
The OLSE of is
b ( X ' X ) 1 X ' y
In such cases, OLSE gives unbiased estimate but has more variability as
E (b) ( X ' X ) 1 X ' E ( y ) ( X ' X ) 1 X ' X
V (b) ( X ' X ) 1 X 'V ( y ) X ( X ' X ) 1 2 ( X ' X ) 1 X ' X ( X ' X ) 1.
Now we attempt to find better estimator as follows:
Since is positive definite, symmetric, so there exists a nonsingular matrix K such that.
KK ' .
Then in the model
y X ,
E ( g ) K 1) E ( ) 0
and
Econometrics | Chapter 5 | Generalized and Weighted Least Squares Estimation | Shalabh, IIT Kanpur
2
V ( g ) E g E ( g ) g E ( g ) '
E ( gg ')
E K 1 ' K '1
K 1 E ( ') K '1
2 K 1K '1
2 K 1 KK ' K '1
2I.
Thus the elements of g have 0 mean, and they are uncorrelated.
So either minimize S ( ) g ' g
' 1
( y X ) ' 1 ( y X )
and get normal equations as
(X'-1 X ) ˆ X ' 1 y
or ˆ ( X ' 1 X ) 1 X ' 1 y.
ˆ ( B ' B ) 1 B ' z
( X ' K '1 K 1 X ) 1 X ' K '1 K 1 y
( X ' 1 X ) 1 X ' 1 y
This is termed as generalized least squares estimator (GLSE) of .
ˆ ( B ' B ) 1 B '( B g )
( B ' B ) 1 B ' g
or ˆ ( B ' B ) 1 B ' g.
Then
E ( ˆ ) ( B ' B) 1 B ' E ( g ) 0
which shows that GLSE is an unbiased estimator of . The covariance matrix of GLSE is given by
V ( ˆ ) E ˆ E ( ˆ ) ˆ E ( ˆ ) '
E ( B ' B) B ' gg ' B '( B ' B)
1 1
̂ minimizes the variance for any linear combination of the estimated coefficients, ' ˆ . We note that
Econometrics | Chapter 5 | Generalized and Weighted Least Squares Estimation | Shalabh, IIT Kanpur
4
Let be another unbiased estimator of that is a linear combination of the data. Our goal, then, is to
show that Var ( ' ) '( X ' 1 X ) 1 with at least one such that Var ( ' ) '( X ' 1 X ) 1 .
We first note that we can write any other estimator of that is a linear combination of the data as
where B is an p n matrix and bo* is a p 1 vector of constants that appropriately adjusts the GLS
E ( ) E ( X ' 1 X ) 1 X ' 1 B y bo*
( X ' 1 X ) 1 X ' 1 B E ( y ) b0*
( X ' 1 X ) 1 X ' 1 B XB b0*
( X ' 1 X ) 1 X ' 1 X BX b0*
BX b0* .
Consequently, is unbiased if and only if both b0* 0 and BX 0. The covariance matrix of is
V ( ) Var ( X ' 1 X ) 1 X ' 1 B y
( X ' 1 X ) 1 X ' 1 B V ( y ) ( X ' 1 X ) 1 X ' 1 B '
( X ' 1 X ) 1 X ' 1 B ( X ' 1 X ) 1 X ' 1 B '
( X ' 1 X ) 1 X ' 1 B 1 X ( X ' 1 X ) 1 B '
( X ' 1 X ) 1 BB '
which must be strictly greater than 0 for some 0 unless B 0 . Thus, the GLS estimate of is the best
linear unbiased estimator.
Econometrics | Chapter 5 | Generalized and Weighted Least Squares Estimation | Shalabh, IIT Kanpur
5
Weighted least squares estimation
When ' s are uncorrelated and have unequal variances, then
1
0 0 0
1
1
2
0 0 0
V ( )
2
2 .
0 1
0 0
n
The estimation procedure is usually called as weighted least squares.
Let W 1 then the weighted least squares estimator of is obtained by solving normal equation
The observations with large variances usual have smaller weights than observations with small variance.
Econometrics | Chapter 5 | Generalized and Weighted Least Squares Estimation | Shalabh, IIT Kanpur
6
Chapter 6
Regression Analysis Under Linear Restrictions and Preliminary Test Estimation
One of the basic objectives in any statistical modeling is to find good estimators of the parameters. In the
context of multiple linear y X β + ε , the ordinary least squares estimator
regression model =
b = ( X ' X ) X ' y is the best linear unbiased estimator of β . Several approaches have been attempted in the
−1
literature to improve further the OLSE. One approach to improve the estimators is the use of extraneous
information or prior information. In applied work, such prior information may be available about the
regression coefficients. For example, in economics, the constant returns to scale imply that the exponents
in a Cobb-Douglas production function should sum to unity. In another example, absence of money illusion
on the part of consumers implies that the sum of money income and price elasticities in a demand function
should be zero. These types of constraints or the prior information may be available from
(i) some theoretical considerations.
(ii) past experience of the experimenter.
(iii) empirical investigations.
(iv) some extraneous sources etc.
To utilize such information in improving the estimation of regression coefficients, it can be expressed in the
form of
(i) exact linear restrictions
(ii) stochastic linear restrictions
(iii) inequality restrictions.
We consider the use of prior information in the form of exact and stochastic linear restrictions in the model
y X β + ε where y is a (n × 1) vector of observations on study variable, X is a (n × k ) matrix of
=
observations on explanatory variables X 1 , X 2 ,..., X k , β is a (k ×1) vector of regression coefficients and ε is
Econometrics | Chapter 6 | Linear Restrictions and Preliminary Test Estimation | Shalabh, IIT Kanpur
1
Exact linear restrictions:
Suppose the prior information binding the regression coefficients is available from some extraneous sources
which can be expressed in the form of exact linear restrictions as
r = Rβ
where r is a (q ×1) vector and R is a (q × k ) matrix with rank=
( R) q (q < k ). The elements in
r and R are known.
0 0 1 0 − 1 0 0 0
=r = , R .
1 0 0 1 2 1 0 0
=r [3=
] , R [ 0 1 0]
(iii) If k = 3 and suppose β1 : β 2 : β3 :: ab : b :1
0 1 − a 0
then r =
= 0 , R 0 1
−b .
0 1 0 −ab
The ordinary least squares estimator b = ( X ' X ) −1 X ' y does not uses the prior information. It does not obey
the restrictions in the sense that r ≠ Rb. So the issue is how to use the sample information and prior
information together in finding an improved estimator of β .
Econometrics | Chapter 6 | Linear Restrictions and Preliminary Test Estimation | Shalabh, IIT Kanpur
2
Restricted least squares estimation
The restricted least squares estimation method enables the use of sample information and prior information
simultaneously. In this method, choose β such that the error sum of squares is minimized subject to linear
restrictions r = Rβ . This can be achieved using the Lagrangian multiplier technique. Define the Lagrangian
function
S (β , λ ) =( y − X β ) '( y − X β ) − 2λ '( R β − r )
where λ is a (k ×1) vector of Lagrangian multiplier.
Using the result that if a and b are vectors and A is a suitably defined matrix, then
∂
= ( A + A ')a
a ' Aa
∂a
∂
a ' b = b,
∂a
we have
∂S ( β , λ )
= 2 X ' X β − 2 X ' y − 2 R ' λ =' 0 (*)
∂β
∂S ( β , λ )
= R β − r= 0.
∂λ
Pre-multiplying equation (*) by R( X ' X ) −1 , we have
−1
( X ' X ) X ' y + ( X ' X ) R ' R( X ' X )−1 R ' ( r − Rb )
−1 −1
βˆR =
−1
b − ( X ' X ) R ' R ( X ' X ) R ' ( Rb − r ) .
−1 −1
=
This estimation is termed as restricted regression estimator of β .
Econometrics | Chapter 6 | Linear Restrictions and Preliminary Test Estimation | Shalabh, IIT Kanpur
3
Properties of restricted regression estimator
1. The restricted regression estimator βˆR obeys the exact restrictions, i.e., r = RβˆR . To verify this,
consider
2. Unbiasedness
The estimation error of βˆR is
= D (b − β )
where
−1
D= I − ( X ' X ) −1 R R ( X ' X ) R ' R.
−1
Thus
( )
E βˆR − β= DE ( b − β )
=0
3. Covariance matrix
The covariance matrix of βˆR is
( )
V βˆR =E βˆR − β( )( βˆ R )
−β '
= DE ( b − β )( b − β ) ' D '
= DV (b) D '
= σ 2D ( X ' X ) D '
−1
−1
= σ 2 ( X ' X ) − σ 2 ( X ' X ) R ' R ( X ' X ) R ' R ' ( X ' X )
−1 −1 −1 −1
which can be obtained as follows:
Econometrics | Chapter 6 | Linear Restrictions and Preliminary Test Estimation | Shalabh, IIT Kanpur
4
Consider
−1
D (=
X 'X) (X 'X ) − ( X ' X ) R ' R ( X ' X ) R ' R ( X ' X )
−1 −1 −1 −1 −1
{ } { }
'
X ' X −1 − X ' X −1 R ' R X ' X −1 R '
D( X ' X ) D' =
−1
R ( X ' X ) I − ( X ' X ) R ' R ( X ' X ) R '
−1
R
( ) ( ) ( )
−1 −1 −1 −1
−1
( X ' X ) − ( X ' X ) R ' R ( X ' X ) R ' ' R ( X ' X ) − ( X ' X ) R ' R ( X ' X ) R ' R ( X ' X )
−1 −1 −1 −1 −1 −1 −1
=
−1 −1
+ ( X ' X ) R ' R ( X ' X ) R ' R ( X ' X ) R ' R ( X ' X ) R ' R ( X ' X )
−1 −1 −1 −1 −1
−1
(X 'X ) − ( X ' X ) R ' R ( X ' X ) R ' R ( X ' X ) .
−1 −1 −1 −1
=
where λ is a ( q ×1) vector of Lagrangian multipliers. The normal equations are obtained by partially
differentiating the log – likelihood function with respect to β , σ 2 and λ and equated to zero as
∂ ln L ( β , σ 2 , λ ) 1
− 2 ( X ' X β − X ' y ) + 2R ' λ =
= 0 (1)
∂β σ
∂ ln L ( β , σ , λ )
2
= 2 ( Rβ − r=
) 0 (2)
∂λ
∂ ln L ( β , σ 2 , λ ) 2n 2 ( y − X β ) ' ( y − X β )
=
− 2+ =
0. (3)
∂σ 2
σ σ4
Let βR , σ R2 and λ denote the maximum likelihood estimators of β , σ 2 and λ respectively which are
obtained by solving equations (1), (2) and (3) as follows:
( r − Rβ )
−1
R ( X ' X )−1 R '
λ = .
σ 2
( )
−1
β + ( X ' X ) R ' R ( X ' X ) R '
βR = r − Rβ
−1 −1
Econometrics | Chapter 6 | Linear Restrictions and Preliminary Test Estimation | Shalabh, IIT Kanpur
5
where β = ( X ' X ) X ' y is the maximum likelihood estimator of β without restrictions. From equation (3),
−1
we get
σ 2
=
( y − X β ) ' ( y − X β )
.
R
n
The Hessian matrix of second order partial derivatives of β and σ 2 is positive definite at
=β β=
R and σ
2
σ R2 .
The restricted least squares and restricted maximum likelihood estimators of β are same whereas they are
different for σ 2 .
Test of hypothesis
It is important to test the hypothesis
H 0 : r = Rβ
H1 : r ≠ R β
before using it in the estimation procedure.
The construction of the test statistic for this hypothesis is detailed in the module on multiple linear regression
model. The resulting test statistic is
(r − Rb) ' R ( X ' X ) −1 R ' −1 (r − Rb)
q
F=
( y − Xb) '( y − Xb
n−k
which follows a F -distribution with q and (n − k ) degrees of freedom under H 0 . The decision rule is to
F ≥ F1−α (q, n − k ).
Econometrics | Chapter 6 | Linear Restrictions and Preliminary Test Estimation | Shalabh, IIT Kanpur
6
Stochastic linear restrictions:
The exact linear restrictions assume that there is no randomness involved in the auxiliary or prior
information. This assumption may not hold true in many practical situations and some randomness may be
present. The prior information in such cases can be formulated as
r Rβ + V
=
where r is a ( q ×1) vector, R is a ( q × k ) matrix and V is a ( q ×1) vector of random errors. The elements
in r and R are known. The term V reflects the randomness involved in the prior information r = Rβ .
Assume
E (V ) = 0
E (VV ') = ψ
E ( ε V ') = 0.
where ψ is a known ( q × q ) positive definite matrix and ε is the disturbance term is multiple regression
y X β + ε.
model=
Note that E (r ) = R β .
The possible reasons for such stochastic linear restriction are as follows:
(i) Stochastic linear restrictions exhibits the unstability of estimates. An unbiased estimate with the
standard error may exhibit stability. For example, in repetitive studies, the surveys are
conducted every year. Suppose the regression coefficient β1 remains stable for several years.
Suppose its estimate is provided along with its standard error. Suppose its value remains stable
around the value 0.5 with standard error 2. This information can be expressed as
r β1 + V1 ,
=
=
where =
r 0.5, E (V1 ) 0,=
E (V12 ) 22.
Now ψ can be formulated with this data. It is not necessary that we should have information
for all regression coefficients but we can have information on some of the regression
coefficients only.
(ii) Sometimes the restrictions are in the form of inequality. Such restrictions may arise from
theoretical considerations. For example, the value of a regression coefficient may lie between 3
and 5, i.e., 3 ≤ β1 ≤ 5, say. In another example, consider a simple linear regression model
y =β 0 + β1 x + ε
Econometrics | Chapter 6 | Linear Restrictions and Preliminary Test Estimation | Shalabh, IIT Kanpur
7
where y denotes the consumption expenditure on food and x denotes the income. Then the
marginal propensity (tendency) to consume is
dy
= β1 ,
dx
i.e., if salary increase by rupee one, then one is expected to spend β1 , amount of rupee one on
food or save (1 − β1 ) amount. We may put a bound on β that either one can not spend all of
rupee one or nothing out of rupee one. So 0 < β1 < 1. This is a natural restriction arising from
theoretical considerations.
These bounds can be treated as p − sigma limits, say 2-sigma limits or confidence limits. Thus
µ − 2σ =
0
µ + 2σ =
1
1 1
⇒ µ= , σ= .
2 4
These values can be interpreted as
1
β1 + V1 =
2
1
E (V12 ) = .
16
(iii) Sometimes the truthfulness of exact linear restriction r = Rβ can be suspected and accordingly
an element of uncertainty can be introduced. For example, one may say that 95% of the
restrictions hold. So some element of uncertainty prevails.
β is
r Rβ + V . So the
which is termed as pure estimator. The pure estimator b does not satisfy the restriction=
objective is to obtain an estimate of β by utilizing the stochastic restrictions such that the resulting estimator
Econometrics | Chapter 6 | Linear Restrictions and Preliminary Test Estimation | Shalabh, IIT Kanpur
8
satisfies the stochastic restrictions also. In order to avoid the conflict between prior information and sample
information, we can combine them as follows:
Write
Xβ ε
y =+ E (ε ) =
0, E (εε ') =
σ 2 In
Rβ + V
r= E (V ) = ψ , E (εV ') =
0, E (VV ') = 0
jointly as
y X ε
r R β + V
=
a Aβ + w
or =
=
where a (=
y r ) ', A ( X=
R) ', w (ε V ) '.
Note that
E (ε ) 0
=
E ( w) =
E (V ) 0
Ω = E ( ww ')
εε ' εV '
= E
V ε ' VV '
σ 2 I n 0
= .
0 ψ
This shows that the disturbances w are non spherical or heteroskedastic. So the application of generalized
least squares estimation will yield more efficient estimator than ordinary least squares estimation. So
applying generalized least squares to the model
a=
AB + w E ( w) ==
0, V ( w) Ω,
the generalized least square estimator of β is given by
Econometrics | Chapter 6 | Linear Restrictions and Preliminary Test Estimation | Shalabh, IIT Kanpur
9
1
I 0 y
[ X ' R '] σ
A'Ω a =−1 2 n
−1 r
0 Ψ
= σ12 X ' y + k ' Ψ −1r
1
I 0 X
[ X ' R '] σ
A'Ω A =−1 2 n
R
0 Ψ −1
= σ12 X ' X + R ' Ψ −1 R.
Thus
−1
1 1
βˆM 2 X ' X + R 'ψ −1 R 2 X ' y + R ' Ψ −1r
=
σ σ
( A ' ΩA)
−1
βˆM − β = A ' Ω −1a − β
= ( A ' Ω −1 A ) A ' Ω −1 ( AB + w ) − β
−1
( )
E βˆM − β = ( A ' Ω −1 A ) A ' Ω −1 E ( w)
−1
= 0.
So mixed regression estimator provides an unbiased estimator of β . Note that the pure regression
Econometrics | Chapter 6 | Linear Restrictions and Preliminary Test Estimation | Shalabh, IIT Kanpur
10
(ii) Covariance matrix
The covariance matrix of βˆM is
( ) (
V βˆM =E βˆM − β )( βˆ M )
−β '
( A ' Ω A)−1 −1
=
−1
1
= 2 X ' X + R ' Ω −1 R .
σ
(iii) The estimator βˆM satisfies the stochastic linear restrictions in the sense that
=r R βˆµ + V
( )
E (r ) RE βˆM + E (V )
=
= Rβ + 0
= Rβ .
Result: The difference of matrices ( A1−1 − A2−1 ) is positive definite if ( A2 − A1 ) is positive definite.
Let
σ 2 ( X ' X ) −1
A1 ≡ V (b) =
−1
1
( βˆM ) 2 X ' X + R ' Ψ −1 R
A2 ≡ V=
σ
1 1
then A1−1 −=
A2−1 X ' X + R ' Ψ −1 R − X 'X
σ 2
σ2
= R ' Ψ −1 R
which is a positive definite matrix. This implies that
A1 − A2 = V (b) − V ( βˆM )
is a positive definite matrix. Thus βˆM is more efficient than b under the criterion of covariance matrices or
assuming σ 2 is known and b = ( X ' X ) X ' y . This follows a χ 2 -distribution with q degrees of freedom.
−1
If Ψ =0 , then the distribution is degenerated and hence r becomes a fixed quantity. For the feasible
version of mixed regression estimator
−1
1 1
βˆ f 2 X ' X + R ' Ψ −1 R 2 X ' y + R ' Ψ −1r ,
=
s s
the optimal properties of mixed regression estimator like linearity unbiasedness and/or minimum variance
do not remain valid. So there can be situations when the incorporation of prior information may lead to loss
in efficiency. This is not a favourable situation. Under such situations, the pure regression estimator is
better to use. In order to know whether the use of prior information will lead to better estimator or not, the
null hypothesis H 0 : E (r ) = R β can be tested.
{
r − Rb ' R X ' X −1 R '+ Ψ
} ( r − Rb )
−1
( ) ( )
q
F=
s2
1
where s 2 = ( y − Xb ) ' ( y − Xb ) and F follows a F − distribution with q and (n − k ) degrees of freedom
n−k
under H 0 .
Econometrics | Chapter 6 | Linear Restrictions and Preliminary Test Estimation | Shalabh, IIT Kanpur
12
Inequality Restrictions
Sometimes the restriction on the regression parameters or equivalently the prior information about the
regression parameters is available in the form of inequalities. For Example,
, etc. Suppose such information is expressible in the form of
. We want to estimate the regression coefficient in the model subject to
constraints .
Another option to obtain an estimator of is subject to inequality constraints is to convert the inequality
constraints in the form of stochastic linear restrictions e.g., limits. and use the framework of
mixed regression estimation.
The minimax estimation can also be used to obtain the estimator of under inequality constrains. The
minimax estimation is based on the idea that the quadratic risk function for the estimate is not minimized
over the entire parameter space but only over an area that is restricted by the prior knowledge or restrictions
in relation to the estimate.
If all the restriction define a convex area, this area can be enclosed in an ellipsoid of the following form
B( β )
= {β : β ' T β ≤ k }
with the origin as center point or in
B ( β , β 0=
) {β : ( β − β 0 ) ' T ( β − β 0 ) ≤ k }
with the center point vector where is a given constant and T is a known matrix which is
assumed to be positive definite. Here defines a concentration ellipsoid.
First we consider an example to understand how the inequality constraints are framed. Suppose it is known a
priori that
ai ≤ βi ≤ bi (i =
1, 2,..., n)
Econometrics | Chapter 6 | Linear Restrictions and Preliminary Test Estimation | Shalabh, IIT Kanpur
13
when are known and may include and . These restriction can be
written as
ai + bi
βi −
2
≤ 1, i =
1, 2,..., n.
1
2(bi − ai )
(iii)The corner points of the cuboid are on the surface of the ellipsoid, which means we have
ai − bi
p 2
∑
i =1 2
ti = 1.
We now include the linear restriction (iii) for the by means of Lagrangian multipliers and solve (with
)
p p a − b 2
min ∏ ti −1 − λ ∑ i i ti − 1 .
min V =
{ti } ti
i =1 i =1 2
∂V a − bj
2
−ti−2 ∏ ti−1 − λ j
= =
0
∂ti i≠ j 2
∂V a j − bj
2
and =
∂λ
∑ = ti − 1 0.
2
Econometrics | Chapter 6 | Linear Restrictions and Preliminary Test Estimation | Shalabh, IIT Kanpur
14
∂V
From = 0, we get
∂ti
2
2
−t ∏ t
λ= −2 −1
(for all j =
1, 2, , p )
i a − b i
i≠ j j j
2
2 p
= −t ∏ t −1 −1
,
i a −b i
i =1 j j
and for any two we obtain
a j − bj a j − bj
2 2
ti = tj ,
2 2
∂V
and hence after summation accrding to = 0 gives
∂λ
a j − bj a j − bj
2 2
p
∑ =
i =1 2
t j pt= j 1.
2
This leads to the required diagonal elements of
t j = ( a j − bj ) ( j =
4 −2
1, 2, , p ) .
p
Hence, the optimal ellipsoid ( β − β 0 )′T ( β − β 0 ) =
1 , which contains the cuboid, has the center point vector
β 0′ =( a1 + b1 , , a p + bp )
1
2
and the following matrix, which is positive definite for finite limits
T =diag
4
p
( −2
)
( b1 − a1 ) , , ( bp − a p ) .
−2
Interpretation: The ellipsoid has a larger volume than the cuboid. Hence, the transition to an ellipsoid as a
priori information represents a weakening, but comes with an easier mathematical handling.
Econometrics | Chapter 6 | Linear Restrictions and Preliminary Test Estimation | Shalabh, IIT Kanpur
15
Example: (Two real regressors) The center-point equation of the ellipsoid is (see Figure )
x2 y 2
+ =
1,
a 2 b2
1
2 0
x
or ( x, y ) a =1
0 1 y
b2
1 1
= =
with T diag 2 , 2 diag ( t1 , t2 )
a b
be a convex region of a priori restrictions for . The criterion of the minimax estimator leads to
the following.
Definition :An estimator is called a minimax estimator of
min
{ β } β ∈B
ˆ ( )
sup R βˆ , β , A = sup R ( b* , β , A ) .
β ∈B
An explicit solution can be achieved if the weight matrix is of the form A = aa ' of rank 1.
Result: In the model , with the restriction with , and the risk
( )
function R βˆ , β , a , the linear minimax estimator is of the following form:
= D*−1 X ' y
with the bias vector and covariance matrix as
Econometrics | Chapter 6 | Linear Restrictions and Preliminary Test Estimation | Shalabh, IIT Kanpur
16
Bias ( b* , β ) = −k −1σ 2 D*−1T β ,
V ( b* ) = σ 2 D*−1SD*−1
Result: If the restrictions are ( β − β 0 )′T ( β − β 0 ) ≤ k with center point β 0 ≠ 0, the linear minimax estimator
is of the following form:
β 0 + D*−1 X ′ ( y − X β 0 )
b* ( β 0 ) =
Interpretation: A change of the center point of the a priori ellipsoid has an influence only on the estimator
itself and its bias. The minimax estimator is not operational because σ 2 is unknown. The smaller the value
of , the stricter is the a priori restriction for fixed . Analogously, the larger the value of , the smaller is
the influence of on the minimax estimator. For the borderline case we have
B(β )
= {β : β ′T β ≤ k} → K as k → ∞
( X ′X ) X ′y.
−1
and lim b* → b =
k →∞
Comparison of b * and b :
Minimax Risk: Since the OLS estimator is unbiased, its minimax risk is
sup R ( b, ⋅, a ) =
σ 2 a′S −1a.
β ′T β ≤ k
The linear minimax estimator b* has a smaller minimax risk than the OLS estimator, and
R ( b, ⋅, a ) − sup R ( b* , β , a )
β ′T β ≤ k
= σ 2 a′( S −1 − ( k −1σ 2T + S ) )a ≥ 0,
−1
Econometrics | Chapter 6 | Linear Restrictions and Preliminary Test Estimation | Shalabh, IIT Kanpur
17
since S −1 − ( k −1σ 2T + S ) ≥ 0
−1
M ( b=
*, β ) V ( b* ) + Bias ( b* , β ) Bias ( b* , β )′
= σ 2 D*−1 ( S + k −2σ 2T ββ ′T ′ ) D*−1
2
k −1 ≤ .
β ′β
Econometrics | Chapter 6 | Linear Restrictions and Preliminary Test Estimation | Shalabh, IIT Kanpur
18
the information is incorrect, i.e., , then OLSE is better than RRE. The truthfulness of prior
information in terms of or is tested by the null hypothesis using the F-statistics.
• If is accepted at level of significance, then we conclude that and in such a situation, RRE is
better than OLSE.
• On the other hand, if is rejected at level of significance, then we conclude that and OLSE
is better than RRE under such situations.
So when the exact content of the true sampling model is unknown, then the statistical model to be used is
determined by a preliminary test of hypothesis using the available sample data. Such procedures are
completed in two stages and are based on a test of hypothesis which provides a rule for choosing between the
estimator based on the sample data and the estimator is consistent with the hypothesis. This requires to make
a test of the compatibility of OLSE (or maximum likelihood estimator) based on sample information only
and RRE based on the linear hypothesis. The one can make a choice of estimator depending upon the
outcome. Consequently, one can choose OLSE or RRE. Note that under the normality of random errors, the
equivalent choice is made between the maximum likelihood estimator of and the restricted maximum
likelihood estimator of , which has the same form as OLSE and RRE, respectively. So essentially a pre-test
of hypothesis is done for and based on that, a suitable estimator is chosen. This is called the pre-
test procedure which generates the pre-test estimator that in turn, provides a rule to choose between restricted
or unrestricted estimators.
One can also understand the philosophy behind the preliminary test estimation as follows. Consider the
problem of an investigator who has a single data set and wants to estimate the parameters of a linear model
that are known to lie in a high dimensional parametric space . However, the prior information about the
parameter is available and it suggests that the relationship may be characterized by a lower dimensional
parametric space . Under such uncertainty, if the parametric space is estimated by OLSE, the
result from the over specified model will be unbiased but will have larger variance. Alternatively, the
parametric space may incorrectly specify the statistical model and if estimated by OLSEwill be biased.
The bias may or may not overweigh the reduction in variance. If such uncertainty is represented in the form
of general linear hypothesis, this leads to pre-test estimators.
Econometrics | Chapter 6 | Linear Restrictions and Preliminary Test Estimation | Shalabh, IIT Kanpur
19
Let us consider the conventional pre-test estimator under the model with usual assumptions and
the general linear hypothesis which can be tested by using statistics. The null hypothesis is
rejected at level of significance when
λ = Fcalculated ≥ Fα , p ,n − p = c
where the critical value is determined for given level of the test by
∞
∫dF=
c
p ,n − p ≥ c α .
P Fp ,n − p =
• If is true, meaning thereby that the prior information is correct, then use RRE
to estimate .
• If is false, meaning thereby that the prior information is incorrect, then use
OLSE to estimate .
Thus the estimator to be used depends on the preliminary test of significance and is of the form
Note that and indicate that the probability of type 1 error (i.e., rejecting when it is true) is
and respectively. So the entire area under the sampling distribution is the area of acceptance or the area of
rejection of null hypothesis. Thus the choice of has a crucial role to play in determining the sampling
Econometrics | Chapter 6 | Linear Restrictions and Preliminary Test Estimation | Shalabh, IIT Kanpur
20
performance of the pre-test estimators. Therefore in a repeated sampling context, the data, the linear
hypothesis, and the level of significance all determine the combination of the two estimators that are chosen
on the average. The level of significance has an impact on the outcome of pretest estimator in the sense of
determining the proportion of the time each estimator is used and in determining the sampling performance
of pretest estimator.
We use the following result to derive the bias and risk of pretest estimator:
Result 1: If the random vector, , is distributed as a multivariate normal random vector with
mean and covariance matrix and is independent of then
Z ′Z n − K z δ χ ( K + 2.λ )
2
E I ( 0,c ) 2 2 = 2 ≤ c* ,
σ χ ( n − K ) K σ σ
P
χ ( n − K )
where and .
Result 2: If the random vector, , is distributed as a multivariate normal random vector with mean
Using these results, we can find the bias and risk of as follows:
Bias:
( ) E ( b ) − E I ( 0,c ) ( u ) . ( b − r )
E βˆPT =
χ2 ′ 2
( p + 2,δ δ /2σ ) p
β −δ P 2
= ≤c
χ n− p
( n − p ,δ ′δ /2σ )
2
β − δ P F p + 2, n − p ,δ ′δ /2σ 2 ≤ c
=
( )
Econometrics | Chapter 6 | Linear Restrictions and Preliminary Test Estimation | Shalabh, IIT Kanpur
21
where denotes the non-central distribution with noncentrality parameter
Thus, if , the pretest estimator is unbiased. Note that the size of bias is affected by the probability of a
random variable with a non-central - distribution being less than a constant that is determined by the level
of the test, the number of hypothesis and the degree of hypothesis error . Since the probability is always
less than or equal to one, so .
Risk:
The risk of pretest estimator is obtained as
( ) ( ′
ρ β , βˆPT = E βˆPT − β βˆPT − β
)(
)
( ′
)(
= E b − β − I ( 0,c ) ( u )( b − r ) b − β − I ( 0,c ) ( u )( b − r )
)
= E ( b − β )′ ( b − β ) − E I ( 0,c ) ( u )( b − β )′ ( b − β ) + E I ( 0,c ) ( u ) δ ′δ
χ 2p + 2,δ ′δ /2σ 2 χ 2p + 4,δ ′δ /2σ 2
( ) ( )
σ p + ( 2δ ′δ − σ p ) P cp cp
= 2 2
≤ − δ ′δ P ≤
χ (2n − p ) n− p χ (2n − p ) n− p
or compactly,
( )
ρ β , βˆPT =σ 2 p + ( 2δ ′δ − σ 2 p ) l ( 2 ) − δ ′δ l ( 4 )
where
χ 2p + 2,δ ′δ /2σ χ 2p + 4,δ ′δ /2σ
( ) 2
( )1 2
Econometrics | Chapter 6 | Linear Restrictions and Preliminary Test Estimation | Shalabh, IIT Kanpur
22
3. As the hypothesis error grows, the risk of the pretest estimator increases obtains a maximum after
crossing the risk of the least squares estimator, and then monotonically decreases to approach
the risk of the OLSE.
4. The pretest estimator risk function defined on the parameter spaces crosses the risk function
of the least squares estimator within the bounds .
The sampling characteristic of the preliminary test estimator are summarized in Figure 1.
From these results we see that the pretest estimator does well relative to OLSE if the hypothesis is correctly
specified. However, in the space representing the range of hypothesis are correctly specified.
However, in the space representing the range of hypothesis errors, the pretest estimator is inferior to
the least squares estimator over an infinite range of the parameter space. In figures 1 and 2, there is a range
of the parameter spacewhich the pretest estimator has risk that is inferior to (greater than) that of both the
unrestricted and restricted least squares estimators. No one estimator depicted in Figure 1 dominates the
other competitors. In addition, in applied problems the hypothesis errors, and thus the correct in the
specification error parameter space, are seldom known. Consequently, the choice of the estimator is
unresolved.
Econometrics | Chapter 6 | Linear Restrictions and Preliminary Test Estimation | Shalabh, IIT Kanpur
23
risk of the pretest estimator approaches that of the least squares estimator b.The choice of , which has a
crucial impact on the performance of the pretest estimator, is portrayed in Figure 3.
Since the investigator is usually unsure of the degree of hypothesis specification error, and thus is unsure of
the appropriate point in the space for evaluating the risk, the best of worlds would be to have a rule that
mixes the unrestricted and restricted estimators so as to minimize risk regardless of the relevant specification
error . Thus the risk function traced out by the cross-hatched area in Figure 2 is relevant.
Unfortunately, the risk of the pretest estimator, regardless of the choice of , is always equal to or greater
than the minimum risk function for some range of the parameter space. Given this result, one criterion that
has been proposed for choosing the level might be to choose the critical value that would minimize the
maximum regret of not being on the minimum risk function, reflected by the boundary of the shaded area.
Another criterion that has been proposed for choosing is to minimize the average regret over the whole
space. Each of these criteria lead to different conclusions or rules for choice, and the question
concerning the optimal level of the test is still open. One thing that is apparent is that conventional choices of
0.05 and 0.01 may have rather severe statistical consequences.
Econometrics | Chapter 6 | Linear Restrictions and Preliminary Test Estimation | Shalabh, IIT Kanpur
24
Chapter 7
Multicollinearity
A basic assumption is multiple linear regression model is that the rank of the matrix of observations on
explanatory variables is the same as the number of explanatory variables. In other words, such a matrix is
of full column rank. This, in turn, implies that all the explanatory variables are independent, i.e., there is
no linear relationship among the explanatory variables. It is termed that the explanatory variables are
orthogonal.
In many situations in practice, the explanatory variables may not remain independent due to various
reasons. The situation where the explanatory variables are highly intercorrelated is referred to as
multicollinearity.
Assume the observations on all X i ' s and yi ' s are centered and scaled to unit length. So
- X ' X becomes a k k matrix of correlation coefficients between the explanatory variables and
- X ' y becomes a k 1 vector of correlation coefficients between explanatory and study variables.
column vectors X 1 , X 2 ,..., X k are linearly dependent if there exists a set of constants 1 , 2 ,..., k , not all
j 1
j X j 0.
If this holds exactly for a subset of the X 1 , X 2 ,..., X k , then rank ( X ' X ) k . Consequently ( X ' X ) 1 does
k
not exist. If the condition
j 1
j X j 0 is approximately true for some subset of X 1 , X 2 ,..., X k , then there
will be a near-linear dependency in X ' X . In such a case, the multicollinearity problem exists. It is also
said that X ' X becomes ill-conditioned.
5. An over-determined model
Sometimes, due to over-enthusiasm, a large number of variables are included in the model to make it more
realistic. Consequently, the number of observations (n ) becomes smaller than the number of explanatory
variables (k ) . Such a situation can arise in medical research where the number of patients may be small,
but the information is collected on a large number of variables. In another example, if there is time-series
data for 50 years on consumption pattern, then it is expected that the consumption pattern does not remain
the same for 50 years. So better option is to choose a smaller number of variables, and hence it results in
n k.
1 r b1 r1 y
r 1 b2 r2 y
where r is the correlation coefficient between x1 and x2 ; rjy is the correlation coefficient between x j and
1 1 r
X 'X
1
2
1 r r 1
r1 y r r2 y
b1
1 r2
r2 y r r1 y
b2 .
1 r2
So the covariance matrix is V (b) 2 ( X ' X ) 1
2
Var (b1 ) Var (b2 )
1 r2
r 2
Cov(b1 , b2 ) .
1 r2
If x1 and x2 are uncorrelated, then r 0 and
1 0
X 'X
0 1
rank ( X ' X ) 2.
If x1 and x2 are perfectly correlated, then r 1 and rank ( X ' X ) 1.
So if variables are perfectly collinear, the variance of OLSEs becomes large. This indicates highly
unreliable estimates, and this is an inadmissible situation.
The standard errors of b1 and b2 rise sharply as r 1 and they break down at r 1 because X ' X
becomes non-singular.
There is no clear cut boundary to distinguish between the harmful and non-harmful multicollinearity.
Generally, if r is low, the multicollinearity is considered as non-harmful, and if r is high, the
multicollinearity is regarded as harmful.
In case of near or high multicollinearity, the following possible consequences are encountered.
1. The OLSE remains an unbiased estimator of , but its sampling variance becomes very large. So
OLSE becomes imprecise, and property of BLUE does not hold anymore.
2. Due to large standard errors, the regression coefficients may not appear significant. Consequently,
essential variables may be dropped.
For example, to test H 0 : 1 0, we use t ratio as
b1
t0 .
(b )
Var 1
When the number of explanatory variables is more than two, say k as X 1 , X 2 ,..., X k then the j th
where R 2j is the multiple correlation coefficient or the coefficient of determination from the regression of
If X j is highly correlated with any subset of other (k 1) explanatory variables then R 2j is high and close
2
to 1. Consequently, the variance of j th OLSE Var (b j ) C jj 2 becomes very high. The
1 R 2j
covariance between bi and b j will also be large if X i and X j are involved in the linear relationship
leading to multicollinearity.
The least-squares estimates b j become too large in absolute value in the presence of multicollinearity. For
L2 (b ) '(b )
k
E ( L2 ) E (b j j ) 2
j 1
k
Var (b j )
j 1
1 1 1
( X ' X ), then , ,..., are the eigenvalues of ( X ' X ) 1 and hence
1 2 k
k
1
E ( L2 ) 2 , j 0.
j 1 j
E ( L2 ) E (b ) '(b )
2tr ( X ' X ) 1 E (b ' b 2b ' ' )
E (b ' b) 2tr ( X ' X ) 1 '
b is generally longer than
OLSE is too large in absolute value.
The least-squares produces wrong estimates of parameters in the presence of multicollinearity. This
does not imply that the fitted model provides wrong predictions also. If the predictions are confined to
x-space with non-harmful multicollinearity, then predictions are satisfactory.
Multicollinearity diagnostics
An important question arises about how to diagnose the presence of multicollinearity in the data on the
basis of given sample information. Several diagnostic measures are available, and each of them is based on
a particular approach. It is difficult to say that which of the diagnostic is best or ultimate. Some of the
popular and important diagnostics are described further. The detection of multicollinearity involves 3
aspects:
(i) Determining its presence.
(ii) Determining its severity.
(iii) Determining its form or location.
multicollinearity increases.
If Rank ( X ' X ) k then X ' X will be singular and so X ' X 0. So, as X ' X 0 , the degree of
multicollinearity increases and it becomes exact or perfect at X ' X 0. Thus X ' X serves as a measure
x12i
i 1
x
i 1
x
1i 2 i
X 'X n n
x
i 1
x
2 i 1i x
i 1
2
2i
n n
x12i x22i 1 r122
i 1 i 1
where r12 is the correlation coefficient between x1 and x2 . So X ' X depends on the
correlation coefficient and variability of the explanatory variable. If explanatory variables have
very low variability, then X ' X may tend to zero, which will indicate the presence of
(iii) It gives no idea about the relative effects on individual coefficients. If multicollinearity is
present, then it will not indicate that which variable in X ' X is causing multicollinearity and
is hard to determine.
If X i and X j are nearly linearly dependent, then rij will be close to 1. Note that the observations in X
are standardized in the sense that each observation is subtracted from the mean of that variable and divided
by the square root of the corrected sum of squares of that variable.
When more than two explanatory variables are considered, and if they are involved in near-linear
dependency, then it is not necessary that any of the rij will be large. Generally, a pairwise inspection of
Limitation
It gives no information about the number of linear dependencies among explanatory variables.
x
i 1
2
1i x
i 1
x
1i 2 i
X 'X n n
(1 r122 ).
x
i 1
x
1i 2 i x
i 1
2
2i
variable is dropped, i 1, 2,..., k , and RL2 Max( R12 , R22 ,..., Rk2 ).
Procedure:
(i) Drop one of the explanatory variables among k variables, say X 1 .
high. Higher the degree of multicollinearity, higher the value of RL2 . So in the presence of
Limitations:
(i) It gives no information about the underlying relations about explanatory variables, i.e., how
many relationships are present or how many explanatory variables are responsible for the
multicollinearity.
(ii) A small value of ( R 2 RL2 ) may occur because of poor specification of the model also and it
determination obtained when X j is regressed on the remaining (k 1) variables excluding X j , then the
j th diagonal element of C is
1
C jj .
1 R 2j
close to 1.
Var (b j ) 2C jj
dependent. Based on this concept, the variance inflation factor for the j th explanatory variable is defined
as
1
VIFj .
1 R 2`j
This is the factor which is responsible for inflating the sampling variance. The combined effect of
dependencies among the explanatory variables on the variance of a term is measured by the VIF of that
term in the model.
One or more large VIFs indicate the presence of multicollinearity in the data.
In practice, usually, a VIF 5 or 10 indicates that the associated regression coefficients are poorly
estimated because of multicollinearity. If regression coefficients are estimated by OLSE and its variance
is 2 ( X ' X ) 1. So VIF indicates that a part of this variance is given by VIFj.
Limitations:
(i) It sheds no light on the number of dependencies among the explanatory variables.
(ii) The rule of VIF > 5 or 10 is a rule of thumb which may differ from one situation to another
situation.
b ˆ C jj t ,n k 1 .
2
2
L j 2 ˆ 2C jj t .
, n k 1
2
1 n
the same as earlier and the same root mean squares ( xij x j ) 2 , then the length of confidence
n i 1
interval becomes
L* 2ˆ t .
, n k 1
2
The condition number is based only or two eigenvalues: min and max . Another measures are condition
The number of condition indices that are large, say more than 1000, indicate the number of near-linear
dependencies in X ' X .
matrix constructed by the eigenvectors of X ' X . Obviously, V is an orthogonal matrix. Then X ' X can
be decomposed as X ' X V V ' . Let V1 , V2 ,..., Vk be the column of V . If there is a near-linear
dependency in the data, then j is close to zero and the nature of linear dependency is described by the
(ii) (a) Identify those i ' s for which C j is greater than the danger level 1000.
(iii) For such ' s with condition index above the danger level, choose one such eigenvalue, say
j.
(vij2 / j ) vij2 / j
pij k
.
(v
VIFj 2
ij / j )
j 1
v2
Note that ij can be found from the expression
j
vi21 vi22 vik2
Var (bi ) 2
...
1 2 k
If pij 0.5, it indicates that bi is adversely affected by the multicollinearity, i.e., an estimate of i is
It is a good diagnostic tool in the sense that it tells about the presence of harmful multicollinearity as well
as also indicates the number of linear dependencies responsible for multicollinearity. This diagnostic is
better than other diagnostics.
eigenvectors corresponding to eigenvalues of X ' X and U is a matrix whose columns are the
eigenvectors associated with the k nonzero eigenvalues of X ' X .
so 2j j , j 1, 2,..., k .
k v 2ji
Var (b j ) 2
i 1 i2
k v 2ji
VIFj
i 1 i2
(vij2 / i2 )
pij .
VIF j
The ill-conditioning in X is reflected in the size of singular values. There will be one small singular value
for each non-linear dependency. The extent of ill-conditioning is described by how small is j relative to
max .
It is suggested that the explanatory variables should be scaled to unit length but should not be centered
when computing pij . This will helps in diagnosing the role of intercept term in near-linear dependence.
No unique guidance is available in the literature on the issue of centering the explanatory variables. The
centering makes the intercept orthogonal to explanatory variables. So this may remove the ill-conditioning
due to intercept term in the model.
Econometrics | Chapter 7 | Multicollinearity | Shalabh, IIT Kanpur
14
Remedies for multicollinearity:
Various techniques have been proposed to deal with the problems resulting from the presence of
multicollinearity in the data.
close to zero. Additional data may help in reducing the sampling variance of the estimates. The data need
to be collected such that it helps in breaking up the multicollinearity in the data.
It is always not possible to collect additional data for various reasons as follows.
The experiment and process have finished and no longer available.
The economic constraints may also not allow collecting the additional data.
The additional data may not match with the earlier collected data and may be unusual.
If the data is in time series, then longer time series may force to take ignore data that is too far in
the past.
If multicollinearity is due to any identity or exact relationship, then increasing the sample size will
not help.
Sometimes, it is not advisable to use the data even if it is available. For example, if the data on
consumption pattern is available for the years 1950-2010, then one may not like to use it as the
consumption pattern usually does not remains the same for such a long period.
If some variables are eliminated, then this may reduce the predictive power of the model. Sometimes there
is no assurance of how the model will exhibit less multicollinearity.
Econometrics | Chapter 7 | Multicollinearity | Shalabh, IIT Kanpur
15
3. Use some relevant prior information:
One may search for some relevant prior information about the regression coefficients. This may lead to
the specification of estimates of some coefficients. The more general situation includes the specification of
some exact linear restrictions and stochastic linear restrictions. The procedures like restricted regression
and mixed regression can be used for this purpose. The relevance and correctness of information play an
important role in such analysis, but it is challenging to ensure it in practice. For example, the estimates
derived in the U.K. may not be valid in India.
Suppose there are k explanatory variables X 1 , X 2 ,..., X k . Consider the linear function of X 1 , X 2 ,.., X k
like
k
Z1 ai X i
i 1
k
Z 2 bi X i etc.
i 1
The constants a1 , a2 ,..., ak are determined such that the variance of Z1 is maximized subject to the
component.
We continue with such process and obtain k such linear combinations such that they are orthogonal to
their preceding linear combinations and satisfy the normality condition. Then we obtain their variances.
Suppose such linear combinations are Z1 , Z 2 ,.., Z k and for them, Var ( Z1 ) Var ( Z 2 ) ... Var ( Z k ). The
linear combination having the largest variance is the first principal component. The linear combination
having the second largest variance is the second-largest principal component and so on. These principal
k k
components have the property that Var (Zi ) Var ( X i ). Also, the X1 , X 2 ,..., X k are correlated but
i 1 i 1
Z1 , Z 2 ,.., Z k are orthogonal or uncorrelated. So there will be zero multicollinearity among Z1 , Z 2 ,.., Z k .
The problem of multicollinearity arises because X 1 , X 2 ,..., X k are not independent. Since the principal
components based on X 1 , X 2 ,..., X k are mutually independent, so they can be used as explanatory
orthogonal matrix whose columns are the eigenvectors associated with 1 , 2 ,..., k . Consider the
Columns of Z Z1 , Z 2 ,..., Z k define a new set of explanatory variables which are called as principal
components.
The OLSE of is
ˆ ( Z ' Z ) 1 Z ' y
1Z ' y
and its covariance matrix is
V (ˆ ) 2 ( Z ' Z ) 1
2 1
1 1 1
2 diag , ,...,
1 2 k
k k
Note that j is the variance of j th principal component and Z ' Z Z i Z j . A small eigenvalue
i 1 j 1
of X ' X means that the linear relationship between the original explanatory variable exists and the
variance of the corresponding orthogonal regression coefficient is large, which indicates that the
multicollinearity exists. If one or more j is small, then it indicates that multicollinearity is present.
information as the original data in X in the sense that the total variability in X and Z is the same. The
difference between them is that the original data are arranged into a set of new variables which are
uncorrelated with each other and can be ranked with respect to the magnitude of their eigenvalues. The j th
column vector Z j corresponding to the largest j accounts for the largest proportion of the variation in
the original data. Thus the Z j ’s are indexed so that 1 2 ... k 0 and j is the variance of Z j .
A strategy of elimination of principal components is to begin by discarding the component associated with
the smallest eigenvalue. The idea behind to do so is that the principal component with the smallest
eigenvalue is contributing least variance and so is least informative.
Using this procedure, principal components are eliminated until the remaining components explain some
preselected variance is terms of percentage of the total variance. For example, if 90% of the total variance
is needed, and suppose r principal components are eliminated which means that (k r ) principal
components contribute 90% variation, then r is selected to satisfy
k r
i
i 1
k
0.90.
i 1
i
Various strategies to choose the required number of principal components are also available in the
literature.
Suppose after using such a rule, the r principal components are eliminated. Now only (k r )
components will be used for regression. So Z matrix is partitioned as
Z Zr Z k r X (Vr Vk r )
where Z r submatrix is of order n r and contains the principal components to be eliminated. The
The reduced model obtained after the elimination of r principal components can be expressed as
y Z k r k r *.
Z k r Z1 , Z 2 ,..., Z k r
k r 1 , 2 ,..., k r
Vk r V1 , V2 ,..., Vk r .
Using OLS on the model with retained principal components, the OLSE of k r is
ˆ k r ( Z k' r Z k r ) 1 Z k' r y .
Now it is transformed back to original explanatory variables as follows:
V '
k r Vk' r
ˆ pc Vk rˆ k r
is the principal component regression estimator of .
This method improves the efficiency as well as multicollinearity.
6. Ridge regression
The OLSE is the best linear unbiased estimator of regression coefficient in the sense that it has minimum
variance in the class of linear and unbiased estimators. However, if the condition of unbiased can be
relaxed, then it is possible to find a biased estimator of regression coefficient say ̂ that has smaller
variance them the unbiased OLSE b . The mean squared error (MSE) of ̂ is
MSE ( ˆ ) E ( ˆ ) 2
2
E ˆ E ( ˆ ) E ( ˆ )
2
Var ( ˆ ) E ( ˆ )
2
Var ( ˆ ) Bias ( ˆ ) .
Thus MSE ( ˆ ) can be made smaller than Var ( ˆ ) by introducing small bias is ̂ . One of the approach to
do so is ridge regression. The ridge regression estimator is obtained by solving the normal equations of
least squares estimation. The normal equations are modified as
is the ridge regression estimator of and 0 is any characterizing scalar termed as biasing
parameter.
So larger the value of , larger shrinkage towards zero. Note that the OLSE in inappropriate to use in the
sense that it has very high variance when multicollinearity is present in the data. On the other hand, a very
small value of ̂ may tend to accept the null hypothesis H 0 : 0 indicating that the corresponding
variables are not relevant. The value of the biasing parameter controls the amount of shrinkage in the
estimates.
Covariance matrix:
The covariance matrix of ˆridge is defined as
V ( ˆridge ) E ˆridge E ( ˆridge ) ˆridge E ( ˆridge ) .
'
Since
Thus as increases, the bias in ˆridge increases but its variance decreases. Thus the trade-off between bias
and variance hinges upon the value of . It can be shown that there exists a value of such that
MSE ( ˆridge ) Var (b)
Choice of :
The estimation of ridge regression estimator depends upon the value of . Various approaches have been
suggested in the literature to determine the value of . The value of can be chosen on the bias of
criteria like
- the stability of estimators with respect to .
- reasonable signs.
- the magnitude of residual sum of squares etc.
We consider here the determination of by the inspection of ridge trace.
Ridge trace:
Ridge trace is the graphical display of ridge regression estimator versus .
If multicollinearity is present and is severe, then the instability of regression coefficients is reflected in the
ridge trace. As increases, some of the ridge estimates vary dramatically, and they stabilize at some
value of . The objective in ridge trace is to inspect the trace (curve) and find the reasonable small value
of at which the ridge regression estimators are stable. The ridge regression estimator with such a choice
of will have smaller MSE than the variance of OLSE.
Econometrics | Chapter 7 | Multicollinearity | Shalabh, IIT Kanpur
22
An example of ridge trace is as follows for a model with 6 parameters. In this ridge trace, the ˆridge is
evaluated for various choices of and the corresponding values of all regression coefficients ˆ j ( ridge ) ’s,
j=1,2,…,6 are plotted versus . These values are denoted by different symbols and are joined by a smooth
curve. This produces a ridge trace for the respective parameter. Now choose the value of where all the
curves stabilize and become nearly parallel. For example, the curves in the following figure become
almost parallel, starting from 4 or so. Thus one possible choice of is 4 and parameters can
The figure drastically exposes the presence of multicollinearity in the data. The behaviour of ˆi ( ridge ) at
0 0 is very different than at other values of . For small values of , the estimates change rapidly.
The estimates stabilize gradually as increases. The value of at which all the estimates stabilize gives
the desired value of because moving away from such will not bring any appreciable reduction in the
residual sum of squares. It multicollinearity is present, then the variation in ridge regression estimators is
rapid around 0. The optimal is chosen such that after that value of , almost all traces stabilize.
exhibit stability for different , and it may often be hard to strike a compromise. In such a situation,
generalized ridge regression estimators are used.
5. There is no guidance available regarding the testing of hypothesis and for confidence interval
estimation.
then
ˆridge ( I 1 ) 1 b
where b is the OLSE of given by
constant. So minimize
( ) ( y X ) '( y X ) ( ' C )
where is the Lagrangian multiplier. Differentiating S ( ) with respect to , the normal equations are
obtained as
S ( )
0 2 X ' y 2 X ' X 2 0
ˆ ( X ' X I ) 1 X ' y.
ridge
Note that if C is very small, it may indicate that most of the regression coefficients are close to zero and if
C is large, then it may indicate that the regression coefficients are away from zero. So C puts a sort of
penalty on the regression coefficients to enable its estimation.
In this case, the diagonal elements of the covariance matrix of are the same indicating that the variance of
each i is same and off-diagonal elements of the covariance matrix of are zero indicating that all
disturbances are pairwise uncorrelated. This property of constancy of variance is termed as homoskedasticity
and disturbances are called as homoskedastic disturbances.
In many situations, this assumption may not be plausible, and the variances may not remain the same. The
disturbances whose variances are not constant across the observations are called heteroskedastic disturbance,
and this property is termed as heteroskedasticity. In this case
Var ( i ) i2 , i 1, 2,..., n
and disturbances are pairwise uncorrelated.
Homoskedasticity
Examples: Suppose in a simple linear regression model, x denote the income and y denotes the expenditure
on food. It is observed that as the income increases, the expenditure on food increases because of the choice
and varieties in food increase, in general, up to a certain extent. So the variance of observations on y will not
remain constant as income changes. The assumption of homoscedasticity implies that the consumption pattern
of food will remain the same irrespective of the income of the person. This may not generally be a correct
assumption in real situations. Instead, the consumption pattern changes and hence the variance of y and so the
variances of disturbances will not remain constant. In general, it and will be increasing as income increases.
2. Sometimes the observations are in the form of averages, and this introduces the heteroskedasticity in the
model. For example, it is easier to collect data on the expenditure on clothes for the whole family rather
than on a particular family member. Suppose in a simple linear regression model
yij 0 1 xij ij , i 1, 2,..., n, j 1, 2,..., mi
yij denotes the expenditure on cloth for the j th family having m j members and xij denotes the age of
the i th person in the j th family. It is difficult to record data for an individual family member, but it is
easier to get data for the whole family. So yij ' s are known collectively.
Then instead of per member expenditure, we find the data on average spending for each family member
as
mj
1
yi
mj
y
j 1
ij
E ( i ) 0
2
Var ( i )
mj
3. Sometimes the theoretical considerations introduce the heteroskedasticity in the data. For example,
suppose in the simple linear model
yi 0 1 xi i , i 1, 2,..., n ,
yi denotes the yield of rice and xi denotes the quantity of fertilizer in an agricultural experiment. It is
observed that when the quantity of fertilizer increases, then yield increases. In fact, initially, the yield
increases when the quantity of fertilizer increases. Gradually, the rate of increase slows down, and if
fertilizer is increased further, the crop burns. So notice that 1 changes with different levels of fertilizer.
In such cases, when 1 changes, a possible way is to express it as a random variable with constant
1i 1 vi , i 1, 2,..., n
with
E (vi ) 0, Var (vi ) 2 , E ( i vi ) 0.
So the complete model becomes
yi 0 1 xi i
i 1 vi
yi 0 xi ( i xi vi )
0 xi wi
E ( wi ) 0
Var ( wi ) E ( wi2 )
E ( i2 ) xi2 E (vi2 ) 2 xi E ( i vi )
2 xi2 2 0
2 xi2 2 .
So variance depends on i , and thus heteroskedasticity is introduced in the model. Note that we assume
homoskedastic disturbances for the model
yi 0 1 xi i , 1i 1 vi
but finally ends up with heteroskedastic disturbances. This is due to theoretical considerations.
Econometrics | Chapter 8 | Heteroskedasticity | Shalabh, IIT Kanpur
4
4. The skewness in the distribution of one or more explanatory variables in the model also causes
heteroskedasticity in the model.
5. The incorrect data transformations and wrong functional form of the model can also give rise to the
heteroskedasticity problem.
1. Bartlett’s test
It is a test for testing the null hypothesis
H 0 : 12 22 ... i2 ... n2
This hypothesis is termed as the hypothesis of homoskedasticity. This test can be used only when replicated
data is available.
only one observation yi is available to find i2 , so the usual tests can not be applied. This problem can be
n n n
where y * is a vector of order mi 1, X is mi k matrix, is k 1 vector and * is mi 1
i 1 i 1 i 1
vector. Apply OLS to this model yields
ˆ ( X ' X ) 1 X ' y *
and obtain the residual vector
ei* yi* X i ˆ .
Based on this, obtain
1
si2 ei* ' ei*
mi k
n
(m k ) s
i
2
i
s2 i 1
n
.
(m k )
i 1
i
C i 1 si
1 n
1 1
C 1 .
3(n 1) i 1 mi k n
(mi k )
i 1
where
1 ni
y yi , i 1, 2,..., m; j 1, 2,..., ni
2
si2 ij
ni j 1
1 m
s2
n i 1
ni si2
m
n ni .
i 1
1 m 1 1
1
3(m 1) i 1 ni 1 n m
1 ni
ˆ i2
n 1 j 1
( yij yi ) 2
1 m
ˆ 2
n m i 1
(ni 1)ˆ i2 .
In experimental sciences, it is easier to get replicated data, and this test can be easily applied. In real-life
applications, it is challenging to get replicated data, and this test may not be applied. This difficulty is overcome
in Breusch Pagan test.
is the vector of observable explanatory variables with first element unity and ( 1 , i* ) ( 1 , 2 ,..., p ) is a
vector of unknown coefficients related to with the first element being the intercept term. The heterogeneity is
defined by these p variables. These Z i ' s may also include some X ' s also.
If H 0 is accepted , it implies that 2 Z i 2 , 3 Z i 3 ,..., p Z ip do not have any effect on i2 and we get i2 1 .
explains the heteroskedasticity in the model. Let j th explanatory variable explains the heteroskedasticity, so
i2 X ij
or i2 2 X ij .
The test procedure is as follows:
2. Split the observations into two equal parts leaving c observations in the middle.
nc nc
So each part contains observations provided k.
2 2
3. Run two separate regression in the two parts using OLS and obtain the residual sum of squares SSres1
and SS res 2 .
On the other hand, if a smaller value of c is chosen, then the test may fail to reveal the heteroskedasticity. The
basic objective of the ordering of observations and deletion of observations in the middle part may not reveal
the heteroskedasticity effect. Since the first and last values of i2 gives the maximum discretion, so removal of
smaller value may not give the proper idea of heteroskedasticity. Considering these two points, the working
n
choice of c is suggested as c .
3
Moreover, the choice of X ij is also difficult. Since i2 X ij , so if all important variables are included in the
model, then it may be difficult to decide that which of the variable is influencing the heteroskedasticity.
4. Glesjer test:
This test is based on the assumption that i2 is influenced by one variable Z , i.e., there is only one variable
which is influencing the heteroskedasticity. This variable could be either one of the explanatory variable or it
can be chosen from some extraneous sources also.
1
4. Conduct the test for h 1, . So the test procedure is repeated four times.
2
In practice, one can choose any value of h . For simplicity, we choose h 1 .
The test has only asymptotic justification and the four choices of h give generally satisfactory results.
This test sheds light on the nature of heteroskedasticity.
phenomenon and n is the number of objects or phenomenon ranked, then the Spearman’s rank correlation
coefficient is defined as
n 2
di
r 1 6 i 12 ; 1 r 1.
n(n 1)
This can be used for testing the hypothesis about the heteroskedasticity.
Consider the model
yi 0 1 X i i .
2. Consider ei .
5. Assuming that the population rank correlation coefficient is zero and n 8, use the test statistic
r n2
t0
1 r2
which follows a t -distribution with (n 2) degrees of freedom.
6. The decision rule is to reject the null hypothesis of heteroskedasticity whenever t0 t1 (n 2).
If there are more than one explanatory variables, then rank correlation coefficient can be computed
between ei and each of the explanatory variables separately and can be tested using t0 .
e1
e
ei 0, 0,..., 0,1, 0,...0 2
en
i 'e i ' H
where i is a n 1 vector with all elements zero except the i th element which is unity and
Thus E (ei2 ) i2 and so ei2 becomes a biased estimator of i2 in the presence of heteroskedasticity.
In the presence of heteroskedasticity, use the generalized least squares estimation. The generalized least
squares estimator (GLSE) of is
(x x )
i
2
i
2
n
i2
Var (b) i 1
and Var ( ˆ ) respectively.
n 2
2
( xi x ) 2
( xi x )
i 1
i 1
Econometrics | Chapter 8 | Heteroskedasticity | Shalabh, IIT Kanpur
13
Consider
2
n
Var ( ˆ ) ( xi x ) 2
i 1
Var (b)
( x x ) 2 2 ( x x ) 2 1
n n
i i
i 1
i
i2
i 1
x x
Square of the correlation coefficient betweene i ( xi x ) and i
i
1
Var ( ˆ ) Var (b).
( xi x )
So efficient of OLSE and GLSE depends upon the correlation coefficient between ( xi x ) i and .
i
The generalized least squares estimation assumes that is known, i.e., the nature of heteroskedasticity is
completely specified. Based on this assumption, the possibilities of following two cases arise:
is completely specified or
is not completely specified.
yi 1 2 X i 2 ... k X ik i .
yi 1 X i2 X ik i
1 2 ... k .
i i i i i
i 2
Let i* , then E ( i* ) 0, Var ( i* ) i2 1. Now OLS can be applied to this model and usual tools for
i i
drawing statistical inferences can be used.
Note that when the model is deflated, the intercept term is lost as 1 / i is itself a variable. This point has to be
i2 X ij2
or i2 2 X ij2
yi X X
1 2 i2 ... k ik i .
X ij X ij X ij X ij X ij
Now apply OLS to this transformed model and use the usual statistical tools for drawing inferences.
A caution is to be kept is mind while doing so. This is illustrated in the following example with one
explanatory variable model.
Deflate it by xi , so we get
yi 0
1 i .
xi xi xi
Note that the roles of 0 and 1 in original and deflated models are interchanged. In the original model, 0 is
the intercept term and 1 is the slope parameter whereas in the deflated model, 1 becomes the intercept term
and 0 becomes the slope parameter. So essentially, one can use OLS but need to be careful in identifying the
u2 if s 0
E (ut , ut s )
0 if s 0
i.e., the correlation between the successive disturbances is zero.
In this assumption, when E (ut , ut s ) u2 , s 0 is violated, i.e., the variance of disturbance term does not
remain constant, then the problem of heteroskedasticity arises. When E (ut , ut s ) 0, s 0 is violated, i.e.,
the variance of disturbance term remains constant though the successive disturbance terms are correlated,
then such problem is termed as the problem of autocorrelation.
When autocorrelation is present, some or all off-diagonal elements in E (uu ') are nonzero.
Sometimes the study and explanatory variables have a natural sequence order over time, i.e., the data is
collected with respect to time. Such data is termed as time-series data. The disturbance terms in time series
data are serially correlated.
Assume s and s are symmetrical in s , i.e., these coefficients are constant over time and depend only on
the length of lag s. The autocorrelation between the successive terms (u2 and u1 )
(u3 and u2 ),..., (un and un 1 ) gives the autocorrelation of order one, i.e., 1 . Similarly, the autocorrelation
between the successive terms (u3 and u1 ), (u4 and u2 )...(un and un 2 ) gives the autocorrelation of order two,
i.e., 2 .
2. Another source of autocorrelation is the effect of deletion of some variables. In regression modeling,
it is not possible to include all the variables in the model. There can be various reasons for this, e.g.,
some variable may be qualitative, sometimes direct observations may not be available on the variable
etc. The joint effect of such deleted variables gives rise to autocorrelation in the data.
3. The misspecification of the form of relationship can also introduce autocorrelation in the data. It is
assumed that the form of relationship between study and explanatory variables is linear. If there are
log or exponential terms present in the model so that the linearity of the model is questionable, then
this also gives rise to autocorrelation in the data.
4. The difference between the observed and true values of the variable is called measurement error or
errors–in-variable. The presence of measurement errors on the dependent variable may also introduce
the autocorrelation in the data.
Observe that now there are (n k ) parameters- 1 , 2 ,..., k , u2 , 1 , 2 ,..., n 1. These (n k ) parameters are
to be estimated on the basis of available n observations. Since the number of parameters are more than the
number of observations, so the situation is not good from the statistical point of view. In order to handle the
situation, some special form and the structure of the disturbance term is needed to be assumed so that the
number of parameters in the covariance matrix of disturbance term can be reduced.
i.e., the current disturbance term depends on the q lagged disturbances and 1 , 2 ,..., k are the parameters
i.e., the present disturbance term ut depends on the p lagged values. The coefficients 1 , 2 ,..., p are the
parameters and are associated with t 1 , t 2 ,..., t p , respectively. This process is termed as MA p process.
The method of correlogram is used to check that the data is following which of the processes. The
correlogram is a two dimensional graph between the lag s and autocorrelation coefficient s which is
In MA(1) process
ut t 1 t 1
1
for s 1
s 1 12
0 for s 2
0 1
1 0
i 0 i 2,3,...
So there is no autocorrelation between the disturbances that are more than one period apart.
The results of any lower order of process are not applicable in higher-order schemes. As the order of the
process increases, the difficulty in handling them mathematically also increases.
ut ut 1 t
where 1, E ( t ) 0,
2 if s 0
E ( t , t s )
0 if s 0
for all t 1, 2,..., n where is the first-order autocorrelation between ut and ut 1 , t 1, 2,..., n. Now
ut ut 1 t
(ut 2 t 1 ) t
t t 1 2 t 2 ...
r t r
r 0
E t 1 t 2 ...
2
u .
2
Similarly,
E (ut ut 2 ) 2 u2 .
In general,
E (ut ut s ) s u2
1 2 n 1
1 n2
.
E (uu ') u 2
2
1 n 3
n 1 n2
n 3 1
Note that the disturbance terms are no more independent and E (uu ') 2 I . The disturbances are
nonspherical.
ut ut 1 t , t 1, 2,..., n
with assumptions
E (u ) 0, E (uu ')
2 if s 0
E ( t ) 0, E ( t t s )
0 if s 0
where is a positive definite matrix.
b ( X ' X ) 1 X ' y
( X ' X ) 1 X '( X u )
b ( X ' X ) 1 X ' u
E (b ) 0.
So OLSE remains unbiased under autocorrelated disturbances.
Application of OLS fails in case of autocorrelation in the data and leads to serious consequences as
overly optimistic view from R 2 .
narrow confidence interval.
usual t -ratio and F ratio tests provide misleading results.
prediction may have large variances.
Since disturbances are nonspherical, so generalized least squares estimate of yields more efficient
estimates than OLSE.
ˆ ( X ' 1 X ) 1 X ' 1 y
E ( ˆ )
V ( ˆ ) 2 ( X ' 1 X ) 1.
u
e y Xb Hy
e e
2
t t 1
d t 2
n
e
t 1
2
t
n n n
et2 et 1 e e t t 1
t 2
n
t 2
n
2 t 2n .
e
t 1
2
t e
t 1
2
t e
t 1
2
t
For large n,
d 1 1 2r
d 2(1 r )
where r is the sample autocorrelation coefficient from residuals based on OLSE and can be regarded as the
regression coefficient of et on et 1 . Here
negative autocorrelation of et ’s d 2
zero autocorrelation of et ’s d 2
As 1 r 1, so
if 1 r 0, then 2 d 4 and
if 0 r 1, then 0 d 2.
So d lies between 0 and 4.
Since e depends on X , so for different data sets, different values of d are obtained. So the sampling
distribution of d depends on X . Consequently, exact critical values of d cannot be tabulated owing to their
dependence on X . Durbin and Watson, therefore, obtained two statistics d and d such that
d d d
and their sampling distributions do not depend upon X .
Considering the distribution of d and d , they tabulated the critical values as d L and dU respectively. They
prepared the tables of critical values for 15 n 100 and k 5. Now tables are available for 6 n 200 and
k 10.
Accept H 0 when d dU .
This test gives a satisfactory solution when values of xi ’s change slowly, e.g., price, expenditure
etc.
2. The D-W test is not applicable when the intercept term is absent in the model. In such a case, one can
use another critical value, say d M in place of d L . The tables for critical values d M are available.
3. The test is not valid when lagged dependent variables appear as explanatory variables. For example,
yt 1 yt 1 2 yt 2 .... r yt r r 1 xt1 ... k xt ,k r ut ,
ut ut 1 t .
Durbin’s h-test
Apply OLS to
yt 1 yt 1 2 yt 2 .... r yt r r 1 xt1 ... k xt ,k r ut ,
ut ut 1 t
(b ). Then the Dubin’s h -
and find OLSE b1 of 1. Let its variance be Var (b1 ) and its estimator is Var 1
statistic is
n
hr
1 n Var (b1 )
e e t t 1
r t 2
n
.
e
t 2
2
t
et t 1 yt .
Now apply OLS to this model and test H 0 A : 0 versus H1 A : 0 using t -test. It H 0 A is accepted then
accept H 0 : 0.
4. If H 0 : 0 is rejected by D-W test, it does not necessarily mean the presence of first-order
autocorrelation in the disturbances. It could happen because of other reasons also, e.g.,
distribution may follows higher-order AR process.
some important variables are omitted.
dynamics of the model is misspecified.
functional term of the model is incorrect.
The OLSE of is unbiased but not, in general, efficient, and the estimate of 2 is biased. So we use
generalized least squares estimation procedure, and GLSE of is
ˆ ( X ' 1 X ) 1 X ' 1 y
where
1 2 0 0 0 0
1 0 0 0
0 1 0 0
P .
0 0 0 1 0
0 0 0 1
Note that the first observation is treated differently than other observations. For the first observation,
1 2 y1
1 2 x1'
1 2 u1
where xt' is a row vector of X . Also, 1 2 u1 and (u1 u0 ) have the same properties. So we
1 2.
V ( ˆ ) 2 ( X *' X *) 1
2 ( X ' 1 X ) 1
and its estimator is
Vˆ ( ˆ ) ˆ 2 ( X ' 1 X ) 1
where
( y X ˆ ) ' 1 ( y X ˆ )
ˆ 2 .
nk
ˆF ( X '
ˆ 1 X ) 1 X '
ˆ 1 y
e e t t 1
r t 2
n
e
t 2
2
t
2. Durbin procedure:
In Durbin procedure, the model
yt yt 1 0 (1 ) ( xt xt 1 ) t , t 2,3,..., n
is expressed as
yt 0 (1 ) yt 1 x1 xt 1 t
0* yt 1 xt * xt 1 t , t 2,3,..., n (*)
where 0* 0 (1 ), * .
Now run a regression using OLS to model (*) and estimate r * as the estimated coefficient of yt 1.
Another possibility is that since (1,1) , so search for a suitable which has smaller error sum of
squares.
3. Cochrane-Orcutt procedure:
This procedure utilizes P matrix defined while estimating when is known. It has following steps:
(i) Apply OLS to yt 0 1 xt ut and obtain the residual vector e .
n
e e t t 1
(ii) Estimate by r t 2
n
.
e
t 2
2
t 1
This is Cochrane-Orcutt procedure. Since two successive applications of OLS are involved, so it is also
called as two-step procedure.
Econometrics | Chapter 9 | Autocorrelation | Shalabh, IIT Kanpur
14
This application can be repeated in the procedure as follows:
(I) Put ˆ0* and ˆ in the original model.
e e t t 1
(III) Calculate by r t 2
n
and substitute it in the model
e
t 2
2
t 1
yt yt 1 0 (1 ) ( xt xt 1 ) t
This procedure is repeated until convergence is achieved, i.e., iterate the process till the two successive
estimates are nearly same so that stability of estimator is achieved.
This is an iterative procedure and is numerically convergent procedure. Such estimates are asymptotically
efficient and there is a loss of one observation.
Suppose we get 0.4. Now choose a finer grid. For example, choose such that 0.3 0.5 and
consider 0.31, 0.32,..., 0.49 and pick up that with the smallest residual sum of squares. Such iteration
can be repeated until a suitable value of corresponding to minimum residual sum of squares is obtained.
The selected final value of can be used and for transforming the model as in the case of Cochrane-Orcutt
procedure. The estimators obtained with this procedure are as efficient as obtained by Cochrane-Orcutt
procedure and there is a loss of one observation.
e e t t 1
(i) Estimate by ˆ t 2
n
where et ’s are residuals based on OLSE.
et21
t 3
1 ˆ 2 y1
1 ˆ 2 0
1 ˆ 2 xt
1 ˆ 2 ut
yt ˆ yt 1 (1 ˆ ) 0 ( xt ˆ xt 1 ) (ut ˆ ut 1 ), t 2,3,..., n.
1 1
L exp 2 ( y X ) ' 1 ( y X ) .
2
n
2
n
2 2
2
1
Ignoring the constant and using , the log-likelihood is
1 2
n 1 1
ln L ln L( , 2 , ) ln 2 ln(1 2 ) 2 ( y X ) ' 1 ( y X ) .
2 2 2
The maximum likelihood estimators of , and 2 can be obtained by solving the normal equations
ln L ln L ln L
0, 0, 0.
2
These normal equations turn out to be nonlinear in parameters and can not be easily solved.
One solution is to
- first derive the maximum likelihood estimator of 2 .
- Substitute it back into the likelihood function and obtain the likelihood function as the function of
and .
- Maximize this likelihood function with respect to and .
n 1 1 n
ln L* ln L *( , ) ln ( y X ) ' 1 ( y X ) ln(1 2 )
2 n 2 2
n 1
ln ( y X ) ' 1 ( y X ) ln(1 2 ) k
2 n
n ( y X ) ' 1 ( y X )
k ln
2 1
(1 2 ) n
n n
where k ln n .
2 2
( y X ) ' 1 ( y X )
1
.
(1 )
2 n
Using the optimization techniques of non-linear regression, this function can be minimized and estimates of
and can be obtained.
If n is large and is not too close to one, then the term (1 2 ) 1/ n is negligible and the estimates of
In general, the explanatory variables in any regression analysis are assumed to be quantitative in nature. For
example, the variables like temperature, distance, age etc. are quantitative in the sense that they are recorded
on a well-defined scale.
In many applications, the variables can not be defined on a well-defined scale, and they are qualitative in
nature.
For example, the variables like sex (male or female), colour (black, white), nationality, employment status
(employed, unemployed) are defined on a nominal scale. Such variables do not have any natural scale of
measurement. Such variables usually indicate the presence or absence of a “quality” or an attribute like
employed or unemployed, graduate or non-graduate, smokers or non- smokers, yes or no, acceptance or
rejection, so they are defined on a nominal scale. Such variables can be quantified by artificially constructing
the variables that take the values, e.g., 1 and 0 where “1” usually indicates the presence of attribute and “0”
usually indicates the absence of the attribute. For example, “1” indicator that the person is male and “0”
indicates that the person is female. Similarly, “1” may indicate that the person is employed and then “0”
indicates that the person is unemployed.
Such variables classify the data into mutually exclusive categories. These variables are called indicator
variable or dummy variables.
Usually, the indicator variables take on the values 0 and 1 to identify the mutually exclusive classes of the
explanatory variables. For example,
1 if person is male
D
0 if person is female,
1 if person is employed
D
0 if person is unemployed.
Here we use the notation D in place of X to denote the dummy variable. The choice of 1 and 0 to identify
a category is arbitrary. For example, one can also define the dummy variable in the above examples as
In a given regression model, the qualitative and quantitative can also occur together, i.e., some variables are
qualitative, and others are quantitative.
Such models can be dealt with within the framework of regression analysis. The usual tools of regression
analysis can be used in the case of dummy variables.
Example:
Consider the following model with x1 as quantitative and D2 as an indicator variable
y 0 1 x1 2 D2 , E ( ) 0, Var ( ) 2
0 if an observation belongs to group A
D2
1 if an observation belongs to group B.
The interpretation of the result is essential. We proceed as follows:
If D2 0, then
y 0 1 x1 2 .0
0 1 x1
E ( y / D2 0) 0 1 x1
y 0 1 x1 2 .1
( 0 2 ) 1 x1
E ( y / D2 1) ( 0 2 ) 1 x1
The quantities E ( y / D2 0) and E ( y / D2 1) are the average responses when an observation belongs to
2 E ( y / D2 1) E ( y / D2 0)
which has an interpretation as the difference between the average values of y with D2 0 and D2 1 .
Graphically, it looks like as in the following figure. It describes two parallel regression lines with the same
variances 2 .
y
E ( y / D2 1) ( 0 2 ) 1 x1
0 2 1
2
E ( y / D2 0) 0 1 x1
1
0
If there are three explanatory variables in the model with two indicator variables D2 , and D3 then they
will describe three levels, e.g., groups A, B and C. The levels of indicator variables are as follows:
1. D2 0, D3 0 if the observation is from group A
Consider the following examples to understand how to define such indicator variables and how they can be
handled.
Example:
Suppose y denotes the monthly salary of a person and D denotes whether the person is graduate or non-
graduate. The model is
y 0 1 D , E ( ) 0, var( ) 2 .
- 1 measures the difference in the mean salaries of a graduate and a non-graduate person.
Now consider the same model with two indicator variables defined in the following way:
1 if person is graduate
Di1
0 if person is nongraduate,
1 if person is nongraduate
Di 2
0 if person is graduate.
The model with n observations is
yi 0 1 Di1 2 Di 2 i , E ( i ) 0, Var ( i ) 2 , i 1, 2,..., n.
Then we have
1. E yi / Di1 0, Di 2 1 0 2 : Average salary of a non-graduate
So multicollinearity is present in such cases. Hence the rank of the matrix of explanatory variables falls
short by 1. So 0 , 1 and 2 are indeterminate, and least-squares method breaks down. So the proposition
of introducing two indicator variables is useful, but they lead to serious consequences. This is known as the
dummy variable trap.
So when intercept term is dropped, then 1 and 2 have proper interpretations as the average salaries of a
Now the parameters can be estimated using ordinary least squares principle, and standard procedures for
drawing inferences can be used.
Rule: When the explanatory variable leads to m mutually exclusive categories classification, then use
(m 1) indicator variables for its representation. Alternatively, use m indicator variables but drop the
intercept term.
yi Salary of i th person.
Then
E yi / Di 2 0 0 1 xi1 2 .0 3 xi1.0
0 1 xi1.
E yi / Di 2 1 0 1 xi1 2 .1 3 xi1.1
( 0 2 ) ( 1 3 ) xi1.
The model
E ( yi ) 0 1 xi1 2 Di 2 3 xi1 Di 2
Thus
2 reflects the change in intercept term associated with the change in the group of person i.e., when the
group changes from A to B.
3 reflects the change in slope associated with the change in the group of person, i.e., when group changes
from A to B.
yi 0 1 xi1 2 .1 3 xi1.1 i
yi ( 0 2 ) ( 1 3 ) xi1 Di 2 i
and
yi 0 1 xi1 2 .0 3 xi1.0 i
yi 0 1 xi1 i
respectively.
The test of hypothesis becomes convenient by using an indicator variable. For example, if we want to test
whether the two regression models are identical, the test of hypothesis involves testing
H 0 : 2 3 0
H1 : 2 0 and/or 3 0.
Acceptance of H 0 indicates that only a single model is necessary to explain the relationship.
In another example, if the objective is to test that the two models differ with respect to intercepts only and
they have the same slopes, then the test of hypothesis involves testing
H 0 : 3 0
H1 : 3 0.
Then these values become 0, 0, 0, 1, 1, 1. Now looking at the value 1, one can not determine if it
corresponds to age 5, 6 or 7 years.
Moreover, if a quantitative explanatory variable is grouped into m categories, then (m 1) parameters are
required whereas if the original variable is used as such, then only one parameter is required.
Treating a quantitative variable as a qualitative variable increases the complexity of the model. The degrees
of freedom for error is also reduced. This can affect the inferences if the data set is small. In large data sets,
such an effect may be small.
The use of indicator variables does not require any assumption about the functional form of the relationship
between study and explanatory variables.
The specification of a linear regression model consists of a formulation of the regression relationships and of
statements or assumptions concerning the explanatory variables and disturbances. If any of these is violated,
e.g., incorrect functional form, the improper introduction of disturbance term in the model, etc., then
specification error occurs. In a narrower sense, the specification error refers to explanatory variables.
The complete regression analysis depends on the explanatory variables present in the model. It is understood
in the regression analysis that only correct and important explanatory variables appear in the model. In
practice, after ensuring the correct functional form of the model, the analyst usually has a pool of explanatory
variables which possibly influence the process or experiment. Generally, all such candidate variables are not
used in the regression modeling, but a subset of explanatory variables is chosen from this pool.
While choosing a subset of explanatory variables, there are two possible options:
1. In order to make the model as realistic as possible, the analyst may include as many as
possible explanatory variables.
2. In order to make the model as simple as possible, one may include only fewer number of
explanatory variables.
X X1 X 2 and 1 2 .
nk
nr n( k r ) r1 ( k r )1)
The model y X , E ( ) 0, V ( ) 2 I can be expressed as
y X 11 X 2 2
After dropping the r explanatory variable in the model, the new model is
y X 11
Thus
E (b1F 1 ) ( X 1' X 1 ) 1 E ( )
which is a linear function of 2 , i.e., the coefficients of excluded variables. So b1F is biased, in general. The
yH1 y ( X 11 X 2 2 ) H1 ( X 2 2 )
( 2' X 2 H1' H1 X 2 2 2' X 2' H1 2' X 2 H1' X 2 2 1' X 1' H1 ' H1' X 2 2 ' H1 ).
1
E (s 2 ) E ( 2' X 2' H1 X 2 2 ) 0 0 E ( ' H )
nr
1
2' X 2' H1 X 2 2 ) (n r ) 2
nr
1
2 2' X 2' H1 X 2 2 .
nr
Thus s 2 is a biased estimator of 2 and s 2 provides an overestimate of 2 . Note that even if X 1' X 2 0,
then also s 2 gives an overestimate of 2 . So the statistical inferences based on this will be faulty. The t -test
and confidence region will be invalid in this case.
If the response is to be predicted at x ' ( x1' , x2' ), then using the full model, the predicted value is
Thus ŷ1 is a biased predictor of y . It is unbiased when X 1' X 2 0. The MSE of predictor is
Also
Var ( yˆ ) MSE ( yˆ1 )
X ' Z ( Z ' Z ) 1 Z ' XbF X ' Z ( Z ' Z ) 1 Z ' ZcF X ' Z ( Z ' Z ) 1 Z ' y. (3)
bF ( X ' H Z X ) 1 X ' H Z y
( X ' H Z X ) 1 X ' H Z ( X )
( X ' H Z X ) 1 X ' H Z .
Thus
E (bF ) ( X ' H Z X ) 1 X ' H Z E ( ) 0
so bF is unbiased even when some irrelevant variables are added to the model.
V (bF ) E bF bF
1
2 X ' H Z X X ' H Z IH Z X X ' H Z X
1 1
2 X ' HZ X .
1
with E (bT )
Result: If A and B are two positive definite matrices then A B is at least positive semi-definite if
Let
A ( X ' H Z X ) 1
B ( X ' X ) 1
B 1 A1 X ' X X ' H Z X
X ' X X ' X X ' Z ( Z ' Z ) 1 Z ' X
X ' Z ( Z ' Z ) 1 Z ' X
which is at least a positive semi-definite matrix. This implies that the efficiency declines unless X ' Z 0. If
X ' Z 0, i.e., X and Z are orthogonal, then both are equally efficient.
The residual sum of squares under the false model is
SSres eF' eF
where
eF y XbF ZCF
bF ( X ' H Z X ) 1 X ' H Z y
cF ( Z ' Z ) 1 Z ' y ( Z ' Z ) 1 Z ' XbF
( Z ' Z ) 1 Z '( y XbF )
( Z ' Z ) 1 Z ' I X ( X ' H Z X ) 1 X ' H z y
( Z ' Z ) 1 Z ' H XZ y
H Z I Z ( Z ' Z ) 1 Z '
H Zx I X ( X ' H Z X ) 1 X ' H Z
2
H ZX H ZX : idempotent.
A fundamental assumption in regression modeling is that the pattern of data on dependent and independent
variables remains the same throughout the period over which the data is collected. Under such an
assumption, a single linear regression model is fitted over the entire data set. The regression model is
estimated and used for prediction assuming that the parameters remain same over the entire time period of
estimation and prediction. When it is suspected that there exists a change in the pattern of data, then the
fitting of single linear regression model may not be appropriate, and more than one regression models may
be required to be fitted. Before taking such a decision to fit a single or more than one regression models, a
question arises how to test and decide if there is a change in the structure or pattern of data. Such changes
can be characterized by the change in the parameters of the model and are termed as structural change.
Now we consider some examples to understand the problem of structural change in the data. Suppose the
data on the consumption pattern is available for several years and suppose there was a war in between the
years over which the consumption data is available. Obviously, the consumption pattern before and after the
war does not remain the same as the economy of the country gets disturbed. So if a model
yi 0 1 X i1 ... k X ik i , i 1, 2,..., n
is fitted then the regression coefficients before and after the war period will change. Such a change is
referred to as a structural break or structural change in the data. A better option, in this case, would be to fit
two different linear regression models- one for the data before the war and another for the data after the war.
In another example, suppose the study variable is the salary of a person, and the explanatory variable is the
number of years of schooling. Suppose the objective is to find if there is any discrimination in the salaries of
males and females. To know this, two different regression models can be fitted-one for male employees and
another for females employees. By calculating and comparing the regression coefficients of both the models,
one can check the presence of sex discrimination in the salaries of male and female employees.
Consider another example of structural change. Suppose an experiment is conducted to study certain
objectives and data is collected in the USA and India. Then a question arises whether the data sets from both
the countries can be pooled together or not. The data sets can be pooled if they originate from the same
model in the sense that there is no structural change present in the data. In such case, the presence of
Econometrics | Chapter 12 | Tests for Structural Change and Stability | Shalabh, IIT Kanpur
1
structural change in the data can be tested and if there is no change, then both the data sets can be merged
and single regression model can be fitted. If structural change is present, then two models are needed to be
fitted.
The objective is now how to test for the presence of a structural change in the data and stability of regression
coefficients. In other words, we want to test the hypothesis that some of or all the regression coefficients
differ in different subsets of data.
Analysis
We consider here a situation where only one structural change is present in the data. The data, in this case, be
divided into two parts. Suppose we have a data set of n observations which is divided into two parts
consisting of n1 and n2 observations such that
n1 n2 n.
where is a n 1 vector with all elements unity, is a scalar denoting the intercept term, X is a n k
X
1 , X 1 , 1
2 X2 2
where the orders of 1 is n1 1 , 2 is n2 1 , X 1 is n1 k , X 2 is n2 k , 1 is n1 1 and
2 is n2 1 .
Based on this partitions, the two models corresponding to two subgroups are
y1 1 X 1 1
y2 2 X 2 2 .
Econometrics | Chapter 12 | Tests for Structural Change and Stability | Shalabh, IIT Kanpur
2
In matrix notations, we can write
y X 1 1
Model (1) : 1 1
y2 2 X 2 2
In this case, the intercept terms and regression coefficients remain the same for both the submodels. So there
is no structural change in this situation.
The problem of structural change can be characterized if intercept terms and/or regression coefficients in the
submodels are different.
If the structural change is caused due to change in the intercept terms only then the situation is characterized
by the following model:
y1 11 X 1 1
y2 2 2 X 2 2
or
y 0 X 1 1 1
Model (2) : 1 1 .
y2 0 2 X 2 2 2
If the structural change is due to different intercept terms as well as different regression coefficients, then the
model is
y1 11 X 11 1
y2 2 2 X 2 2 2
or
1
y1 1 0 X 1 0 2 1
Model (3) : .
y2 0 2 0 X 2 1 2
2
The test of hypothesis related to the test of structural change is conducted by testing anyone of the null
hypothesis depending upon the situation
(I) H 0 : 1 2
(II) H 0 : 1 2
(III) H 0 : 1 2 , 1 2 .
Econometrics | Chapter 12 | Tests for Structural Change and Stability | Shalabh, IIT Kanpur
3
To construct the test statistic, apply ordinary least squares estimation to models (1), (2) and (3) and obtain
the residual sum of squares as RSS1 , RSS2 and RSS3 respectively.
The null hypothesis H 0 : 1 2 i.e., different intercept terms is tested by the statistic
F
RSS1 RSS2 /1
RSS2 /(n k 2)
which follows F 1, n k 2 under H 0 . This statistic tests 1 2 for model (2) using model (1), i.e.,
F
RSS2 RSS3 / k
RSS3 /(n 2k 2)
which follows F k , n 2k 2 under H 0 . This statistic tests 1 2 from the model (3) using the model
The test of the null hypothesis H 0 : 1 2 , 1 2 , i.e., different intercepts and different slope parameters
F
RSS1 RSS3 / k 1
RSS3 /(n 2k 2)
which follows F k 1, n 2k 2 under H 0 . This statistic tests jointly 1 2 and 1 2 for model (3)
using model (1), i.e., model (1) contrasted with model (3). This test is known as Chow test. It requires
n1 k and n2 k for the stability of regression coefficients in the two models. The development of this test
Econometrics | Chapter 12 | Tests for Structural Change and Stability | Shalabh, IIT Kanpur
4
Development of Chow test:
Consider the models
y X 1 1 1
1 (i )
n1 1 n1 p p1 n1 1
y X 2 2 2
2 (ii )
n2 1 n2 p p1 n2 1
y X (iii )
n1 n p p1 n1
n n1 n2
where p k 1 which includes k explanatory variables and an intercept term.
Define
H1 I1 X 1 X 1' X 1 X 1'
1
H 2 I 2 X 2 X 2' X 2 X 2'
1
H I X X 'X X '
1
where I1 and I 2 are the identity matrices of the orders n1 n1 and n2 n2 . The residual sums of squares
Note that H1* H 2* 0 which implies that RSS1 and RSS2 are independently distributed.
Econometrics | Chapter 12 | Tests for Structural Change and Stability | Shalabh, IIT Kanpur
5
We can write
H I X X 'X X '
1
I 0 X1
X ' X X1 X 2
1
1 ' '
0 I2 X 2
H H12
11
H 21 H 22
H 21 X 2 X ' X X 1'
1
H 22 I 2 X 2 X ' X X 2' .
1
Define
H * H1* H 2*
so that
RSS3 ' H *
RSS1 ' H .
Note that H H * and H * are idempotent matrices. Also H H * H * 0. First, we see how this result
holds.
Consider
H H1 H12 H1 0
H H H*
1
*
1 11
H 22 0 0
H 21
( H H1 ) H1 0
11 .
H 21 H1 0
H 21 H1 0
H11 H1 0.
Econometrics | Chapter 12 | Tests for Structural Change and Stability | Shalabh, IIT Kanpur
6
Also, since H1 is idempotent, it follows that
H H H
11 1 1 0.
Thus H H H *
1
*
1 0
or HH1* H1* .
H H H *
2
*
2 0
or HH 2* H 2* .
Thus this implies that
H H H H
*
1
*
2
*
1 H 2* 0
or H H H 0.
* *
Also, we have
tr H n p
tr H * tr H1* tr H 2*
n1 p n2 p
n2p
n 2k 2.
RSS1 RSS3
~ p2 ,
2
RSS3
~ (2n 2 p ) .
2
Econometrics | Chapter 12 | Tests for Structural Change and Stability | Shalabh, IIT Kanpur
7
Limitations of these tests
1. All tests are based under the assumption that 2 remains the same. So first the stability of 2 should
be checked and then these tests can be used.
2. It is assumed in these tests that the point of change is exactly known. In practice, it is difficult to find
such a point at which the change occurs. It is more difficult to know such point when the change
occurs slowly. These tests are not applicable when the point of change is unknown. An ad-hoc
technique when the point of change is unknown is to delete the data of transition period.
3. When there are more than one points of structural change, then the analysis becomes difficult.
Econometrics | Chapter 12 | Tests for Structural Change and Stability | Shalabh, IIT Kanpur
8
Chapter 13
Asymptotic Theory and Stochastic Regressors
The nature of explanatory variable is assumed to be non-stochastic or fixed in repeated samples in any
regression analysis. Such an assumption is appropriate for those experiments which are conducted inside the
laboratories where the experimenter can control the values of explanatory variables. Then the repeated
observations on study variable can be obtained for fixed values of explanatory variables. In practice, such
an assumption may not always be satisfied. Sometimes, the explanatory variables in a given model are the
study variable in another model. Thus the study variable depends on the explanatory variables that are
stochastic in nature. Under such situations, the statistical inferences drawn from the linear regression model
based on the assumption of fixed explanatory variables may not remain valid.
We assume now that the explanatory variables are stochastic but uncorrelated with the disturbance term. In
case, they are correlated then the issue is addressed through instrumental variable estimation. Such a
situation arises in the case of measurement error models.
regression coefficients and ε is the ( n ×1) vector of disturbances. Under the assumption
V ( ε ) σ 2 I , the distribution of ε i , conditional on xi' , satisfy these properties for all all values of
E (ε ) 0,=
=
Let p ( ε i | xi' ) be the conditional probability density function of ε i given xi' and p ( ε i ) is the unconditional
E ( ε i | xi' ) = ∫ ε i p ( ε i | xi' ) d ε i
= ∫ ε i p (ε i ) dε i
= E (ε i )
=0
Econometrics | Chapter 13 | Asymptotic Theory and Stochastic Regressors | Shalabh, IIT Kanpur
1
E ( ε i2 | xi' ) = ∫ ε i2 p ( ε i | xi' ) d ε i
= ∫ ε i2 p ( ε i ) d ε i
= E ( ε i2 )
= σ 2.
with respect β as
joint probability density function ε and X can be derived from the joint probability density function of y
and X as follows:
(
= ∏ f ( yi | xi' ) f ( xi' ) )
n
i =1
= ∏ f ( yi , xi' )
n
i =1
Econometrics | Chapter 13 | Asymptotic Theory and Stochastic Regressors | Shalabh, IIT Kanpur
2
This implies that the maximum likelihood estimators of β and σ 2 will be based on
∏ f ( y | x ) = ∏ f (ε )
n n
'
i i i
=i 1 =i 1
so they will be same as based on the assumption that ε i ' s, i = 1, 2,..., n are distributed as N ( 0, σ 2 ) . So the
maximum likelihood estimators of β and σ 2 when the explanatory variables are stochastic are obtained as
β = ( X ' X ) X ' y
−1
σ 2 =( y − X β )′ ( y − X β ) .
1
n
Note: Note that the vector x ' is represented by an underscore in this section to denote that it ‘s order is
Let xi' , i = 1, 2,..., n are from a multivariate normal distribution with mean vector µ x and covariance matrix
y µ y σ yy Σ yx
' ~ N µ , .
xi x Σ xy Σ xx
where xi' is a 1× ( k − 1) vector of observation of random vector x, β 0 is the intercept term and β1 is the
( k − 1) × 1 vector of regression coefficients. Further ε i is disturbance term with ε i ~ N ( 0, σ 2 ) and is
independent of x ' .
Econometrics | Chapter 13 | Asymptotic Theory and Stochastic Regressors | Shalabh, IIT Kanpur
3
Suppose
y µ y σ yy Σ yx
~ N µ , Σ .
Σ xx
x x xy
1 y − µ y ' y − µ y
1 −1
f ( y, x=
') exp − Σ .
k 1
x − µ x − µ
( 2π ) 2 Σ2
2 x x
−11 1 −Σ yx Σ −xx1
Σ = 2 −1 ,
σ −Σ xx Σ xy σ Σ xx + Σ xx Σ xy Σ yx Σ xx
2 −1 −1 −1
where
σ=
2
σ yy − Σ yx Σ −xx1Σ xy .
Then
f ( y, x ')
=
( 2π )
1
k
2
1
Σ2
exp
1
−
2σ 2 y − µ y {
− ( x − µ x ) ' Σ −1
xx Σ xy
2 −1
} .
+ σ ( x − µ x ) ' Σ xx ( x − µ x )
2
The marginal distribution of x ' is obtained by integrating f ( y, x ') over y and the resulting distribution is
1
exp − ( x − µ x ) ' Σ −xx1 ( x − µ x ) .
1
=g ( x ') k −1
2
1
( 2π ) 2 Σ xx 2
Econometrics | Chapter 13 | Asymptotic Theory and Stochastic Regressors | Shalabh, IIT Kanpur
4
The conditional probability density function of y given x ' is
f ( y, x ')
f ( y | x ') =
g ( x ')
1
{( y − µ ) − ( x − µ ) Σ }
2
1 ' −1
= exp − 2 Σ xy
2σ
y x xx
2πσ 2
which is the probability density function of normal distribution with
• conditional mean
E ( y | x ')= µ y + ( x − µ x ) ' Σ −xx1Σ xy and
• conditional variance
| x ') σ yy (1 − ρ 2 )
Var ( y=
where
Σ yx Σ −xx1Σ xy
ρ2 =
σ yy
is the population multiple correlation coefficient.
In the model
y=β 0 + x ' β1 + ε ,
the conditional mean is
E ( yi | xi' ) =β 0 + x ' β1 + E ( ε | x )
= β 0 + x ' β1.
Comparing this conditional mean with the conditional mean of normal distribution, we obtain the
relationship with β 0 and β1 as follows:
β1 =Σ −xx1Σ xy
β=
0 µ y − µ x' β1.
n 1 y − µ ' y − µ y
1 −1 i
= exp ∑ −
Σ .
i y
L nk n
x − µ x − µ x
( 2π ) 2 Σ2 i =1 i
2 x i
Maximizing the log likelihood function with respect to µ y , µ x , Σ xx and Σ xy , the maximum likelihood
Econometrics | Chapter 13 | Asymptotic Theory and Stochastic Regressors | Shalabh, IIT Kanpur
5
1 n
µ y= y= ∑ yi
n i =1
1 n
µ x= x= ∑ xi= ( x2 , x3 ,..., xk )
n i =1
1 n
Σ xx = S xx = ∑
n i =1
xi xi' − nx x '
1 n
Σ xy = S xy = ∑ xi yi − nx
n i =1
y
1
where xi' ( xi 2 , xi 3 ,..., xik ), S xx is [(k -1) × (k -1)] matrix with elements ∑ ( xti − xi )( xtj − x j ) and S xy is
n t
1
[(k -1) ×1] vector with elements ∑ ( xti − xi )( yi − y ).
n t
Based on these estimates, the maximum likelihood estimators of β1 and β 0 are obtained as
β1 = S xx−1S xy
β0= y − x ' β1
β0
β = (X 'X )
−1
= X ' y.
β1
(X 'X ) X 'y−β
−1
b−β
=
= ( X ' X ) X '( X β + ε ) − β
−1
= ( X ' X ) X 'ε .
−1
E ( X ' X ) X ' ε
−1
E (b − β ) =
{
= E E ( X ' X ) X 'ε X
−1
}
= E ( X ' X ) X ' E ( ε )
−1
=0
Econometrics | Chapter 13 | Asymptotic Theory and Stochastic Regressors | Shalabh, IIT Kanpur
6
The covariance matrix of b is obtained as
V ( b ) =E ( b − β )( b − β ) '
= E ( X ' X ) X ' εε ' X ( X ' X )
−1 −1
{
= E E ( X ' X ) X ' εε ' X ( X ' X ) X
−1 −1
}
= E ( X ' X ) X ' E ( εε ') X ( X ' X ) X
−1 −1
= E ( X ' X ) X ' σ 2 X ( X ' X )
−1 −1
= σ 2 E ( X ' X ) .
−1
Thus the covariance matrix involves a mathematical expectation. The unknown σ 2 can be estimated by
e 'e
σˆ 2 =
n−k
=
( y − Xb ) ' ( y − Xb )
n−k
where e= y − Xb is the residual and
E (σˆ 2 ) = E E (σˆ 2 X )
e 'e
= E E X
n−k
= E (σ 2 )
= σ 2.
Note that the OLSE b = ( X ' X ) X ' y involves the stochastic matrix X and stochastic vector y , so b is
−1
not a linear estimator. It is also no more the best linear unbiased estimator of β as in the case when X is
Asymptotic theory:
The asymptotic properties of an estimator concerns the properties of the estimator when sample size n
grows large.
For the need and understanding of asymptotic theory, we consider an example. Consider the simple linear
regression model with one explanatory variable and n observations as
yi =β 0 + β1 xi + ε i , E ( ε i ) =0, Var ( ε i ) =σ 2 , i =1, 2,..., n.
Econometrics | Chapter 13 | Asymptotic Theory and Stochastic Regressors | Shalabh, IIT Kanpur
7
The OLSE of β1 is
n
∑ ( x − x )( y − y )
i i
b1 = i =1
n
∑(x − x )
2
i
i =1
and its variance is
σ2
Var ( b1 ) = .
n
If the sample size grows large, then the variance of b1 gets smaller. The shrinkage in variance implies that
as sample size n increases, the probability density of OLSE b collapses around its mean because Var (b)
becomes zero.
Let there are three OLSEs b1 , b2 and b3 which are based on sample sizes n1 , n2 and n3 respectively such that
n1 < n2 < n3 , say. If c and δ are some arbitrarily chosen positive constants, then the probability that the
value of b lies within the interval β ± c can be made to be greater than (1 − δ ) for a large value of n. This
property is the consistency of b which ensure that even if the sample is very large, then we can be
confident with high probability that b will yield an estimate that is close to β .
Probability in limit
Let βˆn be an estimator of β based on a sample of size n . Let γ be any small positive constant. Then
for large n , the requirement that bn takes values with probability almost one in an arbitrary small
and it is said that βˆn converges to β in probability. The estimator βˆn is said to be a consistent estimator of
β.
A sufficient but not necessary condition for βˆn to be a consistent estimator of β is that
lim E βˆn = β
n →∞
Econometrics | Chapter 13 | Asymptotic Theory and Stochastic Regressors | Shalabh, IIT Kanpur
8
Consistency of estimators
Now we look at the consistency of the estimators of β and σ 2 .
(i) Consistency of b
X 'X
Under the assumption that lim = ∆ exists as a nonstochastic and nonsingular matrix (with finite
n →∞
n
elements), we have
−1
1 X 'X
lim V (b) = σ lim
2
n →∞ n →∞ n
n
1
= σ 2 lim ∆ −1
n →∞ n
= 0.
This implies that OLSE converges to β in quadratic mean. Thus OLSE is a consistent estimator of β .
This also holds true for maximum likelihood estimators also.
Same conclusion can also be proved using the concept of convergence in probability.
The consistency of OLSE can be obtained under the weaker assumption that
X 'X
plim = ∆* .
n
exists and is a nonsingular and nonstochastic matrix and
X 'ε
plim = 0.
n
Since
b−β =
( X ' X ) −1 X ' ε
−1
X ' X X 'ε
= .
n n
So
−1
X 'X X 'ε
plim(b − β ) = plim plim
n n
= ∆*−1.0
= 0.
Thus b is a consistent estimator of β . The same is true for maximum likelihood estimators also.
Econometrics | Chapter 13 | Asymptotic Theory and Stochastic Regressors | Shalabh, IIT Kanpur
9
(ii) Consistency of s2
Now we look at the consistency of s 2 as an estimate of σ 2 . We have
1
s2 = e 'e
n−k
1
= ε ' Hε
n−k
−1
1 k
1 − ε ' ε − ε ' X ( X ' X ) X ' ε
−1
=
n n
k ε 'ε ε ' X X ' X X 'ε
−1 −1
=
1 − − .
n n n n n
ε 'ε 1 n 2
Note that
n
consists of ∑ ε i and {ε i2 , i = 1, 2,..., n} is a sequence of independently and identically
n i =1
distributed random variables with mean σ 2 . Using the law of large numbers
ε 'ε
=σ
2
plim
n
ε ' X X ' X −1 X ' ε ε 'X X ' X
−1
X 'ε
plim = plim plim plim
n n n n n n
= 0.∆*−1.0
=0
(1 − 0) −1 σ 2 − 0
⇒ plim( s 2 ) =
= σ 2.
Thus s 2 is a consistent estimator of σ 2 . The same holds true for maximum likelihood estimates also.
Asymptotic distributions:
Suppose we have a sequence of random variables {α n } with a corresponding sequence of cumulative
density functions { Fn } for a random variable α with cumulative density function F . Then α n converges
in distribution to α if Fn converges to F point wise. In this case, F is called the asymptotic distribution of
αn.
10
Note that
E (α ) : Mean of asymptotic distribution
2
lim E α n − lim E (α n ) : Asymptotic variance.
n →∞ n →∞
which is constant. Thus the asymptotic distribution of Yn is the distribution of a constant. This is not a
regular distribution as all the probability mass is concentrated at one point. Thus as sample size increases,
the distribution of Yn collapses.
Suppose consider only the one third observations in the sample and find sample mean as
n
3
3
Yn* = ∑ Yi .
n i =1
Then E (Yn* ) = Y
n
Econometrics | Chapter 13 | Asymptotic Theory and Stochastic Regressors | Shalabh, IIT Kanpur
11
Thus plim Yn* = Y and Yn* has the same degenerate distribution as Yn . Since Var (Yn* ) > Var (Yn ) , so Yn*
is preferred over Yn .
Now we observe the asymptotic behaviour of Yn and Yn* . Consider a sequence of random variables {α n }.
αn
= n (Yn − Y )
α n*
= n (Yn* − Y )
E (=
αn ) n E (Yn =
−Y ) 0
E (=
α n* ) n E (Yn* =
−Y ) 0
σ2
Var (α n ) = nE (Yn − Y ) = n
2
= σ2
n
3σ 2
Var (α n =
) nE (Yn − Y ) = n n = 3σ 2 .
* * 2
• Yn* is N ( 0,3σ 2 ) .
So now Yn is preferable over Yn* . The central limit theorem can be used to show that α n will have an
asymptotically normal distribution even if the population is not normally distributed.
Also, since
n (Yn − Y ) ~ N ( 0, σ 2 )
n (Yn − Y )
⇒Z = ~ N ( 0,1)
σ
and this statement holds true in finite sample as well as asymptotic distributions.
Econometrics | Chapter 13 | Asymptotic Theory and Stochastic Regressors | Shalabh, IIT Kanpur
12
X 'X
The asymptotic covariance matrix of b under the assumption that lim = Σ xx exists and is nonsingular.
n →∞ n
It is given by
−1
1 X 'X
σ lim ( X ' X ) = σ lim lim
2 2
n →∞
n →∞ n n →∞
n
= σ 2 .0.Σ −xx1
=0
which is a null matrix.
Consider the asymptotic distribution of n ( b − β ) . Then even if ε is not necessarily normally distributed,
then asymptotically
n ( b − β ) ~ N ( 0, σ 2 Σ −xx1 )
n ( b − β ) ' Σ xx ( b − β )
~ χ k2 .
σ2
X 'X
If is considered as an estimator of Σ xx , then
n
X 'X
n (b − β ) ' (b − β ) (b − β ) ' X ' X (b − β )
n =
σ2 σ2
(
is the usual test statistic as is in the case of finite samples with b ~ N β , σ 2 ( X ' X )
−1
).
Econometrics | Chapter 13 | Asymptotic Theory and Stochastic Regressors | Shalabh, IIT Kanpur
13
Chapter 14
Stein-Rule Estimation
The ordinary least squares estimation of regression coefficients in linear regression model provides the
estimators having minimum variance in the class of linear and unbiased estimators. The criterion of
linearity is desirable because such estimators involve less mathematical complexity, they are easy to
compute, and it is easier to investigate their statistical properties. The criterion of unbiasedness is
attractive because it is intuitively desirable to have an estimator whose expected value, i.e., the mean of
the estimator should be the same as the parameter being estimated. Considerations of linearity and
unbiased estimators sometimes may lead to an unacceptably high price to be paid in terms of the
variability around the true parameter. It is possible to have a nonlinear estimator with better properties. It
is to be noted that one of the main objectives of estimation is to find an estimator whose values have high
concentration around the true parameter. Sometimes it is possible to have a nonlinear and biased estimator
that has smaller variability than the variability of a best linear unbiased estimator of the parameter under
some mild restrictions.
the ordinary least squares estimator (OLSE) of is b X ' X X ' y which is the best linear unbiased
1
estimator of in the sense that it is linear in y, E b and b has smallest variance among all linear
V b E b b ' 2 X ' X .
1
E ˆ 'W ˆ wij E ˆi i
i j
ˆ j j
where W is k k fixed positive definite matrix of weights wij . The two popular choices of weight matrix
W are
(i)
W is an identity matrix, i.e. W I then E ˆ ' ˆ is called as the total mean squared
error (MSE) of ̂ .
E ˆ ' X ' X ˆ E X ˆ X ' X ˆ X
is called as the predictive mean squared error of ̂ . Note that E y X ˆ is the predictor of
average value E y X and X ˆ X is the corresponding prediction error.
There can be other choices of W and it depends entirely on the analyst how to define the loss function so
that the variability is minimum.
If a random vector with k elements k 2 is normally distributed as N , I , being the mean vector,
then Stein established that if the linearity and unbiasedness are dropped, then it is possible to improve
upon the maximum likelihood estimator of under the criterion of total MSE. Later, this result was
generalized by James and Stein for linear regression model. They demonstrated that if the criteria of
linearity and unbiasedness of the estimators are dropped, then a nonlinear estimator can be obtained which
has better performance than the best linear unbiased estimator under the criterion of predictive MSE. In
other words, James and Stein established that OLSE is inadmissible for k 2 under predictive MSE
E ˆ ' X ' X ˆ E b ' X ' X b
for all values of with strict inequality holding for some values of . For k 2, no such estimator
exists and we say that " b can be beaten” in this sense. Thus it is possible to find estimators which will
beat b in this sense. So a nonlinear and biased estimator can be defined which has better performance
than OLSE. Such an estimator is Stein-rule estimator given by
2
ˆ 1 c b when 2 is known
b ' X ' Xb
and
e 'e
ˆ 1 c b when 2 is unknown.
b ' X ' Xb
Here c is a fixed positive characterizing scalar, e ' e is the residuum sum of squares based on OLSE and
e y Xb is the residual. By assuming different values to c , we can generate different estimators. So a
class of estimators characterized by c can be defined. This is called as a family of Stein-rule estimators.
b1 , b2 ,..., bk , respectively. So in order to increase the efficiency, the OLSE is multiplied by a constant
. Thus is called the shrinkage factor. As Stein-rule estimators attempt to shrink the components of
b towards zero, so these estimators are known as shrinkage estimators.
First, we discuss a result which is used to prove the dominance of Stein-rule estimator over OLSE.
Result: Suppose a random vector Z of order k 1 is normally distributed as N , I where is the
E ˆ E b c 2 E
b
b ' X ' Xb
(In general, a non-zero quantity)
0,
in general.
PR ˆ E ˆ ' X ' X ˆ .
The Stein-rule estimator ˆ is better them OLSE b under the criterion of predictive risk if
PR ˆ PR b .
b X ' X X '
1
2 trI k
2k.
'
c 2 c 2
PR ˆ E b
b X ' X b
b ' X ' Xb b ' X ' Xb
c 2
E b ' X ' X b E b ' X ' X b b ' X ' Xb
b ' X ' Xb
c 2 4
E b ' X ' Xb
b ' X ' Xb
2
c 2 b ' X ' Xb c 2 4
2k 2E E b ' X ' Xb .
b ' X ' Xb
1
X 'X
1/2
or X ' X
1/ 2
and Z ~ N , I , i.e., Z1 , Z 2 ,..., Z k are independent. Substituting these values in the expressions for
PR ˆ , we get
2 2 Z 'Z c 2 2
ˆ
PR k 2 E c
2
2Z ' Z
E 2 Z ' Z
2 Z ' Z c 2 2
k 2 E c
2
E
Z 'Z Z 'Z
Z ' Z 2 2 1
2 k 2c 2 E c E
Z 'Z Z 'Z
1
2 k c 2 2 k 2 c E using the result
Z 'Z
1
PR b c 2 2 k 2 c E .
Z 'Z
Thus
PR ˆ PR b
if and only if
1
c 2 2 k 2 c E 0.
Z 'Z
1
E 0
Z 'Z
c 2 k 2 c 0.
or 0 c 2 k 2 provided k 2.
So as long as 0 c 2 k 2 is satisfied, the Stein-rule estimator will have smaller predictive risk them
To find the value of c for which PR ˆ is minimum, we differentiate
1
PR ˆ PR b c 2 2 k 2 c E
Z 'Z
with respect to c and it gives as follows:
d PR b E
d PR ˆ
2 1 d 2 k 2 c c
2
0
dc dc Z 'Z dc
2 k 2 2c 0.
or c k 2 .
Further,
d 2 PR ˆ
2 0.
dc 2
ck 2
The largest gains efficiency arises when c k 2. So if the number of explanatory variables are more
than two, then it is always possible to construct an estimator which is better than OLSE.
The optimum Stein-rule estimator or James-Stein rule estimator of in this case is given by
( p 2) 2
ˆ 1 b when 2 is known.
b ' X ' Xb
( p 2) 2 ( p 2) 2
1 when 0 1
b ' X ' Xb
b
ˆ b ' X ' Xb
( p 2) 2
0 when 1.
b ' X ' Xb
A basic assumption in analyzing the performance of estimators in multiple regression is that the explanatory
variables and disturbance terms are independently distributed. The violation of such assumption disturbs the
optimal properties of the estimators. The instrumental variable estimation method helps in estimating the
regression coefficients in the multiple linear regression model when such violation occurs.
Suppose one or more explanatory variables is correlated with the disturbances in the limit, then we can write
1
plim X ' 0.
n
The consequences of such an assumption on ordinary least squares estimator are as follows:
X ' X X ' X
1
b X ' X X '
1
1
X ' X X '
n n
1
X 'X X '
plim b plim plim
n n
0
X 'X
assuming plim XX exists and is nonsingular. Consequently plim b and thus the OLSE
n
becomes an inconsistent estimator of .
To overcome this problem and to obtain a consistent estimator of , the instrumental variable estimation can
be used.
Z'X
(i) plim ZX is a finite and nonsingular matrix of full rank. This interprets that the variables in
n
Z are correlated with those in X , in the limit.
Z '
(ii) plim 0,
n
i.e., the variables in Z are uncorrelated with , in the limit.
Z 'Z
(iii) plim ZZ exists.
n
If some of X variables are likely to be uncorrelated with , then these can be used to form some of the
columns of Z and extraneous variables are found only for the remaining columns.
First, we understand the role of the term X ' in the OLS estimation. The OLSE b of is derived by
solving the equation
y X ' y X
0
or X ' y X ' Xb
or X ' y Xb 0.
Let
X 'X
plim XX
n
X 'y
plim Xy
n
where population cross moments XX and Xy are finite, XX is finite and nonsingular.
X '
plim 0,
n
then
XX1 Xy .
X 'X X 'y
If XX is estimated by sample cross moment and Xy is estimated by sample cross moment ,
n n
then the OLS estimator of is obtained as
1
X 'X X 'y
b
n n
X ' X X ' y.
1
Such an analysis suggests to use Z to pre-multiply the multiple regression model as follows:
which is termed as an instrumental variable estimator of and this method is called an instrumental
variable method.
Since
Z ' X Z '
1
plim ˆIV plim Z ' X Z '
1
Z ' X 1 Z '
plim
n n
1
Z'X Z '
plim plim
n n
ZX1 .0
plimˆIV .
Thus the instrumental variable estimator is consistent. Note that the variables Z1 , Z 2 ,..., Z k in Z are chosen
such that they are uncorrelated with and correlated with X , at least asymptotically, so that the second
order moment matrix ZX exists and is nonsingular.
Econometrics | Chapter 15 | Instrumental Variables Estimation | Shalabh, IIT Kanpur
4
Asymptotic distribution:
The asymptotic distribution of
1
1
n ˆIV Z ' X
n
1
n
Z '
is normal with mean vector 0 and the asymptotic covariance matrix is given by
Z ' X 1 1 X 'Z
1
ˆ
AsyVar IV plim Z ' E ' Z
n n n
Z ' X 1 Z ' Z X ' Z 1
2 plim
n n n
1 1
Z'X Z 'Z X 'Z
2 plim plim plim
n n n
2 XZ1 ZZ ZX1 .
For a large sample,
1
2
V ˆIV XZ ZZ ZX1
n
which can be estimated by
ˆ ˆ 1 ˆ ˆ 1
2
Vˆ ˆIV XZ ZZ ZX .
n
s2
Z ' X Z 'Z X 'Z
1 1
n
where
1
s2 y Xb '( y Xb),
nk
b X ' X X ' y.
1
The variance of ˆIV is not necessarily a minimum asymptotic variance because there can be more than one
sets of instrumental variables that fulfil the requirement of being uncorrelated with and correlated with
stochastic regressors.
A fundamental assumption in all the statistical analysis is that all the observations are correctly measured. In
the context of multiple regression model, it is assumed that the observations on the study and explanatory
variables are observed without any error. In many situations, this basic assumption is violated. There can be
several reasons for such a violation.
For example, the variables may not be measurable, e.g., taste, climatic conditions, intelligence,
education, ability etc. In such cases, the dummy variables are used, and the observations can be
recorded in terms of values of dummy variables.
Sometimes the variables are clearly defined, but it is hard to take correct observations. For example,
the age is generally reported in complete years or in multiple of five.
Sometimes the variable is conceptually well defined, but it is not possible to take a correct
observation on it. Instead, the observations are obtained on closely related proxy variables, e.g., the
level of education is measured by the number of years of schooling.
Sometimes the variable is well understood, but it is qualitative in nature. For example, intelligence is
measured by intelligence quotient (IQ) scores.
In all such cases, the true value of the variable can not be recorded. Instead, it is observed with some error.
The difference between the observed and true values of the variable is called as measurement error or
errors-in-variables.
where y is a n 1 vector of true observation on study variable, X is a n k matrix of true observations
on explanatory variables and is a k 1 vector of regression coefficients. The value y and X are not
observable due to the presence of measurement errors. Instead, the values of y and X are observed with
additive measurement errors as
y y u
X X V
where y is a n 1 vector of observed values of study variables which are observed with (n 1)
which are observed with n k matrix V of measurement errors in X . In such a case, the usual disturbance
term can be assumed to be subsumed in u without loss of generality. Since our aim is to see the impact of
measurement errors, so it is not considered separately in the present case.
We assume that
E (u ) 0, E (uu ') 2 I
E (V ) 0, E (V 'V ) , E (V ' u ) 0.
Suppose we ignore the measurement errors and obtain the OLSE. Note that ignoring the measurement errors
in the data does not mean that they are not present. We now observe the properties of such an OLSE under
the setup of measurement error model.
b X ' X X ' X
1
X ' X X '
1
E b E X ' X X '
1
X ' X X ' E ( )
1
0
as X is a random matrix which is correlated with . So b becomes a biased estimator of .
1 1 1
X ' X ' V '
n n n
1 1
X '(u V ) V '(u V )
n n
0.
Thus b is an inconsistent estimator of . Such inconsistency arises essentially due to correlation between
X and .
Note: It should not be misunderstood that the OLSE b X ' X X ' y is obtained by minimizing
1
S ' y X ' y X in the model y X . In fact ' cannot be minimized as in the case of
To see the nature of consistency, consider the simple linear regression model with measurement error as
yi 0 1 xi , i 1, 2,..., n
yi yi ui
xi xi vi .
Now
1 x1 1 x1 0 v1
1 x2 1 x2 0 v2
X , X , V
1 xn 1 xn 0 vn
and assuming that
1 n
plim xi
n i 1
1 n
plim ( xi ) 2 x2 ,
n i 1
we have
Also,
1
vv plim V 'V
n
0 0
2
.
0 v
Now
plim b xx vv vv
1
1
b 0 1 0 0 0
plim 0 2 2
b1 1 x v 0 v 1
2 2
1 x2 2 v2 0
2
x v2 2
1 v
v2
2 1
v x2
.
2
2 v 2 1
x v
Thus we find that the OLSEs of 0 and 1 are biased and inconsistent. So if a variable is subjected to
measurement errors, it not only affects its own parameter estimate but also affect other estimator of
parameter that are associated with those variable which are measured without any error. So the presence of
measurement errors in even a single variable not only makes the OLSE of its own parameter inconsistent but
also makes the estimates of other regression coefficients inconsistent which are measured without any error.
1. Functional form: When the xi ' s are unknown constants (fixed), then the measurement error model is
2. Structural form: When the xi ' s are identically and independently distributed random variables, say, with
mean and variance 2 2 0 , the measurement error model is said to be in the structural form.
3. Ultrastructural form: When the xi ' s are independently distributed random variables with different
means, say i and variance 2 2 0 , then the model is said to be in the ultrastructural form. This form is
a synthesis of function and structural forms in the sense that both the forms are particular cases of
ultrastructural form.
Z ' X Z ' X
1
1
plim ˆIV 1 1
plim Z ' X plim Z '
n n
1
ZX .0
0.
Any instrument that fulfils the requirement of being uncorrelated with the composite disturbance term and
correlated with explanatory variables will result in a consistent estimate of parameter. However, there can be
various sets of variables which satisfy these conditions to become instrumental variables. Different choices
of instruments give different consistent estimators. It is difficult to assert that which choice of instruments
Econometrics | Chapter 16 | Measurement Error Models | Shalabh, IIT Kanpur
8
will give an instrumental variable estimator having minimum asymptotic variance. Moreover, it is also
difficult to decide that which choice of the instrumental variable is better and more appropriate in
comparison to other. An additional difficulty is to check whether the chosen instruments are indeed
uncorrelated with the disturbance term or not.
Choice of instrument:
We discuss some popular choices of instruments in a univariate measurement error model. Consider the
model
yi 0 1 xi i , i ui 1vi , i 1, 2,..., n.
A variable that is likely to satisfy the two requirements of an instrumental variable is the discrete grouping
variable. The Wald’s, Bartlett’s and Durbin’s methods are based on different choices of discrete grouping
variables.
1. Wald’s method
Find the median of the given observations x1 , x2 ,..., xn . Now classify the observations by defining an
Another group with those xi ' s above the median of x1 , x2 ,..., xn . Find the means of yi ' s and
n
n xi n nx
Z'X n
i 1
n
n
0 x2 x1
Z i Z i xi 2
i 1 i 1
n
yi ny
Z'y n
i 1
n
y2 y1
Z i yi 2
i 1
1
ˆ0 IV n nx ny
n
ˆ 0 x2 x1 n y2 y1
1IV 2 2
x2 x1 y
2 2 x
x2 x1 0 y2 y1
1 2
y2 y1
y x
x2 x1
y2 y1
x2 x1
y y
ˆ1IV 2 1
x2 x1
y2 y1
ˆ0 IV y ˆ
x y 1IV x .
x2 x
If n is odd, then the middle observations can be deleted. Under fairly general conditions, the estimators are
consistent but are likely to have large sampling variance. This is the limitation of this method.
order. Now three groups can be formed, each containing n / 3 observations. Define the instrumental variable
as
1 if observation is in the top group
Z i 0 if observation is in the middle group
1 if observation is in the bottom group.
Now discard the observations in the middle group and compute the means of yi ' s and xi ' s in
Substituting the values of X and Z in ˆIV Z ' X Z ' y and on solving, we get
1
y3 y1
ˆ1IV
x3 x1
ˆ0 IV y ˆ1IV x .
These estimators are consistent. No conclusive pieces of evidence are available to compare Bartlett’s method
and Wald’s method but three grouping method generally provides more efficient estimates than two
grouping method is many cases.
3. Durbin’s method
Let x1 , x2 ,..., xn be the observations. Arrange these observations in an ascending order. Define the
instrumental variable Z i as the rank of xi . Then substituting the suitable values of Z and X in
Z y y
i i
ˆ 1IV i 1
n
.
Z x x
i 1
i i
Since the estimator uses more information, it is believed to be superior in efficiency to other grouping
methods. However, nothing definite is known about the efficiency of this method.
In general, the instrumental variable estimators may have fairly large standard errors in comparison to
ordinary least square estimators which is the price paid for inconsistency. However, inconsistent estimators
have little appeal.
Assume
2 if i j
E ui 0, E ui u j u
0 if i j ,
v2 if i j
E vi 0, E vi v j
0 if i j,
E uV
i j 0 for all i 1, 2,..., n; j 1, 2,..., n.
For the application of the method of maximum likelihood, we assume the normal distribution for ui and vi .
We consider the estimation of parameters in the structural form of the model in which xi ' s are stochastic. So
assume
xi ~ N , 2
E xi vi
2
2 v2
E yi 0 1 E xi
0 1 .
Var yi E yi E ( yi )
2
E 0 1 xi ui 0 1
2
12 2 u2
1 2 0 0 0
1 2 .
So
1
n /2
i i x
x yi 0 1 xi
2 2
exp i 1 exp i 1 .
2 u v 2 v2
2
2 u
Econometrics | Chapter 16 | Measurement Error Models | Shalabh, IIT Kanpur
13
The log-likelihood is
n n
xi xi y 0 1 xi
2 2
i
L* ln L constant
n
2
ln u2 ln v2 i 1
2 u2
i 1
2 v2
.
The normal equations are obtained by equating the partial differentiations equals to zero as
L * 1 n
(1) yi 0 1 xi 0
0 v2 i 1
L * 1 n
(2) xi yi 0 1 xi 0
1 v2 i 1
L * 1
(3) 2 xi xi 2 yi 0 1 xi 0, i 1, 2,..., n
xi u v
L * n 1 n
4 i
x xi
2
(4)
u
2
2 u 2 v i 1
2
L * n 1 n
4 i
y 0 1 xi .
2
(5)
v
2
2 v 2 v i 1
2
These are n 4 equations in n 4 parameters but summing equation (3) over i 1, 2,..., n and using
These equations can be used to estimate the two means and 0 1 , two variances and one covariance.
The six parameters , 0 , 1 , u2 , v2 and 2 can be estimated from the following five structural relations
derived from these normal equations
(i ) x
(ii ) y 0 1
(iii) mxx 2 v2
(iv) m yy 12 2 u2
(v ) mxy 1 2
1 n 1 n 1 n 1 n 1 n
i i xx n i ( xi x )( yi y ).
2
where x x , y y , m xi x , m yy ( y y ) 2
and m xy
n i 1 n i 1 i 1 n i 1 n i 1
E ( x)
E ( y ) 0 1
Var ( x) 2 v2
Var ( y ) 12 2 u2
Cov( x, y ) 1 2 .
We observe that there are six parameters 0 , 1 , , 2 , u2 and v2 to be estimated based on five structural
equations (i)-(v). So no unique solution exists. Only can be uniquely determined while remaining
parameters can not be uniquely determined. So only is identifiable and remaining parameters are
unidentifiable. This is called the problem of identification. One relation is short to obtain a unique solution,
so additional a priori restrictions relating any of the six parameters is required.
Note: The same equations (i)-(v) can also be derived using the method of moments. The structural
equations are derived by equating the sample and population moments. The assumption of normal
distribution for ui , vi and xi is not needed in case of method of moments.
ˆ0 y ˆ1 x
is estimated if ̂1 is uniquely determined. So we consider the estimation of 1 , 2 , u2 and v2 only. Some
additional information is required for the unique determination of these parameters. We consider now
various type of additional information which are used for estimating the parameters uniquely.
Suppose v2 is known a priori. Now the remaining parameters can be estimated as follows:
mxx 2 v2 ˆ 2 mxx v2
mxy
mxy 1 2 ˆ1
mxx v2
m yy 1 2 u2 ˆ u2 myy ˆ12ˆ 2
mxy2
myy .
mxx v2
Note that ˆ 2 mxx v2 can be negative because v2 is known and mxx is based upon sample. So we assume
Similarly, ˆ u2 is also assumed to be positive under suitable condition. All the estimators ˆ1 , ˆ 2 and ˆ u2 are
the consistent estimators of , 2 and u2 respectively. Note that ˆ1 looks like as if the direct regression
estimator of 1 has been adjusted by v2 for its inconsistency. So it is termed as adjusted estimator also.
2. u2 is known
Suppose u2 is known a priori. Then using mxy 1 2 , we can rewrite
m yy 12 2 u2
mxy 1 u2
m yy u2
ˆ1 ; myy u2
mxy
mxy
ˆ 2
ˆ1
ˆ mxx ˆ 2 .
2
v
The estimators ˆ1 , ˆ 2 and ˆ v2 are the consistent estimators of 1 , 2 and v2 respectively. Note that ̂1 looks
like as if the reverse regression estimator of 1 is adjusted by u2 for its inconsistency. So it is termed as
Consider
m yy 12 2 u2
1 mxy v2 (using (iv))
1 mxy mxx 2 (using (iii))
m
1 mxy mxx xy (using iv)
1
ˆ1
yy yy
2mxy
U
, say.
2mxy
2mxy2
0
U
since mxy2 0, so U must be nonnegative.
ˆ1
yy yy
.
2mxy
Econometrics | Chapter 16 | Measurement Error Models | Shalabh, IIT Kanpur
17
Other estimates are
myy 2ˆ1mxy ˆ12 sxx
ˆ v
2
ˆ 2 1
mxy
ˆ 2 .
ˆ1
Note that the same estimator ˆ1 of 1 can be obtained by orthogonal regression. This amounts to transform
xi by xi / u and yi by yi / v and use the orthogonal regression estimation with transformed variables.
explanatory variable and K x 0, means 2 0 which means the explanatory variable is fixed. A higher
value of K x is obtained when v2 is small, i.e., the impact of measurement errors is small. The reliability
mxx 2 v2
mxy 1 2
mxy 1 2
mxx 2 v2
1 K x
mxy
ˆ1
K x mxx
m
2 xy
1
ˆ 2 K x mxx
mxx 2 v2
ˆ v2 1 K x mxx .
Note that ˆ K 1b
1 x
mxy
where b is the ordinary least squares estimator b .
mxx
Econometrics | Chapter 16 | Measurement Error Models | Shalabh, IIT Kanpur
18
5. 0 is known
Suppose 0 is known a priori and E ( x) 0. Then
y 0 1
y 0
ˆ1
ˆ
y 0
x
m
ˆ 2 xy
ˆ 1
ˆ myy ˆ1mxy
2
u
mxy
ˆ v2 mxx .
ˆ 1
Note: In each of the cases 1-6, note that the form of the estimate depends on the type of available
information which is needed for the consistent estimator of the parameters. Such information can be
available from various sources, e.g., long association of the experimenter with the experiment, similar type
of studies conducted in the part, some external source etc.
unrealistic in the sense that when xi ' s are unobservable and unknown, it is difficult to know if they are fixed
or not. This can not be ensured even in repeated sampling that the same value is repeated. All that can be said
in this case is that the information, in this case, is conditional upon xi ' s . So assume that xi ' s are
1 n
2
yi 0 1 xi
1
n
2 xi xi 2
L 2
exp i 1
2
exp .
2 u 2 u2 2 v 2 v2
The log-likelihood is
n
y 1 xi
2
0
n i
n 1 n
x x .
2
L* ln L constant ln u2 i 1
ln v2 2
2 2
2 v
i i
2 u 2 i 1
The normal equations are obtained by partially differentiating L * and equating to zero as
L * 1 n
(I )
0
0 2
u
y
i 1
i 0 1 xi 0
L * 1 n
( II )
1
0 2
v
y
i 1
i 0 1 xi xi 0
L * n 1 n
y 1 xi 0
2
( III ) 0 2 4
u 2 u 2 u
2 i 0
i 1
L * n 1 n
x x
2
( IV ) 0 2 4 0
v
2
2 v 2 v
i i
i 1
L * 1
(V ) 0 2 yi 0 1 xi 2 ( xi xi ) 0.
xi u v
u i v i
Using the left-hand side of equation (III) and right-hand side of equation (IV), we get
n12 n
2
u v2
u
1
v
Econometrics | Chapter 16 | Measurement Error Models | Shalabh, IIT Kanpur
20
which is unacceptable because can be negative also. In the present case, as u 0 and v 0 , so will
always be positive. Thus the maximum likelihood breaks down because of insufficient information in the
model. Increasing the sample size n does not solve the purpose. If the restrictions like u2 known, v2
u2
known or known are incorporated, then the maximum likelihood estimation is similar to as in the case of
v2
structural form and the similar estimates may be obtained. For example, if u2 / v2 is known, then
substitute it in the likelihood function and maximize it. The same solution as in the case of structural form
are obtained.
Similar to the classification of variables as explanatory variable and study variable in linear regression
model, the variables in simultaneous equation models are classified as endogenous variables and exogenous
variables.
Exogenous variables help is explaining the variations in endogenous variables. It is customary to include past
values of endogenous variables in the predetermined group. Since exogenous variables are predetermined, so
they are independent of disturbance term in the model. They satisfy those assumptions which explanatory
variables satisfy in the usual regression model. Exogenous variables influence the endogenous variables but
Note that in the linear regression model, the explanatory variables influence the study variable but not vice
versa. So relationship is one sided.
The classification of variables as endogenous and exogenous is important because a necessary condition for
uniquely estimating all the parameters is that the number of endogenous variables is equal to the number of
independent equations in the system. Moreover, the main distinction of predetermined variable in estimation
of parameters is that they are uncorrelated with disturbance term in the equations in which they appear.
Example 1:
Now we consider the following example in detail and introduce various concepts and terminologies used in
describing the simultaneous equations models.
Consider a situation of an ideal market where transaction of only one commodity, say wheat, takes place.
Assume that the number of buyers and sellers is large so that the market is a perfectly competitive market. It
is also assumed that the amount of wheat that comes into the market in a day is completely sold out on the
same day. No seller takes it back. Now we develop a model for such mechanism.
Let
dt denotes the demand of the commodity, say wheat, at time t ,
By economic theory about the ideal market, we have the following condition:
d t st , t 1, 2,..., n .
- rainfall ( rt ) at time t .
Our aim is to study the behaviour of st , pt and rt which are determined by the simultaneous equation
model.
Since endogenous variables are influenced by exogenous variables but not vice versa, so
st , pt and rt are endogenous variables.
Now consider an additional variable for the model as lagged value of price pt , denoted as pt 1 . In a market,
generally the price of the commodity depends on the price of the commodity on previous day. If the price of
commodity today is less than the previous day, then buyer would like to buy more. For seller also, today’s
price of commodity depends on previous day’s price and based on which he decides the quantity of
commodity (wheat) to be brought in the market.
Note that the lagged variables are considered as exogenous variable. The updated list of endogenous and
exogenous variables is as follows:
Endogenous variables: pt , dt , st
Exogenous variables: pt 1 , it , rt .
The mechanism of the market is now described by the following set of equations.
demand d t 1 1 pt 1t
supply st 2 2 pt 2t
equilibrium condition d t st qt
where ' s denote the intercept terms, ' s denote the regression coefficients and ' s denote the
disturbance terms.
These equations are called structural equations. The error terms 1 and 2 are called structural
qt 1 1 pt 1t (I)
qt 2 2 pt 2t (II)
So there are only two structural relationships. The price is determined by the mechanism of market and not
by the buyer or supplier. Thus qt and pt are the endogenous variables. Without loss of generality, we can
assume that the variables associated with 1 and 2 are X 1 and X 2 respectively such that
variables.
Each endogenous variable is expressed as a function of the exogenous variable. Note that the exogenous
variable 1 (from X 1 1, or X 2 1) is not clearly identifiable.
The equations (III) and (IV) are called the reduced form relationships and in general, called the reduced
form of the model.
The coefficients 11 and 21 are called reduced form coefficients and errors v1t and v2t are called the
reduced form disturbances. The reduced from essentially express every endogenous variable as a function
of exogenous variable. This presents a clear relationship between reduced form coefficients and structural
coefficients as well as between structural disturbances and reduced form disturbances. The reduced form is
ready for the application of OLS technique. The reduced form of the model satisfies all the assumptions
needed for the application of OLS.
Suppose we apply OLS technique to equations (III) and (IV) and obtained the OLS estimates of 11 and 12
1 2
ˆ11
2 1
ˆ 21 2 1 2 1 .
2 1
Econometrics | Chapter 17 | Simultaneous Equations Models | Shalabh, IIT Kanpur
5
Note that ˆ11 and ˆ 21 are the numerical values of the estimates. So now there are two equations and four
unknown parameters 1 , 2 , 1 and 2 . So it is not possible to derive the unique estimates of parameters of
the model by applying OLS technique to reduced form. This is known as problem of identifications.
dt st qt .
So we have two structural equations model in two endogenous variables qt and pt and one exogenous
variable (value is 1 given by X 1 1, X 2 1) . The set of three equations is reduced to a set of two equations as
follows:
Demand : qt 1 1 pt 1t (1)
Supply: qt 2 2 pt 2t (2)
Before analysis, we would like to check whether it is possible to estimate the parameters 1 , 2 , 1 and 2 or
not.
where 1 (1 ) 2 , 1 (1 ) 2 , t 1t (1 ) 2t and is any scalar lying between 0 and
1.
Comparing equation (3) with equations (1) and (2), we notice that they have same form. So it is difficult to
say that which is supply equation and which is demand equation. To find this, let equation (3) be demand
equation. Then there is no way to identify the true demand equation (1) and pretended demand equation (3).
A similar exercise can be done for the supply equation, and we find that there is no way to identify the true
supply equation (2) and pretended supply equation (3).
Suppose we apply OLS technique to these models. Applying OLS to equation (1) yields
n
p p q q t t
ˆ1 t 1
n
0.6, say
p p
2
t
t 1
1 n 1 n
where p t
n t 1
p , q qt .
n t 1
p p q q
t t
ˆ t 1
n
0.6.
p p
2
t
t 1
Note that ˆ1 and ˆ have same analytical expressions, so they will also have same numerical values, say 0.6.
Looking at the value 0.6, it is difficult to say that the value 0.6 determines equation (1) or (3).
Applying OLS to equation (3) fields
n
p p q q t t
ˆ2 t 1
n
0.6
p p
2
t
t 1
Thus it is difficult to decide and identify whether ˆ1 is determined by the value 0.6 or ˆ2 is determined by
the value 0.6. Increasing the number of observations also does not helps in the identification of these
equations. So we are not able to identify the parameters. So we take the help of economic theory to identify
the parameters.
The economic theory suggests that when price increases then supply increases but demand decreases. So the
plot will look like
and this implies 1 0 and 2 0. Thus since 0.6 0, so we can say that the value 0.6 represents ˆ2 0
and so ˆ2 0.6. But one can always choose a value of such that pretended equation does not violate the
sign of coefficients, say 0. So it again becomes difficult to see whether equation (3) represents supply
equation (2) or not. So none of the parameters is identifiable.
1 2
ˆ11
2 1
2 1
ˆ 21 1 2 .
2 1
There are two equations and four unknowns. So the unique estimates of the parameters 1 , 2 , 1 and 2
cannot be obtained. Thus the equations (1) and (2) can not be identified. Thus the model is not identifiable
and estimation of parameters is not possible.
Suppose a new exogenous variable income it is introduced in the model which represents the income of
buyer at time t . Note that the demand of commodity is influenced by income. On the other hand, the supply
of commodity is not influenced by the income, so this variable is introduced only in the demand equation.
The structural equations (1) and (2) now become
Demand : qt 1 1 pt 1it 1t (6)
Supply: qt 2 2 pt 2t (7)
where 1 is the structural coefficient associate with income. The pretended equation is obtained by
multiplying the equations (6) and (7) by and (1 ) , respectively, and then adding them together. This is
obtained as follows:
qt (1 )qt 1 (1 ) 2 1 (1 ) 2 pt 1it 1t (1 ) 2t
or qt pt it t (8)
Suppose now if we claim that equation (8) is true demand equation because it contains pt and it which
influence the demand. But we note that it is difficult to decide that between the two equations (6) or (8),
which one is the true demand equation.
Suppose now if we claim that equation (8) is the true supply equation. This claim is wrong because income
does not affect the supply. So equation (6) is the supply equation.
Thus the supply equation is now identifiable but demand equation is not identifiable. Such situation is termed
as partial identification.
Econometrics | Chapter 17 | Simultaneous Equations Models | Shalabh, IIT Kanpur
9
Now we find the reduced form of structural equations (6) and (7). This is achieved by first solving equation
(6) for pt and substituting it in equation (7) to obtain an equation in qt . Such an exercise yields the reduced
Applying OLS to equations (9) and (10), we get OLSEs ˆ11 , ˆ 22 , ˆ12 , ˆ 21 . Now we have four equations
((6),(7),(9),(10)) and there are five unknowns (1 , 2 , 1 , 2 , 1 ). So the parameters are not determined
However, here
22
2
11
11
2 21 22 .
22
If ˆ11 , ˆ 22 , ˆ12 and ˆ 21 are available, then ˆ 2 and ˆ2 can be obtained by substituting ˆ ' s in place of ' s. So
2 and 2 are determined uniquely which are the parameters of supply equation. So supply equation is
identified but demand equation is still not identifiable.
Now, as done earlier, we introduce another exogenous variable- rainfall, denoted as rt which denotes the
amount of rainfall at time t . The rainfall influences the supply because better rainfall produces better yield of
wheat. On the other hand, the demand of wheat is not influenced by the rainfall. So the updated set of
structural equations is
Demand : qt 1 1 pt 1it 1t (11)
Supply: qt 2 2 pt 2 rt 2t . (12)
The pretended equation is obtained by adding together the equations obtained after multiplying equation (11)
by and equation (12) by (1 ) as follows:
scalar.
Econometrics | Chapter 17 | Simultaneous Equations Models | Shalabh, IIT Kanpur
10
Now we claim that equation (13) is a demand equation. The demand does not depend on rainfall. So unless
1 so that rt is absent in the model, the equation (13) cannot be a demand equation. Thus equation (11) is
a demand equation. So demand equation is identified.
Now we claim that equation (13) is the supply equation. The supply is not influenced by the income of the
buyer, so (13) cannot be a supply equation. Thus equation (12) is the supply equation. So now the supply
equation is also identified.
The reduced form model from structural equations (11) and (12) can be obtained which have the following
forms:
pt 11 12it 13 rt v1t (14)
qt 21 22it 23 rt v2t . (15)
Application of OLS technique to equations (14) and (15) yields the OLSEs ˆ11 , ˆ12 , ˆ13 , ˆ 21 , ˆ 22 and ˆ 23 . So
now there are six such equations and six unknowns 1 , 2 , 1 , 2 , 1 and 2 . So all the estimates are uniquely
determined. Thus the equations (11) and (12) are exactly identifiable.
Finally, we introduce a lagged endogenous variable pt 1 which denotes the price of the commodity on the
previous day. Since only the supply of wheat is affected by the price on the previous day, so it is introduced
in the supply equation only as
Demand : qt 1 1 pt 1it 1t (16)
Supply: qt 2 2 pt 2 rt 2 pt 1 2t (17)
The pretended equation is obtained by first multiplying equations (16) by and (17) by (1 ) and then
adding them together as follows:
qt (1 )qt pt it rt (1 ) 2 pt 1 t
or qt pt it rt pt 1 t . (18)
Now we claim that equation (18) represents the demand equation. Since rainfall and lagged price do not
affect the demand, so equation (18) cannot be demand equation. Thus equation (16) is a demand equation
and the demand equation is identified.
The reduced form equations from equations (16) and (17) can be obtained as of the following form:
pt 11 12it 13 rt 14 pt 1 v1t (19)
qt 21 22it 23 rt r24 pt 1 v2t . (20)
Applying the OLS technique to equations (19) and (20) gives the OLSEs as
ˆ11 , ˆ12 , ˆ13 , ˆ14 , ˆ 21 , ˆ 22 , ˆ 23 and ˆ 24 . So there are eight equations in seven parameters
1 , 2 , 1 , 2 , 1 , 2 and 2 . So unique estimates of all the parameters are not available. In fact, in this case,
the supply equation (17) is identifiable and demand equation (16) is overly identified (in terms of multiple
solutions).
The whole analysis in this example can be classified into three categories –
(exogenous) variables x1 , x2 ,..., xK . Let there are n observations available on each of the variable and there
are G structural equations connecting both the variables which describe the complete model as follows:
11 y1t 12 y2t ... 1G yGt 11 x1t 12 x2t ... 1K xKt 1t
21 y1t 22 y2t ... 2G yGt 21 x2t 22 x2t ... 2 k xK 2 2t
G1 y1t G 2 y2t ... GG yGt G1 x1t G 2 x2t ... Gk xKt Gt .
This is the reduced form equation of the model where B 1 is the matrix of reduced form coefficients
and vt B 1 t is the reduced form disturbance vectors.
If B is singular, then one or more structural relations would be a linear combination of other structural
relations. If B is non-singular, such identities are eliminated.
Assume that t ' s are identically and independently distributed following N (0, ) and Vt ' s are identically
p ( yt | xt ) p(vt )
t
p ( t )
vt
p t det( B )
t
where is the related Jacobian of transformation and det( B) is the absolute value of the determinant of
vt
B.
Applying a nonsingular linear transformation on structural equation S with a nonsingular matrix D, we get
DByt Dxt D t
or S *: B * yt * xt t* , t 1, 2,..., n
where B* DB, * D, t* D t and structural model S * describes the functioning of this model at time
Also
t* D t
t*
D.
t
Thus
t
p yt xt p t det( B*)
t*
p t det( D 1 ) det( DB)
p t det( D 1 ) det( D) det( B)
p t det( B) .
p
n
L* det( B*)
n *
t
t 1
n
t
det( D) det( B ) p
n n
t *
t 1 t
n
det( D) det( B ) p det( D 1
n n
t )
t 1
L.
Thus both the structural forms S and S * have the same likelihood functions. Since the likelihood functions
form the basis of statistical analysis, so both S and S * have same implications. Moreover, it is difficult to
identify whether the likelihood function corresponds to S and S * . Any attempt to estimate the parameters
will result into failure in the sense that we cannot know whether we are estimating S and S * . Thus S and
S * are observationally equivalent. So the model is not identifiable.
A parameter is said to be identifiable within the model if the parameter has the same value for all equivalent
structures contained in the model.
Given a structure, we can thus find many observationally equivalent structures by non-singular linear
transformation.
The apriori restrictions on B and may help in the identification of parameters. The derived structures may
not satisfy these restrictions and may therefore not be admissible.
The presence and/or absence of certain variables helps in the identifiability. So we use and apply some
apriori restrictions. These apriori restrictions may arise from various sources like economic theory, e.g. it is
known that the price and income affect the demand of wheat but rainfall does not affect it. Similarly, supply
of wheat depends on income, price and rainfall. There are many types of restrictions available which can
solve the problem of identification. We consider zero-one type restrictions.
When the zero-one type restrictions are incorporated in the model, suppose there are G jointly dependent
and K* predetermined variables in S having nonzero coefficients. Rest G G jointly dependent and
Without loss of generality, let and * be the row vectors formed by the nonzero elements in the first row
of B and respectively. Thus the first row of B can be expressed as 0 . So B has G coefficients
Similarly, the first row of can be written as * 0 . So in , there are K* elements present ( those take
or 0 yt * 0 xt 1t , t 1, 2,..., n.
Assume every equation describes the behaviour of a particular variable, so that we can take 11 1.
If 11 1, then divide the whole equation by 11 so that the coefficient of y1t is one.
Partition
**
*
* **
where the orders of * is G K* , ** is G K** , * is G K* and ** is G K**
We can re-express
0 * 0
0 * 0**
**
or 0 * * 0**
* **
* * (i )
** 0**. (ii )
identifiability of S lies upon the unique determination of . Out of G elements in , one has coefficient
Note that
1, 12 ,..., 1G .
As
** 0**
or 1 ** 0** .
rank ** G 1.
rank ** G 1.
This is known as a rank condition for the identifiability of parameters in S . This condition is necessary
and sufficient.
Another condition, known as order condition is only necessary and not sufficient. The order condition is
derived as follows:
We now use the result that for any matrix A of order m n , the rank ( A) Min(m, n). For identifiability, if
rank ( A) m then obviously n m.
Since ** 0 and has only (G 1) elements which are identifiable when
rank ** G 1
K K* G 1.
This is known as the order condition for identifiability.
There are various ways in which these conditions can be represented to have meaningful interpretations. We
discuss them as follows:
Here
G G : Number of jointly dependent variables left out in the equation
y1t 12 y2t ... 1G yG 11 x1t ... 1K xKt 1t
y1t 12 y2t ... 1G yG 11 x1t ... 1K xKt 1t
So left-hand side of the condition denotes the total number of variables excluded is this equation.
Thus if the total number of variables excluded in this equation exceeds (G 1), then the model is
identifiable.
2. K K* G -1
Here
K* : The number of predetermined variables present in the equation.
3. Define L K K* (G 1)
Suppose
0 * 0**
B , ,
B B * **
where , * , 0 and 0** are row vectors consisting of G , K* , G and K** elements respectively in them.
((G 1) K* ) and ** is ((G 1) K** ). Note that B and ** are the matrices of structural coefficients for
the variable omitted from the i th equation (i = 1,2,…,G) but included in other structural equations.
then clearly the rank of * is same as the rank of since the rank of a matrix is not affected by enlarging
0 ,**
rank ( B 1* ) rank
I , ,**
rank ( I , ) rank ( ,** )
G G G 1
G 1
rank B 1* rank * rank B ** .
So
rank ( B ** ) G 1
Note that B ** is a matrix constructed from the coefficients of variables excluded from that particular
equation but included in other equations of the model. If rank ** G 1, then the equation is
identifiable and this is a necessary and sufficient condition. An advantage of this term is that it avoids the
inversion of matrix. A working rule is proposed like following.
Working rule:
1. Write the model in tabular form by putting ‘X’ if the variable is present and ‘0’ if the variable is
absent.
2. For the equation under study, mark the 0’s (zeros) and pick up the corresponding columns
suppressing that row.
3. If we can choose G 1 rows and (G 1) columns that are not all zero, then it can be identified.
G Number of equations = 3.
`X’ denotes the presence and '0 ' denotes the absence of variables in an equation.
G Numbers of ‘X’ in y1 y2 y3 .
K* = Number of ‘X’ in ( x1 x2 x3 ) .
Equation (2) 0 0
Equation (3) X X
Check if any row/column is present with all elements ‘0’. If we can pick up a block of order (G 1)
is which all rows/columns are all ‘0’ , then the equation is identifiable.
0 0
Here G 1 3 1 2, B , ** . So we need 2 2 matrix in which no row and no
X X
column has all elements ‘0’ . In case of equation (1), the first row has all ‘0’, so it is not identifiable.
Notice that from order condition, we have L 0 which indicates that the equation is identified and
this conclusion is misleading. This is happening because order condition is just necessary but not
sufficient.
Equation (1) 0X 0
Equation (3) XX X
0 X 0
B , ** .
X X X
G 1 3 1 2.
We see that there is atleast one block in which no row is ‘0’ and no column is ‘0’. So the equation (2)
is identified.
Also, L 1 0 Equation (2) is over identified.
Equation (1) X X
Equation (2) X X
X X
So B , ** .
X X
We see that there is no ‘0’ present in the block. So equation (3) is identified.
Also, L 0 Equation (3) is exactly identified.
y1t 12 y2t ... 1G yG 11 x1t 12 x2t ... 1K xK*t 1t , t 1, 2,.., n.
Writing this equation in vector and matrix notations by collecting all n observations, we can write
y1 Y1 X 1 1
where
variables where G denote the number of jointly dependent variables present on the right hand side of the
observations on each of the K* predetermined variables and is a n 1 vector of structural disturbances.
1
Assume E 0, E ' where is positive definite symmetric matrix.
n
The reduced form of the model is obtained from the structural equation model by post multiplying by B 1 as
YBB 1 X B 1 B 1
Y X V
where B 1 and V B 1 are the matrices of reduced-form coefficients and reduced form disturbances
respectively.
where A Y1 , X 1 , ' and 1 . This model looks like a multiple linear regression model.
X 1 XJ1
where J1 is called as a select matrix and consists of two types of elements, viz., 0 and 1. If the
Step 1: Apply ordinary least squares to each of the reduced form equations and estimate the reduced form
coefficient matrix.
Step 2: Find algebraic relations between structural and reduced-form coefficients. Then find the structural
coefficients.
where yt y1t , y2t ,..., yGt ', xt x1t , x2t ,..., xKt '.
ˆ X ' X X ' Y .
1
This is the first step of ILS procedure and yields the set of estimated reduced form coefficients.
predetermined (exogenous) variables in the equation and is n 1 vector of structural disturbances. Write
this model as
1
y1 Y1 X1
or more general
1
y1 Y1 Y2 X 1 X 2 0
0
where Y2 and X 2 are the matrices of observations on G G 1 endogenous and ( K K ) predetermined
variables which are excluded from the model due to zero-one type restrictions.
Write
B
1
or .
0 0
Substitute as ˆ X ' X X ' Y
1
and solve for and . This gives indirect least squares estimators b
These equations (i) and (ii) are K equations is (G K* 1) unknowns. Solving the equations (i) and (ii)
2. Two stage least squares (2SLS) or generalized classical linear (GCL) method:
This is more widely used estimation procedure as it is applicable to exactly as well as overidentified
equations. The least-squares method is applied in two stages in this method.
Consider equation y1 Y1 X 1 .
Stage 1: Apply least squares method to estimate the reduced form parameters in the reduced form model
Y1 X 1 V1
ˆ1 ( X ' X ) 1 X ' Y1
Y1 X ˆ1.
Stage 2: Replace Y1 is structural equation y1 Y1 X 1 by Ŷ1 and apply OLS to thus obtained structural
equation as follows:
y1 Yˆ1 X 1
X X ' X X ' Y1 X 1
1
X X ' X X ' Yˆ1 X1
1
Aˆ .
1
ˆ Aˆ Aˆ ˆ
Ay1
1
ˆ Y1' HY1 Y1' X 1 Y1' y1
or '
ˆ X 1Y1 X 1' X 1 X 1' y1
1
Y 'Y Vˆ 'Vˆ Y1' X 1 Y1' Vˆ1'
1 1 ' 1 1 y1.
X 1Y1 X 1' X 1 X 1
'
equations, we get ̂ and ˆ which are the two stage least squares estimators of and respectively.
Ayˆ
1
ˆ Aˆ Aˆ 1
ˆ Aˆ ' Aˆ Aˆ '
1
1
1
1
n
plim ˆ plim Aˆ ' A plim Aˆ ' .
n
Thus plim ˆ- 0 and so the 2SLS estimators are consistent.
Asy Var ˆ n 1 plim n ˆ ˆ '
plim n Aˆ Aˆ Aˆ ' ' Aˆ Aˆ ' Aˆ
1 1
n 1
1 1
1 1 1
n plim Aˆ ' Aˆ plim Aˆ ' ' Aˆ plim Aˆ ' Aˆ
1
n n n
1
1
n plim Aˆ ' Aˆ
1 2
n
where Var ( ) 2 .
where
s 2
y Y ˆ X ˆ ' y Y ˆ X ˆ
1 1 1 1
nG K
A basic nature of the multiple regression model is that it describes the behaviour of a particular study
variable based on a set of explanatory variables. When the objective is to explain the whole system, there
may be more than one multiple regression equations. For example, in a set of individual linear multiple
regression equations, each equation may explain some economic phenomenon. One approach to handle such
a set of equations is to consider the set up of simultaneous equations model is which one or more of the
explanatory variables in one or more equations are itself the dependent (endogenous) variable associated
with another equation in the full system. On the other hand, suppose that none of the variables is the system
are simultaneously both explanatory and dependent in nature. There may still be interactions between the
individual equations if the random error components associated with at least some of the different equations
are correlated with each other. This means that the equations may be linked statistically, even though not
structurally – through the jointness of the distribution of the error terms and through the non-diagonal
covariance matrix. Such behaviour is reflected in the seemingly unrelated regression equations (SURE)
model in which the individual equations are in fact related to one another, even though superficially they
may not seem to be.
The basic philosophy of the SURE model is as follows. The jointness of the equations is explained by the
structure of the SURE model and the covariance matrix of the associated disturbances. Such jointness
introduces additional information which is over and above the information available when the individual
equations are considered separately. So it is desired to consider all the separate relationships collectively to
draw the statistical inferences about the model parameters.
Example:
Suppose a country has 20 states and the objective is to study the consumption pattern of the country. There is
one consumption equation for each state. So all together there are 20 equations which describe 20
consumption functions. It may also not necessary that the same variables are present in all the models.
Different equations may contain different variables. It may be noted that the consumption pattern of the
neighbouring states may have characteristics in common. Apparently, the equations may look distinct
individually but there may be some kind of relationship that may be existing among the equations. Such
equations can be used to examine the jointness of the distribution of disturbances. It seems reasonable to
Model:
We consider here a model comprising of M multiple regression equations of the form
ki
yti xtij ij ti , t 1, 2,..., T ; i 1, 2,..., M ; j 1, 2,..., ki
j 1
where yti is the t th observation on the i th dependent variable which is to be explained by the i th regression
equation, xtij is the t th observation on j th explanatory variable appearing in the i th equation, ij is the
coefficient associated with xtij at each observation and ti is the t th value of the random error component
where yi is T 1 vector with elements yti ; X i is T K i matrix whose columns represent the T
observations on an explanatory variable in the i th equation; i is a ki 1 vector with elements ij ; and i
y1 X1 0 0 1 1
y2 0 X2 0 2 2
yM 0 0 X M M M
or y X
where the orders of y is TM 1 , X is TM k * , is k * 1 , is TM 1 and k * ki .
i
E ui u 'j ij IT ; i, j 1, 2,..., M
where Qij is non-singular matrix with fixed and finite elements and ij is the covariance between the
It is clear that the M equations may appear to be not related in the sense that there is no simultaneity
between the variables in the system and each equation has its own explanatory variables to explain the study
variable. The equations are related stochastically through the disturbances which are serially correlated
across the equations of the model. That is why this system is referred to as SURE model.
The SURE model is a particular case of simultaneous equations model involving M structural equations
with M jointly dependent variable and k ki for all i distinct exogenous variables and in which neither
current nor logged endogenous variables appear as explanatory variables in any of the structural equations.
The SURE model differs from the multivariate regression model only in the sense that it takes account of
prior information concerning the absence of certain explanatory variables from certain equations of the
model. Such exclusions are highly realistic in many economic situations.
b0 X ' X X ' y
1
Further
E b0
V b0 E b0 b0 '
X ' X X ' X X ' X .
1 1
X ' 1 IT X X ' 1 IT y
1
E ˆ
V ˆ E ˆ ˆ '
X ' 1 X
1
X ' 1 IT X .
1
Define
V b0 V ˆ G G ' .
Since is positive definite, so G G ' is atleast positive semidefinite and so GLSE is, in general, more
efficient than OLSE for estimating . In fact, using the result that GLSE best linear unbiased estimator of
, so we can conclude that ̂ is the best linear unbiased estimator in this case also.
matrix S . With such replacement, we obtain a feasible generalized least squares (FGLS) estimator of as
Estimation of
There are two possible ways to estimate ij ' s.
y X , E 0, V IT
Regress each of the M study variables on the column of Z and obtain T 1 residual vectors
H Z yi
Then obtain
1 '
sij ˆiˆ j
T
1
yi' H Z y j
T
X i ZJ i
HZ Xi Xi Z Z ' Z Z ' Xi
1
X i ZJ i
0
and thus
yi' H Z y j i' X i' i' H Z X j j j
i' H Z j .
the coefficients which distinguish the SURE model from the multivariate regression model are used as
follows.
Regress yi on X i , i.e., regress each equation, i 1, 2,..., M by OLS and obtain the residual vector
H X i yi .
A consistent estimator of ij is obtained as
1 '
sij* ui u j
T
1
yi' H X i H X j y j
T
where
H X i I X i X i' X i X i
1
H X j I X j X 'j X j X j .
1
If T in sij* is replaced by
tr H X i H X j T ki k j tr X i' X i X i' X j X 'j X j X 'j X i
1 1