0% found this document useful (0 votes)
285 views300 pages

Econometrics Shalabh

Econometrics deals with measuring economic relationships and providing numerical values for the parameters in economic models. It integrates economics, mathematics, and statistics. Econometric models consist of equations describing behavior with both observed and random disturbance variables. Econometrics aims to formulate testable models, estimate and test those models using data, and use the models for forecasting and policy analysis. Econometrics differs from economics and statistics in its focus on stochastic economic relationships and use of statistical methods adapted for non-experimental economic data.

Uploaded by

jps.mathematics
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
285 views300 pages

Econometrics Shalabh

Econometrics deals with measuring economic relationships and providing numerical values for the parameters in economic models. It integrates economics, mathematics, and statistics. Econometric models consist of equations describing behavior with both observed and random disturbance variables. Econometrics aims to formulate testable models, estimate and test those models using data, and use the models for forecasting and policy analysis. Econometrics differs from economics and statistics in its focus on stochastic economic relationships and use of statistical methods adapted for non-experimental economic data.

Uploaded by

jps.mathematics
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 300

Chapter 1

Introduction to Econometrics

Econometrics deals with the measurement of economic relationships. It is an integration of economics,


mathematical economics and statistics with an objective to provide numerical values to the parameters of
economic relationships. The relationships of economic theories are usually expressed in mathematical forms
and combined with empirical economics. The econometrics methods are used to obtain the values of
parameters which are essentially the coefficients of the mathematical form of the economic relationships.
The statistical methods which help in explaining the economic phenomenon are adapted as econometric
methods. The econometric relationships depict the random behaviour of economic relationships which are
generally not considered in economics and mathematical formulations.

It may be pointed out that the econometric methods can be used in other areas like engineering sciences,
biological sciences, medical sciences, geosciences, agricultural sciences etc. In simple words, whenever there
is a need of finding the stochastic relationship in mathematical format, the econometric methods and tools
help. The econometric tools are helpful in explaining the relationships among variables.

Econometric Models:
A model is a simplified representation of a real-world process. It should be representative in the sense that it
should contain the salient features of the phenomena under study. In general, one of the objectives in
modeling is to have a simple model to explain a complex phenomenon. Such an objective may sometimes
lead to oversimplified model and sometimes the assumptions made are unrealistic. In practice, generally, all
the variables which the experimenter thinks are relevant to explain the phenomenon are included in the
model. Rest of the variables are dumped in a basket called “disturbances” where the disturbances are random
variables. This is the main difference between economic modeling and econometric modeling. This is also
the main difference between mathematical modeling and statistical modeling. The mathematical modeling is
exact in nature, whereas the statistical modeling contains a stochastic term also.

An economic model is a set of assumptions that describes the behaviour of an economy, or more generally, a
phenomenon.

Econometrics | Chapter 1 | Introduction to Econometrics | Shalabh, IIT Kanpur


1
An econometric model consists of
- a set of equations describing the behaviour. These equations are derived from the economic model
and have two parts – observed variables and disturbances.
- a statement about the errors in the observed values of variables.
- a specification of the probability distribution of disturbances.

Aims of econometrics:
The three main aims econometrics are as follows:

1. Formulation and specification of econometric models:


The economic models are formulated in an empirically testable form. Several econometric models can be
derived from an economic model. Such models differ due to different choice of functional form,
specification of the stochastic structure of the variables etc.

2. Estimation and testing of models:


The models are estimated on the basis of the observed set of data and are tested for their suitability. This is
the part of the statistical inference of the modelling. Various estimation procedures are used to know the
numerical values of the unknown parameters of the model. Based on various formulations of statistical
models, a suitable and appropriate model is selected.

3. Use of models:
The obtained models are used for forecasting and policy formulation, which is an essential part in any policy
decision. Such forecasts help the policymakers to judge the goodness of the fitted model and take necessary
measures in order to re-adjust the relevant economic variables.

Econometrics and statistics:


Econometrics differs both from mathematical statistics and economic statistics. In economic statistics, the
empirical data is collected recorded, tabulated and used in describing the pattern in their development over
time. The economic statistics is a descriptive aspect of economics. It does not provide either the explanations
of the development of various variables or measurement of the parameters of the relationships.

Econometrics | Chapter 1 | Introduction to Econometrics | Shalabh, IIT Kanpur


2
Statistical methods describe the methods of measurement which are developed on the basis of controlled
experiments. Such methods may not be suitable for the economic phenomenon as they don’t fit in the
framework of controlled experiments. For example, in real-world experiments, the variables usually change
continuously and simultaneously, and so the set up of controlled experiments are not suitable.

Econometrics uses statistical methods after adapting them to the problems of economic life. These adopted
statistical methods are usually termed as econometric methods. Such methods are adjusted so that they
become appropriate for the measurement of stochastic relationships. These adjustments basically attempt to
specify attempts to the stochastic element which operate in real-world data and enters into the determination
of observed data. This enables the data to be called a random sample which is needed for the application of
statistical tools.

The theoretical econometrics includes the development of appropriate methods for the measurement of
economic relationships which are not meant for controlled experiments conducted inside the laboratories.
The econometric methods are generally developed for the analysis of non-experimental data.

The applied econometrics includes the application of econometric methods to specific branches of
econometric theory and problems like demand, supply, production, investment, consumption etc. The applied
econometrics involves the application of the tools of econometric theory for the analysis of the economic
phenomenon and forecasting economic behaviour.

Types of data
Various types of data is used in the estimation of the model.
1. Time series data
Time series data give information about the numerical values of variables from period to period and are
collected over time. For example, the data during the years 1990-2010 for monthly income constitutes a time
series of data.

2. Cross-section data
The cross-section data give information on the variables concerning individual agents (e.g., consumers or
produces) at a given point of time. For example, a cross-section of a sample of consumers is a sample of
family budgets showing expenditures on various commodities by each family, as well as information on
family income, family composition and other demographic, social or financial characteristics.

Econometrics | Chapter 1 | Introduction to Econometrics | Shalabh, IIT Kanpur


3
3. Panel data:
The panel data are the data from a repeated survey of a single (cross-section) sample in different periods of
time.

4. Dummy variable data


When the variables are qualitative in nature, then the data is recorded in the form of the indicator function.
The values of the variables do not reflect the magnitude of the data. They reflect only the presence/absence
of a characteristic. For example, variables like religion, sex, taste, etc. are qualitative variables. The variable
`sex’ takes two values – male or female, the variable `taste’ takes values-like or dislike etc. Such values are
denoted by the dummy variable. For example, these values can be represented as ‘1’ represents male and ‘0’
represents female. Similarly, ‘1’ represents the liking of taste, and ‘0’ represents the disliking of taste.

Aggregation problem:
The aggregation problems arise when aggregative variables are used in functions. Such aggregative variables
may involve.
1. Aggregation over individuals:
For example, the total income may comprise the sum of individual incomes.

2. Aggregation over commodities:


The quantity of various commodities may be aggregated over, e.g., price or group of commodities. This is
done by using suitable index.

3. Aggregation over time periods


Sometimes the data is available for shorter or longer time periods than required to be used in the functional
form of the economic relationship. In such cases, the data needs to be aggregated over the time period. For
example, the production of most of the manufacturing commodities is completed in a period shorter than a
year. If annual figures are to be used in the model, then there may be some error in the production function.

4. Spatial aggregation:
Sometimes the aggregation is related to spatial issues. For example, the population of towns, countries, or the
production in a city or region etc..

Such sources of aggregation introduce “aggregation bias” in the estimates of the coefficients. It is important
to examine the possibility of such errors before estimating the model.

Econometrics | Chapter 1 | Introduction to Econometrics | Shalabh, IIT Kanpur


4
Econometrics and regression analysis:
One of the very important roles of econometrics is to provide the tools for modeling on the basis of given
data. The regression modeling technique helps a lot in this task. The regression models can be either linear or
non-linear based on which we have linear regression analysis and non-linear regression analysis. We will
consider only the tools of linear regression analysis and our main interest will be the fitting of the linear
regression model to a given set of data.

Linear regression model


Suppose the outcome of any process is denoted by a random variable y , called as dependent (or study)
variable, depends on k independent (or explanatory) variables denoted by X 1, X 2 ,..., X k . Suppose the

behaviour of y can be explained by a relationship given by


y  f ( X 1, X 2 ,..., X k , 1 ,  2 ,...,  k )  

where f is some well-defined function and 1 ,  2 ,...,  k are the parameters which characterize the role and

contribution of X 1, X 2 ,..., X k , respectively. The term  reflects the stochastic nature of the relationship

between y and X 1, X 2 ,..., X k and indicates that such a relationship is not exact in nature. When   0, then

the relationship is called the mathematical model otherwise the statistical model. The term “model” is
broadly used to represent any phenomenon in a mathematical framework.

A model or relationship is termed as linear if it is linear in parameters and non-linear, if it is not linear in
parameters. In other words, if all the partial derivatives of y with respect to each of the parameters
1 ,  2 ,...,  k are independent of the parameters, then the model is called as a linear model. If any of the
partial derivatives of y with respect to any of the 1 ,  2 ,...,  k is not independent of the parameters, the

model is called non-linear. Note that the linearity or non-linearity of the model is not described by the
linearity or non-linearity of explanatory variables in the model.

For example
y  1 X 12   2 X 2  3 log X 3  

is a linear model because y /  i , (i  1, 2,3) are independent of the parameters  i , (i  1, 2,3). On the other

hand,
y  12 X 1   2 X 2   3 log X  

Econometrics | Chapter 1 | Introduction to Econometrics | Shalabh, IIT Kanpur


5
is a non-linear model because y / 1  2 1 X 1 depends on 1 although y /  2 and y /  3 are independent

of any of the 1 ,  2 or  3 .

When the function f is linear in parameters, then y  f ( X 1 , X 2 ,..., X k , 1 ,  2 ,...,  k )   is called a linear

model and when the function f is non-linear in parameters, then it is called a non-linear model. In general,
the function f is chosen as
f ( X 1 , X 2 ,..., X k , 1 ,  2 ...,  k )  1 X 1   2 X 2  ...   k X k

to describe a linear model. Since X 1 , X 2 ,..., X k are pre-determined variables and y is the outcome, so both

are known. Thus the knowledge of the model depends on the knowledge of the parameters 1 ,  2 ,...,  k .

The statistical linear modeling essentially consists of developing approaches and tools to determine
1 ,  2 ,...,  k in the linear model
y  1 X 1   2 X 2  ...   k X k  

given the observations on y and X 1, X 2 ,..., X k .

Different statistical estimation procedures, e.g., method of maximum likelihood, the principle of least
squares, method of moments etc. can be employed to estimate the parameters of the model. The method of
maximum likelihood needs further knowledge of the distribution of y whereas the method of moments and
the principle of least squares do not need any knowledge about the distribution of y .

The regression analysis is a tool to determine the values of the parameters given the data on y and
X 1, X 2 ,..., X k . The literal meaning of regression is “to move in the backward direction”. Before discussing

and understanding the meaning of “backward direction”, let us find which of the following statements is
correct:
S1 : model generates data or
S 2 : data generates the model.
Obviously, S1 is correct. It can be broadly thought that the model exists in nature but is unknown to the
experimenter. When some values to the explanatory variables are provided, then the values for the output or
study variable are generated accordingly, depending on the form of the function f and the nature of the
phenomenon. So ideally, the pre-existing model gives rise to the data. Our objective is to determine the

Econometrics | Chapter 1 | Introduction to Econometrics | Shalabh, IIT Kanpur


6
functional form of this model. Now we move in the backward direction. We propose to first collect the data
on study and explanatory variables. Then we employ some statistical techniques and use this data to know
the form of function f . Equivalently, the data from the model is recorded first and then used to determine
the parameters of the model. The regression analysis is a technique which helps in determining the statistical
model by using the data on study and explanatory variables. The classification of linear and non-linear
regression analysis is based on the determination of linear and non-linear models, respectively.

Consider a simple example to understand the meaning of “regression”. Suppose the yield of the crop ( y )
depends linearly on two explanatory variables, viz., the quantity of fertilizer ( X 1 ) and level of irrigation

( X 2 ) as

y  1 X 1   2 X 2   .

There exist the true values of 1 and  2 in nature but are unknown to the experimenter. Some values on y

are recorded by providing different values to X 1 and X 2 . There exists some relationship between y and

X 1 , X 2 which gives rise to a systematically behaved data on y , X 1 and X 2 . Such a relationship is unknown

to the experimenter. To determine the model, we move in the backward direction in the sense that the
collected data is used to determine the unknown parameters 1 and  2 of the model. In this sense, such an

approach is termed as regression analysis.

The theory and fundamentals of linear models lay the foundation for developing the tools for regression
analysis that are based on valid statistical theory and concepts.

Steps in regression analysis


Regression analysis includes the following steps:
 Statement of the problem under consideration
 Choice of relevant variables
 Collection of data on relevant variables
 Specification of model
 Choice of method for fitting the data
 Fitting of model
 Model validation and criticism
 Using the chosen model(s) for the solution of the posed problem.
Econometrics | Chapter 1 | Introduction to Econometrics | Shalabh, IIT Kanpur
7
These steps are examined below.

1. Statement of the problem under consideration:


The first important step in conducting any regression analysis is to specify the problem and the objectives to
be addressed by the regression analysis. The wrong formulation or the wrong understanding of the problem
will give the wrong statistical inferences. The choice of variables depends upon the objectives of the study
and understanding of the problem. For example, the height and weight of children are related. Now there can
be two issues to be addressed.
(i) Determination of height for a given weight, or
(ii) determination of weight for a given height.
In case 1, the height is the response variable, whereas weight is the response variable in case 2. The role of
explanatory variables is also interchanged in cases 1 and 2.

2. Choice of relevant variables:


Once the problem is carefully formulated and objectives have been decided, the next question is to choose
the relevant variables. It has to be kept in mind that the correct choice of variables will determine the
statistical inferences correctly. For example, in any agricultural experiment, the yield depends on explanatory
variables like quantity of fertilizer, rainfall, irrigation, temperature etc. These variables are denoted by
X 1 , X 2 ,..., X k as a set of k explanatory variables.

3. Collection of data on relevant variables:


Once the objective of the study is clearly stated, and the variables are chosen, the next question arises how to
collect data on such relevant variables. The data is essentially the measurement of these variables. For
example, suppose we want to collect the data on age. For this, it is important to know how to record the data
on age. Then either the date of birth can be recorded which will provide the exact age on any specific date or
the age in terms of completed years as on specific date can be recorded. Moreover, it is also important to
decide whether the data has to be collected on variables as quantitative variables or qualitative variables. For
example, if the ages (in years) are 15,17,19,21,23, then these are quantitative values. If the ages are defined
by a variable that takes value 1 if ages are less than 18 years and 0 if the ages are more than 18 years, then
the earlier recorded data is converted to 1,1,0,0,0. Note that there is a loss of information in converting the
quantitative data into qualitative data. The methods and approaches for qualitative and quantitative data are
also different. If the study variable is binary, then logistic and probit regressions etc. are used. If all

Econometrics | Chapter 1 | Introduction to Econometrics | Shalabh, IIT Kanpur


8
explanatory variables are qualitative, then analysis of variance technique is used. If some explanatory
variables are qualitative and others are quantitative, then analysis of covariance technique is used. The
techniques of analysis of variance and analysis of covariance are the special cases of regression analysis.

Generally, the data is collected on n subjects, then y on data, then y denotes the response or study variable

and y1 , y2 ,..., yn are the n values. If there are k explanatory variables X 1 , X 2 ,.., X k then xij denotes the i th

value of the j th variable i  1, 2,..., n; j  1, 2,..., k . The observation can be presented in the following table:
Notation for the data used in regression analysis
Observation number Response Explanatory variables
y X1 X 2  X k

1 y1 x11 x12  x1k


2 y2 x21 x22  x2 k
     
n yn xn1 xn 2  xnk

4. Specification of model:
The experimenter or the person working in the subject usually help in determining the form of the model.
Only the form of the tentative model can be ascertained, and it will depend on some unknown parameters.
For example, a general form will be like
y  f ( X 1 , X 2 ,..., X k ; 1 ,  2 ,...,  k )  

where  is the random error reflecting mainly the difference in the observed value of y and the value of y
obtained through the model. The form of f ( X 1 , X 2 ,..., X k ; 1 ,  2 ,...,  k ) can be linear as well as non-linear

depending on the form of parameters 1 ,  2 ,...,  k . A model is said to be linear if it is linear in parameters.

For example,
y  1 X 1   2 X 12  3 X 2  
y  1   2 ln X 2  
are linear models whereas
y  1 X 1   22 X 2  3 X 2  
y   ln 1  X 1   2 X 2  

are non-linear models. Many times, the non-linear models can be converted into linear models through some
transformations. So the class of linear models is wider than what it appears initially.

Econometrics | Chapter 1 | Introduction to Econometrics | Shalabh, IIT Kanpur


9
If a model contains only one explanatory variable, then it is called a simple regression model. When there
are more than one independent variables, then it is called a multiple regression model. When there is only
one study variable, the regression is termed as univariate regression. When there are more than one study
variables, the regression is termed as multivariate regression. Note that the simple and multiple regressions
are not same as univariate and multivariate regressions. The simple and multiple regression are determined
by the number of explanatory variables, whereas univariate and multivariate regressions are determined by
the number of study variables.

5. Choice of method for fitting the data:


After the model has been defined, and the data have been collected, the next task is to estimate the
parameters of the model based on the collected data. This is also referred to as parameter estimation or
model fitting. The most commonly used method of estimation is the least-squares method. Under certain
assumptions, the least-squares method produces estimators with desirable properties. The other estimation
methods are the maximum likelihood method, ridge method, principal components method etc.

6. Fitting of model:
The estimation of unknown parameters using appropriate method provides the values of the parameter.
Substituting these values in the equation gives us a usable model. This is termed as model fitting. The
estimates of parameters 1 ,  2 ,...,  k in the model

y  f ( X 1 , X 2 ,..., X k , 1 ,  2 ,...,  k )  

are denoted by ˆ1 , ˆ2 ,..., ˆk which gives the fitted model as

y  f ( X 1 , X 2 ,..., X k , ˆ1 , ˆ2 ,..., ˆk ).

When the value of y is obtained for the given values of X 1 , X 2 ,..., X k , it is denoted as ŷ and called as fitted

value.

The fitted equation is used for prediction. In this case, ŷ is termed as the predicted value. Note that the
fitted value is where the values used for explanatory variables correspond to one of the n observations in the
data, whereas predicted value is the one obtained for any set of values of explanatory variables. It is not
generally recommended to predict the y -values for the set of those values of explanatory variables which lie
outside the range of data. When the values of explanatory variables are the future values of explanatory
variables, the predicted values are called forecasted values.

Econometrics | Chapter 1 | Introduction to Econometrics | Shalabh, IIT Kanpur


10
7. Model criticism and selection
The validity of the statistical method to be used for regression analysis depends on various assumptions.
These assumptions become the assumptions for the model and the data essentially. The quality of statistical
inferences heavily depends on whether these assumptions are satisfied or not. For making these assumptions
to be valid and to be satisfied, care is needed from the beginning of the experiment. One has to be careful in
choosing the required assumptions and to decide as well to determine if the assumptions are valid for the
given experimental conditions or not? It is also important to decide that the situations is which the
assumptions may not meet.

The validation of the assumptions must be made before drawing any statistical conclusion. Any departure
from the validity of assumptions will be reflected in the statistical inferences. In fact, the regression analysis
is an iterative process where the outputs are used to diagnose, validate, criticize and modify the inputs. The
iterative process is illustrated in the following figure.

Inputs Outputs

 Theories Estimate  Estimation of parameters


 Model  Confidence regions
 Assumptions  Tests of hypotheses
 Data  Graphical displays
Diagnosis,
 Statistocal methods validation and
criticism

8. Objectives of regression analysis


The determination of the explicit form of the regression equation is the ultimate objective of regression
analysis. It is finally a good and valid relationship between study variable and explanatory variables. The
regression equation helps in understanding the interrelationships of variables among them. Such a
regression equation can be used for several purposes. For example, to determine the role of any explanatory
variable in the joint relationship in any policy formulation, to forecast the values of the response variable
for a given set of values of explanatory variables.

Econometrics | Chapter 1 | Introduction to Econometrics | Shalabh, IIT Kanpur


11
Chapter 2
Simple Linear Regression Analysis

The simple linear regression model


We consider the modelling between the dependent and one independent variable. When there is only one
independent variable in the linear regression model, the model is generally termed as a simple linear
regression model. When there are more than one independent variable in the model, then the linear model is
termed as the multiple linear regression model.

The linear model


Consider a simple linear regression model
y =  0 + 1 X + 

where y is termed as the dependent or study variable and X is termed as the independent or explanatory
variable. The terms  0 and 1 are the parameters of the model. The parameter  0 is termed as an intercept

term, and the parameter 1 is termed as the slope parameter. These parameters are usually called as

regression coefficients. The unobservable error component  accounts for the failure of data to lie on a
straight line and represents the difference between the true and observed realization of y . There can be
several reasons for such difference, e.g., the effect of all deleted variables in the model, variables may be
qualitative, inherent randomness in the observations etc. We assume that  is observed as an independent
and identically distributed random variable with mean zero and constant variance  2 . Later, we will
additionally assume that  is normally distributed.

The independent variables are viewed as controlled by the experimenter, so it is considered as non-stochastic
whereas y is viewed as random variable with
E ( y ) =  0 + 1 X

and
Var ( y ) =  2 .
Sometimes X can also be a random variable. In such a case, instead of the sample mean and sample
variance of y , we consider the conditional mean of y given X = x as
E ( y | x) =  0 + 1 x

Econometrics | Chapter 2 | Simple Linear Regression Analysis | Shalabh, IIT Kanpur


1
and the conditional variance of y given X = x as

Var ( y | x) =  2 .

When the values of  0 , 1 and  2 are known, the model is completely described. The parameters  0 , 1 and

 2 are generally unknown in practice and  is unobserved. The determination of the statistical model
y =  0 + 1 X +  depends on the determination (i.e., estimation ) of  0 , 1 and  2 . In order to know the

values of these parameters, n pairs of observations ( xi , yi )(i = 1,..., n) on ( X , y ) are observed/collected and

are used to determine these unknown parameters.

Various methods of estimation can be used to determine the estimates of the parameters. Among them, the
methods of least squares and maximum likelihood are the popular methods of estimation.

Least squares estimation


Suppose a sample of n sets of paired observations ( xi , yi ) (i = 1, 2,..., n) is available. These observations

are assumed to satisfy the simple linear regression model, and so we can write
yi =  0 + 1 xi +  i (i = 1, 2,..., n).

The principle of least squares estimates the parameters  0 and 1 by minimizing the sum of squares of the

difference between the observations and the line in the scatter diagram. Such an idea is viewed from different
perspectives. When the vertical difference between the observations and the line in the scatter diagram is
considered, and its sum of squares is minimized to obtain the estimates of  0 and 1 , the method is known

as direct regression. yi

(xi,
yi)
Y =  0 + 1 X

(Xi,
Yi)

xi
Direct regression
method
Econometrics | Chapter 2 | Simple Linear Regression Analysis | Shalabh, IIT Kanpur
2
Alternatively, the sum of squares of the difference between the observations and the line in the horizontal
direction in the scatter diagram can be minimized to obtain the estimates of  0 and 1 . This is known as a

reverse (or inverse) regression method.

yi

Y =  0 + 1 X
(xi, yi)

(Xi, Yi)

xi,

Reverse regression method

Instead of horizontal or vertical errors, if the sum of squares of perpendicular distances between the
observations and the line in the scatter diagram is minimized to obtain the estimates of  0 and 1 , the

method is known as orthogonal regression or major axis regression method.

yi

(xi
,yi)
Y =  0 + 1 X

(Xi
,Yi)

xi
Major axis regression method
Econometrics | Chapter 2 | Simple Linear Regression Analysis | Shalabh, IIT Kanpur
3
Instead of minimizing the distance, the area can also be minimized. The reduced major axis regression
method minimizes the sum of the areas of rectangles defined between the observed data points and the
nearest point on the line in the scatter diagram to obtain the estimates of regression coefficients. This is
shown in the following figure:

yi

(xi yi)

Y =  0 + 1 X

(Xi, Yi)

xi

Reduced major axis method

The method of least absolute deviation regression considers the sum of the absolute deviation of the
observations from the line in the vertical direction in the scatter diagram as in the case of direct regression to
obtain the estimates of  0 and 1 .

No assumption is required about the form of the probability distribution of  i in deriving the least squares

estimates. For the purpose of deriving the statistical inferences only, we assume that  i ' s are random

variable with E ( i ) = 0, Var ( i ) =  2 and Cov ( i ,  j ) = 0 for all i  j (i, j = 1, 2,..., n). This assumption is

needed to find the mean, variance and other properties of the least-squares estimates. The assumption that
 i ' s are normally distributed is utilized while constructing the tests of hypotheses and confidence intervals
of the parameters.

Based on these approaches, different estimates of  0 and 1 are obtained which have different statistical

properties. Among them, the direct regression approach is more popular. Generally, the direct regression
estimates are referred to as the least-squares estimates or ordinary least squares estimates.
Econometrics | Chapter 2 | Simple Linear Regression Analysis | Shalabh, IIT Kanpur
4
Direct regression method
This method is also known as the ordinary least squares estimation. Assuming that a set of n paired
observations on ( xi , yi ), i = 1, 2,..., n are available which satisfy the linear regression model y =  0 + 1 X +  .

So we can write the model for each observation as yi =  0 + 1 xi +  i , (i = 1, 2,..., n) .

The direct regression approach minimizes the sum of squares


n n
S (  0 , 1 ) =   i2 =  ( yi −  0 − 1 xi ) 2
i =1 i =1

with respect to  0 and 1 .

The partial derivatives of S (  0 , 1 ) with respect to  0 is

S (  0 , 1 ) n
= −2 ( yt − 0 − 1 xi )
 0 i =1

and the partial derivative of S (  0 , 1 ) with respect to 1 is

S (  0 , 1 ) n
= −2 ( yi −  0 − 1 xi )xi .
1 i =1

The solutions of  0 and 1 are obtained by setting

S (  0 , 1 )
=0
 0
S (  0 , 1 )
= 0.
1
The solutions of these two equations are called the direct regression estimators, or usually called as the
ordinary least squares (OLS) estimators of  0 and 1 .

This gives the ordinary least squares estimates b0 of  0 and b1 of 1 as

b0 = y − b1 x
sxy
b1 =
sxx
where
n n
1 n 1 n
sxy =  ( xi − x )( yi − y ), sxx = ( xi − x ) 2 , x =  xi , y =  yi .
i =1 i =1 n i =1 n i =1

Econometrics | Chapter 2 | Simple Linear Regression Analysis | Shalabh, IIT Kanpur


5
Further, we have
 2 S (  0 , 1 ) n

 02
= −2 
i =1
(−1) = 2n,

 2 S (  0 , 1 ) n

1 2
= 2 
i =1
xi2

 2 S (  0 , 1 ) n
= 2 xt = 2nx .
 0 1 i =1

The Hessian matrix which is the matrix of second-order partial derivatives, in this case, is given as
  2 S (  0 , 1 )  2 S (  0 , 1 ) 
 
  02  0 1 
H* = 2
  S ( ,  )  2 S (  0 , 1 ) 
 0 1

  0 1 12 
 n nx 
=2  n 
 nx  xi2 
 
 i =1 
 '
= 2  ( , x)
 x '
where = (1,1,...,1) ' is a n -vector of elements unity and x = ( x1 ,..., xn ) ' is a n -vector of observations on X .

The matrix H * is positive definite if its determinant and the element in the first row and column of H * are
positive. The determinant of H * is given by
 n 
H * = 4  n xi2 − n 2 x 2 
 i =1 
n
= 4n ( xi − x ) 2
i =1

 0.
n
The case when  (x − x )
i =1
i
2
= 0 is not interesting because all the observations, in this case, are identical, i.e.

xi = c (some constant). In such a case, there is no relationship between x and y in the context of regression
n
analysis. Since  (x − x )
i =1
i
2
 0, therefore H  0. So H is positive definite for any (  0 , 1 ) , therefore,

S (  0 , 1 ) has a global minimum at (b0 , b1 ).

Econometrics | Chapter 2 | Simple Linear Regression Analysis | Shalabh, IIT Kanpur


6
The fitted line or the fitted linear regression model is
y = b0 + b1 x.

The predicted values are


yˆi = b0 + b1 xi (i = 1, 2,..., n).

The difference between the observed value yi and the fitted (or predicted) value yˆ i is called a residual. The

i th residual is defined as
ei = yi ~ yˆi (i = 1, 2,..., n)
= yi − yˆi
= yi − (b0 + b1 xi ).

Properties of the direct regression estimators:

Unbiased property:
sxy
Note that b1 = and b0 = y − b1 x are the linear combinations of yi (i = 1,..., n).
sxx
Therefore
n
b1 =  ki yi
i =1

n n
where ki = ( xi − x ) / sxx . Note that ki =1
i = 0 and k x
i =1
i i = 1, so

n
E (b1 ) =  ki E ( yi )
i =1
n
=  ki (  0 + 1 xi ) .
i =1

= 1.

This b1 is an unbiased estimator of 1 . Next

E (b0 ) = E  y − b1 x 
= E   0 + 1 x +  − b1 x 
=  0 + 1 x − 1 x
= 0 .

Thus b0 is an unbiased estimator of  0 .

Econometrics | Chapter 2 | Simple Linear Regression Analysis | Shalabh, IIT Kanpur


7
Variances:
Using the assumption that yi ' s are independently distributed, the variance of b1 is
n
Var (b1 ) =  ki2Var ( yi ) +  ki k j Cov ( yi , y j )
i =1 i j i

 (x − x )
i
2

= 2 i
(Cov( yi , y j ) = 0 as y1 ,..., yn are independent)
sxx2
 2 sxx
=
sxx2
2
= .
sxx
The variance of b0 is

Var (b0 ) = Var ( y ) + x 2 Var (b1 ) − 2 xCov( y , b1 ).


First, we find that
Cov( y , b1 ) = E  y − E ( y )b1 − E (b1 )
 
= E  ( ci yi − 1 ) 
 i 
1  
= E (  i )(  0  ci + 1  ci xi +  ci i ) − 1   i 
n  i i i i i 
1
=  0 + 0 + 0 + 0
n
=0
So
 1 x2 
Var (b0 ) =  2  + .
 n sxx 

Covariance:
The covariance between b0 and b1 is

Cov(b0 , b1 ) = Cov( y , b1 ) − xVar (b1 )


x 2
=−  .
sxx

It can further be shown that the ordinary least squares estimators b0 and b1 possess the minimum variance

in the class of linear and unbiased estimators. So they are termed as the Best Linear Unbiased Estimators
(BLUE). Such a property is known as the Gauss-Markov theorem, which is discussed later in multiple
linear regression model.
Econometrics | Chapter 2 | Simple Linear Regression Analysis | Shalabh, IIT Kanpur
8
Residual sum of squares:
The residual sum of squares is given as
n n
SS res =  ei2 =  ( yi − yˆi ) 2
i =1 i =1
n
=  ( yi − b0 − b1 xi ) 2
i =1
n
=   yi − y + b1 x − b1 xi 
2

i =1
n
=   ( yi − y ) − b1 ( xi − x ) 
2

i =1
n n n
=  ( yi − y ) 2 + b12  ( xi − x ) 2 − 2b1  ( xi − x )( yi − y )
i =1 i =1 i =1

= s yy + b s − 2b s
2
1 xx
2
1 xx

= s yy − b12 sxx
2
 sxy 
= s yy −   sxx
 sxx 
sxy2
= s yy −
sxx
= s yy − b1sxy .
n
1 n
where s yy =  ( yi − y ) 2 , y =  yi .
i =1 n i =1

Estimation of  2
The estimator of  2 is obtained from the residual sum of squares as follows. Assuming that yi is normally

distributed, it follows that SS res has a  2 distribution with ( n − 2) degrees of freedom, so

SSres
~  2 (n − 2).
 2

Thus using the result about the expectation of a chi-square random variable, we have
E ( SS res ) = (n − 2) 2 .
Thus an unbiased estimator of  2 is
SSres
s2 = .
n−2
Note that SS res has only ( n − 2) degrees of freedom. The two degrees of freedom are lost due to estimation

of b0 and b1 . Since s 2 depends on the estimates b0 and b1 , so it is a model-dependent estimate of  2 .

Econometrics | Chapter 2 | Simple Linear Regression Analysis | Shalabh, IIT Kanpur


9
Estimates of variances of b0 and b1 :
The estimators of variances of b0 and b1 are obtained by replacing  2 by its estimate ˆ 2 = s 2 as follows:

 1 x2 
Var (b0 ) = s  +  2

 n sxx 
and
s2
Var (b1 ) = .
sxx
n n
It is observed that since  ( yi − yˆi ) = 0, so
i =1
e
i =1
i = 0. In the light of this property, ei can be regarded as an

estimate of unknown  i (i = 1,..., n) . This helps in verifying the different model assumptions on the basis of

the given sample ( xi , yi ), i = 1, 2,..., n.

Further, note that


n
(i) xe
i =1
i i = 0,
n
(ii)  yˆ e
i =1
i i = 0,
n n
(iii)  y =  yˆ
i =1
i
i =1
i and

(iv) the fitted line always passes through ( x , y ).

Centered Model:
Sometimes it is useful to measure the independent variable around its mean. In such a case, the model
yi =  0 + 1 X i +  i has a centred version as follows:

yi = 0 + 1 ( xi − x ) + 1 x +  (i = 1, 2,..., n)
= 0* + 1 ( xi − x ) +  i
where  0* =  0 + 1 x . The sum of squares due to error is given by
n n 2

S (  , 1 ) =   =   yi −  −  1 ( xi − x )  .
*
0 i
2 *
0
i =1 i =1
Now solving
S (  0* , 1 )
=0
 0*
S (  0* , 1 )
= 0,
1*

Econometrics | Chapter 2 | Simple Linear Regression Analysis | Shalabh, IIT Kanpur


10
we get the direct regression least squares estimates of  0* and 1 as

b0* = y

and
sxy
b1 = ,
sxx
respectively.

Thus the form of the estimate of slope parameter 1 remains the same in the usual and centered model

whereas the form of the estimate of intercept term changes in the usual and centered models.

Further, the Hessian matrix of the second order partial derivatives of S (  0* , 1 ) with respect to  0* and 1

is positive definite at  0* = b0* and 1 = b1 which ensures that S (  0* , 1 ) is minimized at  0* = b0* and
1 = b1 .

Under the assumption that E ( i ) = 0,Var ( i ) =  2 and Cov( i j ) = 0 for all i  j = 1, 2,..., n , it follows that

E (b0* ) =  0* , E (b1 ) = 1 ,
2 2
Var (b0* ) = , Var (b1 ) = .
n sxx

In this case, the fitted model of yi =  0* + 1 ( xi − x ) +  i is

y = y + b1 ( x − x ),

and the predicted values are


yˆi = y + b1 ( xi − x ) (i = 1,..., n).

Note that in the centered model


Cov(b0* , b1 ) = 0.

Econometrics | Chapter 2 | Simple Linear Regression Analysis | Shalabh, IIT Kanpur


11
No intercept term model:
Sometimes in practice, a model without an intercept term is used in those situations when xi = 0  yi = 0 for

all i = 1, 2,..., n . A no-intercept model is


yi = 1 xi +  i (i = 1, 2,.., n).

For example, in analyzing the relationship between the velocity ( y ) of a car and its acceleration ( X ) , the
velocity is zero when acceleration is zero.

Using the data ( xi , yi ), i = 1, 2,..., n, the direct regression least-squares estimate of 1 is obtained by
n n
minimizing S ( 1 ) =   i2 =  ( yi − 1 xi ) 2 and solving
i =1 i =1

S ( 1 )
=0
1

gives the estimator of 1 as


n

yx i i
b =
*
1
i =1
n
.
x
i =1
2
i

The second-order partial derivative of S ( 1 ) with respect to 1 at 1 = b1 is positive which insures that b1

minimizes S ( 1 ).

Using the assumption that E ( i ) = 0,Var ( i ) =  2 and Cov( i j ) = 0 for all i  j = 1, 2,..., n , the properties

of b1* can be derived as follows:


n

 x E( y ) i i
E (b ) =
*
1
i =1
n

x
i =1
2
i

x  2
i 1
= i =1
n

x
i =1
2
i

= 1

This b1* is an unbiased estimator of 1 . The variance of b1* is obtained as follows:

Econometrics | Chapter 2 | Simple Linear Regression Analysis | Shalabh, IIT Kanpur


12
n

 x Var ( y )
2
i i
Var (b1* ) = i =1
2
 n 2
  xi 
 i =1 
n

x 2
i
=2 i =1
2
 n 2
  xi 
 i =1 
2
= n

x
i =1
2
i

and an unbiased estimator of  2 is obtained as


n n

y
i =1
2
i − b1  yi xi
i =1
.
n −1

Maximum likelihood estimation


We assume that  i ' s (i = 1, 2,..., n) are independent and identically distributed following a normal

distribution N (0,  2 ). Now we use the method of maximum likelihood to estimate the parameters of the
linear regression model
yi =  0 + 1 xi +  i (i = 1, 2,..., n),

the observations yi (i = 1, 2,..., n) are independently distributed with N (  0 + 1 xi ,  2 ) for all i = 1, 2,..., n.

The likelihood function of the given observations ( xi , yi ) and unknown parameters  0 , 1 and  2 is
1/ 2
n
 1   1 
L( xi , yi ; 0 , 1 ,  2 ) =   2 
exp  − 2 ( yi −  0 − 1 xi ) 2 .
i =1  2   2 
The maximum likelihood estimates of  0 , 1 and  2 can be obtained by maximizing L( xi , yi ;  0 , 1 ,  2 ) or

equivalently in ln L( xi , yi ;  0 , 1 ,  2 ) where

n n  1  n
ln L( xi , yi ; 0 , 1 ,  2 ) = −   ln 2 −   ln  2 −  2   ( yi −  0 − 1 xi ) 2 .
2 2  2  i =1

Econometrics | Chapter 2 | Simple Linear Regression Analysis | Shalabh, IIT Kanpur


13
The normal equations are obtained by partial differentiation of log-likelihood with respect to  0 , 1 and  2

and equating them to zero as follows:


 ln L( xi , yi ;  0 , 1 ,  2 ) 1 n

 0
=− 2

(y − 
i =1
i 0 − 1 xi ) = 0

 ln L( xi , yi ;  0 , 1 ,  2 ) 1 n

1
=− 2

(y − 
i =1
i 0 − 1 xi )xi = 0

and
 ln L( xi , yi ;  0 , 1 ,  2 ) n 1 n
4 
= − + ( yi −  0 − 1 xi ) 2 = 0.
 2
2 2
2 i =1
The solution of these normal equations give the maximum likelihood estimates of  0 , 1 and  2 as

b0 = y − b1 x
n

 ( x − x )( y − y )
i i
sxy
b1 = i =1
n
=
 (x − x ) 2 sxx
i
i =1

and
n

(y −b i 0 − b1 xi ) 2
s2 = i =1

n
respectively.

It can be verified that the Hessian matrix of second-order partial derivation of ln L with respect to  0 , 1 ,

and  2 is negative definite at  0 = b0 , 1 = b1 , and  2 = s 2 which ensures that the likelihood function is

maximized at these values.

Note that the least-squares and maximum likelihood estimates of  0 and 1 are identical. The least-squares

and maximum likelihood estimates of  2 are different. In fact, the least-squares estimate of  2 is
1 n
s2 = 
n − 2 i =1
( yi − y ) 2

so that it is related to the maximum likelihood estimate as


n−2 2
s2 =s .
n
Thus b0 and b1 are unbiased estimators of  0 and 1 whereas s 2 is a biased estimate of  2 , but it is
asymptotically unbiased. The variances of b0 and b1 are same as of b0 and b1 respectively but
Var ( s 2 )  Var ( s 2 ).

Econometrics | Chapter 2 | Simple Linear Regression Analysis | Shalabh, IIT Kanpur


14
Testing of hypotheses and confidence interval estimation for slope parameter:
Now we consider the tests of hypothesis and confidence interval estimation for the slope parameter of the
model under two cases, viz., when  2 is known and when  2 is unknown.

Case 1: When  2 is known:


Consider the simple linear regression model yi =  0 + 1 xi +  i (i = 1, 2,..., n) . It is assumed that  i ' s are

independent and identically distributed and follow N (0,  2 ).

First, we develop a test for the null hypothesis related to the slope parameter
H 0 : 1 = 10

where 10 is some given constant.

2
Assuming  2 to be known, we know that E (b1 ) = 1 , Var (b1 ) = and b1 is a linear combination of
sxx

normally distributed yi ' s . So

 2 
b1 ~ N  1 , 
 sxx 
and so the following statistic can be constructed
b1 − 10
Z1 =
2
sxx

which is distributed as N (0,1) when H 0 is true.

A decision rule to test H1 : 1  10 can be framed as follows:

Reject H 0 if Z1  Z / 2

where Z /2 is the  / 2 percent points on the normal distribution.

Similarly, the decision rule for one-sided alternative hypothesis can also be framed.

Econometrics | Chapter 2 | Simple Linear Regression Analysis | Shalabh, IIT Kanpur


15
The 100 (1 −  )% confidence interval for 1 can be obtained using the Z1 statistic as follows:

P  − z /2  Z1  z /2  = 1 − 
 
 
 b1 − 1
P − z /2   z /2  = 1 − 
  2 
 
 sxx 
 2 2 
P b1 − z /2  1  b1 + z /2  = 1− .
 sxx sxx 
So 100 (1 −  )% confidence interval for 1 is

 2 2 
b1 − z / 2 , b1 + z / 2 
 sxx sxx 
where z / 2 is the  / 2 percentage point of the N (0,1) distribution.

Case 2: When  2 is unknown:


When  2 is unknown then we proceed as follows. We know that
SSres
~  2 (n − 2)
2
and
 SS 
E  res  =  2 .
n−2
Further, SS res /  2 and b1 are independently distributed. This result will be proved formally later in the next

module on multiple linear regression. This result also follows from the result that under normal distribution,
the maximum likelihood estimates, viz., the sample mean (estimator of population mean) and the sample
variance (estimator of population variance) are independently distributed, so b1 and s 2 are also

independently distributed.
Thus the following statistic can be constructed:
b1 − 1
t0 =
ˆ 2
sxx
b1 − 1
=
SS res
(n − 2) sxx
which follows a t -distribution with ( n − 2) degrees of freedom, denoted as tn − 2 , when H 0 is true.

Econometrics | Chapter 2 | Simple Linear Regression Analysis | Shalabh, IIT Kanpur


16
A decision rule to test H1 : 1  10 is to

reject H 0 if t0  tn − 2, / 2
where tn − 2, / 2 is the  / 2 percent point of the t -distribution with ( n − 2) degrees of freedom. Similarly, the

decision rule for the one-sided alternative hypothesis can also be framed.

The 100 (1 −  )% confidence interval of 1 can be obtained using the t 0 statistic as follows:

Consider
P  −t /2  t0  t /2  = 1 − 
 
 
 b1 − 1
P −t /2   t /2  = 1 − 
 ˆ 2 
 
 sxx 
 ˆ 2 ˆ 2 
P b1 − t /2  1  b1 + t / 2  = 1−.
 sxx sxx 
So the 100 (1 −  )% confidence interval 1 is

 SSres SSres 
b1 − tn −2, /2 , b1 + tn −2, /2 .
 (n − 2) sxx (n − 2) sxx 

Testing of hypotheses and confidence interval estimation for intercept term:


Now, we consider the tests of hypothesis and confidence interval estimation for intercept term under two
cases, viz., when  2 is known and when  2 is unknown.

Case 1: When  2 is known:


Suppose the null hypothesis under consideration is
H 0 :  0 =  00 ,
 1 x2 
where  is known, then using the result that E (b0 ) = 0 , Var (b0 ) =   +  and b0 is a linear
2 2

 n sx 
combination of normally distributed random variables, the following statistic
b0 −  00
Z0 =
 1 x2 
2 + 
 n sxx 

has a N (0,1) distribution when H 0 is true.

Econometrics | Chapter 2 | Simple Linear Regression Analysis | Shalabh, IIT Kanpur


17
A decision rule to test H1 :  0   00 can be framed as follows:

Reject H 0 if Z 0  Z /2

where Z /2 is the  / 2 percentage points on the normal distribution. Similarly, the decision rule for one-

sided alternative hypothesis can also be framed.


The 100 (1 −  )% confidence intervals for  0 when  2 is known can be derived using the Z 0 statistic as
follows:
P  − z /2  Z 0  z /2  = 1 − 
 
 
 b0 −  0 
P  − z /2   z /2  = 1 − 
  1 x2  
2 + 
  n s xx 

 
  1 x2  
21 x2 
P b0 − z /2  2  +    0  b0 + z /2   +   = 1− .
  n sxx   n sxx  
So the 100 (1 −  )% of confidential interval of  0 is

 
21 x2  21 x2
b0 − z / 2   +  , b0 + z / 2   + .
  n sxx   n sxx  

Case 2: When  2 is unknown:


When  2 is unknown, then the following statistic is constructed
b0 −  00
t0 =
SS res  1 x 2 
 + 
n − 2  n sxx 

which follows a t -distribution with ( n − 2) degrees of freedom, i.e., tn − 2 when H 0 is true.

A decision rule to test H1 :  0   00 is as follows:

Reject H 0 whenever t0  tn − 2, / 2

where tn − 2, / 2 is the  / 2 percentage point of the t -distribution with ( n − 2) degrees of freedom. Similarly,

the decision rule for one-sided alternative hypothesis can also be framed.

Econometrics | Chapter 2 | Simple Linear Regression Analysis | Shalabh, IIT Kanpur


18
The 100 (1 −  )% confidence interval of  0 can be obtained as follows:

Consider
P tn − 2, /2  t0  tn − 2, /2  = 1 − 
 
 
 b0 −  0 
P tn − 2, /2   tn − 2, /2  = 1 − 
 SS res  1 x 2  
  +  
 n − 2  n sxx  
 SS res  1 x 2  SS res  1 x 2  
P b0 − tn − 2, /2  +     b + t n − 2, /2  +   = 1− .
n − 2  n sxx n − 2  n sxx  
0 0
  

So 100(1 −  )% confidence interval for  0 is

 SS res  1 x 2  SS res  1 x 2 
b0 − tn − 2, / 2  + +
 0 n − 2, / 2
, b t  + .
 n − 2  n sxx  n − 2  n sxx  

Test of hypothesis for  2


We have considered two types of test statistics for testing the hypothesis about the intercept term and slope
parameter- when  2 is known and when  2 is unknown. While dealing with the case of known  2 , the
value of  2 is known from some external sources like past experience, long association of the experimenter
with the experiment, past studies etc. In such situations, the experimenter would like to test the hypothesis
like H 0 :  2 =  02 against H 0 :  2   02 where  02 is specified. The test statistic is based on the result
SSr es
~  n2− 2 . So the test statistic is
2
SSr es
C0 = ~  n2−2 under H 0 .
 2
0

The decision rule is to reject H 0 if C0   n2− 2, /2 or C0   n2− 2,1− /2 .

Econometrics | Chapter 2 | Simple Linear Regression Analysis | Shalabh, IIT Kanpur


19
Confidence interval for  2
A confidence interval for  2 can also be derived as follows. Since SSres /  2 ~  n2− 2 , thus consider

 SS 
P   n2−2, /2  res   n2−2,1− /2  = 1 − 
  2

 SS SS 
P  2 res   2  2 res  = 1 −  .
  n −2,1− / 2  n−2, / 2 

The corresponding 100(1 −  )% confidence interval for  2 is

 SSres SS 
 2 , 2 res  .
  n −2,1− / 2  n −2, / 2 

Joint confidence region for  0 and 1 :


A joint confidence region for  0 and 1 can also be found. Such a region will provide a 100(1 −  )%

confidence that both the estimates of  0 and 1 are correct. Consider the centered version of the linear

regression model
yi =  0* + 1 ( xi − x ) +  i

where  0* =  0 + 1 x . The least squares estimators of  0* and 1 are

sxy
b0* = y and b1 = ,
sxx
respectively.
Using the results that
E (b0* ) =  0* ,
E (b1 ) = 1 ,
2
Var (b0* ) = ,
n
2
Var (b1 ) = .
sxx

When  2 is known, then the statistic


b0* −  0* b1 − 1
~ N (0,1) and ~ N (0,1).
2 2
n sxx

Econometrics | Chapter 2 | Simple Linear Regression Analysis | Shalabh, IIT Kanpur


20
Moreover, both statistics are independently distributed. Thus
2
 
2
 
 * * 
 
 b0 −  0  ~ 12 and  b1 − 1 
~ 12
 2   2 
    
 n   s 
 xx 

are also independently distributed because b0* and b1 are independently distributed. Consequently, the sum

of these two
n(b0* − o* )2 sxx (b1 − 1 ) 2
+ ~  22 .
2 2
Since
SSres
~  n2− 2
2
and SS res is independently distributed of b0* and b1 , so the ratio

 n(b0* −  0* ) 2 sxx (b1 − 1 ) 2 


 +  2
 2 2  ~ F2,n − 2 .
 SSres 
 2  (n − 2)
  
Substituting b0 = b0 + b1 x and  0* =  0 + 1 x , we get
*

 n − 2   Qf 
  
 2   SSres 
where
n n
Q f = n(b0 −  0 )2 + 2 xt (b0 − 1 )(b1 − 1 ) +  xi2 (b1 − 1 ) 2 .
i =1 i =1

Since
 n − 2  Q f 
P    F2,n−2  = 1 − 
 2  SSres 
holds true for all values of  0 and 1 , so the 100 (1 −  ) % confidence region for  0 and 1 is

 n − 2  Qf
 .  F2,n −2;1− . .
 2  SSres
This confidence region is an ellipse which gives the 100 (1 −  )% probability that  0 and 1 are contained

simultaneously in this ellipse.

Econometrics | Chapter 2 | Simple Linear Regression Analysis | Shalabh, IIT Kanpur


21
Analysis of variance:
The technique of analysis of variance is usually used for testing the hypothesis related to equality of more
than one parameters, like population means or slope parameters. It is more meaningful in case of multiple
regression model when there are more than one slope parameters. This technique is discussed and illustrated
here to understand the related basic concepts and fundamentals which will be used in developing the analysis
of variance in the next module in multiple linear regression model where the explanatory variables are more
than two.

A test statistic for testing H 0 : 1 = 0 can also be formulated using the analysis of variance technique as

follows.

On the basis of the identity


yi − yˆi = ( yi − y ) − ( yˆi − y ),

the sum of squared residuals is


n
S (b) =  ( yi − yˆi ) 2
i =1
n n n
=  ( yi − y ) 2 +  ( yˆi − yi ) 2 − 2 ( yi − y )( yˆi − y ).
i =1 i =1 i =1
Further, consider
n n

 ( yi − y )( yˆi − y ) =  ( yi − y )b1 ( xi − x )
i =1 i =1
n
= b12  ( xi − x ) 2
i =1
n
=  ( yˆi − y ) 2 .
i =1

Thus we have
n n n

 ( yi − y )2 =  ( yi − yˆi )2 +  ( yˆi − y )2 .
i =1 i =1 i =1

n
The term ( y − y)
i =1
i
2
is called the sum of squares about the mean, corrected sum of squares of y (i.e.,

SScorrected), total sum of squares, or s yy .

Econometrics | Chapter 2 | Simple Linear Regression Analysis | Shalabh, IIT Kanpur


22
n
The term  ( y − yˆ )
i =1
i i
2
describes the deviation: observation minus predicted value, viz., the residual sum of

n
squares, i.e., SSres =  ( yi − yˆi )2
i =1

n
whereas the term  ( yˆ − y )
i =1
i
2
describes the proportion of variability explained by the regression,
n
SSr e g =  ( yˆi − y ) 2 .
i =1

n
If all observations yi are located on a straight line, then in this case  ( y − yˆ )
i =1
i i
2
= 0 and thus

SScorrected = SSr e g .

Note that SS r e g is completely determined by b1 and so has only one degree of freedom. The total sum of
n n
squares s yy =  ( yi − y ) 2
has ( n − 1) degrees of freedom due to constraint ( y − y) = 0
i and SS res has
i =1 i =1

( n − 2) degrees of freedom as it depends on the determination of b0 and b1 .

All sums of squares are mutually independent and distributed as  df2 with df degrees of freedom if the

errors are normally distributed.

The mean square due to regression is


SSr e g
MSr e g =
1
and mean square due to residuals is
SSres
MSE = .
n−2
The test statistic for testing H 0 : 1 = 0 is

MSr e g
F0 = .
MSE
If H 0 : 1 = 0 is true, then MS r e g and MSE are independently distributed and thus

F0 ~ F1,n − 2 .

Econometrics | Chapter 2 | Simple Linear Regression Analysis | Shalabh, IIT Kanpur


23
The decision rule for H1 : 1  0 is to reject H 0 if

F0  F1,n − 2;1−

at  level of significance. The test procedure can be described in an Analysis of variance table.

Analysis of variance for testing H 0 : 1 = 0

Source of variation Sum of squares Degrees of freedom Mean square F

Regression SS r e g 1 MS r e g MS r e g / MSE

Residual SS res n−2 MSE

Total s yy n −1

Some other forms of SSreg , SSres and s yy can be derived as follows:

The sample correlation coefficient then may be written as


sxy
rxy = .
sxx s yy

Moreover, we have
sxy s yy
b1 = = rxy .
sxx sxx

The estimator of  2 in this case may be expressed as


1 n 2
s2 =  ei
n − 2 i =1
1
= SS res .
n−2
Various alternative formulations for SS res are in use as well:
n
SS res =  [ yi − (b0 + b1 xi )]2
i =1
n
=  [( yi − y ) − b1 ( xi − x )]2
i =1

= s yy + b12 sxx − 2b1sxy


= s yy − b12 sxx
( sxy ) 2
= s yy − .
sxx
Econometrics | Chapter 2 | Simple Linear Regression Analysis | Shalabh, IIT Kanpur
24
Using this result, we find that
SScorrected = s yy

and
SS r e g = s yy − SS res
( sxy ) 2
=
sxx
=b s 2
1 xx

= b1sxy .

Goodness of fit of regression


It can be noted that a fitted model can be said to be good when residuals are small. Since SS res is based on

residuals, so a measure of the quality of a fitted model can be based on SS res . When the intercept term is

present in the model, a measure of goodness of fit of the model is given by


SSres SSr e g
R2 = 1 − = .
s yy s yy

This is known as the coefficient of determination. This measure is based on the concept that how much
variation in y ’s stated by s yy is explainable by SS reg and how much unexplainable part is contained in

SS res . The ratio SSr e g / s yy describes the proportion of variability that is explained by regression in relation

to the total variability of y . The ratio SS res / s yy describes the proportion of variability that is not covered

by the regression.
It can be seen that
R 2 = rxy2

where rxy is the simple correlation coefficient between x and y. Clearly 0  R 2  1 , so a value of R 2 closer

to one indicates the better fit and value of R 2 closer to zero indicates the poor fit.

Prediction of values of study variable


An important use of linear regression modeling is to predict the average and actual values of the study
variable. The term prediction of the value of study variable corresponds to knowing the value of E ( y ) (in
case of average value) and value of y (in case of actual value) for a given value of the explanatory variable.
We consider both cases.

Econometrics | Chapter 2 | Simple Linear Regression Analysis | Shalabh, IIT Kanpur


25
Case 1: Prediction of average value
Under the linear regression model, y =  0 + 1 x +  , the fitted model is y = b0 + b1 x where b0 and b1 are the

OLS estimators of  0 and 1 respectively.

Suppose we want to predict the value of E ( y ) for a given value of x = x0 . Then the predictor is given by

E ( y | x0 ) = ˆ y / x0 = b0 + b1 x0 .

Predictive bias
Then the prediction error is given as
ˆ y| x − E ( y ) = b0 + b1 x0 − E (  0 + 1 x0 +  )
0

= b0 + b1 x0 − (  0 + 1 x0 )
= (b0 −  0 ) + (b1 − 1 ) x0 .
Then
E  ˆ y| x0 − E ( y )  = E (b0 − 0 ) + E (b1 − 1 ) x0
= 0+0 = 0
Thus the predictor  y / x0 is an unbiased predictor of E ( y ).

Predictive variance:
The predictive variance of ˆ y| x0 is

PV ( ˆ y| x0 ) = Var (b0 + b1 x0 )
= Var  y + b1 ( x0 − x ) 
= Var ( y ) + ( x0 − x ) 2Var (b1 ) + 2( x0 − x )Cov( y , b1 )
2  2 ( x0 − x ) 2
= + +0
n sxx
 1 ( x − x )2 
=2  + 0 .
n sxx 

Estimate of predictive variance


The predictive variance can be estimated by substituting  2 by ˆ 2 = MSE as

 1 ( x − x )2 
PV ( ˆ y| x0 ) = ˆ 2  + 0 
n sxx 
 1 ( x − x )2 
= MSE  + 0 .
n sxx 

Econometrics | Chapter 2 | Simple Linear Regression Analysis | Shalabh, IIT Kanpur


26
Prediction interval estimation:
The 100(1-  )% prediction interval for E ( y / x0 ) is obtained as follows:

The predictor ˆ y| x0 is a linear combination of normally distributed random variables, so it is also normally

distributed as

(
ˆ y|x ~ N  0 + 1 x0 , PV ( ˆ y|x
0 0
)) .
So if  2 is known, then the distribution of
ˆ y| x − E ( y | x0 )
0

PV ( ˆ y| x0 )

is N (0,1). So the 100(1-  )% prediction interval is obtained as

 ˆ y| x0 − E ( y | x0 ) 
P  − z /2   z /2  = 1 − 
 PV ( ˆ y| x0 ) 

which gives the prediction interval for E ( y / x0 ) as

  1 ( x − x )2  ( x0 − x ) 2  
2 1
 ˆ y| x0 − z /2  2  + 0 , 
ˆ +
 y| x0  /2
z   +  .
 n sxx  n sxx  

When  2 is unknown, it is replaced by ˆ 2 = MSE and in this case the sampling distribution of
ˆ y| x − E ( y | x0 )
0

 1 ( x − x )2 
MSE  + 0
n sxx 

is t -distribution with ( n − 2) degrees of freedom, i.e., tn − 2 .

The 100(1-  )% prediction interval in this case is


 
 
 ˆ y| x0 − E ( y | x0 ) 
P  −t /2, n − 2   t  = 1 −  .
  1 ( x − x )2  2
,n−2

MSE  + 0 
  n s  
 xx 
which gives the prediction interval as

Econometrics | Chapter 2 | Simple Linear Regression Analysis | Shalabh, IIT Kanpur


27
  1 ( x − x )2   1 ( x0 − x ) 2  
 ˆ y| x0 − t /2,n − 2 MSE  + 0  , 
ˆ y| x0 + t /2, n − 2 MSE  +  .
  n s xx   n s xx  

Note that the width of the prediction interval E ( y | x0 ) is a function of x0 . The interval width is minimum

for x0 = x and widens as x0 − x increases. This is also expected as the best estimates of y to be made at

x -values lie near the center of the data and the precision of estimation to deteriorate as we move to the
boundary of the x -space.

Case 2: Prediction of actual value


If x0 is the value of the explanatory variable, then the actual value predictor for y is

ŷ0 = b0 + b1 x0 .

The true value of y in the prediction period is given by y0 =  0 + 1 x0 +  0 where  0 indicates the value that

would be drawn from the distribution of random error in the prediction period. Note that the form of
predictor is the same as of average value predictor, but its predictive error and other properties are different.
This is the dual nature of predictor.

Predictive bias:
The predictive error of ŷ0 is given by

yˆ 0 − y0 = b0 + b1 x0 − ( 0 + 1 x0 +  0 )
= (b0 −  0 ) + (b1 − 1 ) x0 −  0 .
Thus, we find that
E ( yˆ 0 − y0 ) = E (b0 −  0 ) + E (b1 − 1 ) x0 − E ( 0 )
= 0+0+0 = 0

which implies that ŷ0 is an unbiased predictor of y0 .

Econometrics | Chapter 2 | Simple Linear Regression Analysis | Shalabh, IIT Kanpur


28
Predictive variance
Because the future observation y0 is independent of ŷ0 , the predictive variance of ŷ0 is

PV ( yˆ 0 ) = E ( yˆ 0 − y0 ) 2
= E[(b0 −  0 ) + ( x0 − x )(b1 − 1 ) + (b1 − 1 ) x −  0 ]2
= Var (b0 ) + ( x0 − x ) 2Var (b1 ) + x 2Var (b1 ) + Var ( 0 ) + 2( x0 − x )Cov(b0 , b1 ) + 2 xCov (b0 , b1 ) + 2( x0 − x )Var (b1 )
[rest of the terms are 0 assuming the independence of  0 with 1 ,  2 ,...,  n ]
= Var (b0 ) + [( x0 − x ) 2 + x 2 + 2( x0 − x )]Var (b1 ) + Var ( ) + 2[( x0 − x ) + 2 x ]Cov(b0 , b1 )
= Var (b0 ) + x02Var (b1 ) + Var ( 0 ) + 2 x0Cov(b0 , b1 )
1 x2  2 x 2
=  2  +  + x02 +  2 − 2 x0
 n sxx  sxx sxx
 1 ( x − x )2 
=  2 1 + + 0 .
 n sxx 

Estimate of predictive variance


The estimate of predictive variance can be obtained by replacing  2 by its estimate ˆ 2 = MSE as
 1 ( x − x )2 
PV ( yˆ 0 ) = ˆ 2 1 + + 0 
 n sxx 
 1 ( x − x )2 
= MSE 1 + + 0 .
 n sxx 

Prediction interval:
If  2 is known, then the distribution of
yˆ 0 − y0
PV ( yˆ 0 )

is N (0,1). So the 100(1-  )% prediction interval is obtained as

 yˆ − y0 
P  − z /2  0  z /2  = 1 − 
 PV ( yˆ 0 ) 

which gives the prediction interval for y0 as

  1 ( x − x )2  1 ( x0 − x ) 2  
2
 yˆ 0 − z /2  2 1 + + 0  , y0 + z /2  1 + +
ˆ .
  n s xx   n s xx  

Econometrics | Chapter 2 | Simple Linear Regression Analysis | Shalabh, IIT Kanpur


29
When  2 is unknown, then
yˆ 0 − y0
PV ( yˆ 0 )

follows a t -distribution with ( n − 2) degrees of freedom. The 100(1-  )% prediction interval for ŷ0 in this

case is obtained as
 yˆ − y0 
P  −t /2,n − 2  0  t /2,n − 2  = 1 − 
 PV ( yˆ 0 ) 
 
which gives the prediction interval
  1 ( x − x )2   1 ( x0 − x ) 2  
 yˆ 0 − t /2,n − 2 MSE 1 + + 0 ˆ +
 0  /2, n −2
, y t MSE 1 + +  .
  n sxx   n sxx  

The prediction interval is of minimum width at x0 = x and widens as x0 − x increases.

The prediction interval for ŷ0 is wider than the prediction interval for ˆ y / x0 because the prediction interval

for ŷ0 depends on both the error from the fitted model as well as the error associated with the future

observations.

Reverse regression method


The reverse (or inverse) regression approach minimizes the sum of squares of horizontal distances between
the observed data points and the line in the following scatter diagram to obtain the estimates of regression
parameters.
yi

Y =  0 + 1 X
(xi,
yi)
(Xi,
Yi)

x,
Reverse regression
Econometrics | Chapter 2 | Simple Linear
methodRegression Analysis | Shalabh, IIT Kanpur
30
The reverse regression has been advocated in the analysis of gender (or race) discrimination in salaries. For
example, if y denotes salary and x denotes qualifications, and we are interested in determining if there is
gender discrimination in salaries, we can ask:
“Whether men and women with the same qualifications (value of x) are getting the same salaries
(value of y). This question is answered by the direct regression.”

Alternatively, we can ask:


“Whether men and women with the same salaries (value of y) have the same qualifications (value of
x). This question is answered by the reverse regression, i.e., regression of x on y.”

The regression equation in case of reverse regression can be written as


xi =  0* + 1* yi +  i (i = 1, 2,..., n)

where  i ’s are the associated random error components and satisfy the assumptions as in the case of the

usual simple linear regression model. The reverse regression estimates ˆOR of  0* and ˆ1R of  1* for the

model are obtained by interchanging the x and y in the direct regression estimators of  0 and 1 . The

estimates are obtained as


ˆOR = x − ˆ1R y
and
s yy
ˆ1R =
sxy

for  0 and 1 respectively. The residual sum of squares in this case is

sxy2
SS res = sxx −
*
.
s yy

Note that

ˆ sxy2
1Rb1 = = rxy2
sxx s yy

where b1 is the direct regression estimator of the slope parameter and rxy is the correlation coefficient

between x and y. Hence if rxy2 is close to 1, the two regression lines will be close to each other.

An important application of the reverse regression method is in solving the calibration problem.

Econometrics | Chapter 2 | Simple Linear Regression Analysis | Shalabh, IIT Kanpur


31
Orthogonal regression method (or major axis regression method)
The direct and reverse regression methods of estimation assume that the errors in the observations are either
in x -direction or y -direction. In other words, the errors can be either in the dependent variable or
independent variable. There can be situations when uncertainties are involved in dependent and independent
variables both. In such situations, the orthogonal regression is more appropriate. In order to take care of
errors in both the directions, the least-squares principle in orthogonal regression minimizes the squared
perpendicular distance between the observed data points and the line in the following scatter diagram to
obtain the estimates of regression coefficients. This is also known as the major axis regression method.
The estimates obtained are called orthogonal regression estimates or major axis regression estimates of
regression coefficients.

yi

(xi, yi)

Y =  0 + 1 X

(Xi, Yi)

xi

Orthogonal or major axis regression


method
If we assume that the regression line to be fitted is Yi =  0 + 1 X i , then it is expected that all the observations

( xi , yi ), i = 1, 2,..., n lie on this line. But these points deviate from the line, and in such a case, the squared

perpendicular distance of observed data ( xi , yi ) (i = 1, 2,..., n) from the line is given by

di2 = ( X i − xi ) 2 + (Yi − yi ) 2

where ( X i , Yi ) denotes the i th pair of observation without any error which lies on the line.

Econometrics | Chapter 2 | Simple Linear Regression Analysis | Shalabh, IIT Kanpur


32
n
The objective is to minimize the sum of squared perpendicular distances given by i =1
di2 to obtain the

estimates of  0 and 1 . The observations ( xi , yi ) (i = 1, 2,..., n) are expected to lie on the line

Yi =  0 + 1 X i ,

so let
Ei = Yi −  0 − 1 X i = 0.

n
The regression coefficients are obtained by minimizing d
i =1
i
2
under the constraints Ei ' s using the

Lagrangian’s multiplier method. The Lagrangian function is


n n
L0 =  di2 − 2 i Ei
i =1 i =1

where 1 ,..., n are the Lagrangian multipliers. The set of equations are obtained by setting

L0 L L L
= 0, 0 = 0, 0 = 0 and 0 = 0 (i = 1, 2,..., n).
X i Yi 0 1
Thus we find
L0
= ( X i − xi ) + i 1 = 0
X i
L0
= (Yi − yi ) − i = 0
Yi
L0 n

 0
= 
i =1
i =0

L0 n

1
=  X
i =1
i i = 0.

Since
X i = xi − i 1
Yi = yi + i ,

so substituting these values is Ei , we obtain

Ei = ( yi + i ) −  0 − 1 ( xi − i 1 ) = 0
 0 + 1 xi − yi
 i = .
1 + 12

Econometrics | Chapter 2 | Simple Linear Regression Analysis | Shalabh, IIT Kanpur


33
n
Also using this i in the equation 
i =1
i = 0 , we get

 ( 0 + 1 xi − yi )
i =1
=0
1 + 12
n
and using ( X i − xi ) + i 1 = 0 and  X
i =1
i i = 0 , we get

  ( x −   ) = 0.
i =1
i i i 1

Substituting i in this equation, we get


n

 ( x +  x 0 i
2
1 i − yi xi )
1 (  0 + 1 xi − yi ) 2
i =1
− = 0. (1)
(1 + i2 ) (1 + 12 ) 2
n
Using i in the equation and using the equation 
i =1
i = 0 , we solve

 ( 0 + 1 xi − yi )
i =1
= 0.
1 + 12

The solution provides an orthogonal regression estimate of  0 as

ˆ0OR = y − ˆ1OR x

where ˆ1OR is an orthogonal regression estimate of 1.

Now, substituting  0OR in equation (1), we get


2

)  yxi − 1 xxi +  x − xi yi  − 1  ( y − 1 x + 1 xi − yi ) = 0
n n

 (1 + 
i =1
1
2 2
1 i
i =1

or
n n 2

(1 +  ) 1
2
 x  y − y −  ( x − x )  +    −( y − y ) +  ( x − x ) 
i =1
i i 1 i 1
i =1
i 1 i =0

or
n n
(1 + 12 ) (ui + x )(vi − 1ui ) + 1  ( −vi + 1ui ) 2 = 0
i =1 i =1

where
ui = xi − x ,
vi = yi − y .

Econometrics | Chapter 2 | Simple Linear Regression Analysis | Shalabh, IIT Kanpur


34
n n
Since  ui =  ui = 0, so
i =1 i =1
n

  
i =1
u v + 1 (ui2 − vi2 ) − ui vi  = 0
2
1 i i

or
12 sxy + 1 ( sxx − s yy ) − sxy = 0.

Solving this quadratic equation provides the orthogonal regression estimate of 1 as

(s − sxx ) + sign ( sxy ) ( sxx − s yy ) 2 + 4 s 2xy


ˆ1OR =
yy

2 sxy

where sign( s xy ) denotes the sign of sxy which can be positive or negative. So

1 if sxy  0

sign( sxy ) =  .
−1 if sxy  0.

n
Notice that this gives two solutions for ˆ1OR . We choose the solution which minimizes d
i =1
i
2
. The other

n
solution maximizes d
i =1
i
2
and is in the direction perpendicular to the optimal solution. The optimal solution

can be chosen with the sign of sxy .

Econometrics | Chapter 2 | Simple Linear Regression Analysis | Shalabh, IIT Kanpur


35
Reduced major axis regression method:
The direct, reverse and orthogonal methods of estimation minimize the errors in a particular direction which
is usually the distance between the observed data points and the line in the scatter diagram. Alternatively,
one can consider the area extended by the data points in a certain neighbourhood and instead of distances, the
area of rectangles defined between the corresponding observed data point and the nearest point on the line in
the following scatter diagram can also be minimized. Such an approach is more appropriate when the
uncertainties are present in the study and explanatory variables both. This approach is termed as reduced
major axis regression.
yi

(xi yi)

Y =  0 + 1 X

(Xi, Yi)

xi

Reduced major axis method

Suppose the regression line is Yi =  0 + 1 X i on which all the observed points are expected to lie. Suppose

the points ( xi , yi ), i = 1, 2,..., n are observed which lie away from the line. The area of rectangle extended

between the i th observed data point and the line is


Ai = ( X i ~ xi )(Yi ~ yi ) (i = 1, 2,..., n)

where ( X i , Yi ) denotes the i th pair of observation without any error which lies on the line.

The total area extended by n data points is


n n

 Ai =  ( X i ~ xi )(Yi ~ yi ).
i =1 i =1

Econometrics | Chapter 2 | Simple Linear Regression Analysis | Shalabh, IIT Kanpur


36
All observed data points ( xi , yi ), (i = 1, 2,..., n) are expected to lie on the line

Yi =  0 + 1 X i

and let
Ei* = Yi −  0 − 1 X i = 0.

So now the objective is to minimize the sum of areas under the constraints Ei* to obtain the reduced major

axis estimates of regression coefficients. Using the Lagrangian multiplier method, the Lagrangian function is
n n
LR =  Ai −  i Ei*
i =1 i =1
n n
=  ( X i − xi )(Yi − yi ) −  i Ei*
i =1 i =1

where 1 ,...,  n are the Lagrangian multipliers. The set of equations are obtained by setting

LR L L L
= 0, R = 0, R = 0, R = 0 (i = 1, 2,..., n).
X i Yi  0 1

Thus
LR
= (Yi − yi ) + 1i = 0
X i
LR
= ( X i − xi ) − i = 0
Yi
LR n
=  i = 0
 0 i =1
LR n
=  i X i = 0.
1 i =1

Now
X i = xi + i
Yi = yi − 1i
 0 + 1 X i = yi − 1i
 0 + 1 ( xi + i ) = yi − 1i
y −  0 − 1 xi
 i = i .
2 1

Econometrics | Chapter 2 | Simple Linear Regression Analysis | Shalabh, IIT Kanpur


37
n
Substituting  i in 
i =1
i =0, the reduced major axis regression estimate of  0 is obtained as

ˆ0 RM = y − ˆ1RM x
where ˆ1RM is the reduced major axis regression estimate of 1 . Using X i = xi + i , i and ˆ0 RM in
n

 X
i =1
i i = 0 , we get

n
 yi − y + 1 x − 1 xi  yi − y + 1 x − 1 xi 

i =1  21
 xi −
21
 = 0.
 
Let ui = xi − x and vi = yi − y , then this equation can be re-expressed as
n

 (v −  u )(v +  u + 2 x ) = 0.
i =1
i 1 i i 1 i 1

n n
Using  ui =  ui = 0, we get
i =1 i =1

n n

 vi2 − 12  ui2 = 0.


i =1 i =1

Solving this equation, the reduced major axis regression estimate of 1 is obtained as

s yy
ˆ1RM = sign( sxy )
sxx

1 if sxy  0

where sign ( sxy ) = 
−1 if sxy  0.

We choose the regression estimator which has same sign as of sxy .

Least absolute deviation regression method


The least-squares principle advocates the minimization of the sum of squared errors. The idea of squaring the
errors is useful in place of simple errors because random errors can be positive as well as negative. So
consequently their sum can be close to zero indicating that there is no error in the model and which can be
misleading. Instead of the sum of random errors, the sum of absolute random errors can be considered which
avoids the problem due to positive and negative random errors.

Econometrics | Chapter 2 | Simple Linear Regression Analysis | Shalabh, IIT Kanpur


38
In the method of least squares, the estimates of the parameters  0 and 1 in the model
n
yi =  0 + 1 xi +  i . (i = 1, 2,..., n) are chosen such that the sum of squares of deviations 
i =1
i
2
is minimum. In

the method of least absolute deviation (LAD) regression, the parameters  0 and 1 are estimated such that
n
the sum of absolute deviations 
i =1
i is minimum. It minimizes the absolute vertical sum of errors as in the

following scatter diagram:

yi

(xi, yi)

Y =  0 + 1 X

(Xi, Yi)

xi
Least absolute deviation regression
method
The LAD estimates ˆ0 L and ˆ1L are the estimates of  0 and 1 , respectively which minimize
n
LAD( 0 , 1 ) =  yi − 0 − 1 xi
i =1

for the given observations ( xi , yi ) (i = 1, 2,..., n).

Conceptually, LAD procedure is more straightforward than OLS procedure because e (absolute residuals)

is a more straightforward measure of the size of the residual than e 2 (squared residuals). The LAD
regression estimates of  0 and 1 are not available in closed form. Instead, they can be obtained

numerically based on algorithms. Moreover, this creates the problems of non-uniqueness and degeneracy in
the estimates. The concept of non-uniqueness relates to that more than one best line pass through a data
point. The degeneracy concept describes that the best line through a data point also passes through more than
one other data points. The non-uniqueness and degeneracy concepts are used in algorithms to judge the

Econometrics | Chapter 2 | Simple Linear Regression Analysis | Shalabh, IIT Kanpur


39
quality of the estimates. The algorithm for finding the estimators generally proceeds in steps. At each step,
the best line is found that passes through a given data point. The best line always passes through another
data point, and this data point is used in the next step. When there is non-uniqueness, then there is more than
one best line. When there is degeneracy, then the best line passes through more than one other data point.
When either of the problems is present, then there is more than one choice for the data point to be used in the
next step and the algorithm may go around in circles or make a wrong choice of the LAD regression line.
The exact tests of hypothesis and confidence intervals for the LAD regression estimates can not be derived
analytically. Instead, they are derived analogously to the tests of hypothesis and confidence intervals related
to ordinary least squares estimates.

Estimation of parameters when X is stochastic


In a usual linear regression model, the study variable is supped to be random and explanatory variables are
assumed to be fixed. In practice, there may be situations in which the explanatory variable also becomes
random.

Suppose both dependent and independent variables are stochastic in the simple linear regression model
y =  0 + 1 X + 

where  is the associated random error component. The observations ( xi , yi ), i = 1, 2,..., n are assumed to be

jointly distributed. Then the statistical inferences can be drawn in such cases which are conditional on X .

Assume the joint distribution of X and y to be bivariate normal N (  x ,  y ,  x2 ,  y2 ,  ) where  x and  y

are the means of X and y;  x2 and  y2 are the variances of X and y ; and  is the correlation coefficient

between X and y. Then the conditional distribution of y given X = x is the univariate normal
conditional mean
E ( y | X = x) =  y| x =  0 + 1 x

and the conditional variance of y given X = x is

Var ( y | X = x) =  y2| x =  y2 (1 −  2 )
where
 0 =  y −  x 1
and
y
1 = .
x
Econometrics | Chapter 2 | Simple Linear Regression Analysis | Shalabh, IIT Kanpur
40
When both X and y are stochastic, then the problem of estimation of parameters can be reformulated as
follows. Consider a conditional random variable y | X = x having a normal distribution with mean as

conditional mean  y| x and variance as conditional variance Var ( y | X = x) =  y2| x . Obtain n independently

distributed observation yi | xi , i = 1, 2,..., n from N (  y| x ,  y2| x ) with nonstochastic X . Now the method of

maximum likelihood can be used to estimate the parameters which yield the estimates of  0 and 1 as

earlier in the case of nonstochastic X as


b = y − b1 x
and
sxy
b1 = ,
sxx
respectively.
Moreover, the correlation coefficient
E ( y −  y )( X −  x )
=
 y x
can be estimated by the sample correlation coefficient
n

 ( y − y )( x − x )
i i
ˆ = i =1
n n

 ( xi − x )2
i =1
 ( y − y)
i =1
i
2

sxy
=
sxx s yy
sxx
= b1 .
s yy
Thus
sxx
ˆ 2 = b12
s yy
sxy
= b1
s yy
n
s yy −  ˆi2
= i =1

s yy
= R2
which is same as the coefficient of determination. Thus R 2 has the same expression as in the case when X
is fixed. Thus R 2 again measures the goodness of the fitted model even when X is stochastic.

Econometrics | Chapter 2 | Simple Linear Regression Analysis | Shalabh, IIT Kanpur


41
Chapter 3
Multiple Linear Regression Model
We consider the problem of regression when the study variable depends on more than one explanatory or
independent variables, called a multiple linear regression model. This model generalizes the simple linear
regression in two ways. It allows the mean function E ( y ) to depend on more than one explanatory variables
and to have shapes other than straight lines, although it does not allow for arbitrary shapes.

The linear model:


Let y denotes the dependent (or study) variable that is linearly related to k independent (or explanatory)
variables X 1 , X 2 ,..., X k through the parameters 1 ,  2 ,...,  k and we write

y  X 11  X 2  2  ...  X k  k   .
This is called the multiple linear regression model. The parameters 1 ,  2 ,...,  k are the regression

coefficients associated with X 1 , X 2 ,..., X k respectively and  is the random error component reflecting the
difference between the observed and fitted linear relationship. There can be various reasons for such
difference, e.g., the joint effect of those variables not included in the model, random factors which can not
be accounted for in the model etc.

Note that the j th regression coefficient  j represents the expected change in y per unit change in the j th

independent variable X j . Assuming E ( )  0,

E ( y )
j  .
X j

Linear model:
y E ( y )
A model is said to be linear when it is linear in parameters. In such a case (or equivalently )
 j  j
should not depend on any  ' s . For example,
i) y   0  1 X is a linear model as it is linear in the parameters.

ii) y   0 X 1 can be written as

log y  log  0  1 log X


y*   0*  1 x*
which is linear in the parameter  0* and 1 , but nonlinear is variables y*  log y, x*  log x. So it is

a linear model.
Econometrics | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur
1
iii) y   0  1 X   2 X 2
is linear in parameters  0 , 1 and  2 but it is nonlinear is variables X . So it is a linear model

1
iv) y  0 
X  2
is nonlinear in the parameters and variables both. So it is a nonlinear model.
v) y   0  1 X 2
is nonlinear in the parameters and variables both. So it is a nonlinear model.
vi) y   0  1 X   2 X 2   3 X 3
is a cubic polynomial model which can be written as
y   0  1 X   2 X 2  3 X 3
which is linear in the parameters  0 , 1 ,  2 ,  3 and linear in the variables X 1  X , X 2  X 2 , X 3  X 3 .

So it is a linear model.

Example:
The income and education of a person are related. It is expected that, on average, a higher level of education
provides higher income. So a simple linear regression model can be expressed as
income   0  1 education   .

Not that 1 reflects the change in income with respect to per unit change in education and  0 reflects the

income when education is zero as it is expected that even an illiterate person can also have some income.

Further, this model neglects that most people have higher income when they are older than when they are
young, regardless of education. So 1 will over-state the marginal impact of education. If age and education
are positively correlated, then the regression model will associate all the observed increase in income with an
increase in education. So a better model is
income   0  1 education   2 age   .
Often it is observed that the income tends to rise less rapidly in the later earning years than is early years. To
accommodate such a possibility, we might extend the model to
income   0  1education   2 age  3age 2  
This is how we proceed for regression modeling in real-life situation. One needs to consider the experimental
condition and the phenomenon before making the decision on how many, why and how to choose the
dependent and independent variables.
Econometrics | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur
2
Model set up:
Let an experiment be conducted n times, and the data is obtained as follows:
Observation number Response Explanatory variables
y X1 X 2  X k

1 y1 x11 x12  x1k


2 y2 x21 x22  x2 k
     
yn xn1 xn 2  xnk
n

Assuming that the model is


y   0  1 X 1   2 X 2  ...   k X k   ,
the n-tuples of observations are also assumed to follow the same model. Thus they satisfy
y1   0  1 x11   2 x12  ...   k x1k  1
y2   0  1 x21   2 x22  ...   k x2 k   2
 
yn   0  1 xn1   2 xn 2  ...   k xnk   n .
These n equations can be written as
 y1  1 x11 x12  x1k   0   1 
      
 y2    1 x21 x22  x2 k  1    2 

         
      
 yn   1 xn1 xn 2  xnk    k    n 

or y  X    .
In general, the model with k explanatory variables can be expressed as
y  X 
where y  ( y1 , y2 ,..., yn ) ' is a n  1 vector of n observation on study variable,

 x11 x12  x1k 


 
 x21 x22  x2 k 
X
    
 
 xn1 xn 2  xnk 
is a n  k matrix of n observations on each of the k explanatory variables,   ( 1 ,  2 ,...,  k ) ' is a k 1
vector of regression coefficients and   (1 ,  2 ,...,  n ) ' is a n 1 vector of random error components or
disturbance term.

If intercept term is present, take first column of X to be (1,1,…,1)’.


Econometrics | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur
3
Assumptions in multiple linear regression model
Some assumptions are needed in the model y  X    for drawing the statistical inferences. The following
assumptions are made:
(i) E ( )  0

(ii) E ( ')   2 I n

(iii) Rank ( X )  k
(iv) X is a non-stochastic matrix
(v)  ~ N (0,  2 I n ) .

These assumptions are used to study the statistical properties of the estimator of regression coefficients. The
following assumption is required to study, particularly the large sample properties of the estimators.

 X 'X 
(vi) lim     exists and is a non-stochastic and nonsingular matrix (with finite elements).
n 
 n 

The explanatory variables can also be stochastic in some cases. We assume that X is non-stochastic unless
stated separately.

We consider the problems of estimation and testing of hypothesis on regression coefficient vector under the
stated assumption.

Estimation of parameters:
A general procedure for the estimation of regression coefficient vector is to minimize
n n

 M ( i )   M ( yi  xi11  xi 2  2  ...  xik  k )


i 1 i 1

for a suitably chosen function M .


Some examples of choice of M are
M ( x)  x
M ( x)  x 2

M ( x)  x , in general.
p

We consider the principle of least square which is related to M ( x)  x 2 and method of maximum likelihood
estimation for the estimation of parameters.
Econometrics | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur
4
Principle of ordinary least squares (OLS)
Let B be the set of all possible vectors  . If there is no further information, the B is k -dimensional real
Euclidean space. The object is to find a vector b '  (b1 , b2 ,..., bk ) from B that minimizes the sum of squared

deviations of  i ' s, i.e.,


n
S (  )    i2   '   ( y  X  ) '( y  X  )
i 1

for given y and X . A minimum will always exist as S (  ) is a real-valued, convex and differentiable
function. Write
S (  )  y ' y   ' X ' X   2 ' X ' y .
Differentiate S (  ) with respect to 
S (  )
 2X ' X   2X ' y

 2 S ( )
 2 X ' X (atleast non-negative definite).
 2
The normal equation is
S (  )
0

 X ' Xb  X ' y
where the following result is used:
Result: If f ( z )  Z ' AZ is a quadratic form, Z is a m 1 vector and A is any m  m symmetric matrix

then F ( z )  2 Az .
z

Since it is assumed that rank ( X )  k (full rank), then X ' X is a positive definite and unique solution of the
normal equation is
b  ( X ' X ) 1 X ' y
which is termed as ordinary least squares estimator (OLSE) of  .

 2 S ( )
Since is at least non-negative definite, so b minimize S (  ) .
 2

Econometrics | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur


5
In case, X is not of full rank, then
b  ( X ' X )  X ' y   I  ( X ' X )  X ' X  

where ( X ' X )  is the generalized inverse of X ' X and  is an arbitrary vector. The generalized inverse

( X ' X )  of X ' X satisfies

X ' X ( X ' X ) X ' X  X ' X


X ( X ' X ) X ' X  X
X ' X ( X ' X ) X '  X '
Theorem:
(i) Let ŷ  Xb be the empirical predictor of y . Then ŷ has the same value for all solutions b of
X ' Xb  X ' y.
(ii) S (  ) attains the minimum for any solution of X ' Xb  X ' y.
Proof:
(i) Let b be any member in
b  ( X ' X )  X ' y   I  ( X ' X )  X ' X   .

Since X ( X ' X )  X ' X  X , so then

Xb  X ( X ' X )  X ' y  X  I  ( X ' X )  X ' X  

= X ( X ' X ) X ' y
which is independent of  . This implies that ŷ has the same value for all solution b of X ' Xb  X ' y.
(ii) Note that for any  ,

S (  )   y  Xb  X (b   )   y  Xb  X (b   ) 
 ( y  Xb)( y  Xb)  (b   ) X ' X (b   )  2(b   ) X ( y  Xb)
 ( y  Xb)( y  Xb)  (b   ) X ' X (b   ) (Using X ' Xb  X ' y )
 ( y  Xb)( y  Xb)  S (b)
 y ' y  2 y ' Xb  b ' X ' Xb
 y ' y  b ' X ' Xb
 y ' y  yˆ ' yˆ .

Econometrics | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur


6
Fitted values:
If ̂ is any estimator of  for the model y  X    , then the fitted values are defined as

ŷ  X ˆ where ̂ is any estimator of  .

In the case of ˆ  b,
yˆ  Xb
 X ( X ' X ) 1 X ' y
 Hy

where H  X ( X ' X ) 1 X ' is termed as Hat matrix which is


(i) symmetric
(ii) idempotent (i.e., HH  H ) and

(iii) tr H  tr X ( X X ) 1 X '  tr X ' X ( X ' X ) 1  tr I k  k .

Residuals
The difference between the observed and fitted values of the study variable is called as residual. It is
denoted as
e  y ~ yˆ
 y  yˆ
 y  Xb
 y  Hy
 (I  H ) y
 Hy

where H  I  H .

Note that
(i) H is a symmetric matrix
(ii) H is an idempotent matrix, i.e.,
HH  ( I  H )( I  H )  ( I  H )  H and

(iii) trH  trI n  trH  (n  k ).

Econometrics | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur


7
Properties of OLSE
(i) Estimation error:
The estimation error of b is
b    ( X ' X ) 1 X ' y  
 ( X ' X ) 1 X '( X    )  
 ( X ' X ) 1 X ' 

(ii) Bias
Since X is assumed to be nonstochastic and E ( )  0

E (b   )  ( X ' X ) 1 X ' E ( )
 0.
Thus OLSE is an unbiased estimator of  .

(iii) Covariance matrix


The covariance matrix of b is
V (b)  E (b   )(b   ) '
 E ( X ' X ) 1 X '  ' X ( X ' X ) 1 
 ( X ' X ) 1 X ' E ( ') X ( X ' X ) 1
  2 ( X ' X ) 1 X ' IX ( X ' X ) 1
  2 ( X ' X ) 1.

(iv) Variance
The variance of b can be obtained as the sum of variances of all b1 , b2 ,..., bk which is the trace of covariance

matrix of b . Thus
Var (b)  tr V (b) 
k
  E (bi   i ) 2
i 1
k
  Var (bi ).
i 1

Econometrics | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur


8
Estimation of  2
The least-squares criterion can not be used to estimate  2 because  2 does not appear in S (  ) . Since

E ( i2 )   2 , so we attempt with residuals ei to estimate  2 as follows:


e  y  yˆ
 y  X ( X ' X ) 1 X ' y
 [ I  X ( X ' X ) 1 X '] y
 Hy.
Consider the residual sum of squares
n
SSr e s   ei2
i 1

 e 'e
 ( y  Xb) '( y  Xb)
 y '( I  H )( I  H ) y
 y '( I  H ) y
 y ' Hy.
Also
SS r e s  ( y  Xb) '( y  Xb)
 y ' y  2b ' X ' y  b ' X ' Xb
 y ' y  b ' X ' y (Using X ' Xb  X ' y )

SSr e s  y ' Hy
 (X    )'H (X   )
  ' H  (Using HX  0)

Since  ~ N (0,  2 I ) , so y ~ N ( X  ,  2 I ) . Hence y ' Hy ~  2 (n  k ) .

Thus E[ y ' Hy ]  (n  k ) 2

 y ' Hy 
 
2
or E
 n  k 
or E  MSr e s    2

SSr e s
where MSr e s  is the mean sum of squares due to residual.
nk
Thus an unbiased estimator of  2 is
ˆ 2  MSr e s  s 2 (say)
which is a model-dependent estimator.

Econometrics | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur


9
Variance of ŷ
The variance of ŷ is
V ( yˆ )  V ( Xb)
 XV (b) X '
  2 X ( X ' X ) 1 X '
  2H.

Gauss-Markov Theorem:
The ordinary least squares estimator (OLSE) is the best linear unbiased estimator (BLUE) of  .
Proof: The OLSE of  is

b  ( X ' X ) 1 X ' y
which is a linear function of y . Consider the arbitrary linear estimator

b*  a ' y
of linear parametric function  '  where the elements of a are arbitrary constants.

Then for b* ,
E (b* )  E (a ' y )  a ' X 

and so b* is an unbiased estimator of  '  when

E (b* )  a ' X    ' 


 a ' X   '.
Since we wish to consider only those estimators that are linear and unbiased, so we restrict ourselves to
those estimators for which a ' X   '.

Further
Var (a ' y )  a 'Var ( y )a   2 a ' a
Var ( ' b)   'Var (b)
  2 a ' X ( X ' X ) 1 X ' a.
Consider
Var (a ' y )  Var ( ' b)   2  a ' a  a ' X ( X ' X ) 1 X ' a 
  2 a '  I  X ( X ' X ) 1 X ' a
  2 a '( I  H )a.

Econometrics | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur


10
Since ( I  H ) is a positive semi-definite matrix, so
Var (a ' y )  Var ( ' b)  0 .

This reveals that if b* is any linear unbiased estimator then its variance must be no smaller than that of b .
Consequently b is the best linear unbiased estimator, where ‘best’ refers to the fact that b is efficient within
the class of linear and unbiased estimators.

Maximum likelihood estimation:


In the model, y  X    , it is assumed that the errors are normally and independently distributed with

constant variance  2 or  ~ N (0,  2 I ).


The normal density function for the errors is
1  1 
f ( i )  exp   2  i2  i  1, 2,..., n. .
 2  2 
The likelihood function is the joint density of 1 ,  2 ,...,  n given as
n
L(  ,  2 )   f ( i )
i 1

1  1 n 2

(2 2 ) n /2
exp   2 2   i 
 i 1 
1  1 
 exp   2  '  
(2 )
2 n /2
 2 
 1
1 
 exp   2 ( y  X  ) '( y  X  )  .
(2 )  2
2 n /2

Since the log transformation is monotonic, so we maximize ln L(  ,  2 ) instead of L(  ,  2 ) .
n 1
ln L(  ,  2 )   ln(2 2 )  2 ( y  X  ) '( y  X  ) .
2 2
The maximum likelihood estimators (m.l.e.) of  and  2 are obtained by equating the first-order

derivatives of ln L(  ,  2 ) with respect to  and  2 to zero as follows:

 ln L(  ,  2 ) 1
 2 X '( y  X  )  0
 2 2
 ln L(  ,  2 ) n 1
 2  ( y  X  ) '( y  X  ).
 2
2 2( 2 ) 2
The likelihood equations are given by

Econometrics | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur


11
X 'X  X 'y
1
 2  ( y  X  ) '( y  X  ).
n
Since rank( X )  k , so that the unique m.l.e. of  and  2 are obtained as

  ( X ' X ) 1 X ' y
1
 2  ( y  X  ) '( y  X  ).
n

Further to verify that these values maximize the likelihood function, we find
 2 ln L(  ,  2 ) 1
 2 X 'X
 2

 2 ln L(  ,  2 ) n 1
  6 ( y  X  ) '( y  X  )
 ( )
2 2 2
2 4

 2 ln L(  ,  2 ) 1
  4 X '( y  X  ).
 2

Thus the Hessian matrix of second-order partial derivatives of ln L(  ,  2 ) with respect to  and  2 is

  2 ln L(  ,  2 )  2 ln L(  ,  2 ) 
 
  2  2 
  2 ln L(  ,  2 )  ln L(  ,  ) 
2 2

 
  2   2 ( 2 ) 2 

which is negative definite at    and  2   2 . This ensures that the likelihood function is maximized at
these values.

Comparing with OLSEs, we find that


(i) OLSE and m.l.e. of  are same. So m.l.e. of  is also an unbiased estimator of  .
nk 2
(ii) OLSE of  2 is s 2 which is related to m.l.e. of  2 as  2  s . So m.l.e. of  2 is a
n
biased estimator of  2 .

Econometrics | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur


12
Consistency of estimators
(i) Consistency of b :
 X 'X 
Under the assumption that lim     exists as a nonstochastic and nonsingular matrix (with finite
n 
 n 
elements), we have
1
1 X 'X 
lim V (b)   lim 
2

n  n  n
 n 
1
  2 lim  1
n  n

 0.
This implies that OLSE converges to  in quadratic mean. Thus OLSE is a consistent estimator of  . This
holds true for maximum likelihood estimators also.

The same conclusion can also be proved using the concept of convergence in probability.
An estimator ˆn converges to  in probability if

lim P  ˆn       0 for any   0


n   

and is denoted as plim(ˆn )   .

The consistency of OLSE can be obtained under the weaker assumption that
 X 'X 
plim    * .
 n 
exists and is a nonsingular and nonstochastic matrix such that
 X ' 
plim    0.
 n 
Since
b    ( X ' X ) 1 X ' 
1
 X ' X  X '
  .
 n  n
So
1
 X 'X   X ' 
plim(b   )  plim   plim  
 n   n 
 *1.0
 0.
Thus b is a consistent estimator of  . Same is true for m.l.e. also.
Econometrics | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur
13
(ii) Consistency of s 2
Now we look at the consistency of s 2 as an estimate of  2 as
1
s2  e 'e
nk
1
  ' H
nk
1
1 k 
1    '    ' X ( X ' X ) X '  
1

n n
 k    '  ' X  X ' X  X ' 
1 1

 1       .
 n   n n  n  n 

 ' 1 n 2
Note that
n
consists of 
n i 1
 i and { i2 , i  1, 2,..., n} is a sequence of independently and identically

distributed random variables with mean  2 . Using the law of large numbers
  ' 
 
2
plim 
 n 
  ' X  X ' X  1 X '     'X    X ' X  
1
X ' 
plim    
  plim  plim     plim 
 n  n  n   n    n    n 
 0.*1.0
0
 plim( s )  (1  0)   0 
2 1 2

  2.

Thus s 2 is a consistent estimator of  2 . The same holds true for m.l.e. also.

Econometrics | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur


14
Cramer-Rao lower bound
Let   (  ,  2 ) ' . Assume that both  and  2 are unknown. If E (ˆ)   , then the Cramer-Rao lower

bound for ˆ is grater than or equal to the matrix inverse of


  2 ln L( ) 
I ( )   E  
  ' 
   ln L(  ,  2 )    ln L(  ,  2 )  
 E   E 
   2    
2


   ln L(  ,  2 )    ln L(  ,  )  
2
 E   E 
      
2 2 2 2
 ( )  
  X 'X   X '( y  X  )  
 E   2  E
 4  
   
 
 (y  X )' X   n ( y  X  ) '( y  X  )  
 E   E 4   
  4  2 6 
X 'X 
 2 0 
 .
 0 n 
 2 4 
Then
 2 ( X ' X ) 1 0 
 I ( )   
1
2 4 
0
 n 
is the Cramer-Rao lower bound matrix of  and  2 .

The covariance matrix of OLSEs of  and  2 is

 2 ( X ' X ) 1 0 
 OLS   0 2 4 

 n  k 
which means that the Cramer-Rao have bound is attained for the covariance of b but not for s 2 .

Econometrics | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur


15
Standardized regression coefficients:
Usually, it is difficult to compare the regression coefficients because the magnitude of ˆ j reflects the units

of measurement of j th explanatory variable X j . For example, in the following fitted regression model

yˆ  5  X 1  1000 X 2 ,
y is measured in litres, X 1 in litres and X 2 in millilitres. Although ˆ2  ˆ1 but the effect of both

explanatory variables is identical. One litre change in either X 1 and X 2 when another variable is held fixed

produces the same change is ŷ .

Sometimes it is helpful to work with scaled explanatory variables and study variable that produces
dimensionless regression coefficients. These dimensionless regression coefficients are called as
standardized regression coefficients.

There are two popular approaches for scaling, which gives standardized regression coefficients. We discuss
them as follows:

1. Unit normal scaling:


Employ unit normal scaling to each explanatory variable and study variable. So define
xij  x j
zij  , i  1, 2,..., n, j  1, 2,..., k
sj
yi  y
yi* 
sy
1 n 1 n
where s 2j  
n  1 i 1
( xij  x j ) 2
and s 2
y  
n  1 i 1
( yi  y ) 2 are the sample variances of j th explanatory

variable and study variable, respectively.

All scaled explanatory variable and scaled study variable has mean zero and sample variance unity, i.e.,
using these new variables, the regression model becomes
yi*   1 zi1   2 zi 2  ...   k zik   i , i  1, 2,..., n.

Such centering removes the intercept term from the model. The least-squares estimate of   ( 1 ,  2 ,...,  k ) '
is
ˆ  ( Z ' Z ) 1 Z ' y* .
This scaling has a similarity to standardizing a normal random variable, i.e., observation minus its mean and
divided by its standard deviation. So it is called as a unit normal scaling.
Econometrics | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur
16
2. Unit length scaling:
In unit length scaling, define
xij  x j
ij  , i  1, 2,..., n; j  1, 2,..., k
S 1/2
jj

yi  y
yi0 
SST1/2
n
where S jj   ( xij  x j ) 2 is the corrected sum of squares for j th explanatory variables X j and
i 1
n
ST  SST   ( yi  y ) 2 is the total sum of squares. In this scaling, each new explanatory variable W j has
i 1

1 n n
mean  j   ij  0 and length
n i 1
 (
i 1
ij   j ) 2  1.

In terms of these variables, the regression model is


yio  1i1   2i 2  ...   k ik   i , i  1, 2,..., n.
The least-squares estimate of the regression coefficient   (1 ,  2 ,...,  k ) ' is

ˆ  (W 'W )1W ' y 0 .


In such a case, the matrix W 'W is in the form of the correlation matrix, i.e.,
1 r12 r13  r1k 
 
 r12 1 r23  r2 k 
W 'W   r13 r23 1  r3k 
 
      
r r3k  1 
 1k r2 k
where
n

 (x ui  xi )( xuj  x j )
Sij
rij  u 1

( Sii S jj )1/2 ( Sii S jj )1/2
is the simple correlation coefficient between the explanatory variables X i and X j . Similarly

W ' y o  (r1 y , r2 y ,..., rky ) '

where
n

 (x uj  x j )( yu  y )
Siy
rjy  u 1
1/2

( S jj SST ) ( S jj SST )1/2

is the simple correlation coefficient between the j th explanatory variable X j and study variable y .

Econometrics | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur


17
Note that it is customary to refer rij and rjy as correlation coefficient though X i ' s are not random variable.

If unit normal scaling is used, then


Z ' Z  (n  1)W 'W .

So the estimates of regression coefficient in unit normal scaling (i.e., ˆ ) and unit length scaling (i.e., ˆ ) are

identical. So it does not matter which scaling is used, so ˆ  ˆ .

The regression coefficients obtained after such scaling, viz., ˆ or ˆ usually called standardized regression
coefficients.

The relationship between the original and standardized regression coefficients is


1/ 2
 SS 
b j  ˆ j  T  , j  1, 2,..., k
 S
 jj 
and
k
b0  y   b j x j
j 1

where b0 is the OLSE of intercept term and b j are the OLSE of slope parameters.

The model in deviation form


The multiple linear regression model can also be expressed in the deviation form.
First, all the data is expressed in terms of deviations from the sample mean.
The estimation of regression parameters is performed in two steps:
 First step: Estimate the slope parameters.
 Second step : Estimate the intercept term.
The multiple linear regression model in deviation form is expressed as follows:
Let
1
A  I   '
n
where   1,1,...,1 ' is a n  1 vector of each element unity. So

Econometrics | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur


18
1 0  0 1 1  1
0 1  0  1 1
 1  1
A  .
     n    
   
0 0  1 1 1  1
Then
 y1 
 
1 n 1 y 1
y   yi  1,1,...,1  2    ' y
n i 1 n    n
 
 yn 
Ay  y  y   y1  y , y2  y ,..., yn  y  '.

Thus pre-multiplication of any column vector by A produces a vector showing those observations in
deviation form:

Note that
1
A     ' 
n
1
   .n
n
 
0
and A is a symmetric and idempotent matrix.

In the model
y  X  ,
the OLSE of  is

b X 'X  X 'y


1

and the residual vector is


e  y  Xb.

Note that Ae  e.

If the n  k matrix is partitioned as


X   X 1 X 2* 
Econometrics | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur
19
where X 1  1,1,...,1 ' is n  1 vector with all elements unity, X 2* is n   k  1 matrix of observations of

 k  1 explanatory variables X 2 , X 3 ,..., X k and OLSE b   b1 , b2* ' is suitably partitioned with OLSE of

intercept term 1 as b1 and b2 as a  k  1 1 vector of OLSEs associated with  2 ,  3 ,...,  k .

Then
y  X 1b1  X 2*b2*  e.

Premultiply by A,

Ay  AX 1b1  AX 2*b2*  Ae
 AX 2*b2*  e.

Premultiply by X 2* gives

X 2* ' Ay  X 2* ' AX 2*b2*  X 2* ' e


 X 2* ' AX 2*b2* .

Since A is symmetric and idempotent,

 AX  '  Ay    AX  '  AX  b
*
2
*
2
*
2
*
2 ..

This equation can be compared with the normal equations X ' y  X ' Xb in the model y  X    . Such a
comparison yields the following conclusions:
 b2* is the sub vector of OLSE.

 Ay is the study variables vector in deviation form.

 AX 2* is the explanatory variable matrix in deviation form.

 This is the normal equation in terms of deviations. Its solution gives OLS of slope coefficients as

b2*   AX 2*  '  AX 2*    AX  '  Ay  .


1
*
2

The estimate of the intercept term is obtained in the second step as follows:
1
Premultiplying y  Xb  e by  ' gives
n

Econometrics | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur


20
1 1 1
 ' y   ' Xb   ' e
n n n
 b1 
 
b
y  1 X 2 X 3 ... X k   2   0

 
 bk 
 b1  y  b2 X 2  b3 X 3  ...  bk X k .
Now we explain various sums of squares in terms of this model.

The expression of the total sum of squares (TSS) remains the same as earlier and is given by
TSS  y ' Ay.
Since
Ay  AX 2*b2*  e
y ' Ay  y ' AX 2*b2*  y ' e
  Xb  e  ' AX 2*b2*  y ' e

  X 1b1  X 2*b2*  e  ' AX 2*b2*   X 1b1  X 2*b2*  e  ' e

 b2* ' X 2* ' AX 2*b2*  e ' e


TSS  SSreg  SS res
where the sum of squares due to regression is
SS reg  b2* ' X 2* ' AX 2*b2*

and the sum of squares due to residual is


SSres  e ' e .

Testing of hypothesis:
There are several important questions which can be answered through the test of hypothesis concerning the
regression coefficients. For example
1. What is the overall adequacy of the model?
2. Which specific explanatory variables seem to be important?
etc.

In order the answer such questions, we first develop the test of hypothesis for a general framework, viz.,
general linear hypothesis. Then several tests of hypothesis can be derived as its special cases. So first, we
discuss the test of a general linear hypothesis.
Econometrics | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur
21
Test of hypothesis for H 0 : R  r
We consider a general linear hypothesis that the parameters in  are contained in a subspace of parameter
space for which R   r , where R is ( J  k ) a matrix of known elements and r is a ( J  1 ) vector of known
elements.

In general, the null hypothesis


H 0 : R  r
is termed as general linear hypothesis and
H1 : R   r
is the alternative hypothesis.

We assume that rank ( R )  J , i.e., full rank so that there is no linear dependence in the hypothesis.
Some special cases and interesting example of H 0 : R  r are as follows:

(i) H 0 : i  0
Choose J  1, r  0, R  [0, 0,..., 0,1, 0,..., 0] where 1 occurs at the i th position is R .
This particular hypothesis explains whether X i has any effect on the linear model or not.

(ii) H 0 :  3   4 or H 0 : 3   4  0

Choose J  1, r  0, R  [0, 0,1, 1, 0,..., 0]


(iii) H 0 : 3   4  5

or H 0 : 3   4  0, 3   5  0

0 0 1  1 0 0 ... 0 
Choose J  2, r  (0, 0) ', R   .
0 0 1 0  1 0 ... 0 
(iv) H 0 :  3 5  4  2

Choose J  1, r  2, R   0, 0,1,5, 0...0

(v) H 0 :  2   3  ...   k  0
J  k 1
r  (0, 0,..., 0) '
0 1 0 ... 0  0 
0 
0 1 ... 0   0 I k 1 
R   .
      
   
0 0 0 ... 1  ( k 1)k 0 
Econometrics | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur
22
This particular hypothesis explains the goodness of fit. It tells whether  i has a linear effect or not and are

they of any importance. It also tests that X 2 , X 3 ,..., X k have no influence in the determination of y . Here

1  0 is excluded because this involves additional implication that the mean level of y is zero. Our main
concern is to know whether the explanatory variables help to explain the variation in y around its mean
value or not.
We develop the likelihood ratio test for H 0 : R   r.

Likelihood ratio test:


The likelihood ratio test statistic is

max L(  ,  2 | y, X ) Lˆ ()
 
max L(  ,  2 | y, X , R   r ) Lˆ ( )
where  is the whole parametric space and  is the sample space.

If both the likelihoods are maximized, one constrained, and the other unconstrained, then the value of the
unconstrained will not be smaller than the value of the constrained. Hence   1.

First, we discuss the likelihood ratio test for a more straightforward case when
R  I k and r   0 , i.e.,    0 . This will give us a better and detailed understanding of the minor details,

and then we generalize it for R  r , in general.

Likelihood ratio test for H 0 :    0


Let the null hypothesis related to k  1 vector  is
H 0 :   0

where  0 is specified by the investigator. The elements of  0 can take on any value, including zero. The
concerned alternative hypothesis is
H1 :    0 .

Since  ~ N (0,  2 I ) in y  X    , so y ~ N ( X  ,  2 I ). Thus the whole parametric space and sample


space are  and  respectively given by

 : (  ,  2 ) :     i  ,  2  0, i  1, 2,..., k
 : (  ,  2 ) :    0 ,  2  0 .
Econometrics | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur
23
The unconstrained likelihood under  .
1  1 
L(  ,  2 | y, X )  exp   2 ( y  X  ) '( y  X  )  .
(2 )2 n /2
 2 

This is maximized over  when


  ( X ' X )1 X ' y
1
 2  ( y  X  ) '( y  X  ).
n

where  and  2 are the maximum likelihood estimates of  and  2 which are the values maximizing the
likelihood function.
Lˆ ()  max L   ,  2 | y, X ) 
 
   
1  ( y  X  ) '( y  X  ) 
 exp 
n
  2( y  X  ) '( y  X  )  
 2    2
  
 n ( y  X  ) '( y  X  )    n  
 n
n n /2 exp   
  2 .
n
 
(2 ) ( y  X  ) '( y  X  ) 
n /2 2

The constrained likelihood under  is


1  1 
Lˆ ( )  max L(  ,  2 | y, X ,    0 )  exp   2 ( y  X  0 ) '( y  X  0 )  .
(2 ) 2 n /2
 2 

Since  0 is known, so the constrained likelihood function has an optimum variance estimator

1
2  ( y  X  0 ) '( y  X  0 )
n
 n
n n /2 exp   
Lˆ ( )   2 .
n /2
(2 ) ( y  X  0 ) '( y  X  0) 
n /2

Econometrics | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur


24
The likelihood ratio is
 
 n n /2 exp(n / 2) 
 (2 ) ( y  X  ) '( y  X  )  
n /2
n /2
Lˆ () 

Lˆ ( )  
 n n /2 exp(n / 2) 
 (2 ) n /2 ( y  X  ) '( y  X  )  n /2 
  0 0  
n /2
 ( y  X  0 ) '( y  X  0 ) 

  
 ( y  X  ) '( y  X  ) 
n/ 2
  2 
  
n /2
  2 
  
( y  X  0 ) '( y  X  0 )
where   is the ratio of the quadratic forms.
( y  X  ) '( y  X  )
Now we simplify the numerator in  as follows:

( y  X  0 ) '( y  X  0 )  ( y  X  )  X (    0 )  ( y  X  )  X (    0 ) 
 ( y  X  ) '( y  X  )  2 y '  I  X ( X ' X ) 1 X ' X (    0 )  (    0 ) ' X ' X (    0 )
 ( y  X  ) '( y  X  )  (    0 ) ' X ' X (    0 ).
Thus
( y  X  ) '( y  X  )  (    0 ) ' X ' X (    0 )

( y  X  ) '( y  X  )
(    0 ) ' X ' X (    0 )
 1
( y  X  ) '( y  X  )
(    0 ) ' X ' X (    0 )
or   1  0 
( y  X  ) '( y  X  )
where 0  0  .

Distribution of ratio of quadratic forms


Now we find the distribution of the quadratic forms involved is 0 to find the distribution of 0 as follows:

( y  X  ) '( y  X  )  e ' e
 y '  I  X ( X ' X ) 1 X ' y
 y ' Hy
 (X    )'H (X    )
  ' H (using HX  0)
 (n  k )ˆ 2

Econometrics | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur


25
Result: If Z is a n  1 random vector that is distributed as N (0,  2 I n ) and A is any symmetric idempotent

Z ' AZ
n  n matrix of rank, p then ~  2 ( p). If B is another n  n symmetric idempotent matrix of rank
 2

Z ' BZ
q , then ~  2 (q) . If AB  0 then Z ' AZ is distributed independently of Z ' BZ .
 2

So using this result, we have


y ' Hy (n  k )ˆ 2
 ~  2 (n  k ).
 2
 2

Further, if H 0 is true, then    0 and we have the numerator in 0 . Rewriting the numerator in 0 , in
general, we have
(    ) ' X ' X (    )   ' X ( X ' X ) 1 X ' X ( X ' X ) 1 X ' 
  ' X ( X ' X ) 1 X ' 
  ' H
where H is an idempotent matrix with rank k . Thus using this result, we have
 ' H   ' X '( X ' X ) 1 X ' 
 ~  2 (k ).
2 2
Furthermore, the product of the quadratic form matrices in the numerator ( ' H  ) and denominator ( ' H  )
of 0 is

 I  X ( X ' X ) 1 X ' X ( X ' X ) 1 X '  X ( X ' X ) 1 X ' X ( X ' X ) 1 X ' X ( X ' X ) 1 X '  0
and hence the  2 random variables in the numerator and denominator of 0 are independent. Dividing

each of the  2 random variables by their respective degrees of freedom

 
   
 (  0 ) ' X ' X (  0 ) 
 2 
 k 
1   
  (n  k )ˆ  2

 
   2
 
  n  k  
 
   
(    0 ) ' X ' X (    0 )

kˆ 2
( y  X  0 ) '( y  X  0 )  ( y  X  ) '( y  X  )

kˆ 2
~ F (k , n  k ) under H 0 .

Econometrics | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur


26
Note that
( y  X  0 ) '( y  X  0 ) : Restricted error sum of squares
( y  X  ) '( y  X  ) : Unrestricted error sum of squares

Numerator in 1 : Difference between the restricted and unrestricted error sum of squares.

The decision rule is to reject H 0 :    0 at  level of significance whenever

1  F (k , n  k )
where F (k , n  k ) is the upper critical points on the central F -distribution with k and n  k degrees of

freedom.

Likelihood ratio test for H 0 : R  r


The same logic and reasons used in the development of the likelihood ratio test for H 0 :    0 can be

extended to develop the likelihood ratio test for H 0 : R   r as follows.

  (  ,  2 ) :     i  ,  2  0, i  1, 2,..., k 
  (  ,  2 ) :    i  , R  r ,  2  0 .

Let   ( X ' X ) 1 X ' y. .


Then
E ( R  )  R
V ( R  )  E  R(    )(    ) ' R '
 RV (  ) R '
  2 R( X ' X ) 1 R '.

Since  ~ N   ,  2 ( X ' X ) 1 
so R  ~ N  R ,  2 R ( X ' X ) 1 R '
R   r  R   R  R(    ) ~ N 0,  2 R( X ' X ) 1 R ' .

1
There exists a matrix Q such that  R( X ' X ) 1 R '  QQ ' and then

  QR(b   )  N (0,  2 I n ) . Therefore under H 0 : R  r  0, so

Econometrics | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur


27
 ' ( R  r ) ' QQ '( R  r )

2 2
1
( R  r ) '  R ( X ' X ) 1 R ' ( R   r )
=
2
1
(    ) ' R '  R ( X ' X ) 1 R ' R(    )

2
1
 ' X ( X ' X ) 1 R '  R( X ' X ) 1 R ' R( X ' X ) 1 X ' 

2
~  2 ( J ).
1
which is obtained as X ( X ' X ) 1 R '  R ( X ' X ) 1 R ' R( X ' X ) 1 X ' is an idempotent matrix, and its trace is J

which is the associated degrees of freedom.

Also, irrespective of whether H 0 is true or not,

e ' e ( y  X  ) '( y  X  ) y ' Hy (n  k )ˆ 2


   ~  2 (n  k ).
 2
 2
 2
 2

Moreover, the product of quadratic form matrices of e ' e and


1
(    ) ' R '  R( X ' X ) 1 R ' R(    ) is zero implying that both the quadratic forms are independent. So in

terms of likelihood ratio test statistic


 ( R   r ) '  R( X ' X ) 1 R ' 1 ( R   r ) 
   
  2

 J 
 
1   
 (n  k )ˆ 2

 
 
2

nk

  
1
R  r ) '  R ( X ' X ) 1 R ' R   r

J ˆ 2
~ F ( J , n  k ) under H 0 .
So the decision rule is to reject H 0 whenever

1  F ( J , n  k )
where F ( J , n  k ) is the upper critical points on the central F distribution with J and (n  k ) degrees of
freedom.

Econometrics | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur


28
Test of significance of regression (Analysis of variance)
If we set R  [0 I k 1 ], r  0, then the hypothesis H 0 : R  r reduces to the following null hypothesis:

H 0 :  2   3  ...   k  0
against the alternative hypothesis
H1 :  j  0 for at least one j  2,3,..., k
This hypothesis determines if there is a linear relationship between y and any set of the explanatory
variables X 2 , X 3 ,..., X k . Notice that X 1 corresponds to the intercept term in the model and hence xi1  1

for all i  1, 2,..., n .

This is an overall or global test of model adequacy. Rejection of the null hypothesis indicates that at least
one of the explanatory variables among X 2 , X 3 ,..., X k . contributes significantly to the model. This is called
as analysis of variance.
Since  ~ N (0,  2 I ),

so y ~ N ( X  ,  2 I )
b  ( X ' X ) 1 X ' y ~ N   ,  2 ( X ' X ) 1  .

SS res
Also ˆ 2 
nk
( y  yˆ ) '( y  yˆ )

nk
y '  I  X ( X ' X ) 1 X ' y y ' Hy y' y b' X ' y
    .
nk nk nk
Since ( X ' X )-1 X ' H  0, so b and ˆ 2 are independently distributed.

Since y ' Hy   ' H  and H is an idempotent matrix, so

SSr e s ~  (2n  k ) ,

i.e., central  2 distribution with (n  k ) degrees of freedom.

Partition X  [ X 1 , X 2* ] where the submatrix X 2* contains the explanatory variables X 2 , X 3 ,..., X k

and partition   [ 1 ,  2* ] where the subvector  2* contains the regression coefficients  2 ,  3 ,...,  k .

Econometrics | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur


29
Now partition the total sum of squares due to y ' s as
SST  y ' Ay
 SSr e g  SS r e s

where SS r e g  b2* ' X 2* ' AX 2*b2* is the sum of squares due to regression and the sum of squares due to residuals

is given by
SS r e s  ( y  Xb) '( y  Xb)
 y ' Hy
 SST  SSr e g .

Further
SSr e g   * ' X * ' AX *  *   * ' X * ' AX *  *
~  k21  2 2 2 2 2  , i.e., non-central  2 distribution with non  centrality parameter 2 2 2 2 2 ,
2  2  2

SST   * ' X * ' AX *  *   * ' X * ' AX *  *


~  n21  2 2 2 2 2  , i.e., non-central  2 distribution with non  centrality parameter 2 2 2 2 2 .
2  2  2

Since X 2 H  0, so SSr e g and SS r e s are independently distributed. The mean squares due to regression is

SSr e g
MSr e g 
k 1
and the mean square due to error is
SSr e s
MSres  .
nk
Then
MSreg   * ' X * ' AX *  * 
~ Fk 1,n  k  2 2 2 2 2 
MS res  2 
which is a non-central F -distribution with (k  1, n  k ) degrees of freedom and noncentrality parameter

 2* ' X 2* ' AX 2*  2*
.
2 2

Under H 0 :  2  3  ...   k ,

MSreg
F ~ Fk 1,n  k .
MSres

Econometrics | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur


30
The decision rule is to reject at  level of significance whenever
F  F (k  1, n  k ).
The calculation of F -statistic can be summarized in the form of an analysis of variance (ANOVA) table
given as follows:
Source of variation Sum of squares Degrees of freedom Mean squares F
Regression SS r e g k 1 MS reg  SSr e g / k  1 F
Error nk
SSr e s MSres  SS r e s /(n  k )

Total SST n 1

Rejection of H 0 indicates that it is likely that atleast one  i  0 (i  1, 2,..., k ).

Test of hypothesis on individual regression coefficients


In case, if the test in analysis of variance is rejected, then another question arises is that which of the
regression coefficients is/are responsible for the rejection of the null hypothesis. The explanatory variables
corresponding to such regression coefficients are important for the model.

Adding such explanatory variables also increases the variance of fitted values ŷ , so one needs to be careful
that only those regressors are added that are of real value in explaining the response. Adding unimportant
explanatory variables may increase the residual mean square, which may decrease the usefulness of the
model.

To test the null hypothesis


H0 :  j  0

versus the alternative hypothesis


H1 :  j  0

has already been discussed is the case of a simple linear regression model. In the present case, if H 0 is

accepted, it implies that the explanatory variable X j can be deleted from the model. The corresponding test

statistic is
bj
t ~ t (n  k  1) under H 0
se(b j )
where the standard error of OLSE b j of  j is
se(b j )  ˆ 2C jj where C jj denotes the j th diagonal element of ( X ' X ) 1 corresponding to b j .

Econometrics | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur


31
The decision rule is to reject H 0 at  level of significance if

t  t .
, n  k 1
2

Note that this is only a partial or marginal test because ˆ j depends on all the other explanatory variables

X i (i  j that are in the model. This is a test of the contribution of X j given the other explanatory variables

in the model.

Confidence interval estimation


The confidence intervals in a multiple regression model can be constructed for individual regression
coefficients as well as jointly. We consider both of them as follows:

Confidence interval on the individual regression coefficient:


Assuming  i ' s are identically and independently distributed following N (0,  2 ) in y  X    , we have

y ~ N ( X  , 2 I )
b ~ N (  ,  2 ( X ' X ) 1 ).
Thus the marginal distribution of any regression coefficient estimate
b j ~ N (  j ,  2C jj )

where C jj is the j th diagonal element of ( X ' X ) 1 .

Thus
bj   j
tj  ~ t (n  k ) under H 0 , j  1, 2,...
ˆ 2C jj
SS r e s y ' y  b ' X ' y
where ˆ 2   .
nk nk
So the 100(1   )% confidence interval for  j ( j  1, 2,..., k ) is obtained as follows:

 b j 
P  t  j  t   1  
 2 ,nk ˆ 2C jj ,nk 
 2

 
P b j  t ˆ 2C jj   j  b j  t ˆ 2C jj   1   .
,nk ,n  k
 2 2 
Thus the confidence interval is
 
 b j  t , n  k ˆ C jj , b j  t , n  k ˆ C jj  .
2 2

 2 2 
Econometrics | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur
32
Simultaneous confidence intervals on regression coefficients:
A set of confidence intervals that are true simultaneously with probability (1   ) are called simultaneous or
joint confidence intervals.
It is relatively easy to define a joint confidence region for  in multiple regression model.
Since
(b   ) ' X ' X (b   )
~ Fk ,n  k
k MS r e s
 (b   ) ' X ' X (b   ) 
 P  F (k , n  k )   1   .
 k MSr e s 
So a 100 (1   )% joint confidence region for all of the parameters in  is
(b   ) ' X ' X (b   )
 F (k , n  k )
k MSr e s
which describes an elliptically shaped region.

Coefficient of determination ( R 2 ) and adjusted R 2


Let R be the multiple correlation coefficient between y , and X 1 , X 2 ,..., X k . Then square of multiple

correlation coefficient ( R 2 ) is called a coefficient of determination. The value of R 2 commonly describes


how well the sample regression line fits the observed data. This is also treated as a measure of goodness of
fit of the model.

Assuming that the intercept term is present in the model as


yi  1   2 X i 2   3 X i 3  ...   k X ik  ui , i  1, 2,..., n
then
e 'e
R2  1  n

 ( y  y)
i 1
i
2

SSres SS
 1  reg
SST SST
where
SSr e s : sum of squares due to residuals,

SST : total sum of squares

SSr e g : sum of squares due to regression.

Econometrics | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur


33
R 2 measures the explanatory power of the model, which in turn reflects the goodness of fit of the model. It
reflects the model adequacy in the sense of how much is the explanatory power of the explanatory variables.

Since
e ' e  y '  I  X ( X ' X ) 1 X ' y  y ' Hy,
n n

 ( y  y)   y
i 1
i
2

i 1
2
i  ny 2 ,

1 n
1
where y  
n i 1
yi   ' y
n
with   1,1,...,1 ', y   y1 , y2 ,..., yn  '

Thus
n
 1 
( y  y)
i 1
i  y ' y  n  2  ' yy '  
2

 n 
1
 y ' y  y '  ' y
n
1
 y ' y  y '  ( '  )  ' y
 y '  I  ( ' ) 1  ' y
 y ' Ay

where A  I  ( ' ) 1  ' .

y ' Hy
So R2  1  .
y ' Ay

The limits of R 2 are 0 and 1, i.e.,


0  R 2  1.

R 2  0 indicates the poorest fit of the model.


R 2  1 indicates the best fit of the model
R 2  0.95 indicates that 95% of the variation in y is explained by R 2 . In simple words, the model is 95%
good.

Similarly, any other value of R 2 between 0 and 1 indicates the adequacy of the fitted model.

Econometrics | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur


34
Adjusted R 2
If more explanatory variables are added to the model, then R 2 increases. In case the variables are
irrelevant, then R 2 will still increase and gives an overly optimistic picture.

With a purpose of correction in the overly optimistic picture, adjusted R 2 , denoted as R 2 or adj R 2 is used
which is defined as
SSr e s / (n  k )
R2  1
SST / (n  1)
 n 1 
 1   (1  R ).
2

 n  k 
We will see later that (n  k ) and (n  1) are the degrees of freedom associated with the distributions of SSres

SSr e s SST
and SST . Moreover, the quantities and are based on the unbiased estimators of respective
nk n 1
variances of e and y in the context of analysis of variance.

The adjusted R 2 will decline if the addition if an extra variable produces too small a reduction in (1  R 2 ) to

 n 1 
compensate for the increase in  .
nk 

Another limitation of adjusted R 2 is that it can be negative also. For example, if k  3, n  10, R 2  0.16,
then
9
R 2  1   0.97  0.25  0
7
which has no interpretation.

Econometrics | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur


35
Limitations
1. If the constant term is absent in the model, then R 2 can not be defined. In such cases, R 2 can be
negative. Some ad-hoc measures based on R 2 for regression line through origin have been proposed
in the literature.

Reason that why R 2 is valid only in linear models with intercept term:
In the model y  X    , the ordinary least squares estimator of  is b  ( X ' X ) 1 X ' y . Consider the
fitted model as
y  Xb  ( y  Xb)
 Xb  e
where e is the residual. Note that
y  ly  Xb  e  ly
 yˆ  e  ly
where ŷ  Xb is the fitted value and l  (1,1,...,1) ' is a n  1 vector of elements unity. The total sum of
n
squares TSS   ( yi  y ) 2 is then obtained as
i 1

TSS  ( y  ly ) '( y  ly )  [( yˆ  ly )  e]'[( yˆ  ly )  e]


 ( yˆ  ly ) '( yˆ  ly )  e ' e  2( yˆ  ly ) ' e
  
 SS reg  SSres  2( Xb  ly ) ' e (because yˆ  Xb)
 SSreg  SSres  2 yl ' e (because X ' e  0).

The Fisher Cochran theorem requires TSS  SS reg  SS res to hold true in the context of analysis of

variance and further to define the R2. In order that TSS  SS reg  SS res holds true, we need that

l ' e should be zero, i.e. l ' e =l '( y  yˆ )  0 which is possible only when there is an intercept term in the
model. We show this claim as follows:

First, we consider a no intercept simple linear regression model yi  1 xi   i , (i  1, 2,..., n) where the
n

x y i i n n n
parameter 1 is estimated as b  *
1
i 1
n
. Then l ' e =  ei   ( yi  yˆi )   ( yi  b1* xi ) 0, in general.
x
i 1
2
i
i 1 i 1 i 1

Similarly, in a no intercept multiple linear regression model y  X    , we find that


l ' e =l '( y  yˆ )  l '( X     Xb) =  l ' X (b   )  l '   0 , in general.
Econometrics | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur
36
Next, we consider a simple linear regression model with intercept term yi   0  1 xi   i , (i  1, 2,..., n)
s
where the parameters  0 and 1 are estimated as b0  y  b1 x and b1  xy respectively, where
sxx
n n
1 n 1 n
sxy   ( xi  x )( yi  y ), sxx   ( xi  x ) 2 , x   xi y   yi . We find that
i 1 i 1 n i 1 n i 1
n n
l ' e =  ei   ( yi  yˆi )
i 1 i 1
n
  ( yi  b0  b1 xi )
i 1
n
  ( yi  y  b1 x  b1 xi )
i 1
n
  [( yi  y )  b1 ( xi  x )]
i 1
n n
  ( yi  y )  b1  ( xi  x )
i 1 i 1

 0.

In a multiple linear regression model with an intercept term y   0l  X    where the parameters  0

and  are estimated as ˆ0  y  bx and b  ( X ' X ) 1 X ' y , respectively. We find that

l ' e =l '( y  yˆ )
=l '( y  ˆ0  Xb)
=l '( y  y  Xb  Xb) ,
=l '( y  y )  l '( X  X )b
=0.
Thus we conclude that for the Fisher Cochran to hold true in the sense that the total sum of squares can
be divided into two orthogonal components, viz., the sum of squares due to regression and sum of
squares due to errors, it is necessary that l ' e =l '( y  yˆ )  0 holds and which is possible only when the
intercept term is present in the model.

2. R 2 is sensitive to extreme values, so R 2 lacks robustness.

3. R 2 always increases with an increase in the number of explanatory variables in the model. The main
drawback of this property is that even when the irrelevant explanatory variables are added in the
model, R 2 still increases. This indicates that the model is getting better, which is not really correct.

Econometrics | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur


37
4. Consider a situation where we have the following two models:
yi  1   2 X i 2  ...   k X ik  ui , i  1, 2,.., n
log yi   1   2 X i 2  ...   k X ik  vi
The question is now which model is better?
For the first model,
n

 ( y  yˆ )
i i
2

R12  1  i 1
n

 ( y  y)
i 1
i
2

and for the second model, an option is to define R 2 as


n

 (log y  log yˆ )
i i
2

R22  1  i 1
n
.
 (log y  log y )
i 1
i
2

As such R12 and R22 are not comparable. If still, the two models are needed to be compared, a better

proposition to define R 2 can be as follows:


n

 ( y  anti log yˆ )
i
*
i
R32  1  i 1
n

 ( y  y)
i 1
i
2

 y . Now
where y  log
*
R12 and R32 on the comparison may give an idea about the adequacy of the two
i i

models.

Relationship of analysis of variance test and coefficient of determination


Assuming the 1 to be an intercept term, then for H 0 :  2  3  ...   k  0, the F  statistic in analysis of
variance test is
MS r e g
F
MS res
(n  k ) SSr e g

(k  1) SS r e s
 n  k  SS r e g
 
 k  1  SST  SSr e g
SS r e g
 n  k  SST nk  R
2
    
 k  1  1  SS r e g  k  1  1  R
2

SST

Econometrics | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur


38
where R 2 is the coefficient of determination. So F and R 2 are closely related. When R 2  0, then F  0.

In the limit, when R 2  1, F   . So both F and R 2 vary directly. Larger R 2 implies greater F value.
That is why the F test under the analysis of variance is termed as the measure of the overall significance of
estimated regression. It is also a test of significance of R 2 . If F is highly significant, it implies that we can
reject H 0 , i.e. y is linearly related to X ' s.

Prediction of values of study variable


The prediction in the multiple regression model has two aspects
1. Prediction of the average value of study variable or mean response.
2. Prediction of the actual value of the study variable.

1. Prediction of average value of y


We need to predict E ( y ) at a given x0  ( x01 , x02 ,..., x0 k ) '.
The predictor as a point estimate is
p  x0 b  x0 ( X ' X ) 1 X ' y
E ( p)  x0  .

So p is an unbiased predictor for E ( y ) .

Its variance is
Var ( p )  E  p  E ( y )  '  p  E ( y ) 
= 2 x0 ( X ' X ) 1 x0

Then
E ( yˆ 0 )  x0   E ( y | x0 )
Var ( yˆ 0 )   2 x0 ( X ' X ) 1 x0

The confidence interval on the mean response at a particular point, such as x01 , x02 ,..., x0 k can be found as
follows:

Define x0  ( x01 , x02 ,..., x0 k ) '. The fitted value at x0 is yˆ 0  x0 b.

Econometrics | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur


39
Then
 yˆ 0  E ( y | x0 ) 
P  t   t   1  
 2 , n  k ˆ 2 x0 ( X ' X ) 1 x0 2
,n k

 
P  yˆ 0  t  ˆ 2 x0 ( X ' X ) 1 x0  E ( y | x0 )  yˆ 0  t  ˆ 2 x0 ( X ' X ) 1 x0   1   .
,n  k ,nk
 2 2 

The 100 (1   )% confidence interval on the mean response at the point x01 , x02 ,..., x0 k , i.e., E ( y / x0 ) is

 
 yˆ 0  t , n  k ˆ x0 ( X ' X ) x0 , yˆ 0  t , n  k  ˆ x0 ( X ' X ) x0
2 1 2 1
.
 2 2 

2. Prediction of actual value of y


We need to predict y at a given x0  ( x01 , x02 ,..., x0 k ) '.
The predictor as a point estimate is
p f  x0 b
E ( p f )  x0 
So p f is an unbiased for y. It's variance is
Var ( p f )  E  ( p f  y )( p f  y ) ' 
  2 1  x0 ( X ' X ) 1 x0  .

The 100 (1   )% confidence interval for this future observation is

 
 p f  t , n  k ˆ [1  x0 ( X ' X ) x0 ], p f  t , n  k ˆ [1  x0 ( X ' X ) x0 ]  .
2 1 2 1

 2 2 

Econometrics | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur


40
Chapter4
Predictions In Linear Regression Model

Prediction of values of study variable


An important use of linear regression modeling is to predict the average and actual values of the study
variable. The term prediction of the value of study variable corresponds to knowing the value of E ( y ) (in
case of average value) and value of y (in case of actual value) for a given value of the explanatory
variable. We consider both cases. The prediction of values consists of two steps. In the first step, the
regression coefficients are estimated on the basis of given observations. In the second step, these
estimators are then used to construct the predictor which provides the prediction of actual or average
values of study variables. Based on this approach of construction of predictors, there are two situations in
which the actual and average values of the study variable can be predicted- within sample prediction and
outside sample prediction. We describe the prediction in both situations.

Within sample prediction in simple linear regression model


Consider the linear regression model y   0  1 x   . Based on a sample of n sets of paired

observations ( xi , yi ) (i  1, 2,..., n) following yi   0  1 xi   i , where  i ’s are identically and

independently distributed following N (0,  2 ) . The parameters  0 and 1 are estimated using the

ordinary least squares estimation as b0 of  0 and b1 of 1 as

b0  y  b1 x
sxy
b1 
sxx
where
n n
1 n 1 n
sxy   ( xi  x )( yi  y ), sxx  ( xi  x ) 2 , x  i x , y   yi .
i 1 i 1 n i 1 n i 1

The fitted model is y  b0  b1 x .

Case 1: Prediction of average value of y


Suppose we want to predict the value of E ( y ) for a given value of x  x0 . Then the predictor is given by

pm  b0  b1 x0 .

Here m stands for mean value.

Econometrics | Chapter 4 | Predictions In Linear Regression Model | Shalabh, IIT Kanpur


1
Predictive bias
The prediction error is given as
pm  E ( y )  b0  b1 x0  E (  0  1 x0   )
 b0  b1 x0  (  0  1 x0 )
 (b0   0 )  (b1  1 ) x0 .
Then the prediction bias is given as
E  pm  E ( y )   E (b0   0 )  E (b1  1 ) x0

 0  0  0.
Thus the predictor pm is an unbiased predictor of E ( y ).

Predictive variance:
The predictive variance of pm is

PV ( pm )  Var (b0  b1 x0 )
 Var  y  b1 ( x0  x ) 
 Var ( y )  ( x0  x ) 2 Var (b1 )  2( x0  x )Cov( y , b1 )
2  2 ( x0  x ) 2
  0
n sxx
 1 ( x0  x ) 2 
  
2
.
n sxx 

Estimate of predictive variance


The predictive variance can be estimated by substituting  2 by ˆ 2  MSE as

 ( p )  ˆ 2  1  ( x0  x ) 
2
PV m  
n sxx 
 1 ( x  x )2 
 MSE   0 .
n sxx 

Prediction interval :
The 100(1-  )% prediction interval for E ( y ) is obtained as follows:
The predictor pm is a linear combination of normally distributed random variables, so it is also normally

distributed as
pm ~ N   0  1 x0 , PV  pm   .

Econometrics | Chapter 4 | Predictions In Linear Regression Model | Shalabh, IIT Kanpur


2
So if  2 is known, then the distribution of
pm  E ( y )
PV ( pm )

is N (0,1). So the 100(1-  )% prediction interval is obtained as

 p  E ( y) 
P   z / 2  m  z /2   1  
 PV ( pm ) 
which gives the prediction interval for E ( y ) as

  1 ( x  x )2  ( x0  x ) 2  
2 1
 pm  z /2  2   0 , 
 m  /2
p z     .
 n sxx  n sxx  

When  2 is unknown, it is replaced by ˆ 2  MSE and in this case, the sampling distribution of
pm  E ( y )
 1 ( x0  x ) 2 
MSE  
n sxx 

is t -distribution with (n  2) degrees of freedom, i.e., tn  2 .

The 100(1-  )% prediction interval in this case is


 
 
 pm  E ( y ) 
P  t /2,n  2   t /2,n  2   1   .
  1 ( x  x )2  
MSE   0 
  n s  
 xx 
which gives the prediction interval as
  1 ( x  x )2   1 ( x0  x ) 2  
 pm  t /2, n  2 MSE   0  , pm  t / 2, n  2 MSE   .
 n sxx  n sxx  

Note that the width of the prediction interval E ( y ) is a function of x0 . The interval width is minimum for

x0  x and widens as x0  x increases. This is also expected as the best estimates of y to be made at x -

values lie near the center of the data and the precision of estimation to deteriorate as we move to the
boundary of the x -space.

Econometrics | Chapter 4 | Predictions In Linear Regression Model | Shalabh, IIT Kanpur


3
Case 2: Prediction of actual value
If x0 is the value of the explanatory variable, then the actual value predictor for y is

pa  b0  b1 x0 .

Here a means “actual”. The true value of y in the prediction period is given by y0   0  1 x0   0 where

 0 indicates the value that would be drawn from the distribution of random error in the prediction period.
Note that the form of predictor is the same as of average value predictor, but its predictive error and other
properties are different. This is the dual nature of predictor.

Predictive bias:
The predictive error of pa is given by

pa  y0  b0  b1 x0  (  0  1 x0   0 )
 (b0   0 )  (b1  1 ) x0   0 .
Thus, we find that
E ( pa  y0 )  E (b0   0 )  E (b1  1 ) x0  E ( 0 )
 000  0

which implies that pa is an unbiased predictor of y .

Predictive variance
Because the future observation y0 is independent of pa , the predictive variance of pa is

PV ( pa )  E ( pa  y0 ) 2
 E[(b0   0 )  ( x0  x )(b1  1 )  (b1  1 ) x   0 ]2
 Var (b0 )  ( x0  x ) 2 Var (b1 )  x 2Var (b1 )  Var ( 0 )  2( x0  x )Cov(b0 , b1 )  2 xCov(b0 , b1 )  2( x0  x )Var (b1 )
[rest of the terms are 0 assuming the independence of  0 with 1 ,  2 ,...,  n ]
 Var (b0 )  [( x0  x ) 2  x 2  2( x0  x )]Var (b1 )  Var ( 0 )  2[( x0  x )  2 x ]Cov(b0 , b1 )
 Var (b0 )  x02Var (b1 )  Var ( 0 )  2 x0Cov(b0 , b1 )
1 x2  2 x 2
  2     x02   2  2 x0
 n sxx  sxx sxx
 1 ( x  x )2 
  2 1   0 .
 n sxx 

Econometrics | Chapter 4 | Predictions In Linear Regression Model | Shalabh, IIT Kanpur


4
Estimate of predictive variance
The estimate of predictive variance can be obtained by replacing  2 by its estimate ˆ 2  MSE as

 ( p )  ˆ 2 1  1  ( x0  x ) 
2
PV a  
 n sxx 
 1 ( x  x )2 
 MSE 1   0 .
 n sxx 

Prediction interval:
If  2 is known, then the distribution of
pa  y0
PV ( pa )

is N (0,1). So the 100(1-  )% prediction interval for y0 is obtained as

 p  y0 
P   z / 2  a  z / 2   1  
 PV ( pa ) 
which gives the prediction interval for y0 as

  1 ( x  x )2  
2 1 ( x0  x ) 2 
 pa  z /2  2 1   0 , 
 a  /2
p z   1    .
  n s xx   n sxx  

When  2 is unknown, then


pa  y0
( p )
PV a

follows a t -distribution with (n  2) degrees of freedom. The 100(1-  )% prediction interval for y0 in

this case is obtained as


 pa  y0 

P t /2, n  2   t /2, n  2   1  
 
PV ( pa ) 
 
which gives the prediction interval for y0 as

  1 ( x  x )2   1 ( x0  x ) 2  
 pa  t /2,n  2 MSE  1   0 , 
 a  /2,n  2
p t MSE 1    .
  n sxx   n sxx  

The prediction interval is of minimum width at x0  x and widens as x0  x increases.

The prediction interval for pa is wider than the prediction interval for pm because the prediction interval

for pa depends on both the error from the fitted model as well as the error associated with the future

observations.
Econometrics | Chapter 4 | Predictions In Linear Regression Model | Shalabh, IIT Kanpur
5
Within sample prediction in multiple linear regression model
Consider the multiple regression model with k explanatory variables as
y  X  ,
where y  ( y1 , y2 ,..., yn ) ' is a n 1 vector of n observation on study variable,

 x11 x12  x1k 


 
x x22  x2 k 
X   21
    
 
 xn1 xn 2  xnk 

is a n  k matrix of n observations on each of the k explanatory variables,   ( 1 ,  2 ,...,  k ) ' is a k  1

vector of regression coefficients and   (1 ,  2 ,...,  n ) ' is a n 1 vector of random error components or

disturbance term following N (0,  2 I n ) . If the intercept term is present, take the first column of X to be
(1,1,...,1)' .

Let the parameter  be estimated by its ordinary least squares estimator b  ( X ' X ) 1 X ' y . Then the
predictor is p  Xb which can be used for predicting the actual and average values of the study variable.
This is the dual nature of predictor.

Case 1: Prediction of average value of y


When the objective is to predict the average value of y , i.e., E ( y ), then the estimation error is given by
p  E ( y )  Xb  X 
 X (b   )
 X ( X ' X ) 1 X ' 
 H

where H  X ( X ' X ) 1 X '.


Then
E  p  E ( y)  X   X   0

which proves that the predictor p  Xb provides an unbiased prediction for the average value.
The predictive variance of p is

PVm ( p)  E  p  E ( y ) ' p  E ( y )


 E  ' H .H  
 E  ' H  
  2tr H   2 k .
 m ( p )  ˆ 2 k where ˆ 2  MSE is obtained from the
The predictive variance can be estimated by PV
analysis of variance based on OLSE.
Econometrics | Chapter 4 | Predictions In Linear Regression Model | Shalabh, IIT Kanpur
6
When  2 is known, then the distribution of
p  E ( y)
PVm ( p )

is N (0,1). So the 100(1-  )% prediction interval for E ( y ) is obtained as

 p  E ( y) 
P   z /2   z /2   1  
 PVm ( p ) 
which gives the prediction interval for E ( y ) as

 p  z /2 PVm ( p), p  z /2 PVm ( p)  .


 

When  2 is unknown, it is replaced by ˆ 2  MSE and in this case, the sampling distribution of
p  E ( y)
 m ( p)
PV

is t -distribution with (n  k ) degrees of freedom, i.e., tn  k .

The 100(1-  )% prediction interval for E ( y ) in this case is

 p  E ( y) 
P  t /2,n  k   t / 2,n  k   1   .
  m ( p) 
 PV 
which gives the prediction interval for E ( y ) as

p t   
  /2, n  k PV m ( p ), p  t / 2, n  k PV m ( p )  .

Case 2: Prediction of actual value of y


When the predictor p  Xb is used for predicting the actual value of the study variable y , then its
prediction error is given by
p  y  Xb  X   
 X (b   )  
 X ( X ' X ) 1 X '   
   I  X ( X ' X ) 1 X ' 
 H .
Thus
E ( p  y)  0
which shows that p provides unbiased predictions for the actual values of the study variable.

Econometrics | Chapter 4 | Predictions In Linear Regression Model | Shalabh, IIT Kanpur


7
The predictive variance in this case is
PVa ( p)  E  p  y ) '( p  y  
 E   ' H .H  
 E  ' H  
  2trH
  2 (n  k ).
The predictive variance can be estimated by
 m ( p )  ˆ 2 (n  k )
PV

where ˆ 2  MSE is obtained from the analysis of variance based on OLSE.

Comparing the performances of p to predict actual and average values, we find that p in better
predictor for predicting the average value in comparison to actual value when
PVm ( p)  PVa ( p)
or k  (n  k )
or 2k  n.
i.e. when the total number of observations is more than twice the number of explanatory variables.

Now we obtain the confidence interval for y.

When  2 is known, then the distribution of


p y
PVa ( p)

is N (0,1). So the 100(1-  )% prediction interval for y is obtained as

 p y 
P   z / 2   z /2   1  
 PVa ( p ) 
which gives the prediction interval for y as

 p  z / 2 PVa ( p), p  z /2 PVa ( p)  .


 

When  2 is unknown, it is replaced by ˆ 2  MSE and in this case, the sampling distribution of
p y
 a ( p)
PV

is t -distribution with (n  k ) degrees of freedom, i.e., tn  k .

Econometrics | Chapter 4 | Predictions In Linear Regression Model | Shalabh, IIT Kanpur


8
The 100(1-  )% prediction interval of y, in this case, is obtained as
 p y 
P  t /2, n  k   t /2, n  k   1   .
  a ( p) 
 PV 
which gives the prediction interval for y as
p t   
  /2, n  k PV a ( p ), p  t /2, n  k PV a ( p )  .

Outside sample prediction in multiple linear regression model


Consider the model
y  X  (1)
where y is a n  1 vector of n observations on study variable, X is a n  k matrix of explanatory

variables and  is a n  1 vector of disturbances following N (0,  2 I n ).

Further, suppose a set of n f observations on the same set of k explanatory variables are also available,

but the corresponding n f observations on the study variable are not available. Assuming that this set of

observation also follows the same model, we can write


yf  X f   f (2)

where y f is a n f 1 vector of future values, X f is a n f  k matrix of known values of explanatory

variables and  f is a n f 1 vector of disturbances following N (0,  2 I n f ) . It is also assumed that the

elements of  and  f are independently distributed.

We now consider the prediction of y f values for given X f from model (2). This can be done by

estimating the regression coefficients from the model (1) based on n observations and use it is
formulating the predictor in the model (2). If ordinary least squares estimation is used to estimate  in
the model (1) as
b  ( X ' X ) 1 X ' y
then the corresponding predictor is
p f  X f b  X f ( X ' X ) 1 X ' y.

Econometrics | Chapter 4 | Predictions In Linear Regression Model | Shalabh, IIT Kanpur


9
Case 1: Prediction of average value of study variable
When the aim is to predict the average value E ( y f ), then the prediction error is

p f  E( y f )  X f b  X f 
 X f (b   )
 X f ( X ' X ) 1 X '  .

Then
E  p f  E ( y f )   X f  X ' X  X ' E   
1

 0.
Thus p f provides an unbiased prediction for the average value.

The predictive covariance matrix of p f is

Covm ( p f )  E  p f  E ( y f ) p f  E ( y f ) '

 E  X f ( X ' X ) 1 X '  ' X ( X ' X ) 1 X 'f 

 X f  X ' X  X ' E   ' X ( X ' X ) 1 X 'f


1

  2 X f ( X ' X ) 1 X ' X  X ' X  X 'f


1

  2 X f ( X ' X ) 1 X 'f .

The predictive variance of p f is

PVm ( p f )  E  p f  E ( y f  ' p f  E ( y f )


 tr Covm ( p f ) 
  2tr ( X ' X ) 1 X 'f X f  .

If  2 is unknown, then replace  2 by ̂ 2  MSE in the expressions of the predictive covariance matrix
and predictive variance and their estimates are
 m ( p )  ˆ 2 X ( X ' X ) 1 X '
Cov f f f

 m ( p )  ˆ 2tr  X ' X 1 ( X ' X )  .


PV f  f f 

Now we obtain the confidence interval for E ( y f ) .

When  2 is known, then the distribution of


p f  E( y f )
PVm ( p f )

is N (0,1). So the 100(1-  )% prediction interval of E ( y f ) is obtained as


Econometrics | Chapter 4 | Predictions In Linear Regression Model | Shalabh, IIT Kanpur
10
 p f  E( y f ) 
P   z / 2   z / 2   1  
 PVm ( p f ) 

which gives the prediction interval for E ( y f ) as

 p f  z / 2 PVm ( p f ), p f  z /2 PVm ( p f )  .
 

When  2 is unknown, it is replaced by ˆ 2  MSE and in this case, the sampling distribution of
p f  E( y f )
m(p )
PV f

is t -distribution with (n  k ) degrees of freedom, i.e., tn  k .

The 100(1-  )% prediction interval for E ( y f ) in this case is

 p f  E( y f ) 
P  t /2,n  k   t /2,n  k   1   .
 m(p )
PV 
 f 
which gives the prediction interval for E ( y f ) as

p t  m ( p ), p  t  
 m  /2, n  k
PV  / 2, n  k PV m ( p )  .
m

Case 2: Prediction of actual value of study variable


When p f is used to predict the actual value y f , then the prediction error is

pf  yf  X f b  X f   f
 X f (b   )   f .

Then
E  p f  y f   X f E  b     E ( f )  0.

Thus p f provides an unbiased prediction for actual values.

The predictive covariance matrix of p f in this case is

Cova  p f   E  p f  y f  p f  y f  '

 E  X f (b   )   f (b   ) ' X 'f   'f 


 X f V (b) X 'f  E ( f  'f ) (Using (b   )  ( X ' X ) 1 X '  )
  2  X f ( X ' X ) 1 X 'f  I n f  .

Econometrics | Chapter 4 | Predictions In Linear Regression Model | Shalabh, IIT Kanpur


11
The predictive variance of p f is

PVa ( p f )  E  p f  y f  '  p f  y f  

 tr Cova ( p f ) 
  2 tr ( X ' X ) 1 X 'f X f  n f  .

The estimates of the covariance matrix and predictive variance can be obtained by replacing  2 by
ˆ 2  MSE as
 a ( p )  ˆ 2  X ( X ' X ) 1 X '  I 
Cov f  f f nf 

 a ( p )  ˆ 2 tr ( X ' X ) 1 X ' X  n  .


PV f  f f f 

Now we obtain the confidence interval for y f .

When  2 is known, then the distribution of


pf  yf
PVa ( p f )

is N (0,1). So the 100(1-  )% prediction interval is obtained as

 pf  yf 
P   z / 2   z /2   1  
 PVa ( p f ) 

which gives the prediction interval for y f as

 p f  z / 2 PVa ( p f ), p f  z /2 PVa ( p f )  .
 

When  2 is unknown, it is replaced by ˆ 2  MSE and in this case, the sampling distribution of
pf  yf
a(p )
PV f

is t -distribution with (n  k ) degrees of freedom, i.e., tn  k .

The 100(1-  )% prediction interval for y f in this case is

 pf  yf 
P  t /2,n  k   t /2, n  k   1   .
 
PV a ( p f ) 
 
which gives the prediction interval for y f as

p t  a ( p ), p  t  
PV  /2, n  k PV a ( p f )  .
 f  /2,n  k f f

Econometrics | Chapter 4 | Predictions In Linear Regression Model | Shalabh, IIT Kanpur
12
Simultaneous prediction of average and actual values of the study variable
The predictions are generally obtained either for the average values of the study variable or actual values
of the study variable. In many applications, it may not be appropriate to confine our attention to only to
either of the two. It may be more appropriate in some situations to predict both the values simultaneously,
i.e., consider the prediction of actual and average values of the study variable simultaneously. For
example, suppose a firm deals with the sale of fertilizer to the user. The interest of the company would be
in predicting the average value of yield which the company would like to use in showing that the average
yield of the crop increases by using their fertilizer. On the other side, the user would not be interested in
the average value. The user would like to know the actual increase in the yield by using the fertilizer.
Suppose both seller and user, both go for prediction through regression modeling. Now using the classical
tools, the statistician can predict either the actual value or the average value. This can safeguard the
interest of either the user or the seller. Instead of this, it is required to safeguard the interest of both by
striking a balance between the objectives of the seller and the user. This can be achieved by combining
both the predictions of actual and average values. This can be done by formulating an objective function
or target function. Such target function has to be flexible and should allow assigning different weights to
the choice of two kinds of predictions depending upon their importance in any given application and also
reducible to individual predictions leading to actual and average value prediction.

Now we consider the simultaneous prediction in within and outside sample cases.

Simultaneous prediction in within sample prediction


Define a target function
   y  (1   ) E ( y ); 0    1
which is a convex combination of actual value y and average value E ( y ). The weight  is a constant
lying between zero and one whose value reflects the importance being assigned to actual value
prediction. Moreover   0 gives the average value prediction and   1 gives the actual value
prediction. For example, the value of  in the fertilizer example depends on the rules and regulation of
the market, norms of society and other considerations etc. The value of  is the choice of practitioner.

Consider the multiple regression model


y  X    , E ( )  0, E ( ')   2 I n .

Estimate  by ordinary least squares estimation and construct the predictor


p  Xb.
Now employ this predictor for predicting the actual and average values simultaneously through the target
function.
Econometrics | Chapter 4 | Predictions In Linear Regression Model | Shalabh, IIT Kanpur
13
The prediction error is
p    Xb   y  (1   ) E ( y )
 Xb   ( X    )  (1   ) X 
 X (b   )   .
Thus
E ( p   )  XE (b   )   E ( )
 0.
So p provides an unbiased prediction for  .

The variance is
Var ( p )  E ( p   ) '( p   )
 E (b   ) ' X '  ' X (b   )   

 E  ' X ( X ' X ) 1 X ' X  X ' X  X '    2 '    (b   ) ' X '    ' X (b   ') 
1
 
 E (1  2 ) ' X ( X ' X ) 1 X '    2 '  
  2 (1  2 )tr ( X ' X ) 1 X ' X    2 2trI n
  2 (1  2 )k   2 n  .

The estimates of predictive variance can be obtained by replacing  2 by ˆ 2  MSE as



Var ( p )  ˆ 2 (1  2 )k   2 n  .

Simultaneous prediction is outside sample prediction:


Consider the model described earlier under outside sample prediction as
y  X    , E ( )  0, V ( )   2 I n
n1 nk k 1 n1

y f  X f    f ; E ( f )  0, V ( f )   2 I n f .
n f 1 n f k k 1 n f 1

The target function, in this case, is defined as


 f   y f  (1   ) E ( y f ); 0    1.
The predictor based on OLSE of  is

b   X ' X  X ' y.
1
p f  X f b;

The predictive error of p is


p f   f  X f b   y f  (1   ) E ( y f )
 X f b   ( X f    f )  (1   ) X f 
 X f (b   )   f .

Econometrics | Chapter 4 | Predictions In Linear Regression Model | Shalabh, IIT Kanpur


14
So
E ( p f   f )  X f E (b   )   E ( f )
 0.
Thus p f provides an unbiased prediction for  f .

The variance of p f is

Var ( p f )  E ( p f   ) '( p f   )
 E (b   ) '  'f   'f  X f (b   )   f 

 E  ' X ( X ' X ) 1 X 'f X f  X ' X  X '    'f  f  2 'f X f ( X ' X ) 1 X '  
1
 
  2 tr  X ( X ' X ) 1 X 'f X f ( X ' X ) 1 X '   2 n f 

assuming that the elements in  and  f are mutually independent.

The estimates of predictive variance can be obtained by replacing  2 by ˆ 2  MSE as



Var ( p f )  ˆ 2 tr  X ( X ' X ) 1 X 'f X f ( X ' X ) 1 X '   2 n f  .

Econometrics | Chapter 4 | Predictions In Linear Regression Model | Shalabh, IIT Kanpur


15
Chapter 5
Generalized and Weighted Least Squares Estimation

The usual linear regression model assumes that all the random error components are identically and
independently distributed with constant variance. When this assumption is violated, then ordinary least
squares estimator of the regression coefficient loses its property of minimum variance in the class of linear
and unbiased estimators. The violation of such assumption can arise in anyone of the following situations:
1. The variance of random error components is not constant.
2. The random error components are not independent.
3. The random error components do not have constant variance as well as they are not independent.

In such cases, the covariance matrix of random error components does not remain in the form of an identity
matrix but can be considered as any positive definite matrix. Under such assumption, the OLSE does not
remain efficient as in the case of an identity covariance matrix. The generalized or weighted least squares
method is used in such situations to estimate the parameters of the model.

In this method, the deviation between the observed and expected values of yi is multiplied by a weight i

where i is chosen to be inversely proportional to the variance of yi .

For a simple linear regression model, the weighted least squares function is
n
S (  0 , 1 )   i  yi   0  1 xi  .
2

The least-squares normal equations are obtained by differentiating S (  0 , 1 ) with respect to  0 and 1 and

equating them to zero as


n n n
ˆ0  i ˆ1  i xi   i yi
i 1 i 1 i 1
n n n
ˆ0  i xi ˆ1  i xi2   i xi yi .
i 1 i 1 i 1

The solution of these two normal equations gives the weighted least squares estimate of  0 and 1 .

Econometrics | Chapter 5 | Generalized and Weighted Least Squares Estimation | Shalabh, IIT Kanpur
1
Generalized least squares estimation
Suppose in usual multiple regression model
y  X    with E ( )  0, V ( )   2 I ,

the assumption V ( )   2 I is violated and become

V ( )   2 
where  is a known n  n nonsingular, positive definite and symmetric matrix.

This structure of  incorporates both the cases.


- when  is diagonal but with unequal variances and
- when  is not necessarily diagonal depending on the presence of correlated errors, some of the
diagonal elements are nonzero.

The OLSE of  is

b  ( X ' X ) 1 X ' y

In such cases, OLSE gives unbiased estimate but has more variability as
E (b)  ( X ' X ) 1 X ' E ( y )  ( X ' X ) 1 X ' X   
V (b)  ( X ' X ) 1 X 'V ( y ) X ( X ' X ) 1   2 ( X ' X ) 1 X ' X ( X ' X ) 1.
Now we attempt to find better estimator as follows:

Since  is positive definite, symmetric, so there exists a nonsingular matrix K such that.
KK '  .
Then in the model
y  X  ,

premutliply by K 1 , this gives


K 1 y  K 1 X   K 1
or z  B  g

where z  K 1 y, B  K 1 X , g  K 1 . Now observe that

E ( g )  K 1) E ( )  0
and

Econometrics | Chapter 5 | Generalized and Weighted Least Squares Estimation | Shalabh, IIT Kanpur
2
V ( g )  E  g  E ( g ) g  E ( g ) '
 E ( gg ')
 E  K 1 ' K '1 
 K 1 E ( ') K '1
  2 K 1K '1
  2 K 1 KK ' K '1
  2I.
Thus the elements of g have 0 mean, and they are uncorrelated.
So either minimize S (  )  g ' g
  '  1
 ( y  X  ) '  1 ( y  X  )
and get normal equations as
(X'-1 X ) ˆ  X '  1 y
or ˆ  ( X '  1 X ) 1 X '  1 y.

Alternatively, we can apply OLS to transformed model and obtain OLSE of  as

ˆ  ( B ' B ) 1 B ' z
 ( X ' K '1 K 1 X ) 1 X ' K '1 K 1 y
 ( X '  1 X ) 1 X '  1 y
This is termed as generalized least squares estimator (GLSE) of  .

The estimation error of GLSE is

ˆ  ( B ' B ) 1 B '( B   g )
   ( B ' B ) 1 B ' g
or ˆ    ( B ' B ) 1 B ' g.
Then
E ( ˆ   )  ( B ' B) 1 B ' E ( g )  0
which shows that GLSE is an unbiased estimator of  . The covariance matrix of GLSE is given by

 
V ( ˆ )  E  ˆ  E ( ˆ ) ˆ  E ( ˆ ) '
  
 E ( B ' B) B ' gg ' B '( B ' B) 
1 1

 ( B ' B) 1 B ' E ( gg ') B '( B ' B ) 1.


Econometrics | Chapter 5 | Generalized and Weighted Least Squares Estimation | Shalabh, IIT Kanpur
3
Since
E ( gg ')  K 1 E ( ') K '1
  2 K 1K '1
  2 K 1 KK ' K '1
  2I,
so
V ( ˆ )   2 ( B ' B ) 1 B ' B ( B ' B) 1
  2 ( B ' B ) 1
  2 ( X ' K '1 K 1 X ) 1
  2 ( X '  1 X ) 1.
Now we prove that GLSE is the best linear unbiased estimator of  .

The Gauss-Markov theorem for the case Var ( )  


The Gauss-Markov theorem establishes that the generalized least-squares (GLS) estimator of
 given by ˆ  ( X '  1 X )1 X '  1 y, is BLUE (best linear unbiased estimator). By best  , we mean that

̂ minimizes the variance for any linear combination of the estimated coefficients,  ' ˆ . We note that

E ( ˆ )  E ( X '  1 X ) 1 X '  1 y 


 ( X '  1 X ) 1 X '  1 E ( y )
 ( X '  1 X ) 1 X '  1 X 
 .

Thus ̂ is an unbiased estimator of  .

The covariance matrix of ̂ is given by

V ( ˆ )  ( X '  1 X ) 1 X '  1  V ( y ) ( X '  1 X ) 1 X '  1  '


 ( X '  1 X ) 1 X '  1   ( X '  1 X ) 1 X '  1  '
 ( X '  1 X ) 1 X '  1    1 X ( X '  1 X ) 1 
 ( X '  1 X ) 1.
Thus,
Var ( ' ˆ )   'Var ( ˆ )
  ' ( X '  1 X ) 1  .

Econometrics | Chapter 5 | Generalized and Weighted Least Squares Estimation | Shalabh, IIT Kanpur
4
Let  be another unbiased estimator of  that is a linear combination of the data. Our goal, then, is to

show that Var ( '  )   '( X '  1 X ) 1  with at least one  such that Var ( '  )   '( X '  1 X ) 1  .
We first note that we can write any other estimator of  that is a linear combination of the data as

  ( X '  1 X ) 1 X '  1  B  y  b0*

where B is an p  n matrix and bo* is a p  1 vector of constants that appropriately adjusts the GLS

estimator to form the alternative estimate. Then


E (  )  E ( X '  1 X ) 1 X '  1  B  y  bo* 
 ( X '  1 X ) 1 X '  1  B  E ( y )  b0*
 ( X '  1 X ) 1 X '  1  B  XB  b0*
 ( X '  1 X ) 1 X '  1 X   BX   b0*
   BX   b0* .

Consequently,  is unbiased if and only if both b0*  0 and BX  0. The covariance matrix of  is


V (  )  Var ( X '  1 X ) 1 X '  1  B  y 
 ( X '  1 X ) 1 X '  1  B  V ( y ) ( X '  1 X ) 1 X '  1  B  '
 ( X '  1 X ) 1 X '  1  B   ( X '  1 X ) 1 X '  1  B  '
 ( X '  1 X ) 1 X '  1  B    1 X ( X '  1 X ) 1  B '
 ( X '  1 X ) 1  BB '

because BX  0, which implies that ( BX ) '  X ' B '  0. Then

Var ( '  )   'V (  )



  ' ( X '  1 X ) 1  BB '  
  '( X '  1 X ) 1    ' BB ' 
 Var ( ' ˆ )   ' BB ' .
We note that  is a positive definite matrix. Consequently, there exists some nonsingular matrix K such
that   K ' K As a result, BB '  BK ' KB ' is at least positive semidefinite matrix; hence,  ' BB '   0.
Next note that we can define *  KB ' . As a result,
p
 ' BB '   *'*   *2
i
i 1

which must be strictly greater than 0 for some   0 unless B  0 . Thus, the GLS estimate of  is the best
linear unbiased estimator.
Econometrics | Chapter 5 | Generalized and Weighted Least Squares Estimation | Shalabh, IIT Kanpur
5
Weighted least squares estimation
When  ' s are uncorrelated and have unequal variances, then
1 
 0 0  0 
 1 
 1 
2 
0 0  0 
V ( )      
2
2 .
     
 
0 1 
0 0 
 n 
The estimation procedure is usually called as weighted least squares.

Let W   1 then the weighted least squares estimator of  is obtained by solving normal equation

( X 'WX ) ˆ  X 'Wy which gives

ˆ  ( X 'WX )1 X 'Wy


where 1 , 2 ,..., n are called the weights.

The observations with large variances usual have smaller weights than observations with small variance.

Econometrics | Chapter 5 | Generalized and Weighted Least Squares Estimation | Shalabh, IIT Kanpur
6
Chapter 6
Regression Analysis Under Linear Restrictions and Preliminary Test Estimation

One of the basic objectives in any statistical modeling is to find good estimators of the parameters. In the
context of multiple linear y X β + ε , the ordinary least squares estimator
regression model =

b = ( X ' X ) X ' y is the best linear unbiased estimator of β . Several approaches have been attempted in the
−1

literature to improve further the OLSE. One approach to improve the estimators is the use of extraneous
information or prior information. In applied work, such prior information may be available about the
regression coefficients. For example, in economics, the constant returns to scale imply that the exponents
in a Cobb-Douglas production function should sum to unity. In another example, absence of money illusion
on the part of consumers implies that the sum of money income and price elasticities in a demand function
should be zero. These types of constraints or the prior information may be available from
(i) some theoretical considerations.
(ii) past experience of the experimenter.
(iii) empirical investigations.
(iv) some extraneous sources etc.

To utilize such information in improving the estimation of regression coefficients, it can be expressed in the
form of
(i) exact linear restrictions
(ii) stochastic linear restrictions
(iii) inequality restrictions.

We consider the use of prior information in the form of exact and stochastic linear restrictions in the model
y X β + ε where y is a (n × 1) vector of observations on study variable, X is a (n × k ) matrix of
=
observations on explanatory variables X 1 , X 2 ,..., X k , β is a (k ×1) vector of regression coefficients and ε is

a (n ×1) vector of disturbance terms.

Econometrics | Chapter 6 | Linear Restrictions and Preliminary Test Estimation | Shalabh, IIT Kanpur
1
Exact linear restrictions:
Suppose the prior information binding the regression coefficients is available from some extraneous sources
which can be expressed in the form of exact linear restrictions as
r = Rβ
where r is a (q ×1) vector and R is a (q × k ) matrix with rank=
( R) q (q < k ). The elements in
r and R are known.

Some examples of exact linear restriction r = Rβ are as follows:


(i) If there are two restrictions with k = 6 like
β2 = β4
β3 + 2β 4 + β5 =
1
then

0  0 1 0 − 1 0 0 0 
=r =  , R  .
1  0 0 1 2 1 0 0 

(ii) If k = 3 and suppose β 2 = 3, then

=r [3=
] , R [ 0 1 0]
(iii) If k = 3 and suppose β1 : β 2 : β3 :: ab : b :1

0  1 − a 0 
then r =
= 0  , R 0 1
 −b  .
0  1 0 −ab 

The ordinary least squares estimator b = ( X ' X ) −1 X ' y does not uses the prior information. It does not obey
the restrictions in the sense that r ≠ Rb. So the issue is how to use the sample information and prior
information together in finding an improved estimator of β .

Econometrics | Chapter 6 | Linear Restrictions and Preliminary Test Estimation | Shalabh, IIT Kanpur
2
Restricted least squares estimation
The restricted least squares estimation method enables the use of sample information and prior information
simultaneously. In this method, choose β such that the error sum of squares is minimized subject to linear
restrictions r = Rβ . This can be achieved using the Lagrangian multiplier technique. Define the Lagrangian
function
S (β , λ ) =( y − X β ) '( y − X β ) − 2λ '( R β − r )
where λ is a (k ×1) vector of Lagrangian multiplier.

Using the result that if a and b are vectors and A is a suitably defined matrix, then

= ( A + A ')a
a ' Aa
∂a

a ' b = b,
∂a
we have
∂S ( β , λ )
= 2 X ' X β − 2 X ' y − 2 R ' λ =' 0 (*)
∂β
∂S ( β , λ )
= R β − r= 0.
∂λ
Pre-multiplying equation (*) by R( X ' X ) −1 , we have

2 Rβ − 2 R( X ' X ) −1 X ' y − 2 R( X ' X ) −1 R ' λ ' =


0

or Rβ − Rb − R( X ' X ) −1 R ' λ ' =


0
−1
⇒λ'=
−  R( X ' X ) −1 R ' ( Rb − r )

using R ( X ' X ) −1 R ' > 0.


Substituting λ in equation (*), we get
−1
2 X ' X β − 2 X ' y + 2 R '  R( X ' X ) −1 R ' ( Rb − r ) =
0

X ' y − R ' ( R( X ' X ) −1 R ') ( Rb − r ) .


−1
or X 'Xβ =

Pre-multiplying by ( X ' X ) yields


−1

−1
( X ' X ) X ' y + ( X ' X ) R '  R( X ' X )−1 R ' ( r − Rb )
−1 −1
βˆR =
−1
b − ( X ' X ) R '  R ( X ' X ) R ' ( Rb − r ) .
−1 −1
=
 
This estimation is termed as restricted regression estimator of β .
Econometrics | Chapter 6 | Linear Restrictions and Preliminary Test Estimation | Shalabh, IIT Kanpur
3
Properties of restricted regression estimator
1. The restricted regression estimator βˆR obeys the exact restrictions, i.e., r = RβˆR . To verify this,
consider

R b + ( X ' X ) R '{ R( X ' X ) −1} ( r − Rb ) 


−1 −1
RβˆR =
 
= Rb + r − Rb
= r.

2. Unbiasedness
The estimation error of βˆR is

βˆR − β = ( b − β ) + ( X ' X ) −1 R '  R ( X ' X ) R ' ( Rβ − Rb )


−1
 
= I − ( X ' X ) R '{ R( X ' X ) −1 R '} R  ( b − β )
−1 −1

 
= D (b − β )

where
−1
D= I − ( X ' X ) −1 R  R ( X ' X ) R ' R.
−1
 
Thus

( )
E βˆR − β= DE ( b − β )
=0

implying that βˆR is an unbiased estimator of β .

3. Covariance matrix
The covariance matrix of βˆR is

( )
V βˆR =E βˆR − β( )( βˆ R )
−β '
= DE ( b − β )( b − β ) ' D '
= DV (b) D '
= σ 2D ( X ' X ) D '
−1

−1
= σ 2 ( X ' X ) − σ 2 ( X ' X ) R '  R ( X ' X ) R ' R ' ( X ' X )
−1 −1 −1 −1
 
which can be obtained as follows:

Econometrics | Chapter 6 | Linear Restrictions and Preliminary Test Estimation | Shalabh, IIT Kanpur
4
Consider
−1
D (=
X 'X) (X 'X ) − ( X ' X ) R '  R ( X ' X ) R ' R ( X ' X )
−1 −1 −1 −1 −1
 

{ } { }
'
 X ' X −1 − X ' X −1 R ' R X ' X −1 R '
D( X ' X ) D' =
−1
R ( X ' X )   I − ( X ' X ) R ' R ( X ' X ) R '
−1
R 
( ) ( ) ( )
−1 −1 −1 −1

 
−1
( X ' X ) − ( X ' X ) R '  R ( X ' X ) R ' ' R ( X ' X ) − ( X ' X ) R '  R ( X ' X ) R ' R ( X ' X )
−1 −1 −1 −1 −1 −1 −1
=
−1 −1
+ ( X ' X ) R '  R ( X ' X ) R ' R ( X ' X ) R '  R ( X ' X ) R ' R ( X ' X )
−1 −1 −1 −1 −1
   
−1
(X 'X ) − ( X ' X ) R '  R ( X ' X ) R ' R ( X ' X ) .
−1 −1 −1 −1
=
 

Maximum likelihood estimation under exact restrictions:


Assuming ε ~ N (0, σ 2 I ) , the maximum likelihood estimator of β and σ 2 can also be derived such that it
follows r = Rβ . The Lagrangian function as per the maximum likelihood procedure can be written as
n
 1 2  1  ( y − X β ) '( y − X β ) 
L ( β , σ=
2
,λ)  2 
exp  −  − λ ' ( Rβ − r ) 
 2πσ   2 σ 
2

where λ is a ( q ×1) vector of Lagrangian multipliers. The normal equations are obtained by partially

differentiating the log – likelihood function with respect to β , σ 2 and λ and equated to zero as

∂ ln L ( β , σ 2 , λ ) 1
− 2 ( X ' X β − X ' y ) + 2R ' λ =
= 0 (1)
∂β σ
∂ ln L ( β , σ , λ )
2

= 2 ( Rβ − r=
) 0 (2)
∂λ
∂ ln L ( β , σ 2 , λ ) 2n 2 ( y − X β ) ' ( y − X β )
=
− 2+ =
0. (3)
∂σ 2
σ σ4
Let βR , σ R2 and λ denote the maximum likelihood estimators of β , σ 2 and λ respectively which are
obtained by solving equations (1), (2) and (3) as follows:

From equation (1), we get optimal λ as

( r − Rβ )
−1
 R ( X ' X )−1 R '
λ =   .
σ 2

Substituting λ in equation (1) gives

( )
−1
β + ( X ' X ) R '  R ( X ' X ) R '
βR = r − Rβ
−1 −1
 
Econometrics | Chapter 6 | Linear Restrictions and Preliminary Test Estimation | Shalabh, IIT Kanpur
5
where β = ( X ' X ) X ' y is the maximum likelihood estimator of β without restrictions. From equation (3),
−1

we get

σ 2
=
( y − X β ) ' ( y − X β )
.
R
n
The Hessian matrix of second order partial derivatives of β and σ 2 is positive definite at

=β β=
R and σ
2
σ R2 .
The restricted least squares and restricted maximum likelihood estimators of β are same whereas they are

different for σ 2 .

Test of hypothesis
It is important to test the hypothesis
H 0 : r = Rβ
H1 : r ≠ R β
before using it in the estimation procedure.

The construction of the test statistic for this hypothesis is detailed in the module on multiple linear regression
model. The resulting test statistic is
 (r − Rb) '  R ( X ' X ) −1 R ' −1 (r − Rb) 
   
 q 
F=  
 ( y − Xb) '( y − Xb 
 
 n−k 
which follows a F -distribution with q and (n − k ) degrees of freedom under H 0 . The decision rule is to

reject H 0 at α level of significance whenever

F ≥ F1−α (q, n − k ).

Econometrics | Chapter 6 | Linear Restrictions and Preliminary Test Estimation | Shalabh, IIT Kanpur
6
Stochastic linear restrictions:
The exact linear restrictions assume that there is no randomness involved in the auxiliary or prior
information. This assumption may not hold true in many practical situations and some randomness may be
present. The prior information in such cases can be formulated as
r Rβ + V
=

where r is a ( q ×1) vector, R is a ( q × k ) matrix and V is a ( q ×1) vector of random errors. The elements

in r and R are known. The term V reflects the randomness involved in the prior information r = Rβ .
Assume
E (V ) = 0
E (VV ') = ψ
E ( ε V ') = 0.

where ψ is a known ( q × q ) positive definite matrix and ε is the disturbance term is multiple regression

y X β + ε.
model=
Note that E (r ) = R β .

The possible reasons for such stochastic linear restriction are as follows:
(i) Stochastic linear restrictions exhibits the unstability of estimates. An unbiased estimate with the
standard error may exhibit stability. For example, in repetitive studies, the surveys are
conducted every year. Suppose the regression coefficient β1 remains stable for several years.
Suppose its estimate is provided along with its standard error. Suppose its value remains stable
around the value 0.5 with standard error 2. This information can be expressed as
r β1 + V1 ,
=

=
where =
r 0.5, E (V1 ) 0,=
E (V12 ) 22.

Now ψ can be formulated with this data. It is not necessary that we should have information
for all regression coefficients but we can have information on some of the regression
coefficients only.

(ii) Sometimes the restrictions are in the form of inequality. Such restrictions may arise from
theoretical considerations. For example, the value of a regression coefficient may lie between 3
and 5, i.e., 3 ≤ β1 ≤ 5, say. In another example, consider a simple linear regression model

y =β 0 + β1 x + ε
Econometrics | Chapter 6 | Linear Restrictions and Preliminary Test Estimation | Shalabh, IIT Kanpur
7
where y denotes the consumption expenditure on food and x denotes the income. Then the
marginal propensity (tendency) to consume is
dy
= β1 ,
dx
i.e., if salary increase by rupee one, then one is expected to spend β1 , amount of rupee one on

food or save (1 − β1 ) amount. We may put a bound on β that either one can not spend all of

rupee one or nothing out of rupee one. So 0 < β1 < 1. This is a natural restriction arising from
theoretical considerations.

These bounds can be treated as p − sigma limits, say 2-sigma limits or confidence limits. Thus
µ − 2σ =
0
µ + 2σ =
1
1 1
⇒ µ= , σ= .
2 4
These values can be interpreted as
1
β1 + V1 =
2
1
E (V12 ) = .
16
(iii) Sometimes the truthfulness of exact linear restriction r = Rβ can be suspected and accordingly
an element of uncertainty can be introduced. For example, one may say that 95% of the
restrictions hold. So some element of uncertainty prevails.

Pure and mixed regression estimation:


Consider the multiple regression model
y Xβ +ε
=
with n observations and k explanatory variables X 1 , X 2 ,..., X k . The ordinary least squares estimator of

β is

b =(X 'X ) X 'y


−1

r Rβ + V . So the
which is termed as pure estimator. The pure estimator b does not satisfy the restriction=
objective is to obtain an estimate of β by utilizing the stochastic restrictions such that the resulting estimator

Econometrics | Chapter 6 | Linear Restrictions and Preliminary Test Estimation | Shalabh, IIT Kanpur
8
satisfies the stochastic restrictions also. In order to avoid the conflict between prior information and sample
information, we can combine them as follows:

Write
Xβ ε
y =+ E (ε ) =
0, E (εε ') =
σ 2 In
Rβ + V
r= E (V ) = ψ , E (εV ') =
0, E (VV ') = 0
jointly as
 y  X  ε 
 r   R  β + V 
=
     
a Aβ + w
or =

=
where a (=
y r ) ', A ( X=
R) ', w (ε V ) '.

Note that
 E (ε )  0 
=
E ( w) =   
 E (V )  0 
Ω = E ( ww ')
 εε ' εV ' 
= E
V ε ' VV '
σ 2 I n 0
= .
 0 ψ

This shows that the disturbances w are non spherical or heteroskedastic. So the application of generalized
least squares estimation will yield more efficient estimator than ordinary least squares estimation. So
applying generalized least squares to the model
a=
AB + w E ( w) ==
0, V ( w) Ω,
the generalized least square estimator of β is given by

( A ' Ω−1 A) A ' Ω−1a.


−1
βˆM =

The explicit form of this estimator is obtained as follows:

Econometrics | Chapter 6 | Linear Restrictions and Preliminary Test Estimation | Shalabh, IIT Kanpur
9
 1 
 I 0   y
[ X ' R ']  σ
A'Ω a =−1 2 n
 
−1  r 
0 Ψ 
= σ12 X ' y + k ' Ψ −1r

 1 
 I 0  X 
[ X ' R ']  σ
A'Ω A =−1 2 n
 R 
0 Ψ −1   
= σ12 X ' X + R ' Ψ −1 R.

Thus
−1
 1   1 
βˆM  2 X ' X + R 'ψ −1 R   2 X ' y + R ' Ψ −1r 
=
σ  σ 

assuming σ 2 to be unknown. This is termed as mixed regression estimator.


1
s 2 = ( y − Xb ) ' ( y − Xb ) and feasible
If σ 2 is unknown, then σ 2 can be replaced by its estimator σˆ 2 =
n−k
mixed regression estimator of β is obtained as
−1
1  1 
βˆ f  2 X ' X + R ' Ψ −1 R   2 X ' y + R ' Ψ −1r  .
=
s  s 
This is also termed as estimated or operationalized generalized least squares estimator.

Properties of mixed regression estimator:


(i) Unbiasedness:
The estimation error of βˆm is

( A ' ΩA)
−1
βˆM − β = A ' Ω −1a − β

= ( A ' Ω −1 A ) A ' Ω −1 ( AB + w ) − β
−1

( A ' Ω−1 A) A ' Ω−1w.


−1
=

( )
E βˆM − β = ( A ' Ω −1 A ) A ' Ω −1 E ( w)
−1

= 0.
So mixed regression estimator provides an unbiased estimator of β . Note that the pure regression

b = ( X ' X ) X ' y estimator is also an unbiased estimator of β .


−1

Econometrics | Chapter 6 | Linear Restrictions and Preliminary Test Estimation | Shalabh, IIT Kanpur
10
(ii) Covariance matrix
The covariance matrix of βˆM is

( ) (
V βˆM =E βˆM − β )( βˆ M )
−β '

( A ' Ω−1 A) A ' Ω−1E (VV ') Ω−1 A ( A ' Ω−1 A)


−1 −1
=

( A ' Ω A)−1 −1
=
−1
 1 
=  2 X ' X + R ' Ω −1 R  .
σ 

(iii) The estimator βˆM satisfies the stochastic linear restrictions in the sense that

=r R βˆµ + V

( )
E (r ) RE βˆM + E (V )
=
= Rβ + 0
= Rβ .

(iv) Comparison with OLSE


We first state a result that is used further to establish the dominance of βˆM over b .

Result: The difference of matrices ( A1−1 − A2−1 ) is positive definite if ( A2 − A1 ) is positive definite.

Let
σ 2 ( X ' X ) −1
A1 ≡ V (b) =
−1
 1 
( βˆM )  2 X ' X + R ' Ψ −1 R 
A2 ≡ V=
σ 
1 1
then A1−1 −=
A2−1 X ' X + R ' Ψ −1 R − X 'X
σ 2
σ2
= R ' Ψ −1 R
which is a positive definite matrix. This implies that
A1 − A2 = V (b) − V ( βˆM )

is a positive definite matrix. Thus βˆM is more efficient than b under the criterion of covariance matrices or

Loewner ordering provided σ 2 is known.


Econometrics | Chapter 6 | Linear Restrictions and Preliminary Test Estimation | Shalabh, IIT Kanpur
11
Testing of hypothesis:
r Rβ + V , we want to test whether there is close
In the prior information specified by stochastic restriction=
relation between the sample information and the prior information. The test for the compatibility of sample
and prior information is tested by χ 2 − test statistic given by
1 −1
( r − Rb ) '  R ( X ' X ) R '+ Ψ  ( r − Rb )
−1
χ2
=
σ 2 

assuming σ 2 is known and b = ( X ' X ) X ' y . This follows a χ 2 -distribution with q degrees of freedom.
−1

If Ψ =0 , then the distribution is degenerated and hence r becomes a fixed quantity. For the feasible
version of mixed regression estimator
−1
1  1 
βˆ f  2 X ' X + R ' Ψ −1 R   2 X ' y + R ' Ψ −1r  ,
=
s  s 
the optimal properties of mixed regression estimator like linearity unbiasedness and/or minimum variance
do not remain valid. So there can be situations when the incorporation of prior information may lead to loss
in efficiency. This is not a favourable situation. Under such situations, the pure regression estimator is
better to use. In order to know whether the use of prior information will lead to better estimator or not, the
null hypothesis H 0 : E (r ) = R β can be tested.

For testing the null hypothesis


H 0 : E (r ) = Rβ

when σ 2 is unknown, we use the F − statistic given by

{
 r − Rb ' R X ' X −1 R '+ Ψ
} ( r − Rb )
−1

( ) ( )

q
F=
s2
1
where s 2 = ( y − Xb ) ' ( y − Xb ) and F follows a F − distribution with q and (n − k ) degrees of freedom
n−k
under H 0 .

Econometrics | Chapter 6 | Linear Restrictions and Preliminary Test Estimation | Shalabh, IIT Kanpur
12
Inequality Restrictions
Sometimes the restriction on the regression parameters or equivalently the prior information about the
regression parameters is available in the form of inequalities. For Example,
, etc. Suppose such information is expressible in the form of
. We want to estimate the regression coefficient in the model subject to
constraints .

One can minimize subject to to obtain an estimator of . This can be formulated


as a quadratic programming problem and can be solved using an appropriate algorithm, e.g. Simplex
algorithm and a numerical solution is obtained.The advantage of this procedure is that a solution is found
that fulfills the condition. The disadvantage is that the statistical properties of the estimates are not easily
determined and no general conclusions about superiority can be made.

Another option to obtain an estimator of is subject to inequality constraints is to convert the inequality
constraints in the form of stochastic linear restrictions e.g., limits. and use the framework of
mixed regression estimation.

The minimax estimation can also be used to obtain the estimator of under inequality constrains. The
minimax estimation is based on the idea that the quadratic risk function for the estimate is not minimized
over the entire parameter space but only over an area that is restricted by the prior knowledge or restrictions
in relation to the estimate.

If all the restriction define a convex area, this area can be enclosed in an ellipsoid of the following form
B( β )
= {β : β ' T β ≤ k }
with the origin as center point or in
B ( β , β 0=
) {β : ( β − β 0 ) ' T ( β − β 0 ) ≤ k }
with the center point vector where is a given constant and T is a known matrix which is
assumed to be positive definite. Here defines a concentration ellipsoid.

First we consider an example to understand how the inequality constraints are framed. Suppose it is known a
priori that
ai ≤ βi ≤ bi (i =
1, 2,..., n)
Econometrics | Chapter 6 | Linear Restrictions and Preliminary Test Estimation | Shalabh, IIT Kanpur
13
when are known and may include and . These restriction can be
written as
ai + bi
βi −
2
≤ 1, i =
1, 2,..., n.
1
2(bi − ai )

Now we want to construct a concentration ellipsoid ( β − β 0 )′T ( β − β 0 ) =


1 which encloses the cuboid and
fulfills the following conditions:
1
(i) The ellipsoid and the cuboid have the same center point, β 0= (a1 + b1 , …, a p + bp ).
2
(ii) The axes of the ellipsoid are parallel to the coordinate axes, that is , T = diag (t1 , , t p ).

(iii)The corner points of the cuboid are on the surface of the ellipsoid, which means we have

 ai − bi 
p 2

∑ 
i =1  2 
 ti = 1.

(iv) The ellipsoid has minimal volume:


p 1

V = c K ∏ ti 2 ,
i =1

with being a constant dependent on the dimension .

We now include the linear restriction (iii) for the by means of Lagrangian multipliers and solve (with
)

 p  p  a − b 2  
min ∏ ti −1 − λ  ∑  i i  ti − 1  .
min V =
{ti } ti
 i =1  i =1  2   

The normal equations are then obtained as

∂V  a − bj 
2

−ti−2 ∏ ti−1 − λ  j
=  =
0
∂ti i≠ j  2 

∂V  a j − bj 
2

and =
∂λ
∑ =  ti − 1 0.
 2 

Econometrics | Chapter 6 | Linear Restrictions and Preliminary Test Estimation | Shalabh, IIT Kanpur
14
∂V
From = 0, we get
∂ti
2
 2 
−t ∏ t 
λ= −2 −1
(for all j =
1, 2, , p )
i  a − b  i
i≠ j  j j 
2
 2 p 
= −t ∏ t −1 −1
 ,
i  a −b i
i =1  j j 
and for any two we obtain

 a j − bj   a j − bj 
2 2

ti   = tj   ,
 2   2 
∂V
and hence after summation accrding to = 0 gives
∂λ
 a j − bj   a j − bj 
2 2
p

∑ = 
i =1  2 
 t j pt= j   1.
 2 
This leads to the required diagonal elements of

t j = ( a j − bj ) ( j =
4 −2
1, 2, , p ) .
p
Hence, the optimal ellipsoid ( β − β 0 )′T ( β − β 0 ) =
1 , which contains the cuboid, has the center point vector

β 0′ =( a1 + b1 , , a p + bp )
1
2
and the following matrix, which is positive definite for finite limits

T =diag
4
p
( −2
)
( b1 − a1 ) , , ( bp − a p ) .
−2

Interpretation: The ellipsoid has a larger volume than the cuboid. Hence, the transition to an ellipsoid as a
priori information represents a weakening, but comes with an easier mathematical handling.

Econometrics | Chapter 6 | Linear Restrictions and Preliminary Test Estimation | Shalabh, IIT Kanpur
15
Example: (Two real regressors) The center-point equation of the ellipsoid is (see Figure )

x2 y 2
+ =
1,
a 2 b2
 1 
 2 0
 x
or ( x, y )  a   =1
 0 1  y 
 
 b2 

 1 1 
= =
with T diag  2 , 2  diag ( t1 , t2 )
a b 

and the area .

The Minimax Principle:



Consider the quadratic risk R ( βˆ , β , A
=) tr  AE βˆ − β

( )( βˆ − β )′  and a class of estimators. Let

be a convex region of a priori restrictions for . The criterion of the minimax estimator leads to
the following.
Definition :An estimator is called a minimax estimator of

min
{ β } β ∈B
ˆ ( )
sup R βˆ , β , A = sup R ( b* , β , A ) .
β ∈B

An explicit solution can be achieved if the weight matrix is of the form A = aa ' of rank 1.

Using the abbreviation D=


* (S + k σ 2T ) , we have following result:
−1

Result: In the model , with the restriction with , and the risk

( )
function R βˆ , β , a , the linear minimax estimator is of the following form:

=b* (X 'X +k σ 2T ) −1 X ' y )


−1

= D*−1 X ' y
with the bias vector and covariance matrix as

Econometrics | Chapter 6 | Linear Restrictions and Preliminary Test Estimation | Shalabh, IIT Kanpur
16
Bias ( b* , β ) = −k −1σ 2 D*−1T β ,
V ( b* ) = σ 2 D*−1SD*−1

and the minimax risk is


sup R ( b* , β , a ) = σ 2 a′D*−1a.
β ′T β ≤ k

Result: If the restrictions are ( β − β 0 )′T ( β − β 0 ) ≤ k with center point β 0 ≠ 0, the linear minimax estimator
is of the following form:
β 0 + D*−1 X ′ ( y − X β 0 )
b* ( β 0 ) =

with bias vector and covariance matrix as


Bias ( b* ( β 0 ) , β ) =
−k −1σ 2 D*−1T ( β − β 0 ) ,
V ( b* ( β 0 ) ) = V ( b* ) ,

and the minimax risk is


sup R ( b* ( β 0 ) , β , a ) = σ 2 a′D*−1a.
( β − β0 )′ T ( β − β0 )≤ k

Interpretation: A change of the center point of the a priori ellipsoid has an influence only on the estimator
itself and its bias. The minimax estimator is not operational because σ 2 is unknown. The smaller the value
of , the stricter is the a priori restriction for fixed . Analogously, the larger the value of , the smaller is
the influence of on the minimax estimator. For the borderline case we have
B(β )
= {β : β ′T β ≤ k} →  K as k → ∞
( X ′X ) X ′y.
−1
and lim b* → b =
k →∞

Comparison of b * and b :
Minimax Risk: Since the OLS estimator is unbiased, its minimax risk is
sup R ( b, ⋅, a ) =
σ 2 a′S −1a.
β ′T β ≤ k

The linear minimax estimator b* has a smaller minimax risk than the OLS estimator, and

R ( b, ⋅, a ) − sup R ( b* , β , a )
β ′T β ≤ k

= σ 2 a′( S −1 − ( k −1σ 2T + S ) )a ≥ 0,
−1

Econometrics | Chapter 6 | Linear Restrictions and Preliminary Test Estimation | Shalabh, IIT Kanpur
17
since S −1 − ( k −1σ 2T + S ) ≥ 0
−1

Considering the superiority with MSE matrices, we get

M ( b=
*, β ) V ( b* ) + Bias ( b* , β ) Bias ( b* , β )′
= σ 2 D*−1 ( S + k −2σ 2T ββ ′T ′ ) D*−1

Hence, is superior to under the criterion of Loewner ordering when


∆ ( b, b=
*) V ( bx ) − M ( b* , =
β ) σ 2 D*−1[ D* S −1 D* − S − k −2σ 2T ββ ′T ′]D*−1 ≥ 0,
which is possible if and only if
=B D* S −1 D* − S − k −2σ 2T ββ ′T ′
= k −2σ 4T {S −1 + 2kσ −2T −1} − σ −2 ββ ′ T ≥ 0
1
 −
1
− 
1

1
k −2σ 4TC 2  I − σ −2C 2 ββ ′C 2  C 2 T ≥ 0
=
 
C S −1 + 2kσ −2T −1. This is equivalent to
with=
σ −2 β ′( S −1 + 2kσ −2T −1 ) ≥ 0.

Since ( 2kσ −2T −1 ) − ( S −1 + 2kσ −2T −1 ) ≥ 0,


−1

2
k −1 ≤ .
β ′β

Preliminary Test Estimation:


The statistical modeling of the data is usually done assuming that the model is correctly specified and the
correct estimators are used for the purpose of estimation and drawing statistical inferences form a sample of
data. Sometimes the prior information or constraints are available from outside the sample as non-sample
information. The incorporation and use of such prior information along with the sample information leads to
more efficient estimators provided it is correct. So the suitability of the estimator lies on the correctness of
prior information. One possible statistical approach to check the correctness of prior information is through
the framework of test of hypothesis. For example, if prior information is available in the form of exact linear
restrictions , there are two possibilities- either it is correct or incorrect. If the information is correct,
then holds true in the model and then the restricted regression estimator (RRE)
of is used which is more efficient than
OLSE of . Moreover, RRE satisfies the restrictions, i.e. . On the other hand, when

Econometrics | Chapter 6 | Linear Restrictions and Preliminary Test Estimation | Shalabh, IIT Kanpur
18
the information is incorrect, i.e., , then OLSE is better than RRE. The truthfulness of prior
information in terms of or is tested by the null hypothesis using the F-statistics.
• If is accepted at level of significance, then we conclude that and in such a situation, RRE is
better than OLSE.
• On the other hand, if is rejected at level of significance, then we conclude that and OLSE
is better than RRE under such situations.

So when the exact content of the true sampling model is unknown, then the statistical model to be used is
determined by a preliminary test of hypothesis using the available sample data. Such procedures are
completed in two stages and are based on a test of hypothesis which provides a rule for choosing between the
estimator based on the sample data and the estimator is consistent with the hypothesis. This requires to make
a test of the compatibility of OLSE (or maximum likelihood estimator) based on sample information only
and RRE based on the linear hypothesis. The one can make a choice of estimator depending upon the
outcome. Consequently, one can choose OLSE or RRE. Note that under the normality of random errors, the
equivalent choice is made between the maximum likelihood estimator of and the restricted maximum
likelihood estimator of , which has the same form as OLSE and RRE, respectively. So essentially a pre-test
of hypothesis is done for and based on that, a suitable estimator is chosen. This is called the pre-
test procedure which generates the pre-test estimator that in turn, provides a rule to choose between restricted
or unrestricted estimators.

One can also understand the philosophy behind the preliminary test estimation as follows. Consider the
problem of an investigator who has a single data set and wants to estimate the parameters of a linear model
that are known to lie in a high dimensional parametric space . However, the prior information about the
parameter is available and it suggests that the relationship may be characterized by a lower dimensional
parametric space . Under such uncertainty, if the parametric space is estimated by OLSE, the
result from the over specified model will be unbiased but will have larger variance. Alternatively, the
parametric space may incorrectly specify the statistical model and if estimated by OLSEwill be biased.
The bias may or may not overweigh the reduction in variance. If such uncertainty is represented in the form
of general linear hypothesis, this leads to pre-test estimators.

Econometrics | Chapter 6 | Linear Restrictions and Preliminary Test Estimation | Shalabh, IIT Kanpur
19
Let us consider the conventional pre-test estimator under the model with usual assumptions and
the general linear hypothesis which can be tested by using statistics. The null hypothesis is
rejected at level of significance when
λ = Fcalculated ≥ Fα , p ,n − p = c

where the critical value is determined for given level of the test by

∫dF=
c
p ,n − p ≥ c  α .
P  Fp ,n − p =

• If is true, meaning thereby that the prior information is correct, then use RRE

to estimate .

• If is false, meaning thereby that the prior information is incorrect, then use
OLSE to estimate .

Thus the estimator to be used depends on the preliminary test of significance and is of the form

 βˆR     if     u < c


βˆPT = 
b        if     u ≥ c.

This estimator is called as preliminary test or pre-test estimator of . Alternatively,

=βˆPT βˆR .I ( 0,c ) ( u ) + b.I[c ,∞ ) ( u )


= βˆR .I ( 0,c ) ( u ) + 1 − I ( 0,c ) ( u )  .b
       

       =b − (b − βˆR ).I ( 0,c ) ( u )


      =b − ( b − r ) .I ( 0,c ) ( u )

where the indicator functions are defined as


1    when     0 < u < c
I ( 0,c ) ( u ) = 
0    otherwise                 
1    when      u ≥ c
I[c ,∞ ) ( u ) = 
0    otherwise.      
• If , then βˆPT= βˆR .I[0,∞ ) ( u ) + b.I ( ∞ ) ( u )= βˆR .

• If , then βˆPT = βˆR .I ( 0) ( u ) + b.I[0,∞ ) ( u ) = b .

Note that and indicate that the probability of type 1 error (i.e., rejecting when it is true) is
and respectively. So the entire area under the sampling distribution is the area of acceptance or the area of
rejection of null hypothesis. Thus the choice of has a crucial role to play in determining the sampling

Econometrics | Chapter 6 | Linear Restrictions and Preliminary Test Estimation | Shalabh, IIT Kanpur
20
performance of the pre-test estimators. Therefore in a repeated sampling context, the data, the linear
hypothesis, and the level of significance all determine the combination of the two estimators that are chosen
on the average. The level of significance has an impact on the outcome of pretest estimator in the sense of
determining the proportion of the time each estimator is used and in determining the sampling performance
of pretest estimator.
We use the following result to derive the bias and risk of pretest estimator:

Result 1: If the random vector, , is distributed as a multivariate normal random vector with
mean and covariance matrix and is independent of then

  Z ′Z n − K  z   δ   χ ( K + 2.λ )
2

E  I ( 0,c )  2 2 =     2 ≤ c*  ,
 σ χ ( n − K ) K  σ   σ
P
      χ ( n − K ) 

where and .

Result 2: If the random vector, , is distributed as a multivariate normal random vector with mean

and covariance and is independent of then

  Z ' Z   Z '  Z     χ 2K + 2,δ ′δ /2σ 2    χ 2K + 4,δ ′δ /2σ 2 


( ) δ ′
δ ( )
E  I ( 0,c*= 2 2
)       KE  I ( 0,c*)   + E  I ( 0,c*)
   χ( n− K )  σ  χ( n− K ) 
  σ χ ( n − K )   σ   σ  
2 2 2
     
 χ K + 2,δ ′δ /2σ 2
2
 χ 2

 ( ) cK  δ ′δ  ( K + 4,δ ′δ /2σ 2 ) cK 
= KP ≤ + P ≤ ,
 χ (2n − K ) T − K  σ 2  χ (2n − K ) n−K 
   
where .

Using these results, we can find the bias and risk of as follows:
Bias:

( ) E ( b ) − E  I ( 0,c ) ( u ) . ( b − r ) 
E βˆPT =
χ2 ′ 2 
( p + 2,δ δ /2σ ) p 
β −δ P  2
= ≤c
χ n− p
 ( n − p ,δ ′δ /2σ ) 
2

β − δ P  F p + 2,  n − p ,δ ′δ /2σ 2 ≤ c 
=
 ( ) 

Econometrics | Chapter 6 | Linear Restrictions and Preliminary Test Estimation | Shalabh, IIT Kanpur
21
where denotes the non-central distribution with noncentrality parameter

denotes the non-central distribution with noncentrality parameter .

Thus, if , the pretest estimator is unbiased. Note that the size of bias is affected by the probability of a
random variable with a non-central - distribution being less than a constant that is determined by the level
of the test, the number of hypothesis and the degree of hypothesis error . Since the probability is always
less than or equal to one, so .
Risk:
The risk of pretest estimator is obtained as
 
( ) ( ′
ρ β , βˆPT = E  βˆPT − β βˆPT − β 

)( 
)
 

( ′
)(
= E  b − β − I ( 0,c ) ( u )( b − r ) b − β − I ( 0,c ) ( u )( b − r ) 

)
= E ( b − β )′ ( b − β )  − E  I ( 0,c ) ( u )( b − β )′ ( b − β )  + E  I ( 0,c ) ( u )  δ ′δ      
   
 χ 2p + 2,δ ′δ /2σ 2   χ 2p + 4,δ ′δ /2σ 2 
( ) ( )
σ p + ( 2δ ′δ − σ p ) P  cp   cp 
= 2 2
≤ − δ ′δ P ≤
 χ (2n − p ) n− p  χ (2n − p ) n− p
   
or compactly,
( )
ρ β , βˆPT =σ 2 p + ( 2δ ′δ − σ 2 p ) l ( 2 ) − δ ′δ l ( 4 )
where
χ 2p + 2,δ ′δ /2σ χ 2p + 4,δ ′δ /2σ
( ) 2
( )1 2

=l (2) = , l (4) , 0 < l (4) < l (2) < 1.


χ( n− p)
2
χ( n− p) 2
2

The risk function implies the following results:


1. If the restrictions are correct and , the risk of the pretest estimator
where . Therefore, the pretest estimator has risk less than that of the
least squares estimator at the origin and the decrease in risk depends on the level of
significance and, correspondingly, on the critical value of the test .
2. As the hypothesis error , and thus , increases and approaches infinity, and
approach zero. The risk of the pretest estimator therefore approaches , the risk of the
unrestricted least squares estimator.

Econometrics | Chapter 6 | Linear Restrictions and Preliminary Test Estimation | Shalabh, IIT Kanpur
22
3. As the hypothesis error grows, the risk of the pretest estimator increases obtains a maximum after
crossing the risk of the least squares estimator, and then monotonically decreases to approach
the risk of the OLSE.
4. The pretest estimator risk function defined on the parameter spaces crosses the risk function
of the least squares estimator within the bounds .

The sampling characteristic of the preliminary test estimator are summarized in Figure 1.

From these results we see that the pretest estimator does well relative to OLSE if the hypothesis is correctly
specified. However, in the space representing the range of hypothesis are correctly specified.
However, in the space representing the range of hypothesis errors, the pretest estimator is inferior to
the least squares estimator over an infinite range of the parameter space. In figures 1 and 2, there is a range
of the parameter spacewhich the pretest estimator has risk that is inferior to (greater than) that of both the
unrestricted and restricted least squares estimators. No one estimator depicted in Figure 1 dominates the
other competitors. In addition, in applied problems the hypothesis errors, and thus the correct in the
specification error parameter space, are seldom known. Consequently, the choice of the estimator is
unresolved.

The Optimal Level of Significance


The form of the pretest estimator involves, for evaluation purposes, the probabilities of ratios of random
variables and being less than a constant that depends on the critical value of the test or on the level
of statistical significance . Thus as , the probabilities , and the risk of the pretest estimator
approaches that of the restricted regression estimator . In contrast, as approaches zero and the

Econometrics | Chapter 6 | Linear Restrictions and Preliminary Test Estimation | Shalabh, IIT Kanpur
23
risk of the pretest estimator approaches that of the least squares estimator b.The choice of , which has a
crucial impact on the performance of the pretest estimator, is portrayed in Figure 3.

Since the investigator is usually unsure of the degree of hypothesis specification error, and thus is unsure of
the appropriate point in the space for evaluating the risk, the best of worlds would be to have a rule that
mixes the unrestricted and restricted estimators so as to minimize risk regardless of the relevant specification
error . Thus the risk function traced out by the cross-hatched area in Figure 2 is relevant.
Unfortunately, the risk of the pretest estimator, regardless of the choice of , is always equal to or greater
than the minimum risk function for some range of the parameter space. Given this result, one criterion that
has been proposed for choosing the level might be to choose the critical value that would minimize the
maximum regret of not being on the minimum risk function, reflected by the boundary of the shaded area.
Another criterion that has been proposed for choosing is to minimize the average regret over the whole
space. Each of these criteria lead to different conclusions or rules for choice, and the question
concerning the optimal level of the test is still open. One thing that is apparent is that conventional choices of
0.05 and 0.01 may have rather severe statistical consequences.

Econometrics | Chapter 6 | Linear Restrictions and Preliminary Test Estimation | Shalabh, IIT Kanpur
24
Chapter 7
Multicollinearity
A basic assumption is multiple linear regression model is that the rank of the matrix of observations on
explanatory variables is the same as the number of explanatory variables. In other words, such a matrix is
of full column rank. This, in turn, implies that all the explanatory variables are independent, i.e., there is
no linear relationship among the explanatory variables. It is termed that the explanatory variables are
orthogonal.

In many situations in practice, the explanatory variables may not remain independent due to various
reasons. The situation where the explanatory variables are highly intercorrelated is referred to as
multicollinearity.

Consider the multiple regression model


y  X    ,  ~ N (0,  2 I )
nk k 1 n1

with k explanatory variables X 1 , X 2 ,..., X k with usual assumptions including Rank ( X )  k .

Assume the observations on all X i ' s and yi ' s are centered and scaled to unit length. So

- X ' X becomes a k  k matrix of correlation coefficients between the explanatory variables and
- X ' y becomes a k 1 vector of correlation coefficients between explanatory and study variables.

Let X   X 1 , X 2 ,..., X k  where X j is the j th column of X denoting the n observations on X j . The

column vectors X 1 , X 2 ,..., X k are linearly dependent if there exists a set of constants 1 ,  2 ,...,  k , not all

zero, such that


k

 j 1
j X j  0.

If this holds exactly for a subset of the X 1 , X 2 ,..., X k , then rank ( X ' X )  k . Consequently ( X ' X ) 1 does
k
not exist. If the condition 
j 1
j X j  0 is approximately true for some subset of X 1 , X 2 ,..., X k , then there

will be a near-linear dependency in X ' X . In such a case, the multicollinearity problem exists. It is also
said that X ' X becomes ill-conditioned.

Econometrics | Chapter 7 | Multicollinearity | Shalabh, IIT Kanpur


1
Source of multicollinearity:
1. Method of data collection:
It is expected that the data is collected over the whole cross-section of variables. It may happen that the
data is collected over a subspace of the explanatory variables where the variables are linearly dependent.
For example, sampling is done only over a limited range of explanatory variables in the population.

2. Model and population constraints


There may exist some constraints on the model or on the population from where the sample is drawn. The
sample may be generated from that part of the population having linear combinations.

3. Existence of identities or definitional relationships:


There may exist some relationships among the variables which may be due to the definition of variables or
any identity relation among them. For example, if data is collected on the variables like income, saving
and expenditure, then income = saving + expenditure. Such a relationship will not change even when the
sample size increases.

4. Imprecise formulation of model


The formulation of the model may unnecessarily be complicated. For example, the quadratic (or
polynomial) terms or cross-product terms may appear as explanatory variables. For example, let there be 3
variables X 1 , X 2 and X 3 , so k  3. Suppose their cross-product terms X 1 X 2 , X 2 X 3 and X 1 X 3 are also

added. Then k rises to 6.

5. An over-determined model
Sometimes, due to over-enthusiasm, a large number of variables are included in the model to make it more
realistic. Consequently, the number of observations (n ) becomes smaller than the number of explanatory
variables (k ) . Such a situation can arise in medical research where the number of patients may be small,
but the information is collected on a large number of variables. In another example, if there is time-series
data for 50 years on consumption pattern, then it is expected that the consumption pattern does not remain
the same for 50 years. So better option is to choose a smaller number of variables, and hence it results in
n  k.

Econometrics | Chapter 7 | Multicollinearity | Shalabh, IIT Kanpur


2
Consequences of multicollinearity
To illustrate the consequences of the presence of multicollinearity, consider a model
y  1 x1   2 x2   , E ( )  0, V ( )   2 I

where x1 , x2 and y are scaled to length unity.

The normal equation ( X ' X )b  X ' y in this model becomes

 1 r   b1   r1 y 
     
 r 1   b2   r2 y 
where r is the correlation coefficient between x1 and x2 ; rjy is the correlation coefficient between x j and

y ( j  1, 2) and b   b1 , b2  ' is the OLSE of  .

 1  1 r 
X 'X 
1
 2 
 1  r   r 1 
r1 y  r r2 y
 b1 
1 r2
r2 y  r r1 y
b2  .
1 r2
So the covariance matrix is V (b)   2 ( X ' X ) 1

2
 Var (b1 )  Var (b2 ) 
1 r2
r 2
Cov(b1 , b2 )   .
1 r2
If x1 and x2 are uncorrelated, then r  0 and

1 0
X 'X  
0 1
rank ( X ' X )  2.
If x1 and x2 are perfectly correlated, then r  1 and rank ( X ' X )  1.

If r  1, then Var (b1 )  Var (b2 )   .

So if variables are perfectly collinear, the variance of OLSEs becomes large. This indicates highly
unreliable estimates, and this is an inadmissible situation.

Econometrics | Chapter 7 | Multicollinearity | Shalabh, IIT Kanpur


3
Consider the following result
r 0.99 0.9 0.1 0
Var (b1 )  Var (b2 ) 50 2 5 2 1.01 2 2

The standard errors of b1 and b2 rise sharply as r  1 and they break down at r  1 because X ' X

becomes non-singular.

 If r is close to 0, then multicollinearity does not harm, and it is termed as non-harmful


multicollinearity.
 If r is close to +1 or -1 then multicollinearity inflates the variance, and it rises terribly. This is
termed as harmful multicollinearity.

There is no clear cut boundary to distinguish between the harmful and non-harmful multicollinearity.
Generally, if r is low, the multicollinearity is considered as non-harmful, and if r is high, the
multicollinearity is regarded as harmful.

In case of near or high multicollinearity, the following possible consequences are encountered.
1. The OLSE remains an unbiased estimator of  , but its sampling variance becomes very large. So
OLSE becomes imprecise, and property of BLUE does not hold anymore.
2. Due to large standard errors, the regression coefficients may not appear significant. Consequently,
essential variables may be dropped.
For example, to test H 0 : 1  0, we use t  ratio as

b1
t0  .
 (b )
Var 1

 (b ) is large, so t is small and consequently H is more often accepted.


Since Var 1 0 0

Thus harmful multicollinearity intends to delete important variables.


3. Due to large standard errors, the large confidence region may arise. For example, the confidence
  (b )  . When Var
 (b ) it becomes large, then the confidence
interval is given by  b1  t Var 1  1
, n 1
 2 
interval becomes wider.

Econometrics | Chapter 7 | Multicollinearity | Shalabh, IIT Kanpur


4
4. The OLSE may be sensitive to small changes in the values of explanatory variables. If some
observations are added or dropped, OLSE may change considerably in magnitude as well as in
sign. Ideally, OLSE should not change with the inclusion or deletion of variables. Thus OLSE loses
stability and robustness.

When the number of explanatory variables is more than two, say k as X 1 , X 2 ,..., X k then the j th

diagonal element of C  ( X ' X ) 1 is


1
C jj 
1  R 2j

where R 2j is the multiple correlation coefficient or the coefficient of determination from the regression of

X j on the remaining (k  1) explanatory variables.

If X j is highly correlated with any subset of other (k  1) explanatory variables then R 2j is high and close

2
to 1. Consequently, the variance of j th OLSE Var (b j )  C jj 2  becomes very high. The
1  R 2j

covariance between bi and b j will also be large if X i and X j are involved in the linear relationship

leading to multicollinearity.

The least-squares estimates b j become too large in absolute value in the presence of multicollinearity. For

example, consider the squared distance between b and  as

L2  (b   ) '(b   )
k
E ( L2 )   E (b j   j ) 2
j 1
k
  Var (b j )
j 1

  2tr ( X ' X ) 1.


The trace of a matrix is the same as the sum of its eigenvalues. If 1 , 2 ,..., k are the eigenvalues of

1 1 1
( X ' X ), then , ,..., are the eigenvalues of ( X ' X ) 1 and hence
1 2 k
k
1
E ( L2 )   2  ,  j  0.
j 1 j

Econometrics | Chapter 7 | Multicollinearity | Shalabh, IIT Kanpur


5
If ( X ' X ) is ill-conditioned due to the presence of multicollinearity, then at least one of the eigenvalue
will be small. So the distance between b and  may also be substantial. Thus

E ( L2 )  E (b   ) '(b   )
 2tr ( X ' X ) 1  E (b ' b  2b '    '  )
 E (b ' b)   2tr ( X ' X ) 1   ' 
 b is generally longer than 
 OLSE is too large in absolute value.

The least-squares produces wrong estimates of parameters in the presence of multicollinearity. This
does not imply that the fitted model provides wrong predictions also. If the predictions are confined to
x-space with non-harmful multicollinearity, then predictions are satisfactory.

Multicollinearity diagnostics
An important question arises about how to diagnose the presence of multicollinearity in the data on the
basis of given sample information. Several diagnostic measures are available, and each of them is based on
a particular approach. It is difficult to say that which of the diagnostic is best or ultimate. Some of the
popular and important diagnostics are described further. The detection of multicollinearity involves 3
aspects:
(i) Determining its presence.
(ii) Determining its severity.
(iii) Determining its form or location.

1. Determinant of X ' X  X 'X 


This measure is based on the fact that the matrix X ' X becomes ill-conditioned in the presence of
multicollinearity. The value of the determinant of X ' X , i.e., X ' X declines as the degree of

multicollinearity increases.

If Rank ( X ' X )  k then X ' X will be singular and so X ' X  0. So, as X ' X  0 , the degree of

multicollinearity increases and it becomes exact or perfect at X ' X  0. Thus X ' X serves as a measure

of multicollinearity and X ' X =0 indicates that perfect multicollinearity exists.

Econometrics | Chapter 7 | Multicollinearity | Shalabh, IIT Kanpur


6
Limitations:
This measure has the following limitations
(i) It is not bounded as 0  X ' X  .

(ii) It is affected by the dispersion of explanatory variables. For example, if k  2, then


n n

 x12i
i 1
x
i 1
x
1i 2 i

X 'X  n n

x
i 1
x
2 i 1i x
i 1
2
2i

 n  n 
   x12i   x22i  1  r122 
 i 1  i 1 
where r12 is the correlation coefficient between x1 and x2 . So X ' X depends on the

correlation coefficient and variability of the explanatory variable. If explanatory variables have
very low variability, then X ' X may tend to zero, which will indicate the presence of

multicollinearity and which is not the case so.

(iii) It gives no idea about the relative effects on individual coefficients. If multicollinearity is
present, then it will not indicate that which variable in X ' X is causing multicollinearity and

is hard to determine.

2. Inspection of correlation matrix


The inspection of off-diagonal elements rij in X ' X gives an idea about the presence of multicollinearity.

If X i and X j are nearly linearly dependent, then rij will be close to 1. Note that the observations in X

are standardized in the sense that each observation is subtracted from the mean of that variable and divided
by the square root of the corrected sum of squares of that variable.

When more than two explanatory variables are considered, and if they are involved in near-linear
dependency, then it is not necessary that any of the rij will be large. Generally, a pairwise inspection of

correlation coefficients is not sufficient for detecting multicollinearity in the data.

Econometrics | Chapter 7 | Multicollinearity | Shalabh, IIT Kanpur


7
3. Determinant of correlation matrix
Let D be the determinant of the correlation matrix then 0  D  1.
If D  0 then it indicates the existence of exact linear dependence among explanatory variables.
If D  1 then the columns of X matrix are orthonormal.
Thus a value close to 0 is an indication of a high degree of multicollinearity. Any value of D between 0
and 1 gives an idea of the degree of multicollinearity.

Limitation
It gives no information about the number of linear dependencies among explanatory variables.

Advantages over X ' X


(i) It is a bounded measure, 0  D  1.
(ii) It is not affected by the dispersion of explanatory variables. For example, when k  2,
n n

x
i 1
2
1i x
i 1
x
1i 2 i

X 'X  n n
 (1  r122 ).
x
i 1
x
1i 2 i x
i 1
2
2i

4. Measure based on partial regression:


A measure of multicollinearity can be obtained on the basis of coefficients of determination based on
partial regression. Let R 2 be the coefficient of determination in the full model, i.e., based on all
explanatory variables and Ri2 be the coefficient of determination in the model when the i th explanatory

variable is dropped, i  1, 2,..., k , and RL2  Max( R12 , R22 ,..., Rk2 ).

Procedure:
(i) Drop one of the explanatory variables among k variables, say X 1 .

(ii) Run regression of y over rest of the (k  1) variables X 2 , X 3 ,..., X k .

(iii) Calculate R12 .

(iv) Similarly, calculate R22 , R32 ,..., Rk2 .

(v) Find RL2  Max( R12 , R22 ,..., Rk2 ).

(vi) Determine R 2  RL2 .

Econometrics | Chapter 7 | Multicollinearity | Shalabh, IIT Kanpur


8
The quantity ( R 2  RL2 ) provides a measure of multicollinearity. If multicollinearity is present, RL2 will be

high. Higher the degree of multicollinearity, higher the value of RL2 . So in the presence of

multicollinearity, ( R 2  RL2 ) be low.

Thus if ( R 2  RL2 ) is close to 0, it indicates the high degree of multicollinearity.

Limitations:
(i) It gives no information about the underlying relations about explanatory variables, i.e., how
many relationships are present or how many explanatory variables are responsible for the
multicollinearity.
(ii) A small value of ( R 2  RL2 ) may occur because of poor specification of the model also and it

may be inferred in such situation that multicollinearity is present.

5. Variance inflation factors (VIF):


The matrix X ' X becomes ill-conditioned in the presence of multicollinearity in the data. So the diagonal
elements of C  ( X ' X ) 1 helps in the detection of multicollinearity. If R 2j denotes the coefficient of

determination obtained when X j is regressed on the remaining (k  1) variables excluding X j , then the

j th diagonal element of C is
1
C jj  .
1  R 2j

If X j is nearly orthogonal to remaining explanatory variables, then R 2j is small and consequently C jj is

close to 1.

If X j is nearly linearly dependent on a subset of remaining explanatory variables, then R 2j is close to 1

and consequently C jj is large.

Since the variance of j th OLSE of  j is

Var (b j )   2C jj

Econometrics | Chapter 7 | Multicollinearity | Shalabh, IIT Kanpur


9
So C jj is the factor by which the variance of b j increases when the explanatory variables are near-linear

dependent. Based on this concept, the variance inflation factor for the j th explanatory variable is defined
as
1
VIFj  .
1  R 2`j

This is the factor which is responsible for inflating the sampling variance. The combined effect of
dependencies among the explanatory variables on the variance of a term is measured by the VIF of that
term in the model.

One or more large VIFs indicate the presence of multicollinearity in the data.

In practice, usually, a VIF  5 or 10 indicates that the associated regression coefficients are poorly
estimated because of multicollinearity. If regression coefficients are estimated by OLSE and its variance
is  2 ( X ' X ) 1. So VIF indicates that a part of this variance is given by VIFj.

Limitations:
(i) It sheds no light on the number of dependencies among the explanatory variables.
(ii) The rule of VIF > 5 or 10 is a rule of thumb which may differ from one situation to another
situation.

Another interpretation of VIFj


The VIFs can also be viewed as follows.
The confidence interval of j th OLSE of  j is given by

 
 b  ˆ C jj t ,n  k 1  .
2

 2 

The length of the confidence interval is

L j  2 ˆ 2C jj t .
, n  k 1
2

Econometrics | Chapter 7 | Multicollinearity | Shalabh, IIT Kanpur


10
Now consider a situation where X is an orthogonal matrix, i.e., X ' X  I so that C jj  1, sample size is

1 n 
the same as earlier and the same root mean squares   ( xij  x j ) 2  , then the length of confidence
 n i 1 
interval becomes
L*  2ˆ t .
, n  k 1
2

Consider the ratio


Lj
 C jj .
L*
Thus VIFj indicates the increase in the length of the confidence interval of j th regression coefficient

due to the presence of multicollinearity.

6. Condition number and condition index:


Let 1 , 2 ,..., k be the eigenvalues (or characteristic roots) of X ' X . Let

max  Max(1 , 2 ,..., k )


min  Min(1 , 2 ,..., k ).
The condition number (CN ) is defined as
max
CN  , 0  CN   .
min
The small values of characteristic roots indicate the presence of near-linear dependency in the data. The
CN provides a measure of spread in the spectrum of characteristic roots of X ' X .

The condition number provides a measure of multicollinearity.


 If CN  100, then it is considered as non-harmful multicollinearity.
 If 100  CN 1000, then it indicates that the multicollinearity is moderate to severe (or strong).
This range is referred to as danger level.
 If CN 1000, then it indicates a severe (or strong) multicollinearity.

The condition number is based only or two eigenvalues: min and max . Another measures are condition

indices which use the information on other eigenvalues.

Econometrics | Chapter 7 | Multicollinearity | Shalabh, IIT Kanpur


11
The condition indices of X ' X are defined as
max
Cj  , j  1, 2,..., k .
j

In fact, the largest C j  CN .

The number of condition indices that are large, say more than 1000, indicate the number of near-linear
dependencies in X ' X .

A limitation of CN and C j is that they are unbounded measures as 0  CN   , 0  C j   .

7. Measure based on characteristic roots and proportion of variances:


Let 1 , 2 ,.., k be the eigenvalues of X ' X ,   diag (1 , 2 ,..., k ) is a k  k matrix and V is a k  k

matrix constructed by the eigenvectors of X ' X . Obviously, V is an orthogonal matrix. Then X ' X can
be decomposed as X ' X  V V ' . Let V1 , V2 ,..., Vk be the column of V . If there is a near-linear

dependency in the data, then  j is close to zero and the nature of linear dependency is described by the

elements of the associated eigenvector V j .

The covariance matrix of OLSE is


V (b)   2 ( X ' X ) 1
  2 (V V ') 1
  2V  1V '
 v2 v2 v2 
 Var (bi )   2  i1  i 2  ...  ik 
 1 2 k 
where vi1 , vi 2 ,..., vik are the elements in V .

The condition indices are


max
Cj  , j  1, 2,..., k .
j

Econometrics | Chapter 7 | Multicollinearity | Shalabh, IIT Kanpur


12
Procedure:
(i) Find condition index C1 , C2 ,..., Ck .

(ii) (a) Identify those i ' s for which C j is greater than the danger level 1000.

(b) This gives the number of linear dependencies.


(c) Don’t consider those C j ' s which are below the danger level.

(iii) For such  ' s with condition index above the danger level, choose one such eigenvalue, say
j.

(iv) Find the value of the proportion of variance corresponding to j in

Var (b1 ),Var (b2 ),..., Var (bk ) as

(vij2 /  j ) vij2 /  j
pij   k
.
 (v
VIFj 2
ij / j )
j 1

 v2 
Note that  ij  can be found from the expression

 j 
 vi21 vi22 vik2 
Var (bi )     2
 ...  
 1 2 k 

i.e., corresponding to j th factor.

The proportion of variance pij provides a measure of multicollinearity.

If pij  0.5, it indicates that bi is adversely affected by the multicollinearity, i.e., an estimate of  i is

influenced by the presence of multicollinearity.

It is a good diagnostic tool in the sense that it tells about the presence of harmful multicollinearity as well
as also indicates the number of linear dependencies responsible for multicollinearity. This diagnostic is
better than other diagnostics.

Econometrics | Chapter 7 | Multicollinearity | Shalabh, IIT Kanpur


13
The condition indices are also defined by the singular value decomposition of X the matrix as follows:
X  UDV '
where U is n k matrix, V is kk matrix, U 'U  I , V ' V  I , D is k k matrix,
D  diag ( 1 , 2 ,..., k ) and 1 , 2 ,..., k are the singular values of X , V is a matrix whose columns are

eigenvectors corresponding to eigenvalues of X ' X and U is a matrix whose columns are the
eigenvectors associated with the k nonzero eigenvalues of X ' X .

The condition indices of X matrix are defined as


max
j  , j  1, 2,..., k
j

where  max  Max( 1 ,  2 ,..., k ).

If 1 , 2 ,..., k are the eigenvalues of X ' X then

X ' X  (UDV ') 'UDV '  VD 2V '  V V ',

so  2j   j , j  1, 2,..., k .

Note that with  2j   j ,

k v 2ji
Var (b j )   2 
i 1 i2
k v 2ji
VIFj  
i 1 i2
(vij2 / i2 )
pij  .
VIF j

The ill-conditioning in X is reflected in the size of singular values. There will be one small singular value
for each non-linear dependency. The extent of ill-conditioning is described by how small is  j relative to

max .

It is suggested that the explanatory variables should be scaled to unit length but should not be centered
when computing pij . This will helps in diagnosing the role of intercept term in near-linear dependence.

No unique guidance is available in the literature on the issue of centering the explanatory variables. The
centering makes the intercept orthogonal to explanatory variables. So this may remove the ill-conditioning
due to intercept term in the model.
Econometrics | Chapter 7 | Multicollinearity | Shalabh, IIT Kanpur
14
Remedies for multicollinearity:
Various techniques have been proposed to deal with the problems resulting from the presence of
multicollinearity in the data.

1. Obtain more data


The harmful multicollinearity arises essentially because the rank of X ' X falls below k and X ' X is

close to zero. Additional data may help in reducing the sampling variance of the estimates. The data need
to be collected such that it helps in breaking up the multicollinearity in the data.

It is always not possible to collect additional data for various reasons as follows.
 The experiment and process have finished and no longer available.
 The economic constraints may also not allow collecting the additional data.
 The additional data may not match with the earlier collected data and may be unusual.
 If the data is in time series, then longer time series may force to take ignore data that is too far in
the past.
 If multicollinearity is due to any identity or exact relationship, then increasing the sample size will
not help.
 Sometimes, it is not advisable to use the data even if it is available. For example, if the data on
consumption pattern is available for the years 1950-2010, then one may not like to use it as the
consumption pattern usually does not remains the same for such a long period.

2. Drop some variables that are collinear:


If possible, identify the variables which seem to cause multicollinearity. These collinear variables can be
dropped so as to match the condition of fall rank of X  matrix. The process of omitting the variables way
be carried out on the basis of some kind of ordering of explanatory variables, e.g., those variables can be
deleted first which have smaller value of t -ratio. In another example, suppose the experimenter is not
interested in all the parameters. In such cases, one can get the estimators of the parameters of interest
which have smaller mean squared errors them the variance of OLSE by dropping some variables.

If some variables are eliminated, then this may reduce the predictive power of the model. Sometimes there
is no assurance of how the model will exhibit less multicollinearity.
Econometrics | Chapter 7 | Multicollinearity | Shalabh, IIT Kanpur
15
3. Use some relevant prior information:
One may search for some relevant prior information about the regression coefficients. This may lead to
the specification of estimates of some coefficients. The more general situation includes the specification of
some exact linear restrictions and stochastic linear restrictions. The procedures like restricted regression
and mixed regression can be used for this purpose. The relevance and correctness of information play an
important role in such analysis, but it is challenging to ensure it in practice. For example, the estimates
derived in the U.K. may not be valid in India.

4. Employ generalized inverse


If rank ( X ' X )  k , then the generalized inverse can be used to find the inverse of X ' X . Then  can be

estimated by ˆ  ( X ' X )  X ' y.


In such a case, the estimates will not be unique except in the case of use of Moore-Penrose inverse of
( X ' X ). Different methods of finding generalized inverse may give different results. So applied workers
will get different results. Moreover, it is also not known that which method of finding generalized inverse
is optimum.

5. Use of principal component regression


The principal component regression is based on the technique of principal component analysis. The k
explanatory variables are transformed into a new set of orthogonal variables called principal components.
Usually, this technique is used for reducing the dimensionality of data by retaining some levels of
variability of explanatory variables which is expressed by the variability in the study variable. The
principal components involve the determination of a set of linear combinations of explanatory variables
such that they retain the total variability of the system, and these linear combinations are mutually
independent of each other. Such obtained principal components are ranked in the order of their
importance. The importance being judged in terms of variability explained by a principal component
relative to the total variability in the system. The procedure then involves eliminating some of the
principal components which contribute to explaining relatively less variation. After elimination of the
least important principal components, the set up of multiple regression is used by replacing the
explanatory variables with principal components. Then study variable is regressed against the set of
selected principal components using the ordinary least squares method. Since all the principal
components are orthogonal, they are mutually independent, and so OLS is used without any problem.
Once the estimates of regression coefficients for the reduced set of orthogonal variables (principal
Econometrics | Chapter 7 | Multicollinearity | Shalabh, IIT Kanpur
16
components) have been obtained, they are mathematically transformed into a new set of estimated
regression coefficients that correspond to the original correlated set of variables. These new estimated
coefficients are the principal components estimators of regression coefficients.

Suppose there are k explanatory variables X 1 , X 2 ,..., X k . Consider the linear function of X 1 , X 2 ,.., X k

like
k
Z1   ai X i
i 1
k
Z 2   bi X i etc.
i 1

The constants a1 , a2 ,..., ak are determined such that the variance of Z1 is maximized subject to the

normalizing condition that a


i 1
2
i  1. The constant b1 , b2 ,..., bk are determined such that the variance of Z 2

is maximized subject to the normality condition that b


i 1
i
2
 1 and is independent of the first principal

component.

We continue with such process and obtain k such linear combinations such that they are orthogonal to
their preceding linear combinations and satisfy the normality condition. Then we obtain their variances.
Suppose such linear combinations are Z1 , Z 2 ,.., Z k and for them, Var ( Z1 )  Var ( Z 2 )  ...  Var ( Z k ). The

linear combination having the largest variance is the first principal component. The linear combination
having the second largest variance is the second-largest principal component and so on. These principal
k k
components have the property that Var (Zi ) Var ( X i ). Also, the X1 , X 2 ,..., X k are correlated but
i 1 i 1

Z1 , Z 2 ,.., Z k are orthogonal or uncorrelated. So there will be zero multicollinearity among Z1 , Z 2 ,.., Z k .

The problem of multicollinearity arises because X 1 , X 2 ,..., X k are not independent. Since the principal

components based on X 1 , X 2 ,..., X k are mutually independent, so they can be used as explanatory

variables, and such regression will combat the multicollinearity.

Econometrics | Chapter 7 | Multicollinearity | Shalabh, IIT Kanpur


17
Let 1 , 2 ,..., k be the eigenvalues of X ' X ,   diag (1 , 2 ,..., k ) is k  k diagonal matrix, V is a k  k

orthogonal matrix whose columns are the eigenvectors associated with 1 , 2 ,..., k . Consider the

canonical form of the linear model


y  X 
 XVV '   
 Z  
where Z  XV ,   V '  , V ' X ' XV  Z ' Z   .

Columns of Z   Z1 , Z 2 ,..., Z k  define a new set of explanatory variables which are called as principal

components.

The OLSE of  is
ˆ  ( Z ' Z ) 1 Z ' y
  1Z ' y
and its covariance matrix is
V (ˆ )   2 ( Z ' Z ) 1
  2  1
1 1 1 
  2 diag  , ,..., 
 1 2 k 
k k
Note that  j is the variance of j th principal component and Z ' Z   Z i Z j   . A small eigenvalue
i 1 j 1

of X ' X means that the linear relationship between the original explanatory variable exists and the
variance of the corresponding orthogonal regression coefficient is large, which indicates that the
multicollinearity exists. If one or more  j is small, then it indicates that multicollinearity is present.

Retainment of principal components:


The new set of variables, i.e., principal components are orthogonal, and they retain the same magnitude of
variance as of the original set. If multicollinearity is severe, then there will be at least one small value of
eigenvalue. The elimination of one or more principal components associated with the smallest eigenvalues
will reduce the total variance in the model. Moreover, the principal components responsible for creating
multicollinearity will be removed, and the resulting model will be appreciably improved.

Econometrics | Chapter 7 | Multicollinearity | Shalabh, IIT Kanpur


18
The principal component matrix Z   Z1 , Z 2 ,..., Z k  with Z1 , Z 2 ,..., Z k contains precisely the same

information as the original data in X in the sense that the total variability in X and Z is the same. The
difference between them is that the original data are arranged into a set of new variables which are
uncorrelated with each other and can be ranked with respect to the magnitude of their eigenvalues. The j th
column vector Z j corresponding to the largest  j accounts for the largest proportion of the variation in

the original data. Thus the Z j ’s are indexed so that 1  2  ...  k  0 and  j is the variance of Z j .

A strategy of elimination of principal components is to begin by discarding the component associated with
the smallest eigenvalue. The idea behind to do so is that the principal component with the smallest
eigenvalue is contributing least variance and so is least informative.

Using this procedure, principal components are eliminated until the remaining components explain some
preselected variance is terms of percentage of the total variance. For example, if 90% of the total variance
is needed, and suppose r principal components are eliminated which means that (k  r ) principal
components contribute 90% variation, then r is selected to satisfy
k r

 i
i 1
k
 0.90.

i 1
i

Various strategies to choose the required number of principal components are also available in the
literature.

Suppose after using such a rule, the r principal components are eliminated. Now only (k  r )
components will be used for regression. So Z matrix is partitioned as
Z   Zr Z k  r   X (Vr Vk  r )

where Z r submatrix is of order n  r and contains the principal components to be eliminated. The

submatrix Z k  r is of order n  (k  r ) and contains the principal components to be retained.

The reduced model obtained after the elimination of r principal components can be expressed as
y  Z k  r k  r   *.

Econometrics | Chapter 7 | Multicollinearity | Shalabh, IIT Kanpur


19
The random error component is represented as  * just to distinguish with  . The reduced coefficients
contain the coefficients associated with retained Z j ' s . So

Z k  r   Z1 , Z 2 ,..., Z k  r 
 k  r  1 ,  2 ,...,  k  r 
Vk  r  V1 , V2 ,..., Vk  r  .

Using OLS on the model with retained principal components, the OLSE of  k  r is

ˆ k  r  ( Z k'  r Z k  r ) 1 Z k'  r y .
Now it is transformed back to original explanatory variables as follows:
 V '
 k  r  Vk' r 
 ˆ pc  Vk  rˆ k  r
is the principal component regression estimator of  .
This method improves the efficiency as well as multicollinearity.

6. Ridge regression
The OLSE is the best linear unbiased estimator of regression coefficient in the sense that it has minimum
variance in the class of linear and unbiased estimators. However, if the condition of unbiased can be
relaxed, then it is possible to find a biased estimator of regression coefficient say ̂ that has smaller

variance them the unbiased OLSE b . The mean squared error (MSE) of ̂ is

MSE ( ˆ )  E ( ˆ   ) 2

   
2
 E  ˆ  E ( ˆ )  E ( ˆ )   
 
2
 Var ( ˆ )   E ( ˆ )   
2
 Var ( ˆ )   Bias ( ˆ )  .
Thus MSE ( ˆ ) can be made smaller than Var ( ˆ ) by introducing small bias is ̂ . One of the approach to
do so is ridge regression. The ridge regression estimator is obtained by solving the normal equations of
least squares estimation. The normal equations are modified as

 X ' X   I  ˆridge  X ' y


 ˆridge   X ' X   I  X ' y
1

is the ridge regression estimator of  and   0 is any characterizing scalar termed as biasing
parameter.

Econometrics | Chapter 7 | Multicollinearity | Shalabh, IIT Kanpur


20
As   0, ˆ  b (OLSE ) and as   , ˆ  0.

So larger the value of  , larger shrinkage towards zero. Note that the OLSE in inappropriate to use in the
sense that it has very high variance when multicollinearity is present in the data. On the other hand, a very
small value of ̂ may tend to accept the null hypothesis H 0 :   0 indicating that the corresponding

variables are not relevant. The value of the biasing parameter controls the amount of shrinkage in the
estimates.

Bias of ridge regression estimator:


The bias of ˆridge is

Bias ( ˆridge )  E ( ˆridge )  


 ( X ' X   I ) 1 X ' E ( y )  
 ( X ' X   I ) 1 X ' X  I  
 ( X ' X   I ) 1  X ' X  X ' X   I  
  ( X ' X   I ) 1  .
Thus the ridge regression estimator is a biased estimator of  .

Covariance matrix:
The covariance matrix of ˆridge is defined as

  
V ( ˆridge )  E  ˆridge  E ( ˆridge ) ˆridge  E ( ˆridge )  .
'

 

Since

ˆridge  E ( ˆridge )  ( X ' X   I ) 1 X ' y  ( X ' X   I ) 1 X ' X 


 ( X ' X   I ) 1 X '( y  X  )
 ( X ' X   I ) 1 X '  ,
so
V ( ˆridge )  ( X ' X   I ) 1 X 'V ( ) X ( X ' X   I ) 1
  2 ( X ' X   I ) 1 X ' X ( X ' X   I ) 1.

Econometrics | Chapter 7 | Multicollinearity | Shalabh, IIT Kanpur


21
Mean squared error:
The mean squared error of ˆridge is
2
MSE ( ˆridge )  Var ( ˆridge )  bias ( ˆridge ) 
2
 tr V ( ˆridge )   bias ( ˆridge ) 

  2tr ( X ' X   I ) 1 X ' X ( X ' X   I ) 1    2  '( X ' X   I ) 2 


k j
2    2  '( X ' X   I ) 2 
j 1 ( j   )
2

where 1 , 2 ,..., k are the eigenvalues of X ' X .

Thus as  increases, the bias in ˆridge increases but its variance decreases. Thus the trade-off between bias

and variance hinges upon the value of  . It can be shown that there exists a value of  such that
MSE ( ˆridge )  Var (b)

provided  '  is bounded.

Choice of  :
The estimation of ridge regression estimator depends upon the value of  . Various approaches have been
suggested in the literature to determine the value of  . The value of  can be chosen on the bias of
criteria like
- the stability of estimators with respect to  .
- reasonable signs.
- the magnitude of residual sum of squares etc.
We consider here the determination of  by the inspection of ridge trace.

Ridge trace:
Ridge trace is the graphical display of ridge regression estimator versus  .

If multicollinearity is present and is severe, then the instability of regression coefficients is reflected in the
ridge trace. As  increases, some of the ridge estimates vary dramatically, and they stabilize at some
value of  . The objective in ridge trace is to inspect the trace (curve) and find the reasonable small value
of  at which the ridge regression estimators are stable. The ridge regression estimator with such a choice
of  will have smaller MSE than the variance of OLSE.
Econometrics | Chapter 7 | Multicollinearity | Shalabh, IIT Kanpur
22
An example of ridge trace is as follows for a model with 6 parameters. In this ridge trace, the ˆridge is

evaluated for various choices of  and the corresponding values of all regression coefficients ˆ j ( ridge ) ’s,

j=1,2,…,6 are plotted versus  . These values are denoted by different symbols and are joined by a smooth
curve. This produces a ridge trace for the respective parameter. Now choose the value of  where all the
curves stabilize and become nearly parallel. For example, the curves in the following figure become
almost parallel, starting from    4 or so. Thus one possible choice of  is    4 and parameters can

be estimated as ˆridge   X ' X   4 I  X ' y .


1

The figure drastically exposes the presence of multicollinearity in the data. The behaviour of ˆi ( ridge ) at

 0  0 is very different than at other values of  . For small values of  , the estimates change rapidly.
The estimates stabilize gradually as  increases. The value of  at which all the estimates stabilize gives
the desired value of  because moving away from such  will not bring any appreciable reduction in the
residual sum of squares. It multicollinearity is present, then the variation in ridge regression estimators is
rapid around   0. The optimal  is chosen such that after that value of  , almost all traces stabilize.

Econometrics | Chapter 7 | Multicollinearity | Shalabh, IIT Kanpur


23
Limitations:
1. The choice of  is data-dependent and therefore is a random variable. Using it as a random variable
violates the assumption that  is constant. This will disturb the optimal properties derived under the
assumption of constancy of  .
2. The value of  lies in the interval (0, ) . So a large number of values are required for exploration.
This result is wasting of time. This is not a big issue when working with the software.
3. The choice of  from graphical display may not be unique. Different people may choose different  ,
and consequently, the values of ridge regression estimators will be changing. Another choice of  is
kˆ 2
 where b and ˆ 2 are obtained from the least-squares estimation.
b 'b
4. The stability of numerical estimates of ˆi ' s is a rough way to determine  . Different estimates may

exhibit stability for different  , and it may often be hard to strike a compromise. In such a situation,
generalized ridge regression estimators are used.
5. There is no guidance available regarding the testing of hypothesis and for confidence interval
estimation.

Idea behind ridge regression estimator:


The problem of multicollinearity arises because some of the eigenvalues roots of X ' X are close to zero or
are zero. So if 1 , 2 ,...,  p are the characteristic roots, and if

X ' X    diag (1 , 2 ,..., k )

then
ˆridge  ( I   1 ) 1 b
where b is the OLSE of  given by

b  ( X ' X ) 1 X ' y   1 X ' y.


Thus a particular element will be of the forms
1 i
bi  bi .
 i  
1
i
i
So a small quantity  is added to i so that if i  0, even then remains meaningful.
i  

Econometrics | Chapter 7 | Multicollinearity | Shalabh, IIT Kanpur


24
Another interpretation of ridge regression estimator:
k
In the model y  X    , obtain the least squares estimator of  when 
i 1
1
2
 C , where C is some

constant. So minimize
 (  )  ( y  X  ) '( y  X  )   (  '   C )
where  is the Lagrangian multiplier. Differentiating S (  ) with respect to  , the normal equations are
obtained as
S (  )
 0  2 X ' y  2 X ' X   2  0

 ˆ  ( X ' X   I ) 1 X ' y.
ridge

Note that if C is very small, it may indicate that most of the regression coefficients are close to zero and if
C is large, then it may indicate that the regression coefficients are away from zero. So C puts a sort of
penalty on the regression coefficients to enable its estimation.

Econometrics | Chapter 7 | Multicollinearity | Shalabh, IIT Kanpur


25
Chapter 8
Heteroskedasticity

In the multiple regression model


y  X  ,
it is assumed that
V ( )   2 I ,
i.e.,
Var ( i2 )   2 ,
Cov( i j )  0, i  j  1, 2,..., n.

In this case, the diagonal elements of the covariance matrix of  are the same indicating that the variance of
each  i is same and off-diagonal elements of the covariance matrix of  are zero indicating that all

disturbances are pairwise uncorrelated. This property of constancy of variance is termed as homoskedasticity
and disturbances are called as homoskedastic disturbances.

In many situations, this assumption may not be plausible, and the variances may not remain the same. The
disturbances whose variances are not constant across the observations are called heteroskedastic disturbance,
and this property is termed as heteroskedasticity. In this case
Var ( i )   i2 , i  1, 2,..., n
and disturbances are pairwise uncorrelated.

The covariance matrix of disturbances is


  12 0  0 
 
0  22  0
V ( )  diag ( 1 ,  2 ,...,  n )  
2 2 2 .
     
 
 0 0   n2 

Econometrics | Chapter 8 | Heteroskedasticity | Shalabh, IIT Kanpur


1
Graphically, the following pictures depict homoskedasticity and heteroskedasticity.

Homoskedasticity

Heteroskedasticity (Var(y) increases with x) Heteroskedasticity (Var(y) decreases with x)

Examples: Suppose in a simple linear regression model, x denote the income and y denotes the expenditure
on food. It is observed that as the income increases, the expenditure on food increases because of the choice
and varieties in food increase, in general, up to a certain extent. So the variance of observations on y will not
remain constant as income changes. The assumption of homoscedasticity implies that the consumption pattern
of food will remain the same irrespective of the income of the person. This may not generally be a correct
assumption in real situations. Instead, the consumption pattern changes and hence the variance of y and so the
variances of disturbances will not remain constant. In general, it and will be increasing as income increases.

Econometrics | Chapter 8 | Heteroskedasticity | Shalabh, IIT Kanpur


2
In another example, suppose in a simple linear regression model, x denotes the number of hours of practice for
typing and y denotes the number of typing errors per page. It is expected that the number of typing mistakes
per page decreases as the person practices more. The homoskedastic disturbances assumption implies that the
number of errors per page will remain the same irrespective of the number of hours of typing practice which
may not be true in practice.

Possible reasons for heteroskedasticity:


There are various reasons due to which the heteroskedasticity is introduced in the data. Some of them are as
follows:
1. The nature of the phenomenon under study may have an increasing or decreasing trend. For example,
the variation in consumption pattern on food increases as income increases. Similarly, the number of
typing mistakes decreases as the number of hours of typing practise increases.

2. Sometimes the observations are in the form of averages, and this introduces the heteroskedasticity in the
model. For example, it is easier to collect data on the expenditure on clothes for the whole family rather
than on a particular family member. Suppose in a simple linear regression model
yij   0  1 xij   ij , i  1, 2,..., n, j  1, 2,..., mi

yij denotes the expenditure on cloth for the j th family having m j members and xij denotes the age of

the i th person in the j th family. It is difficult to record data for an individual family member, but it is

easier to get data for the whole family. So yij ' s are known collectively.

Then instead of per member expenditure, we find the data on average spending for each family member
as
mj
1
yi 
mj
y
j 1
ij

and the model becomes


yi   0  1 xi   i .

If we assume E ( ij )  0, Var ( ij )   2 , then

E ( i )  0
2
Var ( i ) 
mj

Econometrics | Chapter 8 | Heteroskedasticity | Shalabh, IIT Kanpur


3
which indicates that the resultant variance of disturbances does not remain constant but depends on the
number of members in a family m j . So heteroskedasticity enters in the data. The variance will remain

constant only when all m j ' s are same.

3. Sometimes the theoretical considerations introduce the heteroskedasticity in the data. For example,
suppose in the simple linear model
yi   0  1 xi   i , i  1, 2,..., n ,

yi denotes the yield of rice and xi denotes the quantity of fertilizer in an agricultural experiment. It is

observed that when the quantity of fertilizer increases, then yield increases. In fact, initially, the yield
increases when the quantity of fertilizer increases. Gradually, the rate of increase slows down, and if
fertilizer is increased further, the crop burns. So notice that 1 changes with different levels of fertilizer.

In such cases, when 1 changes, a possible way is to express it as a random variable with constant

mean 1 and constant variance  2 like

1i  1  vi , i  1, 2,..., n
with
E (vi )  0, Var (vi )   2 , E ( i vi )  0.
So the complete model becomes
yi   0  1 xi   i
i  1  vi
 yi   0   xi  ( i  xi vi )
  0   xi  wi

where wi   i  xi vi is like a new random error component. So

E ( wi )  0
Var ( wi )  E ( wi2 )
 E ( i2 )  xi2 E (vi2 )  2 xi E ( i vi )
  2  xi2 2  0
  2  xi2 2 .
So variance depends on i , and thus heteroskedasticity is introduced in the model. Note that we assume
homoskedastic disturbances for the model
yi   0  1 xi   i , 1i  1  vi
but finally ends up with heteroskedastic disturbances. This is due to theoretical considerations.
Econometrics | Chapter 8 | Heteroskedasticity | Shalabh, IIT Kanpur
4
4. The skewness in the distribution of one or more explanatory variables in the model also causes
heteroskedasticity in the model.
5. The incorrect data transformations and wrong functional form of the model can also give rise to the
heteroskedasticity problem.

Tests for heteroskedasticity


The presence of heteroskedasticity affects the estimation and test of hypothesis. The heteroskedasticity can
enter into the data due to various reasons. The tests for heteroskedasticity assume a specific nature of
heteroskedasticity. Various tests are available in the literature, e.g.,
1. Bartlett test
2. Breusch Pagan test
3. Goldfeld Quandt test
4. Glesjer test
5. Test based on Spearman’s rank correlation coefficient
6. White test
7. Ramsey test
8. Harvey Phillips test
9. Szroeter test
10. Peak test (nonparametric) test
We discuss the first five tests.

1. Bartlett’s test
It is a test for testing the null hypothesis
H 0 :  12   22  ...   i2  ...   n2
This hypothesis is termed as the hypothesis of homoskedasticity. This test can be used only when replicated
data is available.

Since in the model


yi  1 X i1   2 X i 2  ...   k X ik   i , E ( i )  0, Var ( i )   i2 , i  1, 2,..., n,

only one observation yi is available to find  i2 , so the usual tests can not be applied. This problem can be

overcome if replicated data is available. So consider the model of the form

Econometrics | Chapter 8 | Heteroskedasticity | Shalabh, IIT Kanpur


5
yi*  X i    i*

where yi* is a mi 1 vector, X i is mi  k matrix,  is k 1 vector and  i* is mi 1 vector. So replicated data

is now available for every yi* in the following way:

y1*  X 1  1* consists of m1 observations


y2*  X 2    2* consists of m2 observations

yn*  X n    n* consists of mn observations.
All the individual model can be written as
 y1*   X 1   1* 
 *    *
 y2    X 2      2 
       
 *     * 
 yn   X n  n 
or y*  X    *

 n   n   n 
where y * is a vector of order   mi  1, X is   mi   k matrix,  is k  1 vector and  * is   mi   1
 i 1   i 1   i 1 
vector. Apply OLS to this model yields
ˆ  ( X ' X ) 1 X ' y *
and obtain the residual vector
ei*  yi*  X i ˆ .
Based on this, obtain
1
si2  ei* ' ei*
mi  k
n

 (m  k ) s
i
2
i
s2  i 1
n
.
 (m  k )
i 1
i

Now apply Bartlett’s test as


1 n s2
   (mi  k ) log 2
2

C i 1 si

which has asymptotic  2  distribution with (n  1) degrees of freedom where

 
1  n
 1  1 
C  1     .
3(n  1)  i 1  mi  k  n
 (mi  k ) 
 i 1


Econometrics | Chapter 8 | Heteroskedasticity | Shalabh, IIT Kanpur


6
Another variant of Bartlett’s test
Another variant of Bartlett’s test is based on the likelihood ratio test statistic
ni / 2
m
 s2 
u    i2 
i 1  s 

where

1 ni

 y  yi  , i  1, 2,..., m; j  1, 2,..., ni
2
si2  ij
ni j 1

1 m
s2  
n i 1
ni si2

m
n   ni .
i 1

To obtain an unbiased test and modification of -2 ln u which is a closer approximation to


 m2 1 under H 0 , Bartlett test replaces ni by (ni  1) and divide by a scalar constant. This leads to the statistic
m
(n  m) log ˆ 2   (ni  1) log ˆ i2
M i 1

1 m  1  1 
1    
3(m  1)  i 1  ni  1  n  m 

which has a  2 distribution with (m  1) degrees of freedom under H 0 and

1 ni
ˆ i2  
n  1 j 1
( yij  yi ) 2

1 m
ˆ 2  
n  m i 1
(ni  1)ˆ i2 .

In experimental sciences, it is easier to get replicated data, and this test can be easily applied. In real-life
applications, it is challenging to get replicated data, and this test may not be applied. This difficulty is overcome
in Breusch Pagan test.

Econometrics | Chapter 8 | Heteroskedasticity | Shalabh, IIT Kanpur


7
2. Breusch Pagan test
This test can be applied when the replicated data is not available, but only single observations are available.
When it is suspected that the variance is some function (but not necessarily multiplicative) of more than one
explanatory variable, then Breusch Pagan test can be used.

Assuming that under the alternative hypothesis  i2 is expressible as

 i2  h( Zi'ˆ )  h( 1  Z i* * )


where h is some unspecified function and is independent of i ,
Z i'  (1, Z i*' )(1, Z i 2 , Z i 3 ,..., Z ip )

is the vector of observable explanatory variables with first element unity and    ( 1 ,  i* )  ( 1 ,  2 ,...,  p ) is a

vector of unknown coefficients related to  with the first element being the intercept term. The heterogeneity is
defined by these p variables. These Z i ' s may also include some X ' s also.

Specifically, assume that


 i2   1   2 Z i 2  ...   p Z ip .
The null hypothesis
H 0 :  12   22  ...   n2
can be expressed as
H 0 :  2   3  ...   p  0 .

If H 0 is accepted , it implies that  2 Z i 2 ,  3 Z i 3 ,...,  p Z ip do not have any effect on  i2 and we get  i2   1 .

The test procedure is as follows:


1. Ignoring heterogeneity, apply OLS to
yi  1   2 X i1  ...   k X ik   i

and obtain residual


e  y  Xb
b  ( X ' X ) 1 X ' Y .
2. Construct the variables
ei2 nei2
gi  
 n 2  SSres
  ei n 
 i 1 
where SSres is the residual sum of squares based on ei ' s.
Econometrics | Chapter 8 | Heteroskedasticity | Shalabh, IIT Kanpur
8
*
3. Run regression of g on Z1 , Z 2 ,..., Z p and get residual sum of squares SSres .

4. For testing, calculate the test statistic


1 n * 
Q    gi2  SS res 
2  i 1 
which is asymptotically distributed as  2 distribution with ( p  1) degrees of freedom.

5. The decision rule is to reject H 0 if Q  12 (m  1).

 This test is very simple to perform.


 A fairly general form is assumed for heterogeneity, so it is a very general test.
 This is an asymptotic test.
 This test is quite powerful in the presence of heteroskedasticity.

3. Goldfeld Quandt test


This test is based on the assumption that  i2 is positively related to X ij , i.e., one of the explanatory variables

explains the heteroskedasticity in the model. Let j th explanatory variable explains the heteroskedasticity, so

 i2  X ij
or  i2   2 X ij .
The test procedure is as follows:

1. Rank the observations according to the decreasing order of X j .

2. Split the observations into two equal parts leaving c observations in the middle.
nc nc
So each part contains observations provided  k.
2 2
3. Run two separate regression in the two parts using OLS and obtain the residual sum of squares SSres1

and SS res 2 .

4. The test statistic is


SSres 2
F0 
SSres1
 nc nc 
which follows F  distribution, i.e., F   k,  k  when H 0 true.
 2 2 
 nc nc 
5. The decision rule is to reject H 0 whenever F0  F1   k, k.
 2 2 
Econometrics | Chapter 8 | Heteroskedasticity | Shalabh, IIT Kanpur
9
 This test is a simple test, but it is based on the assumption that one of the explanatory variables helps in
determining the heteroskedasticity.
 Then the test is an exact finite sample test.
 The only difficulty in this test is that the choice of c is not obvious. If a large value of c is chosen, then
nc nc
it reduces the degrees of freedom  k , and the condition  k may be violated.
2 2

On the other hand, if a smaller value of c is chosen, then the test may fail to reveal the heteroskedasticity. The
basic objective of the ordering of observations and deletion of observations in the middle part may not reveal
the heteroskedasticity effect. Since the first and last values of  i2 gives the maximum discretion, so removal of

smaller value may not give the proper idea of heteroskedasticity. Considering these two points, the working
n
choice of c is suggested as c  .
3
Moreover, the choice of X ij is also difficult. Since  i2  X ij , so if all important variables are included in the

model, then it may be difficult to decide that which of the variable is influencing the heteroskedasticity.

4. Glesjer test:
This test is based on the assumption that  i2 is influenced by one variable Z , i.e., there is only one variable

which is influencing the heteroskedasticity. This variable could be either one of the explanatory variable or it
can be chosen from some extraneous sources also.

The test procedure is as follows:


1. Use OLS and obtain the residual vector e on the basis of available study and explanatory variables.
2. Choose Z and apply OLS to
ei   0  1Z ih  vi

where vi is the associated disturbance term.

3. Test H 0 : 1  0 using t -ratio test statistic.

1
4. Conduct the test for h  1,  . So the test procedure is repeated four times.
2
In practice, one can choose any value of h . For simplicity, we choose h  1 .
 The test has only asymptotic justification and the four choices of h give generally satisfactory results.
 This test sheds light on the nature of heteroskedasticity.

Econometrics | Chapter 8 | Heteroskedasticity | Shalabh, IIT Kanpur


10
5. Spearman’s rank correlation test
It d i denotes the difference in the ranks assigned to two different characteristics of the i th object or

phenomenon and n is the number of objects or phenomenon ranked, then the Spearman’s rank correlation
coefficient is defined as
 n 2 
  di 
r  1  6  i 12  ;  1  r  1.
 n(n  1) 
 
 
This can be used for testing the hypothesis about the heteroskedasticity.
Consider the model
yi   0  1 X i   i .

1. Run the regression of y on X and obtain the residuals e .

2. Consider ei .

3. Rank both ei and X i (or yˆi ) in an ascending (or descending) order.

4. Compute rank correlation coefficient r based on ei and X i (or yˆi ) .

5. Assuming that the population rank correlation coefficient is zero and n  8, use the test statistic

r n2
t0 
1 r2
which follows a t -distribution with (n  2) degrees of freedom.
6. The decision rule is to reject the null hypothesis of heteroskedasticity whenever t0  t1 (n  2).

If there are more than one explanatory variables, then rank correlation coefficient can be computed
between ei and each of the explanatory variables separately and can be tested using t0 .

Estimation under heteroskedasticity


Consider the model
y  X 
with k explanatory variables and assume that
E ( )  0
  12 0  0 
 
0  22  0 
V ( )     .
     
 
 0 0   n2 

Econometrics | Chapter 8 | Heteroskedasticity | Shalabh, IIT Kanpur


11
The OLSE is
b  ( X ' X ) 1 X ' y.
Its estimation error is
b    ( X ' X ) 1 X ' 
and
E (b   )  ( X ' X ) 1 X ' E ( )  0.
Thus OLSE remains unbiased even under heteroskedasticity.

The covariance matrix of b is


V (b)  E (b   )(b   ) '
 ( X ' X ) 1 X ' E ( ') X ( X ' X ) 1
 ( X ' X ) 1 X ' X ( X ' X ) 1
which is not the same as conventional expression. So OLSE is not efficient under heteroskedasticity as
compared under homoskedasticity.

Now we check if E (ei2 )   i2 or not where ei is the ith residual.

The residual vector is


e  y  Xb  H 

e1 
e 
ei   0, 0,..., 0,1, 0,...0  2 
 
 
en 
 i 'e  i ' H

where  i is a n 1 vector with all elements zero except the i th element which is unity and

H  I  X ( X ' X ) 1 X ' . Then

ei2   i ' e.e '  i   i ' H  ' H  i


E (ei2 )   i ' HE ( ') H  i   i ' H H  i

Econometrics | Chapter 8 | Heteroskedasticity | Shalabh, IIT Kanpur


12
0
 
  h i 
 h11  h1n  0   1i 
   h i
H i       1    2i 
  
 hn1  hnn  0   
    hni i 
 
0
 
 12 0  0  h1i i 
  
 0  22  0  h2i i  .
E (ei )   h1i i, h2i i,..., hni i 
2
      
 2 
 0 0   n   hn1 i 

Thus E (ei2 )   i2 and so ei2 becomes a biased estimator of  i2 in the presence of heteroskedasticity.

In the presence of heteroskedasticity, use the generalized least squares estimation. The generalized least
squares estimator (GLSE) of  is

ˆ  ( X ' 1 X )1 X '  1 y.


Its estimation error is obtained as
ˆ  ( X '  1 X ) 1 X '  1 ( X    )
ˆ    ( X '  1 X ) 1 X '  1 ).
Thus
E ( ˆ   )  ( X '  1 X ) 1 X '  1E ( )  0
V ( ˆ )  E ( ˆ   )( ˆ   )
 ( X '  1 X ) 1 X '  1 E ( ) 1 X ( X '  1 X ) 1
 ( X '  1 X ) 1 X '  1 1 X ( X '  1 X ) 1
 ( X '  1 X ) 1.
Example: Consider a simple linear regression model
yi   0  1 xi   i , i  1, 2,..., n.

The variances of OLSE and GLSE of  are


n

 (x  x ) 
i
2
i
2
n
 i2
Var (b)  i 1
and Var ( ˆ )   respectively.
 n 2
2
( xi  x ) 2
  ( xi  x ) 
i 1

 i 1 
Econometrics | Chapter 8 | Heteroskedasticity | Shalabh, IIT Kanpur
13
Consider
2
 n 

Var ( ˆ )   ( xi  x ) 2 

 i 1

Var (b)
  ( x  x ) 2 2   ( x  x ) 2 1  
n n

   i i  
  i 1
i
 i2  
 i 1
x x
 Square of the correlation coefficient betweene  i ( xi  x ) and  i 
 i 
1
 Var ( ˆ )  Var (b).
( xi  x )
So efficient of OLSE and GLSE depends upon the correlation coefficient between ( xi  x ) i and .
i
The generalized least squares estimation assumes that  is known, i.e., the nature of heteroskedasticity is
completely specified. Based on this assumption, the possibilities of following two cases arise:
  is completely specified or
  is not completely specified.

We consider both the cases as follows:

Case 1:  i2 ' s are prespecified:


Suppose  12 ,  22 ,...,  n2 are completely known in the model

yi  1   2 X i 2  ...   k X ik   i .

Now deflate the model by  i , i.e.,

yi 1 X i2 X ik i
 1  2  ...   k  .
i i i i i

i 2
Let  i*  , then E ( i* )  0, Var ( i* )  i2  1. Now OLS can be applied to this model and usual tools for
i i
drawing statistical inferences can be used.
Note that when the model is deflated, the intercept term is lost as 1 /  i is itself a variable. This point has to be

taken care of in software output.

Econometrics | Chapter 8 | Heteroskedasticity | Shalabh, IIT Kanpur


14
Case 2:  may not be completely specified
Let  12 ,  22 ,...,  n2 are partially known and suppose

 i2  X ij2 

or  i2   2 X ij2 

but  2 is not available. Consider the model


yi  1   2 X i 2  ...   k X ik   i

and deflate it by X ij as

yi  X X 

 1   2 i2  ...   k ik  i .
X ij X ij X ij X ij X ij

Now apply OLS to this transformed model and use the usual statistical tools for drawing inferences.

A caution is to be kept is mind while doing so. This is illustrated in the following example with one
explanatory variable model.

Consider the model


yi   0  1 xi   i .

Deflate it by xi , so we get

yi  0 
  1  i .
xi xi xi

Note that the roles of  0 and 1 in original and deflated models are interchanged. In the original model,  0 is

the intercept term and 1 is the slope parameter whereas in the deflated model, 1 becomes the intercept term

and  0 becomes the slope parameter. So essentially, one can use OLS but need to be careful in identifying the

intercept term and slope parameter, particularly in the software output.

Econometrics | Chapter 8 | Heteroskedasticity | Shalabh, IIT Kanpur


15
Chapter 9
Autocorrelation
One of the basic assumptions in the linear regression model is that the random error components or
disturbances are identically and independently distributed. So in the model y  X   u, it is assumed that

 u2 if s  0
E (ut , ut  s )  
0 if s  0
i.e., the correlation between the successive disturbances is zero.

In this assumption, when E (ut , ut  s )   u2 , s  0 is violated, i.e., the variance of disturbance term does not

remain constant, then the problem of heteroskedasticity arises. When E (ut , ut  s )  0, s  0 is violated, i.e.,

the variance of disturbance term remains constant though the successive disturbance terms are correlated,
then such problem is termed as the problem of autocorrelation.

When autocorrelation is present, some or all off-diagonal elements in E (uu ') are nonzero.

Sometimes the study and explanatory variables have a natural sequence order over time, i.e., the data is
collected with respect to time. Such data is termed as time-series data. The disturbance terms in time series
data are serially correlated.

The autocovariance at lag s is defined as


 s  E (ut , ut  s ); s  0, 1, 2,... .
At zero lag, we have constant variance, i.e.,
 0  E (ut2 )   2 .
The autocorrelation coefficient at lag s is defined as
E (ut ut  s ) s
s   ; s  0, 1, 2,...
Var (ut )Var (ut  s ) 0

Assume  s and  s are symmetrical in s , i.e., these coefficients are constant over time and depend only on

the length of lag s. The autocorrelation between the successive terms (u2 and u1 )

(u3 and u2 ),..., (un and un 1 ) gives the autocorrelation of order one, i.e., 1 . Similarly, the autocorrelation

between the successive terms (u3 and u1 ), (u4 and u2 )...(un and un  2 ) gives the autocorrelation of order two,

i.e.,  2 .

Econometrics | Chapter 9 | Autocorrelation | Shalabh, IIT Kanpur


1
Source of autocorrelation
Some of the possible reasons for the introduction of autocorrelation in the data are as follows:
1. Carryover of effect, at least in part, is an important source of autocorrelation. For example, the
monthly data on expenditure on household is influenced by the expenditure of preceding month. The
autocorrelation is present in cross-section data as well as time-series data. In the cross-section data,
the neighbouring units tend to be similar with respect to the characteristic under study. In time-series
data, time is the factor that produces autocorrelation. Whenever some ordering of sampling units is
present, the autocorrelation may arise.

2. Another source of autocorrelation is the effect of deletion of some variables. In regression modeling,
it is not possible to include all the variables in the model. There can be various reasons for this, e.g.,
some variable may be qualitative, sometimes direct observations may not be available on the variable
etc. The joint effect of such deleted variables gives rise to autocorrelation in the data.

3. The misspecification of the form of relationship can also introduce autocorrelation in the data. It is
assumed that the form of relationship between study and explanatory variables is linear. If there are
log or exponential terms present in the model so that the linearity of the model is questionable, then
this also gives rise to autocorrelation in the data.

4. The difference between the observed and true values of the variable is called measurement error or
errors–in-variable. The presence of measurement errors on the dependent variable may also introduce
the autocorrelation in the data.

Econometrics | Chapter 9 | Autocorrelation | Shalabh, IIT Kanpur


2
Structure of disturbance term:
Consider the situation where the disturbances are autocorrelated,
 0 1   n 1 
 0   n  2 
E ( ')   1
     
 
 n 1  n2   0 
 1 1   n 1 
  1   n  2 
 0  1
     
 
  n 1 n2  1 
 1 1   n 1 
  1   n  2 
2 
 u 1
.
     
 
  n 1 n2  1 

Observe that now there are (n  k ) parameters- 1 ,  2 ,...,  k ,  u2 , 1 ,  2 ,...,  n 1. These (n  k ) parameters are

to be estimated on the basis of available n observations. Since the number of parameters are more than the
number of observations, so the situation is not good from the statistical point of view. In order to handle the
situation, some special form and the structure of the disturbance term is needed to be assumed so that the
number of parameters in the covariance matrix of disturbance term can be reduced.

The following structures are popular in autocorrelation:


1. Autoregressive (AR) process.
2. Moving average (MA) process.
3. Joint autoregression moving average (ARMA) process.

1. Autoregressive (AR) process


The structure of disturbance term in the autoregressive process (AR) is assumed as
ut  1ut 1  2ut  2  ...  q ut  q   t ,

i.e., the current disturbance term depends on the q lagged disturbances and 1 , 2 ,..., k are the parameters

(coefficients) associated with ut 1 , ut  2 ,..., ut  q respectively. An additional disturbance term is introduced in

ut which is assumed to satisfy the following conditions:

Econometrics | Chapter 9 | Autocorrelation | Shalabh, IIT Kanpur


3
E  t   0
 2 if s  0
E  t t s    
0 if s  0.
This process is termed as AR  q  process. In practice, the AR 1 process is more popular.

2. Moving average (MA) process:


The structure of disturbance term in the moving average (MA) process is
ut   t  1 t 1  ....   p t  p ,

i.e., the present disturbance term ut depends on the p lagged values. The coefficients 1 , 2 ,..., p are the

parameters and are associated with  t 1 ,  t  2 ,...,  t  p , respectively. This process is termed as MA  p  process.

3. Joint autoregressive moving average (ARMA) process:


The structure of disturbance term in the joint autoregressive moving average (ARMA) process is
ut  1ut 1  ...  q ut  q  1  1 t 1  ...   p t  p .

This is termed as ARMA  q, p  process.

The method of correlogram is used to check that the data is following which of the processes. The
correlogram is a two dimensional graph between the lag s and autocorrelation coefficient  s which is

plotted as lag s on X -axis and  s on y -axis.

In MA(1) process
ut   t  1 t 1
 1
 for s  1
 s  1  12
0 for s  2

0  1
1  0
i  0 i  2,3,...
So there is no autocorrelation between the disturbances that are more than one period apart.

Econometrics | Chapter 9 | Autocorrelation | Shalabh, IIT Kanpur


4
In ARMA(1,1) process
ut  1ut 1   t  1 t 1
 1  11 1  1 
 for s  1
 s   1  12  211 

1  s 1 for s  2
 1  12  211  2
 u2     .
 1  12 
The autocorrelation function begins at some point determined by both the AR and MA components but
thereafter, declines geometrically at a rate determined by the AR component.

In general, the autocorrelation function


- is nonzero but is geometrically damped for AR process.
- becomes zero after a finite number of periods for MA process.
The ARMA process combines both these features.

The results of any lower order of process are not applicable in higher-order schemes. As the order of the
process increases, the difficulty in handling them mathematically also increases.

Estimation under the first order autoregressive process:


Consider a simple linear regression model
yt   0  1 X t  ut , t  1, 2,..., n.

Assume ui ' s follow a first-order autoregressive scheme defined as

ut   ut 1   t

where   1, E ( t )  0,

 2 if s  0
E ( t ,  t  s )  
0 if s  0
for all t  1, 2,..., n where  is the first-order autocorrelation between ut and ut 1 , t  1, 2,..., n. Now

ut   ut 1   t
  (ut  2   t 1 )   t
 
  t   t 1   2 t  2  ...

   r  t r
r 0

Econometrics | Chapter 9 | Autocorrelation | Shalabh, IIT Kanpur


5
E (ut )  0
E (ut2 )  E ( t2 )   2 E ( t21 )   4 E ( t2 2 )  ...
 (1   2   4  ....) 2 ( t' s are serially independent)
 2
E (ut2 )   u2  for all t.
1  2

E (ut ut 1 )  E   t   t 1   2 t  2  ...    t 1   t  2   2 t 3  ... 

 E  t     t 1   t  2  ... t 1   t  2  ...

  E   t 1   t  2  ... 
2
 
  u .
2

Similarly,
E (ut ut  2 )   2 u2 .
In general,
E (ut ut  s )   s u2
 1  2   n 1 
 
  1    n2 
.
E (uu ')     u   2
2
 1   n 3 
 
      
  n 1  n2
 n 3  1 

Note that the disturbance terms are no more independent and E (uu ')   2 I . The disturbances are
nonspherical.

Consequences of autocorrelated disturbances:


Consider the model with first-order autoregressive disturbances
y  X  u
n1 nk k 1 n1

ut   ut 1   t , t  1, 2,..., n
with assumptions
E (u )  0, E (uu ')  
 2 if s  0
E ( t )  0, E ( t  t  s )   
0 if s  0
where  is a positive definite matrix.

Econometrics | Chapter 9 | Autocorrelation | Shalabh, IIT Kanpur


6
The ordinary least squares estimator of  is

b  ( X ' X ) 1 X ' y
 ( X ' X ) 1 X '( X   u )
b    ( X ' X ) 1 X ' u
E (b   )  0.
So OLSE remains unbiased under autocorrelated disturbances.

The covariance matrix of b is


V (b)  E (b   )(b   ) '
 ( X ' X ) 1 X ' E (uu ') X ( X ' X ) 1
 ( X ' X ) 1 X ' X ( X ' X ) 1
  u2 ( X ' X ) 1.

The residual vector is


e  y  Xb  Hy  Hu
e ' e  y ' Hy  u ' Hu
E (e ' e)  E (u ' u )  E u ' X ( X ' X ) 1 X ' u 
 n u2  tr ( X ' X ) 1 X ' X .
e 'e
Since s 2  , so
n 1
 u2 1
E (s 2 )   tr ( X ' X ) 1 X ' X ,
n 1 n 1
so s 2 is a biased estimator of  2 . In fact, s 2 has a downward bias.

Application of OLS fails in case of autocorrelation in the data and leads to serious consequences as
 overly optimistic view from R 2 .
 narrow confidence interval.
 usual t -ratio and F  ratio tests provide misleading results.
 prediction may have large variances.

Since disturbances are nonspherical, so generalized least squares estimate of  yields more efficient
estimates than OLSE.

Econometrics | Chapter 9 | Autocorrelation | Shalabh, IIT Kanpur


7
The GLSE of  is

ˆ  ( X '  1 X ) 1 X '  1 y
E ( ˆ )  
V ( ˆ )   2 ( X '  1 X ) 1.
u

The GLSE is best linear unbiased estimator of  .

Tests for autocorrelation:


Durbin Watson test:
The Durbin-Watson (D-W) test is used for testing the hypothesis of lack of first-order autocorrelation in the
disturbance term. The null hypothesis is
H0 :   0

Use OLS to estimate  in y  X   u and obtain the residual vector

e  y  Xb  Hy

where b  ( X ' X ) 1 X ' y, H  I  X ( X ' X ) 1 X '.

The D-W test statistic is


n

 e  e 
2
t t 1
d t 2
n

e
t 1
2
t

n n n

 et2  et 1 e e t t 1
 t 2
n
 t 2
n
 2 t  2n .
e
t 1
2
t e
t 1
2
t e
t 1
2
t

For large n,
d  1  1  2r
d  2(1  r )
where r is the sample autocorrelation coefficient from residuals based on OLSE and can be regarded as the
regression coefficient of et on et 1 . Here

Econometrics | Chapter 9 | Autocorrelation | Shalabh, IIT Kanpur


8
positive autocorrelation of et ’s  d  2

negative autocorrelation of et ’s  d  2

zero autocorrelation of et ’s  d  2

As 1  r 1, so
if 1 r  0, then 2  d  4 and
if 0  r 1, then 0  d  2.
So d lies between 0 and 4.

Since e depends on X , so for different data sets, different values of d are obtained. So the sampling
distribution of d depends on X . Consequently, exact critical values of d cannot be tabulated owing to their
dependence on X . Durbin and Watson, therefore, obtained two statistics d and d such that

d d d
and their sampling distributions do not depend upon X .

Considering the distribution of d and d , they tabulated the critical values as d L and dU respectively. They

prepared the tables of critical values for 15  n 100 and k  5. Now tables are available for 6  n  200 and
k  10.

The test procedure is as follows:


H0 :   0
Nature of H1 Reject H 0 when Retain H 0 when The test is inconclusive
when
H1 :   0 d  dL d  dU d L  d  dU
H1 :   0 d  (4  d L ) d  (4  dU ) ( 4  dU )  d  (4  d L )
H1 :   0 d  d L or dU  d  (4  dU ) d L  d  dU
d  (4  d L ) or
(4  dU )  d  (4  d L )
Values of d L and dU are obtained from tables.

Econometrics | Chapter 9 | Autocorrelation | Shalabh, IIT Kanpur


9
Limitations of D-W test
1. If d falls in the inconclusive zone, then no conclusive inference can be drawn. This zone becomes
fairly larger for low degrees of freedom. One solution is to reject H 0 if the test is inconclusive. A

better solutions is to modify the test as


 Reject H 0 when d  dU .

 Accept H 0 when d  dU .

This test gives a satisfactory solution when values of xi ’s change slowly, e.g., price, expenditure

etc.

2. The D-W test is not applicable when the intercept term is absent in the model. In such a case, one can
use another critical value, say d M in place of d L . The tables for critical values d M are available.

3. The test is not valid when lagged dependent variables appear as explanatory variables. For example,
yt  1 yt 1   2 yt  2  ....   r yt  r   r 1 xt1  ...   k xt ,k  r  ut ,

ut   ut 1   t .

In such case, Durbin’s h test is used, which is given as follows.

Durbin’s h-test
Apply OLS to
yt  1 yt 1   2 yt  2  ....   r yt  r   r 1 xt1  ...   k xt ,k  r  ut ,

ut   ut 1   t
 (b ). Then the Dubin’s h -
and find OLSE b1 of 1. Let its variance be Var (b1 ) and its estimator is Var 1

statistic is
n
hr

1  n Var (b1 )

which is asymptotically distributed as N (0,1) and


n

e e t t 1
r t 2
n
.
e
t 2
2
t

Econometrics | Chapter 9 | Autocorrelation | Shalabh, IIT Kanpur


10
 (b )   0, then test breaks down. In such cases, the
This test is applicable when n is large. When 1  nVar
 1 

following test procedure can be adopted.


Introduce a new variable  t 1 to ut   ut 1   t . Then

et  t 1  yt .

Now apply OLS to this model and test H 0 A :   0 versus H1 A :   0 using t -test. It H 0 A is accepted then

accept H 0 :   0.

If H 0 A :   0 is rejected, then reject H 0 :   0.

4. If H 0 :   0 is rejected by D-W test, it does not necessarily mean the presence of first-order

autocorrelation in the disturbances. It could happen because of other reasons also, e.g.,
 distribution may follows higher-order AR process.
 some important variables are omitted.
 dynamics of the model is misspecified.
 functional term of the model is incorrect.

Estimation procedures with autocorrelated errors when autocorrelation coefficient is


known
Consider the estimation of regression coefficient under first-order autoregressive disturbances and the
autocorrelation coefficient is known. The model is
y  X   u,
ut   ut   t

and assume that E (u )  0, E (uu ')     2) I , E ( )  0, E ( ')   2 I .

The OLSE of  is unbiased but not, in general, efficient, and the estimate of  2 is biased. So we use
generalized least squares estimation procedure, and GLSE of  is

ˆ  ( X ' 1 X ) 1 X ' 1 y
where

Econometrics | Chapter 9 | Autocorrelation | Shalabh, IIT Kanpur


11
 1  0  0 0 
  1   2   0 0 
 
 0  1  2  0 0 
 
1
.
     
 0 0 0  1  2  
 
 0 0 0   1 

To employ this, we proceed as follows:


1. Find a matrix P such that P ' P   1. In this case

 1  2 0 0  0 0
 
  1 0  0 0
 0  1  0 0
P .
      
 
 0 0 0  1 0
 0 0 0   1 

2. Transform the variables as


y*  Py, X *  PX ,  *  P .
Such transformation yields
 1  2 y   1  2 1   2 x12  1   2 x1k 
 1
  
 2
y   y1   1  x22   x12  x2 k   x1k 
y*   y3   y2  , X*   1  x32   x22  x3k   x2 k  .
 
        
   
y   yn 1  1  xn 2   xn 1,2 ,  xn   xn 1 
 n  

Note that the first observation is treated differently than other observations. For the first observation,

  
1   2 y1  
1   2 x1'    
1   2 u1

whereas for other observations


yt   yt 1   xt   xt 1 ) '   (ut   ut 1  ; t  2,3,..., n

where xt' is a row vector of X . Also, 1   2 u1 and (u1   u0 ) have the same properties. So we

expect these two errors to be uncorrelated and homoscedastic.

Econometrics | Chapter 9 | Autocorrelation | Shalabh, IIT Kanpur


12
If the first column of X is a vector of ones, then the first column of X * is not constant. Its first element is

1  2.

Now employ OLSE with observations y * and X * , then the OLSE of  is

 *  ( X *' X *) 1 X *' y*,


its covariance matrix is

V ( ˆ )   2 ( X *' X *) 1
  2 ( X ' 1 X ) 1
and its estimator is
Vˆ ( ˆ )  ˆ 2 ( X ' 1 X ) 1
where
( y  X ˆ ) ' 1 ( y  X ˆ )
ˆ 2  .
nk

Estimation procedures with autocorrelated errors when autocorrelation coefficient is


unknown
Several procedures have been suggested to estimate the regression coefficients when autocorrelation
coefficient is unknown. The feasible GLSE of  is

ˆF  ( X ' 
ˆ 1 X ) 1 X ' 
ˆ 1 y

ˆ 1 is the  1 matrix with  replaced by its estimator ̂ .


where 

1. Use of sample correlation coefficient


The most common method is to use the sample correlation coefficient r between successive residuals as the
natural estimator of  . The sample correlation can be estimated using the residuals in place of disturbances
as
n

e e t t 1
r t 2
n

e
t 2
2
t

where et  yt  xt' b, t  1, 2,..., n and b is OLSE of  .

Two modifications are suggested for r which can be used in place of r .

Econometrics | Chapter 9 | Autocorrelation | Shalabh, IIT Kanpur


13
nk 
1. r*    r is the Theil’s estimator.
 n 1 
d
2. r **  1  for large n where d is the Durbin Watson statistic for H 0 :   0 .
2

2. Durbin procedure:
In Durbin procedure, the model
yt   yt 1   0 (1   )   ( xt   xt 1 )   t , t  2,3,..., n

is expressed as
yt   0 (1   )   yt 1   x1   xt 1   t
  0*   yt 1   xt   * xt 1   t , t  2,3,..., n (*)

where  0*   0 (1   ),  *    .

Now run a regression using OLS to model (*) and estimate r * as the estimated coefficient of yt 1.

Another possibility is that since   (1,1) , so search for a suitable  which has smaller error sum of
squares.
3. Cochrane-Orcutt procedure:
This procedure utilizes P matrix defined while estimating  when  is known. It has following steps:
(i) Apply OLS to yt   0  1 xt  ut and obtain the residual vector e .
n

e e t t 1
(ii) Estimate  by r  t 2
n
.
e
t 2
2
t 1

Note that r is a consistent estimator of  .


(iii) Replace  by r is
yt   yt 1   0 (1   )   ( xt   xt 1 )   t

and apply OLS to the transformed model


yt  ryt 1   0*   ( xt  rxt 1 )  disturbance term

and obtain estimators of  0* and  as ˆ0* and ˆ respectively.

This is Cochrane-Orcutt procedure. Since two successive applications of OLS are involved, so it is also
called as two-step procedure.
Econometrics | Chapter 9 | Autocorrelation | Shalabh, IIT Kanpur
14
This application can be repeated in the procedure as follows:
(I) Put ˆ0* and ˆ in the original model.

(II) Calculate the residual sum of squares.


n

e e t t 1
(III) Calculate  by r  t 2
n
and substitute it in the model
e
t 2
2
t 1

yt   yt 1   0 (1   )   ( xt   xt 1 )   t

and again obtain the transformed model.


(IV) Apply OLS to this model and calculate the regression coefficients.

This procedure is repeated until convergence is achieved, i.e., iterate the process till the two successive
estimates are nearly same so that stability of estimator is achieved.
This is an iterative procedure and is numerically convergent procedure. Such estimates are asymptotically
efficient and there is a loss of one observation.

4. Hildreth-Lu procedure or Grid-search procedure:


The Hilreth-Lu procedure has the following steps:
(i) Apply OLS to
( yt   yt 1 )   0 (1   )   ( xt   xt 1 )   t , t  2,3,..., n

using different values of  (1    1) such as   0.1, 0.2,... .


(ii) Calculate the residual sum of squares in each case.
(iii) Select that value of  for which residual sum of squares is smallest.

Suppose we get   0.4. Now choose a finer grid. For example, choose  such that 0.3    0.5 and
consider   0.31, 0.32,..., 0.49 and pick up that  with the smallest residual sum of squares. Such iteration
can be repeated until a suitable value of  corresponding to minimum residual sum of squares is obtained.
The selected final value of  can be used and for transforming the model as in the case of Cochrane-Orcutt
procedure. The estimators obtained with this procedure are as efficient as obtained by Cochrane-Orcutt
procedure and there is a loss of one observation.

Econometrics | Chapter 9 | Autocorrelation | Shalabh, IIT Kanpur


15
5. Prais-Winston procedure
This is also an iterative procedure based on two-step transformation.
n

e e t t 1
(i) Estimate  by ˆ  t 2
n
where et ’s are residuals based on OLSE.
 et21
t 3

(ii) Replace  by ̂ is the model as in Cochrane-Orcutt procedure

  
1  ˆ 2 y1  
1  ˆ 2  0     
1  ˆ 2 xt  
1  ˆ 2 ut
yt  ˆ yt 1  (1  ˆ )  0   ( xt  ˆ xt 1 )  (ut  ˆ ut 1 ), t  2,3,..., n.

(iii) Use OLS for estimating the parameters.


The estimators obtained with this procedure are asymptotically as efficient as the best linear unbiased
estimators. There is no loss of any observation.

(6) Maximum likelihood procedure


Assuming that y ~ N ( X  ,  2 ), the likelihood function for  ,  and  2 is

1  1 
L exp   2 ( y  X  ) ' 1 ( y  X  )  .
 2 
n

 2  
n
2 2
 2

1
Ignoring the constant and using   , the log-likelihood is
1  2
n 1 1
ln L  ln L(  ,  2 ,  )   ln  2  ln(1   2 )  2 ( y  X  ) ' 1 ( y  X  ) .
2 2 2 

The maximum likelihood estimators of  ,  and  2 can be obtained by solving the normal equations

 ln L  ln L  ln L
 0,  0,  0.
   2
These normal equations turn out to be nonlinear in parameters and can not be easily solved.
One solution is to
- first derive the maximum likelihood estimator of  2 .

- Substitute it back into the likelihood function and obtain the likelihood function as the function of
 and  .
- Maximize this likelihood function with respect to  and  .

Econometrics | Chapter 9 | Autocorrelation | Shalabh, IIT Kanpur


16
Thus
 ln L n 1
 0   2  2 ( y  X  ) ' 1 ( y  X  )  0
 2
2  2 
1
 ˆ 2  ( y  X  ) ' 1 ( y  X  )
n
is the estimator of  2 .

Substituting ˆ 2 in place of  2 in the log-likelihood function yields

n 1  1 n
ln L*  ln L *(  ,  )   ln  ( y  X  ) ' 1 ( y  X  )   ln(1   2 ) 
2 n  2 2
n 1 
  ln ( y  X  ) ' 1 ( y  X  )  ln(1   2 )   k
2 n 
 
n  ( y  X  ) ' 1 ( y  X  ) 
 k  ln
2  1

 (1   2 ) n 

n n
where k  ln n  .
2 2

Maximization of ln L * is equivalent to minimizing the function

( y  X  ) ' 1 ( y  X  )
1
.
(1   )
2 n

Using the optimization techniques of non-linear regression, this function can be minimized and estimates of
 and  can be obtained.

If n is large and  is not too close to one, then the term (1   2 ) 1/ n is negligible and the estimates of 

will be same as obtained by nonlinear least-squares estimation.

Econometrics | Chapter 9 | Autocorrelation | Shalabh, IIT Kanpur


17
Chapter 10
Dummy Variable Models

In general, the explanatory variables in any regression analysis are assumed to be quantitative in nature. For
example, the variables like temperature, distance, age etc. are quantitative in the sense that they are recorded
on a well-defined scale.

In many applications, the variables can not be defined on a well-defined scale, and they are qualitative in
nature.

For example, the variables like sex (male or female), colour (black, white), nationality, employment status
(employed, unemployed) are defined on a nominal scale. Such variables do not have any natural scale of
measurement. Such variables usually indicate the presence or absence of a “quality” or an attribute like
employed or unemployed, graduate or non-graduate, smokers or non- smokers, yes or no, acceptance or
rejection, so they are defined on a nominal scale. Such variables can be quantified by artificially constructing
the variables that take the values, e.g., 1 and 0 where “1” usually indicates the presence of attribute and “0”
usually indicates the absence of the attribute. For example, “1” indicator that the person is male and “0”
indicates that the person is female. Similarly, “1” may indicate that the person is employed and then “0”
indicates that the person is unemployed.

Such variables classify the data into mutually exclusive categories. These variables are called indicator
variable or dummy variables.

Usually, the indicator variables take on the values 0 and 1 to identify the mutually exclusive classes of the
explanatory variables. For example,
1 if person is male
D
0 if person is female,
1 if person is employed
D
0 if person is unemployed.

Here we use the notation D in place of X to denote the dummy variable. The choice of 1 and 0 to identify
a category is arbitrary. For example, one can also define the dummy variable in the above examples as

Econometrics | Chapter 10 | Dummy Variable Models | Shalabh, IIT Kanpur


1
1 if person is female
D
0 if person is male,
1 if person is unemployed
D
0 if person is employed.
It is also not necessary to choose only 1 and 0 to denote the category. In fact, any distinct value of D will
serve the purpose. The choices of 1 and 0 are preferred as they make the calculations simple, help in the easy
interpretation of the values and usually turn out to be a satisfactory choice.

In a given regression model, the qualitative and quantitative can also occur together, i.e., some variables are
qualitative, and others are quantitative.

When all explanatory variables are


- quantitative, then the model is called a regression model,
- qualitative, then the model is called an analysis of variance model and
- quantitative and qualitative both, then the model is called an analysis of covariance model.

Such models can be dealt with within the framework of regression analysis. The usual tools of regression
analysis can be used in the case of dummy variables.

Example:
Consider the following model with x1 as quantitative and D2 as an indicator variable

y   0  1 x1   2 D2   , E ( )  0, Var ( )   2
0 if an observation belongs to group A
D2  
1 if an observation belongs to group B.
The interpretation of the result is essential. We proceed as follows:
If D2  0, then

y   0  1 x1   2 .0  
  0  1 x1  
E ( y / D2  0)   0  1 x1

which is a straight line relationship with intercept  0 and slope 1 .

Econometrics | Chapter 10 | Dummy Variable Models | Shalabh, IIT Kanpur


2
If D2  1, then

y   0  1 x1   2 .1  
 (  0   2 )  1 x1  
E ( y / D2  1)  (  0   2 )  1 x1

which is a straight-line relationship with intercept (  0   2 ) and slope 1.

The quantities E ( y / D2  0) and E ( y / D2  1) are the average responses when an observation belongs to

group A and group, B, respectively. Thus

 2  E ( y / D2  1)  E ( y / D2  0)
which has an interpretation as the difference between the average values of y with D2  0 and D2  1 .

Graphically, it looks like as in the following figure. It describes two parallel regression lines with the same
variances  2 .

y
E ( y / D2  1)  ( 0   2 )  1 x1

0   2 1
2
E ( y / D2  0)  0  1 x1

1
0

If there are three explanatory variables in the model with two indicator variables D2 , and D3 then they

will describe three levels, e.g., groups A, B and C. The levels of indicator variables are as follows:
1. D2  0, D3  0 if the observation is from group A

2. D2  1, D3  0 if the observation is from group B

3. D2  0, D3  1 if the observation is from group C

The concerned regression model is


y   0  1 x1   2 D2  3 D3   , E ( )  0, var( )   2 .
Econometrics | Chapter 10 | Dummy Variable Models | Shalabh, IIT Kanpur
3
In general, if a qualitative variable has m levels, then (m  1) indicator variables are required, and each of
them takes value 0 and 1.

Consider the following examples to understand how to define such indicator variables and how they can be
handled.

Example:
Suppose y denotes the monthly salary of a person and D denotes whether the person is graduate or non-
graduate. The model is
y   0  1 D   , E ( )  0, var( )   2 .

With n observations, the model is


yi   0  1 Di   i , i  1, 2,..., n
E ( yi / Di  0)   0
E ( yi / Di  1)   0  1
1  E ( yi / Di  1)  E ( yi / Di  0)
Thus
-  0 measures the mean salary of a non-graduate.

- 1 measures the difference in the mean salaries of a graduate and a non-graduate person.

Now consider the same model with two indicator variables defined in the following way:
1 if person is graduate
Di1  
0 if person is nongraduate,
1 if person is nongraduate
Di 2  
0 if person is graduate.
The model with n observations is
yi   0  1 Di1   2 Di 2   i , E ( i )  0, Var ( i )   2 , i  1, 2,..., n.
Then we have
1. E  yi / Di1  0, Di 2  1   0   2 : Average salary of a non-graduate

2. E  yi / Di1  1, Di 2  0   0  1 : Average salary of a graduate

3. E  yi / Di1  0, Di 2  0   0 : cannot exist

4. E  yi / Di1  1, Di 2  1   0  1   2 : cannot exist.


Econometrics | Chapter 10 | Dummy Variable Models | Shalabh, IIT Kanpur
4
Notice that in this case
Di1  Di 2  1 for all i

which is an exact constraint and indicates the contradiction as follows:


Di1  Di 2  1  person is graduate

Di1  Di 2  1  person is non-graduate

So multicollinearity is present in such cases. Hence the rank of the matrix of explanatory variables falls
short by 1. So  0 , 1 and  2 are indeterminate, and least-squares method breaks down. So the proposition

of introducing two indicator variables is useful, but they lead to serious consequences. This is known as the
dummy variable trap.

If the intercept term is ignored, then the model becomes


yi  1 Di1   2 Di 2   i , E ( i )  0,Var ( i )   2 , i  1, 2,..., n
then
E ( yi / Di1  1, Di 2  0)  1  Average salary of a graduate.
E ( yi / Di1  0, Di 2  1)   2  Average salary of a non  graduate.

So when intercept term is dropped, then 1 and  2 have proper interpretations as the average salaries of a

graduate and non-graduate persons, respectively.

Now the parameters can be estimated using ordinary least squares principle, and standard procedures for
drawing inferences can be used.

Rule: When the explanatory variable leads to m mutually exclusive categories classification, then use
(m  1) indicator variables for its representation. Alternatively, use m indicator variables but drop the
intercept term.

Econometrics | Chapter 10 | Dummy Variable Models | Shalabh, IIT Kanpur


5
Interaction term:
Suppose a model has two explanatory variables – one quantitative variable and other an indicator variable.
Suppose both interact and an explanatory variable as the interaction of them is added to the model.
yi   0  1 xi1   2 Di 2  3 xi1 Di 2   i , E ( i )  0, Var ( i )   2 , i  1, 2,..., n.

To interpret the model parameters, we proceed as follows:


Suppose the indicator variables are given by
1 if i th person belongs to group A
Di 2  
0 if i th person belongs to group B

yi  Salary of i th person.
Then
E  yi / Di 2  0    0  1 xi1   2 .0  3 xi1.0
  0  1 xi1.

This is a straight line with intercept  0 and slope 1 . Next

E  yi / Di 2  1   0  1 xi1   2 .1  3 xi1.1
 (  0   2 )  ( 1   3 ) xi1.

This is a straight line with intercept term (  0   2 ) and slope ( 1  3 ).

The model
E ( yi )   0  1 xi1   2 Di 2  3 xi1 Di 2

has different slopes and different intercept terms.

Thus
 2 reflects the change in intercept term associated with the change in the group of person i.e., when the
group changes from A to B.
3 reflects the change in slope associated with the change in the group of person, i.e., when group changes
from A to B.

Econometrics | Chapter 10 | Dummy Variable Models | Shalabh, IIT Kanpur


6
Fitting of the model
yi   0  1 xi1   2 Di 2   3 xi1 Di 2   i

is equivalent to fitting two separate regression models corresponding to Di 2  1 and Di 2  0 , i.e.

yi   0  1 xi1   2 .1   3 xi1.1   i
yi  (  0   2 )  ( 1   3 ) xi1 Di 2   i
and
yi   0  1 xi1   2 .0  3 xi1.0   i
yi   0  1 xi1   i
respectively.

The test of hypothesis becomes convenient by using an indicator variable. For example, if we want to test
whether the two regression models are identical, the test of hypothesis involves testing
H 0 :  2  3  0
H1 :  2  0 and/or 3  0.

Acceptance of H 0 indicates that only a single model is necessary to explain the relationship.

In another example, if the objective is to test that the two models differ with respect to intercepts only and
they have the same slopes, then the test of hypothesis involves testing
H 0 : 3  0
H1 :  3  0.

Indicator variables versus quantitative explanatory variable


The quantitative explanatory variables can be converted into indicator variables. For example, if the ages of
persons are grouped as follows:
Group 1: 1 day to 3 years
Group 2: 3 years to 8 years
Group 3: 8 years to 12 years
Group 4: 12 years to 17 years
Group 5: 17 years to 25 years
then the variable “age” can be represented by four different indicator variables.

Econometrics | Chapter 10 | Dummy Variable Models | Shalabh, IIT Kanpur


7
Since it is difficult to collect the data on individual ages, so this will help in an easy collection of data. A
disadvantage is that some loss of information occurs. For example, if the ages in years are 2, 3, 4, 5, 6, 7 and
suppose the indicator variable is defined as
1 if age of i th person is  5 years
Di  
0 if age of i person is  5 years.
th

Then these values become 0, 0, 0, 1, 1, 1. Now looking at the value 1, one can not determine if it
corresponds to age 5, 6 or 7 years.

Moreover, if a quantitative explanatory variable is grouped into m categories, then (m  1) parameters are
required whereas if the original variable is used as such, then only one parameter is required.

Treating a quantitative variable as a qualitative variable increases the complexity of the model. The degrees
of freedom for error is also reduced. This can affect the inferences if the data set is small. In large data sets,
such an effect may be small.

The use of indicator variables does not require any assumption about the functional form of the relationship
between study and explanatory variables.

Econometrics | Chapter 10 | Dummy Variable Models | Shalabh, IIT Kanpur


8
Chapter 11
Specification Error Analysis

The specification of a linear regression model consists of a formulation of the regression relationships and of
statements or assumptions concerning the explanatory variables and disturbances. If any of these is violated,
e.g., incorrect functional form, the improper introduction of disturbance term in the model, etc., then
specification error occurs. In a narrower sense, the specification error refers to explanatory variables.

The complete regression analysis depends on the explanatory variables present in the model. It is understood
in the regression analysis that only correct and important explanatory variables appear in the model. In
practice, after ensuring the correct functional form of the model, the analyst usually has a pool of explanatory
variables which possibly influence the process or experiment. Generally, all such candidate variables are not
used in the regression modeling, but a subset of explanatory variables is chosen from this pool.

While choosing a subset of explanatory variables, there are two possible options:
1. In order to make the model as realistic as possible, the analyst may include as many as
possible explanatory variables.
2. In order to make the model as simple as possible, one may include only fewer number of
explanatory variables.

In such selections, there can be two types of incorrect model specifications.


1. Omission/exclusion of relevant variables.
2. Inclusion of irrelevant variables.

Now we discuss the statistical consequences arising from both situations.

1. Exclusion of relevant variables:


In order to keep the model simple, the analyst may delete some of the explanatory variables which may be of
importance from the point of view of theoretical considerations. There can be several reasons behind such
decisions, e.g., it may be hard to quantify the variables like the taste, intelligence etc. Sometimes it may be
difficult to take correct observations on the variables like income etc.

Econometrics | Chapter 11 | Specification Error Analysis | Shalabh, IIT Kanpur


1
Let there be k candidate explanatory variables out of which suppose r variables are included and (k  r )
variables are to be deleted from the model. So partition the X and  as

   
X   X1 X 2  and    1 2  .
nk
 nr n( k  r )   r1 ( k  r )1) 
The model y  X    , E ( )  0, V ( )   2 I can be expressed as
y  X 11  X 2  2  

which is called a full model or true model.

After dropping the r explanatory variable in the model, the new model is
y  X 11  

which is called a misspecified model or false model.

Applying OLS to the false model, the OLSE of 1 is

b1F  ( X 1' X 1 ) 1 X 1' y.

The estimation error is obtained as follows:


b1F  ( X 1' X 1 ) 1 X 1' ( X 11  X 2  2   )
 1  ( X 1' X 1 ) 1 X 1' X 2  2  ( X 1' X 1 ) 1 X 1'
b1F  1    ( X 1' X 1 ) 1 X 1'

where   ( X 1' X 1 ) 1 X 1' X 2  2 .

Thus
E (b1F  1 )    ( X 1' X 1 ) 1 E ( )

which is a linear function of  2 , i.e., the coefficients of excluded variables. So b1F is biased, in general. The

bias vanishes if X 1' X 2  0, i.e., X 1 and X 2 are orthogonal or uncorrelated.

The mean squared error matrix of b1F is

MSE (b1F )  E (b1F  1 )(b1F  1 ) '


 E  '  ' X 1 ( X 1' X 1 ) 1  ( X 1' X 1 ) 1 X 1' ' ( X 1' X 1 ) 1 X 1' ' X 1 ( X 1' X 1 ) 1 
  ' 0  0   2 ( X 1' X 1 ) 1 X 1' IX 1 ( X 1' X 1 ) 1
  '  2 ( X 1' X 1 ) 1.

Econometrics | Chapter 11 | Specification Error Analysis | Shalabh, IIT Kanpur


2
So efficiency generally declines. Note that the second term is the conventional form of MSE.
The residual sum of squares is
SSres e 'e
ˆ 2  
nr nr
where e  y  X 1b1F  H1 y,
H1  I  X 1 ( X 1' X 1 ) 1 X 1' .
Thus
H1 y  H1 ( X 11  X 2  2   )
 0  H1 ( X 2  2   )
 H1 ( X 2  2   ).

yH1 y  ( X 11  X 2  2   ) H1 ( X 2  2   )
 (  2' X 2 H1' H1 X 2  2   2' X 2' H1   2' X 2 H1' X 2  2  1' X 1' H1   ' H1' X 2  2   ' H1 ).

1
E (s 2 )   E (  2' X 2' H1 X 2  2 )  0  0  E ( ' H  ) 
nr
1
   2' X 2' H1 X 2  2 )  (n  r ) 2 
nr 
1
2   2' X 2' H1 X 2  2 .
nr
Thus s 2 is a biased estimator of  2 and s 2 provides an overestimate of  2 . Note that even if X 1' X 2  0,

then also s 2 gives an overestimate of  2 . So the statistical inferences based on this will be faulty. The t -test
and confidence region will be invalid in this case.

If the response is to be predicted at x '  ( x1' , x2' ), then using the full model, the predicted value is

yˆ  x ' b  x '( X ' X ) 1 X ' y


with
E ( yˆ )  x ' 
Var ( yˆ )   2 1  x '( X ' X ) 1 x  .

When the subset model is used then the predictor is


yˆ1  x1' b1F
and then

Econometrics | Chapter 11 | Specification Error Analysis | Shalabh, IIT Kanpur


3
E ( yˆ1 )  x1' ( X 1' X 1 ) 1 X 1' E ( y )
 x1' ( X 1' X 1 ) 1 X 1' E ( X 11  X 2  2   )
 x1' ( X 1' X 1 ) 1 X 1' ( X 11  X 2  2 )
 x1' 1  x1' ( X 1' X 1 ) 1 X 1' X 2  2
 x1' 1  xi' .

Thus ŷ1 is a biased predictor of y . It is unbiased when X 1' X 2  0. The MSE of predictor is

MSE ( yˆ1 )   2 1  x1' ( X 1' X 1 ) 1 x1    x1'  x2'  2  .


2

Also
Var ( yˆ )  MSE ( yˆ1 )

provided V ( ˆ2 )   2  2' is positive semidefinite.

2. Inclusion of irrelevant variables


Sometimes due to enthusiasm and to make the model more realistic, the analyst may include some
explanatory variables that are not very relevant to the model. Such variables may contribute very little to the
explanatory power of the model. This may tend to reduce the degrees of freedom (n  k ) , and consequently,
the validity of inference drawn may be questionable. For example, the value of the coefficient of
determination will increase, indicating that the model is getting better, which may not really be true.

Let the true model be


y  X    , E ( )  0,V ( )   2 I
which comprise k explanatory variable. Suppose now r additional explanatory variables are added to the
model and resulting model becomes
y  X   Z  
where Z is a n  r matrix of n observations on each of the r explanatory variables and  is r 1 vector of
regression coefficient associated with Z and  is disturbance term. This model is termed as a false model.

Econometrics | Chapter 11 | Specification Error Analysis | Shalabh, IIT Kanpur


4
Applying OLS to false model, we get
1
 bF   X ' X X ' Z   X ' y 
    
 cF   Z ' X Z ' Z   Z ' y 
 X ' X X ' Z   bF   X ' y 
     
 Z ' X Z ' Z   cF   Z ' y 

 X ' XbF  X ' ZcF  X ' y (1)


Z ' XbF  Z ' ZcF  Z ' y (2)

where bF and CF are the OLSEs of  and  respectively.

Premultiply equation (2) by X ' Z ( Z ' Z ) 1 , we get

X ' Z ( Z ' Z ) 1 Z ' XbF  X ' Z ( Z ' Z ) 1 Z ' ZcF  X ' Z ( Z ' Z ) 1 Z ' y. (3)

Subtracting equation (1) from (3), we get


 X ' X  X ' Z ( Z ' Z ) 1 Z ' X  bF  X ' y  X ' Z ( Z ' Z ) 1 Z ' y
X '  I  Z ( Z ' Z ) 1 Z ' XbF  X '  I  Z ( Z ' Z ) 1 Z ' y
 bF  ( X ' H Z X ) 1 X ' H Z y
where H Z  I  Z ( Z ' Z ) 1 Z '.

The estimation error of bF is

bF    ( X ' H Z X ) 1 X ' H Z y  
 ( X ' H Z X ) 1 X ' H Z ( X    )  
 ( X ' H Z X ) 1 X ' H Z  .
Thus
E (bF   )  ( X ' H Z X ) 1 X ' H Z E ( )  0

so bF is unbiased even when some irrelevant variables are added to the model.

The covariance matrix is

V (bF )  E  bF    bF   
1

 E  X ' H Z X  X ' H Z  ' H Z X ( X ' H Z X ) 1 


1

 
  2  X ' H Z X  X ' H Z IH Z X  X ' H Z X 
1 1

  2  X ' HZ X  .
1

Econometrics | Chapter 11 | Specification Error Analysis | Shalabh, IIT Kanpur


5
If OLS is applied to true model, then
bT  ( X ' X ) 1 X ' y

with E (bT )  

V (bT )   2 ( X ' X ) 1.

To compare bF and bT , we use the following result.

Result: If A and B are two positive definite matrices then A  B is at least positive semi-definite if

B 1  A1 is also at least positive semi-definite.

Let
A  ( X ' H Z X ) 1
B  ( X ' X ) 1
B 1  A1  X ' X  X ' H Z X
 X ' X  X ' X  X ' Z ( Z ' Z ) 1 Z ' X
 X ' Z ( Z ' Z ) 1 Z ' X
which is at least a positive semi-definite matrix. This implies that the efficiency declines unless X ' Z  0. If
X ' Z  0, i.e., X and Z are orthogonal, then both are equally efficient.
The residual sum of squares under the false model is
SSres  eF' eF
where

eF  y  XbF  ZCF
bF  ( X ' H Z X ) 1 X ' H Z y
cF  ( Z ' Z ) 1 Z ' y  ( Z ' Z ) 1 Z ' XbF
 ( Z ' Z ) 1 Z '( y  XbF )
 ( Z ' Z ) 1 Z '  I  X ( X ' H Z X ) 1 X ' H z  y
 ( Z ' Z ) 1 Z ' H XZ y
H Z  I  Z ( Z ' Z ) 1 Z '
H Zx  I  X ( X ' H Z X ) 1 X ' H Z
2
H ZX  H ZX : idempotent.

Econometrics | Chapter 11 | Specification Error Analysis | Shalabh, IIT Kanpur


6
So
eF  y  X ( X ' H Z X ) 1 X ' H Z y  Z ( Z ' Z ) 1 Z ' H ZX y
  I  X ( X ' H Z X ) 1 X ' H Z  Z ( Z ' Z ) 1 Z ' H ZX  y
  H ZX  ( I  H Z ) H ZX  y
 H Z H ZX y
 H ZX
* *
y where H ZX  H Z H ZX .
Thus
SSres  eF' eF
 y ' H Z H ZX H ZX H Z y
 y ' H Z H ZX y
 y ' H ZX
*
y
E ( SS res )   2 tr ( H ZX
*
)
  2 (n  k  r )
 SS res 
E    2.
 nk r 
SSres
So is an unbiased estimator of  2 .
nk r

A comparison of exclusion and inclusion of variables is as follows:

Exclusion type Inclusion type


Estimation of coefficients Biased Unbiased
Efficiency Generally declines Declines
Estimation of the disturbance Over-estimate Unbiased
term
Conventional test of hypothesis Invalid and faulty inferences Valid though erroneous
and confidence region

Econometrics | Chapter 11 | Specification Error Analysis | Shalabh, IIT Kanpur


7
Chapter 12
Tests for Structural Change and Stability

A fundamental assumption in regression modeling is that the pattern of data on dependent and independent
variables remains the same throughout the period over which the data is collected. Under such an
assumption, a single linear regression model is fitted over the entire data set. The regression model is
estimated and used for prediction assuming that the parameters remain same over the entire time period of
estimation and prediction. When it is suspected that there exists a change in the pattern of data, then the
fitting of single linear regression model may not be appropriate, and more than one regression models may
be required to be fitted. Before taking such a decision to fit a single or more than one regression models, a
question arises how to test and decide if there is a change in the structure or pattern of data. Such changes
can be characterized by the change in the parameters of the model and are termed as structural change.

Now we consider some examples to understand the problem of structural change in the data. Suppose the
data on the consumption pattern is available for several years and suppose there was a war in between the
years over which the consumption data is available. Obviously, the consumption pattern before and after the
war does not remain the same as the economy of the country gets disturbed. So if a model
yi   0  1 X i1  ...   k X ik   i , i  1, 2,..., n

is fitted then the regression coefficients before and after the war period will change. Such a change is
referred to as a structural break or structural change in the data. A better option, in this case, would be to fit
two different linear regression models- one for the data before the war and another for the data after the war.

In another example, suppose the study variable is the salary of a person, and the explanatory variable is the
number of years of schooling. Suppose the objective is to find if there is any discrimination in the salaries of
males and females. To know this, two different regression models can be fitted-one for male employees and
another for females employees. By calculating and comparing the regression coefficients of both the models,
one can check the presence of sex discrimination in the salaries of male and female employees.

Consider another example of structural change. Suppose an experiment is conducted to study certain
objectives and data is collected in the USA and India. Then a question arises whether the data sets from both
the countries can be pooled together or not. The data sets can be pooled if they originate from the same
model in the sense that there is no structural change present in the data. In such case, the presence of

Econometrics | Chapter 12 | Tests for Structural Change and Stability | Shalabh, IIT Kanpur
1
structural change in the data can be tested and if there is no change, then both the data sets can be merged
and single regression model can be fitted. If structural change is present, then two models are needed to be
fitted.

The objective is now how to test for the presence of a structural change in the data and stability of regression
coefficients. In other words, we want to test the hypothesis that some of or all the regression coefficients
differ in different subsets of data.

Analysis
We consider here a situation where only one structural change is present in the data. The data, in this case, be
divided into two parts. Suppose we have a data set of n observations which is divided into two parts
consisting of n1 and n2 observations such that

n1  n2  n.

Consider the model


y    X   

where  is a  n 1 vector with all elements unity,  is a scalar denoting the intercept term, X is a  n  k 

matrix of observations on k explanatory variables,  is a  k 1 vector of regression coefficients and  is a

 n 1 vector of disturbances.

Now partition , X and  into two subgroups based on n1 and n2 observation as

  X   
   1 , X   1 ,    1 
 2   X2  2 
where the orders of 1 is  n1 1 ,  2 is  n2 1 , X 1 is  n1  k  , X 2 is  n2  k  , 1 is  n1 1 and

 2 is  n2  1 .

Based on this partitions, the two models corresponding to two subgroups are
y1   1  X 1  1
y2    2  X 2    2 .

Econometrics | Chapter 12 | Tests for Structural Change and Stability | Shalabh, IIT Kanpur
2
In matrix notations, we can write
 y   X 1      1 
Model (1) :  1    1   
 y2    2 X 2       2 

and term it as Model (1).

In this case, the intercept terms and regression coefficients remain the same for both the submodels. So there
is no structural change in this situation.

The problem of structural change can be characterized if intercept terms and/or regression coefficients in the
submodels are different.

If the structural change is caused due to change in the intercept terms only then the situation is characterized
by the following model:
y1  11  X 1  1
y2   2  2  X 2    2
or
 
 y   0 X 1   1   1 
Model (2) :  1    1     .
 y2   0  2 X 2   2    2 
 

If the structural change is due to different intercept terms as well as different regression coefficients, then the
model is
y1  11  X 11  1
y2   2  2  X 2  2   2
or
 1 
 
 y1   1 0 X 1 0    2   1 
Model (3) :        .
 y2   0  2 0 X 2   1    2 
 
 2 
The test of hypothesis related to the test of structural change is conducted by testing anyone of the null
hypothesis depending upon the situation
(I) H 0 : 1   2
(II) H 0 : 1   2
(III) H 0 : 1   2 , 1   2 .

Econometrics | Chapter 12 | Tests for Structural Change and Stability | Shalabh, IIT Kanpur
3
To construct the test statistic, apply ordinary least squares estimation to models (1), (2) and (3) and obtain
the residual sum of squares as RSS1 , RSS2 and RSS3 respectively.

Note that the degrees of freedom associated with


 RSS1 from the model (1) is n   k  1 .

 RSS 2 from the model (2) is n   k  1  1  n   k  2  .

 RSS3 from the model (3) is n   k  1  k  1  n  2  k  1 .

The null hypothesis H 0 : 1   2 i.e., different intercept terms is tested by the statistic

F
 RSS1  RSS2  /1
RSS2 /(n  k  2)

which follows F 1, n  k  2  under H 0 . This statistic tests 1   2 for model (2) using model (1), i.e.,

model (1) contrasted with model (2).

The null hypothesis H 0 : 1   2 , i.e., different regression coefficients is tested by

F
 RSS2  RSS3  / k
RSS3 /(n  2k  2)

which follows F  k , n  2k  2  under H 0 . This statistic tests 1   2 from the model (3) using the model

(2), i.e., model (2) contrasted with the model (3).

The test of the null hypothesis H 0 : 1   2 , 1   2 , i.e., different intercepts and different slope parameters

can be jointly tested by the test statistic

F
 RSS1  RSS3  /  k  1
RSS3 /(n  2k  2)

which follows F  k  1, n  2k  2  under H 0 . This statistic tests jointly 1   2 and 1   2 for model (3)

using model (1), i.e., model (1) contrasted with model (3). This test is known as Chow test. It requires
n1  k and n2  k for the stability of regression coefficients in the two models. The development of this test

is as follows which is based on the set up of analysis of variance test.

Econometrics | Chapter 12 | Tests for Structural Change and Stability | Shalabh, IIT Kanpur
4
Development of Chow test:
Consider the models
y  X 1 1  1
1 (i )
n1 1 n1  p p1 n1 1

y  X 2 2   2
2 (ii )
n2 1 n2  p p1 n2 1

y  X  (iii )
n1 n  p p1 n1

n  n1  n2
where p  k  1 which includes k explanatory variables and an intercept term.

Define

H1  I1  X 1  X 1' X 1  X 1'
1

H 2  I 2  X 2  X 2' X 2  X 2'
1

H  I  X X 'X  X '
1

where I1 and I 2 are the identity matrices of the orders  n1  n1  and  n2  n2  . The residual sums of squares

based on models (i), (ii) and (iii) are obtained as


RSS1  1' H11
RSS 2   2' H 2 2
RSS   ' H  .
Then define
 H 0 0 0 
H1*   1  , H2  
*
.
 0 0 0 H 2 
Both H1* and H 2* are  n  n  matrices. Now the RSS1 and RSS2 can be re-expressed as

RSS1   ' H1*


RSS 2   ' H 2*

where  is the disturbance term related to model (iii) based on  n1  n2  observations.

Note that H1* H 2*  0 which implies that RSS1 and RSS2 are independently distributed.

Econometrics | Chapter 12 | Tests for Structural Change and Stability | Shalabh, IIT Kanpur
5
We can write

H  I  X X 'X  X '
1

I 0   X1 
     X ' X   X1 X 2 
1
 1 ' '

0 I2   X 2 
H H12 
  11 
 H 21 H 22 

where H11  I1  X 1  X ' X  X 1'


1

H12   X 1  X ' X  X 2'


1

H 21   X 2  X ' X  X 1'
1

H 22  I 2  X 2  X ' X  X 2' .
1

Define
H *  H1*  H 2*

so that
RSS3   ' H *
RSS1   ' H  .

Note that  H  H *  and H * are idempotent matrices. Also  H  H *  H *  0. First, we see how this result

holds.

Consider
 H  H1 H12   H1 0 
H  H  H*
1
*
1   11  
H 22   0 0 
 H 21
 ( H  H1 ) H1 0 
  11 .
 H 21 H1 0

Since X 1' H1  0, so we have

H 21 H1  0
H11 H1  0.

Econometrics | Chapter 12 | Tests for Structural Change and Stability | Shalabh, IIT Kanpur
6
Also, since H1 is idempotent, it follows that

H  H  H
11 1 1  0.
Thus  H  H  H *
1
*
1 0
or HH1*  H1* .

Similarly, it can be shown that

H  H  H *
2
*
2 0
or HH 2*  H 2* .
Thus this implies that

 H  H  H  H
*
1
*
2
*
1  H 2*   0
or  H  H  H  0.
* *

Also, we have
tr H  n  p
tr H *  tr H1*  tr H 2*
 n1  p  n2  p
 n2p
 n  2k  2.

Hence RSS1 and RSS3 are independently distributed. Further

RSS1  RSS3
~  p2 ,
 2

RSS3
~  (2n  2 p ) .
 2

 RSS1  RSS3  RSS3


Also,   and are independently distributed. Hence under the null hypothesis
  2
 2
 RSS1  RSS3 
  p
 2  ~ F  p, n  2 p 
 RSS3 
 2  n  2 p
  
 RSS1  RSS3  /  k  1 ~ F k  1, n  2k  2 .
or  
RSS3 /  n  2k  2 

Econometrics | Chapter 12 | Tests for Structural Change and Stability | Shalabh, IIT Kanpur
7
Limitations of these tests
1. All tests are based under the assumption that  2 remains the same. So first the stability of  2 should
be checked and then these tests can be used.
2. It is assumed in these tests that the point of change is exactly known. In practice, it is difficult to find
such a point at which the change occurs. It is more difficult to know such point when the change
occurs slowly. These tests are not applicable when the point of change is unknown. An ad-hoc
technique when the point of change is unknown is to delete the data of transition period.
3. When there are more than one points of structural change, then the analysis becomes difficult.

Econometrics | Chapter 12 | Tests for Structural Change and Stability | Shalabh, IIT Kanpur
8
Chapter 13
Asymptotic Theory and Stochastic Regressors
The nature of explanatory variable is assumed to be non-stochastic or fixed in repeated samples in any
regression analysis. Such an assumption is appropriate for those experiments which are conducted inside the
laboratories where the experimenter can control the values of explanatory variables. Then the repeated
observations on study variable can be obtained for fixed values of explanatory variables. In practice, such
an assumption may not always be satisfied. Sometimes, the explanatory variables in a given model are the
study variable in another model. Thus the study variable depends on the explanatory variables that are
stochastic in nature. Under such situations, the statistical inferences drawn from the linear regression model
based on the assumption of fixed explanatory variables may not remain valid.

We assume now that the explanatory variables are stochastic but uncorrelated with the disturbance term. In
case, they are correlated then the issue is addressed through instrumental variable estimation. Such a
situation arises in the case of measurement error models.

Stochastic regressors model


Consider the linear regression model
y Xβ +ε
=

where X is a (n× k ) matrix of n observations on k explanatory variables X 1 , X 2 ,..., X k which are

stochastic in nature, y is a ( n ×1) vector of n observations on study variable, β is a ( k ×1) vector of

regression coefficients and ε is the ( n ×1) vector of disturbances. Under the assumption

V ( ε ) σ 2 I , the distribution of ε i , conditional on xi' , satisfy these properties for all all values of
E (ε ) 0,=
=

X where xi' denotes the i th row of X . This is demonstrated as follows:

Let p ( ε i | xi' ) be the conditional probability density function of ε i given xi' and p ( ε i ) is the unconditional

probability density function of ε i . Then

E ( ε i | xi' ) = ∫ ε i p ( ε i | xi' ) d ε i
= ∫ ε i p (ε i ) dε i
= E (ε i )
=0
Econometrics | Chapter 13 | Asymptotic Theory and Stochastic Regressors | Shalabh, IIT Kanpur

1
E ( ε i2 | xi' ) = ∫ ε i2 p ( ε i | xi' ) d ε i
= ∫ ε i2 p ( ε i ) d ε i
= E ( ε i2 )
= σ 2.

In case, ε i and xi' are independent, then p ( ε i | xi' ) = p ( ε i ) .

Least squares estimation of parameters


The additional assumption that the explanatory variables are stochastic poses no problem in the ordinary
least squares estimation of β and σ 2 . The OLSE of β is obtained by minimizing ( y − X β ) ' ( y − X β )

with respect β as

b =(X 'X ) X 'y


−1

and estimator of σ 2 is obtained as


1
s2 = ( y − Xb ) ' ( y − Xb ) .
n−k

Maximum likelihood estimation of parameters:


Assuming ε ~ N ( 0, σ 2 I ) in the model=
y X β + ε along with X is stochastic and independent of ε , the

joint probability density function ε and X can be derived from the joint probability density function of y
and X as follows:

f ( ε , X ) = f ( ε1 , ε 2 ,..., ε n , x1' , x2' ,..., xn' )


 n  n 
=  ∏ f ( ε i )   ∏ f ( xi' ) 
=  i 1=  i 1 
 n  n 
=  ∏ f ( yi | xi' )   ∏ f ( xi' ) 
=  i 1=  i 1 

(
= ∏ f ( yi | xi' ) f ( xi' ) )
n

i =1

= ∏ f ( yi , xi' )
n

i =1

= f ( y1 , y2 ,..., yn , x1' , x2' ,..., xn' )


= f ( y, X ) .

Econometrics | Chapter 13 | Asymptotic Theory and Stochastic Regressors | Shalabh, IIT Kanpur

2
This implies that the maximum likelihood estimators of β and σ 2 will be based on

∏ f ( y | x ) = ∏ f (ε )
n n
'
i i i
=i 1 =i 1

so they will be same as based on the assumption that ε i ' s, i = 1, 2,..., n are distributed as N ( 0, σ 2 ) . So the

maximum likelihood estimators of β and σ 2 when the explanatory variables are stochastic are obtained as

β = ( X ' X ) X ' y
−1

σ 2 =( y − X β )′ ( y − X β ) .
1
n

Alternative approach for deriving the maximum likelihood estimates


Alternatively, the maximum likelihood estimators of β and σ 2 can also be derived using the joint
probability density function of y and X .

Note: Note that the vector x ' is represented by an underscore in this section to denote that it ‘s order is

1× ( k − 1)  which excludes the intercept term.

Let xi' , i = 1, 2,..., n are from a multivariate normal distribution with mean vector µ x and covariance matrix

Σ xx , i.e., xi' ~ N ( µ x , Σ xx ) and the joint distribution of y and xi' is

 y  µ y   σ yy Σ yx  
 '  ~ N  µ  ,   .
 xi   x   Σ xy Σ xx  

Let the linear regression model is


yi =β 0 + xi' β1 + ε i , i =1, 2,..., n

where xi' is a 1× ( k − 1)  vector of observation of random vector x, β 0 is the intercept term and β1 is the

( k − 1) × 1 vector of regression coefficients. Further ε i is disturbance term with ε i ~ N ( 0, σ 2 ) and is

independent of x ' .

Econometrics | Chapter 13 | Asymptotic Theory and Stochastic Regressors | Shalabh, IIT Kanpur

3
Suppose

 y  µ y   σ yy Σ yx  
  ~ N  µ  ,  Σ  .
Σ xx  
 x  x   xy

The joint probability density function of ( y, x ) based on random sample of size n is


i
'

 1  y − µ y ' y − µ y 
1 −1 
f ( y, x=
') exp  −   Σ   .
k 1
 x − µ x − µ  
( 2π ) 2 Σ2 
2  x   x

Now using the following result, we find Σ −1 :


Result: Let A be a nonsingular matrix which in partitioned suitably as
 B C
A= ,
D E
where E and F= B − CE −1 D are nonsingular matrices, then
 F −1 − F −1CE −1 
A−1 =  −1 −1 −1 −1 −1 −1 
.
 − E DF E + E DF CE 
−1 −1
=
Note that AA A= A I.
Thus

−11  1 −Σ yx Σ −xx1 
Σ = 2  −1 ,
σ  −Σ xx Σ xy σ Σ xx + Σ xx Σ xy Σ yx Σ xx 
2 −1 −1 −1

where
σ=
2
σ yy − Σ yx Σ −xx1Σ xy .
Then

f ( y, x ')
=
( 2π )
1
k
2
1
Σ2
exp
 1 

 2σ 2  y − µ y {
− ( x − µ x ) ' Σ −1
xx Σ xy
 2 −1
} .
 + σ ( x − µ x ) ' Σ xx ( x − µ x )
2

The marginal distribution of x ' is obtained by integrating f ( y, x ') over y and the resulting distribution is

( k − 1) variate multivariate normal distribution as

 1 
exp  − ( x − µ x ) ' Σ −xx1 ( x − µ x )  .
1
=g ( x ') k −1
 2 
1
( 2π ) 2 Σ xx 2

Econometrics | Chapter 13 | Asymptotic Theory and Stochastic Regressors | Shalabh, IIT Kanpur

4
The conditional probability density function of y given x ' is

f ( y, x ')
f ( y | x ') =
g ( x ')
 1
{( y − µ ) − ( x − µ ) Σ } 
2
1 ' −1
= exp  − 2 Σ xy 
 2σ
y x xx
2πσ 2

which is the probability density function of normal distribution with
• conditional mean
E ( y | x ')= µ y + ( x − µ x ) ' Σ −xx1Σ xy and

• conditional variance
| x ') σ yy (1 − ρ 2 )
Var ( y=

where
Σ yx Σ −xx1Σ xy
ρ2 =
σ yy
is the population multiple correlation coefficient.
In the model
y=β 0 + x ' β1 + ε ,
the conditional mean is
E ( yi | xi' ) =β 0 + x ' β1 + E ( ε | x )
= β 0 + x ' β1.

Comparing this conditional mean with the conditional mean of normal distribution, we obtain the
relationship with β 0 and β1 as follows:

β1 =Σ −xx1Σ xy
β=
0 µ y − µ x' β1.

The likelihood function of ( y, x ') based on a sample of size n is

 n  1  y − µ ' y − µ y  
1  −1  i
= exp ∑ − 
 Σ    .
i y
L nk n
 x − µ x − µ x  
( 2π ) 2 Σ2  i =1   i
2 x   i 

Maximizing the log likelihood function with respect to µ y , µ x , Σ xx and Σ xy , the maximum likelihood

estimates of respective parameters are obtained as

Econometrics | Chapter 13 | Asymptotic Theory and Stochastic Regressors | Shalabh, IIT Kanpur

5
1 n
µ y= y= ∑ yi
n i =1
1 n
µ x= x= ∑ xi= ( x2 , x3 ,..., xk )
n i =1
1 n 
Σ xx = S xx =  ∑
n  i =1
xi xi' − nx x ' 

1 n 
Σ xy = S xy = ∑ xi yi − nx
n  i =1
y

1
where xi' ( xi 2 , xi 3 ,..., xik ), S xx is [(k -1) × (k -1)] matrix with elements ∑ ( xti − xi )( xtj − x j ) and S xy is
n t
1
[(k -1) ×1] vector with elements ∑ ( xti − xi )( yi − y ).
n t
Based on these estimates, the maximum likelihood estimators of β1 and β 0 are obtained as

β1 = S xx−1S xy
β0= y − x ' β1
 β0 
β = (X 'X )
−1
=   X ' y.
 β1 

Properties of the estimators of least squares estimator:


The estimation error of OLSE b = ( X ' X ) X ' y of β is
−1

(X 'X ) X 'y−β
−1
b−β
=
= ( X ' X ) X '( X β + ε ) − β
−1

= ( X ' X ) X 'ε .
−1

Then assuming that E ( X ' X ) X ' exists, we have


−1
 

E ( X ' X ) X ' ε 
−1
E (b − β ) =
 

 {
= E  E ( X ' X ) X 'ε X 
−1

 }
= E ( X ' X ) X ' E ( ε )
−1
 
=0

because ( X ' X ) X ' and ε are independent. So b is an unbiased estimator of β .


−1

Econometrics | Chapter 13 | Asymptotic Theory and Stochastic Regressors | Shalabh, IIT Kanpur

6
The covariance matrix of b is obtained as
V ( b ) =E ( b − β )( b − β ) '
= E ( X ' X ) X ' εε ' X ( X ' X ) 
−1 −1
 

 {
= E  E ( X ' X ) X ' εε ' X ( X ' X ) X 
−1 −1

 }
= E ( X ' X ) X ' E ( εε ') X ( X ' X ) X 
−1 −1
 
= E ( X ' X ) X ' σ 2 X ( X ' X ) 
−1 −1
 
= σ 2 E ( X ' X )  .
−1
 
Thus the covariance matrix involves a mathematical expectation. The unknown σ 2 can be estimated by
e 'e
σˆ 2 =
n−k

=
( y − Xb ) ' ( y − Xb )
n−k
where e= y − Xb is the residual and

E (σˆ 2 ) = E  E (σˆ 2 X ) 
  e 'e  
= E E   X
 n−k  
= E (σ 2 )
= σ 2.

Note that the OLSE b = ( X ' X ) X ' y involves the stochastic matrix X and stochastic vector y , so b is
−1

not a linear estimator. It is also no more the best linear unbiased estimator of β as in the case when X is

nonstochastic. The estimator of σ 2 as being conditional on given X is an efficient estimator.

Asymptotic theory:
The asymptotic properties of an estimator concerns the properties of the estimator when sample size n
grows large.

For the need and understanding of asymptotic theory, we consider an example. Consider the simple linear
regression model with one explanatory variable and n observations as
yi =β 0 + β1 xi + ε i , E ( ε i ) =0, Var ( ε i ) =σ 2 , i =1, 2,..., n.

Econometrics | Chapter 13 | Asymptotic Theory and Stochastic Regressors | Shalabh, IIT Kanpur

7
The OLSE of β1 is
n

∑ ( x − x )( y − y )
i i
b1 = i =1
n

∑(x − x )
2
i
i =1
and its variance is
σ2
Var ( b1 ) = .
n
If the sample size grows large, then the variance of b1 gets smaller. The shrinkage in variance implies that

as sample size n increases, the probability density of OLSE b collapses around its mean because Var (b)
becomes zero.

Let there are three OLSEs b1 , b2 and b3 which are based on sample sizes n1 , n2 and n3 respectively such that

n1 < n2 < n3 , say. If c and δ are some arbitrarily chosen positive constants, then the probability that the

value of b lies within the interval β ± c can be made to be greater than (1 − δ ) for a large value of n. This

property is the consistency of b which ensure that even if the sample is very large, then we can be
confident with high probability that b will yield an estimate that is close to β .

Probability in limit
Let βˆn be an estimator of β based on a sample of size n . Let γ be any small positive constant. Then

for large n , the requirement that bn takes values with probability almost one in an arbitrary small

neighborhood of the true parameter value β is

lim P  βˆn − β < γ  =


1
n →∞  
which is denoted as
plim βˆn = β

and it is said that βˆn converges to β in probability. The estimator βˆn is said to be a consistent estimator of

β.

A sufficient but not necessary condition for βˆn to be a consistent estimator of β is that

lim E  βˆn  = β
n →∞

and lim Var  βˆn  = 0.


n →∞

Econometrics | Chapter 13 | Asymptotic Theory and Stochastic Regressors | Shalabh, IIT Kanpur

8
Consistency of estimators
Now we look at the consistency of the estimators of β and σ 2 .

(i) Consistency of b
 X 'X 
Under the assumption that lim   = ∆ exists as a nonstochastic and nonsingular matrix (with finite
n →∞
 n 
elements), we have
−1
1 X 'X 
lim V (b) = σ lim 
2

n →∞ n →∞ n
 n 
1
= σ 2 lim ∆ −1
n →∞ n

= 0.
This implies that OLSE converges to β in quadratic mean. Thus OLSE is a consistent estimator of β .
This also holds true for maximum likelihood estimators also.

Same conclusion can also be proved using the concept of convergence in probability.

The consistency of OLSE can be obtained under the weaker assumption that
 X 'X 
plim   = ∆* .
 n 
exists and is a nonsingular and nonstochastic matrix and
 X 'ε 
plim   = 0.
 n 
Since
b−β =
( X ' X ) −1 X ' ε
−1
 X ' X  X 'ε
=  .
 n  n
So
−1
 X 'X   X 'ε 
plim(b − β ) = plim   plim  
 n   n 
= ∆*−1.0
= 0.
Thus b is a consistent estimator of β . The same is true for maximum likelihood estimators also.

Econometrics | Chapter 13 | Asymptotic Theory and Stochastic Regressors | Shalabh, IIT Kanpur

9
(ii) Consistency of s2
Now we look at the consistency of s 2 as an estimate of σ 2 . We have
1
s2 = e 'e
n−k
1
= ε ' Hε
n−k
−1
1 k 
1 −  ε ' ε − ε ' X ( X ' X ) X ' ε 
−1
=
n n 
 k   ε 'ε ε ' X  X ' X  X 'ε 
−1 −1

=
1 −   −   .
 n   n n  n  n 
ε 'ε 1 n 2
Note that
n
consists of ∑ ε i and {ε i2 , i = 1, 2,..., n} is a sequence of independently and identically
n i =1

distributed random variables with mean σ 2 . Using the law of large numbers
 ε 'ε 
 =σ
2
plim 
 n 
 ε ' X  X ' X  −1 X ' ε   ε 'X    X ' X  
−1
X 'ε 
plim     =  plim   plim     plim 
 n  n  n   n    n    n 
= 0.∆*−1.0
=0
(1 − 0) −1 σ 2 − 0 
⇒ plim( s 2 ) =
= σ 2.
Thus s 2 is a consistent estimator of σ 2 . The same holds true for maximum likelihood estimates also.

Asymptotic distributions:
Suppose we have a sequence of random variables {α n } with a corresponding sequence of cumulative

density functions { Fn } for a random variable α with cumulative density function F . Then α n converges

in distribution to α if Fn converges to F point wise. In this case, F is called the asymptotic distribution of

αn.

Note that since convergence in probability implies the convergence in distribution, so


plim α=
n α ⇒ α n 
D
→ α ( α n tend to α in distribution), i.e., the asymptotic distribution of α n is F
which is the distribution of α .
Econometrics | Chapter 13 | Asymptotic Theory and Stochastic Regressors | Shalabh, IIT Kanpur

10
Note that
E (α ) : Mean of asymptotic distribution

Var (α ) : Variance of asymptotic distribution


lim E (α n ) : Asymptotic mean
n →∞

2
lim E α n − lim E (α n )  : Asymptotic variance.
n →∞  n →∞ 

Asymptotic distribution of sample mean and least squares estimation


1 n
Let α=
n Y=
n ∑ Yi be the sample mean based on a sample of size n . Since sample mean is a consistent
n i =1

estimator of population mean Y , so


plim Yn = Y

which is constant. Thus the asymptotic distribution of Yn is the distribution of a constant. This is not a
regular distribution as all the probability mass is concentrated at one point. Thus as sample size increases,
the distribution of Yn collapses.

Suppose consider only the one third observations in the sample and find sample mean as
n
3
3
Yn* = ∑ Yi .
n i =1

Then E (Yn* ) = Y
n

and Var (Yn* ) = 2 ∑ Var (Yi )


9 3
n i =1
9 n
= 2 σ2
n 3
3
= σ2
n
→ 0 as n → ∞.

Econometrics | Chapter 13 | Asymptotic Theory and Stochastic Regressors | Shalabh, IIT Kanpur

11
Thus plim Yn* = Y and Yn* has the same degenerate distribution as Yn . Since Var (Yn* ) > Var (Yn ) , so Yn*

is preferred over Yn .

Now we observe the asymptotic behaviour of Yn and Yn* . Consider a sequence of random variables {α n }.

Thus for all n , we have

αn
= n (Yn − Y )
α n*
= n (Yn* − Y )
E (=
αn ) n E (Yn =
−Y ) 0
E (=
α n* ) n E (Yn* =
−Y ) 0
σ2
Var (α n ) = nE (Yn − Y ) = n
2
= σ2
n
3σ 2
Var (α n =
) nE (Yn − Y ) = n n = 3σ 2 .
* * 2

Assuming the population to be normal, the asymptotic distribution of


• Yn is N ( 0, σ 2 )

• Yn* is N ( 0,3σ 2 ) .

So now Yn is preferable over Yn* . The central limit theorem can be used to show that α n will have an
asymptotically normal distribution even if the population is not normally distributed.

Also, since

n (Yn − Y ) ~ N ( 0, σ 2 )
n (Yn − Y )
⇒Z = ~ N ( 0,1)
σ
and this statement holds true in finite sample as well as asymptotic distributions.

ordinary least squares estimate b = ( X ' X ) X ' y of β


−1
Consider the in linear regression model

y X β + ε . If X is nonstochastic then the finite covariance matrix of b is


=

V (b) = σ 2 ( X ' X ) −1.

Econometrics | Chapter 13 | Asymptotic Theory and Stochastic Regressors | Shalabh, IIT Kanpur

12
X 'X
The asymptotic covariance matrix of b under the assumption that lim = Σ xx exists and is nonsingular.
n →∞ n
It is given by
−1
1  X 'X 
σ lim ( X ' X ) = σ lim   lim 
2 2

n →∞
 
n →∞ n n →∞
 n 
= σ 2 .0.Σ −xx1
=0
which is a null matrix.

Consider the asymptotic distribution of n ( b − β ) . Then even if ε is not necessarily normally distributed,

then asymptotically
n ( b − β ) ~ N ( 0, σ 2 Σ −xx1 )
n ( b − β ) ' Σ xx ( b − β )
~ χ k2 .
σ2

X 'X
If is considered as an estimator of Σ xx , then
n
X 'X
n (b − β ) ' (b − β ) (b − β ) ' X ' X (b − β )
n =
σ2 σ2

(
is the usual test statistic as is in the case of finite samples with b ~ N β , σ 2 ( X ' X )
−1
).

Econometrics | Chapter 13 | Asymptotic Theory and Stochastic Regressors | Shalabh, IIT Kanpur

13
Chapter 14
Stein-Rule Estimation

The ordinary least squares estimation of regression coefficients in linear regression model provides the
estimators having minimum variance in the class of linear and unbiased estimators. The criterion of
linearity is desirable because such estimators involve less mathematical complexity, they are easy to
compute, and it is easier to investigate their statistical properties. The criterion of unbiasedness is
attractive because it is intuitively desirable to have an estimator whose expected value, i.e., the mean of
the estimator should be the same as the parameter being estimated. Considerations of linearity and
unbiased estimators sometimes may lead to an unacceptably high price to be paid in terms of the
variability around the true parameter. It is possible to have a nonlinear estimator with better properties. It
is to be noted that one of the main objectives of estimation is to find an estimator whose values have high
concentration around the true parameter. Sometimes it is possible to have a nonlinear and biased estimator
that has smaller variability than the variability of a best linear unbiased estimator of the parameter under
some mild restrictions.

In the multiple regression model


y  X    , E     0, V      2 I ,
n1 nk k 1 n1

the ordinary least squares estimator (OLSE) of  is b   X ' X  X ' y which is the best linear unbiased
1

estimator of  in the sense that it is linear in y, E  b    and b has smallest variance among all linear

and unbiased estimators of  . Its covariance matrix is

V  b   E  b    b    '   2  X ' X  .
1

The weighted mean squared error of an estimator ̂ is defined as

   
E ˆ   'W ˆ     wij E ˆi   i
i j
  ˆ j j 
where W is k  k fixed positive definite matrix of weights wij . The two popular choices of weight matrix

W are

(i)   
W is an identity matrix, i.e. W  I then E ˆ   ' ˆ   is called as the total mean squared

error (MSE) of ̂ .

Econometrics | Chapter 14 | Stein-Rule Estimation | Shalabh, IIT Kanpur


1
(ii) W  X ' X , then

     
E ˆ   ' X ' X ˆ    E X ˆ  X  ' X ˆ  X  

is called as the predictive mean squared error of ̂ . Note that E  y   X ˆ is the predictor of
 
average value E  y   X  and X ˆ  X  is the corresponding prediction error.

There can be other choices of W and it depends entirely on the analyst how to define the loss function so
that the variability is minimum.

If a random vector with k elements  k  2  is normally distributed as N   , I  ,  being the mean vector,

then Stein established that if the linearity and unbiasedness are dropped, then it is possible to improve
upon the maximum likelihood estimator of  under the criterion of total MSE. Later, this result was
generalized by James and Stein for linear regression model. They demonstrated that if the criteria of
linearity and unbiasedness of the estimators are dropped, then a nonlinear estimator can be obtained which
has better performance than the best linear unbiased estimator under the criterion of predictive MSE. In
other words, James and Stein established that OLSE is inadmissible for k  2 under predictive MSE

criterion, i.e., for k  2, there exists an estimator ̂ such that

   
E ˆ   ' X ' X ˆ    E  b    ' X ' X  b   

for all values of  with strict inequality holding for some values of  . For k  2, no such estimator
exists and we say that " b can be beaten” in this sense. Thus it is possible to find estimators which will
beat b in this sense. So a nonlinear and biased estimator can be defined which has better performance
than OLSE. Such an estimator is Stein-rule estimator given by
 2 
ˆ  1  c  b when  2 is known
 b ' X ' Xb 

and
 e 'e 
ˆ  1  c b when  2 is unknown.
 b ' X ' Xb 
Here c is a fixed positive characterizing scalar, e ' e is the residuum sum of squares based on OLSE and
e  y  Xb is the residual. By assuming different values to c , we can generate different estimators. So a
class of estimators characterized by c can be defined. This is called as a family of Stein-rule estimators.

Econometrics | Chapter 14 | Stein-Rule Estimation | Shalabh, IIT Kanpur


2
Let
 2 
1  
b ' X ' Xb 
c

be a scalar quantity. Then
ˆ   b.

So essentially we say that instead of estimating 1 ,  2 ,...,  k by b1 , b2 ,..., bk we estimate them by

 b1 ,  b2 ,...,  bk , respectively. So in order to increase the efficiency, the OLSE is multiplied by a constant
 . Thus  is called the shrinkage factor. As Stein-rule estimators attempt to shrink the components of
b towards zero, so these estimators are known as shrinkage estimators.

First, we discuss a result which is used to prove the dominance of Stein-rule estimator over OLSE.

Result: Suppose a random vector Z of order  k 1 is normally distributed as N   , I  where  is the

mean vector and I is the covariance matrix.


Then
 Z ' Z      1 
E    k  2 E  .
 Z ' Z   Z ' Z 
An important point to be noted in this result is that the left-hand side depends on  , but the right-hand
side is independent of  .

Now we consider the Stein-rule estimator ̂ when  2 is known. Note that

  
E ˆ  E  b   c 2 E 
b 

 b ' X ' Xb 
   (In general, a non-zero quantity)
 0,
in general.

Thus the Stein-rule estimator is biased while OLSE b is unbiased for  .

Econometrics | Chapter 14 | Stein-Rule Estimation | Shalabh, IIT Kanpur


3
The predictive risk of b and ̂ are
PR  b   E  b    ' X ' X  b   

   
PR ˆ  E ˆ   ' X ' X ˆ   .  
The Stein-rule estimator ˆ is better them OLSE b under the criterion of predictive risk if

 
PR ˆ  PR  b  .

Solving the expressions, we get

b     X ' X  X '
1

PR  b   E  ' X  X ' X  X ' X  X ' X  X '  


1 1
 
 E  ' X  X ' X  X '  
1
 
 E tr  X ' X  X '  ' X 
1
 
 tr  X ' X  X ' E   ' X
1

  2tr  X ' X  X ' X


1

  2 trI k
  2k.

'
 c 2   c 2 
 
PR ˆ  E  b    

b  X ' X  b    
b ' X ' Xb   b ' X ' Xb 
 c 2 
 E b    ' X ' X b     E  b ' X ' X  b      b    ' X ' Xb
 b ' X ' Xb 
 c 2 4 
E b ' X ' Xb 
  b ' X ' Xb 
2

 c 2  b    ' X ' Xb   c 2 4 
  2k  2E    E  b ' X ' Xb  .
 b ' X ' Xb   

Econometrics | Chapter 14 | Stein-Rule Estimation | Shalabh, IIT Kanpur


4
Suppose
1
X 'X 
1/ 2
Z b

or b    X ' X 
1/ 2
Z

1
X 'X  
1/2


or     X ' X 
1/ 2

and Z ~ N   , I  , i.e., Z1 , Z 2 ,..., Z k are independent. Substituting these values in the expressions for

 
PR ˆ , we get

 2  2 Z   'Z   c 2 2 
ˆ 
PR    k  2 E c
2

 2Z ' Z 
  E  2 Z ' Z 
 

 2 Z ' Z      c 2 2 
  k  2 E c
2
E 
 Z 'Z   Z 'Z 
 Z ' Z     2 2  1 
  2 k  2c 2 E  c  E 
 Z 'Z   Z 'Z 
 1 
  2 k  c 2  2  k  2   c  E    using the result 
 Z 'Z 
 1 
 PR  b   c 2  2  k  2   c  E  .
 Z 'Z 
Thus

 
PR ˆ  PR  b 

if and only if
 1 
c 2  2  k  2   c  E    0.
 Z 'Z 

Since Z ~ N   , I  ,  2  0. So Z ' Z has a non-central Chi-square distribution. Thus

 1 
E 0
 Z 'Z 
 c  2  k  2   c   0.

Econometrics | Chapter 14 | Stein-Rule Estimation | Shalabh, IIT Kanpur


5
Since c  0 is assumed, so this inequality holds true when
2  k  2  c  0

or 0  c  2  k  2  provided k  2.

So as long as 0  c  2  k  2  is satisfied, the Stein-rule estimator will have smaller predictive risk them

OLSE. This inequality is not satisfied for k  1 and k  2.

 
To find the value of c for which PR ˆ is minimum, we differentiate

   1 
PR ˆ  PR  b   c 2  2  k  2   c  E  
 Z 'Z 
with respect to c and it gives as follows:

    d  PR b    E 
d PR ˆ
2 1  d  2  k  2  c  c 
2

0
 
dc dc  Z 'Z  dc
 2  k  2   2c  0.

or c  k  2 .
Further,

  
d 2 PR ˆ
 2  0.
dc 2
ck 2

The largest gains efficiency arises when c  k  2. So if the number of explanatory variables are more
than two, then it is always possible to construct an estimator which is better than OLSE.

The optimum Stein-rule estimator or James-Stein rule estimator of  in this case is given by

 ( p  2) 2 
ˆ  1  b when  2 is known.
 b ' X ' Xb 

Econometrics | Chapter 14 | Stein-Rule Estimation | Shalabh, IIT Kanpur


6
To avoid the change of sign in this estimator, the “positive part” version of this estimator called as
Positive part Stein-rule estimator is given by

  ( p  2) 2  ( p  2) 2
 1  when 0  1
b ' X ' Xb 
b
ˆ  b ' X ' Xb
 

 ( p  2) 2
 0 when  1.
b ' X ' Xb

When  2 is unknown then it can be shown that the Stein-rule estimator


 e 'e 
ˆ  1  c b
 b ' X ' Xb 
is better than OLSE b if and only if
2  k  2
0c ; k  2.
nk 2
The optimum choice of c giving the largest gain is efficiency is
k 2
c .
nk 2

Econometrics | Chapter 14 | Stein-Rule Estimation | Shalabh, IIT Kanpur


7
Chapter 15
Instrumental Variables Estimation

A basic assumption in analyzing the performance of estimators in multiple regression is that the explanatory
variables and disturbance terms are independently distributed. The violation of such assumption disturbs the
optimal properties of the estimators. The instrumental variable estimation method helps in estimating the
regression coefficients in the multiple linear regression model when such violation occurs.

Consider the multiple linear regression model


y  X 
where y is (n  1) vector of observation on study variable, X is ( n  k ) matrix of observations on
X 1 , X 2 ,..., X k ,  is a (k  1) vector of regression coefficient and  is a (n  1) vector of disturbances.

Suppose one or more explanatory variables is correlated with the disturbances in the limit, then we can write
1 
plim  X '    0.
n 
The consequences of such an assumption on ordinary least squares estimator are as follows:

b X 'X  X 'y


1

  X ' X  X ' X    
1

b     X ' X  X '
1

1
 X ' X   X ' 
   
 n   n 
1
 X 'X   X ' 
plim  b     plim   plim  
 n   n 
0

 X 'X 
assuming plim     XX exists and is nonsingular. Consequently plim b   and thus the OLSE
 n 
becomes an inconsistent estimator of  .

To overcome this problem and to obtain a consistent estimator of  , the instrumental variable estimation can
be used.

Econometrics | Chapter 15 | Instrumental Variables Estimation | Shalabh, IIT Kanpur


1
Consider the model
1 
y  X    with plim  X '    0.
n 
Suppose that it is possible to find a data matrix Z of order  n  k  with the following properties.:

Z'X 
(i) plim     ZX is a finite and nonsingular matrix of full rank. This interprets that the variables in
 n 
Z are correlated with those in X , in the limit.

 Z ' 
(ii) plim    0,
 n 
i.e., the variables in Z are uncorrelated with  , in the limit.
 Z 'Z 
(iii) plim     ZZ exists.
 n 

Thus Z  variables are postulated to be


 uncorrelated with  , in the limit and
 to have a nonzero cross product with X .
Such variables are called instrumental variables.

If some of X variables are likely to be uncorrelated with  , then these can be used to form some of the
columns of Z and extraneous variables are found only for the remaining columns.

First, we understand the role of the term X '  in the OLS estimation. The OLSE b of  is derived by
solving the equation
  y  X   ' y  X  
0

or X ' y  X ' Xb

or X '  y  Xb   0.

Econometrics | Chapter 15 | Instrumental Variables Estimation | Shalabh, IIT Kanpur


2
Now we look at this normal equation as if it is obtained by pre-multiplying y  X    by X ' as
X ' y  X ' X   X '
where the term X '  is dropped and  is replaced by b. The disappearance of X '  can be explained
when X and  are uncorrelated as follows:
X ' y  X ' X   X '
X 'y X 'X X '
 
n n n
 X 'y  X 'X   X ' 
plim    plim     plim  
 n   n   n 
1
 X 'X    X 'y  X ' 
   plim    plim    plim   .
 n    n   n 

Let
 X 'X 
plim     XX
 n 
 X 'y
plim     Xy
 n 
where population cross moments  XX and  Xy are finite,  XX is finite and nonsingular.

If X and  are uncorrelated so that

 X ' 
plim    0,
 n 
then
   XX1  Xy .
X 'X X 'y
If  XX is estimated by sample cross moment and  Xy is estimated by sample cross moment ,
n n
then the OLS estimator of  is obtained as
1
 X 'X   X 'y
b   
 n   n 
  X ' X  X ' y.
1

Such an analysis suggests to use Z to pre-multiply the multiple regression model as follows:

Econometrics | Chapter 15 | Instrumental Variables Estimation | Shalabh, IIT Kanpur


3
Z ' y  Z ' X   Z '
Z'y Z'X   Z ' 
   
n  n   n 
Z'y Z'X   Z ' 
plim    plim     plim  
 n   n   n 
1
Z'X   Z'y  Z ' 
   plim    plim    plim  
 n    n   n 
1
  ZX  Zy .

Substituting the sample cross moment


Z'X
 of  ZX and
n
Z'y
 of  Zy ,
n
Thus the following instrumental variable estimator of  is obtained:

ˆIV   Z ' X  Z ' y


1

which is termed as an instrumental variable estimator of  and this method is called an instrumental
variable method.

Since

ˆIV     Z ' X  Z '  X      


1

  Z ' X  Z '
1

 
plim ˆIV    plim  Z ' X  Z '  

1

 Z ' X  1 Z '  
 plim   
 n  n 
1
 Z'X   Z ' 
 plim   plim  
 n   n 
  ZX1 .0
 plimˆIV   .

Thus the instrumental variable estimator is consistent. Note that the variables Z1 , Z 2 ,..., Z k in Z are chosen

such that they are uncorrelated with  and correlated with X , at least asymptotically, so that the second
order moment matrix  ZX exists and is nonsingular.
Econometrics | Chapter 15 | Instrumental Variables Estimation | Shalabh, IIT Kanpur
4
Asymptotic distribution:
The asymptotic distribution of
1

 
1 
n ˆIV     Z ' X 
 n 
1
n
Z '

is normal with mean vector 0 and the asymptotic covariance matrix is given by
 Z ' X  1 1  X 'Z  
1
ˆ  
AsyVar  IV  plim   Z ' E   ' Z   
 n  n  n  
 Z ' X  1  Z ' Z  X ' Z 1 
  2 plim      
 n   n  n  
1 1
Z'X   Z 'Z   X 'Z 
  2 plim   plim   plim  
 n   n   n 
  2 XZ1  ZZ  ZX1 .
For a large sample,
 1
 
2
V ˆIV   XZ  ZZ  ZX1
n
which can be estimated by
ˆ ˆ 1 ˆ ˆ 1
 
2
Vˆ ˆIV   XZ  ZZ  ZX .
n
s2
 Z ' X  Z 'Z  X 'Z 
1 1

n
where
1
s2   y  Xb  '( y  Xb),
nk
b   X ' X  X ' y.
1

The variance of ˆIV is not necessarily a minimum asymptotic variance because there can be more than one

sets of instrumental variables that fulfil the requirement of being uncorrelated with  and correlated with
stochastic regressors.

Econometrics | Chapter 15 | Instrumental Variables Estimation | Shalabh, IIT Kanpur


5
Chapter 16
Measurement Error Models

A fundamental assumption in all the statistical analysis is that all the observations are correctly measured. In
the context of multiple regression model, it is assumed that the observations on the study and explanatory
variables are observed without any error. In many situations, this basic assumption is violated. There can be
several reasons for such a violation.
 For example, the variables may not be measurable, e.g., taste, climatic conditions, intelligence,
education, ability etc. In such cases, the dummy variables are used, and the observations can be
recorded in terms of values of dummy variables.
 Sometimes the variables are clearly defined, but it is hard to take correct observations. For example,
the age is generally reported in complete years or in multiple of five.
 Sometimes the variable is conceptually well defined, but it is not possible to take a correct
observation on it. Instead, the observations are obtained on closely related proxy variables, e.g., the
level of education is measured by the number of years of schooling.
 Sometimes the variable is well understood, but it is qualitative in nature. For example, intelligence is
measured by intelligence quotient (IQ) scores.

In all such cases, the true value of the variable can not be recorded. Instead, it is observed with some error.
The difference between the observed and true values of the variable is called as measurement error or
errors-in-variables.

Difference between disturbances and measurement errors:


The disturbances in the linear regression model arise due to factors like the unpredictable element of
randomness, lack of deterministic relationship, measurement error in study variable etc. The disturbance
term is generally thought of as representing the influence of various explanatory variables that have not
actually been included in the relation. The measurement errors arise due to the use of an imperfect measure
of true variables.

Econometrics | Chapter 16 | Measurement Error Models | Shalabh, IIT Kanpur


1
Large and small measurement errors
If the magnitude of measurement errors is small, then they can be assumed to be merged in the disturbance
term, and they will not affect the statistical inferences much. On the other hand, if they are large in
magnitude, then they will lead to incorrect and invalid statistical inferences. For example, in the context of
linear regression model, the ordinary least squares estimator (OLSE) is the best linear unbiased estimator of
the regression coefficient when measurement errors are absent. When the measurement errors are present in
the data, the same OLSE becomes biased as well as inconsistent estimator of regression coefficients.

Consequences of measurement errors:


We first describe the measurement error model. Let the true relationship between correctly observed study
and explanatory variables be
y  X 

where y is a  n 1 vector of true observation on study variable, X is a  n  k  matrix of true observations

on explanatory variables and  is a  k  1 vector of regression coefficients. The value y and X are not

observable due to the presence of measurement errors. Instead, the values of y and X are observed with
additive measurement errors as
y  y  u
X  X  V
where y is a  n 1 vector of observed values of study variables which are observed with (n  1)

measurement error vector u . Similarly, X is a (n  k ) matrix of observed values of explanatory variables

which are observed with  n  k  matrix V of measurement errors in X . In such a case, the usual disturbance

term can be assumed to be subsumed in u without loss of generality. Since our aim is to see the impact of
measurement errors, so it is not considered separately in the present case.

Alternatively, the same setup can be expressed as


y  X   u
X  X  V
where it can be assumed that only X is measured with measurement errors V and u can be considered as the
usual disturbance term in the model.

Econometrics | Chapter 16 | Measurement Error Models | Shalabh, IIT Kanpur


2
In case, some of the explanatory variables are measured without any measurement error then the
corresponding values in V will be set to zero.

We assume that
E (u )  0, E (uu ')   2 I
E (V )  0, E (V 'V )  , E (V ' u )  0.

The following set of equations describes the measurement error model


y  X 
y  y  u
X  X  V
which can be re-expressed as
y  y  u
 X   u
  X V    u
 X   u V  
=X   
where   u  V  is called as the composite disturbance term. This model resemble like a usual linear
regression model. A basic assumption in linear regression model is that the explanatory variables and
disturbances are uncorrelated. Let us verify this assumption in the model y  X   w as follows:

E  X  E ( X ) '  E ( )  E V '(u  V  ) 


 E V ' u   E V 'V  
 0  
 
 0.
Thus X and  are correlated. So OLS will not provide efficient result.

Suppose we ignore the measurement errors and obtain the OLSE. Note that ignoring the measurement errors
in the data does not mean that they are not present. We now observe the properties of such an OLSE under
the setup of measurement error model.

Econometrics | Chapter 16 | Measurement Error Models | Shalabh, IIT Kanpur


3
The OLSE is

b X 'X  X 'y


1

b     X ' X  X ' X      
1

  X ' X  X '
1

E  b     E  X ' X  X '  
1
 
  X ' X  X ' E ( )
1

0
as X is a random matrix which is correlated with  . So b becomes a biased estimator of  .

Now we check the consistency property of OLSE. Assume


1 
plim  X ' X    xx
n 
1 
plim  V 'V    vv
n 
1 
plim  X 'V   0
n 
1 
plim  V ' u   0.
n 
Then
  X ' X  1  X '   
plim  b     plim    
 n   n  
1 1
n

X ' X  X  V ' X  V
n
 
1 1 1 1
 X ' X  X 'V  V ' X  V 'V
n n n n
1  1  1  1  1 
plim  X ' X   plim  X ' X   plim  X 'V   plim  V ' X   plim  V 'V 
n  n  n  n  n 
  xx  0  0   vv
  xx   vv

1 1 1
X '   X '   V ' 
n n n
1  1
 X '(u  V  )  V '(u  V  )
n n

Econometrics | Chapter 16 | Measurement Error Models | Shalabh, IIT Kanpur


4
1  1  1  1  1 
plim  X '    plim  X ' u   plim  X 'V    plim  V ' u   plim  V 'V  
n  n  n  n  n 
 0  0  0  vv 
-1
 X 'X   X ' 
plim  b     plim   plim  
 n   n 
    xx  vv   vv 
1

 0.
Thus b is an inconsistent estimator of  . Such inconsistency arises essentially due to correlation between
X and  .

Note: It should not be misunderstood that the OLSE b   X ' X  X ' y is obtained by minimizing
1

S   '    y  X   '  y  X   in the model y  X    . In fact  '  cannot be minimized as in the case of

usual linear regression, because the composite error   u  V  is itself a function of  .

To see the nature of consistency, consider the simple linear regression model with measurement error as
yi   0  1 xi , i  1, 2,..., n
yi  yi  ui
xi  xi  vi .
Now
1 x1  1 x1   0 v1 
     
1 x2  1 x2  0 v2 
X  , X   , V 
        
     
1 xn  1 xn   0 vn 
and assuming that
1 n 
plim   xi   
 n i 1 
1 n 
plim   ( xi   ) 2    x2 ,
 n i 1 
we have

Econometrics | Chapter 16 | Measurement Error Models | Shalabh, IIT Kanpur


5
1 
 xx  plim  X ' X 
n 
 1 n 
 1  xi
n i 1 
 plim  n
1 1 n 2
  xi  xi 
 n i 1 n i 1 
1  
 2
.
 x   
2

Also,
1 
 vv  plim  V 'V 
n 
0 0 
 2
.
0 v 
Now

plim  b        xx   vv   vv 
1

1
 b  0  1    0 0   0 
plim  0    2  2  
 b1  1     x     v   0  v   1 
2 2

1   x2   2   v2     0 
 2
 x   v2      2
1    v 
  v2 
 2 1 
v  x2

  .
  2 
  2 v 2 1 
  x v 

Thus we find that the OLSEs of  0 and 1 are biased and inconsistent. So if a variable is subjected to

measurement errors, it not only affects its own parameter estimate but also affect other estimator of
parameter that are associated with those variable which are measured without any error. So the presence of
measurement errors in even a single variable not only makes the OLSE of its own parameter inconsistent but
also makes the estimates of other regression coefficients inconsistent which are measured without any error.

Econometrics | Chapter 16 | Measurement Error Models | Shalabh, IIT Kanpur


6
Forms of measurement error model:
Based on the assumption about the true values of the explanatory variable, there are three forms of
measurement error model .

Consider the model


yi   0  1 xi , i  1, 2,..., n
yi  yi  ui
xi  xi  vi .

1. Functional form: When the xi ' s are unknown constants (fixed), then the measurement error model is

said to be in its functional form.

2. Structural form: When the xi ' s are identically and independently distributed random variables, say, with

mean  and variance  2  2  0  , the measurement error model is said to be in the structural form.

Note that in case of functional form,  2  0.

3. Ultrastructural form: When the xi ' s are independently distributed random variables with different

means, say i and variance  2  2  0  , then the model is said to be in the ultrastructural form. This form is

a synthesis of function and structural forms in the sense that both the forms are particular cases of
ultrastructural form.

Methods for consistent estimation of  :


The OLSE of  which is the best linear unbiased estimator becomes biased and inconsistent in the presence
of measurement errors. An important objective in measurement error models is how to obtain the consistent
estimators of regression coefficients. The instrumental variable estimation and method of maximum
likelihood (or method of moments) are utilized to obtain the consistent estimates of the parameters.

Econometrics | Chapter 16 | Measurement Error Models | Shalabh, IIT Kanpur


7
Instrumental variable estimation:
The instrumental variable method provides the consistent estimate of regression coefficients in linear
regression model when the explanatory variables and disturbance terms are correlated. Since in measurement
error model, the explanatory variables and disturbance are correlated, so this method helps. The instrumental
variable method consists of finding a set of variables which are correlated with the explanatory variables in
the model but uncorrelated with the composite disturbances, at least asymptotically, to ensure consistency.

Let Z1 , Z 2 ,..., Z k be the k instrumental variables. In the context of the model


y  X   ,   u  V  ,
let Z be the n  k matrix of k instrumental variables Z1 , Z 2 ,..., Z k , each having n observations such that

 Z and X are correlated, atleast asymptotically and

 Z and  are uncorrelated, at least asymptotically.


So we have
1 
plim  Z ' X    ZX
n 
1 
plim  Z '    0.
 n 

The instrumental variable estimator of  is given by

ˆIV   Z ' X  Z ' y


1

  Z ' X  Z ' X    
1

ˆIV     Z ' X  Z ' 


1

1


plim ˆIV    1  1 
 plim  Z ' X  plim  Z '  
n  n 
1
  ZX .0
 0.

So ˆIV is consistent estimator of  .

Any instrument that fulfils the requirement of being uncorrelated with the composite disturbance term and
correlated with explanatory variables will result in a consistent estimate of parameter. However, there can be
various sets of variables which satisfy these conditions to become instrumental variables. Different choices
of instruments give different consistent estimators. It is difficult to assert that which choice of instruments
Econometrics | Chapter 16 | Measurement Error Models | Shalabh, IIT Kanpur
8
will give an instrumental variable estimator having minimum asymptotic variance. Moreover, it is also
difficult to decide that which choice of the instrumental variable is better and more appropriate in
comparison to other. An additional difficulty is to check whether the chosen instruments are indeed
uncorrelated with the disturbance term or not.

Choice of instrument:
We discuss some popular choices of instruments in a univariate measurement error model. Consider the
model
yi   0  1 xi  i , i  ui  1vi , i  1, 2,..., n.

A variable that is likely to satisfy the two requirements of an instrumental variable is the discrete grouping
variable. The Wald’s, Bartlett’s and Durbin’s methods are based on different choices of discrete grouping
variables.

1. Wald’s method
Find the median of the given observations x1 , x2 ,..., xn . Now classify the observations by defining an

instrumental variable Z such that


1 if xi  median ( x1 , x2 ,..., xn )
Zi  
1 if xi  median ( x1 , x2 ,..., xn ).
In this case,
1 Z1  1 x1 
   
1 Z2  1 x2 
Z  , X  .
     
   
1 Zn  1 xn 

Now form two groups of observations as follows.


 One group with those xi ' s below the median of x1 , x2 ,..., xn . Find the means of yi ' s and xi ' s , say

y1 and x1 , respectively in this group..

 Another group with those xi ' s above the median of x1 , x2 ,..., xn . Find the means of yi ' s and

xi ' s, say y2 and x2 , respectively in this group.

Econometrics | Chapter 16 | Measurement Error Models | Shalabh, IIT Kanpur


9
1 n 1 n
Now we find the instrumental variable estimator under this set up as follows. Let x  
n i 1
xi , y   yi .
n i 1

ˆIV   Z ' X  Z ' y


1

 n

 n  xi   n nx 

Z'X  n
i 1
   
n
 n
 0  x2  x1  
  Z i  Z i xi   2 
 i 1 i 1 
 n 
  yi   ny 
Z'y  n
i 1
 n  
    y2  y1  
  Z i yi   2 
 i 1 
1
 ˆ0 IV  n nx   ny 
  n
  
 ˆ  0  x2  x1    n  y2  y1  
 1IV   2  2 
 x2  x1  y 
2  2 x   

 x2  x1   0   y2  y1 
 1  2 
  y2  y1  
 y  x
  x2  x1  
 
  y2  y1  
 
  x2  x1  

y y
 ˆ1IV  2 1
x2  x1
 y2  y1 
ˆ0 IV  y   ˆ
 x  y  1IV x .
 x2  x 

If n is odd, then the middle observations can be deleted. Under fairly general conditions, the estimators are
consistent but are likely to have large sampling variance. This is the limitation of this method.

Econometrics | Chapter 16 | Measurement Error Models | Shalabh, IIT Kanpur


10
2. Bartlett’s method:
Let x1 , x2 ,..., xn be the n observations. Rank these observation and order them in an increasing or decreasing

order. Now three groups can be formed, each containing n / 3 observations. Define the instrumental variable
as
1 if observation is in the top group

Z i  0 if observation is in the middle group
1 if observation is in the bottom group.

Now discard the observations in the middle group and compute the means of yi ' s and xi ' s in

- bottom group, say y1 and x1 and

- top group, say y3 and x3 .

Substituting the values of X and Z in ˆIV   Z ' X  Z ' y and on solving, we get
1

y3  y1
ˆ1IV 
x3  x1
ˆ0 IV  y  ˆ1IV x .

These estimators are consistent. No conclusive pieces of evidence are available to compare Bartlett’s method
and Wald’s method but three grouping method generally provides more efficient estimates than two
grouping method is many cases.

3. Durbin’s method
Let x1 , x2 ,..., xn be the observations. Arrange these observations in an ascending order. Define the

instrumental variable Z i as the rank of xi . Then substituting the suitable values of Z and X in

ˆIV   Z ' X  Z ' y, we get the instrumental variable estimators


1

Z y  y
i i
ˆ 1IV  i 1
n
.
Z x  x
i 1
i i

Econometrics | Chapter 16 | Measurement Error Models | Shalabh, IIT Kanpur


11
When there are more than one explanatory variables, one may choose the instrument as the rank of that
particular variable.

Since the estimator uses more information, it is believed to be superior in efficiency to other grouping
methods. However, nothing definite is known about the efficiency of this method.

In general, the instrumental variable estimators may have fairly large standard errors in comparison to
ordinary least square estimators which is the price paid for inconsistency. However, inconsistent estimators
have little appeal.

Maximum likelihood estimation in structural form


Consider the maximum likelihood estimation of parameters in the simple measurement error model given by
yi   0  1 xi , i  1, 2,..., n
yi  yi  ui
xi  xi  vi .

Here  xi , yi  are unobservable and  xi , yi  are observable.

Assume
 2 if i  j
E  ui   0, E  ui u j    u
0 if i  j ,
 v2 if i  j
E  vi   0, E  vi v j   
0 if i  j,
E  uV
i j   0 for all i  1, 2,..., n; j  1, 2,..., n.

For the application of the method of maximum likelihood, we assume the normal distribution for ui and vi .

We consider the estimation of parameters in the structural form of the model in which xi ' s are stochastic. So

assume
xi ~ N   ,  2 

and xi ' s are independent of ui and vi .

Econometrics | Chapter 16 | Measurement Error Models | Shalabh, IIT Kanpur


12
Thus
E ( xi )  
Var ( xi )   2
E ( xi )  
Var  xi   E  xi  E ( xi ) 
2

 E  xi  vi   
2

 E  xi     E  vi2   2  xi    vi


2

  2   v2
E  yi    0  1 E  xi 
  0  1 .

Var  yi   E  yi  E ( yi ) 
2

 E   0   1 xi  ui   0  1 
2

 12 E  xi     E  ui2   21 E  xi    ui


2

 12 2   u2

Cov  xi , yi   E  xi  E ( xi ) yi  E ( yi )


 E  xi  vi   0  1 xi  ui   0  1
 1 E  xi     E  xi    ui  1 E  xi    vi  E  ui vi 
2

 1 2  0  0  0
 1 2 .
So

 yi    0  1   12 2   u2 1 2  


  ~  ,   .
   1 2
N
 xi    2   v2  
The likelihood function is the joint probability density function of ui and vi , i  1, 2,..., n as

L  f  u1u2 ,..., un , v1 , v2 ,..., vn 


 n 2  n 2
 1 
n /2
  i  u   vi 
 2 2 
exp   i 1
 exp   i 1 2 
 2 u  v    2 v 
2
 2 u 
   
 n 2  n 2

 1 
n /2
  i i  x  
x     yi   0  1 xi  
 2 2 
exp   i 1  exp   i 1 .
 2 u  v   2 v2
2
 2 u   
   
Econometrics | Chapter 16 | Measurement Error Models | Shalabh, IIT Kanpur
13
The log-likelihood is
n n

  xi  xi   y   0  1 xi 
2 2
i
L*  ln L  constant 
n
2
 ln  u2  ln  v2   i 1
2 u2
 i 1
2 v2
.

The normal equations are obtained by equating the partial differentiations equals to zero as
L * 1 n
(1)    yi  0  1 xi   0
 0  v2 i 1
L * 1 n
(2)   xi  yi  0  1 xi   0
1  v2 i 1
L * 1 
(3)  2  xi  xi   2  yi   0  1 xi   0, i  1, 2,..., n
xi  u v
L * n 1 n
4  i
x  xi 
2
(4)   
 u
2
2 u 2 v i 1
2

L * n 1 n
4  i
y   0  1 xi  .
2
(5)   
 v
2
2 v 2 v i 1
2

These are  n  4  equations in  n  4  parameters but summing equation (3) over i  1, 2,..., n and using

equation (4), we get


 v2  12 u2
which is undesirable.

These equations can be used to estimate the two means   and  0  1  , two variances and one covariance.

The six parameters  ,  0 , 1 ,  u2 ,  v2 and  2 can be estimated from the following five structural relations
derived from these normal equations

(i ) x 
(ii ) y   0  1
(iii) mxx   2   v2
(iv) m yy  12 2   u2
(v ) mxy  1 2

1 n 1 n 1 n 1 n 1 n
i  i xx n     i  ( xi  x )( yi  y ).
2
where x  x , y  y , m  xi  x , m yy  ( y  y ) 2
and m xy 
n i 1 n i 1 i 1 n i 1 n i 1

Econometrics | Chapter 16 | Measurement Error Models | Shalabh, IIT Kanpur


14
These equations can be derived directly using the sufficiency property of the parameters in bivariate normal
distribution using the definition of structural relationship as

E ( x)  
E ( y )   0  1
Var ( x)   2   v2
Var ( y )  12 2   u2
Cov( x, y )  1 2 .

We observe that there are six parameters  0 , 1 ,  ,  2 ,  u2 and  v2 to be estimated based on five structural

equations (i)-(v). So no unique solution exists. Only  can be uniquely determined while remaining
parameters can not be uniquely determined. So only  is identifiable and remaining parameters are
unidentifiable. This is called the problem of identification. One relation is short to obtain a unique solution,
so additional a priori restrictions relating any of the six parameters is required.

Note: The same equations (i)-(v) can also be derived using the method of moments. The structural
equations are derived by equating the sample and population moments. The assumption of normal
distribution for ui , vi and xi is not needed in case of method of moments.

Additional information for the consistent estimation of parameters:


The parameters in the model can be consistently estimated only when some additional information about the
model is available.

From equations (i) and (ii), we have


ˆ  x
and so  is clearly estimated. Further

ˆ0  y  ˆ1 x

is estimated if ̂1 is uniquely determined. So we consider the estimation of 1 ,  2 ,  u2 and  v2 only. Some

additional information is required for the unique determination of these parameters. We consider now
various type of additional information which are used for estimating the parameters uniquely.

Econometrics | Chapter 16 | Measurement Error Models | Shalabh, IIT Kanpur


15
1.  v2 is known:

Suppose  v2 is known a priori. Now the remaining parameters can be estimated as follows:

mxx   2   v2  ˆ 2  mxx   v2
mxy
mxy  1 2  ˆ1 
mxx   v2
m yy  1 2   u2  ˆ u2  myy  ˆ12ˆ 2
mxy2
 myy  .
mxx   v2

Note that ˆ 2  mxx   v2 can be negative because  v2 is known and mxx is based upon sample. So we assume

that ˆ 2  0 and redefine


mxy
ˆ1  ; mxx   v2 .
mxx   2
v

Similarly, ˆ u2 is also assumed to be positive under suitable condition. All the estimators ˆ1 , ˆ 2 and ˆ u2 are

the consistent estimators of  ,  2 and  u2 respectively. Note that ˆ1 looks like as if the direct regression

estimator of 1 has been adjusted by  v2 for its inconsistency. So it is termed as adjusted estimator also.

2.  u2 is known
Suppose  u2 is known a priori. Then using mxy  1 2 , we can rewrite

m yy  12 2   u2
 mxy 1   u2
m yy   u2
 ˆ1  ; myy   u2
mxy
mxy
ˆ 2 
ˆ1

ˆ  mxx  ˆ 2 .
2
v

The estimators ˆ1 , ˆ 2 and ˆ v2 are the consistent estimators of 1 ,  2 and  v2 respectively. Note that ̂1 looks

like as if the reverse regression estimator of 1 is adjusted by  u2 for its inconsistency. So it is termed as

adjusted estimator also.


Econometrics | Chapter 16 | Measurement Error Models | Shalabh, IIT Kanpur
16
 u2
3.   is known
 v2
Suppose the ratio of the measurement error variances is known, so let
 u2

 v2
is known.

Consider
m yy  12 2   u2
 1 mxy   v2 (using (iv))
 1 mxy    mxx   2  (using (iii))
 m 
 1 mxy    mxx  xy  (using iv)
 1 

12 mxy    1mxx  mxy   1myy  0  1  0 


12 mxy     mxx  m yy    mxy  0.

Solving this quadratic equation

m   mxx   m   mxx   4 mxy2


2

ˆ1 
yy yy

2mxy
U
 , say.
2mxy

Since mxy  1 2 and


ˆ 2  0
mxy
 0
ˆ1

2mxy2
 0
U
 since mxy2  0, so U must be nonnegative.

This implies that the positive sign in U has to be considered and so

m   mxx   m   mxx   4 mxy2


2

ˆ1 
yy yy
.
2mxy
Econometrics | Chapter 16 | Measurement Error Models | Shalabh, IIT Kanpur
17
Other estimates are
myy  2ˆ1mxy  ˆ12 sxx
ˆ v 
2

  ˆ 2 1

mxy
ˆ 2  .
ˆ1

Note that the same estimator ˆ1 of 1 can be obtained by orthogonal regression. This amounts to transform

xi by xi /  u and yi by yi /  v and use the orthogonal regression estimation with transformed variables.

4. Reliability ratio is known


The reliability ratio associated with the explanatory variable is defined as the ratio of variances of true and
observed values of explanatory variables, so
Var ( x ) 2
Kx   2 ; 0  Kx  1
Var ( x)    v2
is the reliability ratio. Note that K x  1, when  v2  0 which means that there is no measurement error in the

explanatory variable and K x  0, means  2  0 which means the explanatory variable is fixed. A higher

value of K x is obtained when  v2 is small, i.e., the impact of measurement errors is small. The reliability

ratio is a popular measure in psychometrics.

Let K x be known a priori. Then

mxx   2   v2
mxy  1 2
mxy 1 2
 
mxx  2   v2
 1 K x
mxy
 ˆ1 
K x mxx
m
 2  xy
1
 ˆ 2  K x mxx
mxx   2   v2
 ˆ v2  1  K x  mxx .
Note that ˆ  K 1b
1 x

mxy
where b is the ordinary least squares estimator b  .
mxx
Econometrics | Chapter 16 | Measurement Error Models | Shalabh, IIT Kanpur
18
5.  0 is known
Suppose  0 is known a priori and E ( x)  0. Then

y   0  1
y  0
 ˆ1 
ˆ
y  0

x
m
ˆ 2  xy
ˆ 1

ˆ  myy  ˆ1mxy
2
u

mxy
ˆ v2  mxx  .
ˆ 1

6. Both  u2 and  v2 are known


This case leads to over-identification in the sense that the number of parameters to be estimated is smaller
than the number of structural relationships binding them. So no unique solutions are obtained in this case.

Note: In each of the cases 1-6, note that the form of the estimate depends on the type of available
information which is needed for the consistent estimator of the parameters. Such information can be
available from various sources, e.g., long association of the experimenter with the experiment, similar type
of studies conducted in the part, some external source etc.

Estimation of parameters in function form:


In the functional form of the measurement error model, xi ' s are assumed to be fixed. This assumption is

unrealistic in the sense that when xi ' s are unobservable and unknown, it is difficult to know if they are fixed

or not. This can not be ensured even in repeated sampling that the same value is repeated. All that can be said
in this case is that the information, in this case, is conditional upon xi ' s . So assume that xi ' s are

conditionally known. So the model is


yi   0  1 xi
xi  xi  vi
yi  yi  ui
then

Econometrics | Chapter 16 | Measurement Error Models | Shalabh, IIT Kanpur


19
 yi    0  1 xi    u2 0  
  ~ N  , 2 
.
 xi   xi   0  v  

The likelihood function is


 n 2

 1    n
2
 yi   0  1 xi  
 1 
n
2    xi  xi 2 
L 2 
exp   i 1
 2 
exp   .
 2 u   2 u2   2 v   2 v2 
 

The log-likelihood is
n

 y    1 xi 
2
0
n i
n 1 n

  x  x  .
2
L*  ln L  constant  ln  u2  i 1
 ln  v2  2
2 2
2 v
i i
2 u 2 i 1

The normal equations are obtained by partially differentiating L * and equating to zero as
L * 1 n
(I )
 0
0 2
u
 y  
i 1
i 0  1 xi   0

L * 1 n
( II )
1
0 2
v
 y  
i 1
i 0  1 xi  xi  0

L * n 1 n

 y    1 xi   0
2
( III ) 0 2  4
 u 2 u 2 u
2 i 0
i 1

L * n 1 n

  x  x 
2
( IV ) 0 2  4 0
 v
2
2 v 2 v
i i
i 1

L *  1
(V )  0  2  yi   0  1 xi   2 ( xi  xi )  0.
xi u v

Squaring and summing equation (V), we get


2 2
   1 
i  12  yi  0  1 xi   i    2  xi  xi 
 u   v 
 2
1
or 12   yi   0  1 xi   4   xi  xi  .
2 2

u i v i

Using the left-hand side of equation (III) and right-hand side of equation (IV), we get
n12 n

 2
u  v2
u
 1 
v
Econometrics | Chapter 16 | Measurement Error Models | Shalabh, IIT Kanpur
20
which is unacceptable because  can be negative also. In the present case, as  u  0 and  v  0 , so  will

always be positive. Thus the maximum likelihood breaks down because of insufficient information in the
model. Increasing the sample size n does not solve the purpose. If the restrictions like  u2 known,  v2

 u2
known or known are incorporated, then the maximum likelihood estimation is similar to as in the case of
 v2

structural form and the similar estimates may be obtained. For example, if    u2 /  v2 is known, then

substitute it in the likelihood function and maximize it. The same solution as in the case of structural form
are obtained.

Econometrics | Chapter 16 | Measurement Error Models | Shalabh, IIT Kanpur


21
Chapter 17
Simultaneous Equations Models

In any regression modeling, generally, an equation is considered to represent a relationship describing a


phenomenon. Many situations involve a set of relationships which explain the behaviour of certain
variables. For example, in analyzing the market conditions for a particular commodity, there can be a
demand equation and a supply equation which explain the price and quantity of commodity exchanged in the
market at market equilibrium. So there are two equations to explain the whole phenomenon - one for demand
and another for supply. In such cases, it is not necessary that all the variables should appear in all the
equations. So estimation of parameters under this type of situation has those features that are not present
when a model involves only a single relationship. In particular, when a relationship is a part of a system, then
some explanatory variables are stochastic and are correlated with the disturbances. So the basic assumption
of a linear regression model that the explanatory variable and disturbance are uncorrelated or explanatory
variables are fixed is violated and consequently ordinary least squares estimator becomes inconsistent.

Similar to the classification of variables as explanatory variable and study variable in linear regression
model, the variables in simultaneous equation models are classified as endogenous variables and exogenous
variables.

Endogenous variables (Jointly determined variables)


The variables which are explained by the functioning of system and values of which are determined by the
simultaneous interaction of the relations in the model are endogenous variables or jointly determined
variables.

Exogenous variables (Predetermined variables)


The variables that contribute to provide explanations for the endogenous variables and values of which are
determined from outside the model are exogenous variables or predetermined variables.

Exogenous variables help is explaining the variations in endogenous variables. It is customary to include past
values of endogenous variables in the predetermined group. Since exogenous variables are predetermined, so
they are independent of disturbance term in the model. They satisfy those assumptions which explanatory
variables satisfy in the usual regression model. Exogenous variables influence the endogenous variables but

Econometrics | Chapter 17 | Simultaneous Equations Models | Shalabh, IIT Kanpur


1
are not themselves influenced by them. One variable which is endogenous for one model can be exogenous
variable for the other model.

Note that in the linear regression model, the explanatory variables influence the study variable but not vice
versa. So relationship is one sided.

The classification of variables as endogenous and exogenous is important because a necessary condition for
uniquely estimating all the parameters is that the number of endogenous variables is equal to the number of
independent equations in the system. Moreover, the main distinction of predetermined variable in estimation
of parameters is that they are uncorrelated with disturbance term in the equations in which they appear.

Simultaneous equation systems:


A model constitutes a system of simultaneous equations if all the relationships involved are needed for
determining the value of at least one of the endogenous variables included in the model. This implies that at
least one of the relationships includes more them one endogenous variable.

Example 1:
Now we consider the following example in detail and introduce various concepts and terminologies used in
describing the simultaneous equations models.

Consider a situation of an ideal market where transaction of only one commodity, say wheat, takes place.
Assume that the number of buyers and sellers is large so that the market is a perfectly competitive market. It
is also assumed that the amount of wheat that comes into the market in a day is completely sold out on the
same day. No seller takes it back. Now we develop a model for such mechanism.

Let
dt denotes the demand of the commodity, say wheat, at time t ,

st denotes the supply of the commodity, say wheat, at time t , and

qt denotes the quantity of the commodity, say wheat, transacted at time t.

By economic theory about the ideal market, we have the following condition:
d t  st , t  1, 2,..., n .

Econometrics | Chapter 17 | Simultaneous Equations Models | Shalabh, IIT Kanpur


2
Observe that
 the demand of wheat depends on
- price of wheat ( pt ) at time t.

- income of buyer (it ) at time t.

 the supply of wheat depends on


- price of wheat ( pt ) at time t.

- rainfall ( rt ) at time t .

From market conditions, we have


qt  d t  st .

Demand, supply and price are determined from each other.


Note that
 income can influence demand and supply, but demand and supply cannot influence the income.
 supply is influenced by rainfall, but rainfall is not influenced by the supply of wheat.

Our aim is to study the behaviour of st , pt and rt which are determined by the simultaneous equation

model.

Since endogenous variables are influenced by exogenous variables but not vice versa, so
 st , pt and rt are endogenous variables.

 it and rt are exogenous variables.

Now consider an additional variable for the model as lagged value of price pt , denoted as pt 1 . In a market,

generally the price of the commodity depends on the price of the commodity on previous day. If the price of
commodity today is less than the previous day, then buyer would like to buy more. For seller also, today’s
price of commodity depends on previous day’s price and based on which he decides the quantity of
commodity (wheat) to be brought in the market.

Econometrics | Chapter 17 | Simultaneous Equations Models | Shalabh, IIT Kanpur


3
So the lagged price affects the demand and supply equations both. Updating both the models, we can now
write that
 demand depends on pt , it and pt 1.

 supply depends on pt , rt and pt 1.

Note that the lagged variables are considered as exogenous variable. The updated list of endogenous and
exogenous variables is as follows:
 Endogenous variables: pt , dt , st

 Exogenous variables: pt 1 , it , rt .

The mechanism of the market is now described by the following set of equations.
 demand d t  1  1 pt  1t

 supply st   2   2 pt   2t

 equilibrium condition d t  st  qt

where  ' s denote the intercept terms,  ' s denote the regression coefficients and  ' s denote the
disturbance terms.

These equations are called structural equations. The error terms 1 and  2 are called structural

disturbances. The coefficients 1 ,  2 , 1 and  2 are called the structural coefficients.

The system of equations is called the structural form of the model.

Since qt  d t  st , so the demand and supply equations can be expressed as

qt  1  1 pt  1t (I)
qt   2   2 pt   2t (II)
So there are only two structural relationships. The price is determined by the mechanism of market and not
by the buyer or supplier. Thus qt and pt are the endogenous variables. Without loss of generality, we can

assume that the variables associated with 1 and  2 are X 1 and X 2 respectively such that

X 1  1 and X 2  1. So X 1  1 and X 2  1 are predetermined and so they can be regarded as exogenous

variables.

Econometrics | Chapter 17 | Simultaneous Equations Models | Shalabh, IIT Kanpur


4
From the statistical point of view, we would like to write the model in such a form so that the OLS can be
directly applied. So writing equations (I) and (II) as
1  1 pt  1t   2   2 pt   2t
   2 1t   2t
or pt  1 
 2  1  2  1
  11  v1t (III)

 21  1 2  21t  1 2t


qt  
 2  1  2  1
  21  v2t (IV)
where
1   2   
 11  ,  21  2 1 1 2
 2  1  2  1
   2t    1 2t
v1t  1t , v2t  2 1t .
 2  1  2  1

Each endogenous variable is expressed as a function of the exogenous variable. Note that the exogenous
variable 1 (from X 1  1, or X 2  1) is not clearly identifiable.

The equations (III) and (IV) are called the reduced form relationships and in general, called the reduced
form of the model.

The coefficients  11 and  21 are called reduced form coefficients and errors v1t and v2t are called the

reduced form disturbances. The reduced from essentially express every endogenous variable as a function
of exogenous variable. This presents a clear relationship between reduced form coefficients and structural
coefficients as well as between structural disturbances and reduced form disturbances. The reduced form is
ready for the application of OLS technique. The reduced form of the model satisfies all the assumptions
needed for the application of OLS.

Suppose we apply OLS technique to equations (III) and (IV) and obtained the OLS estimates of  11 and  12

as ˆ11 and ˆ12 respectively which are given by

1   2
ˆ11 
 2  1
   
ˆ 21  2 1 2 1 .
 2  1
Econometrics | Chapter 17 | Simultaneous Equations Models | Shalabh, IIT Kanpur
5
Note that ˆ11 and ˆ 21 are the numerical values of the estimates. So now there are two equations and four

unknown parameters 1 ,  2 , 1 and  2 . So it is not possible to derive the unique estimates of parameters of

the model by applying OLS technique to reduced form. This is known as problem of identifications.

By this example, the following have been described upto now:


 Structural form relationship.
 Reduced form relationship.
 Need for reducing the structural form into reduced form.
 Reason for the problem of identification.

Now we describe the problem of identification in more detail.

The identification problem:


Consider the model in earlier Example 1, which describes the behaviour of a perfectly competitive market in
which only one commodity, say wheat, is transacted. The models describing the behaviour of consumer and
supplier are prescribed by demand and supply conditions given as
Demand : dt  1  1 pt  1t , t  1, 2,..., n
Supply: st   2   2 pt   2t
Equilibrium condition : dt  st .

If quantity qt is transacted at time t then

dt  st  qt .

So we have two structural equations model in two endogenous variables  qt and pt  and one exogenous

variable (value is 1 given by X 1  1, X 2  1) . The set of three equations is reduced to a set of two equations as

follows:
Demand : qt  1  1 pt  1t (1)
Supply: qt   2   2 pt   2t (2)

Before analysis, we would like to check whether it is possible to estimate the parameters 1 ,  2 , 1 and  2 or

not.

Econometrics | Chapter 17 | Simultaneous Equations Models | Shalabh, IIT Kanpur


6
Multiplying equations (1) by  and (2) by (1   ) and then adding them together gives

 qt  (1   )qt   1  (1   ) 2    1  (1   )  2  pt   1t  (1   ) 2t 


or qt     pt   t (3)

where   1  (1   ) 2 ,   1  (1   )  2 ,  t  1t  (1   ) 2t and  is any scalar lying between 0 and
1.

Comparing equation (3) with equations (1) and (2), we notice that they have same form. So it is difficult to
say that which is supply equation and which is demand equation. To find this, let equation (3) be demand
equation. Then there is no way to identify the true demand equation (1) and pretended demand equation (3).

A similar exercise can be done for the supply equation, and we find that there is no way to identify the true
supply equation (2) and pretended supply equation (3).

Suppose we apply OLS technique to these models. Applying OLS to equation (1) yields
n

  p  p  q  q  t t
ˆ1  t 1
n
 0.6, say
 p  p
2
t
t 1

1 n 1 n
where p   t
n t 1
p , q   qt .
n t 1

Applying OLS to equation (3) yields


n

  p  p  q  q 
t t
ˆ  t 1
n
 0.6.
 p  p
2
t
t 1

Note that ˆ1 and ˆ have same analytical expressions, so they will also have same numerical values, say 0.6.

Looking at the value 0.6, it is difficult to say that the value 0.6 determines equation (1) or (3).
Applying OLS to equation (3) fields
n

  p  p  q  q  t t
ˆ2  t 1
n
 0.6
 p  p
2
t
t 1

Econometrics | Chapter 17 | Simultaneous Equations Models | Shalabh, IIT Kanpur


7
because ˆ2 has the same analytical expression as of ˆ1 and ˆ , so

ˆ1  ˆ2  ˆ3 .

Thus it is difficult to decide and identify whether ˆ1 is determined by the value 0.6 or ˆ2 is determined by

the value 0.6. Increasing the number of observations also does not helps in the identification of these
equations. So we are not able to identify the parameters. So we take the help of economic theory to identify
the parameters.

The economic theory suggests that when price increases then supply increases but demand decreases. So the
plot will look like

and this implies 1  0 and  2  0. Thus since 0.6  0, so we can say that the value 0.6 represents ˆ2  0

and so ˆ2  0.6. But one can always choose a value of  such that pretended equation does not violate the

sign of coefficients, say   0. So it again becomes difficult to see whether equation (3) represents supply
equation (2) or not. So none of the parameters is identifiable.

Now we obtain the reduced form of the model as


1   2 1t   2t
pt  
 2  1  2  1
or pt   11  v1t (4)

1 2   2 1  21t  1 2t


qt  
 2  1  2  1
or qt   21  v2t . (5)
Econometrics | Chapter 17 | Simultaneous Equations Models | Shalabh, IIT Kanpur
8
Applying OLS to equations (4) and (5) and obtain OLSEs ˆ11 and ˆ 21 which are given by

1   2
ˆ11 
 2  1
    2 1
ˆ 21  1 2 .
 2  1

There are two equations and four unknowns. So the unique estimates of the parameters 1 ,  2 , 1 and  2

cannot be obtained. Thus the equations (1) and (2) can not be identified. Thus the model is not identifiable
and estimation of parameters is not possible.

Suppose a new exogenous variable income it is introduced in the model which represents the income of

buyer at time t . Note that the demand of commodity is influenced by income. On the other hand, the supply
of commodity is not influenced by the income, so this variable is introduced only in the demand equation.
The structural equations (1) and (2) now become
Demand : qt  1  1 pt   1it  1t (6)
Supply: qt   2   2 pt   2t (7)

where  1 is the structural coefficient associate with income. The pretended equation is obtained by

multiplying the equations (6) and (7) by  and (1   ) , respectively, and then adding them together. This is
obtained as follows:
 qt  (1   )qt   1  (1   ) 2    1  (1   )  2  pt   1it   1t  (1   ) 2t 
or qt     pt   it   t (8)

where   1  (1   )  2 ,   1  (1   )  2 ,    1 ,  t  1t  (1   ) 2t , 0    1 is a scalar.

Suppose now if we claim that equation (8) is true demand equation because it contains pt and it which

influence the demand. But we note that it is difficult to decide that between the two equations (6) or (8),
which one is the true demand equation.

Suppose now if we claim that equation (8) is the true supply equation. This claim is wrong because income
does not affect the supply. So equation (6) is the supply equation.

Thus the supply equation is now identifiable but demand equation is not identifiable. Such situation is termed
as partial identification.
Econometrics | Chapter 17 | Simultaneous Equations Models | Shalabh, IIT Kanpur
9
Now we find the reduced form of structural equations (6) and (7). This is achieved by first solving equation
(6) for pt and substituting it in equation (7) to obtain an equation in qt . Such an exercise yields the reduced

form equations of the following form:


pt   11   12it  v1t (9)
qt   21   22it  v2t . (10)

Applying OLS to equations (9) and (10), we get OLSEs ˆ11 , ˆ 22 , ˆ12 , ˆ 21 . Now we have four equations

((6),(7),(9),(10)) and there are five unknowns (1 ,  2 , 1 ,  2 ,  1 ). So the parameters are not determined

uniquely. Thus the model is not identifiable.

However, here
 22
2 
 11
 11
 2   21   22 .
 22

If ˆ11 , ˆ 22 , ˆ12 and ˆ 21 are available, then ˆ 2 and ˆ2 can be obtained by substituting ˆ ' s in place of  ' s. So

 2 and  2 are determined uniquely which are the parameters of supply equation. So supply equation is
identified but demand equation is still not identifiable.

Now, as done earlier, we introduce another exogenous variable- rainfall, denoted as rt which denotes the

amount of rainfall at time t . The rainfall influences the supply because better rainfall produces better yield of
wheat. On the other hand, the demand of wheat is not influenced by the rainfall. So the updated set of
structural equations is
Demand : qt  1  1 pt   1it  1t (11)
Supply: qt   2   2 pt   2 rt   2t . (12)

The pretended equation is obtained by adding together the equations obtained after multiplying equation (11)
by  and equation (12) by (1   ) as follows:

 qt  (1   )qt   1  (1   ) 2    1  (1   )  2  pt   1it  (1   ) 2 rt   1t  (1   ) 2t 


or qt     pt   it   rt   t (13)

where   1  (1   ) 2 ,   1  (1   )  2 ,    1 ,   (1   ) 2 ,  t  1t  (1   ) 2t and 0    1 is a

scalar.
Econometrics | Chapter 17 | Simultaneous Equations Models | Shalabh, IIT Kanpur
10
Now we claim that equation (13) is a demand equation. The demand does not depend on rainfall. So unless
  1 so that rt is absent in the model, the equation (13) cannot be a demand equation. Thus equation (11) is
a demand equation. So demand equation is identified.

Now we claim that equation (13) is the supply equation. The supply is not influenced by the income of the
buyer, so (13) cannot be a supply equation. Thus equation (12) is the supply equation. So now the supply
equation is also identified.

The reduced form model from structural equations (11) and (12) can be obtained which have the following
forms:
pt   11   12it   13 rt  v1t (14)
qt   21   22it   23 rt  v2t . (15)

Application of OLS technique to equations (14) and (15) yields the OLSEs ˆ11 , ˆ12 , ˆ13 , ˆ 21 , ˆ 22 and ˆ 23 . So

now there are six such equations and six unknowns 1 ,  2 , 1 ,  2 ,  1 and  2 . So all the estimates are uniquely

determined. Thus the equations (11) and (12) are exactly identifiable.

Finally, we introduce a lagged endogenous variable pt 1 which denotes the price of the commodity on the

previous day. Since only the supply of wheat is affected by the price on the previous day, so it is introduced
in the supply equation only as
Demand : qt  1  1 pt   1it  1t (16)
Supply: qt   2   2 pt   2 rt   2 pt 1   2t (17)

where  2 is the structural coefficient associated with pt 1.

The pretended equation is obtained by first multiplying equations (16) by  and (17) by (1   ) and then
adding them together as follows:
 qt  (1   )qt     pt   it   rt  (1   ) 2 pt 1   t
or qt     pt   it   rt   pt 1   t . (18)
Now we claim that equation (18) represents the demand equation. Since rainfall and lagged price do not
affect the demand, so equation (18) cannot be demand equation. Thus equation (16) is a demand equation
and the demand equation is identified.

Econometrics | Chapter 17 | Simultaneous Equations Models | Shalabh, IIT Kanpur


11
Now finally, we claim that equation (18) is the supply equation. Since income does not affect supply, so
equation (18) cannot be a supply equation. Thus equation (17) is supply equation and the supply equation is
identified.

The reduced form equations from equations (16) and (17) can be obtained as of the following form:
pt   11   12it   13 rt   14 pt 1  v1t (19)
qt   21   22it   23 rt  r24 pt 1  v2t . (20)

Applying the OLS technique to equations (19) and (20) gives the OLSEs as
ˆ11 , ˆ12 , ˆ13 , ˆ14 , ˆ 21 , ˆ 22 , ˆ 23 and ˆ 24 . So there are eight equations in seven parameters

1 ,  2 , 1 ,  2 ,  1 ,  2 and  2 . So unique estimates of all the parameters are not available. In fact, in this case,
the supply equation (17) is identifiable and demand equation (16) is overly identified (in terms of multiple
solutions).

The whole analysis in this example can be classified into three categories –

(i) Under identifiable case


The estimation of parameters is not at all possible in this case. No enough estimates are available for
structural parameters.

(2) Exactly identifiable case :


The estimation of parameters is possible in this case. The OLSE of reduced form coefficients leads to unique
estimates of structural coefficients.

(3) Over identifiable case :


The estimation of parameters, in this case, is possible. The OLSE of reduced form coefficients leads to
multiple estimates of structural coefficients.

Econometrics | Chapter 17 | Simultaneous Equations Models | Shalabh, IIT Kanpur


12
Analysis:
Suppose there are G jointly dependent (endogenous) variables y1 , y2 ,..., yG and K predetermined

(exogenous) variables x1 , x2 ,..., xK . Let there are n observations available on each of the variable and there

are G structural equations connecting both the variables which describe the complete model as follows:

11 y1t  12 y2t  ...  1G yGt   11 x1t   12 x2t  ...   1K xKt  1t
 21 y1t   22 y2t  ...   2G yGt   21 x2t   22 x2t  ...   2 k xK 2   2t

 G1 y1t  G 2 y2t  ...  GG yGt   G1 x1t   G 2 x2t  ...   Gk xKt   Gt .

These equations can be expressed in matrix form as


 11 12  1G   y1t    11  12   1K   x1t   1t 
  22   2G   y2t    21  22   2 K   x2t    2t 
 21  
                  
       
  G1 G 2   GG   yGt   K 1  K 2   KK   xKt   Gt 
or
S : Byt  xt   t , t  1, 2,..., n

where B is a G  G matrix of unknown coefficients of predetermined variables, yt is a (n 1) vector of

observations on G jointly dependent variables,  is (G  K ) matrix of structural coefficients, xt is ( K 1)

vector of observations on K predetermined variables and  t is (G 1) vector of structural disturbances.

The structural form S describes the functioning of model at time t.

Assuming B is nonsingular, premultiplying the structural form equations by B 1 , we get


B 1 Byt  B 1xt  B 1 t
or yt   xt  vt , t  1, 2,..., n.

This is the reduced form equation of the model where   B 1 is the matrix of reduced form coefficients
and vt  B 1 t is the reduced form disturbance vectors.

If B is singular, then one or more structural relations would be a linear combination of other structural
relations. If B is non-singular, such identities are eliminated.

Econometrics | Chapter 17 | Simultaneous Equations Models | Shalabh, IIT Kanpur


13
The structural form relationship describes the interaction taking place inside the model. In reduced form
relationship, the jointly dependent (endogenous) variables are expressed as linear combination of
predetermined (exogenous) variables. This is the difference between structural and reduced form
relationships.

Assume that  t ' s are identically and independently distributed following N (0, ) and Vt ' s are identically

and independently distributed following N (0, ) where   B 1  B '1 with

E ( t )  0, E ( t ,  t' )  , E ( t  t'* )  0 for all t  t *.


E (vt )  0, E (vt , vt' )  , E (vt vt'* )  0 for all t  t *.

The joint probability density function of yt given xt is

p ( yt | xt )  p(vt )
 t
 p ( t )
vt
 p   t  det( B )

 t
where is the related Jacobian of transformation and det( B) is the absolute value of the determinant of
vt

B.

The likelihood function is


L  p( y1 , y2 ,..., yn x1 , x2 ,..., xn )
n
  p( yt xt )
t 1
n
 det( B)  p( ).
n
t
t 1

Applying a nonsingular linear transformation on structural equation S with a nonsingular matrix D, we get

DByt  Dxt  D t
or S *: B * yt   * xt   t* , t  1, 2,..., n

where B*  DB, *  D,  t*  D t and structural model S * describes the functioning of this model at time

t . Now find p( yt xt ) with S * as follows:

Econometrics | Chapter 17 | Simultaneous Equations Models | Shalabh, IIT Kanpur


14
 t*
p  yt | xt   p   t* 
vt
 p   t*  det  B * .

Also
 t*  D t
 t*
  D.
 t
Thus
 t
p  yt xt   p   t  det( B*)
 t*
 p   t  det( D 1 ) det( DB)
 p   t  det( D 1 ) det( D) det( B)
 p   t  det( B) .

The likelihood function corresponding to S * is

 p  
n
L*  det( B*)
n *
t
t 1
n
 t
 det( D) det( B )  p   
n n
t *
t 1 t
n
 det( D) det( B )  p   det( D 1
n n
t )
t 1

 L.

Thus both the structural forms S and S * have the same likelihood functions. Since the likelihood functions
form the basis of statistical analysis, so both S and S * have same implications. Moreover, it is difficult to
identify whether the likelihood function corresponds to S and S * . Any attempt to estimate the parameters
will result into failure in the sense that we cannot know whether we are estimating S and S * . Thus S and
S * are observationally equivalent. So the model is not identifiable.

A parameter is said to be identifiable within the model if the parameter has the same value for all equivalent
structures contained in the model.

Econometrics | Chapter 17 | Simultaneous Equations Models | Shalabh, IIT Kanpur


15
If all the parameters in the structural equation are identifiable, then the structural equation is said to be
identifiable.

Given a structure, we can thus find many observationally equivalent structures by non-singular linear
transformation.

The apriori restrictions on B and  may help in the identification of parameters. The derived structures may
not satisfy these restrictions and may therefore not be admissible.

The presence and/or absence of certain variables helps in the identifiability. So we use and apply some
apriori restrictions. These apriori restrictions may arise from various sources like economic theory, e.g. it is
known that the price and income affect the demand of wheat but rainfall does not affect it. Similarly, supply
of wheat depends on income, price and rainfall. There are many types of restrictions available which can
solve the problem of identification. We consider zero-one type restrictions.

Zero-one type restrictions:


Suppose the apriori restrictions are of zero-one type, i.e., some of the coefficients are one and others are zero.
Without loss of generality, consider S as
Byt  xt   t , t  1, 2,..., n.

When the zero-one type restrictions are incorporated in the model, suppose there are G jointly dependent

and K* predetermined variables in S having nonzero coefficients. Rest  G  G  jointly dependent and

 K  K*  predetermined variables are having coefficients zero.

Without loss of generality, let   and  * be the row vectors formed by the nonzero elements in the first row

of B and  respectively. Thus the first row of B can be expressed as    0  . So B has G coefficients

which are one and (G  G ) coefficients which are zero.

Similarly, the first row of  can be written as   * 0  . So in , there are K* elements present ( those take

value one) and  K  K*  elements absent (those take value zero).

Econometrics | Chapter 17 | Simultaneous Equations Models | Shalabh, IIT Kanpur


16
The first equation of the model can be rewritten as
11 y1t  12 y2t  ...  1G yG t   11 x1t   12 x2t  ...   1K xK t  1t
  * *

or    0  yt   * 0  xt  1t , t  1, 2,..., n.
Assume every equation describes the behaviour of a particular variable, so that we can take 11  1.

If 11  1, then divide the whole equation by 11 so that the coefficient of y1t is one.

So the first equation of the model becomes


y1t  12 y2t  ...  1G yGt   11 x1t   12 x2t  ...   1K* xK *t  1t .

Now the reduced form coefficient relationship is


   B 1
or B  
or    0    * 0 .
   
G  G  G  K* K  K*
elements elements elements elements

Partition
  ** 
   * 
  *  ** 

where the orders of  * is  G  K*  ,  ** is  G  K**  ,  * is  G  K*  and  ** is  G  K** 

where G  G  G and K**  K  K* .

We can re-express
  0      * 0 
  0      * 0** 


  ** 
or   0   *      * 0** 
  *  ** 
    *   * (i )
   **  0**. (ii )

Econometrics | Chapter 17 | Simultaneous Equations Models | Shalabh, IIT Kanpur


17
Assume  is known. Then (i) gives a unique solution for  * if   is uniquely found from (ii). Thus

identifiability of S lies upon the unique determination of   . Out of G elements in   , one has coefficient

1, so there are  G  1 unknown elements in   that are unknown.

Note that
   1, 12 ,..., 1G  .

As
  **  0**

or 1    **  0** .

So  G  1 elements of   are uniquely determined as non-trivial solution when

rank  **   G  1.

Thus  0 is identifiable when

rank  **   G  1.

This is known as a rank condition for the identifiability of parameters in S . This condition is necessary
and sufficient.

Another condition, known as order condition is only necessary and not sufficient. The order condition is
derived as follows:

We now use the result that for any matrix A of order m  n , the rank ( A)  Min(m, n). For identifiability, if
rank ( A)  m then obviously n  m.
Since   **  0 and   has only (G  1) elements which are identifiable when

rank  **   G  1
 K  K*  G  1.
This is known as the order condition for identifiability.

There are various ways in which these conditions can be represented to have meaningful interpretations. We
discuss them as follows:

Econometrics | Chapter 17 | Simultaneous Equations Models | Shalabh, IIT Kanpur


18
1.  G  G   ( K  K* )   G  G    G  1
or  G  G   ( K  K* )  G  1

Here
G  G : Number of jointly dependent variables left out in the equation

y1t  12 y2t  ...  1G yG   11 x1t  ...  1K xKt  1t

K  K* : Number of predetermined variables left out in the equation

y1t  12 y2t  ...  1G yG   11 x1t  ...  1K xKt  1t

G  1: Number of total equations - 1.

So left-hand side of the condition denotes the total number of variables excluded is this equation.

Thus if the total number of variables excluded in this equation exceeds (G  1), then the model is
identifiable.

2. K  K*  G -1
Here
K* : The number of predetermined variables present in the equation.

G : The number of jointly dependent variables present is the equation.

3. Define L  K  K*  (G  1)

L measures the degree of overidentification.

If L  0 then the equation is said to be exactly identified.

L  0 , then the equation is said to be over identified.


L  0  the parameters cannot be estimated.
L  0  the unique estimates of parameters are obtained.
L  0  the parameters are estimated with multiple estimators.

So by looking at the value of L , we can check the identifiability of the equation.


Rank condition tells whether the equation is identified or not. If identified, then the order condition tells
whether the equation is exactly identified or over identified.
Econometrics | Chapter 17 | Simultaneous Equations Models | Shalabh, IIT Kanpur
19
Note
We have illustrated the various aspects of estimation and conditions for identification in the simultaneous
equation system using the first equation of the system. It may be noted that this can be done for any equation
of the system and it is not necessary to consider only the first equation. Now onwards, we do not restrict to
first equation but we consider any, say ih equation (i = 1, 2, …,G).

Working rule for rank condition


The checking of the rank condition sometimes, in practice, can be a difficult task. An equivalent form of rank
condition based on the partitioning of structural coefficients is as follows.

Suppose
 0   * 0** 
B  ,    ,
 B B   * ** 

where   ,  * , 0 and 0** are row vectors consisting of G , K* , G and K** elements respectively in them.

Similarly the orders of B is ( G  1  G ), B is ((G  1)  G ), * is

((G  1)  K* ) and ** is ((G  1)  K** ). Note that B and ** are the matrices of structural coefficients for

the variable omitted from the i th equation (i = 1,2,…,G) but included in other structural equations.

Form a new matrix


   B 
 0  0 
    * ** 
 B B * ** 
B 1   B 1 B B 1 
 I  
 I  0    ,*    ,** 
 .
0 I  ,    ,*    ,** 
If
0 0** 
*   
 B ** 

then clearly the rank of * is same as the rank of  since the rank of a matrix is not affected by enlarging

the matrix by a rows of zeros or switching any columns.

Econometrics | Chapter 17 | Simultaneous Equations Models | Shalabh, IIT Kanpur


20
Now using the result that if a matrix A is multiplied by a nonsingular matrix, then the product has the same
rank as of A, we can write

rank (* )  rank ( B 1* )

 0   ,** 
rank ( B 1* )  rank  
 I  ,   ,** 
 rank ( I  ,  )  rank (  ,** )
  G  G    G  1
 G 1
rank  B 1*   rank  *   rank  B **  .

So
rank ( B ** )  G  1

and then the equation is identifiable.

Note that  B **  is a matrix constructed from the coefficients of variables excluded from that particular

equation but included in other equations of the model. If rank    **   G  1, then the equation is

identifiable and this is a necessary and sufficient condition. An advantage of this term is that it avoids the
inversion of matrix. A working rule is proposed like following.
Working rule:
1. Write the model in tabular form by putting ‘X’ if the variable is present and ‘0’ if the variable is
absent.
2. For the equation under study, mark the 0’s (zeros) and pick up the corresponding columns
suppressing that row.
3. If we can choose  G  1 rows and (G  1) columns that are not all zero, then it can be identified.

Econometrics | Chapter 17 | Simultaneous Equations Models | Shalabh, IIT Kanpur


21
Example 2:
Now we discuss the identifiability of following simultaneous equations model with three structural
equations
(1) y1  13 y3   11 x1   13 x3  1
(2) y1   21 x1   23 x3   2
(3) y2  33 y3   31 x1   32 x2   3 .
First, we represent the equation (1)-(3) in tabular form as follows
Equation number y1 y2 y3 x1 x2 x3 G  1 K* L

1 X0X X0X 2–1=1 2 3–3=0


2 X00 X0X 1–1=0 2 3–2=1
3 0XX XX0 2–1=1 2 3–3=0

 G  Number of equations = 3.
 `X’ denotes the presence and '0 ' denotes the absence of variables in an equation.
 G  Numbers of ‘X’ in  y1 y2 y3  .

 K* = Number of ‘X’ in ( x1 x2 x3 ) .

Consider equation (1):


 Write columns corresponding to '0 ' which are columns corresponding to y2 and x2 and write as
follows
y2 x2

Equation (2) 0 0
Equation (3) X X

 Check if any row/column is present with all elements ‘0’. If we can pick up a block of order (G  1)
is which all rows/columns are all ‘0’ , then the equation is identifiable.
0 0
Here G  1  3  1  2, B    , **    . So we need  2  2  matrix in which no row and no
X X
column has all elements ‘0’ . In case of equation (1), the first row has all ‘0’, so it is not identifiable.

Notice that from order condition, we have L  0 which indicates that the equation is identified and
this conclusion is misleading. This is happening because order condition is just necessary but not
sufficient.

Econometrics | Chapter 17 | Simultaneous Equations Models | Shalabh, IIT Kanpur


22
Consider equation (2):
 Identify ‘0’ and then
y2 y3 x2

Equation (1) 0X 0
Equation (3) XX X

 0 X 0
 B    , **    .
X X X
 G  1  3  1  2.
 We see that there is atleast one block in which no row is ‘0’ and no column is ‘0’. So the equation (2)
is identified.
 Also, L  1  0  Equation (2) is over identified.

Consider equation (3)


 identify ‘0’.
y1 x3

Equation (1) X X
Equation (2) X X

X X
 So B    , **    .
X X
 We see that there is no ‘0’ present in the block. So equation (3) is identified.
 Also, L  0  Equation (3) is exactly identified.

Econometrics | Chapter 17 | Simultaneous Equations Models | Shalabh, IIT Kanpur


23
Estimation of parameters
To estimate the parameters of the structural equation, we assume that the equations are identifiable.

Consider the first equation of the model


11 y1t  12 y2t  ...  1G yGt   11 x1t   12 x2t  ...   1K xKt  1t , t  1, 2,.., n.
Assume 11  1 and incorporate the zero-one type restrictions. Then we have

y1t  12 y2t  ...  1G  yG   11 x1t   12 x2t  ...   1K xK*t  1t , t  1, 2,.., n.

Writing this equation in vector and matrix notations by collecting all n observations, we can write
y1  Y1  X 1  1

where

 y11   y21 y31  yG1 


   
y y22 y32  yG 2 
y1   12  , Y1  
        
   
 y1n   y2 n y3n  yGn 
 x11 x21  xK* 1    12    11 
     
 x12 x22  xK* 2   13    12 
X1   ,    ,   .
          
     
 x1n
 x2 n  xK*n    1G    1K* 

The order of y1 is  n  1 , Y1 is (n  (G  1)), X 1 is  n  K*  ,  is  (G  1) 1 and  is  K* x 1 .

This describes one equation of the structural model.


Now we describe the general notations in this model.

Consider the model with incorporation of zero-one type restrictions as


y1  Y1  X 1  

where y1 is a  n 1 vector of jointly dependent variables, Y1 is  n   G  1  matrix of jointly dependent

variables where G denote the number of jointly dependent variables present on the right hand side of the

equation,  is a ( G  1  1) vector of associated structural coefficients, X 1 is a  n  K*  matrix of n

observations on each of the K* predetermined variables and  is a  n 1 vector of structural disturbances.

This equation is one of the equations of complete simultaneous equation model.

Econometrics | Chapter 17 | Simultaneous Equations Models | Shalabh, IIT Kanpur


24
Stacking all the observations according to the variance, rather than time, the complete model consisting of G
equations describing the structural equation model can be written as
YB  X   
where Y is a nG matrix of observation on jointly dependent variables, B is G  G  matrix of

associated structural coefficients, X is  n  K  matrix of observations on predetermined variables,  is

( K  G ) matrix of associated structural coefficients and  is a  n  G  matrix of structural disturbances.

1
Assume E     0, E   '   where  is positive definite symmetric matrix.
n
The reduced form of the model is obtained from the structural equation model by post multiplying by B 1 as
YBB 1  X B 1  B 1
Y  X V

where   B 1 and V  B 1 are the matrices of reduced-form coefficients and reduced form disturbances
respectively.

The structural equation is expressed as


y  Y1  X 1  1
 
 Y1 X 1     1
 
 A  

where A  Y1 , X 1  ,       ' and   1 . This model looks like a multiple linear regression model.

Since X 1 is a submatrix of X , so it can be expressed as

X 1  XJ1

where J1 is called as a select matrix and consists of two types of elements, viz., 0 and 1. If the

corresponding variable is present, its value is 1 and if absent, its value is 0.

Econometrics | Chapter 17 | Simultaneous Equations Models | Shalabh, IIT Kanpur


25
Method of consistent estimation of parameters
1. Indirect least squares (ILS) method
This method for the consistent estimation of parameters is available only for exactly identified equations.

If equations are exactly identifiable, then


K  G  K*  1.

Step 1: Apply ordinary least squares to each of the reduced form equations and estimate the reduced form
coefficient matrix.

Step 2: Find algebraic relations between structural and reduced-form coefficients. Then find the structural
coefficients.

The structural model at a time t is


 yt  xt   t ; t  1, 2,..., n

where yt   y1t , y2t ,..., yGt  ', xt   x1t , x2t ,..., xKt  '.

Stacking all n such equations, the structural model is obtained as


BY  X  
where Y is  n  G  matrix, X is  n  K  matrix and  is (n  K ) matrix.

The reduced form equation is obtained by premultiplication of B 1 as


B 1 BY  B 1X  B 1
Y  X V
where    B 1 and B 1.

Applying OLS to reduced form equation yields the OLSE of  as

ˆ   X ' X  X ' Y .
1

This is the first step of ILS procedure and yields the set of estimated reduced form coefficients.

Econometrics | Chapter 17 | Simultaneous Equations Models | Shalabh, IIT Kanpur


26
Suppose we are interested in the estimation of following structural equation
y  Y1  X 1  

where y is  n 1 vector of n observation on dependent (endogenous) variable, Y1 is  n  (G  1 matrix of

observations on G1 current endogenous variables, X 1 is  n  K*  matrix of observations on K*

predetermined (exogenous) variables in the equation and  is  n  1 vector of structural disturbances. Write

this model as
 1
 
y1 Y1 X1      
  
 
or more general
 1
 
  
 y1 Y1 Y2 X 1 X 2   0   
  
 0
 
where Y2 and X 2 are the matrices of observations on  G  G  1 endogenous and ( K  K  ) predetermined

variables which are excluded from the model due to zero-one type restrictions.

Write
 B  
 1
   
or         .
 0  0
 
Substitute  as ˆ   X ' X  X ' Y
1
and solve for  and  . This gives indirect least squares estimators b

and c of  and  respectively by solving

Econometrics | Chapter 17 | Simultaneous Equations Models | Shalabh, IIT Kanpur


27
 1
c
ˆ  b    
 0  0
 
 1
c
 X ' X  X ' Y  b    
1
or
 0  0
 
 1
c
 X ' X  X '  y1 Y1 Y2   b    
1
or
 0  0
 
c
 X ' y1  X ' Y1b  X ' X  
0
Since X   X 1 X2 
 ( X 1'Y1 )b   X 1' X 1  c  X 1' y1 (i )
( X 2' Y1 )b   X 2' X 1  c  X 2' y1. (ii )

These equations (i) and (ii) are K equations is (G  K*  1) unknowns. Solving the equations (i) and (ii)

gives ILS estimators of  and  .

2. Two stage least squares (2SLS) or generalized classical linear (GCL) method:
This is more widely used estimation procedure as it is applicable to exactly as well as overidentified
equations. The least-squares method is applied in two stages in this method.
Consider equation y1  Y1  X 1   .

Stage 1: Apply least squares method to estimate the reduced form parameters in the reduced form model
Y1  X  1  V1
 ˆ1  ( X ' X ) 1 X ' Y1
 Y1  X ˆ1.

Stage 2: Replace Y1 is structural equation y1  Y1  X 1   by Ŷ1 and apply OLS to thus obtained structural

equation as follows:
y1  Yˆ1  X 1  
 X  X ' X  X ' Y1  X 1  
1

 
  X  X ' X  X ' Yˆ1 X1     
1
   

 Aˆ   .

Econometrics | Chapter 17 | Simultaneous Equations Models | Shalabh, IIT Kanpur


28
where Aˆ   X  X ' X  X ' Yˆ1 X 1  , A   X  X ' X  X ' Y1 X 1   HA, H  X  X ' X  X ' is idempotent and
1 1 1
   
      '.

Applying OLS to Y1  Aˆ   gives OLSE of  as

 
1
ˆ  Aˆ  Aˆ ˆ
Ay1

  A ' HA  A ' Hy1


1

1
 ˆ   Y1' HY1 Y1' X 1   Y1' y1 
or   '   
 ˆ   X 1Y1 X 1' X 1   X 1' y1 
1
 Y 'Y  Vˆ 'Vˆ Y1' X 1   Y1'  Vˆ1' 
  1 1 ' 1 1    y1.
 X 1Y1 X 1' X 1   X 1
'

where V1  Y1  X  1 is estimated by Vˆ1  Y1  X ˆ1   I  H  Y1  HY1 , H  I  H . Solving these two

equations, we get ̂ and ˆ which are the two stage least squares estimators of  and  respectively.

Now we see the consistency of ˆ .

  Ayˆ
1
ˆ  Aˆ Aˆ 1

  Aˆ ' Aˆ  Aˆ '  Aˆ   


1

ˆ     Aˆ ' Aˆ  Aˆ ' 
1

1

 
1 

1
n

plim ˆ    plim  Aˆ ' A  plim  Aˆ '   .
n 

The 2SLS estimator ˆ is consistent if


 1 
 plim Yˆ1' 
1  n
plim  Aˆ '       0.
n   plim 1 X ' 
 n
1


Since by assumption, X variables are uncorrelated with  in limit, so


1 
plim  X '    0.
n 
1 
For plim  Yˆ1  , we observe that
n 

Econometrics | Chapter 17 | Simultaneous Equations Models | Shalabh, IIT Kanpur


29
1  1 
plim  Yˆ1'   plim  ˆ1 X '  
n  n 
 1 
  plimˆ1   plim X '  
 n 
  1.0
 0.

 
Thus plim ˆ-  0 and so the 2SLS estimators are consistent.

The asymptotic covariance matrix of ˆ is

 
Asy Var ˆ  n 1 plim  n ˆ   ˆ   '
    
plim  n  Aˆ Aˆ  Aˆ '  ' Aˆ  Aˆ ' Aˆ  
1 1
 n 1
 
1 1
1  1  1 
 n plim  Aˆ ' Aˆ  plim  Aˆ '  ' Aˆ  plim  Aˆ ' Aˆ 
1

n  n  n 
1
1 
 n   plim  Aˆ ' Aˆ 
1 2

n 
where Var ( )   2 .

The asymptotic covariance matrix is estimated by


Y1' X ( X ' X ) 1 X ' Y1 Y1' X 1 
 
1
s 2
Aˆ ' Aˆ s  2

 X 1'Y1 X 1' X 1 

where

s 2

 y  Y ˆ  X ˆ  '  y  Y ˆ  X ˆ 
1 1 1 1

nG  K

Econometrics | Chapter 17 | Simultaneous Equations Models | Shalabh, IIT Kanpur


30
Chapter 18
Seemingly Unrelated Regression Equations Models

A basic nature of the multiple regression model is that it describes the behaviour of a particular study
variable based on a set of explanatory variables. When the objective is to explain the whole system, there
may be more than one multiple regression equations. For example, in a set of individual linear multiple
regression equations, each equation may explain some economic phenomenon. One approach to handle such
a set of equations is to consider the set up of simultaneous equations model is which one or more of the
explanatory variables in one or more equations are itself the dependent (endogenous) variable associated
with another equation in the full system. On the other hand, suppose that none of the variables is the system
are simultaneously both explanatory and dependent in nature. There may still be interactions between the
individual equations if the random error components associated with at least some of the different equations
are correlated with each other. This means that the equations may be linked statistically, even though not
structurally – through the jointness of the distribution of the error terms and through the non-diagonal
covariance matrix. Such behaviour is reflected in the seemingly unrelated regression equations (SURE)
model in which the individual equations are in fact related to one another, even though superficially they
may not seem to be.

The basic philosophy of the SURE model is as follows. The jointness of the equations is explained by the
structure of the SURE model and the covariance matrix of the associated disturbances. Such jointness
introduces additional information which is over and above the information available when the individual
equations are considered separately. So it is desired to consider all the separate relationships collectively to
draw the statistical inferences about the model parameters.

Example:
Suppose a country has 20 states and the objective is to study the consumption pattern of the country. There is
one consumption equation for each state. So all together there are 20 equations which describe 20
consumption functions. It may also not necessary that the same variables are present in all the models.
Different equations may contain different variables. It may be noted that the consumption pattern of the
neighbouring states may have characteristics in common. Apparently, the equations may look distinct
individually but there may be some kind of relationship that may be existing among the equations. Such
equations can be used to examine the jointness of the distribution of disturbances. It seems reasonable to

Econometrics | Chapter 18 | SURE Models | Shalabh, IIT Kanpur


1
assume that the error terms associated with the equations may be contemporaneously correlated. The
equations are apparently or “seemingly” unrelated regressions rather than independent relationships.

Model:
We consider here a model comprising of M multiple regression equations of the form
ki
yti   xtij  ij   ti , t  1, 2,..., T ; i  1, 2,..., M ; j  1, 2,..., ki
j 1

where yti is the t th observation on the i th dependent variable which is to be explained by the i th regression

equation, xtij is the t th observation on j th explanatory variable appearing in the i th equation, ij is the

coefficient associated with xtij at each observation and  ti is the t th value of the random error component

associated with i th equation of the model.

These M equations can be compactly expressed as


yi  X i i   i , i  1, 2,..., M

where yi is T 1 vector with elements yti ; X i is T  K i  matrix whose columns represent the T

observations on an explanatory variable in the i th equation;  i is a  ki 1 vector with elements  ij ; and  i

is a T  1 vector of disturbances. These M equations can be further expressed as

 y1   X1 0  0   1   1 
      
 y2  0 X2  0   2    2 
             
      
 yM   0 0  X M   M  M 
or y  X   

where the orders of y is TM 1 , X is TM  k * ,  is  k * 1 ,  is TM 1 and k *   ki .
i

Econometrics | Chapter 18 | SURE Models | Shalabh, IIT Kanpur


2
Treat each of the M equations as the classical regression model and make conventional assumptions for
i  1, 2,..., M as
 X i is fixed.
 rank  X i   ki .
1 
 lim  X i' X i   Qii where Qii is nonsingular with fixed and finite elements.
T 
T 
 E  ui   0 .
 E  ui ui'    ii IT where  ii is the variance of disturbances in i th equation for each observation in
the sample.

Considering the interactions between the M equations of the model, we assume


1 '
 lim X i X j  Qij
T  T

 E  ui u 'j    ij IT ; i, j  1, 2,..., M

where Qij is non-singular matrix with fixed and finite elements and  ij is the covariance between the

disturbances of i th and j th equations for each observation in the sample.

Compactly, we can write


E    0
  11 IT  12 IT   1M I T 
 
 I  22 IT   2 M IT 
E   '   21 T    IT  
     
 
  M 1 IT  M 2 IT   MM IT 

where  denotes the Kronecker product operator,  is  MT  MT  matrix and    ij  is  M  M   


positive definite symmetric matrix. The definiteness of  avoids the possibility of linear dependencies
among the contemporaneous disturbances in the M equations of the model.

The structure E  uu '    IT implies that

 variance of  ti is constant for all t .

 contemporaneous covariance between  ti and  tj is constant for all t .

 intertemporal covariance between  ti and  t* j  t  t * are zero for all i and j .

Econometrics | Chapter 18 | SURE Models | Shalabh, IIT Kanpur


3
By using the terminologies “contemporaneous” and “intertemporal” covariance, we are implicitly assuming
that the data are available in time series form but this is not restrictive. The results can be used for cross-
section data also. The constancy of the contemporaneous covariances across sample points is a natural
generalization of homoskedastic disturbances in a single equation model.

It is clear that the M equations may appear to be not related in the sense that there is no simultaneity
between the variables in the system and each equation has its own explanatory variables to explain the study
variable. The equations are related stochastically through the disturbances which are serially correlated
across the equations of the model. That is why this system is referred to as SURE model.

The SURE model is a particular case of simultaneous equations model involving M structural equations
with M jointly dependent variable and k   ki for all i  distinct exogenous variables and in which neither

current nor logged endogenous variables appear as explanatory variables in any of the structural equations.

The SURE model differs from the multivariate regression model only in the sense that it takes account of
prior information concerning the absence of certain explanatory variables from certain equations of the
model. Such exclusions are highly realistic in many economic situations.

OLS and GLS estimation:


The SURE model is
y  X    , E     0, V       IT   .

Assume that  is known.

The OLS estimator of  is

b0   X ' X  X ' y
1

Further
E  b0   
V  b0   E  b0    b0    '
  X ' X  X ' X  X ' X  .
1 1

The generalized least squares (GLS) estimator of 

Econometrics | Chapter 18 | SURE Models | Shalabh, IIT Kanpur


4
ˆ   X ' 1 X  X ' 1 y
1

  X '   1  IT  X  X '   1  IT  y
1

 
E ˆ  

V  ˆ   E  ˆ    ˆ    '

  X ' 1 X 
1

  X '   1  IT  X  .
1

Define

G   X ' X  X '  X ' 1 X  X ' 1


1 1

then GX  0 and we find that

 
V  b0   V ˆ  G G ' .

Since  is positive definite, so G G ' is atleast positive semidefinite and so GLSE is, in general, more
efficient than OLSE for estimating  . In fact, using the result that GLSE best linear unbiased estimator of

 , so we can conclude that ̂ is the best linear unbiased estimator in this case also.

Feasible generalized least squares estimation:


When  is unknown, then GLSE of  cannot be used. Then  can be estimated and replaced by  M  M 

matrix S . With such replacement, we obtain a feasible generalized least squares (FGLS) estimator of  as

ˆF   X '  S 1  IT  X  X '  S 1  IT  y .


1

Assume that S   s  is a nonsingular matrix and s


ij ij is some estimator of  ij .

Estimation of 
There are two possible ways to estimate  ij ' s.

Econometrics | Chapter 18 | SURE Models | Shalabh, IIT Kanpur


5
1. Use of unrestricted residuals
Let K be the total number of distinct explanatory variables out of k1 , k2 ,..., km variables in the full model

y  X    , E     0, V       IT

and let Z be a T  K observation matrix of these variables.

Regress each of the M study variables on the column of Z and obtain T 1 residual vectors

ˆi  yi  Z  Z ' Z  Z ' yi i  1, 2,..., M


1

 H Z yi

where H Z  IT  Z  Z ' Z  Z '.


1

Then obtain
1 '
sij  ˆiˆ j
T
1
 yi' H Z y j
T

and construct the matrix S   s  accordingly.


ij

Since X i is a submatrix of Z , so we can write

X i  ZJ i

where J i is a  K  ki  selection matrix. Then

HZ Xi  Xi  Z Z ' Z  Z ' Xi
1

 X i  ZJ i
0
and thus
yi' H Z y j   i' X i'   i'  H Z  X j  j   j 
  i' H Z  j .

Econometrics | Chapter 18 | SURE Models | Shalabh, IIT Kanpur


6
Hence
1
E  sij   E   i' H Z  j 
T
1
  ij tr  H Z 
T
 K
 1    ij
 T 
 T 
E sij    ij .
T K 
T
Thus an unbiased estimator of  ij is given by sij .
T K

2. Use of restricted residuals


In this approach to find an estimator of  ij , the residuals obtained by taking into account the restrictions on

the coefficients which distinguish the SURE model from the multivariate regression model are used as
follows.

Regress yi on X i , i.e., regress each equation, i  1, 2,..., M by OLS and obtain the residual vector

ui   I  X i  X i' X i  X i'  yi


1

 
 H X i yi .
A consistent estimator of  ij is obtained as

1 '
sij* ui u j
T
1
 yi' H X i H X j y j
T
where

H X i  I  X i  X i' X i  X i
1

H X j  I  X j  X 'j X j  X j .
1

Using sij* , a consistent estimator of S can be constructed.

If T in sij* is replaced by

 
tr H X i H X j  T  ki  k j  tr  X i' X i  X i' X j  X 'j X j  X 'j X i
1 1

then sij* is an unbiased estimator of  ij .

Econometrics | Chapter 18 | SURE Models | Shalabh, IIT Kanpur


7

You might also like