1 Merged Merged
1 Merged Merged
CONTENTS
1. INTRODUCTION
3. ECONOMIC DATA
8. SUMMARY
1.INTRODUCTION
The term Regression has its historical origin which is connected to the works of Francis Galton. While studying
the height of parents and children, Galton found the following result: tall parents usually have tall children and
short parents usually have short children but the average or mean height of children born of both parents tend to
move towards the average or mean height in the population as a whole. In other words the height of children of
both tall and short parents tends to move towards theaverage or mean height of the population. Karl Pearson
confirmed the findings of Galton by analysing thousand records of heights of both children and parents.
Regression analysis is a statistical tool or method that is very useful for studying the relationships between two (or
more) variables. For example, in economics we have Keynes’s theory of Physiological law of consumption. It
states when income increases there is an associated increase in consumption. However the marginal propensity to
consume will lie between 0 and 1. There are other such examples of relationship between variables in social
sciences. In such cases, Regression Analysis can be employed to build a model to predict the value of one
variable (dependent variable) on the basis of other given variables (the independent variables). We briefly explain
the notation and terminology usedcommonly in regression analysis.
Notation:
Dependent Variable:
Terminology:
The dependent variables are also known by various terminologies such as: Explained variable, predictand,
Regressand, Response, Endogenous, outcome, controlled variable.
The independent variables are also known by various terminologies: Predictor, Regressor, Stimulus, Exogenous,
Covariate, control variable.
Though the use of terminology is a personal choice, we will simply use dependent and independent/ explanatory
variables in this chapter.
In regression analysis we look for statistical relationship (and not deterministic relationship) between variables. In
statistical relationship we essentially deal with stochastic or random variables that possessed probability
distributions. For instance, the yield of crop depends on temperature, rainfall and fertilizers but agronomists could
not predict the yield of crop exactly because there are errors involved in measuring these explanatory variables
and there are other factors affecting crop yield. Therefore in statistical relationship, the dependent variable and
explanatory variables does not have exact relationship.
In deterministic relationship the relationship between variables can be exactly determined. One such example is
Newton’s Law of Gravity. According to Newton’s Law of Gravity, the force of attraction between every particle
in the universe is directly proportional to the product of the masses and it is inversely proportional to the square of
the distance between them. There are other examples such as Ohm’s Law or Boyle’s Gas Law.
A statistical relationship between two (or more) variables does not in any manner imply causation. For instance,
we know that crop yield depends on rainfall. Statistically speaking there is no valid reason why we cannot assume
that rainfall depends on crop yield. However simple common sense do suggests that this is not usually the case as
we cannot change rainfall by increasing or decreasing crop yield. Therefore the crop yield is treated as dependent
variable and rainfall as explanatory variable.
To determine causality between variables we must take into account a priori or theoretical considerations.
Statistical relationship in itself is not at all sufficient to imply causation between variables.
Both correlation and regression are closely related but conceptually different. In correlation analysis our main
objective is to measure the strength or degree of linear association between variables. For example we are
interested in measuring the correlation (coefficient) between smoking and lung cancer.
In regression analysis we are interested in predicting the average or mean value of one variable (dependent
variable) on the basis of the given values of other variables (independent variables). For instance we would want
to know the occurrence of AIDS among drug users. Secondly in correlation analysis, both the variables are treated
as random. However in regression analysis the dependent variable is random variable while the explanatory
variables is fixed and given.
3. ECONOMIC DATA
The success of any regression analysis will depend on the availability of high quality data necessary for
economic research. So we briefly look at the types of data and issue of data accuracy.
Types of Data
There are three types of data available for empirical analysis: time series, cross-section and pooled data (i.e.,
combination of time series and cross section).
Times series:
In time series, we measure the set of observations at different time periods. We can have daily data (e.g., stock
prices, weather reports), weekly data (e.g., money supply, weekly sale), monthly data (e.g., consumer price index,
BUSINESS ECONOMICS PAPER No. : 8, FUNDAMENTALS OF ECONOMETRICS
MODULE No. : 1, INTRODUCTION TO TWO VARIABLE REGRESSION
ANALYSIS
___________________________________________________________________________________________________
unemployment rate), quarterly data (GDP, Industrial Production), Annual data (e.g., government budgets),
quinquennially data (e.g., census of manufacturing) and decennially data (e.g., census of population). The
problem with time series econometrics is that we need to assume that the underlying time series data is stationary.
Simply speaking a time series data is said to be stationary if its mean and variance remain constant over a period
of time.
Cross-Section Data:
In Cross Section Data we measure the data on one or more variables at the same point in time. Examples- Census
of population after every 10 years, Consumer expenditure survey etc. The problem with cross-sectional data
relates to the issue of heterogeneity. So, when we include heterogeneous units in a statistical analysis, the size and
scale effect needs to be considered into account.
Pooled Data
In pooled data we combine both the time series and cross sectional data. For instance we can have a data on the
price and output of different variety of cloths for a time period of say 10 years.
Therefore it is possible that if the quality of data is poor, it may lead to unsatisfactory results for the economic
researcher.
(1) Simple Regression model: This is further divided into simple Linear regression Model and Simple
Non Linear regression Model
(2) Multiple regression model. This is also further divided into Multiple Linear regression model and
Multiple Non Linear Regression model.
Regression Model
In simple regression model the number of independent variable is one while in multiple regression model it is
more than one. The term Linearity can mean both Linearity in parameter and variables. For regression analysis,
Linear regression model would mean a model which is linear in parameter and Non Linear regression model
would mean a model which is not linear in parameter. By Linearity in parameter we mean that the parameters say
�
�, has the power 1 and are not multiplied or divided by any other parameters like � . or ��
1
Examples:
In regression analysis the main aim is to estimate and/or predict the (population) average/mean value of the
dependent variable based on the known explanatory variable (s). We consider in table 1 aillustrative example of
the weekly income and weekly expenditure on consumption of a hypothetical community. The total
population comprised of 36 families. These families are further divided into 6 income groups from Rs 1000 to Rs
3500. We have 6 fixed values of and the corresponding values are shown in the table. So we have 6 sub
populations. The bottom row shows the conditional expected values of Y for the 6 subpopulations.
Corresponding to table 1, we can also find the conditional probability for population of 36 observations. This can
be seen in table 2.
X
1000 1500 2000 2500 3000 3500
Conditional 1/5 1/6 1/7 1/5 1/6 1/7
Probabilities 1/5 1/6 1/7 1/5 1/6 1/7
p(y/Xi) 1/5 1/6 1/7 1/5 1/6 1/7
1/5 1/6 1/7 1/5 1/6 1/7
1/5 1/6 1/7 1/5 1/6 1/7
___ 1/6 1/7 _____ 1/6 1/7
BUSINESS ECONOMICS PAPER No. : 8, FUNDAMENTALS OF ECONOMETRICS
MODULE No. : 1, INTRODUCTION TO TWO VARIABLE REGRESSION
ANALYSIS
___________________________________________________________________________________________________
We plot the graph of the relationship between expenditure on consumption and income as shown in fig 1. We can
clearly see that there is enough variation in expenditure on consumption in each income group. However the
average expenditure on consumption rises as income rises. This phenomenon can be seen in table 1 itself. For
instance, when the income level is Rs 1000, the corresponding average expenditure on consumption is Rs 800.
When the income is increased to Rs 1500 the average expenditure on consumption rises to Rs 1100. Thus we
clearly see a positive relationship between income and expenditure on consumption. These mean/average values
of dependent variable (here consumption expenditure) are known as conditional expected values as they are
conditioned or depend on the given explanatory variable (here income group).
Apart from conditional expected values, we can also find unconditional expected value of expenditure on
consumption . We add the expenditure on consumption of all income groups and divide it by the total
number of families which is Rs 1583.33. It is unconditional as it includes expenditure on consumption of all
income groups. The conditional expected values are different from the unconditional expected value. The
unconditional expected value gives the average expenditure on consumption of all income groups while the
conditional expected value gives the average expenditure on consumption of a particular income group. For
instance the conditional expected value of expenditure on consumption for people with income group of Rs 1000
is Rs 800 while for another income group Rs 1500 it is Rs 1100. The concept of conditional expected value may
help us to predict the average or mean value of dependent variable at different values of independent variable
which is in fact the essence of regression analysis.
The next objective is to obtain population regression line. The population regression line is obtained by joining
the conditional mean values of for various level of . Geometrically the population regression line is the locus
of conditional means of dependent variable for given values of the explanatory variables. The regression curve
thus passes through these (conditional) mean values. We assume for simplicity shake that these values are
symmetrically distributed around their respective (conditional) mean values. This can be seen in figure 1 which is
given below:
Continuing with the above illustrative example we have already seen that each conditional mean of expenditure
onconsumption depends on particular income level. Therefore, ⁄ � is a function of � where � is a given
value of . Mathematically
⁄ � = � ��
Equation is known as Conditional Expectation Function (CEF) or Population Regression Function (PRF). It
tells us how the expected value of the distribution of Y is related functionallyto the value of � in some way.
The next question relates to the functional form which � should take? The function form of the Population
Regression Function is both an empirical and theoretical question. For instance, economist assumed that
expenditure on consumption and income are linearly related. For simplicity sake and as a initial working
hypothesis we assume that the Population Regression Function ⁄ � is a linear function of � .
⁄ � = �� + � � ��
where�� � � are known as the regression coefficients. The regression coefficients are fixed but unknown.�� is
known as intercept or constant term; and� as slope coefficients.
Equation (2) is known as linear popular regression model. Our main objective is to find (i) the true values of
�� � � and (ii) standard error of �� � �
Coming back to our illustrative example, we find that the average expenditure on consumption increases as
income increases. However if we look at a particular family this need not be necessary true. For instance there is a
family with income of Rs 1500 whose expenditure on consumption is Rs 850 which is below the expenditure of
one family with income of Rs 1000. Therefore there are families whose consumption expenditure deviates from
the average expenditure.
The deviation of Individual � from its expected /mean value ⁄ � can be expressed as follows
�� = � − ⁄ �
� = ⁄ � + �� ��
� = �� + � � + �� ��
The deviation of � from its expected value is denoted by �� . �� is an unobservable random variable and can take
either positive or negative values. They are popularly known asstochastic disturbance term.
The expenditure on consumption of an individual family for a given income level can be expressed as the sum of
two components:
To understand the issues more clearly we write the hypothetical example in table 1 in the form of equation 3
The individual consumption expenditure for = � can be written as given below:
= = �� + � + �
= = �� + � + �
= = �� + � + �
= = �� + � + �
= = �� + � + �
�⁄ � = ⁄ � + �� ⁄ �
�� ⁄ � =
Thus the conditional mean value of �� is zero. This is because we assume that the regression curve/line passes
through the conditional mean of .
The stochastic specification in equation 3 clearly shows that there are other variable(s) apart from income which
affects expenditure on consumption and income alone cannot explain individual family consumption expenditure.
The disturbance term �� captures all omitted variables that collectively affects but are not included in the model.
The question is why is it not possible to introduce all the terms that affect the dependent variable explicitly into
the model. There are numerous number of reasons for this:
1. There is always some elements of intrinsic randomness in human responses. This arises due to
unpredictability of human choices, error in making decisions among others.
2. An effect of large number of omitted variables is contained in . Due to incompleteness of theory or data
unavailability there are large numbers of explanatory variables which are excluded from the model.
3. There could be error in measuring
4. Functional form of is not known. In reality it is very difficult to know the exact functional form of
the relationship between dependent variable and independent variables.
Our objective is to estimate Population Regression Function (PRF). In reality we cannot observe or see the
population relationship between the dependent variable and explanatory variable . So we use sample
information to estimate population values. Consider the two random samples drawn from Table 1
The Sample Regression Function (SRF) is the counterpart of PRF which can be expressed as given below:
: ̂� = � + �
Where ̂� is an estimator of [ � ⁄ � ]
� is an estimator of ��
is an estimator of �
We draw sample regression curve/line based on table 3. This can be seen from figure 2 which is given below:
Recall that the Stochastic Population Regression Function can be written as follows:
� = �� + � � + ��
The main objective of regression analysis is to estimate the Population regression Function with the help of the
Sample Regression Function. The sample estimates will be used as an approximate of the population parameters.
BUSINESS ECONOMICS PAPER No. : 8, FUNDAMENTALS OF ECONOMETRICS
MODULE No. : 1, INTRODUCTION TO TWO VARIABLE REGRESSION
ANALYSIS
___________________________________________________________________________________________________
These sample estimates will also vary from one sample to another sample. So our task will be to estimate the SRF
which make this approximation as close as possible.
8. SUMMARY
The main idea behind any regression analysis is to study the statistical dependence of dependent
variable on one or more explanatory variables
The objective of any regression analysis is to estimate and / or predict the mean value of the dependent
variable on the basis of the known value of the explanatory variables.
The success of any regression analysis will depend upon the availability of high quality data.
Regression models could be of two types: Simple Regression model and Multiple Regression model. In
both cases, we could have linear and non-linear regression models.
We study linear population regression functions regressions that are linear in the parameters only.They
could be non linear in the explanatory variables.
The population regression function (PRF) or the conditional expectation function (CEF) remains the key
concept behind regression analysis.We study how the average value of the dependent variable changes
with the given value of the explanatory variables.
We study the stochastic PRF as they are useful for empirical analysis.The stochastic disturbance term ��
plays an important role in estimating the PRF.
The stochastic disturbance term �� captures all the factors that influence the dependent variable but are
not explicitly incorporated in the model.
In reality one rarely has the access to the entire population of interest.So we use the stochastic sample
regression function to estimate the PRF.
TABLE OF CONTENTS
1. POPULATIONS AND SAMPLE REGRESSION FUNCTION
2. METHODS FOR ESTIMATING REGRESSION MODEL
3. THE LEAST SQAURE METHODS
3.1. NECESSARY REQUIREMENT FOR OLS ESTIMATES
3.2. INTERPRETATION OF THE COEFFICIENTS
4. VARIANCE AND STANDARD ERROR OF THE OLS ESTIMATES
4.1. THE VARIANCE AND STANDARD ERROR OF THE OLS ESTIMATES
4.2. VARIANCE ESTIMATES OF DISTURBANCE TERM
5. NUMERICAL PROPERTIES OF OLS
6. SUMMARY
The right hand side of the Population Regression Function can be divided into two parts:
(1) The systematic components: � + � = ��
(2) The disturbance term:
We need to estimate the population parameters � � from a given sample since it is
not possible to observe the whole population. The sample counterpart of the PRF is
known as the Sample Regression Function (SRF) which can be expressed as follows:
� = + � + � ℎ � = , ,………,
= ̂� + �
The sample counterparts of the PRF of various parameters and terms can be seen as
follows:
Population Sample
�
�
�� ̂�
�
The sample estimates of can be calculated for a given sample but these
estimates will change from sample to sample. However the population parameters
� � are fixed but remain unknown. The relationship between sample and
population regression lines can be seen from figure 1 below:
According to this criterion the values of would be chosen in such away that
the sum of all the errors are (near) zero. This can be achieved by minimizing the
following function
min |∑ �|
�=
Although this criterion is intuitively appealing, it has serious problem. The residuals of
positive signs can be compensated by the residuals of negative signs. So there could be
infinite number of lines which have the same sum of residuals ∑�= � equal to zero, no
matter what its slope or intercept are.
min ∑| � |
�=
This approach is also known as the ‘minimum absolute distance’ (MAD) estimator
because it minimizes the distance between � ̂� . This approach avoided the
possibility of positive errors being compensated by negative errors. In this approach all
the deviations are given equal weight and it is more resistant to influence by outliers.
However it is not very popular because their calculations are complicated and involves
linear programming or iterative calculations.
According to this criterion, the values of would be chosen in such away that it
minimizes the sum of the squared deviations. This can be achieved by minimizing the
following function:
min ∑ �
�=
This approach is known as the least square estimation method. This criterion avoided the
problem of compensation of residuals as we square the residuals. This approach puts
more weights on observations with large deviations and less weights on observation with
small deviations. It is also easy to calculate and obtain the least square estimates. The
least square estimates has some very useful properties under some relatively general
conditions.
We want to minimize the sum of squared errors (ESS) or the residual sum of squares
(RSS)
i.e.min ∑�= � . This is shown graphically in figure 2 below:
�= ∑ �
�=
= ∑ � − ̂�
�=
= ∑ � − − �
�=
=∑ � − � − − � � + + � + �
�=
The estimates of and are obtained by partially estimating the above equation with
respect to and
�
= ∑ − � + + �
�=
= − ∑ � − − �
�=
= − ∑ �
�=
�
= ∑ − � � + � + �
�=
= − ∑ � − − � �
�=
= − ∑ � �
�=
We equate the above two equations to 0
�
= − ∑ �− − � =
�=
�
= − ∑ � − − � � =
�=
We then obtain
∑ � = + ∑ �
�= �=
∑ � � = ∑ � + ∑ �
�= �= �=
These two equations are known as the Ordinary Least Squares Normal Equations. They
represent two equations and two unknowns. Solving them gives and
∑�= � − ̅ � −̅
=
∑�= � − ̅
,
=
= ̅− ̅
We can always compute the OLS estimates for a particular sample as long as∑�= � −
̅ > . In other words all the � should not be equal and there should be some variation
in � .
: Geometrically represents the intercept and it denotes the point where the
regression line cuts the y-axis. Econometrically it represents the average value of when
= . It may or may not have substantial meaning depending on the problem.
We know that the variance of a random variable measures the dispersion of that variable
around its mean. If the variance is small, then the individual variable is closer to their
mean. A random variable with smaller variance will also have narrow confidence interval
for the parameter. Therefore the precision of an estimator is captured by the variance of
an estimator. Hence it is worth computing the variances of ordinary least square estimates
respectively.
It should be noted that the OLS estimates , depends on the dependent variable
� . The � in turns depend on the disturbance terms , , … … . , � . Therefore the OLS
estimates are random variables with associated distributions.
The variance and standard error of the ordinary least square estimates are
given as below:
∑� �
� = �
∑� � − ̅
∑� �
� = √ �
∑� �− ̅
�
� =
√∑� � − ̅
The covariance between the OLS estimates and are
− ̅
�� , = � =− ̅
∑ �− ̅
From the above formulae, it is clear that when there is large variation in � and sample
size is large, the variance and corresponding standard error of the estimates are smaller.
And smaller variances actually improve the precision of the ordinary least squares
estimates.
The problem with the above expression is that the population variances are unknown
precisely because � is unknown.
4.2 Variance estimates of disturbance term
The population variance � can be estimated from the sample. Consider the following
equation:
̂� = + �
The above equation is a straight line. So the estimate of � can be obtained as follows:
� = �− − � where � is the estimate residual. So an estimator of � is found by
estimating the variance of error term and correcting it for the loss of degrees of freedom
for calculating . So unbiased estimator of � can be obtained as follows:
∑� � �
= �̂ = =
− −
Where is the OLS estimator of the true but unknown � and − are the degrees of
freedom.The standard error of the error term is found by taking the square root of the � .
It is also known as root mean square error or standard error of the disturbance term.
∑� �
=√
−
Since is an estimate of the variance of the error term � , it is also an estimate of the
variance of � conditioned on �
Numerical Example 1
Solution:
Here weekly consumption expenditure is the dependent variable and income is the
explanatory variable. Let represent the dependent variable and represent the
explanatory variable. In simple linear regression model the relationship between Weekly
Consumption expenditure and Weekly income can be written as follows:
� � = + ∗� +
The calculation for slope and intercept coefficient is given in the table below:
−̅ − ̅ − ̅ − ̅ −̅
800 1000 -708.33 -1250 1562500 885412.5
1200 1500 -308.33 -750 562500 231247.5
1450 2000 -58.33 -250 62500 14582.5
1500 2500 -8.33 250 62500 -2082.5
1850 3000 341.67 750 562500 256252.5
2250 3500 741.67 1250 1562500 927087.5
∑ − ̅ −̅ =
∑ =9050 ∑ =13500 ∑ − ̅ = 2312500
4375000
The mean value for the weekly consumption expenditure can be calculated as given
below:
∑
̅= = = .
Again the mean value for weekly income can be calculated as given below:
∑
̅= = =
Thus the average weekly income of a family is Rs 2250.
∑ − ̅ −̅
= = = .
∑ − ̅
= ̅− ̅= .
(a) The slope coefficient is 0.52857 and the intercept coefficient is 319.0476
(b) The estimated regression equation for weekly consumption expenditure and
weekly income can be written as follow:
�̂ = . + . ∗�
(c) The value of = . measures the slope of the regression line. It shows
that when the value of weekly income for a family lies between Rs 1000 and Rs
3500 and as income increases by Re 1, the average increase in estimated
weekly consumption expenditure is Re 0.52857. So if the weekly income
increases by Rs 100, then on average the weekly consumption expenditure will
rise by Rs 529 approximately.
Therefore it is best to interpret the intercept term as the average or mean effect of
the dependent variable when all the explanatory variables are omitted from the
regression model.
(d) The predicted value of weekly consumption when weekly income is Rs 3300 can
be obtained as follows:
�̂ = . + . ∗ = .
(e) The variance and standard variance of the disturbance term can be calculated as
follows:
�
�̂ = =
−
ℎ �= ∑ � − ∑ � − ∑ � �
Here ∑ � = ; ∑ � = ∑ � � =
So �= − . ∗ − . ∗ = .
44794.47
Therefore, �̂ = = 4
= .
And � = √ . = 105.8235
The variance and standard variance of the disturbance term are 11198.617 and 105.8235
respectively.
The interpretation for the standard deviation is as follows: The standard deviation of
105.8235 is the magnitude of typical deviation from the estimated regression line. So
some points are closer to the regression line and other points are farther away from it.
Numerical Example 2
It is widely believed in labour economics that wage earning depends on the level of
education. so the relationship between monthly earning and the number of years of
education are as follows:
̂� = . + . �
= .
= .
= .
= .
�= .
(a) Interpret the slope and intercept coefficient of the above wage-education
regression model
(b) What does the standard deviation of disturbance term indicate?
Answer:
(a) There is positive relationship between the level of education and the monthly
wage earning. For every increase in additional year of schooling raises the
monthly wage earning by 42% approximately. The intercept term is positive
however there is no economic meaning attached to it.
(b) The standard deviation of the disturbance term is small indicating that the
individual values of sample data do not divert away from the regression line.
Consider a second more realistic example. Suppose we run a regression model in which
the number of crime in a city is consider as a dependent variable and the number of
policemen is taken as an explanatory variable. Let us assume that we obtain positive
slope coefficient. In such a situation can we say that more number of policemen in a city
BUSINESS PAPER No. : 8, FUNDAMENTALS OF ECONOMETRICS
ECONOMICS MODULE No. : 2, ESTIMATION OF REGRESSION
ANALYSIS
13
____________________________________________________________________________________________________
1. The sample regression line always passes through the sample means of and
̅= + ̅
This property hold when the sample regression model has an intercept term .
To derive the above equation recall that = ̅− ̅ which can be re-written as
̅= + ̅ . So the predicted value of the dependent variable is ̅ when the
explanatory variable is ̅
2. The sum and the average value of the residuals � are zero.
∑ � = and ̅ =
To prove the above property recall the derivation of the least square estimates
�
= − ∑ �− − � =
�=
Or − ∑�= � =
So ∑�= � = and hence ̅ =
3. The mean value of the predicted (say ̂ ) is equal to the mean value of the actual
̂� = + � = ̅− ̅+ � = ̅+ � − ̅
Summing both the sides over the sample values and dividing through the sample size n
we get
4. The regressors and the errors are uncorrelated. i.e the covariance between
regressors and residuals are zero.
� �, � =
Or ∑�= � � =
∑ � � =
�=
5. The predicted value of � and the errors are uncorrelated. i.e, the covariance
between predicted value of � and the errors are zero.
� ̂� , � =
∑ ̂� � =
�=
Proof:
∑ ̂� � = ∑ + � �
�= �=
= ∑ � + ∑ � � =
�= �=
7. SUMMARY
1. The sample regression function is used to estimate the population regression function
because it is not possible to observe the population parameters.
3. The least square method is chosen as the best way to estimate the sample regression
function. The least square estimates possessed interesting statistical properties.
4. The intercept coefficient of the ordinary least square represents the average value of
the dependent value when the explanatory variable is zero. The slope coefficient captures
the change in dependent variable when the explanatory variable changes by one unit.
5. The standard errors measure the precision of the ordinary least squares estimates. If the
standard errors are small then the estimates are precisely estimated.
6. In order to avoid running spurious regression model, one must be careful to choose the
dependent and independent variable using economic theory and other prior information.
7. The ordinary least squares estimates had many interesting numerical properties. They
are (a) the sample regression line always passes through the sample means of and (b)
The sum and the average value of the residuals � are zero. (c) The mean value of the
predicted (say ̂ ) is equal to the mean value of the actual (d) The regressors and the
errors are uncorrelated (e) The predicted value of � and the errors are uncorrelated.
1
____________________________________________________________________________________________________
TABLE OF CONTENTS
1. INTRODUCTION
2. ASSUMPTIONS OF GAUSS MARKOV THEOREM
3. GAUSS MARKOV THEOREM AND PROOF
3.1. PROOF THAT OLS ESTIMATOR ARE LINEAR AND UNBIASED
3.2. PROOF THAT OLS ESTIMATOR IS EFFICIENT
3.3. PROOF THAT OLS ESTIMATOR IS CONSISTENT
4. GOODNESS OF FIT
4.1. MEASURES OF VARIATION
4.2. COEFFICIENT OF DETERMINATION
4.3. COEFFICIENT OF CORRELATION
5. SUMMARY
2
____________________________________________________________________________________________________
1. INTRODUCTION
Using OLS we estimate the parameters from the sample regression function.
However this estimates of are from the sample regression function. So we
need to make some assumptions about the population regression function so that the
sample estimates of can be used to make inferences about the population
estimate� � . These sets of assumptions are known as Classical Linear Regression
Model (CLRM) Assumptions.
Under these assumptions the OLS estimators has very good statistical properties. So these
assumptions are also known as the Gauss Markov Theorem assumptions. We now look at
those Gauss Markov assumptions for the Classical Linear Regression (CLRM) Model.
� = � + � � + �
�⁄ � =
This assumption also implies that information which are not captured by explanatory
variable (s) and falls into the error term are not related to the explanatory variable (s) and
hence do not systematically affect the dependent variable.
BUSINESS PAPER No. : 8, FUNDAMENTALS OF ECONOMETRICS
ECONOMICS MODULE No. : 3, GAUSS MARKOV THEOREM
3
____________________________________________________________________________________________________
�⁄ � = �
By definition
�⁄ � = [ � − �⁄ � ]
�⁄ � = � ⁄ � = � �
4
____________________________________________________________________________________________________
Assumption 5: (No Autocorrelation): The correlation between any two disturbance terms
� and � ≠ given any two values � and � are zero.
� , ⁄ � , = {[ � − ]⁄ � }{[ − � ]⁄ }
= [ �⁄ � ⁄ ]=
� � = [ � − � ][ � − � ]
= [ � � − � ] � � � ℎ �
= [ � �] − [ � ] [ � ] Since � =
= �
This basically says that the explanatory variables are uncorrelated with the disturbance
term. So the values of the explanatory variables has nothing to say about the disturbance
term.
Assumption 7: (Identification):
To find unique estimates of the normal equations, the number of observations must be
greater than the number of parameters to be estimated. Otherwise it would not be possible
to find unique OLS estimates of the parameters.
5
____________________________________________________________________________________________________
∑� �− ̅
< <∞
−
If all the values of � are the same we have∑� � − ̅ = . Thus it will not be possible
to estimates the OLS estimates.
� ~ �� ,� � = , ,……….,
Where NID stands for Normal Independently Distributed. The normality assumption of
the disturbance term implies that � is also normally distributed. This assumption is
necessary for constructing confidence intervals of � � and hence for conducting
hypothesis testing.
The functional form of the regression model need to correctly specify. Otherwise there
will specification bias or error in the estimation of the regression model.
Assumption 11: (No Multicollinearity). When the regression model has more than one
explanatory variables there should not be any perfect linear relationship between any of
these variables.
The above assumptions about the regression models relates to the population regression
function. Since we can only observed the sample regression function and not the
population regression function we cannot really know if the above assumptions are
actually valid.
The Gauss Markov Theorem basically states that under the assumptions of the Classical
Linear Regression Model (assumptions 1-8), the least squares estimators are the
minimum variance estimators among the class of unbiased linear estimators; that is, they
are BLUE.
6
____________________________________________________________________________________________________
The OLS estimator is unbiased if its expected value is equal to population parameter
� . The estimator is a random variable and takes on different values from sample to
sample. However unbiasedness property implies that on average the value of is equal
to the population parameter �
= ∑ � �
�=
Where
∑�= � − ̅
� =
∑�= � − ̅
′
The � has the following properties
∑ � =
�=
∑�= �− ̅
∑ � = =
[∑�= � − ̅ ] ∑�= � − ̅
�=
∑�= � − ̅ �
∑ � � = =
∑�= �− ̅
�=
To prove the unbiasedness of the OLS estimator we need to rewrite our estimator in
terms of population parameter.
= ∑ � �
�=
7
____________________________________________________________________________________________________
= ∑ � � + � � + �
�=
=� ∑ � + � ∑ � � + ∑ � �
�= �= �=
= � + ∑ � �
�=
The OLS estimator is thus a linear function of � .The explanatory variable (s) are
assumed to be non-stochastic. So the � are also non-stochastic as well.
Taking expectation operator both the sides we have
= � + ∑ � � =�
�=
Therefore OLS estimator is an unbiased linear estimator of �
The OLS estimator has the second desirable property of being an efficient estimator. This
efficiency property relates to the variance of the estimator. We have to prove that the
variance of OLS estimator has the smallest variance among all the possible estimators.
To prove this we have to first define an arbitrary estimator �̃ which is linear in .
Secondly we impose restrictions implied by unbiasedness. Lastly we will show that
variance of arbitrary estimator �̃ is larger than (or atleast equal to) the variance of OLS
estimator
�̃ = ∑ � �
=
Next we substitute the Population Regression Function in �
�̃ = ∑ � �
�=
= ∑ � � + � � + �
�=
8
____________________________________________________________________________________________________
=� ∑ � + � ∑ � � + ∑ � �
�= �= �=
∑ � = � ∑ � � =
�= �=
�̃ = � + ∑ � �
�=
[�̃ − ] = [∑ � �] = � ∑ �
�= �=
� − ̅ � − ̅
= � ∑[ � − + ]
∑�= �− ̅ ∑�= �− ̅
�=
� − ̅ � − ̅
= � ∑[ � − ] + � ∑[ ]
∑�= �− ̅ ∑�= � − ̅
�= �=
� − ̅ � − ̅
+ � ∑[ � − + ]
∑�= �− ̅ ∑�= � − ̅
�=
It can be shown that the last term in the above equation is zero
� − ̅ �− ̅
� ∑[ � − + ]
∑�= � − ̅ ∑�= � − ̅
�=
� − ̅ � − ̅
= � ∑[ � − ] − � ∑[ � + ]
∑�= � − ̅ ∑�= � − ̅
�= �=
= � − � =
[�̃ − ] = � ∑[ � − �] + [ −� ]
�=
�� −�̅
Where � = ∑�
�= �� −�̅
The first term on the Right Hand Side is always positive except when � = � for all
values of i.
So [�̃ − ] [ −� ]
9
____________________________________________________________________________________________________
This property of invariance does not hold valid for the expectation operator . For
instance if �̂ is an unbiased estimator of �ie (�̂ ) = �. However this does not mean that
̂ is an unbiased estimator of � (ie ̂ ≠ �(�̂) ≠ �. This is because the expectation
� �
operator applies only to linear functions of random variables while � � operator is valid
for any continuous function.
We know that
∑�= � − ̅ � −̅
=
∑�= � − ̅
∑�=�− ̅ � − ̅ ∑�= � − ̅
=
∑�= � − ̅
∑�= � − ̅ �
=
∑�= � − ̅
∑�= � − ̅ � + � � + �
=
∑�= � − ̅
� ∑�= � − ̅ � ∑�= � − ̅ � ∑�= �− ̅ �
= + +
∑�= � − ̅ ∑�= � − ̅ ∑�= � − ̅
∑�= � − ̅ �
= � +
∑�= � − ̅
10
____________________________________________________________________________________________________
Plim ∑�= � − ̅ �
→∞
= � +
Plim ∑�= � − ̅
→∞
We divide both the numerator and denominator in the second term by so that the
summation does not goes to infinity when → ∞. Then next we apply the law of large
numbers to both numerator and denominator. According to Law of Large number that
under general conditions, the sample moments converge to their corresponding
population moments.
,
�� = � + = �
Provided ≠ . Note that , = [ − ̅ ]= − ̅ [ ]=
4. GOODNESS OF FIT
We have estimated our model parameters using OLS and have seen how they have
various desirable statistical properties under certain assumptions. But we are still not sure
if the estimated model fits the data well. If all the observations of the sample lie on the
regression line then we say that the regression model fits the data perfectly. Usually, we
will have some negative and some positive residual term. We want that these residuals
around the regression line as minimum as possible. The coefficient of determination
provides a summary measure of how well the sample regression line fits the data.
Summing both the sides and dividing it by the sample size we have
̅= + ̅
Squaring both the sides and taking summation over the sample we have
BUSINESS PAPER No. : 8, FUNDAMENTALS OF ECONOMETRICS
ECONOMICS MODULE No. : 3, GAUSS MARKOV THEOREM
11
____________________________________________________________________________________________________
∑ � = ∑ ̂� + ∑ � + ∑ ̂� �
�= �= �= �=
The Last term is zero by the assumption that the covariance of fitted value and error is
zero
∑ � − ̅ = ∑( ̂ − ̅ ) + ∑ �
�= �= �=
= +
Or, Total Sum of Squares (TSS) = Explained Sum of Squares (ESS) + Residual Sum of
Squares (RSS)
Where
= ∑�= − ̅ is the total variation of actual values about their sample mean
�
Therefore the Total Variation in can be decomposed into two parts (1) ESS which is
the part accounted for by and (2) RSS which is the unexplained and unaccounted part.
RSS is known as unexplained part of variation because the residual term captures the
effect of variables other than the explanatory variable that are not included in the
regression model.
We define as follows
∑�= ( ̂ − ̅ )
= =
∑�= � − ̅
12
____________________________________________________________________________________________________
So, is now equal to 1 minus the total sum of squares that is not explained by the
regression model (Residual Sum of Squares). When the observed points are closer to the
estimated regression line, then we say that the data fits the model very well. In such case
ESS will be higher and RSS will be smaller. We want which is a measure of goodness
of fit to be very high. When is low, this means that there are lots of variations in
which cannot be explain by
There are other interpretations for . It also measures the correlation between the
observed value � and the predicted value ̂� ( �,�̂ ). Therefore
̂
= ̂
( �, ̂� ) ÷ ̂� ( ̂� ) = =
��̂
The question which commonly arises relates to the value of the goodness of fit. There is
no rule which suggest what value of is considered as high and what is considered as
low. For time series data the value of is usually high and above 0.9. However for
cross-sectional data, value of 0.6 or 0.7 may be considered as good. We should be
cautious not to depend too much on the value of . is simply one measure of model
adequacy. We should be more concerned about the signs of the regression coefficients
and whether they conform to economic theory or prior informations.
Properties of ��
1. is a non-negative number.
2. It is unit free as both the numerator and the denominator have the same units
3. The following relationship will hold for coefficient of determination
.
13
____________________________________________________________________________________________________
The concept of Coefficient of Correlation is quite different from that of goodness of fit.
However they are closely connected. The Coefficient of Correlation measures the degree
of association between two variables. The sample correlation coefficient can be obtained
as follows:
∑ �− ̅ �−̅
=
√∑ � − ̅ ∑ � − ̅
4. The change in origin and scale of measurement does not affect the measurement
of the coefficient of correlation. Suppose �∗ = � + and �∗ = + where
∗ ∗
, , are constants. The correlation coefficient between � � and the
correlation coefficient between �∗ ∗
� are the same.
14
____________________________________________________________________________________________________
5. SUMMARY
1. The Classical Linear Regression Model is based on as set of assumptions known as the
Gauss Markov assumptions.
3. The assumptions under the Classical Linear Regression Model are necessary to prove
the Gauss Markov Theorem. The Theorem basically states that under these assumptions,
the least squares estimators are the minimum variance estimators among the class of
unbiased linear estimators; that is, they are BLUE (Best Linear Unbiased Estimator)
4. The OLS estimator is unbiased if its expected value is equal to population parameter
� .The property of unbiasedness implies that on average the value of is equal to the
population parameter �
5. This efficiency property of estimator relates to the concept of the smallest variance of
the estimator. The variance of OLS estimators has the smallest variance among all the
possible estimators.
6. The property of consistency is a large sample property which basically means that as
the sample size tends to infinity, the density function of the estimator collapses to the
parameter value.
15
____________________________________________________________________________________________________
7. The Total Variation in (TSS) is a sum of two parts (1) Explained Sum of Squares
(ESS) which is the part accounted for by and (2) Residual Sum of Squares (RSS) which
is the unexplained and unaccounted part.
8. The coefficient of determination measures the overall goodness of fit of the regression
model. It tells what proportion of the variation in the dependent variable is explained by
the explanatory variable.
9. The coefficient of determination lies between 0 and 1.The closer it is to 1 the better is
the overall goodness of fit of the model. There is no rule which says that such level of
coefficient of determination is high and such level is low. The sign of regression
coefficient is very important.
10. The Coefficient of Correlation measures the degree of association between two
variables. It lies between− . The statistical independence of two variables
implies zero correlation coefficient but not necessarily vice-versa.
11. The Coefficient of determination and the Correlation Coefficient are related as
follows: = ±√
16
____________________________________________________________________________________________________
17
Subject Business Economics
Module No and Title 4, Further Aspect of the two variable linear regression
model
Module Tag BSE_P8_M4
1. Introduction
2. Regression through the origin
3. for regression model through origin.
4. Change of Scale and Origin of Measurement
4.1 Changing the scale of measurement
4.2 Changing the origin
5. Regression on Standardized Variables
6. Functional Forms
6.1. Various Functional Forms
6.2. Choosing the Best Functional Forms
7. Summary
In this chapter we will further discussed various aspects of linear regression analysis.
First we will study the case of regression through the origin, that is, a situation where the
intercept term � is missing from the model. Second we will study the units of
measurement of X and Y and how changes in the units of measurement of X and Y affect
the regression analysis. Finally we study the various functional form of the linear
regression model. We study models that are linear in parameters but non-linear in
variables.
The sample counterpart of the above PRF is the sample regression function (SRF) which
is
� = � + �
We apply OLS method to obtain the formula for calculating and its variance.
The SRF is
� = � + �
We want to minimize
∑ � =∑ � − �
∑ �
= ∑ � − � − �
or, ∑ � � − ∑ � =
∑ � �
, =
∑ �
∑ � �
−� =
∑ �
∑ � �
� −� = �[ ]
∑ �
�
= [ −� ] =
∑ �
Where � is estimated by
∑ �
�̂ =
∑ �
Let us briefly compare the formulas in regression model with and without intercept term
�= �+ � �= + �+ �
∑ � � ∑ � − ̅� � − ̅ �
= =
∑ �
∑ � − ̅�
� �
� = � =
∑ �
∑ � − ̅�
An illustration:
Consider the Capital Asset Pricing Model (CAPM) of modern portfolio theory which
states that if capital market works efficiently, then security �′ expected risk premium
(� � − ) is equal to that security’s � coefficient times the expected market risk
premium(� − ). The CAPM postulate can be empirically expressed as follows:
�− = �� ( − )+ �
Where � is the return on security�
is the risk free return, say return of treasury bills
is the return on market portfolio
�� is a measure of systematic risk i.e risk which cannot be eliminated through
diversification
� represents the disturbance term
The postulate of CAPM has been shown diagrammatically in figure 1 given above.
The row satisfies the relation < < but it is not directly comparable to the
conventional .It is good to stick to intercept model unless there is very strong a priori
expectation. This is because we will be committing a specification error if there is
intercept in the model and we insist on fitting regression through the origin. Second if
intercept term is absent but is included in the model it will turn out to be statistically
insignificant.
= ̅− ̅
∑ �− ̅ �−̅
=
∑ �− ̅
∑ �
= �
∑ � − ̅
�
=
∑ �− ̅
∗
∗
∑ �
= ∗ �∗
∑ � − ̅∗
∗
�∗
=
∑ �∗ − ̅ ∗
∗
∑ ��∗
�̂ =
−
∗ ∗
=( ) =
�̂ ∗ = �̂
∗
=
∗
= ( )
= ∗ ∗
∗ ∗
Then = and =
∗ ∗
= and =
The Slope and intercept coefficients along with their respective standard error are both
multiplied by
The slope and its standard error are multiplied by but the intercept and its standard
error are unaffected.
The change of origin affected the intercept of the regression. However the slope
coefficient is unaffected. The origin can be change by either adding/subtracting a
constant to and/or .
Suppose ∗ = + and ∗
= +
∗
= ∗+ ∗ ∗
+
Then
∗
∗
∑ �− ̅ ∗ �∗ − ̅ ∗ ∑ + − ̅− + − ̅−
= ∗ =
∑ � − ̅∗ ∑( + − ̅− )
∗ � − ̅
� =
There are two things to be noted. First, the notion of is not applicable to standardized
regression as it is regression through origin. Second, there is a relationship between
regression coefficient of standardized model and traditional model. The relationship for
bivariate case is as follows:
∗
=
Where represents the standard deviation of the regressor and represents the
standard deviation of the regressand.
6. Functional Forms
For many of the economic applications, relationship between variables are non-linear in
nature. So we extend linear regression models to incorporate non-linearities (in variables)
by appropriately redefining the dependent and independent variables.
Both the above models are linear in the parameters although they are not linear in the
variable (in first model) and variable (in the second model). Since both the models
are linear in parameters, OLS method can be used to estimate the parameters. However if
the models are not linear in the parameters, iterative methods must be used in the
estimation.
This is a non-linearizable model as we cannot transformed this model into a linear model
Before we proceed to various functional forms we look at few definitions which will help
us later:
(1) The proportional change between and is given by
∆ −
=
To obtain proportional change in % we multiply proportional change by 100
∆
%
(2) The logarithmic change between and is given by
∆ ln = ln − ln
Concept of Elasticity
Elasticity measures the ratio of the relative changes of two variables. In terms of
proportional changes, the elasticity of can be defined as
∆ ⁄
� / =
∆ ⁄
Where ∆ = − ∆ = −
We study some regression models that may be non linear in variables but are linear in
parameters or can be linearized using suitable transformation. They are:
1. The log-log model
2. Semilog models
3. Reciprocal models
4. The logarithmic reciprocal model
a.Log-Log model
The log-log model can be estimated by Ordinary Least Square method by letting
∗ ∗
� = + � + �
Where �∗ = ∗
� and � = � . The OLS estimators are unbiased estimator.
The popularity of the log model lies in that the slope coefficient measures the
elasticity of Y with respect to X. For instance, if Y represents the quantity demand of a
commodity and X represents its price, then measures the price elasticity of demand.
The log-log model has two special features: First, the elasticity coefficient between Y and
X remains constant throughout. So no matter at which point of � you measure the
elasticity, it will remain the same. Second, even the estimator = � is biased
estimator although is an unbiased estimator. However, the intercept terms are not very
important in empirical studies.
Illustrative example
Consider the following results of the estimated regression model between consumption
expenditure and income which is below:
ln ̂ � = − . + . ln � �
The elasticity of consumption expenditure with respect to income is about 1.5 indicating
that if the income increases by 1 percent, then the expenditure increase by about 1.5
percent.
The Lin-Log model helps us to find the absolute change in dependent variable � for a
percent change in independent variable � . A linear-log model is given by
� = + ln � +
Where ∆ denotes a small change. Alternatively the above equation can be written as
∆ �= ∆ �⁄ �
The above equation states that the absolute change in � = ∆ � is equal to slope times
the relative change in � . If we multiplied the latter by 100, then the above equation gives
the absolute change in � for a percentage change in �
The most practical application of the Lin-Log model can be found in the Engel-
Expenditure models. According to Engel, the total expenditure spent on food increases in
arithmetic proportion and total expenditure increases in geometric progression.
c.Log-Linear Model
We are often interested in calculating the growth rate of important economic variable
such as GDP, Money Supply, Population, employment etc. Suppose we are interested in
finding the growth rate of a variable using compound interest formula
= +
Where is the compound growth rate of . Take logarithm on both the sides, we get
ln = + +
Suppose = ln = ln + . We get
ln = +
The above model is known as semilog models because only one variable is in logarithmic
form. Here the explanatory variable is linear and the dependent variable is in logarithmic
form. So it is called lin-log model.
In lin-log model, the slope coefficient measures the relative change in dependent variable
for a given absolute change in the value of independent variable.
The coefficient gives us the instantaneous growth rate of the dependent variable .
However this is not the compound rate of growth over a period of time. The compound
rate of growth could be found by taking the antilog of estimated and subtracting 1
from it and multiplying the difference by 100.
Here we regress on time unlike in Log-Lin model where we regress on time. The
time variable is known as trend variable. We have downward trend in when the slope
coefficient is negative and upward trend in when the slope coefficient is positive.
The choice between the growth rate model and the linear trend model will depend on
whether we are interested in the relative or absolute change in the variable of our interest.
Illustrative example
Consider the regression result for growth rate of expenditure on private consumption over
a period 1990 to 2010.
ln = . + .
Find (1) the instantaneous growth rate of expenditure on private consumption over a
period 1990 to 2010
(2)Find the compounded growth rate on expenditure over the period 1990 to 2010.
Solution: (1) The instantaneous growth rate is 0.743 percent
(2)The compounded growth rate is: antilog (0.00743)-1 = 0.00746 or 0.746 percent
d.Reciprocal Model
BUSINESS PAPER NO. : 8, FUNDAMENTALS OF ECONOMETRICS
ECONOMICS MODULE NO. : 3, FURTHER ASPECT OF THE TWO VARIABLE LINEAR
REGRESSION MODEL
A reciprocal Model is given by
= + ( )+
The above model is nonlinear in the variable as it enters reciprocally, but it is linear in
parameters . As the independent variable increases indefinitely, the term
tends to zero and the dependent variable approaches its limiting value
�
Illustrative example
Consider the modern version of Phillip Curve which can be expressed as:
� −� =� � − +
Where � = � � � �
� = � � � � � − . It is
usually replace by � − since expectation inflation is unobserved.
� = � �
= � �
=
In empirical studies, the relationship between inflation and unemployment can either take
a linear regression model and/or a reciprocal model.
Linear Model
� −� − = . − . �
Reciprocal Model
� −� − = − . + . ���
The Linear model shows that if the unemployment rate goes down by 1 percent, then on
average the rate of inflation will go up by 0.5789 percentage point and vice-versa. The
Reciprocal Model shows that even if the rate of unemployment increases indefinitely, the
rate of inflation will come down to at most 2.4 percentage point. The natural rate of
unemployment can be calculated from the linear model as given below:
.
= = = .
− .
BUSINESS PAPER NO. : 8, FUNDAMENTALS OF ECONOMETRICS
ECONOMICS MODULE NO. : 3, FURTHER ASPECT OF THE TWO VARIABLE LINEAR
REGRESSION MODEL
Thus the natural rate of unemployment is about 5.989% approximately
� = + ( )+
Such model can be used to capture the production function in short run. Our knowledge
from microeconomics tell us that if labour and capital are the inputs in production
function and we keep capital constant and keep increasing the labour input, then the short
run output-labour relationship will be capture by Log Reciprocal Model.
The choice of the most appropriate functional form for empirical analysis requires lots of
skill and knowledge of economic theory and other information. The choice becomes
more difficult as the number of explanatory variable increases in the model. Some
practical guidelines for choosing the most appropriate forms are:
1. The economic theory may help us in selecting a particular functional form (eg
Modified Phillips Curve)
2. It is useful to calculate the slope coefficient of the regression model and the elasticity
of the dependent variable with respect to the explanatory variable. The following table
may be of help:
Model Equation �� � �
�� (= ) .
=
.
Linear = + ( )
Log-Log � = + �
Log-Linear � = +
3. The regression coefficient of the chosen model should be in conformity with the priori
expectation according to economic theory. For instance, if we consider the demand
function of a commodity with respect to its price, then the slope coefficient should be
negative.
4. It is possible that more than one functional form satisfy the data very well. For instance
we can fit both linear and reciprocal model to the modern version of Phillip curve.
7. Summary
1. If a regression model does not contain an explicit intercept term, it is known as
regression through origin one should be cautions with estimating such models. In such
models, the sum of the residuals ∑ � is non zero and conventionally may not be
meaningful. It is better to introduce the intercept in the model explicitly unless there is a
strong theoretical reason.
2. The interpretation of regression coefficients depend on the units and scale in which the
dependent and the independent variable are expressed.
3. We discussed some important functional forms in this chapter. They are (a) the log-
linear or constant elasticity model (b) semilog regression models and (c) reciprocal
models.
5. In the semilog model, where the dependent variable is logarithmic and the independent
variable is time, the estimated slope coefficient (multiplied by 100) measures the
(instantaneous) rate of growth of dependent variable. In semilog model if the independent
variable is logarithmic, its coefficient measures the absolute rate of change in the
dependent variable for a given percent change in the value of the independent variable.
6. In the reciprocal models either the dependent variable or the independent variable is
expressed in reciprocal or inverse form to capture nonlinear relationships between
economic variables.
7. The choice of most appropriate functional form requires experience and knowledge of
economic theory. Instead of giving importance to the value of , we should look at the
sign of regression coefficient.
TABLE OF CONTENTS
1. Introduction
2. The Probability distribution of disturbances 𝑼𝒊
3. The normality assumption for 𝑼𝒊
4. Why do we need normality assumption?
5. Properties of OLS estimators under normality assumptions
6. Students’ t-statistics
7. Few important probability distribution
7.1. Normal Distribution
7.2. The Chi-Square Distribution
7.3. Student’s t distribution
7.4. The F-Distribution
7.5 Bernoulli Distribution
7.6. The Binomial Distribution
7.7. The Poisson Distribution
8. Summary
Learning Outcomes
After reading this chapter, the reader will be able to understand the following concepts:
1. Introduction
The classical theory of statistical inference consists of two branches namely estimation
and hypothesis testing. Using the method of OLS we were able to estimates the
parameters of the (two variable) regression model. It was also shown that these estimators
possessed several desirable statistical properties such as unbiasedness minimum variance
etc.( i.e., BLUE property). However these estimators change from sample to sample and
we are not sure if they can represent the population’s parameters.
Hypothesis testing deals with the issue of drawing inferences about the population
regression function from the sample regression functions. To relate the estimated
parameters (𝑏𝑜, 𝑏1, 𝜎̂ 2 ) to their true value (𝛽𝑜, 𝛽1 , 𝜎 2 ) we need to know the probability
distributions of 𝑏𝑜 ,𝑏1, 𝜎̂ 2
𝑥𝑖
We know that 𝑏2 = ∑ 𝑐𝑖 𝑦𝑖 where 𝑐𝑖 = ∑ 𝑥𝑖2
Since the 𝑥′𝑠 are assumed fixed, 𝑏2 is a linear
function of 𝑦𝑖 .
𝑏2 = ∑ 𝑐𝑖 (𝛽0 + 𝛽1 𝑋𝑖 + 𝑈𝑖 )
The classical normal linear regression model assumes that each 𝑈𝑖 is distributed normally
with : E{[(Ui – E(Ui)] [ Us-E(Us)}= E(Ui Us)=0
Mean: 𝐸(𝑈𝑖 ) = 0
Variance: 𝐸[𝑈𝑖 − 𝐸(𝑈𝑖 )]2 = 𝐸[𝑈𝑖2 ] = 𝜎 2
: E{[(𝑈𝑖 - E(𝑈𝑖 )][𝑈𝑠 -E(𝑈𝑠 )]} = E( 𝑈𝑖 𝑈𝑠 ) = 0
𝐶𝑂𝑉 (𝑈𝑖 ,𝑈𝑠 ) 𝑖 ≠s
𝑈𝑖 ~𝑁(0, 𝜎 2 )
Where the symbol ~ means ‘distributed as’ and N stands for the ‘normal distribution’.
The two terms inside the bracket represents the mean and variance respectively.
Since for two normally distributed variables, zero covariance means that the two
variables are independent of each other. So 𝑈𝑖 and 𝑈𝑠 are uncorrelated and independently
distributed. Therefore we can rewrite the above equation as
𝑈𝑖 ~𝑁𝐼𝐷 (0, 𝜎 2 )
1. The disturbance term 𝑈𝑖 represents large number of explanatory variables that are
not introduced in the model but have influence on the dependent variable. We
wish that the influence of this neglected variables is small and at best random.
The central limit theorem states that if there are a large number of independent
and identically distributed random variables, the distribution of their sum will
BUSINESS PAPER NO. : 8, FUNDAMENTALS OF ECONOMETRICS
ECONOMICS MODULE NO. : 5, PROBABILITY DISTRIBUTION OF LEAST SQUARE
ESTIMATES
____________________________________________________________________________________________________
tends towards a normal distribution as the number of such variables tends toward
infinity. The CLT thus provides theoretical justification for the assumption of
normality of 𝑈𝑖 .
2. A less restrictive version of the CLT states that even if the number of variables
are not very large or strictly independent, their sum may still be normally
distributed.
3. The probability distribution of OLS estimators can be easily derived with the
normality assumption of disturbance term 𝑈𝑖 because any linear function of
normally distributed variables is itself normally distributed. So since OLS
estimators 𝑏𝑜 and 𝑏1 are linear functions of 𝑈𝑖 and 𝑈𝑖 is normally distributed; the
two OLS estimators 𝑏0 and 𝑏1 are also distributed normally.
4. The normal distribution involves only two parameters (mean and variance).So it is
comparatively a simple distribution. It is also well known and its properties well-
studied.
5. For small sample size of less than 100 observations the assumption of normal
distribution is important as it helps us to derive the exact probability distributions
of OLS estimators but also enables us to use the t, F and 𝜒 2 statistical tests for
regression models.
6. In large sample size, t and F test can still be applied even if we don’t assume that
the disturbance term is normally distributed.
2. They have minimum variance among the class of linear estimators which combined
with the unbiasedness property means that they are efficient.
Means: 𝐸(𝑏0)= 𝛽0
∑ 𝑥𝑖2
Var(𝑏0 ): 𝜎𝑏20 = 𝜎2
𝑛 ∑(𝑥𝑖 −𝑥̅ )2
𝑏𝑜 ~𝑁(𝛽𝑜 , 𝜎𝑏20 )
𝑏0 −𝛽𝑜
This implies that the variable Z which is defined as Z = follows the standard normal
𝜎𝑏0
distribution with zero mean and unit (=1) variance or Z~N (0,1).
Mean: 𝐸(𝑏1 ) = 𝛽1
𝜎2
𝑉𝑎𝑟(𝑏1): 𝜎𝑏21 = ∑(𝑥 −𝑥̅ )2
𝑖
𝑏1−𝛽1
Then the variable Z which is defined as 𝑍 = 𝜎 𝑏1
also follows the standard normal
distribution
(𝑁 − 2)𝑠 2
(𝑛 − 2)(𝜎̂ 2 ⁄𝜎 2 ) = ~ 𝜒𝑛−2
𝜎2
Where 𝜎̂ 2 is the estimated value and 𝜎 2 is the true value. This knowledge will helps us to
draw inferences about the true 𝜎 2 from the estimated 𝜎̂ 2 = 𝑠 2
8. The OLS estimators (𝑏0 ,𝑏1 ) under the assumption of normality have minimum
variance in the entire class of unbiased estimators whether linear or not. This result due to
Rao is more powerful than Gauss Markov Theorem which is restricted only to the class
of linear estimators.
Thus OLS estimator are not only BLUE but are best unbiased estimators (BLUE) as well;
they have minimum variance in the entire class of unbiased estimators.
When we assume that 𝑈𝑖 ~ N(0,𝜎 2 ) and 𝑌𝑖 being a linear function of 𝑈𝑖 ,we have that 𝑈𝑖 is
itself normally distributed with a mean and variance given by
𝐸(𝑌𝑖 )= 𝛽0+ 𝛽1 𝑋𝑖
𝑉𝑎𝑟(𝑦𝑖 ) = 𝜎 2
6. Students’ t-statistics
We convert the normally distributed OLS estimators into 𝑍 variable which are distributed
as 𝑁(0,1)
𝑏0 − 𝛽0 𝑏0 − 𝛽0
𝑍𝑏𝑜 = =
𝜎𝑏0
∑ 𝑥𝑖 2
√ 𝜎2
𝑛 ∑(𝑥𝑖 − 𝑥̅ )2
𝑛 ∑(𝑥𝑖 − 𝑥̅ )2
(𝑏0 − 𝛽0 )√ ⁄∑ 𝑥 2
𝑖
𝑍𝑏𝑜 = ~ 𝑁(0,1)
𝜎
𝑏1 − 𝛽1 𝑏1 − 𝛽1
𝑍𝑏1 = =
𝜎𝑏1
𝜎2
√
∑(𝑥𝑖 − 𝑥̅ )2
(𝑏1 − 𝛽1 )√∑(𝑥𝑖 − 𝑥̅ )2
𝑍𝑏1 = ~ 𝑁(0,1)
𝜎
𝑏0 − 𝛽0 𝑏0 − 𝛽0
𝑇𝑏0 = =
𝑠𝑏0
∑ 𝑥𝑖 2
√ 𝑠2
𝑛 ∑(𝑥𝑖 − 𝑥̅ )2
𝑛 ∑(𝑥𝑖 − 𝑥̅ )2
(𝑏0 − 𝛽0 )√ ⁄∑ 𝑥 2
𝑖
𝑍𝑏𝑜 = ~ 𝑡𝑛−2
𝑠
𝑏1 − 𝛽1 𝑏1 − 𝛽1
𝑇𝑏1 = =
𝑠𝑏1
𝑠2
√
∑(𝑥𝑖 − 𝑥̅ )2
(𝑏1 − 𝛽1 )√∑(𝑥𝑖 − 𝑥̅ )2
𝑇𝑏1 = ~ 𝑡𝑛−2
𝑠
The degrees of freedom here is 𝑛 − 2 because we are dealing with simple linear
regression model. We get 𝑡 – distribution because the numerator is a standard normal
∑ 𝑒𝑖2
variable and the denominator is a chi-square distribution. Recall that 𝑠 2 = 𝑛−2
is a sum
(a)Normal Distribution
Normal distribution is the most popular probability distribution with bell-shaped figure.
A normally distributed random variable X has the following Probability Density
Function:
1 (𝑥 − 𝜇)2
𝑓(𝑥) = exp ( ) −∞<𝑥 <∞
𝜎√2𝜋 2𝜎 2
Where 𝜇 and 𝜎 2 are the parameters of the distribution representing mean and variance
respectively. The normal distribution is shown in figure 2.
1 𝑍2
𝑓(𝑍) = exp ( )
√2𝜋 2
Let 𝑋1 ~ 𝑁(𝜇1 , 𝜎12 ) and 𝑋2 ~ 𝑁(𝜇2 , 𝜎22 ) be two independent variables. The Linear
Combination of both the variables can be written as follows:
𝑌 = 𝑎𝑋1 + 𝑏𝑋2 Where 𝑎 and 𝑏 are constants.
We can then show that
𝜎2
𝑥̅ ~ 𝑁 (𝜇, ) 𝑎𝑠 𝑛 → ∞
𝑛
𝑥̅ − 𝜇 √𝑛(𝑥̅ − 𝜇)
𝑧= =
𝜎 ⁄ √𝑛 𝜎
The random variable will tend towards standard normal. So
𝑧 ~ 𝑁(0,1)
6. Test of Normality
A normal distribution will have Skewness =0 and Kurtosis =3. In other words, a normal
distribution is symmetric and mesokurtic. So to find out if a distribution departs from
normal distribution we need to check if computed values of skewness and kurtosis are
different from 0 and 3 respectively. The formal test can be done by the Jarque-Bera test
of normality which is as follows:
𝑆 2 (𝐾 − 3)2
𝐽𝐵 = 𝑛 [ + ]
6 24
Where S means Skewness and K means Kurtosis. Under the null hypothesis, JB is
distributed as a Chi-square statistic with 2 df.
7. The mean and variance of a random variable which follows normal distribution
are independent of each other
Example 1
Suppose that 𝑋 follows a normal distribution with mean of 6 and variance of 9. i.e
𝑋~𝑁(6,9)
Find the probability that 𝑋 will take a value between 𝑋1 = 3 and 𝑋2 = 9.
Solution:
We first calculate the value that 𝑍 takes which is as follows:
𝑋1 − 𝜇 3−6
𝑍1 = = = −1
𝜎 3
𝑋2 − 𝜇 9−6
𝑍2 = = = 1
𝜎 3
We use table under normal distribution to find the probability that the two values of 𝑍
will take. So 𝑃𝑟(0 ≤ 𝑍 ≤ 1) = 0.3413. By using the notion of symmetry we
have 𝑃𝑟(−2 ≤ 𝑍 ≤ 0) = 0.3413. Therefore, the probability that 𝑋 will lie between 𝑋1 =
3 and 𝑋2 = 9 is 0.3413 + 0.3413 =0.6826
1. The Chi-Squared distribution has a skewed distribution and the degree of the
skewness depends on the degree of freedom. When degrees of freedom are less
the distribution is highly skewed to the right. The distribution becomes more
symmetrical as the degrees of freedom increases and when the degree of freedom
exceeds 100 the variable
√2𝜒 2 − √(2𝑘 − 1)
becomes a standardized normal variable where 𝑘 is the degree of freedom.
2. The Chi-Squared distribution has a mean 𝑘 and variance 2𝑘 where 𝑘 is the
degrees of freedom.
3. For two independent Chi-Squared variables 𝑍1 and 𝑍2 with 𝑘1 and 𝑘2 degrees of
freedom, their sum 𝑍1 + 𝑍2 is also a Chi-Square variable with degrees of freedom
equal to 𝑘1 + 𝑘2
Example 2
Find the probability of getting a Chi-square value of 32 or greater, given that the degree
of freedom is 14.
Solution:
Using the table from Chi-Square distribution we find that the probability of getting a Chi-
Square value of 31.3193 or greater for a degree of freedom of 14 is 0.005. So, the
probability of getting a chi-square value of 32 or greater is less than 0.005 which is a
small probability indeed.
𝑍1 𝑍1 √𝑘
𝑡= =
√(𝑍2 ⁄𝑘) √𝑍2
follows Student’s t distribution with 𝑘 degrees of freedom. It is usually denoted by 𝑡𝑘
where 𝑘 denotes the degrees of freedom. The t-distribution is shown in figure 4
Properties of t-distribution
Example 3
When the degrees of freedom is 21, what is the probability of getting a 𝑡 value of (a)
about 2 or greater (b) about -2 or smaller (c) of |𝑡| of about 2 or greater where |𝑡| denotes
the absolute value.
Solution:
(a) The probability of getting a 𝑡 value of about 2 or greater when degrees of freedom
is 21 is about 0.025 (See table for 𝑡 distribution)
(b) Since 𝑡 distribution is symmetric, The probability of getting a 𝑡 value of about -2
or smaller when degrees of freedom is 21 is about 0.025
𝑍1 ⁄𝑘1
𝐹=
𝑍2 ⁄𝑘2
Properties of F distribution
𝑘1 𝐹 ~ 𝜒𝑘21
Note that all the three distributions namely 𝑡, 𝐹 and Chi-square distributions are
related to normal distribution. They all tend to approach the normal distribution
when the degrees of freedom get larger and larger.
Example 4
1. Find the probability of getting a 𝐹 value of about 1.6 or greater when 𝑘1 = 10 and
𝑘2 = 8 (Answer: From 𝐹 table, the probability is 0.25)
2. Give a numerical example to show that 𝑘1 𝐹 ~ 𝜒𝑘21 when 𝑘2 is large.
Answer: Suppose 𝑘1 = 20 𝑎𝑛𝑑 𝑘2 = 200. The critical value of 𝐹 at 10% is 1.46. So
𝑘1 𝐹 = 20 ∗ 1.46 = 29.2. From the Chi-square table, the critical chi-square value at
2 (10%)
10% for degrees of freedom equal to 20 is given by 𝜒20 = 28.412
(e)Bernoulli distribution
𝑃(𝑋 = 0) = 1 − 𝑝
𝑃(𝑋 = 1) = 𝑝
𝑉𝑎𝑟(𝑋) = 𝑝𝑞
Let 𝑛 represent the number of independent trails each of which results in a ‘success’ with
probability 𝑝 and a failure with a probability𝑞 = (1 − 𝑝). The number of successes in
‘n’ number of trails is represented by 𝑋. Then 𝑋 follows binomial distribution whose
Probability density function is
𝑛
𝑓(𝑋) = ( ) 𝑝 𝑥 (1 − 𝑝)𝑛−𝑥
𝑥
BUSINESS PAPER NO. : 8, FUNDAMENTALS OF ECONOMETRICS
ECONOMICS MODULE NO. : 5, PROBABILITY DISTRIBUTION OF LEAST SQUARE
ESTIMATES
____________________________________________________________________________________________________
𝑛 𝑛!
( )=
𝑥 𝑥! (𝑛 − 𝑥)!
The binomial distribution has two parameters namely 𝑛 𝑎𝑛𝑑 𝑝. In binomial distribution
we have
𝐸(𝑋) = 𝑛𝑝
Example 5
A new treatment for swine flu has 20 percent probability of complete cure. On trial basis,
35 patients were given a treatment. What is the probability atleast 10 patient will be cured
completely?
Solution.
Let 𝑋 = 𝑡ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑢𝑐𝑐𝑒𝑠𝑠 𝑖𝑛 35 𝑡𝑟𝑖𝑎𝑙𝑠. We need to find 𝑃(𝑋 > 10). From the
binomial distribution table we find that 𝑃(𝑋 > 10) = 0.1457
8. Summary
1. We study the classical normal linear regression model (CNLRM) in this chapter.
2. The CNLRM is different from the classical linear regression model(CLRM).Under the
CNLRM,the disturbance term 𝑈𝑖 entering the regression model is assumed to be normally
distributed.
3. The CNLRM does not require any assumption about the probability distribution of
𝑈𝑖 .It only requires that the mean value of 𝑈𝑖 is zero and its variance is a finite constant.
4. The central limit theorem provides the theoretical justification for the normality
assumption. According to Central limit theorem, if 𝑥1 𝑥2 … … … . , 𝑥𝑛 be a random sample
and 𝐸(𝑥𝑖 ) = 𝜇 and 𝑣𝑎𝑟(𝑥𝑖 ) = 𝜎 2 , then the distribution of random variable 𝑍𝑛 =
√𝑛 (𝑥 − 𝜇)⁄𝜎 converges to standard normal 𝑁(0,1) as 𝑛 increases to infinity.
5. When we assume that disturbance term is normally distributed, the OLS estimators are
best unbiased estimators (BUE).
6. The OLS estimators under the normality assumption follow well-known probability
distributions. The OLS estimators of the intercept and slope are themselves normally
distributed and the OLS estimator of the variance of 𝑈𝑡 is related to the chi-square
distribution.
7. The normal distribution is the most popular probability distribution and has a bell-
shaped picture. Other important distributions include Chi-square distribution, Student’s t
distribution, F-distribution, the Bernoulli distribution, binomial distribution and Poisson
distribution.
Module Tag
BSE_P8_M6
TABLE OF CONTENTS
1. Learning Outcomes
2. Introduction
3. What does this hypothesis mean and imply?
4. One sided tests
5. Summary
1. Learning Outcomes
After reading this module, the learning outcomes are such that the students will be able to:
2. Introduction
Hypothesis Testing
Fitting the regression line is only the first and a very small step in econometric analysis. In
applied economics, we are generally interested in testing the hypothesis about some hypothesized
value of the population parameter. Let’s say we have a random sample 𝑥1 , 𝑥2 , … … , 𝑥𝑛 of a
random variable X with PDF 𝑓(𝑥; 𝜇 ). Regression analysis helps us to obtain the estimate of 𝜇,
say 𝜇̂ . Hypothesis testing implies deciding whether 𝜇̂ is compatible with some hypothesized
value𝜇0 . To set up a hypothesis test, we formally state the hypothesis as:
𝐻0 : 𝜇 = 𝜇0
𝐻1 : 𝜇 ≠ 𝜇0
Where 𝐻0 is called the null hypothesis and 𝐻1 is called the alternate hypothesis.
.
Let us start with a random variable X i.e. 𝑋~𝑁(𝜇, 𝜎 2 ) Suppose we define our null hypothesis to
be
𝐻0 : 𝜇 = 𝜇0
𝐻1 : 𝜇 ≠ 𝜇0
To proceed, first step is to draw a sample from X and calculate its mean, 𝑋̅. Value of 𝑋̅ obtained
from repeated sample will be normally distributed with mean 𝜇0 and variance 𝜎 2 ⁄𝑛 Null .
hypothesis being true resonates with the above explanation. Such a distribution with the null
being true is shown below:
If 𝜇̂ is a good estimator of 𝜇, then it will take a value close to 𝜇 i.e. 𝜇̂ -𝜇 will be small. If the null
hypothesis is true, then 𝜇̂ − 𝜇0 = (𝜇̂ − 𝜇)+ (𝜇 − 𝜇0 ) , should be small as 𝜇̂ -𝜇 is small and the
second term is zero. If the alternate hypothesis is true, then 𝜇̂ − 𝜇0 = (𝜇̂ − 𝜇)+ (𝜇 − 𝜇0 ) should
be large as 𝜇̂ -𝜇and𝜇 − 𝜇0 ≠ 0. Small or large depends upon the distribution of the estimator.
In the model set above, we don’t expect 𝑋̅ to be exactly equal to 𝜇0 . There is no reason to deny
the possibility but the chances are rare. If the mean is far off from𝜇0 , there are two possibilities;
either reject the null or do not reject the null. Either way, the decision will contain some element
of error i.e. rejecting a true null hypothesis (Type 1) or accepting a false null hypothesis (Type 2).
There is no fool proof way of deciding.
If the probability of the mean, which lies far off from 𝜇0 , is less than the level of significance,
we can conclude to reject the null hypothesis. For e.g. if the probability is less than 5% i.e. lies in
the upper and lower 2.5% tails (as shown in the figure). Thus, the probability of the mean being
1.96 standard deviations away is 5%.
Source: Dougherty
The figure suggests that the null would be rejected if 𝑋̅ lies in the shaded area i.e. if
̅̅̅ or,
𝑋̅ > 𝜇𝑜 + 1.96 𝑠𝑡𝑑 𝑑𝑒𝑣(𝑋)
̅̅̅
𝑋̅ < 𝜇𝑜 − 1.96 𝑠𝑡𝑑 𝑑𝑒𝑣(𝑋)
𝑋̅−𝜇0
𝑠𝑑(𝑋̅)
> 1.96 or,
𝑋̅ − 𝜇0
< −1.96
𝑠𝑑(𝑋̅)
Admittedly, we can term the LHS as z statistic. Therefore, we reject the null hypothesis when
|𝑧|>1.96.
As mentioned before, the decision rule established above is not flawless. Let us go back to our
initial condition of 𝐻𝑜 being true. There is 5% probability that 𝑋̅ will lie far away from 𝜇𝑜 in the
rejection region. It also means that there is 5% probability of a type 1 error i.e. probability of
BUSINESS PAPER NO. : 8, FUNDAMENTALS OF ECONOMETRICS
ECONOMICS MODULE NO. : 6, HYPOTHESIS TESTING: TEST OF SIGNIFICANCE
APPROACH
____________________________________________________________________________________________________
rejecting a true null hypothesis is 5%1. The researcher has the liberty to settle the risk of type 1
error and accordingly look up the critical value.
To sum up, there are three possible outcomes:
• Correct decision
• Rejecting a true hypothesis– Type I error.
• Accepting a false hypothesis– Type II error.
To be able to perform the test, we need an additional knowledge of the sampling distributions of
the estimators. It will be impossible to perform hypothesis testing without this knowledge,
prerequisite of which is the assumption that the error term is normally distributed.
Given the following model,
𝑌𝑖 = 𝛼 + 𝛽𝑋𝑖 + 𝑈𝑖
It simply means that 𝑈𝑖 follows normal distribution with mean zero and variance 𝜎 2 i.e. N (0,𝜎 2 ).
The principle reason behind this assumption lies in the Central Limit Theorem (CLT)2.
Referring back to the basics, error term is defined as all those factors that affect Y but not
included in the model due to non-availability of data, omission etc. Supposing these factors are
random, then U represents the sum of random variables. Hence, by applying CLT we can,
undoubtedly, impose normality assumption on the error term. Thus, 𝑈~𝑁(0, 𝜎 2 ).
From this, we derive the probability distributions of the estimators of𝛼 and𝛽. The estimators are
in fact linear functions of U3. Applying one of the properties of normal distribution that any linear
function of a normally distributed variable is also normally distributed, it can be deduced that,
𝛼̂~𝑁(𝛼, 𝜎𝛼̂ 2)
𝛽̂ ~𝑁(𝛽, 𝜎𝛽̂ 2 )
With this background, we now set the foundation for hypothesis testing. Suppose we wish to test
something about 𝛽, as already mentioned, it follows normal distribution. By standardizing it , we
get,
𝛽̂ − 𝛽
𝑍= ~𝑁(0,1)
𝑠𝑒(𝛽)
Since 𝜎 2 is not known, replacing the population estimator by its sample estimator, we get a new
variable,
1
Similar claim can be made for 1% level of significance. If a coefficient is significant at 1%, it will be
significant at 5% as well. However, vice-versa may not be true.
2
Gujarati and Porter(2010) define CLT in the following way, “If there is a large number of independent
and identically distributed random variables, then, with few exceptions, the distribution of their sum
tends to be a normal distribution as the number of such variables increases indefinitely”.
3
See Know more section 1
BUSINESS PAPER NO. : 8, FUNDAMENTALS OF ECONOMETRICS
ECONOMICS MODULE NO. : 6, HYPOTHESIS TESTING: TEST OF SIGNIFICANCE
APPROACH
____________________________________________________________________________________________________
𝛽̂ − 𝛽
~𝑡
̂ 𝑛−2
𝑠𝑒(𝛽)
Hence, we use t distribution to test the null hypothesis4. The calculated t has different sampling
distribution under the null and under the alternate hypothesis, being higher under the latter. A
higher t would be consistent with the alternate hypothesis. Thus, a higher t would mean rejection
of the null hypothesis.
∑(𝒙𝒕 −𝒙
̅)( 𝒚𝒕 −𝒚
̅)
𝛽̂ = ∑(𝒙𝒕 −𝒙̅) 2
= 0.45;
̂𝛼 = 𝑦̅ − 𝛽̂ 𝑥̅ = 35.72
∑ 𝑒̂𝑡 2
𝜎̂ 2 = = 98.98;
𝑛−2
̂
𝜎 2
𝑣𝑎𝑟(𝛽̂ ) = ∑(𝒙 ̅) 2 =0.0030
𝒕 −𝒙
4
Difference between 𝛽̂ and 𝛽 small implies t value to be small. If 𝛽̂ = 𝛽, t will be, zero implying null
hypothesis will not be rejected. Thus, as the difference between 𝛽̂ and 𝛽 increases, likelihood of rejecting
the null hypothesis increases.
BUSINESS PAPER NO. : 8, FUNDAMENTALS OF ECONOMETRICS
ECONOMICS MODULE NO. : 6, HYPOTHESIS TESTING: TEST OF SIGNIFICANCE
APPROACH
____________________________________________________________________________________________________
After running through the necessary calculations, the next is to test if income and expenditure are
related i.e. null hypothesis is as follows,
𝐻0 : 𝛽 = 0
𝐻𝑎 : 𝛽 ≠ 0
̂ −𝛽
𝛽 ̂ −0
𝛽
Test statistic, t = 𝑠𝑒(𝛽)
̂ = 0.054 = 8.30
If the level of significance is 5%, next step is to look out for critical value at that level from the t
table i.e.
𝑡𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙 𝑎𝑡 8 = 2.306. The absolute value of t exceeds the critical value, which implies that we
can reject the null hypothesis i.e. 𝛽 is significantly different from zero. In this case, the test is
statistically significant; hence, we can reject the null. It simply means that the probability of the
difference between 𝛽̂ and zero is due to mere chance is less than the level of significance and thus
can easily reject the null hypothesis. On contrary, if the test is statistically insignificant implies
that the probability of the difference between the estimator and the hypothesized value is more
than the level of significance, thus we cannot reject the null (Gujarati and Porter, 2010).
Let us look at another example5 where in hourly earnings is a function of years of schooling.
Example 2:
In a similar manner, let us test if the level of schooling affects earnings. Hence, the null
hypothesis is set as
𝐻0 : 𝛽 = 0
𝐻𝑎 : 𝛽 ≠ 0
5
Example taken from Christopher Dougherty, Introduction to Econometrics, 4th Edition.
BUSINESS PAPER NO. : 8, FUNDAMENTALS OF ECONOMETRICS
ECONOMICS MODULE NO. : 6, HYPOTHESIS TESTING: TEST OF SIGNIFICANCE
APPROACH
____________________________________________________________________________________________________
The intention is to reject the null hypothesis as only then a relation between earnings and
schooling can be established. T statistic can be easily calculated by looking at the regression
output6,
2.45
𝑡 = 0.23 = 10.65
𝑡𝑐𝑎𝑙 > 𝑡𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙 (1.96)
The result decides in favour of rejecting the null hypothesis and concluding that earnings are
significantly affected by the level of schooling. The column next to t stat in the regression output
above is also a useful to test the significance of the coefficients. This P value or the probability
value is the exact probability of committing a Type 1 error, under the null. The level of
significance7 is the largest probability of making a Type 1 error. Whereas, p value is the smallest
probability, given t-statistic. If this p value happens to be smaller than the level of significance,
because the probability of making the type 1 error is very small, we can safely reject the null
hypothesis8. The method has an edge over the earlier as it allows us to know the exact probability
of making a type 1 error.
In our example, the p value for the schooling coefficient is 0 i.e. the exact probability of making a
type 1 error here is zero and thus the coefficient will be significant at all levels.
Example 3: Suppose in our example 1, we want to test whether marginal propensity to consume
is equal to 1. Accordingly, hypothesis is set as,
𝐻0 : 𝛽 = 1
𝐻𝑎 : 𝛽 < 1
̂ −1
𝛽 0.455
𝑡= = = -9.96
̂)
√𝑣𝑎𝑟(𝛽 √0.0030
𝑡𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙 𝑎𝑡 8 = −1.860
Thus, 𝑡𝑐𝑎𝑙 > 𝑡𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙 , so we reject the null in favour of the alternate hypothesis.
6
Also can be directly seen from t stat column.
7
It is the area right to the critical value.
8
Please refer to the know more section 2 for further understanding of Type 1 and Type 2 errors.
BUSINESS PAPER NO. : 8, FUNDAMENTALS OF ECONOMETRICS
ECONOMICS MODULE NO. : 6, HYPOTHESIS TESTING: TEST OF SIGNIFICANCE
APPROACH
____________________________________________________________________________________________________
an alternative to a 5%. The alternate hypothesis was set as 𝜇 ≠ 𝜇0 i.e. must be equal to some 𝜇1 .
For simplicity, let 𝜇1 > 𝜇0 .
Therefore,
𝐻0 : 𝜇 = 𝜇0
𝐻1 : 𝜇 = 𝜇1
At 5% level of significance, if 𝑋̅ lies in the upper or lower 2.5% tail, we reject the null. Given the
assumption, 𝑋̅ should lie in the upper tail. This would also be compatible with the alternate
hypothesis, if it is true. on the contrary, if it lies in the lower tail rejection region, the test would
suggest to reject the null, although probability of 𝑋̅ lying there should be zero (given the
assumption). Thus, in such a case it would be logical to accept the null i.e. reject 𝐻0 only when it
lies on the upper tail as shown in the figure 3 below:
We could also perform a 5% test by just increasing the rejection region to that extent. Note that
alternate could have another possibility where, 𝜇 > 𝜇0 or 𝜇 < 𝜇0 . These are clearly one-sided
tests and it would be appropriate to consider the right and left rejection region respectively for
any conclusion. Thence, the principle reason for applying one-sided tests should solely depend on
the relevant question-theory and economic sense, like in example 3.
4. If 𝑡 > 𝑡 ∗ , reject the null hypothesis. This implies that 𝜇 is significantly different from 𝜇0 .
5. The null would also be rejected if P value <a.
Let us conclude this module by looking at an example9 using both two-sided and one-sided tests.
The standard error for 𝛼 = 0.10 and𝛽 = 0.05. Suppose we want to see if the price inflation and
wage inflation rate is the same at one. Thus, we set the hypothesis as, 𝐻0 : 𝛽 = 1 and 𝐻1 : 𝛽 ≠ 1.
This clearly is a two-sided test. Calculating t statistic as,
0.90 − 1
𝑡= = −2
0.05
The critical value at 5% level of significance with 18 degrees of freedom, 𝑡𝑐 = 2.1. Therefore,
|𝑡| < 𝑡𝑐 , according to the rule, we do not reject the null hypothesis. The estimated parameter has
a coefficient less than the hypothesized value but according to our test conclusion, the difference
is not significant. A two-sided test does not reject the null. Nevertheless, there could be a
possibility where the rate of price inflation is less than the rate of wage inflation. One plausible
reason for this to happen could be increase in productivity. Certainly, it will lead to the other way
round i.e. make the rate of price inflation more than the rate of wage inflation.
To reiterate the difference between the two tests, let us perform the one-sided test on the above
example. Our new hypothesis becomes, 𝐻0 : 𝛽 = 1 and 𝐻1 : 𝛽 < 1. With no other changes in the
estimated model, t statistic remains at -2. The critical value at 5% level of significance with 18
degrees of freedom now becomes, 𝑡𝑐 = 1.73. As |𝑡| > 𝑡𝑐 , we can reject the null hypothesis and
conclude that coefficient of wage inflation is less than one i.e. rate of price inflation is less than
rate of wage inflation. Therefore, the above result can influence us to say that one-sided test has
an edge over two-sided. However, one should refrain from making such statements as even
though the possibility of 𝛽 > 1 can be excluded, the possibility of 𝛽 = 1 cannot be excluded.
The statistical significance of a parameter depends fully on the t statistic, and the economic
significance depends entirely on the magnitude of the regression coefficient and its sign. While
concluding about the estimated model, one should be careful not to emphasize too much on
99
The example has been extracted from Christopher Dougherty, Introduction to Econometrics, 4 th edition,
chap 2, pg 135.
BUSINESS PAPER NO. : 8, FUNDAMENTALS OF ECONOMETRICS
ECONOMICS MODULE NO. : 6, HYPOTHESIS TESTING: TEST OF SIGNIFICANCE
APPROACH
____________________________________________________________________________________________________
either. A variable can be economically significant but not statistically. There could also be a case
where a variable could be statistically significant but not much relevant to the estimated model.
Researcher has to be careful in finding a middle ground and at last, presenting her concluding
remarks
5. Summary
Hypothesis testing is the vital part of econometric analysis.
The tests explained in this module are associated with a simple linear regression model
and form the foundation for the more general multiple linear regression model.
Significant result does not mean important, it simply emphasizes on statistical
significance.
Apart from having a sound knowledge of the terminologies like type 1 error, type 2 error,
one-sided, two-sided tests etc. , it is also important to know how to present the results.
6. Appendix
1. 𝛽 can be decomposed into two components: random and non-random i.e it can be
expressed as a linear function of U. Following is the mathematical proof:
𝛽 ∑𝑛 ̅ 2 𝑛 ̅
𝑖=1(𝑋𝑖 −𝑋) +∑𝑖=1(𝑋𝑖 −𝑋)(𝑢𝑖 −𝑢̅)
Therefore, 𝛽̂ = 𝑛 ̅
∑𝑖=1(𝑋𝑖 −𝑋) 2
𝑛 𝑛
If we test at 5% level of significance (two-sided test), the risk of type 1 error is 5%. Let
us say that the null is false and the alternate hypothesis is true. In figure 2 above, if 𝑋̅ lie
in the acceptance region; we do not reject the null. Hence, we commit type 2 error i.e.
accept a false null hypothesis. The total probability of committing a type 2 error is
marked below:
Ideally, the power of test should be high. Power of test is defined as the probability of
rejecting the null hypothesis when it is false. It is also defined as 1-Probability of type 2
error. Now, instead of 5%, what if we apply a 1% test. As discussed before, reducing the
level of significance is equivalent to reducing the probability of type 1 error.
Consequently, the rejection region would shrink.
As we can see from figure 5, reducing the probability of type 1 error increases the
probability of type 2 error. Thus, a trade off exists, which clearly tells that decision
cannot be fool proof.
Module Tag
BSE_P8_M7
TABLE OF CONTENTS
1. Learning Outcomes
2. Introduction
3. Estimation of Regression Coefficient
4. Interpretation of Estimated Parameters
5. Classical Linear Assumptions Revisited
6. Variance of Estimators
7. Goodness of fit
8. Summary
9. Appendix
1. Learning Outcomes
After reading this module, the students will be able to:
Understand the concept of multiple linear regression model and differentiate it from the
simple linear regression model
Estimate the parameters using ordinary least squares with a different formula
Interpret the regression coefficients
Understand the difference between unadjusted and adjusted 𝑅 2
2. Introduction
Until now, we have been dealing with the simple linear regression where, the dependent
variable(Y) relates to only one explanatory variable(X). However, it is seldom that the variations
in the variable Y can be explained only by a single explanatory variable. Certainly, there would
be other variables affecting Y. In the earnings-schooling example from module 8, we saw that
level of schooling positively affects the hourly earnings. Nevertheless, there are variable like
parental education, work experience, age etc. that also affects earnings. Multiple regression model
without doubt is preferred to simple regression model. However, the analysis has to be carried out
carefully as there are dangers of including an irrelevant variable and excluding a relevant
variable. Any of the two would lead to specification bias(the topicis dealt later in the course!).
Therefore, A model where a dependent variable depends on several independent variables is
known as multiple regression model i.e. a model with k+11 parameters is as below where, i is the
observation.
Slope parameters
Intercept
Certainly, there is no upper limit on the number of independent variables (although, there is a cost
as mentioned above). There is a lower limit though i.e. 𝑛 ≥ 𝑘, n must at least be equal to k. It is
now time to derive he regression coefficients. We shall deal with its interpretation afterwards.
1
So, a total of k+1 parameters; k is the number of regression coefficients plus the intercept.
BUSINESS PAPER NO. : 8, FUNDAMENTALS OF ECONOMETRICS
ECONOMICS
MODULE NO. : 7, MULTIPLE LINEAR REGRESSION MODEL
____________________________________________________________________________________________________
As in the simple regression model, the next step is to find the best linear estimators. Clearly, there
are three unknown parameters, 𝛽1 , 𝛽2 𝑎𝑛𝑑 𝛽3 . Applying OLS, the procedure begins with
minimizing the residual sum of squares,
𝑒𝑖 = 𝑌𝑖 − 𝑌̂𝑖 …………………………………….(4)
FOC,
𝑑𝑅𝑆𝑆
𝑑𝛽̂1 = −2 ∑𝑛𝑖=1(𝑌𝑖 − 𝛽̂1 − 𝛽̂2 𝑋𝑖2 − 𝛽̂3 𝑋𝑖3 )=0............... .(6)
𝑑𝑅𝑆𝑆
̂2 = −2 ∑𝑛𝑖=1 𝑋𝑖2 (𝑌𝑖 − 𝛽̂1 − 𝛽̂2 𝑋𝑖2 − 𝛽̂3 𝑋𝑖3 ) = 0……(7)
𝑑𝛽
𝑑𝑅𝑆𝑆
𝑑𝛽̂3 = −2 ∑𝑛𝑖=1 𝑋𝑖3 (𝑌𝑖 − 𝛽̂1 − 𝛽̂2 𝑋𝑖2 − 𝛽̂3 𝑋𝑖3 ) = 0 … ….(8)
There are three normal equations and three unknowns, therefore, can easily derive the estimators.
Equation 6 can be simplified as,
𝑛 𝑛 𝑛
Therefore,
BUSINESS PAPER NO. : 8, FUNDAMENTALS OF ECONOMETRICS
ECONOMICS
MODULE NO. : 7, MULTIPLE LINEAR REGRESSION MODEL
____________________________________________________________________________________________________
These are the two normal equations and solving this could be a tedious job. For simplicity, we
can transform them into deviations from mean and express them in lower case letters. Taking the
mean in equation 2 we get,
𝑌̅ = 𝛽1 + 𝛽2 𝑋̅2 + 𝛽3 𝑋̅3 + 𝑈
̅……………………(13)
Subtracting (13) from (2) will give us the model in deviation form,
̂
𝑠𝑦2 −𝛽3 𝑠23
𝛽̂2 = 𝑠
……………………..……………..(21)
22
̂3 𝑠23
𝑠𝑦2 −𝛽
[ 𝑠22
] 𝑠23+ 𝛽̂3 𝑠33 = 𝑠𝑦3
𝑠 2 𝑠 𝑠23
𝛽̂3 [𝑠33 − 𝑠23 ]= 𝑠𝑦3 − 𝑦2
𝑠
22 22
In estimating the parameters, the procedure is the same as in the simple linear regression model.
However, the procedure for deriving the slope parameters is not as simple. It becomes
complicated in k variable case. In a k variable case,
The derivations for the slope parameters become very complex. The analysis is generally done
using matrix algebra2.
𝑌𝑖 = 10 + 6𝑋𝑖2 + 2𝑋𝑖3
𝑌𝑖 = 10 + 6𝑋𝑖2 + 2(5)
= 10 + 6𝑋𝑖2 + 10
2
Please refer to know more section for matrix estimation of slope parameters.
BUSINESS PAPER NO. : 8, FUNDAMENTALS OF ECONOMETRICS
ECONOMICS
MODULE NO. : 7, MULTIPLE LINEAR REGRESSION MODEL
____________________________________________________________________________________________________
= 20 + 6𝑋𝑖2
𝛽2 = 6implies that the average value of Y increases by 6 with an unit increase in 𝑋2 , holding 𝑋3
constant. The interpretation remains the same irrespective of the value at which 𝑋3 is held
constant. The value does not affect the slope parameter as it is summed up with the intercept.
Now, holding 𝑋2 constant at 2 we get,
𝑌𝑖 = 10 + 6(2) + 2𝑋𝑖3
= 10 + 12 + 2𝑋𝑖3 = 22 + 2𝑋𝑖3
𝛽3 = 2 implies that the average value of Y increases by 2 with an unit increase in 𝑋3 , holding 𝑋2
constant. Therefore, multiple regression allows us to differentiate the effect of an explanatory
variable on the dependent variable. Let us just continue with our example 2 in module 8 except
that we include an extra explanatory variable- Experience. Thus, the estimated model3 becomes,
Example 1:
̂
𝐸𝐴𝑅𝑁𝐼𝑁𝐺𝑆 = −26.49 + 2.68𝑆 + 0.56𝐸𝑋𝑃
Intercept is negative implying that when the level of education/schooling and experience are zero,
average hourly earnings of an individual will be negative. This does not make economic sense
and can be easily ignored4. The coefficient of schooling suggests that with an extra year of
schooling, average hourly wage increases by 2.68$, keeping experience constant. Similarly, the
coefficient of experience suggest that average hourly wage increases by 0.56$ with an additional
year of experience, keeping schooling constant.
Example 25:
Let us look at another estimated example where we want to understand what influences the
property prices (in dollars). We have three explanatory variables; SFT-square feet, BEDR-
bedroom, BATHR-bathroom. Before attempting to interpret, one should look at the signs of the
coefficient. Notice that the sign of both BEDR and BATHR are negative, going beyond our
expectations. This means that with an increase in a bedroom/bathroom, the property price would
go down by 20$/14$. In fact, it should be opposite. An extra room should increase the price. This
emphasize on the importance of holding other variable constant or other things being equal. The
interpretation of the coefficients would be meaningful only when we include other things being
equal. Therefore, keeping other things equal, an extra bathroom would lead to fall in the average
price of the property by 14$. This is a reasonable argument as by keeping SFT and BEDR the
3
Example from Christopher Dougherty
4
Sometimes, meaningless intercept is suggestive of model misspecification.
5
Example from RamuRamanathan, chapter 4, pg146.
BUSINESS PAPER NO. : 8, FUNDAMENTALS OF ECONOMETRICS
ECONOMICS
MODULE NO. : 7, MULTIPLE LINEAR REGRESSION MODEL
____________________________________________________________________________________________________
same, an extra bathroom means splitting up the given area to make room for a bathroom, hence
smaller rooms demanding lesser price.
This is same as in the simple linear regression analysis. In the above model, we assume that the
sample contains n random observations. It is important to assume that none of the explanatory
variables is correlated with the error term, especially if X is random. Xs can also be given i.e.
fixed in repeated samples. If correlation is found, we encounter a problem known as Endogeneity
Problem wherein estimates of the regression coefficients are biased. This leads us to our next
assumption.
Assumption 2: 𝑪(𝑿, 𝑼) = 𝟎
This is in the case where Xs are non-random. A simple proof can help comprehend:
= 𝑋𝐸(𝑈) − 𝑋𝐸(𝑈) = 0
𝐸(𝑢|𝑋) = 0
Even though the above assumption states that expectation of the error term given x is zero; law of
iterated expectations7 implies that it is also true for unconditional mean as well i.e.
𝐸(𝑢) = 0. This assumption helps in interpreting the deterministic part of the model.
Multicollinearity8is defined as a linear relation between any two explanatory variables. This is the
new addition when compared with simple linear regression. For an estimator to be BLUE, it is
6
Expectation of a constant is a constant itself, E(a)=a
7
Wooldridge, “Introductory Econometrics: A Modern Approach”, 4 th edition, Glossary, pg 841
BUSINESS PAPER NO. : 8, FUNDAMENTALS OF ECONOMETRICS
ECONOMICS
MODULE NO. : 7, MULTIPLE LINEAR REGRESSION MODEL
____________________________________________________________________________________________________
required that there exists no perfect relationship between any two explanatory variables.
However, if it all there is one such relationship, then it becomes difficult to estimate the
population parameters and to differentiate the partial effect of an explanatory variable on the
dependent variable.
Assumptions 1-4 are sufficient conditions the estimated parameters to be unbiased. By now, we
should know that, regression coefficients can be decomposed into two components,
(𝑋𝑖 −𝑋̅)
𝛽̂ = 𝛽 + ∑𝑛𝑖=1 𝑎𝑖 𝑢𝑖 , where 𝑎𝑖 = ∑𝑛 ̅ 2
……………(24)
𝑖=1(𝑋𝑖 −𝑋)
This is the case with only one explanatory variable. The second term in equation (24) becomes
complicated with multiple linear model i.e. with more than one explanatory variables. However,
matrix algebra helps us to solve and prove the properties of the estimated parameter9. The
property of unbiasedness does not get affected even in the presence of multicollinearity. Any
linear relationship among the explanatory variables only affects the precision.
Assumption 5: Homoscedasticity
This is to facilitate hypothesis testing. We shall revisit this in the next module.
Assumptions 1-6 ensure that estimators from OLS are unbiased, efficient and consistent i.e. they
are best linear unbiased estimators or BLUE. Gauss Markov theorem states that among all linear
unbiased estimates, OLS estimators are the most efficient i.e. have minimum variance. Before
proving the theorem, let us find out the variance and standard error of the parameters.
To find out the variance we need the homoscedasticity assumption as it simplifies the derivation.
However, we do not show the proof here and simply state the formula as in Gujarati & Porter.
Refer to the Know More section for its derivation using matrix algebra. The formula is as below:
8
This topic is dealt in detail in later modules.
9
For proofs, please refer to Know More section.
BUSINESS PAPER NO. : 8, FUNDAMENTALS OF ECONOMETRICS
ECONOMICS
MODULE NO. : 7, MULTIPLE LINEAR REGRESSION MODEL
____________________________________________________________________________________________________
2 2
1 𝑋̅2 ∑ 𝑥𝑖3 2 + 𝑋̅3 ∑ 𝑥𝑖2 2 − 2𝑋̅2 𝑋̅3 ∑ 𝑥𝑖2 𝑥𝑖3
̂
𝑣𝑎𝑟(𝛽1 ) = [ + ] 𝜎𝑢 2
𝑛 ∑ 𝑥𝑖2 2 ∑ 𝑥𝑖3 2 − (∑ 𝑥𝑖2 𝑥𝑖3 )2
∑ 𝑥𝑖3 2
𝑣𝑎𝑟(𝛽̂2 ) = 𝜎 2
∑ 𝑥𝑖2 2 ∑ 𝑥𝑖3 2 − (∑ 𝑥𝑖2 𝑥𝑖3 )2 𝑢
∑ 𝑥𝑖2 2
𝑣𝑎𝑟(𝛽̂3 ) = 𝜎 2
∑ 𝑥𝑖2 2 ∑ 𝑥𝑖3 2 − (∑ 𝑥𝑖2 𝑥𝑖3 )2 𝑢
To know the variance, we need to know the variance of the error term, 𝜎𝑢 2. Therefore, the
estimate of the error variance is as given,
∑ 𝑒𝑖 2
𝜎̂𝑢 2 = 𝑛−𝑘
, where 𝑒𝑖 is the residual, n = no. of observations and k=no. of parameters (3 in this
case).
7. Goodness of Fit
In simple regression model, 𝑅 2was defined as variation in the dependent variable explained by
𝐸𝑆𝑆
the variation in the independent variable i.e. 𝑅 2 = 𝑇𝑆𝑆 , where ESS= explained sum of squares
and TSS= total sum of squares. However, in multiple model case, ESS will imply variation in all
the independent variable taken together. Thus, we can easily apply the concept to multiple linear
regression. It is now called Multiple Coefficient of Determination.
𝑌𝑖 = 𝛽1 + 𝛽2 𝑋𝑖2 + 𝛽3 𝑋𝑖3 + 𝑈𝑖
𝑦𝑖 = 𝛽2 𝑥𝑖2 + 𝛽3 𝑥𝑖3 + 𝑒𝑖
𝑒𝑖 = 𝑦𝑖 − 𝛽2 𝑥𝑖2 − 𝛽3 𝑥𝑖3
∑ 𝑒𝑖 2 = ∑ 𝑒𝑖 𝑒𝑖
= ∑ 𝑒𝑖 ( 𝑦𝑖 − 𝛽2 𝑥𝑖2 − 𝛽3 𝑥𝑖3 )
= ∑ 𝑒𝑖 𝑦𝑖 − 𝛽2 ∑ 𝑒𝑖 𝑥𝑖2 − 𝛽3 ∑ 𝑒𝑖 𝑥𝑖3
= ∑ 𝑒𝑖 𝑦𝑖
10
The derivation is taken from Gujarati &Porter, Essentials of Econometrics, chap 4, pg 129.
BUSINESS PAPER NO. : 8, FUNDAMENTALS OF ECONOMETRICS
ECONOMICS
MODULE NO. : 7, MULTIPLE LINEAR REGRESSION MODEL
____________________________________________________________________________________________________
=TSS – ESS
Therefore,
RSS = TSS-ESS
𝐸𝑆𝑆 𝑅𝑆𝑆
TSS=ESS+RSS and, 𝑅 2 = 𝑇𝑆𝑆
or 1 − 𝑇𝑆𝑆.
Let us just illustrate the above with an example using Gretl software11. We start with a simple
regression model where we suppose that education only depends on aptitude. Below is the
regression output
The result show that a point increase in aptitude increases education level by 0.14 years. P value
of the coefficient is very small implying that the coefficient is statistically significant. 𝑅 2is 0.33
which means that 33% of the variation in education is explained by a student’s aptitude. The
number is quite low which is indicative of the fact that some of the important variables are
missing from the model. Those variables could be both observable and unobservable. Now, let us
add another variable that could explain the dependent variable.
11
Gretl is a freely downloadable software.
BUSINESS PAPER NO. : 8, FUNDAMENTALS OF ECONOMETRICS
ECONOMICS
MODULE NO. : 7, MULTIPLE LINEAR REGRESSION MODEL
____________________________________________________________________________________________________
Mother’s education also has a role to play in influencing the level of an individual’s education.
This is clearly seen in the results above. It is also seen that adding another variable has made a
difference to the overall fit. 𝑅 2has increased to 0.35 i.e. 35% of the variation in the education
variable is explained jointly by both aptitude and mother’s education. Adding another variable
would further influence the overall fit. In model 3, we add another variable, father’s education.
Although there is not a significant increase in𝑅 2 , it is enough to claim that as we keep on adding
more explanatory variables, the overall fit improves. Addition of any variable, whether related or
not related to the model, will always increase𝑅 2.
Although increased 𝑅 2 is indicative of a good model, such a method may not be desirable. To
discourage the use of such practice, in multiple linear regression, a different measure is looked at
for analyzing the goodness of fit. It is called the Adjusted 𝑅 2. This gives the correct picture as far
as the variability of the model is concerned. The measure adjusts for the loss in degree of freedom
that occurs by including an extra variable. The formula for calculating is as follows:
𝑅𝑆𝑆/(𝑛 − 𝑘)
𝑅̅ 2 = 1 −
𝑇𝑆𝑆/(𝑛 − 1)
𝑛−1
=1− (1 − 𝑅 2 )
𝑛−𝑘
Note that:
1. 𝑅̅ 2 ≤ 𝑅 2 When 𝑅 2 = 1,𝑅̅ 2 = 1. This directly comes from the formula. Now, as k12>1,
(n-k) <(n-1). So, 𝑅̅ 2 < 𝑅 2 . Therefore, 𝑅̅ 2 ≤ 𝑅 2 always.
2. Even though 𝑅 2 can never be negative, 𝑅̅ 2can fall below zero.
High 𝑅̅ 2in no way implies statistical significance of the coefficients. Hypothesis testing of the
regression coefficients is probably the most important part of the econometric analysis, without
which the results will be ineffectual. We shall deal with this in detail in our next module.
8. Summary
The module is an extension of simple linear regression model with more than one
explanatory variable.
It shows how regression coefficients are estimated in a general case and how the
interpretation differs from simple linear model.
This module concludes with the concept of goodness of fit and adjusted 𝑅 2.
Multiple linear model best explains the reasons and consequences of violations of
classical linear assumptions.
9. Appendix
Estimating slope parameters becomes easy with matrix algebra. The k variable model can
be written as,
𝑦1 1 𝑥12 … 𝑥1𝑘 𝛽1 𝑢1
𝑦2 1 𝑥22 … 𝑥2𝑘 𝛽2 𝑢2
⋮ = ⋮ ⋮ ⋮ ⋮ ⋮ + ⋮
⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
[𝑦𝑛 ] [1 𝑥𝑛2 ⋯ 𝑥𝑛𝑘 ] [𝛽𝑘 ] [𝑢𝑛 ]
12
Not including the intercept
BUSINESS PAPER NO. : 8, FUNDAMENTALS OF ECONOMETRICS
ECONOMICS
MODULE NO. : 7, MULTIPLE LINEAR REGRESSION MODEL
____________________________________________________________________________________________________
That is,
𝑦 = 𝑥𝛽 + 𝑢
OLS method would give us those parameters, which would minimize residual sum of
squares,
RSS= ∑𝑛𝑖=1 𝑒 2 𝑡 ; 𝑒𝑡 = 𝑦 − 𝑥𝛽̂
In matrix algebra,
RSS= 𝑒 ´ 𝑒
= (𝑦 − 𝑥𝛽̂ )´ (𝑦 − 𝑥𝛽̂ )
=(𝑦 ´ -𝛽̂ ´ 𝑥 ´ )(𝑦 − 𝑥𝛽̂ )
=𝑦 𝑦 − 𝑦 ´ 𝑥𝛽̂ − 𝛽̂ ´ 𝑥 ´ 𝑦 + 𝛽̂ ´ 𝑥 ´ 𝑥𝛽̂
´
𝑑𝑅𝑆𝑆
Therefore, ̂ = −2𝑥 ´ 𝑦 + 2𝑥 ´ 𝑥𝛽̂=0
𝑑𝛽
𝑥 𝑥𝛽 = 𝑥 ´ 𝑦
´
−1
𝛽̂ = ( 𝑥 ´ 𝑥) 𝑥 ´ 𝑦
Therefore, 𝑦̂ = 𝑥𝛽̂ is the fitted value.
1. Properties of estimated parameter
b) Variance
c) 𝜎̂ 2
Now,
𝑒 ′ 𝑒 = (𝑀𝑢)′ 𝑀𝑢
= 𝑢′ 𝑀′ 𝑀𝑢
= 𝑢′𝑀𝑢 , since M is a symmetric idempotent matrix13
Trace of the above along with the cyclical property14,
𝑇𝑟(𝑢′ 𝑀𝑢) = 𝑇𝑟(𝑀𝑢𝑢′ )
Taking expectations, 𝐸(𝑇𝑟(𝑀𝑢𝑢 = 𝑇𝑟(𝑀𝐸(𝑢𝑢′ )) = 𝑇𝑟(𝑀𝜎 2 ) = 𝜎 2 𝑇𝑟(𝑀)
′)
13
An idempotent matrix is a square matrix, which fulfills the condition, M*M=M
14
Trace of a matrix remains constant with cyclical permutations for e.g. Tr(XYZ)=Tr(YZX). However,
Tr(XYZ)≠TR(YXZ).
BUSINESS PAPER NO. : 8, FUNDAMENTALS OF ECONOMETRICS
ECONOMICS
MODULE NO. : 7, MULTIPLE LINEAR REGRESSION MODEL
____________________________________________________________________________________________________
We say that 𝛽̂ is BLUE when among the class of unbiased estimators, which are linear
functions of y, OLS estimator 𝛽̂ has the minimum variance. Here, we intend to prove this
using matrix algebra.
Let there be another estimator 𝛽̌ = KY, which is a linear function of Y. Let us assume K
to be fixed. So, 𝐸(𝛽̌ ) = 𝐸(𝐾𝑌)
= 𝐾𝐸(𝑌)
= 𝐾𝐸(𝑋𝛽 + 𝑈)
= 𝐾𝑋𝛽 + 𝐾𝐸(𝑈)
= 𝐾𝑋𝛽
For 𝛽̌to be unbiased, we need KX=I. A restriction, we need to impose. Supposing that it
is unbiased, to be the best linear estimator, the coefficient needs to have the lowest
variance.
′
𝑉𝑎𝑟(𝛽̌ ) = 𝐸[(𝛽̌ − 𝛽)(𝛽̌ − 𝛽) ]
= 𝐸[𝐾𝑈𝑈 ′ 𝐾 ′ ]
= 𝐾𝐸[𝑈𝑈 ′ ]𝐾 ′
= 𝜎 2 𝐾𝐾′
Since, 𝛽 = 𝐾(𝑋𝛽 + 𝑈) = 𝐾𝑋𝛽 + 𝐾𝑈 = 𝛽 + 𝐾𝑈 . Hence, 𝛽̌ − 𝛽 = 𝐾𝑈
̌
By comparing 𝑉𝑎𝑟(𝛽̂ ) and𝑉𝑎𝑟(𝛽̌ ), we can prove that 𝛽̂ indeed is the best linear unbiased
estimator. Proving the Gauss Markov Theorem would require some simple mathematical
manipulations. Variable K can be written as,
𝐾 = 𝐾 + (𝑋 ′ 𝑋)−1 𝑋 ′ − (𝑋 ′ 𝑋)−1 𝑋 ′
= 𝑀 + (𝑋 ′ 𝑋)−1 𝑋 ′
𝑀𝑋 = 𝐾𝑋 − (𝑋 ′ 𝑋)−1 𝑋 ′ 𝑋
= 𝐼−𝐼 = 0
Now,𝜎 2 𝐾𝐾′= 𝜎 2 [(𝑀 + (𝑋 ′ 𝑋)−1 𝑋 ′ )(𝑀 + (𝑋 ′ 𝑋)−1 𝑋 ′ )′ ]
= 𝜎 2 [(𝑀 + (𝑋 ′ 𝑋)−1 𝑋 ′ )(𝑀′ + 𝑋(𝑋 ′ 𝑋)−1 )]
= 𝜎 [𝑀𝑀 + 𝑀𝑋(𝑋 𝑋) +(𝑋 ′ 𝑋)−1 𝑋 ′ 𝑀′+(𝑋 ′ 𝑋)−1 𝑋 ′ 𝑋(𝑋 ′ 𝑋)−1 ]
2 ′ ′ −1
The second term in the above expression is zero as 𝑀𝑋 = 0. 𝑋 ′ 𝑀′ is also zero as 𝑋 ′ 𝑀′ = (𝑀𝑋)′.
Thus, the second and the third term vanishes with the variance reducing to,
𝜎 2 𝐾𝐾′=𝜎 2 𝑀𝑀′ + 𝜎 2 (𝑋 ′ 𝑋)−1
Clearly, 𝑎𝑟(𝛽̌ ) > 𝑉𝑎𝑟(𝛽̂ ). This proves our theorem.
TABLE OF CONTENTS
1. Learning outcomes
2. Introduction
3. Hypothesis testing
4. Hypothesis testing: Individual regression model
5. Testing overall significance of sample regression
6. Summary
1. Learning Outcomes
After studying this module, you shall be able to
2. Introduction
Hypothesis testing allows us to carry out inferences about population parameters using
data from a sample. In order to test a hypothesis in statistics, we must perform three
steps: 1) Formulate a null hypothesis and an alternative hypothesis on population
parameters. 2) Build a statistic to test the hypothesis made. 3) Define a decision rule to
reject or not to reject the null hypothesis.
3. Hypothesis testing
Before establishing how to formulate the null and alternative hypothesis, let us make the
distinction between simple hypotheses and composite hypotheses. The hypotheses that
are made through one or more equalities are called simple hypotheses. The hypotheses
are called composite when they are formulated using the operators "inequality", "greater
than" and "smaller than". It is very important to remark that hypothesis testing is always
about population parameters. Hypothesis testing implies making a decision, on the basis
of sample data, on whether to reject that certain restrictions are satisfied by the basic
assumed model. The restrictions we are going to test are known as the null hypothesis,
denoted by H0. Thus, null hypothesis is a statement on population parameters. An
alternative hypothesis, denoted by H1, will be our conclusion if the experimental test
indicates that H0 is false.
If we invoke the assumption that ui ∼ N (0, σ2), then we can use the t test to test a
hypothesis about any individual partial regression coefficient. To illustrate the mechanics,
consider the child literacy regression. Let us postulate that
H0: β3 = 0 and H1: β3 ≠ 0
The null hypothesis states that, with X2 (female literacy rate) held constant, X3 (PGNP)
has no (linear) influence on Y (child literacy).To test the null hypothesis, we use the t
test. If the computed t value exceeds the critical t value at the chosen level of
significance, we may reject the null hypothesis; otherwise, we may not reject it.
One does not have to assume a particular value of α to conduct hypothesis testing. One
can simply use the p value. The interpretation of this p value (i.e., the exact level of
significance) is that if the null hypothesis were true, the probability of obtaining a t value
of T or greater (in absolute terms) can be a small probability, much smaller than the
artificially adopted value of α = 5%.
This example provides us an opportunity to decide whether we want to use a one-tail or a
two-tail t test. Since a priori child mortality and per capita GNP are expected to be
negatively related we should use the one-tail test. That is, our null and alternative
hypothesis should be:
H0: β3 < 0 and H1: β3 ≥ 0
We can reject the null hypothesis on the basis of the one-tail t test in the present instance.
There is intimate connection between hypothesis testing and confidence interval
estimation. For our example, the 95% confidence interval for β3 is:
that is, the interval includes the true β3 coefficient with 95% confidence coefficient. Thus,
if 100 samples of size 64 are selected and 100 confidence intervals are constructed, we
expect 95 of them to contain the true population parameter β3. Since the interval does not
include the null-hypothesized value of zero, we can reject the null hypothesis that the true
β3 is zero with 95% confidence.
Thus, whether we use the t test of significance or the confidence interval estimation, we
reach the same conclusion. However, this should not be surprising in view of the close
connection between confidence interval estimation and hypothesis testing. Following the
procedure just described, we can test hypotheses about the other.
Before moving on, remember that the t-testing procedure is based on the assumption that
the error term ui follows the normal distribution. Although we cannot directly observe ui,
BUSINESS PAPER NO.8:FUNDAMENTALS OF ECONOMETRICS
ECONOMICS MODULE NO. 2: MRM: INFERENCE PROBLEM
____________________________________________________________________________________________________
we can observe their proxy, the ûi , that is, the residuals. The claim for normality is
usually made on the basis of the Central Limit Theorem (CLT), but this is restrictive in
some cases. That is to say, normality cannot always be assumed. In any application,
whether normality of u can be assumed is really an empirical matter. It is often the case
that using a transformation, i.e. taking logs, yields a distribution that is closer to
normality, which is easy to handle from a mathematical point of view. Large samples will
allow us to drop normality without affecting the results too much.
Now if H0: β1= β2=………… =βk =0 is true, SSR/σ2 is a chi-square random variable with
k degrees of freedom. Note that the number of degrees of freedom for this chi-square
random variable is equal to the number of regressor variables in the model. We can also
show the SSE/ σ2 is a chi-square random variable with n-p degrees of freedom, and that
SSE and SSR are independent. The test statistic for is H0: β1= β2=………… =βk =0 is
We should reject H0 if the computed value of the test statistic, f0, is greater than fα,k,n-p.
The procedure is usually summarized in an analysis of variance table as below.
So far we have seen how to use the F statistic to test several restrictions in the model, but
it can be used to test a single restriction. In this case, we can choose between using the F
statistic or the t statistic to carry out a two-tail test. The conclusions would nevertheless
be exactly same. But, what is the relationship between an F with one degree of freedom
in the numerator (to test a single restriction) and a t? It can be shown that
This fact is illustrated in the figure below where we observe that the tail of the F splits
into two tails of the t. Hence, the two approaches lead to exactly the same outcome,
provided that the alternative hypothesis is two sided. However, the t statistic is more
flexible for testing a single hypothesis, because it can be used to test H0 against one-tail
alternatives.
Moreover, since the t statistics are also easier to obtain than the F statistics, there is no
good reason for using an F statistic to test a hypothesis with a unique restriction.
The normality of the OLS estimators depends crucially on the normality assumption of
the disturbances. What happens if the disturbances do not have a normal distribution? We
have seen that the disturbances under the Gauss-Markov assumptions, and consequently
the OLS estimators are asymptotically normally distributed, i.e. approximately normally
distributed. If the disturbances are not normal, the t statistic will only have an
approximate t distribution rather than an exact one. As it can be seen in the t student
table, for a sample size of 60 observations the critical points are practically equal to the
standard normal distribution. Similarly, if the disturbances are not normal, the F statistic
will only have an approximate F distribution rather than an exact one, when the sample
size is large enough and the Gauss-Markov assumptions are fulfilled. Therefore, we can
use the F statistic to test linear restrictions in linear models as an approximate test. There
are other asymptotic tests (the likelihood ratio, Lagrange multiplier and Wald tests) based
on the likelihood functions that can be used in testing linear restriction if the disturbances
are non-normally distributed. These three can also be applied when a) the restrictions are
nonlinear; and b) the model is nonlinear in the parameters. For non-linear restrictions, in
linear and non-linear models, the most widely used test is the Wald test. For testing the
assumptions of the model (for example, homoscedasticity and no autocorrelation) the
Lagrange multiplier (LM) test is usually applied. In the application of the LM test, an
auxiliary regression is often run. The name of auxiliary regression means that the
coefficients are not of direct interest: only the R2 is retained. In an auxiliary regression
the regressand is usually the residuals (or functions of the residuals), obtained in the OLS
estimation of the original model, while the regressors are often the regressors (and/or
functions of them) of the original model.
Example:
The Bush Company is engaged in the sale and distribution of gifts imported from the
Near East. The most popular item in the catalog is the Guantanamo bracelet, which has
some relaxing properties. The sales agents receive a commission of 30% of total sales
amount. In order to increase sales without expanding the sales network, the company
established special incentives for those agents who exceeded a sales target during the last
year. Advertising spots were radio broadcasted in different areas to strengthen the
promotion of sales. In those spots special emphasis was placed on highlighting the well-
being of wearing a Guantanamo bracelet. The manager of the Bush Company wonders
whether a dollar spent on special incentives has a higher incidence on sales than a dollar
spent on advertising. To answer that question, the company's econometrician suggests the
following model to explain sales:
Where incent are incentives to the salesmen and advert are expenditures in advertising.
The variables sales, incent and advert are expressed in thousands of dollars. Using a
sample of 18 sale areas (work file advance), we have obtained the output and the
covariance matrix of the coefficients that appears in table below.
Covariance matrix
In this model, the coefficient 2 indicates the increase in sales produced by a dollar
increase in spending on advertising, while3 indicates the increase caused by a dollar
increase in the special incentives, holding fixed in both cases the other regressor. To
answer the question posed in this example, the null and the alternative hypothesis are the
following:
The t statistic is built using information about the covariance matrix of the estimators:
For =0.10, we find that t15 0.10 1.341. As t<1.341we do not reject H0 for =0.10, nor
for =0.05 or =0.01. Therefore, there is no empirical evidence that a dollar spent on
special incentives has a higher incidence on sales than a dollar spent on advertising.
6. Summary
This chapter extended and refined the ideas of interval estimation and hypothesis
testing first introduced in Chapter 5 in the context of the two-variable linear
regression model.
In a multiple regression, testing the individual significance of a partial regression
coefficient (using the t test) and testing the overall significance of the regression
(i.e. H0: all partial slope coefficients are zero or R2 = 0) are not the same thing.
In particular, the finding that one or more partial regression coefficients are
statistically insignificant on the basis of the individual t test does not mean that all
partial regression coefficients are also (collectively) statistically insignificant. The
latter hypothesis can be tested only by the F test.
The F test is versatile in that it can test a variety of hypotheses, such as whether
(1) an individual regression coefficient is statistically significant, (2) all partial
slop coefficients are zero, (3) two or more coefficients are statistically equal, (4)
the coefficients satisfy some linear restrictions, and (5) there is structural stability
of the regression model.
As in the two-variable case, the multiple regression model can be used for the
purpose of mean and or individual prediction.
TABLE OF CONTENTS
1. Learning outcomes
2. Introduction
3. Theory of functional forms
4. The Log-linear model
5. Semi-log models
5.1 Lin-log Model
5.2 Log-lin Model
6. Reciprocal Model
7. Illustrative Examples
8. Choice of functional form
9. Summary
1. Learning Outcomes
After studying this module, you shall be able to
2. Introduction
Through the use of multiplicative terms in interaction models, we can assess how slopes vary,
conditional on the value of some other covariate. In this sense, we can estimate more
sophisticated linear relationships among our variables. However, one thing that we haven’t
considered to this point is how to incorporate nonlinear relationships inside of a linear
regression model. Why might we want to think about doing this? The reasons are simple. First,
given the desirable properties of the OLS model, it would be nice to use our data and stay
within the framework of the linear model. Second, if a relationship is nonlinear, then
transformation on X may produce a better fitting regression model than one where the
relationship between X and Y is constant (and linear). Despite its name, the classical linear
regression model is not limited to a linear relationship between the dependent and the
independent variables. In this module we are concerned primarily with models that are linear in
the parameters; they may or may not be linear in the variables.
The latter is the usual multiple linear regression model with L + 1 regressors as long as all
necessary assumptions about the error term and the transformed independent variables z i = (1zi1
zi2 ... ziL) are satisfied. All properties of OLS are therefore preserved. The use of vector
approach is to show that we could extend the one equation analysis to many equations in n
dimensional space and can find the solutions to the beta coefficients. Coming back to the one
regression equation model; we consider below some of the other functional forms.
In the sections that follow we consider some commonly used regression models that may be
nonlinear in the variables but are linear in the parameters or that can be made so by suitable
transformations of the variables. In particular, we discuss the following regression models:
1. The log-linear model
2. Semi-log models
3. Reciprocal models
4. The logarithmic reciprocal model
we can write this model as linear in the parameters α and β2, linear in the logarithms of the
variables Y and X, and can be estimated by OLS regression. Because of this linearity, these
models are called log-log, double-log, or log-linear models. If the assumptions of the classical
linear regression model are fulfilled, the parameters can be estimated by the OLS method by
letting
The OLS estimators ˆα and ˆ β2 obtained will be best linear unbiased estimators of α and β 2,
respectively.
The slope coefficient of the log-log model, β2, measures the elasticity of Y with respect to X,
that is, the percentage change in Y for a given (small) percentage change in X. Thus, if Y
represents the quantity of a commodity demanded and X its unit price, β2 measures the price
elasticity of demand, a parameter of considerable economic interest.
Two special features of the log-linear model are: The model assumes that the elasticity
coefficient between Y and X, β2, remains constant throughout. Another feature of the model is
that although α̂ and β̂2 are unbiased estimates of α and β2, β1 (the parameter entering the original
model) when estimated as β̂1 = antilog (α̂) is itself a biased estimator.
This model is like any other linear regression model in that the parameters β1 and β2 are linear.
The only difference is that the regressand is the logarithm of Y and the regressor is “time,”
which will take values of 1, 2, 3, etc.
Models like these are called semi-log models because only one variable (in this case the
regressand) appears in the logarithmic form. A model in which the regressand is logarithmic
will be called a log-lin model. Later we will consider a model in which the regressand is linear
but the regressor(s) are logarithmic and call it a lin-log model. Before we present the regression
results, let us examine the properties proportional or relative change in Y for a given absolute
change in the value of the regressor (in this case the variable t), that is
Yt Y0 (1 r )t
lnYt ln Y0 t ln(1 r )
let 1 ln Y0 2 ln(1 r )
lnYt 1 2t ut
β2 = relative change in regressand / absolute change in regressor
If we multiply the relative change in Y by 100, it will then give the percentage change, or the
growth rate, in Y for an absolute change in X, the regressor. That is, 100 times β2 gives the
growth rate in Y; 100 times β2 is known in the literature as the semi-elasticity of Y with respect
to X. Instantaneous versus Compound Rate of Growth. The coefficient of the trend variable in
the growth model, β2, gives the instantaneous (at a point in time) rate of growth and not the
compound (over a period of time) rate of growth. But the latter can be easily found from by
taking the antilog of the estimated β2 and subtracting 1 from it and multiplying the difference
by 100. Thus, for illustrative example, suppose the estimated slope coefficient is 0.00743.
Therefore, [antilog(0.00743) − 1] = 0.00746 or 0.746 percent. Thus, in the illustrative example,
the compound rate of growth was about 0.746 percent per quarter, which is slightly higher than
the instantaneous growth rate of 0.743 percent.
This is of course due to the compounding effect.
Yt = β1 + β2t + ut
That is, instead of regressing the log of Y on time, they regress Y on time, where Y is the
regressand under consideration. Such a model is called a linear trend model and the time
variable t is known as the trend variable. If the slope coefficient in is positive, there is an
upward trend in Y, whereas if it is negative, there is a downward trend in Y. Unlike the growth
model just discussed, in which we were interested in finding the percent growth in Y for an
absolute change in X, suppose we now want to find the absolute change in Y for a percent
change in X. A model that can accomplish this purpose can be written as:
Yi = β1 + β2 ln Xi + ui
For descriptive purposes we call such a model a lin–log model. Let us interpret the slope
coefficient β2. As usual,
β2 = change in Y/ change in ln X
= change in Y / relative change in X
If X changes by 0.01 unit (or 1 percent), the absolute change in Y is 0.01(β 2); if in an
application one finds that β2 = 500, the absolute change in Y is (0.01)(500) = 5.0. Therefore,
when regression) is estimated by OLS, do not forget to multiply the value of the estimated slope
coefficient by 0.01, or, what amounts to the same thing, divide it by 100. If you do not keep this
in mind, your interpretation in an application will be highly misleading. The practical question
is: When is a lin–log model useful? An interesting application has been found in the so-called
Engel expenditure. Engel found that food expenditures are an increasing function of income
(but that food budget shares decrease with income – which explains the nonlinearity) Goods
with income elasticity <= & 1 = luxuries respectively Engel found that food is a necessity.
Example: an estimated coefficient of .37 from a regression of log food expenditure on log
income suggests that A 1% rise in income generates a 0.37% rise in food expenditure A 10%
rise in income generates a 3.7% rise in the share of household budget spent on food So the food
income elasticity is indeed between 0 & 1, so that food is a necessity.
6. Reciprocal Models
So far we have considered models written in linear form
5.
6. Y = b0 + b1X + u
Implies a straight line relationship between y and X .Sometimes economic theory and/or
observation of data will not suggest that there is a linear relationship between variables. One
way to model a non-linear relationship is the equation
Y = a + b/X + e
(Where the line asymptotes to the value “a” as X ↑ to infinity)
However Y = a + b/X + e is not a linear equation since it does not trace out a straight line
between Y and X - and OLS only works (i.e. fit a straight line to minimize RSS) if can
somehow we make it linear. In the above case do this by creating a variable equal to the
reciprocal of X, 1/X, so that the relationship between y and 1/X is linear (i.e. a straight line)
Models of the following type are known as reciprocal models. Although this model is nonlinear
in the variable X because it enters inversely or reciprocally, the model is linear in β1 and β2 and
is therefore a linear regression model. This model has these features: As X increases
indefinitely, the term β2(l/X) approaches zero (note: β2 is a constant) and Y approaches the
limiting or asymptotic value β1. Therefore, models like above have built in them an asymptote
or limit value that the dependent variable will take when the value of the X variable increases
indefinitely.
7. Illustrative Examples
Example1
If you're trying to estimate the amount of time it takes a person to complete a complex physical
task as a function of the air temperature (holding humidity constant), you might suspect that
time would be minimized at about 70o F and higher when the air is either colder or warmer.
The relationship should be U-shaped.
You might choose to model this as:
Ti = β0 + β1Fi + β2Fi 2 + εi
You would expect the estimated coefficient on Fi to be negative and the estimate coefficient on
Fi 2 to be positive. This will produce a curve with the desired U-shape.
Example 2
One part of my dissertation looked at the cost of operating scrubbers on coal burning power
plants. The dependent variable was the annual operating and maintenance costs (Yi) and the
explanatory variables were such things as the level of efficiency of the scrubbers (E i), the total
amount of coal burned (Ci), the sulfur content of the coal (Si) and the percentage of hours
during which the scrubber was operated (Pi). If any of these explanatory variables took the
value 0, then there would be no scrubbing done and the costs should be zero. Further, if any of
these were to double, I would expect that the total cost would rise by some percentage, perhaps
doubling. As a result, the model I used was:
Yi = a Ei b Ci c Si d Pi f εi
Example 3
In talking about demand relationships, one of the most important considerations is the price
elasticity of demand. Among other things, this reveals the effect that a price change will have
on revenue. If we estimate the equation:
Qt = β0 + β1Pt + εi
We get the slope of the demand curve, which isn't terribly interesting. However, if we estimate:
Qt = a Pt β εi
the coefficient β is the price elasticity of demand. To do this estimation, we need to rewrite the
equation as:
or
The estimate of β1 is the estimate of the price elasticity of demand. To be more clear about
doing this, if you have data on prices and quantities at different points in time, all you need to
do is generate new variables equal to the log of the price and the log of the quantity and then do
a linear regression using these new variables. The resulting coefficient on price will be the
estimate of elasticity. In general, anytime you estimate a model of this form, the slope
coefficients are elasticities. That is, they represent the relationship between a percentage change
in the explanatory variable and the resulting explanatory variable in the dependent variable. For
example, looking through a catalog for bicycle parts, you would notice that lighter parts carry
higher prices. If you regressed ln(price) on ln(weight) you would calculate the weight elasticity
of price, or the percentage increase in price resulting from a 1% decrease in weight.
If we want to compare goodness of fit of models in which the dependent variable is in logs or
levels then cannot use the R2. The TSS in Y is not the same as the TSS in lnY,
so comparing R2 is not valid. Instead the basic idea behind testing for the appropriate functional
form of the dependent variable is to transform the data so as to make the RSS comparable.
9. Summary
A functional form refers to the algebraic form of a relationship between a dependent variable
and regressors or explanatory variables.
The simplest functional form is the linear functional form, where the relationship between the
dependent variable and an independent variable is graphically represented by a straight line.
(1) Linear: y = a + b x + e
In this functional form b represents the change in y (in units of y) that will occur as x changes
one unit.
In this functional form b is interpreted as follows. A one unit change in x will cause a (100) %
change in y, e.g., if the estimated coefficient is 0.05 that means that a one unit increase in x will
generate a 5% increase in y.
In this functional form b is the elasticity coefficient. A one percent change in x will cause a b%
change in y, e.g., if the estimated coefficient is -2 that means that a 1% increase in x will
generate a -2% decrease in y.
Sometimes a regression model may not contain an explicit intercept term known as regression
through the origin. In such models the sum of the residuals of ui is nonzero and the
conventionally computed r2 may not be meaningful. Unless there is a strong theoretical reason,
it is better to introduce the intercept in the model explicitly. The units and scale in which the
regressand and the regressor(s) are expressed are very important because the interpretation of
regression coefficients critically depends on them. In the log-linear model both the regressand
and the regressor(s) are expressed in the logarithmic form. Models like reciprocal models have
built in them an asymptote or limit value that the dependent variable will take when the value of
the X variable increases indefinitely.
TABLE OF CONTENTS
1. Learning outcomes
2. Introduction
3. Point Estimate Vs Interval Estimate
3.1 Confidence Intervals
3.2 Confidence Levels
3.3 Margin of Errors
4. Confidence Intervals for Regression Coefficients β1 and β2
5. Confidence interval for 𝝈𝟐
6. Confidence Sets for Multiple Coefficients
7. Application of Regression Analysis: The Problem of Prediction
7.1 Mean Prediction
7.2 Individual Prediction
8. Summary
1. Learning Outcomes
After studying this module, you shall be able to
2. Introduction
Estimation theory is a branch of statistics that deals with estimating the values of parameters based
on measured/empirical data that has a random component. The parameters describe an underlying
physical setting in such a way that their value affects the distribution of the measured data. In
statistics, estimation refers to the process by which one makes inferences about a population, based
on information obtained from a sample.
Statisticians use sample statistics to estimate population parameters. For example, sample means
are used to estimate population means; sample proportions, to estimate population proportions.
A confidence level.
A statistic.
A margin of error.
The confidence level describes the uncertainty of a sampling method. The statistic and the margin
of error define an interval estimate that describes the precision of the method. The interval estimate
of a confidence interval is defined by the sample statistic + margin of error.
Confidence intervals are preferred to point estimates, because confidence intervals indicate (a) the
precision of the estimate and (b) the uncertainty of the estimate.
The probability part of a confidence interval is called a confidence level. The confidence level
describes the likelihood that a particular sampling method will produce a confidence interval that
includes the true population parameter.
Here is how to interpret a confidence level. Suppose we collected all possible samples from a given
population, and computed confidence intervals for each sample. Some confidence intervals would
include the true population parameter; others would not. A 95% confidence level means that 95%
of the intervals contain the true population parameter; a 90% confidence level means that 90% of
the intervals contain the population parameter; and so on.
In a confidence interval, the range of values above and below the sample statistic is called
the margin of error. For example, suppose the local newspaper conducts an election survey and
reports that the independent candidate will receive 30% of the vote. The newspaper states that the
survey had a 5% margin of error and a confidence level of 95%. These findings result in the
BUSINESS PAPER NO.8: FUNDAMENTALS OF ECONOMETRICS
ECONOMICS MODULE NO. 10: SIMPLE REGRESSION MODEL:INTERVAL ESTIMATION
____________________________________________________________________________________________________
following confidence interval: We are 95% confident that the independent candidate will receive
between 25% and 35% of the vote.
Pr (β̂2− δ ≤ β2 ≤ β̂2+ δ) = 1 − α
Recall that in our regression model, we are stating that E (y|x) = β0 + β1x. In this model, β1 represents
the change in the mean of our response variable y, as the predictor variable x increases by 1 unit.
Note that if β1 = 0, we have that E (y|x) = β0 + β1x = β0 + 0x = β0, which implies the mean of our
response variable is the same at all values of x.
In the context of the coffee sales example (below), this would imply that mean sales are the same,
regardless of the amount of shelf space, so a marketer has no reason to purchase extra shelf space.
This is like saying that knowing the level of the predictor variable does not help us predict the
response variable. Under the assumptions stated previously, namely that y ∼ N (β0 + β1x, σ2), our
estimator b1 has a sampling distribution that is normal with mean β1 (the true value of the
parameter), and standard error.
𝜎
√∑𝑛𝑖=1(𝑥𝑖 − 𝑥̅ )2
That is:
𝜎
𝑏1 ~𝑁(𝛽1 , )
𝑆𝑆𝑥𝑥
BUSINESS PAPER NO.8: FUNDAMENTALS OF ECONOMETRICS
ECONOMICS MODULE NO. 10: SIMPLE REGRESSION MODEL:INTERVAL ESTIMATION
____________________________________________________________________________________________________
First, we obtain the estimated standard error of b1 (this is the standard deviation of its sampling
distribution :
𝑆𝑒
Note that is the estimated standard error of b1 since we use se = √ MSE to estimate σ. Also,
√𝑆𝑆𝑥𝑥
we have n − 2 degrees of freedom instead of n − 1, since the estimate se has 2 estimated parameters
used in it.
Examples:
Suppose,
b1 = 2.0055, SSxx = 7738.94, se = 6.24, n = 48.
For the hosiery mill cost function analysis, we obtain a 95% confidence interval for average unit
variable costs (β1). Note that t.025,48−2 = t.025,46 ≈ 2.015, since t.025,40 = 2.021 and t.025,60 = 2.000 (we
could approximate this with z.025 = 1.96 as well).
We are 95% confident that the true average unit variable costs are between $1.86 and $2.15 (this is
the incremental cost of increasing production by one unit, assuming that the production process is
in place.
BUSINESS PAPER NO.8: FUNDAMENTALS OF ECONOMETRICS
ECONOMICS MODULE NO. 10: SIMPLE REGRESSION MODEL:INTERVAL ESTIMATION
____________________________________________________________________________________________________
Above follows the χ2 distribution with n − 2 df. Therefore, we can use the χ2 distribution to
establish a confidence interval for σ2
where the χ2 value in the middle of this double inequality is as given by equation1 and where χ21−α/2
and χ2α/2 are two values of χ2 (the critical χ2 values) obtained from the chi-square table for n − 2
df in such a manner that they cut off 100(α/2) percent tail areas of the χ2 distribution. Substituting
χ2 from (1) into (2) and rearranging the terms, we obtain which gives the 100(1 − α) % confidence
interval for σ2. The interpretation of this interval is, if
we establish 95% confidence limits on σ2 and if we maintain a priori that these limits will include
true σ2, we shall be right in the long run 95 percent of the time.
For the above regression model, a joint confidence set for β1 and β2 that is 95% joint confidence set
is a set-valued function of the data that contains the true parameter(s) in 95% of hypothetical
repeated samples. The set of parameter values cannot be rejected at the 5% significance level. We
can find a 95% confidence set as the set of (β1, β2) that cannot be rejected at the 5% level using an
F-test.
Let F(β1,0,β2,0) be the (heteroskedasticity-robust) F-statistic testing the hypothesis that β1 = β1,0 and
β2 = β2,0. For this the 95% confidence set is {β1, 0, β2, 0: F (β1,0, β2,0) < 3.00}. Here 3.00 is the 5%
critical value of the F2, infinity distribution. This set has coverage rate 95% because the test on which
it is based (the test it “inverts”) has size of 5%. That is 5% of the time, the test incorrectly rejects
the null when the null is true, so 95% of the time it does not; therefore the confidence set constructed
as the non-rejected values contains the true value 95% of the time (in 95% of all samples).
2 2
1 𝛽̂2 − 𝛽2,0 𝛽̂1 − 𝛽1,0 𝛽̂2 − 𝛽2,0 𝛽̂1 − 𝛽1,0
= × [( ) +( ) + 2𝜌̂𝑡1 𝑡2 ( )( )
2(1 − 𝜌̂𝑡1 𝑡2 ) 𝑆𝐸(𝛽̂2 ) 𝑆𝐸(𝛽̂1 ) 𝑆𝐸(𝛽̂2 ) 𝑆𝐸(𝛽̂1 )
This is a quadratic form in β1,0 and β2,0 – thus the boundary of the set F = 3.00 is an ellipse.
By replacing the unknown σ2 by its unbiased estimator σ 2̂ , we see that the variable follows the t
distribution with n−2 df. The t distribution can therefore be used to derive confidence intervals for
the true E (Y0|X0) and test hypotheses about it in the usual manner, namely,
1 (100 − 170)2
𝑣𝑎𝑟(𝑌̂0 ) = 42.159[ + = 10.4759
10 33,000
and
𝑠𝑒(𝑌̂0 ) = 3.2366
Therefore, the 95% confidence interval for true E(Y | X0) = β1 + β2X0 is given by
75.3645 − 2.306(3.2366) ≤ E (Y0 | X = 100) ≤ 75.3645 + 2.306(3.2366) that is,
67.9010 ≤ E(Y | X = 100) ≤ 82.8381
Thus, given X0 = 100, in repeated sampling, 95 out of 100 intervals will include the true mean
value; the single best estimate of the true mean value is of course the point estimate 75.3645.
1 (𝑋0 − 𝑋̅)2
𝑣𝑎𝑟(𝑌0 − 𝑌̂0 ) = 𝐸[𝑌0 − 𝑌̂0 ] = 𝜎 2 [1 + + ]
𝑛 ∑ 𝑥𝑖2
It can be shown further that Y0 also follows the normal distribution with mean and variance given
by above formulas, respectively. Substituting σ̂ 2 for the unknown σ2, it follows that
𝑌0 − 𝑌̂0
𝑡=
𝑠𝑒(𝑌0 − 𝑌̂0 )
also follows the t distribution. Therefore, the t distribution can be used to draw inferences about the
true Y0. Continuing with our consumption–income example, we see that the point prediction of Y0
is 75.3645, the same as that of Yˆ0, and its variance is 52.6349. Therefore, the 95% confidence
interval for Y0 corresponding to X0 = 100 is seen to be
8. Summary
Estimation and hypothesis testing constitute the two main branches of classical statistics. Preceding
question: confidence interval and test of significance. Underlying the confidence-interval approach
is the concept of interval estimation. An interval estimator is an interval or range constructed in
such a manner that it has a specified probability of including within its limits the true value of the
unknown parameter. The interval thus constructed is known as a confidence interval, which is often
stated in percent form, such as 90 or 95%. The confidence interval provides a set of plausible
hypotheses about the value of the unknown parameter. If the null-hypothesized value lies in the
confidence interval, the hypothesis is not rejected, whereas if it lies outside this interval, the null
hypothesis can be rejected
TABLE OF CONTENTS
1. Learning Outcome
2. Introduction
3. Mean Prediction
3.1 Variance of Mean Prediction
3.2 Factors affecting the variance of 𝒀̂𝒐
3.3 Confidence Interval for Mean Predictor
3.4 Interpretation of a Confidence Interval
4. Individual Prediction
4.1 Prediction Interval for Individual Predictor
4.2 Interpretation of a Prediction Interval
4.3 Difference between Confidence Interval and Prediction Interval
4.4 Numerical example
5. Forecasting Methodologies
5.1 Econometric Forecasting
5.2 Time series forecasting
5.3 Fitted values, In-sample and Out-of-Sample Forecasts
5.4 Conditional and Unconditional Forecasts
6. Evaluation of forecasts
7. Summary
1. Learning Outcomes
After reading this module, the reader will be able to understand:
Mean prediction
Individual prediction
Forecasting methodologies
Evaluation of forecast
2. Introduction
Regression analysis has been used to estimate economic relationship among one (or more)
variable(s). The regression results can be used to predict the value of dependent variable when the
independent variables are given or specified. This is known as prediction or forecasting. Typically
we have two types of predictions in the literature namely Mean Prediction and Individual
Prediction. The Mean Prediction basically relates to finding the Mean value of dependent variable
for a specified value of explanatory variable while the Individual Prediction relates to finding the
Individual value of dependent variable for a specified value of explanatory variable.
3. Mean Prediction
The Mean Prediction calculates the average or the expected value of dependent variable (say) when
the independent variable (say 𝑋 ) are given. To illustrate the concept of mean prediction; we take
the following simple regression model:
𝑌 = 𝛽𝑜 + 𝛽1 𝑋 + 𝑈
The fitted value for the regression equation given above is:
𝑌̂ = 𝛽̂𝑜 + 𝛽̂1 𝑋
The predicted mean value of 𝑌 can be different from its true value. The prediction or forecast error
can be calculated as the difference between these two values. If we want to access how good this
prediction is, we have to find out the variability of this prediction. So we calculate the Variance of
Mean Prediction.
Please note that 𝛽̂𝑜 𝑎𝑛𝑑 𝛽̂1 are unbiased estimators of 𝛽𝑜 𝑎𝑛𝑑 𝛽1 respectively.
So
Therefore the expected value of the predicted value of 𝑌 and the expected value of 𝑌 conditioned
on the given values of 𝑋 are equal.
Recall that, the following property of variance for any two variables 𝐴 𝑎𝑛𝑑 𝐵 which are random
∑ 𝑋𝑖2 2
𝜎2 2
−𝑋̅𝜎 2
= 𝜎 + [ ] 𝑋 + 2 [ ]𝑋
𝑛 ∑(𝑋𝑖 − 𝑋̅)2 ∑(𝑋𝑖 − 𝑋̅)2 𝑜 ∑(𝑋𝑖 − 𝑋̅)2 𝑜
We can show that 𝑌̂𝑜 is normally distributed as 𝑌̂𝑜 is a linear function of random variables which
are normally distributed.
1 (𝑋𝑜 − 𝑋̅)2
𝑌̂𝑜 ~ 𝑁 (𝛽̂𝑜 + 𝛽̂1 𝑋𝑜 , 𝜎 2 [ + ])
𝑛 ∑ 𝑥𝑖2
1 (𝑋𝑜 −𝑋̅)2
Where 𝛽̂𝑜 + 𝛽̂1 𝑋𝑜 is the mean of the prediction and 𝜎 2 [𝑛 + ∑ 𝑥𝑖2
] is the variance of the
prediction.
̂𝒐
3.2 What factors affects the variance of 𝒀
1 (𝑋𝑜 − 𝑋̅)2
𝑉𝑎𝑟 (𝑌̂𝑜 ) = 𝜎 2 [ + ]
𝑛 ∑ 𝑥𝑖2
1. An increase in the sample size 𝑛 decreases the variance of𝑌̂𝑜 . So the variability of the mean
prediction decreases when sample size increased.
2. An increase in the variability of X decreases the variance of𝑌̂𝑜 . So the variability of the
mean prediction decreases when the variability of 𝑋 increases.
3. The variance of 𝑌̂𝑜 increases when 𝑋𝑜 is very far from𝑋̅. So the variability of the mean
prediction increases. Therefore we should be careful about the predicted value of 𝑌 when
𝑋𝑜 is very far from𝑋̅.
The mean predictor’s standard error can be obtained by taking the square root of 𝑉𝑎𝑟(𝑌̂𝑜 ).
1 (𝑋𝑜 − 𝑋̅)2
𝑠𝑌̂𝑜 = √𝜎 2 [ + ]
𝑛 ∑ 𝑥𝑖2
The variance of the error (𝜎 2 ) is not known. So we have to replace it with its unbiased estimator𝑆 2 .
Accordingly, we can use the t- distribution to make inferences about the predictions.
A confidence interval estimates the mean or average value of dependent variable 𝑌 for a specified
value of independent variable𝑋. Or equivalently, it give us the range of mean or average values of
dependent variable for a specified central percentage of a population for a given value of
independent variable𝑋.
We can discuss the results of regression model with the help of Predicted values of Dependent
variable 𝑌 and Confidence Interval. For example, we can make a statement like: The predicted
value of 𝑌 is 230 when𝑋 = 20. It is useful to incorporate the notion of Confidence Interval to
access the accuracy of our statement. This will help us to understand the variability around those
predictions. A better statement would now be: The predicted value of 𝑌 ranges from 190 to 255
with 95 % Confidence interval when the value of𝑋 = 20. Similarly we could make a statement like
‘the predicted value of 𝑌 increases from 120 to 150 when the value of 𝑋 decreases from 12 to 9’.
A better statement would be: when the value of 𝑋 decrease from 12 to 9, the 95 % confidence
interval of the predicted value of 𝑌 increases from (90; 135) to (143; 167). We need to report the
confidence interval around the predicted value to access the accuracy of any prediction. One could
also plot the predicted values and the confidence intervals over some range of𝑋.
4. Individual Prediction
For Individual Prediction, we are interested in predicting an individual value of the dependent
variable 𝑌 when the independent variable𝑋 = 𝑋𝑜 . Accordingly we obtain;
𝑌𝑜 = 𝛽𝑜 + 𝛽1 𝑋𝑜 + 𝑢𝑜
This is the equation for measures of prediction error. Taking expectations on both the sides we get
We used the fact that 𝛽̂𝑜 , 𝛽̂1 are unbiased estimator of 𝛽𝑜 𝑎𝑛𝑑 𝛽1 respectively; 𝑋𝑜 is a fixed
number and 𝐸(𝑢𝑜 ) is zero by assumption.
We obtained the variance of the prediction error by squaring the equation for prediction error and
taking expectations on both sides which gives us:
Using the formulae for variance and covariance of 𝛽̂𝑜 𝑎𝑛𝑑 𝛽̂1 and given that 𝑣𝑎𝑟 (𝑢𝑜 ) = 𝜎 2 we get
1 (𝑋𝑜 − 𝑋̅)2
𝑣𝑎𝑟(𝑌𝑜 − 𝑌̂𝑜 ) = 𝜎 2 [1 + + ]
𝑛 ∑ 𝑥𝑖2
We compare the variance of the individual prediction with the variance of the mean prediction and
found that the former is larger than the latter by 𝜎 2 . This is because in the mean prediction we
average out the effect of the disturbance terms but not so in individual predictions.
We can show that 𝑌̂𝑜 is normally distributed with mean of 𝛽̂𝑜 + 𝛽̂1 𝑋𝑜 and variance of
1 (𝑋𝑜 − 𝑋̅)2
𝜎 2 [1 + + ]
𝑛 ∑ 𝑥𝑖2
1 (𝑋 −𝑋̅) 2
i.e. 𝑌̂𝑜 ~ 𝑁 (𝛽̂𝑜 + 𝛽̂1 , 𝜎 2 [1 + 𝑛 + ∑𝑜 𝑥 2 ] )
𝑖
The Individual Predictor’s standard error can be obtained by taking the square root of 𝑉𝑎𝑟(𝑌̂𝑜 ).
1 (𝑋𝑜 − 𝑋̅)2
𝑠𝑌𝑜−𝑌̂𝑜 = √𝜎 2 [1 + + ]
𝑛 ∑ 𝑥𝑖2
The variance of the error (𝜎 2 ) is not known, so we need to replace it with its unbiased estimator𝑆 2 .
We can then use the t- distribution to make inferences about the predictions. The variable
𝑌𝑜 − 𝑌̂𝑜
𝑡=
𝑠𝑌𝑜 −𝑌̂𝑜
follows the t distribution with 𝑛 − 2 degrees of freedom. The Individual predicted value of 𝑌𝑜 i.e.
for 𝑌𝑜 ⁄𝑋0 with confidence interval of (1 − 𝛼)% is given by:
𝑃𝑟 [ 𝛽̂𝑜 + 𝛽̂1 𝑋𝑜 − 𝑡𝑛−2,𝛼⁄2 ∗ 𝑠𝑌𝑜 −𝑌̂𝑜 ≤ 𝑌𝑜 ⁄𝑋0 ≤ 𝛽̂𝑜 + 𝛽̂1 𝑋𝑜 + 𝑡𝑛−2,𝛼⁄2 ∗ 𝑠𝑌𝑜 −𝑌̂𝑜 ] = 1 − 𝛼
Prediction Interval gives us the value of dependent variable 𝑌 for a speficied value of
independent variable 𝑋.In other words, it give us the range of values of the dependent
variable for a specified central percentage of a population with a specified value of
independent variable.
The Prediction Interval predicts one particular value of dependent variable 𝑌 for a given
value of. The Prediction Interval is given below:
The Confidence Interval predicts the mean value of dependent variable 𝑌 for a given value
of𝑋. The confidence Interval is given below:
𝑦̂ ± 𝑡𝑛−2,𝛼⁄2 ∗ 𝑠𝑌̂𝑜
We observed that the width of these intervals is smallest when 𝑋𝑜 = 𝑋̅. when 𝑋𝑜 moves
farther from 𝑋̅ the width of the interval increases rapidly. This suggests that the predictive
accuracy decreases as 𝑋𝑜 differs too much from 𝑋̅ . We can also see that the Confidence
Interval is much wider than the Prediction interval. This is because there is less error in
predicting the mean value than in predicting individual values of dependent variable.
The data on weekly expenditure on consumption and weekly income of a particular hypothetical
community are given below:
Weekly Weekly
Consumption Income
Expenditure (in
(in Rupees) Rupees)
80 90
86 102
90 115
94 123
99 131
105 145
109 156
115 167
121 175
128 188 1. Find the estimated regression line
2. Calculate the marginal Propensity to consume of the above hypothetical community. Does
it conform to Keynes Psychological Law of Consumption?
3. Calculate the mean prediction of weekly consumption expenditure when the weekly
income is Rs 120.
4. Calculate the 95% confidence interval and 95% prediction interval and gave its
interpretation.
Solutions:
𝑦 𝑥 ̂
𝑈
Y X = 𝑌 − 𝑌̅ = 𝑋 − 𝑋̅ 𝑥2
𝑥∗𝑦 𝑌̂ = 𝑌 − 𝑌̂ ̂2
𝑈
2420.6
80 90 -22.7 -49.2 4 1116.84 78.97374 1.02626 1.05321
1383.8
86 102 -16.7 -37.2 4 621.24 84.76064 1.239356 1.536003
90 115 -12.7 -24.2 585.64 307.34 91.02979 -1.02979 1.060467
94 123 -8.7 -16.2 262.44 140.94 94.88773 -0.88773 0.788057
99 131 -3.7 -8.2 67.24 30.34 98.74566 0.254338 0.064688
105 145 2.3 5.8 33.64 13.34 105.4971 -0.49705 0.247059
109 156 6.3 16.8 282.24 105.84 110.8017 -1.80171 3.246166
115 167 12.3 27.8 772.84 341.94 116.1064 -1.10637 1.224063
1281.6
121 175 18.3 35.8 4 655.14 119.9643 1.03569 1.072654
2381.4
128 188 25.3 48.8 4 1234.64 126.2335 1.766544 3.120678
Sum 1027 1392 0 0 9471.6 4567.6 1027 0 13.41305
Ave
rage 102.7 139.2 0 0 947.16 456.76 102.7 0
Slope Coefficient:
∑ 𝑥𝑦 4567.6
𝛽̂1 = ∑ 𝑥 2 = 9471.6 = 0.482242
Intercept Coefficient:
𝛽̂0 = 𝑌̅ − 𝛽̂1 𝑋̅ = 102.7 − 0.482242 ∗ 139.2 = 35.57196
̂𝑖2
∑𝑈 13.41305
𝜎̂ 2 = 𝑛−2
= 8
= 1.676631
𝑦̂ ± 𝑡𝑛−2,𝛼⁄2 ∗ 𝑠𝑌̂𝑜
Where
1 (𝑋0 −𝑋̅)2
𝑠𝑌̂𝑜 = 𝜎√[𝑛 + ] = 0.482616
∑ 𝑥𝑖2
Where
1 (𝑋0 − 𝑋̅)2
𝑠𝑌𝑜−𝑌̂𝑜 = 𝜎√[1 + + ] = 1.381865
𝑛 ∑ 𝑥𝑖2
The Confidence interval with 95% level of confidence is (92.328088, 94.553912) which means that
when Income is Rs 120, the mean consumption expenditure will lie between Rs 92.328088 and Rs
94.553912 with a probability of 95%.
The Prediction Interval with 95% level of confidence is (90.25442, 96.62758) which means that
when income is Rs 120, the individual consumption expenditure will lie between Rs 90.25442 and
Rs 96.62758 with a probability of 95%.
5. Forecasting Methodologies
The subject of econometric forecasting is quite vast and it is not possible to discuss all the
issues involved in it. The term Forecasting or Prediction usually connotes an attempt to
predict future. However it can also be used in a situation of predicting cross section
variables. We have two main types of the methodologies of forecasting. The first one is
known as Econometric Forecasting and the second methodology is known as Time Series
Forecasting.
It deals with a regression model where we are trying to predict the unknown value of
dependent variable with the help of explanatory variables. This type of forecasting has been
used for policy making as they can explain changes in economic and other behavioral
variables. For Instance we can predict the housing demand of a particular city based on the
city’s population, household income and some other variables. In time series forecasting,
our aim is to predict the future value of a variable based on its past values.
The simplest model for econometric forecasting is when the dependent variables have no
lagged values or its error is serially uncorrelated. It is possible to have a case when the
dependent variables have lagged values and their errors are serially correlated. When the
errors are serially correlated, we modeled with an autoregressive process to obtain more
efficient forecast. We illustrate the most general econometric formulation of a single
dependent variable with both lagged dependent variables and autocorrelated errors.
A one-step-ahead forecast for given values of 𝑋𝑛+1,1 , 𝑋𝑛+1,2… 𝑋𝑛+1,𝑘 are given below:
𝑌̂𝑛+1 = 𝛼̂1 𝑌𝑛 + 𝛼̂2 𝑌𝑛−1 + ⋯ … … … … … . + 𝛼̂𝑝 𝑌𝑛+1−𝑝 + 𝛽̂1 𝑋𝑛+1,1 + 𝛽̂2 𝑋𝑛+1,2
+ 𝛽̂3 𝑋𝑛+1,3 + ⋯ … … … … … . +𝛽̂𝑘 𝑋𝑛+1,𝑘 + 𝑢̂𝑛+1
Where
In time series forecasting, attempt is made to forecast the future value of a variable by using
the past value of that variable. A time series follows purely autoregressive (AR model)
structure if the dependent variable 𝑌 is related to its past values with white noise errors. It
is purely moving average (MA model) if the dependent variable 𝑌 are related to a linear
combinations of white noise error terms. A more general model is an ARMA model which
combines both the AR and MA model into a single model. The general form of a
𝐴𝑅𝑀𝐴 (𝑝, 𝑞) model is given below:
𝑌̂𝑡+1 = 𝛼̂1 𝑌𝑡 + 𝛼̂2 𝑌𝑡−1 + … … . + 𝛼̂𝑝 𝑌𝑡+1−𝑝 − 𝛽̂1 𝑣̂𝑡 − 𝛽̂2 𝑣̂𝑡−1 − … … − 𝛽̂𝑞 𝑣̂𝑡+1−𝑞
A time series possess a property of stationarity if the series has time-invariant mean and
constant variance and covariance. A non-stationarity series can be converted to stationarity
series by differencing. There are three steps involved in estimating a time series model
such as identification, estimation and diagnostic checking. The first step of identification
specifies the order of the ARMA model using correlograms and partial correlograms.
Diagnostic test such as Box-Pierce and Ljung-Box are conducted to see it the model fits
the data adequately. Only when the data passed adequacy test, a time series data can
generate forecast from the model.
It is usually observed that in short term, the time series forecast offers better results while
in the longer term econometric forecasts are preferable. Models that can combine both
these two methods can give better short and long term forecasts. However it is difficult to
draw a line between these two methodologies.
We next generate the Out-of-Sample forecasts for time periods 𝑛2 + 1 onward. The
forecasts generated for the time period outside the data used to estimate the model is known
as Out-of-sample forecast. These two forecasting periods are illustrated below:
A conditional forecast implies the prediction of dependent variable when the value of
independent variables are known or given. Suppose we try to predict housing demand in a
particular city. Let us hypothesize that housing demand depends on the city’s population.
If the population of the city is known and is assumed to remain constant in the future, then
we can forecast the demand for housing based on the city’s population.
In Unconditional Forecasts, the explanatory variables are unknown but can be generated
from the model or auxiliary model itself. For instance if we consider the example given
above and the population of the city is unknown, then it can be obtain by using auxiliary
model of birth rate, death rate or population migration. Accordingly the demand for
housing can be forecasted.
6. Evaluation of forecasts
It is often the case that more than one model are used to generate forecasts. It is necessary
to determine which of the models provide the most accurate forecast as different models
generate different forecasts. There is huge literature on the evaluation of forecasts and
historically, the most preferred measure has been the mean squared error (MSE). MSE
provides a quadratic loss function and is useful in situations where large forecast errors are
more serious than smaller error. This may, however, also be viewed as a disadvantage if
large errors are not disproportionately more serious. The other measures of forecast
accuracy are: Mean Absolute Error (MAE), Mean Absolute Percent Error (MAPE) and
Mean Square Percentage error (MSPE). The following are the definition for these various
measures of forecast evaluation.
1. Mean Square Error:
2
∑(𝑌𝑡𝑓 − 𝑌𝑡 )
𝑀𝑆𝐸 =
𝑛−2
∑|𝑌𝑡𝑓 − 𝑌𝑡 |
𝑀𝐴𝐸 =
𝑛−2
𝑓
1 |𝑌𝑡 − 𝑌𝑡 |
𝑀𝐴𝑃𝐸 = ∑ 100
𝑛 𝑌𝑡
𝑓
Where 𝑌𝑡 denote the forecast of the dependent variable for the observation; 𝑌𝑡 denote the
actual value and 𝑛 denotes the sample size.
If two or more models are used to predict the dependent variable then the model that has
lower values for these measures would be considered a better model for forecasting
purposes. We have other ways of evaluating the forecasting accuracy of a model. This can
be done by carrying out a post sample forecast. We will not use the last few observations
in estimating the model, but instead use the parameter estimates from the first set of
observations to predict the dependent variable for the reserved sample. We then calculate
the MSE and other measures of forecast accuracy and chose the model that gives us the
lowest values of these measures for forecasting purposes.
7. Summary
Prediction is about finding the unknown dependent variable when the independent
variables are given. We have two types of predictions namely Mean Prediction and
Individual Prediction.
The Mean Prediction find out the average or mean value of dependent variable for a
specified value of explanatory variable while the Individual Prediction find out the
Individual value of dependent variable for a specified value of explanatory variable.
Variance of Prediction accesses the variability of prediction and tells us how good
the prediction is.
A confidence interval estimates the range of mean values of dependent variable for a
specified central percentage of a population for a given value of independent
variable𝑋.
Prediction Interval estimates the range of values of the dependent variable for a
specified central percentage of a population with a specified value of independent
variable 𝑋
The two methodologies of forecasting are Econometric Forecasting and Time Series
Forecasting. In the first method, we predict the value of dependent variable with the
help of explanatory variables. In time series forecasting, we forecast the future value
of a variable by using the past value of that variable.
Some measures of forecast accuracy are: Mean Square Error, Mean Absolute Error,
Mean Absolute Percent Error and Mean Square Percentage Error.
TABLE OF CONTENTS
1. Learning Outcomes
2. Introduction
3. Types of Errors
6. Errors of measurement
10. Summary
1. Learning Outcomes
After studying this module, you shall be able to
know different types of errors
Consequences of specification errors
Detect these errors using certain tests
Remedies to solve the errors
Establishing comparisons in different models.
2. Introduction
Classical linear regression model assumes that the regression model considered is “correctly
specified”. If this assumption is not satisfied, then there will be problem of model specification
bias or model specification error. It is important to search for the correct model for estimation,
otherwise the whole implication would be spurious. In this module, we explore in considerable
detail the different types of errors and their consequences on the regression models. the following
section of the module discuss the tests to detect these errors and the different approaches to solve
these errors. towards, the end, model selection criteria’s are discussed to finally search for the
correct model so as to render a meaningful estimation of an empirical model.
3. Types of Errors
To start with, lets consider a model based on production function. let Yt be the output produced
and Xt be the amount of labour used given the technology and let Zt denote the amount of capital
used. So the model becomes:
Yt = α + β Xt + Ɣ Zt+ ut (3.1)
There are different types of errors that can occur in estimation of this model. These can be
attributed to the following reasons:
1. specification error with respect to choice of variables
2. functional form of the model
3. error structure of the model
The specification error with respect to choice of the variable occurs when either one of the
variable from the true model is omitted or an irrelevant variable has been included in the model.
Suppose for some reason we forgot to include the variable Zt in the model and there fore purr
estimated model becomes:
Yt = α + β Xt + ut (3.2)
BUSINESS PAPER NO. :8, FUNDAMENTALS OF ECONOMETRICS
ECONOMICS MODULE NO. :12, SPECIFICATION ERRORS AND DIAGNOSTIC TESTING
____________________________________________________________________________________________________
Suppose the estimator believes that a variable Qt denoting the hours labour spend talking should
be included in estimating the production function and hence the estimated model now becomes:
Yt = α + β Xt + Ɣ Zt+ � Qt + ut (3.3)
By estimating 3.3, the estimator commits an error of including an irrelevant variable.
3.3.2 Violating the assumptions imposed on the error term by classical linear
regression model
Suppose the researcher estimates the following model:
Yt = Xt βut (3.6)
In equation 3.6, the stochastic error term ut enters multiplicatively in the model. this might result
in another specification error that results from incorrect specification of ut. To estimate this
model, if log transformation is applied to equation 2.6, the error term now becomes ln ut.
ln Yt = β ln Xt + ln ut (3.7)
In estimation of 3.7, the entire model gets transformed and may or may not satisfy the basic
assumptions of classical linear regression model.
To sum up, the different types of errors that can be encountered in empirical investigation of an
hypothesis are:
2. The estimated content term is generally biased implying that the forecast made from the
equation will also be biased. This is true even when the excluded variable is uncorrelated with
every included variable.
4. the estimated variance of the regression coefficients of the included variables will generally be
biased. This implies that usual confidence interval and tests of hypothesis are invalid.
≠0
E (β2) = α2 + α3 b32
where b32 is the slope in the regression of the excluded variable Zt on the included variable Xt.
Hence, coefficient of Xt will be biased and inconsistent unless Xt is uncorrelated with Zt or α3 is
zero.
The extent of bias will depend on the magnitude of the term α3 b32 and the direction of the bias
will depend on the sign of α3 and direction of b32.
proof of bias in constant term.
_ _
β1 = Y —β2 X
_ _
E( β1) = E( Y—β2 X)
_ _ _ _
Y = α1 + α2 X+ α3 Z+ u
_ _ _
E( Y) = α1 + α2 X+ α3 Z
_ _
E( β1) = E( Y—β2 X)
_ _
= E( Y) -E ( β2 X)
_ _
= E( Y) - [α2 + α3 b32] X
_ _ _ _
= α1 + α2 X+ α3 Z - α2 X - α3 b32 X
_ _
= α1 + α3 {Z - α3 b32 X}
_ _
For the constant term to be unbiased, Z - α3 b32 X = 0 . Thus, even if there is no correlation of
omitted variable with the included variables, the constant term continues to be biased. This also
shows one of the reason why a constant term should always be included in the model.
As we can see, the consequences of omitting a relevant variable can be really serious. Hence, if
the hypothetical theory suggests for inclusion of a variable, it should never be dropped from the
empirical estimation of the model.
1. If an independent variable whose true regression coefficient is zero is included in the model,
the estimated values of other regression coefficients will be unbiased and consistent.
2. The estimated variances of the regression coefficients will also be unbiased and hence, the
tests of hypothesis will be valid.
3. The error variance is correctly estimated.
4. ariance of the regression coefficients will be higher than that without the irrelevant variable.
hence, the coefficient will become inefficient with larger variances.
Caution: Comparing the consequences of two types of errors with respect to choice of variables,
it appears that there are less serious consequences of including an irrelevant variable. hence, it
might become more tempting to include more variables in the model to avoid omitting any
variable from the model. This may not be the right conclusion, as by including more variables, it
leads to loss in efficiency of variables and might also result in problem of multi-collinearity,
along with lost degrees of freedom.
Ramsey has proposed a general test of specification error called RESET ( Regression
Specification error test). The intuition behind formulating the test is finding the omitted variable
in the fitted values of the dependent variable. If the fitted values of the dependent variables are
found to be related with the residuals obtains from the model and increases the R 2 of the model,
it is logical to include the fitted values of the dependent variable in the model in some form.
Suppose we consider the model:
Yt = α + β Xt + ut (5.1)
the steps involved in the Ramsey RESET test are as follows:
1. estimate the model given by the equation 4.1 and obtain the fitted values of the dependent
variable ^
Yt.
^ ^
2. Regress Yt on Xt , Y2t and Y3t as additional regressors.
^ ^
Yt = β1+ β2 Xt + β3 Y2t +β4Y3t + ut (5.2)
3. Save the R2 obtained from both the models and name them as R2old and R2new, respectively .
4. Perform an F-test using the formula,
BUSINESS PAPER NO. :8, FUNDAMENTALS OF ECONOMETRICS
ECONOMICS MODULE NO. :12, SPECIFICATION ERRORS AND DIAGNOSTIC TESTING
____________________________________________________________________________________________________
one advantage of using the RESET test is that it is easy to apply but at the same time it only tells
us that the model is not correctly specified, it does not help in choosing a better alternative.
6. Errors of Measurement
The theoretical argument helps us in building an empirical model. In the empirical model, given
the set of dependent and independent variables, the dataset for these variables has to be collected
from the real world. We assume that the dataset available is accurate. However, this may not be
BUSINESS PAPER NO. :8, FUNDAMENTALS OF ECONOMETRICS
ECONOMICS MODULE NO. :12, SPECIFICATION ERRORS AND DIAGNOSTIC TESTING
____________________________________________________________________________________________________
true always. the data may suffer from various kinds of errors including non response errors,
computing errors, reporting errors, etc and hence result in errors of measurement. the
consequences of such errors will be discussed in this section.
where vt = ( ut+ et) is a composite error term, containing the population disturbance term (ut) and
measurement error term (et).
Assuming that E( ut) = E (et) = 0 , cov( Xt, ut) = 0 and for simplicity , cov (Xt, et) = 0 , i,e. the
errors of measurement in Y*t are uncorrelated with Xt and cov (ut, et) = 0 , i,e. the errors of
measurement and population disturbance term is uncorrelated. Given these assumptions, the β
estimated from equation 6.3, will be unbiased estimator of true β. Hence, the errors of
measurement in the dependent variable do not cause any harm to unbiasedness property of the
OLS estimators. However, the variance and the standard errors estimated from this equation will
be very different from those of OLS estimators.
From estimating equation 6.1, ^
Var( β) = σ2u
∑ xt2
From estimating equation 6.3, ^
Var (β) = σ2v
∑ xt2
= σ2u+σ2e
∑ xt2
As can be seen, estimated variance of β is larger in second case than the first one. This is one of
the consequence of errors of measurement in dependent variable.
where zt = ( ut- βwt) is a composite error term, containing the population disturbance term (ut) and
measurement error term (wt).
Making standard assumptions for the error term, namely, E(wt) = 0 ,E( ut) = 0, cov( Xt, ut) =0,
cov (wt , wt-1) = 0 ,and cov (ut, wt) = 0 , can we also assume independence of composite error
term with reference to Xt ?( Assuming E(zt) = 0)
The answer is No. lets see why?
In conclusion, the measurement errors can lead to serious problems when they are present in the
explanatory variable because they lead to inconsistent estimation. However, if the errors are
present in the dependent variable, it does not create a problem, the estimators remain unbiased
and consistent.
For a solution to problem of measurement error in explanatory variable, we can find a proxy
variable which is highly correlated with X’s variable but uncorrelated with both the error terms.
∑Xt 2
= β eσ2/2 (∑Xt2 ) (because X are non stochastic and ut has expected value
σ2/2
e )
∑Xt2
= β eσ2/2
^
Hence, α is a biased estimator of β.
In these models, Model A and Model B are nested models as model A can be reduced to Model B
if we add variable X4 to model B and β5 is zero. We can use T test to test the hypothesis that the
coefficient of X5is zero.
In this sense, specification error tests are essentially tests of nested hypothesis.
Also, Model C and Model D are Non- Nested because one cannot be derived as a special case
from the other as X and Z are different variables. even if variable X 3 is added to Model D and Z3
is included in Model C , still these will be Non-nested Models because Model C does not contain
Z2 and Model D does not contain X2.
Also, Model E and Model D is non-nested as one cannot be derived from the other.
Given Model C and Model D, how can one decide how to choose between these two
models.consider another Model F.
BUSINESS PAPER NO. :8, FUNDAMENTALS OF ECONOMETRICS
ECONOMICS MODULE NO. :12, SPECIFICATION ERRORS AND DIAGNOSTIC TESTING
____________________________________________________________________________________________________
Next, if we begin with Model D and repeat the same procedure and find out Model D to be the
correct model. Hence the choice of reference hypothesis could determine the outcome of the
choice model, making the entire empirical investigation spurious.
^
Yt = α1 + α2 X2t +α3 X3t + α4 YD + ut
and test for α4 = 0. If this hypothesis is not rejected, we choose model C over Model D.. If this
hypothesis is rejected, model D is chosen.
Problems :
It is possible to arrive at a conclusion where both models are either accepted and both models are
either rejected, in which case it wont be possible to make a decision for the choice of the model.
9.1 R2 criterion
Although R2 criterion does not imposes any penalty for adding more variables, the adjusted R2
does so. It is defined as:
_
R2 = 1 - RSS /(n-k)
TSS/ (n-1)
= 1 - (1-R2) (n-k)
(n-1)
_
from the definition itself, R2 ≤ R2 and hence shows adjusted R2 penalises for adding more
variables into the model. The adjusted R2 will increase only if the absolute t value of added
regressors is greater than 1.
_
Hence, R2 is a better measure than R2. But like R2 , the requirement for dependent variable to be
the same to enable comparison between models.
or in log form:
ln SIC = ( k ) ln n + ln (RSS )
n n
where [(k/n) ln n] is the penalty factor which is even harsher than AIC. The lower the value of
SIC, the better the model.
SIC also incorporates the same advantages as AIC along with imposing a harsher penalty.
10. Summary
• Classical linear regression model assumes the model is correctly specified. This assumption
may not be true always and there may exists errors in the model. These may be specification
errors or mis-specification errors.
• If the specification errors exist, they may lead to serious consequences in the model. Different
types of specifications errors include omitting a variable, including an irrelevant variable,
specification error in the error structure of the model.
• It is important to know the consequences of these specification errors.
• Having understood the consequences, the model needs to be detected for these errors if present
through the help of various tests, namely, Ramsey’s RESET test and LM test.
• If errors are found in the model, approaches should be used to solve these errors to arrive at the
correct model for estimation.
• Evaluation of the model should be done on model selection criteria like R2, adjusted R2, AIC
AND SIC.
TABLE OF CONTENTS
1. Learning Outcomes
2. Introduction
3. What is Multicollinearity
4. Consequences of Multicollinearity
5. Detection of Multicollinearity
6. Remedies of Multicollinearity
6.1 Ignore
6.2 Drop Variables
6.3 Obtain Extraneous Data
6.4 Increase Sample Size
6.5 Transform the variables
6.6 Principal Component Analysis
7. Summary
1. Learning Outcomes
After studying this module you shall be able to
Understand what happens to regression analysis when multicollinearity is present in the
regression model.
Understand what multicollinearity is and what its consequences on regression estimates
are.
Learn about tests for detecting the presence of multicollinearity.
Learn about remedies for carrying out regression analysis in the presence of
multicollinearity.
2. Introduction
Suppose, a researcher is trying to explain the consumption expenditure (y) in a group of
individuals and uses both the current income (x1) and accumulated wealth (x2) of these
individuals as explanatory variables, it is possible that while running a regression analysis, the
regression coefficient for income may turn out to be negative! Or, t−tests for the regression
coefficients of income and wealth might suggest that none of the two predictors are significantly
associated with consumption expenditure, while the F−test indicates that the model is useful for
predicting consumption expenditure.
Or for instance, in a model that is built for analyzing wage differentials between male and female
workers, if dummy variables for both males and females are included in the regression analysis,
the computer software may not be able to estimate the regression coefficients.
These types of problems arise due to presence of multicollinearity.
3. What is Multicollinearity
Multicollinearity is a statistical phenomenon and is said to be present when an independent
variable is a linear combination of some or all of the other independent variables in the model. An
example of multicollinear variables could be GDP, money supply and prices. Another example
could be body mass index and heights of children in a given age group.
If two or more explanatory variables have an exact linear relationship between them, then exact
or perfect multicollinearity is said to be present. For a k-variable regression model involving X1,
X2,….,Xk regressors (where, X1 =1 to allow for intercept term), an exact linear relationship is said
to exist if the following condition is satisfied:
The example above, pertaining to dummy variable suffers from exact multicollinearity. If M i is
the dummy variable for males such that Mi = 1, when the worker is male and Mi = 0, when the
worker is female. Similarly, if Fi is the dummy variable for females such that Fi = 1, when the
BUSINESS PAPER NO. :8, FUNDAMENTALS OF ECONOMETRICS
ECONOMICS MODULE NO. :13, MULTICOLLINEARITY
____________________________________________________________________________________________________
worker is female and Fi = 0, when the worker is male. Obviously, Mi + Fi =1. So estimating a
regression model that contains both the dummy variables, Mi and Fi as explanatory variables will
run into the problem of exact multicollinearity.
When the linear relationships among explanatory variables are approximate, then near
multicollinearity is said to be present in the model. This can be represented as:
a1 X1 + a2 X2 +….,+ akXk + vi = 0 (2)
The example on consumption function model suffers from near multicollinearity where current
income and wealth measure similar attributes about individuals’ ability to undertake consumption
expenditure.
One of the assumptions of a classical linear regression model is that there are no exact linear
relationships between two or more explanatory variables and that the number of observations is at
least as many as the number of independent variables. Presence of multicollinearity leads to a
violation of this assumption and has consequences related to uniqueness, statistical significance
and precision of the OLS estimators. The presence of perfect multicollinearity makes it
impossible to compute the OLS estimates of the regression model.
Note that multicollinearity does not depend on any theoretical linear relationship between the
explanatory variables. It depends on the presence of linear relationships between the regressors in
the data set. So, multicollinearity is a sample phenomenon.
There may be several reasons for the presence of multicollinearity in the data set, for example, the
explanatory variables may all have a common underlying time trend, one of the regressors may be
a lagged value of another regressor, or the data may not have been collected from a wide base so
that the independent variables may tend to vary together.
4. Consequences of Multicollinearity
When two or more regressors in a regression model are exactly linearly related, then the
regression coefficients cannot be estimated.
2
(∑ 𝑦𝑖𝑥2𝑖 )(∑ 𝑥3𝑖 )− (∑ 𝑦𝑖𝑥3𝑖 )( ∑ 𝑥2𝑖 𝑥3𝑖 )
Using 𝛽̂2 = 2 2
(∑ 𝑥2𝑖 )(∑ 𝑥3𝑖 )−(∑ 𝑥2𝑖 𝑥3𝑖 )2
Where, yi ,x2i and x3i are the deviations of Yi, X2i, X3i from their mean values respectively.
Since, X3i = 2 X2i, we can easily see that the value of ̂ 2 = 0⁄0 , which is indeterminate.
When multicollinearity is not perfect, the model can be estimated with some consequences.
The numerical values of the standard error of the regression coefficients get amplified, making
the values of t ratios of one or more coefficients small and thus possibly making them statistically
insignificant. Though the hypotheses tests are still valid.
In a regression model Yi = 1 + 2X2i + 3X3i + ui (3)
The standard error of the regression coefficients ̂ 2 , can be given as in equation 4 below:
BUSINESS PAPER NO. :8, FUNDAMENTALS OF ECONOMETRICS
ECONOMICS MODULE NO. :13, MULTICOLLINEARITY
____________________________________________________________________________________________________
ˆ 2
var(ˆ2 )
x22i (1 r23 )
2
(4)
Where, =
x2 i X 2 i X 2
r232 r232
When X2 and X3 are uncorrelated, = 0 but when multicollinearity is present becomes high
and as a result the variance of the regression parameter as given in equation (4) becomes very
high.
Consequently,
The t ratios of the coefficient becomes small and confidence interval becomes large,
tending to make the coefficient statistically insignificant and less precise.
Thus the regression coefficients and their standard errors can be very sensitive to small
changes in data in the presence of multicollinearity.
Since the OLS estimates are unbiased and consistent, the forecasts based on them are
unbiased and confidence intervals are also valid.
The covariance between the regression coefficients will be very high in the presence of
multicollinearity.
In the regression model given by equation (3), covariance between 𝛽̂2 and 𝛽̂3 is given by
𝑟23 𝜎 2
Cov (𝛽̂ 2 , 𝛽̂ 3) = (5)
2 )√∑ 𝑥 2 ∑ 𝑥 2
(1−𝑟23 2𝑖 3𝑖
Consequently, the interpretation of regression coefficients becomes quite difficult. 𝛽̂2, the
regression coefficient of X2 is supposed to measure the change in the dependent variable Y, due
to a change in X2, controlling for all other variables. However, since X2 and X3 are correlated, any
change in X2 is also likely bring a change in X3 and so 𝛽̂2 will not only capture the change in Y
due to X2 but it will also contain some effect of change in X 3. This implies that if one estimated
parameter 𝛽̂2 overestimates the true parameter 2 then the other estimated parameter 𝛽̂3 is likely
to underestimate the true parameter 3 and vice versa.
As a result,
• It becomes difficult to interpret the partial effect of individual regressors on the
dependent variable.
• This at times, makes model selection difficult.
The tests of hypotheses are still valid but since the t ratios of the regression coefficients are small,
the confidence intervals tend to be much wider, making the test less powerful. There is a greater
chance of the regression coefficient being statistically insignificant.
Although the F statistic may still be significant.
5. Detection of Multicollinearity
Multicollinearity is a sample phenomenon and not a problem with the model, so there are some
informal tests rather than formal tests for detecting multicollinearity.
If a pair of regressors Xk and Xj are correlated with each other, the covariance between the
estimated regression coefficients as given in equation (5) is going to be high.
When we run the OLS regression, if multicollinearity is present, the results would throw up
individual regression coefficients as statistically insignificant although the value of R2 would be
very high and the Wald F-statistic for joint significance of regression coefficients would also be
highly significant.
In the presence of multicollinearity the pair-wise correlation coefficients between the regressors is
high. Though while carrying out empirical exercises it should be kept in mind that in models
involving only two independent variables, value of pair-wise correlation coefficients are
sufficient to test for multicollinearity, but in models involving more than two regressors this is not
a very satisfactory test of multicollinearity because the explanatory variables can be redefined in a
number of ways and we may get drastically different measures of correlations.
If multicollinearity is suspected, then in order to find out which of the regressors are linearly
related, one may run auxiliary regressions and perform F tests.
The degrees of freedom of the F statistic are k-2 and n-k+1, where n is the number of
observations and k is the number of explanatory of variables including the intercept.
• Step 4 If the computed value of F exceeds the critical value of F at the selected level of
significance, then it indicates that Xk is collinear with other X’s
Instead of carrying out the above mentioned F test one can use Klein’s Rule of Thumb, according
to which, if the from the auxiliary regression is greater than the overall R2, then
multicollinearity is a serious problem.
Variance Inflation Factor (VIF), defined below can also be used as an indicator of the degree of
multicollinearity
VIF =
Where, is the coefficient of determination in the regression model. The larger the value of
VIF, the greater the degree of multicollinearity. As a Rule of thumb, if VIF of a variable exceeds
10, it is said to be highly collinear with the other variables.
Tolerance
6.1 Ignore
1. If an analyst is less interested in interpreting individual coefficients and wants to forecast the
value of the dependent variable then the problem of multicollinearity may be ignored.
2. It is also worthwhile to ignore the problem if the regression coefficients are significant and
have meaningful signs and magnitudes.
3. Multicollinearity may also be ignored if R2 from the regression exceeds any of the .
Multicollinearity may be eliminated by dropping one of the collinearly related variables from the
model.
But it should be kept in mind that this involves some risk of underspecifying the model, if the
variable truly belongs to the model.
Information from other sources about the variables or values of some regression coefficients from
some related studies may be incorporated in the model.
Since multicollinearity is essentially a sample phenomenon, using additional data can often solve
the problem.
The variables may be transformed. If the data is a time series data, then one may reformulate the
model in terms of the first difference forms. So instead of running the regression model stated in
equation 9.3 The regression that may be estimated is
Yt-Yt-1 = 2(X2t - X2t-1) + 3 (X3t - X3t-1) + vt
Where vt = ut – ut-1
PCA is a method by which a number of collinear variables can be transformed into a set of
uncorrelated or orthogonal variables, called principal components (PCs). These PCs are artificial
variables that are constructed in a manner such that they account for most of the variance that is
caused in the observed set of correlated variables due to the unobserved common factors between
them.
In fact this common factor is what is called a principal component and there are as many principal
components as the number of common factors.
Once the PCs have been extracted, the regression may be run on these orthogonal variables in
place of the original collinear variables.
The characteristic of these PCs is that the first component that is extracted accounts for a maximal
amount of total variance in the observed variables. Similarly, the remaining components that are
extracted in the analysis account for a maximal amount of variance in the observed variables that
was not accounted for by the preceding components, and is uncorrelated with all of the preceding
components.
It may be pointed out that in order to extract the PCs, the observed variables are standardized to
have a mean of zero and a variance of one and therefore the total variance in the data set which is
simply the sum of the variances of these observed variables, will always be equal to the number
of observed variables. This total variance is partitioned among the principal components that are
extracted.
However, one drawback of the PCA analysis is that generally it is not possible to interpret these
principal components in a meaningful way. So in order to interpret the regression coefficients in a
meaningful way, the regression coefficients of PCs are transformed back into the regression
coefficients of original variables after running the regression.
7. Summary
TABLE OF CONTENTS
1. Learning Outcomes
2. Introduction
̂
4.2.3 Proof to show biasedness and inconsistency of variance of �
5. Summary
1. Learning Outcomes
After studying this module, you shall
2. Introduction
A Beginner’s perspective
Suppose a student becomes interested in gender inequality and its negative effects for
development of capabilities of women, their freedom of choice and human development, in
general. She collects data on the so called, Gender Inequality index [GII] from UNDP’s Human
Development reports for 152 countries for the year 2013. GII is a composite measure reflecting
inequality in achievement between women and men in three dimensions: reproductive health,
empowerment and economic status. It can take values between 0 and 1, lower numbers reflecting
greater gender equality.
From her data, the student finds that the mean GII is 0.3754, median 0.3850 and standard
deviation 0.1899. The student interprets the distribution of GII to be fairly symmetric since mean
is almost equal to median; with wide variation since standard deviation is almost half as large as
the mean. She notices that countries in her sample are very heterogeneous with stark variations in
socio-economic dimensions.
If this student wishes to develop an econometric model in order to determine the factors
influencing GII, she is most likely to encounter the issue of heteroscedasticity- one of the most
common violations of the classical linear regression model in cross sectional data consisting of
heterogeneous units in the sample.
= � + � + = , ,…. . [ ]
| = | = �
= , , ,…. [ ]
Equation [2] reflects the dispersion of the error terms around their mean zero. It is also a measure
of dispersion of the data values of the dependent variable around the linear regression line � +
� . The constant value σ implies that this dispersion is same for all observations on X.
| = | = �( )= � = , , … . .[3]
Heteroscedasticity is a result of a data generating process that draws disturbances, for each value
of the independent variable, from distributions that have different variances. It also implies that
dispersion of the dependent variable around the regression line is not constant.
Heteroscedasticity usually arises in cross sectional data where the scale of the dependent variable
tends to vary across observations, and in highly volatile time series data. It is less common in
other time series data where values of explanatory and dependent variables are of similar order of
magnitude at all points of time.
A standard manifestation of heteroscedasticity is that the spread of actual Y values around the
linear regression line will not be constant. Also, the plot of regression residuals against, say, the
predicted Y values will exhibit some characteristic pattern instead of being random. Let us
illustrate with the following example.
Example 1- Annual salaries of professors and years since they completed their Ph.D.
Figure 1 shows the scatter plot of annual salaries against years since Ph.D. of 222 professors from
7 U.S. universities [UC Berkley, UCLA, UC San Diego, Illinois, Michigan, Stanford and
Virginia] for the year 1995 [Ramanathan, 2002]. It can be seen that the spread of Y values around
the average straight line relation is not uniform- the variance initially increases, then decreases.
As Ramanathan points out, the salaries of recent Ph.D.’s are very competitive in the job market
and hence salary differentials are not expected to be high. Salary of tenured professors might vary
depending on their productivity and reputation. After a number of years, salary increases tend to
stabilize and hence the variance is likely to reduce. Scatter plots like these, point towards an
underlying heteroscedastic regression model.
4.8
4.6
4.4
4.2
3.8
3.6
0 5 10 15 20 25 30 35 40 45
Years since Ph.D.
A log quadratic regression [in which log of annual salaries is regressed against a constant, years
since Ph.D. and squares of years since Ph.D.] for 222 observations yield squared residuals
presented in Figure 2. We can see that the squared residuals are first increasing then decreasing- a
pattern consistent with the behavior of annual salaries depicted in Figure 1. This residual plot is
characteristic of underlying heteroscedastic population disturbances.
The plan of the module is as follows- In the next section, we present a discussion on some of the
common reasons why heteroscedasticity is likely to be present in regression models.Section 4
explains in detail the structure of heteroscedastic variances commonly used in the literature.
Section 5 highlights the serious consequences of using ordinary least squares for estimation and
statistical inference in heteroscedastic regression models. The module summary is presented in
the last section.
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 5 10 15 20 25 30 35 40 45
Years since Ph.D.
If the observations in a cross section are related to heterogeneous units of different sizes/scales,
the assumption of a common disturbance variance for all observations is often violated. Consider
the following examples:
i. Suppose that the profit of a firm at a given time depends on research and development
expenditures. Since large firms engage in higher research and development expenditures
which are associated with greater risk, one can expect variation of profits around the
mean for large firms to be greater than the corresponding variance for smaller firms.
Therefore, the variances of the error terms in such regression models (even after
accounting for variation in firm sizes) are expected to be greater for large firms than for
small firms.
ii. Similarly, consider regression models for household expenditures. Households vary in
size as measured by the number of family members and the level of income. Since
households with large income have more discretion and choice with respect to
consumption, households with higher income will have greater dispersion around their
mean incomes as compared to lower income households. The residuals in the regression
will then be systematically related to income.
iii. Some cross section studies are based on replication of the dependent variable for given
values of the explanatory variable. Suppose we are interested in studying the effect of
fertilizer use on crop yields. For each dosage of fertilizer, a group of n plots is chosen.
As the dosage of fertilizer increases, the variance of the error terms is likely to increase,
although the error variance within a group may be constant.
All the above examples suggest that although we are running a single regression through different
heterogeneous units in the cross section of the form-
= � + � +� = , ,…
-a regression in which slope coefficient � varies across observations. [In the household
expenditures example, it means that the effect of a given change in income on household
expenditures will be different for low income and high income families].To illustrate the concept
of heteroscedasticity, let us assume that the slope coefficients vary randomly across some fixed
value� ∗. Then we can say,
� = � ∗ + εi
, = � + � ∗ + εi +�
, = � + �∗ + εi + � )
, = � + �∗ + , where = εi + � [4]
Equation [4] shows that the error term will vary with and will have unequal variance.
i. Many a times, we have monthly or quarterly data on an economic variable but we need to
work on yearly data. Finding averages is a common method of aggregation.
Heteroscedasticity can be generated as a consequence of data aggregation.
ii. Choice of a wrong scale: Sometimes heteroscedasticity arises because variables are
measured on a wrong scale. Measuring the variables in logs, as percentages or as ratios
reduces the scales in which the variables are measured. A homoscedastic disturbance
term in a logarithmic regression, responsible for proportional changes in the dependent
variable, may appear to be heteroscedastic in a linear regression because the absolute
changes in the dependent variable will be proportional to its size.
iii. Presence of outliers: Outliers or extreme observations in the sample make the
homoscedastic assumption difficult to maintain.
iv. Explanatory variables with a large range or skewness in their distribution: If the data on
an explanatory variable has a large range of values, it may lead to large variance of error
terms for larger values of the explanatory variables. Similarly, skewed distributions tend
to generate heteroscedasticity.
i. Omitted variables: If a researcher omits a relevant variable from the regression model, the
effect of this variable will be captured by the error term. It will appear that there is a
systematic relation between the disturbances and some exogenous variable and the
variance of the error terms will not be constant across observations.
ii. Non Linearities: The same problem occurs if a researcher estimates a linear relation
instead of a quadratic or any nonlinear model and the error term of the linear model
captures the effect of non linearities.
iii. Wrong functional form: If a linear function is fitted to an underlying quadratic or
logarithmic true model, or in general, a wrong functional form is chosen, then the
disturbances appear heteroscedastic.
�(�̂ ) = � , = , ,…..,
ii. The ordinary least squares estimates will be inefficient i.e. they will no longer have
the minimum variance in a class of unbiased estimators and hence are not BLUE,
i.e. (�̂ ) ≥ �̃
where�̂ is the OLS estimator that ignores heteroscedasticity and �̃ is another
linear unbiased estimator that explicitly takes heteroscedasticity into account.
iii. The conventional estimator of the variance of the error term σ2is biased, i.e.
∑̂
� �̂ ≠ � ℎ �̂ =
−
iv. The conventional formula for the OLS estimators of the variance of regression
coefficients is wrong.
v. The OLS estimator of the variancesand covariances of the regression coefficients
are biased.
vi. The conventionally constructed confidence intervals are no longer be valid.
vii. The t and F statistics based on the OLS regression do not follow the t and F
distribution respectively and hence standard hypotheses tests are invalid
4.1.3Effects on forecasts
Forecasts based on the ordinary least square estimates will be unbiased and
consistent but inefficient.
In this section, we will prove some of the properties of OLS estimators discussed in the previous
section. The proof of others is beyond the scope of the module.
= � + = , ,…… [ ]
BUSINESS PAPER NO. : 8, FUNDAMENTALS OF ECONOMETRICS
ECONOMICS MODULE NO. : 14, HETEROSCEDASTICITY-NATURE &
CONSEQUENCES
____________________________________________________________________________________________________
| = � =� = , ,……. [ ]
The model is taken without an intercept; and the dependent and independent variables are
measured as deviations from their respective means. It is assumed that the variance of the error
term is proportional to the square of a variable W whose values are known. It is also assumed that
the model satisfies all other assumptions of the classical linear regression model.
[Ramanathan(2002)]
Note that if W takes a value 1 for all observations, we will have a case of homoscedasticity. The
OLS procedure, indeed assumes that = 1 for all observations and yields the estimates-
∑ ∑̂ ̂�̂ = �̂
�̂ = , �̂ = , [ ]
∑ − ∑
Substituting for from equation [11] into the formula for�̂, we get
∑ ∑ � + ∑
�̂ = = �̂ = = �+ [ ]
∑ ∑ ∑
Since ′ are given and the error terms are assumed to have a zero conditional mean, we can
show that �(�̂ ) = �. Hence OLS estimators are still unbiased. This proof requires only two of
the assumptions of the classical linear regression model, namely, the error terms have zero
conditional mean and they are uncorrelated with the explanatory variables. Therefore, even if
heteroscedasticity is ignored, OLS estimators are still unbiased.
Since � = , by the law of large numbers, we can say that the probability limit of �̂ will be
equal to its expected value as n tends to infinity, i.e.
(�̂ ) = �(�̂ ) = �, �̂ �
Hence, OLS estimator is consistent. Note again that this proof did not require any assumption on
the variance of the error term, so the presence of heteroscedasticity will not alter the consistency
property of OLS estimators.
̂
5.2.3Proof to show biasedness and inconsistency of variance of �
�
(�̂ ) = ∑ = ∑ � = ∑ [ ]
∑ ∑ ∑
̂̂ ̂2
�
Now, � (� ) = � (∑ 2 ) ≠
��
(�̂ )
The OLS estimator of variance of regression coefficient is thus biased and inconsistent.
5. Summary
In this module, we have discussed the nature of heteroscedasticity and its consequences for
ordinary least squares estimation.A brief summary is presented below-
TABLE OF CONTENTS
1. Learning Outcomes
2. Introduction
6. Summary
1.Learning Outcomes
2. Introduction
So far in the previous module we have seen that heteroscedasticity is a violation of one of the
assumptions of the classical linear regression model. It occurs when the disturbance term ui of all
the observations have non-constant conditional variance. It may be a result of specification errors
or data issues. For instance, regression models with cross sectional data, especially in cases where
the scale of the dependent variable varies across observations, heteroscedasticity is more likely to
occur. It is also observed in highly volatile time series data. In the presence of heteroscedasticity,
ordinary least squares estimators and forecasts based on them are still unbiased and consistent but
they are no longer BLUE. The estimated variances and covariances of the estimators are biased
and inconsistent; hypothesis testing procedures and statistical inference are not valid anymore.
Therefore, if we continue to use the OLS method to estimate parameters and to test hypothesis for
a data suffering from heteroscedasticity, then we are likely to get misleading conclusions. This
makes necessary to find some diagnostic tools to check for the presence of heteroscedasticity
problem.
Informal Formal
Methods Methods
Checking Nature
of the problem Park test
Graphical
inspection of Glejser test
residuals
Spearman's rank
correlation test
Goldfeld-
Quandt test
Breusch- Pagan
test
White's test
Nature of the problem is one of the simplest methods to detect the presence of heteroscedasticity.
For instance, if we take a cross sectional data on household’s consumption patterns and income
level in a locality then we find that residual variance changes for every observation. This is
because cross sectional data pools small income, medium income and large income households
together for the study. Thus, the possibility of heteroscedasticity is higher in case of cross
sectional data.
Figure 1shows different patterns of squared residuals, ̂ ,plotted against explanatory variable, X.
fig (a) shows there is no systematic pattern between squared residuals and explanatory variable.
Thus, there is no heteroscedasticity problem. However, fig (b) to (e) shows a systematic pattern
between squared residuals and explanatory variable. For instance, fig (b) shows a linear
relationship between squared residuals and explanatory variable. Similarly, fig (d) and (e) show
quadratic relationship between the two. Thus, fig (b) to (e) is depicting the possibility of
heteroscedasticity.
In case we have a multiple regression model with more than one explanatory variable then instead
of plotting squared residuals against each explanatory variable, we can plot it simply against ̂� ,
the estimated Y. Since, ̂� , is a linear combination of all the explanatory variables, we get the
same graphs as above when we plot squaredresiduals against ̂� . This is shown in figure 2.
However, the below drawn graphical plots are just an indication of the problem of
heteroscedasticity. We need some formal methods to make sure the claim of heteroscedasticity
problem.
Figure1
Figure 2
Park method is based on the assumption that heteroscedastic variance, �� , is some function of the
explanatory variable � . Therefore, to regress �� on � , the following functional form is adopted:
���� = � + � �� � + � 1 � = , , … . �.[1]
Here, � is population error variance.
However, we cannot run this regression because population error variance � is unknown and
thus, we use �� as a proxy for � and obtain �� by the below mentioned steps:
EXAMPLE 1
The following hypothetical example will enable further understanding of this test:
Lets us take a data on wages (per hour, rupees), education (years of schooling) and experience
(years of job) for 523 workers of an industry.Then on regressing wages, being dependant
variable, on education and experience as explanatory variables, we get the following regression
results:
Wagei = -4.524472 + 0.913018 Edui + 0.096810 Expi
se= (1.239348) (0.082190) (0.017719)
t= (-3.650687) (11.10868) (5.463513)
p= (0.003) (0.0000) (0.0000)
r2= .194953
The above result shows that there exist a positive relationship between wages and education and
also wages and experience. The estimated coefficients of education and experience are also
significant as captured by their t values of about 11 and 5 respectively.
1. The functional form adopted is just for simplicity. One can choose some other functional form as well and the
results will be different. For instance, if the values of explanatory variables � are negative then one should
not take log of � . But simply regress �� on � .
2. The other method used can be to regress���� on ̂ i (estimated Y).
However, the real problem arises because of the fact that it is a cross sectional data i.e. a sample
of 523 workers with diverse backgrounds is taken together at a given point of time. Thus, the
possibility of heteroscedasticity is higher here. When we plot squared residuals on each of the
explanatory variable i.e. education and experience or on the estimated value of wage, we get
considerable variability in the plot as depicted in earlier figures 1 and 2 [fig (b) to (e)].
Now a formal check for heteroscedasticity can be done by using Park test. Here, we regress
squared residuals on estimated value of wage and get the following results:
� ̂
= -10.35965 + 3.467020 ���
se= (11.79490) (1.255228)
t= (-0.878316) (2.762063)
p= (.3802) (0.0059)
r2= 0.014432
Now we can see that the coefficient of estimated value of wage is statistically significant as it has
very small p value. Thus, the Park test suggests the presence of heteroscedasticity.
NOTE OF CAUTION: The error term � in equation [2] itself may not be homoscedastic. Thus,
we are again back on the same problem.
This test suggests that instead of taking square of residuals, we take the absolute value of the
estimated residuals ̂� ,and regress it on explanatory variable, X. He proposed following
functional forms to do the regression.
̂� |= � + �
|� � + Vi
̂� |= � + � √
|� � + Vi
̂� |= � + �
|� / � + Vi
̂� |= � + �
|� /√ � + Vi
̂� |=√� + �
|� � + Vi
̂� |=√� + �
|� �
2
+ Vi
For each functional form if� turns out to be significant then we have a problem of
heteroscedasticity and we need to correct it. However, if � turns out to be insignificant then we
can interpret � as homoscedastic variance, � .
Let’s continue with the same hypothetical example to get further understanding of this test:
Now we regress |̂� | on education and get the following results,
̂� | = -3.1905 + 1.8623√� ��
|�
t= (-2.5068) (5.1764)
r2 = 0.0489
The above result suggests that the data suffers from heteroscedasticity as the coefficient of
education is statistically significant. Similarly, we can regress |̂� | on experience and estimated
value of wage also.
NOTE OF CAUTION: The error term � in above functional forms itself may not be
homoscedastic. Also, below mentioned functional forms have non linear parameters and thus can
not be estimated by using normal OLS method.
̂� |=√� + � � + Vi
|�
̂� |=√� + �
|� �
2
+ Vi
Moreover, Glejser himself has found that his first four methods are quite satisfactory in detecting
heteroscedasticity in large samples only. Therefore, in our example of 525 workers, Glejser test
has strengthened the result of park test.
� = � + � �+ � � �� � = , ,…�
Here, squared residuals are regressed on all the explanatory variables, square of the
explanatory variables and the cross products of them. � is the residual term of the
auxiliary regression.
3. Get the R2 value from the above regression and multiply it by sample size (n). This will
follow distribution with degrees of freedom equal to the number of X, explanatory
variables.
n. R2−
̃ with k-1 degrees of freedom[4]
In this model, d.f. are 5
4. Test the null hypothesis that all the slope coefficients are zero.
5. If the value obtained from equation [4] > critical at chosen level of significance,
or if the p value is very low then we reject the null hypothesis that all the slope
coefficients are zero. In other words, we suffer from the problem of heteroscedasticity.
And if the value obtained from equation [4] < critical at chosen level of
significance, or if the p value is fairly large then we do not reject the null hypothesis that
all the slope coefficients are zero.
Now we continue with example 1 and see how the White’s General Heteroscedasticity Test is
used.
From the above table, we find that n. R2 is 11.23102 and this is significant at 5% level. Therefore,
we reject the null hypothesis of no heteroscedasticity.
This test is based on Spearman’s Rank Correlation coefficient. To understand this test, we assume
the following model:
� = � + � � + �� = , , … �
here di is the difference in the ranks allocated to two different attributes of the ith
individual and n istotal number of observations ranked.
3. Now use t test to check for the null hypothesis of no heteroscedasticity. It is assumed that
the population rank correlation coefficient ρsis zero and n >8, with df = n − 2.
t= rs√� − / √ − r2 s [5]
4. If the computed t value > critical t value then we reject the null hypothesis of no
heteroscedasticity and if the computed t value < critical t value then we do not reject the
null hypothesis and there is no heteroscedasticity problem.
Note: in case of multiple regression model i.e. with more than one explanatory variables,
rs between | ̂�|and Xican be calculated separately for each Xi and testing for statistical
significance can be done by using t test in [5].
This test is based on the assumption that heteroscedatic variance, σ� , is positively related to one
of the explanatory variables, Xi.To understand this test, consider the following model:
Yi = β + β Xi + ui � = , , … n
As per the assumption, let’s say σ� = σ2X � , here σ2 is a constant.It states that σ� is proportional to
the square of the variableX. In other words, larger the values of Xi, larger would beσ� . Hence,
this relation makes the presence of heteroscedasticity more likely in the model.
Now to have a better idea of this method, follow the below mentioned steps:
1. Arrange the values of Xi from lowest to highest.
2. Remove c central observations, in casec is stated a priori, and then divide the remaining
(n − c) observations into two set each with (n − c) / 2 observations.
3. Fit two separate regressions for the two sets divided above and obtain their respective
residual sums of squares RSS1 and RSS2. Here, RSS1 represents residual sums of squares
for the first set with smaller Xivalues (the small variance group) and RSS2 represents
residual sums of squares for the second set with larger Xivalues (the large variance
group).
The degrees of freedom of these residual sums of squares (RSS) is (n− c) / 2 − k or (n− c
− 2k) / 2. Here, k is the number of parameters (including intercept) to be estimated.
/df
4. Next, find out the ratio λ = [6]
/df
5. Now, λ of [6] will follow F distribution with degrees of freedom of both numerator and
denominator equal to (n−c−2k)/2 if uiis normally distributed and variance is
homoscedastic.
6. So if for a particular regression result, if the computed λ (= F) >critical F at the chosen
level of significance then we reject the null hypothesis of no heteroscedasticity
NOTE: The aim of removing c, central observations, is to highlight the distinction between small
variance group and the large variance group. The results of this test depend on how we decide
which c is to be removed.Thus, it becomes a major limitation of this test. Also, in case of multiple
regression model with more than one X variable, rank order or arrangement of the values of Xi
can be done by choosing any one of the X, explanatory variable. If we find difficulty about which
Xi to choose then we can run this test on each Xi and conclude the result.
As we have seen in the previous test, Goldfeld–Quandt test, its major drawbacks were finding
correct X, explanatory variable, about which arrangement of the values was done and also
deciding which c is to be removed. These limitations can be overcome by using this test. To
demonstrate this test, first consider the following linear regression model.
Also assume thatthe error variance σ2 is a linear function of some non-stochastic variables Z’s in
the form of σ� = f (α1 + α2Z2i+ ·· ·+αmZmi)
Now if f α2= α3= · · · = αm= 0, then σ 2i = α1, which is a constant. Therefore, null hypothesis of no
heteroscedasticity is α2 = α3 = · · · = αm= 0.
Now to have a better idea of this method, follow the below mentioned steps:
1. Run the above regression in equation [7] and obtain the residuals ̂ , ̂ , ……, ̂�
2. Getmaximum likelihood (ML) estimator of σ2 which is˜σ2 = ∑ ˆu2i/n.
3. Define a variable pi= ̂� 2 / ˜σ2
4. Now run the following regression: pi= α1 + α2Z2i + ·· ·+αmZmi + vi � = , , … n[8]
hereviis the residual term
5. Obtain ESS (explained sum of squares) by using [8] and define € as half of ESS.
6. Now we can show that if there is no heteroscedasticity and if the sample size n increases
indefinitely (under the assumption that uiare normally distributed) then we get
€ asy
̃ χ m-1
2
6. Summary
In this module, we have discussed different diagnostic tools to detect the presence of
heteroscedasticity. We have categorized these tools into two groups: informal methods and
formal methods. Informal methods include detection of nature of the problem and graphical
inspection of residuals. These informal methods just give a clue of the presence of
heteroscedasticity and its conformity can be done by using formal methods. For instance, Park
Test, Glejser test, White's test, Spearman's rank correlation test, Goldfeld-Quandt test and
Breusch- Pagan test. All these methods, with different steps, have some advantages and
disadvantages over one another.
2.Introduction
What is Autocorrelation?
One of the assumptions of the classical linear regression model is that conditional on X,
the successive error terms are independently distributed.
If the regression model is Yi = 1 + 2X2i +…+ kXki + ui
Then, the above assumption implies, Cov (ui, uj) = E (ui, uj) 0, Vi ≠ j
This property of the error term is known as serial independence or no autocorrelation.
3.Meaning of Autocorrelation
In an intuitive sense,when no autocorrelation is present, the magnitude and sign of the
error term for some observation should not influence the sign and magnitude of the error
term in the adjacent observation(s). This implicitly implies that the different observations
in a data set are unrelated to each other. So for example, if we have data on income and
consumption expenditure of individuals, then the consumption expenditure of the ith
individual has no influence on the consumption expenditure of the (i+1)st individual.
Presence of autocorrelation, on the other hand implies that the various observations are
correlated to each other. So the observations in the data set may not only be treated as
data points but also as predictors for the subsequent observations.
For example, in a study of output, an unexpected event like breakdown of machinery will
decrease the output for a particular observation. But this negative disturbance for one
BUSINESS PAPER No. 8: FUNDAMENTALS OF ECONOMETRICS
ECONOMICS MODULE No. 17: AUTOCORRELATION
observation should be no reason to have a negative disturbance for the
neighbouringobservations. So the influence of a random disturbance should not persist
over time. But if a hangover of the disturbance term is observed from one observation to
the neighbouring observations, then autocorrelation is said to be present.
We know that when autocorrelation is present, E(ut, ut-s) = γs, Where, s = 0, ±1, ±2 . . . ,
is the length of the lag. At lag 0, E(ut2) is the constant variance term
𝛾
= σ𝑠2
∴ Cov (ut, ut-s) =γs = σ2 *ρs
The symbol “±” before the number of lags indicates that the γs and ρs are symmetrical in s
and it does not matter whether the lag is t+s or t-s. So these coefficients are constant over
time and depend only on the length of the lag.
For a sample with n observations, the variance- covariance matrix of the disturbance term
may be written as follows:
γ0 𝛾1 … 𝛾𝑛−1
∴Var (u) = [ 𝛾1 γ0 ⋱ 𝛾𝑛−2 ]
𝛾𝑛−1 𝛾𝑛−2 … γ0
Estimation of this matrix is not possible since the number of unknowns is greater than the
number of observations. In the literature, a structure is imposed on the disturbance terms.
ut=𝜑1ut-1 + εt (1)
Where, φ is some parameter such that1|𝜑|< 1and, εt is a white noise disturbance term
with the usual properties, i.e.
Usually, an AR (4) process is apt for describing the disturbance term in seasonally non
adjusted quarterly data. Similarly, AR(12) may be quite suitable for describing the
disturbance term in seasonally non adjusted monthly data
A moving average process is one where u is a function of ε. A first order moving average
process (MA 1) is given by
ut =θ1εt-1 + εt
Where,εt ∼ N (0,𝜎𝜀2 )
and , cov(εt, εs) = 0, t≠ s
For a AR(1) process defined in (1), the autocorrelation function (ACF) is given as
ρs= φs , s= 0, 1,2
Stationary conditions ensure that the ACF is a decaying function of s. A plot of the ACF
against s is known as a correlogram.
Ut
Postive
. . .. . . . ..
.. .t
Auto. 0
.. .. . .
. ...
Ut
No . . .. . . . . . . . .
Auto. 0
. . .. . . . .. .. . .
. . .t
Ut . . . . . .
Negative
0
. . .
. . .
. . . . .
Auto. t
.
Figure 1
Although, autocorrelation can be present in both time series and cross section data,it is
found more often in time series data. The reason for this is that when the dependent and
independent variables are observed as a sequence of observations over time then they
tend to be correlated with each other over time or some time related phenomenon may be
influencing both the variables over time causing them to co-vary.
There are several factors causing autocorrelation. According to Greene WH (2000), some
of these factors are due to a flawed model specification and have been explained in
Reasons 4.1, 4.2 & 4.3 below. There are some other factors which give rise to serial
correlation as an anticipated part of the model; as listed at 4.4, 4.5 &4.6 below.
When some relevant explanatory variables have been omitted while specifying the model,
then their effect is captured by the residual term. As a result, the error term instead of
being random, will exhibit a systematic patterngenerated by the combinedeffect of the
omitted variables.Thiscreates spurious autocorrelation. Thus a model specification bias is
likely to generate an autocorrelated error term.
Suppose, the dependent variable in a given time period is a function of its own value in
the previous time period (causing the dependent variable to be a function of a lagged
value of its own self), for example,
Consumptiont = 1 + 2Incomet + 3Consumptiont-1 + ut
If the lagged value is erroneously dropped from the model, then the error term would also
include the impact of lagged consumption and insteadof being random will exhibit a
systematic pattern. So, autocorrelation is likely to be present in autoregressive models.
(iv) Inertia
When a variable responds to another variable with a time lag, it reflects what is known as
the Cobweb phenomenon. For example, when the supply of agricultural products reacts
to price changes with a one period time lag,
Supplyt = 1 + 2Pricet-1 + ut
Then the disturbance term, ui, is not expected to be random. In fact, the error term
displays a pattern of being positive in one period (when the farmers under produce)
followed by a negative value in the successive year (when the farmers overproduce),
followed by a positive value and so on.
Sometimes the empirical relationship being studied is such that autocorrelation gets built
into the system. For example, when the expectations-augmented Phillips curve with
adaptive expectationsis used to explain the phenomenon of inflation, then the expected
inflation in the current year is forecast as the value of inflation in the previous year.
Naturally, the residual in any one time period becomes dependent on the value of the
residual in the previous time period.
Another such example is the returns from hedge funds. The returns from hedge funds are
strongly correlated with their values in the past because these securities are not very
actively traded in the stock market and their market prices are not always readily
available so their reported returns are calculated on the basis of their past returns, leading
to correlated observations across time.
(vii)Data Transformation
Sometimes the given observations are transformed to generate a data set which is more
suitable for the desired empirical analysis.
Using averages
A quarterly series may be produced from a monthly series using the process of
averaging.
Difference equations
If a model is given as Yt = 1 + 2Xt + ut (2)
Then it must also be the case that
Yt-1 = 1 + 2Xt-1 + ut-1 (3)
Equations 1 & 2 are known as level equations. A difference equation may be obtained by
subtracting equation 2 from equation 1, which is obtained as
It can be shown that if the error term does not exhibit serial correlation in equations 2&3,
then the error term in equation 4 will be serially correlated.
All these data transformation activities may cause the error terms to be correlated with
each other.
If either the dependent variable or the independent variable or both are non-
stationary2then the error term in the regression model will exhibit serial correlation.
Sometimes the variables in the data are measured inaccurately so that the reported values
are different from the true values.If the dependent variable is suffering from measurement
error, then this effect willget incorporated into the disturbance term. So, each successive
value of the disturbance term will exhibit some systematic relationship with the
contiguous error term, causing the model to suffer from autocorrelation.
If Yi* = Yi + vi
6. Nature of Autocorrelation
Autocorrelation may be classified as positive or negative, depending on the relationship
between successive disturbance terms.
If the covariance between ut and ut-1 is positive, then autocorrelation is said to be positive.
This would also mean that a positive disturbance term is followed by positive disturbance
term and after a prolonged time (or sequence of observations), the disturbance term
becomes negative and again there is a lingering effect so that for a long sequence of
observations,a negative disturbance term is followed by another negative disturbance.
If the covariance between ut and ut-1 is negative then a positive disturbance term is
followed by negative disturbance term and the negative disturbance term is followed by a
positive disturbance. In this case the autocorrelation is said to be negative.
7. Order of Autocorrelation
The order of autocorrelation is determined on the basis of the relationship between the
successive error terms. The order of autocorrelation depends upon the number of lags
with which the disturbance in a give time is a function of the disturbance in the previous
time period(s).
When the disturbance in any one time period is a function of the disturbance with one
time period lagged, it is called first order autocorrelation. The relationship between a pair
of successive disturbance terms is given as
ut = ρut-1 + εt ,
where, ρ is called the coefficient of autocorrelation such that |ρ| < 1andεtis identically and
independently distributed normally, with mean 0 and variance 2, V t. This process of
generating the disturbance term as a function of its own one period lagged value is also
called the first order autoregressive process AR(1). This series of ut is also known as the
white noise series with zero mean.
BUSINESS PAPER No. 8: FUNDAMENTALS OF ECONOMETRICS
ECONOMICS MODULE No. 17: AUTOCORRELATION
It is helpful to have the absolute value of ρ, the coefficient of autocorrelation to be less
than one for the stability of the autoregressive process. If the absolute value of ρ is
greater than one than the series would be explosive. The magnitude of ρ indicates the
strength of autocorrelation. The closer is the magnitude of ρ to 1, the stronger is the
autocorrelation. Further, if ρ > 0, autocorrelation is positive and if ρ < 0, autocorrelation
is negative.
The longer the time period of the relationship between the disturbance term in the current
time period and the disturbance term in the previous time periods, the higher is the order
of autocorrelation. The order of autocorrelation would depend on the nature of data.
Generally, for example, in case of quarterly data, a fourth order autocorrelation is more
likely to occur and in case of monthly data a 12th order autocorrelation is more likely.
The assumptions about the distribution of tand magnitude of i remain as in the AR (1)
process.
In general the autocorrelation may be of any order and a pth order autoregressive process
AR(p) may be written as:
The assumptions about the distribution of tand magnitude of i remain as in the AR(1)
process.
The presence of autocorrelation has some consequences on the estimates of the regression
coefficients. These may be stated as follows:
We know that any observed value of Y is equal to the sum of the expected value of Y and
the error term. In the presence of positive autocorrelation, the positive error term (ut) in
each observation will be followed by a positive ut, so once an observed value of Y is
found to be above the true value, it will continue to remain above the true value. So, the
observed values of Y will continue to get overestimated. Similarly when ut is negative, it
will continue to be negative, so if an observed value of Y lies below the true value, it will
continue to stay there, making the observed value of Y to be continually underestimated.
This would imply that, although the estimated values of Y are not likely to be biased
(overestimates and underestimates are likely to cancel each other on the balance). The
estimates will be characterized by large variances.
2. The variances of the regression coefficients are also biased and inconsistent.
∑𝑥 𝑦
We know that, 𝛽̂ OLS = ∑ 𝑥𝑖 2 𝑖
𝑖
𝜎 2
∴ Var(𝛽̂ OLS)= ∑ 𝑥 2
𝑖
So, when no autocorrelation is present, ρ = 0, Var(𝛽̂ ) is unbiased.
If ρ > 0, i.e. when positive autocorrelation is present,then OLS will underestimate the
true variance.
If negative autocorrelation is present, i.e. ρ < 0, then OLS will overestimate the true
variance.
When the true variance of the regression coefficient is underestimated, then the t values
of the OLS estimates are larger than should be. As a result, one might conclude that the
variables are statistically significant when they are not (type I error). On the other hand,
when the true variance of the regression coefficient is overestimated thenthe t values of
the OLS estimates are smallerthan should be. As a result, one might conclude that the
variables are statistically not significant when actually they are (type II error). So, the t
tests are not reliable.
4. The estimate of the variance of the error term is also biased. This can be shown in
the following manner. For a simple linear regression model, if variance of each ui
is given as 𝜎2𝑢 , then, its estimator, 𝜎̂𝑢2 , is given as
∑ 𝑒𝑖2
𝜎̂𝑢2 =
𝑛−2
Since, E(∑ 𝑒𝑖2 )= (n-2)𝜎𝑢2
Therefore, 𝜎̂𝑢2 becomes an unbiased estimator of 𝜎𝑢2
But if autocorrelation is present, then for a large n and first order autocorrelation with
parameter λ,
∑ 𝑥𝑡 𝑥𝑡−1 ∑ 𝑥𝑡 𝑥𝑡−2 ∑ 𝑥𝑡 𝑥𝑛
Then, E(∑ 𝑒𝑖2 ) =𝜎𝑢2 [𝑛 − (1 + 2𝜌 + 2𝜌2 + ⋯ + 2𝜌𝑛−1 )]
∑ 𝑥2𝑖 ∑ 𝑥2𝑖 ∑ 𝑥2𝑖
1+𝜌𝜆
Then, E(∑ 𝑒𝑖2 ) =𝜎𝑢2 (𝑛 − )
1−𝜌𝜆
In which case, the estimator, 𝜎̂𝑢2 becomes a biased estimator of 𝜎𝑢2 4
9. Summary
2. Detection of Autocorrelation
The presence of autocorrelation can be detected by looking at the residual plot. The graph
of estimated residuals may be plotted in several ways.
We can plot the residuals against time
We can plot the standardized residuals against time. The residuals are standardized by
𝑒
dividing the residuals by the standard error of regression. estd =
̂𝑢
√𝜎
these standardized residuals have mean zero and approximately unit variance. For large
n, they are also approximately distributed normally1.
We can also plot the residuals against their lagged values. So, for an AR (1) scheme, et
may be plotted against et-1.
1
Gujarati D & Sangeetha, Basic Econometrics, Tata McGraw Hill India, 4 th ed, 2007,pp474
Figure 1
Figure 2
BUSINESS PAPER No. 8: FUNDAMENTALS OF ECONOMETRICS
ECONOMICS MODULE No. 18: TEST FOR AUTOCORRELATION
____________________________________________________________________________________________________
Figure 3
It is a non-parametric test because it does not invoke any assumptions about the underlying
distribution of the disturbance term. This test is also called Geary test after, R C Geary who
first proposed it in 19702. A run may be defined as the continuous sequence of a “+” or “–
” attribute in the residuals. The length of a run may be defined as number of elements in it.
For example, we may observe residuals with the following pattern of signs:
(- - - - - - - - )(+ + + + + + + + + + + + +)( - - - - - - - - - )
Here, we observe 8 negative residuals, followed by 13 positive residuals, followed by 9
negative residuals for a total of 30 observations. When we observe that a positive residual
2
R C Geary, “Relative efficiency of count sign changes for assessing residual autoregression in least square
regression”, Biometrika, Vol 57, 1970 pp 123-127.
2𝑁1 𝑁2
Mean, E(R) = +1
𝑁
2𝑁1 𝑁2 (2𝑁1 𝑁 2 − 𝑁)
Variance, 𝜎𝑅2 = (𝑁)2 (N−1)
The null hypothesis that can be tested is that the residuals are random. A table for critical
values of runs if N1 or N2 is less than 20 is also there.
Durbin & Watson have defined the d statistic, based on the estimated residuals of the
regression
∑𝑛
𝑡=2(𝑢 ̂𝑡−1 )2
̂𝑡 −𝑢
d= ∑𝑛 ̂𝑡2
(1)
𝑡=1 𝑢
Assumptions underlying the d statistic
1. The regression model includes the intercept term.
2. The explanatory variables are non-stochastic
3. The disturbance term, ut is generated by first order autoregressive scheme,
4. The disturbance term, ut is distributed normally.
5. The regression model is not autoregressive, i.e., it does not include lagged values of the
dependent variable as one of the explanatory variables.
6. There are no missing observations in the data.
𝑢̂𝑡 and 𝑢̂𝑡−1 are said to be approximately equal since they differ in only one observation,
therefore the value of d statistic mentioned in (1) may be written as
∑𝑢
̂𝑡 𝑢
̂𝑡−1
𝑑 ≈ 2 (1 − ̂𝑡2
∑𝑢
) (2)
d ≈ 2(1-𝜌̂) (4)
∵ -1 ≤ρ ≤ 1, ∴ 2 ≤d≤4 (5)
To summarize, if no autocorrelation is present in the model then, the d statistic will take a
value equal to 2. If positive autocorrelation is present in the model then, the d statistic will
take a value close to 0. If negative autocorrelation is present in the model then, the d statistic
will take a value close to 4.
However, it is important to note that the sampling distribution of d under the null
hypothesis of no autocorrelation depends upon the values of the explanatory variables. So,
the critical values of d will also depend on the values of the explanatory variables. In order,
to overcome this problem Durbin –Watson, developed the lower and upper bounds of d,
namely dL and dU.
The probability distribution of dL and dU does not depend on the values of the explanatory
variables and they have the property that, dL< d < dU. Using this property the D-W d test
can be carried out to test for autocorrelation.
BUSINESS PAPER No. 8: FUNDAMENTALS OF ECONOMETRICS
ECONOMICS MODULE No. 18: TEST FOR AUTOCORRELATION
____________________________________________________________________________________________________
The values for the lower and upper bounds of d are contained in the Durbin Watson
statistical tables.
Steps in D-W d Test
Hypothesis (i)
No positive autocorrelation Reject 0 < d < dL
No positive autocorrelation No decision dL ≤ d ≤ dU
Hypothesis (ii)
No negative autocorrelation Reject 4- dL < d < 4
No negative autocorrelation No decision 4- dU ≤ d ≤ 4- dL
Hypothesis (iii)
No autocorrelation Do not reject dU <d < 4- du
𝑑 𝑛
h = (1- 2 )√ ̂1 ) ∼ approx. N(0, 1)
1−𝑛 𝑣𝑎𝑟(𝛽
Durbin (1970) suggested the m test to overcome the limitation of the h test.
We use the t test for the coefficient of ût−1 to carry out the following hypotheses test.
(i) Test for positive autocorrelation
H0: ρ = 0 versus H1: ρ > 0
Null hypothesis states that there is no positive autocorrelation
(ii) Test for negative autocorrelation
H0: ρ = 0 versus H1: ρ < 0
Null hypothesis states that there is no negative autocorrelation
Wallis proposed a test for fourth order autocorrelation. If the disturbance term is
characterized by fourth order autocorrelation then,
ut = φ4ut-4 + εt
This test overcomes two main limitations of the D-W test namely,
(i) It allows the use of both autoregressive and moving averages regression models
(ii) Allows testing for higher order autocorrelation.
Hypothesis in LM Tests
The null hypothesis is that there is no autocorrelation of any order
Steps in LM test
1. Estimate the model by OLS method and obtain the estimated residuals, 𝑢̂𝑡
2. Run the auxiliary regression
𝑢̂𝑡 = α1 + α2 Xt + 𝜌̂1 𝑢̂𝑡−1 + 𝜌̂2 𝑢̂𝑡−2 + . . . + 𝜌̂𝑛 𝑢̂𝑡−𝑛 + εt3 (8)
3. Obtain R2
4. The test variate is (n-p) R2.
5. For a large value of n, (n-p) R2 ∼ 𝜒𝑝2
6. The hypothesis test is a test of joint significance of first p autocorrelations of these
disturbance terms. H0: ρ1 = ρ2 =. . . . ρn = 0 (No autocorrelation)
H1: At least one of the ρ ≠ 0 (Autocorrelation present)
The same test can be used for any one of the alternate hypotheses.
3
This regression contains only n-p variables.
Advantage of LM test
The test can be used both for autoregressive and moving average autocorrelated
disturbance term in the model.
Limitation of LM test
A limitation of the test is that the length of the lag p cannot be specified a priori. The length
of the lag has to be found by inspecting the t statistics on each lagged residual in the
auxiliary regression.
6. Portmanteau Q Tests
Limitations
It has been found that the Q statistic in Box-Pierce Test performs very well with large
samples but not with small samples.
The Q statistic in Box-Pierce test has been modified so that it performs well in small
samples. It is known as the Ljung Box Q statistic.
̂2
𝜌
Ljung Box Q = n (n+2) ∑ℎ𝑘=1 𝑛−𝑘
𝑘
If residuals are white noise, the Q statistic follows a χ2 distribution with h degrees of
freedom just like the Box-Pierce Q statistic.
The Q statistic can be applied for any type of ARIMA4 model. But when it is applied to
test for autocorrelation of the disturbance term in a model, the degrees of freedom of the
test statistic have to be computed by subtracting the number of parameters from the total
number of observations.
Hypothesis in Q Tests
The null and alternative hypotheses are the same as those in the Breusch-Godfrey LM test.
Steps in Q test
1. Estimate the model by OLS method and obtain the estimated residuals, 𝑢̂𝑡
2. Estimate the autocorrelation function 𝜌̂
3. Compute the test variate Q
4. Q ∼ 𝜒ℎ2
5. The hypothesis test is a test of joint significance of first p autocorrelations of these
disturbance terms. H0: ρ1 = ρ2 =. . . . ρn = 0 (No autocorrelation)
H1: At least one of the ρ ≠ 0 (Autocorrelation present)
6. If the test statistic exceeds the critical value of χ2 at the chosen level of significance, then
the null hypothesis is rejected.
4
Autoregressive Integrated Moving Averages (ARIMA) model is a generalization of the ARMA
(Autoregressive Moving Averages model.
There are several methods for detecting the presence of autocorrelation. The graphical
method and runs method are non-parametric tests. Durbin Watson test and Durbin's m and
h test are parametric tests that can be used for detection of first order autocorrelation. Wallis
test, LM test and Portmanteau tests are used for detection of higher order autocorrelation.
Durbin Watson test is one of the most commonly known test for autocorrelation but it
suffers from some limitations. These limitations are that in some cases the test is
inconclusive. The test can only be used when the regression model has an intercept term
and does not include lagged dependent variables. The test can only be used when the error
term is autoregressive. Durbin's h test helps in testing for autocorrelation in the presence
of lagged dependent variables in the regression model. LM test can be used when the error
term is generated by an autoregressive or moving averages processes.
TABLE OF CONTENTS
1. Learning Outcomes
2. Remedies for Autocorrelation
2.1 Model Specification
2.2 GLS Method
2.3 FGLS Method
2.3.1 First Difference Method
2.3.2 Durbin Watson Method
2.3.3 Cochrane Orcutt Iterative Method
2.3.4 Hildreth Lu Grid Search Method
2.4 Newey West method of Correcting Standard Error
3. Summary
1. Learning Outcomes
After studying this module, you shall be able to
1
Since this hypothesis is nonlinear, one has to use wald test or LR or LM test for test of hypothesis.
BUSINESS PAPER No. 8: FUNDAMENTALS OF ECONOMETRICS
ECONOMICS MODULE No. 19: REMEDIES FOR AUTOCORRELATION
____________________________________________________________________________________________________
If autocorrelation is present despite correct specification of the model then the Generalised
Least Squares (GLS) regression method should be used. GLS regression method is a very
useful method of regression, where the regression model is so modified that it fulfills the
assumptions of the classical linear regression model (CLRM) and OLS regression
procedure may be used for estimating it. The GLS procedure is applicable only when ρ is
known
If the regression model is Yt = 1 + 2Xt + ut (1)
Such that, ut = ρut-1 + εt and -1< ρ< 1 (2)
This method exploits the fact the covariance matrix of the error term (ut) can be obtained
in terms of the autocorrelation coefficient and variance of εt.
Limitations
1. This method is suitable when the disturbance follows AR (1) scheme, otherwise it
compounds the errors of estimation.
2. The method can be applied only if X is not endogenous, i.e., the model does not include
lagged dependent variables.
In carrying out the above differencing, the first observation is lost. When the number of
observations is large, it does not matter.
But if the number of observations is less, the OLS regressors in eq (6) are not BLUE.
The problem is rectified when the first observation on X and Y is transformed as
Y1√1 − 𝜌2 and X1√1 − 𝜌2 .
After introducing the first observations, the estimates of OLS regressors obtained are
BLUE.
When one suspects perfect positive autocorrelation, then ρ= +1, in which case, the
difference eq 5 becomes Yt - Yt-1 = 2 (Xt - Xt-1) + (ut- ut-1)
Or, Δ Yt = 2 Δ Xt + εt (7)
Where, Δ is defined as the first difference operator.
This equation may be estimated using the OLS regression model, since the disturbance
term satisfies the CLRM assumptions. So, this method may be used when one suspects a
very high value of ρ or the value of the D-W, d statistic is very low.
Some observations about the first difference model
1. This is a regression without intercept.
2. If ρ= +1, then, the underlying series is non stationary but the first difference series is
stationary.
3. This method is suitable only if ρ is very high or close to +1.
To test for ρ= +1, Berenblutt –Webb test may be used.
∑𝑛 𝜀̂ 2
Test statistic, g = ∑𝑛2 𝑒𝑡2
1 𝑡
Where, e- estimator of u in the original model
𝜀̂ - Estimator of ε in the first difference model
H0: ρ= +1
H1: ρ= 0
If the calculated value of g lies below the lower limit of d, we do not reject H0, i.e. ρ= +1.
𝑑
Since we know that 𝜌̂ ≈ 1- 2
For large samples, we can use this estimated value of ρ to transform data and run the
regression on generalized difference equation 6.
Iterative method entails starting with an initial value of 𝜌̂ and finding the value of 𝜌̂
through successive approximations. Steps involved in the procedure are:
Step 1: Estimate the regression and obtain residuals
Step 2: Estimate 𝜌̂ by regressing the residuals to its lagged terms.
(Note: This regression does not have an intercept term because the sum of residuals
is zero, making the intercept equal to zero.)
Step 3: Transform the original variables using the 𝜌̂ obtained from step 2 as follows: , 𝑌𝑡∗ =
Yt - 𝜌̂ Yt-1
𝛽1∗ = 1- 𝜌̂ 1
𝛽2∗ = 2
𝑋𝑡∗ = Xt - 𝜌̂ Xt-1
εt = ut- 𝜌̂ ut-1
Step 4: Run the regression again with the transformed variables and obtain residuals.
𝑌𝑡∗ = 𝛽1∗ + 𝛽2∗ 𝑋𝑡∗ + εt
Step 5 and on: Continue repeating steps 2 to 4 for several rounds until the iterations
converge, i.e., the estimates of autocorrelation coefficient , 𝜌̂, from two successive
iterations differ by no more than some preselected small value, such as 0.001.
This method entails finding the value of 𝜌̂ using a grid search procedure. The steps involved
in the test are:
Step 1: Calculate different values of 𝑌𝑡∗ and 𝑋𝑡∗ , defined in eq 6, for different values X of
ρ at intervals of 0.1 in the range -1≤ρ≤1.
Step 2: Run the regression 𝑌𝑡∗ = 𝛽1∗ + 𝛽2∗ 𝑋𝑡∗ + εt
Step 3: Calculate RSS for each regression and choose the value of ρ for which RSS is
minimum.
Step 4: Repeat the procedure for smaller intervals of ρ around the value of ρ obtained in
step 3.
𝑛 1
Step 5: Choose the value of ρ for which 2 log RSS (ρ) - 2 log (1- ρ2) is minimum.
It may be pointed out though, that the FGLS estimators are not unbiased. But they are
asymptotically more efficient than the OLS estimators of an AR (1) model.
But if the regression model includes lagged dependent variables or exhibits higher order
autocorrelation then, the FGLS estimates are neither unbiased nor consistent.
2.4 Newey – West method- correcting for standard errors of OLS regressors
When the exact form of autocorrelation is not known, or when the independent variables
in the regression model are not exogenous then it is better to estimate the model by the
usual OLS procedure but correct the standard errors of the estimates for different forms of
autocorrelation.
In absence of autocorrelation we know OLS estimate of variance on any coefficient is Var
𝜎𝑢2
(𝛽̂) = ( 𝑁 ∗ 𝑉𝑎𝑟(𝑋))
In the presence of autocorrelation, it can be shown that the Newey-West standard errors
are
𝑣
Var (β*) = Var(𝛽̂)* 𝜎
𝑢
𝑔
Where v = ∑𝑛1 𝑎̂𝑡2 + 2∑ℎ=1[1 − ℎ/(𝑔 + 1)] (∑𝑛𝑡=ℎ+1 𝑎̂𝑡 𝑎̂𝑡−ℎ )
𝑎̂𝑡 = 𝑟̂𝑡 𝑢̂𝑡 , t= 1, 2... n
g- The number of lags
𝑟̂𝑡 - The residual obtained by running the auxiliary regression: xt1 on xt2, xt3.., xtk
with the intercept.
Advantages
1. This technique may be used even when the model includes lagged dependent
regressors.
Limitations
1. The method can be used only for large samples.
2. The method makes the estimator unbiased, although it is still inefficient.
3. Summary
The presence of autocorrelation in the regression model can be dealt with in several ways.
In case autocorrelation has arisen due to incorrect model specification then the most
appropriate method is to specify the model correctly. This can be done by introducing the
omitted variables or lagged dependent variables. If autocorrelation is present despite
correct specification of the model then the Generalised Least Squares (GLS) regression
method should be used. But the GLS procedure is applicable only when autocorrelation
coefficient is known. When ρ is not known, Feasible Generalised Least Squares (FGLS)
regression method should be used and the estimated value of ρ may be used in place of
its true value. There are several ways of estimating the value of ρ, namely, DW
method, Cochrane Orcutt method and Hildreth Lu method. When the exact form of
autocorrelation is not known, or when the independent variables in the regression model
are not exogenous then it is better to estimate the model by the usual OLS procedure but
correct the standard errors of the estimates for different forms of autocorrelation, by using
the Newey West method of standard error correction.