Econometrics II
Econometrics II
MODULE
December, 2022/23
Page 1
CHAPTER ONE
REGRESSION ANALYSIS WITH QUALITATIVE INFORMATION
Introduction
Both the dependent and explanatory variables that we have encountered in the preceding chapters were
essentially quantitative/continuous. Regression models can also handle qualitative variables as dependent and
explanatory variables. In this chapter, we shall consider models that may involve not only quantitative variables
but also qualitativevariables. Such variables are also known as indicator variables, categorical
variables,nominal scale variables or dummy variables.
To be specific, in this chapter we shall discuss two main topics: (1) regression with qualitative explanatory
variables and (2) regression with qualitative dependent variables.
In regression analysis the dependent variable, or regressand, is frequently influenced not only by ratio scale
variables (e.g. income, output, prices, costs, height, temperature) but also by variables that are essentially
qualitative, or nominal scale, in nature. Examples are sex, race, color, religion, nationality, geographical region
and political party membership. The importance of such variables is evidenced by several studies. For instance,
studies found that female workers earn less than male workers, holding other factors constant. Also, in
multicultural contexts,black workers are found to earn less than whites.This clearly shows that(whatever the
reasons) qualitative variablessuch as sex and race can influence the dependent variable and should explicitly be
included among explanatory variables.
Page 2
In a given regression model, the qualitative and quantitative variables may also occur together, i.e., some
variables may be qualitative and others are quantitative. When all explanatory variables are
- quantitative, then the model is called a regression model,
- qualitative, then the model is called an analysis of variance model (ANOVA) and
- quantitative and qualitative both, then the model is called analysis of covariance model (ANCOVA).
Such models can be dealt within the framework of regression analysis. The usual tools of regression
analysis can be used in case of dummy variables.
Note that equation 1.1 has only one dummy variable since the grouping/qualitative variable in equation 1.1 has
two categories. Similarly, if a qualitative variable has three categories, we introduce only two dummy variables.
For instance, suppose that the investigator we considered above wants to see whether wheat farm productivity
depends on education attainment of household head, where education attainment of household heads‟ are
divided in to three mutually exclusive categories: illiterate, elementary school and high school and above. The
model can be stated as follows:
Where, {
What does the model (1.2) tell us? Assuming that the error term satisfies the usual assumptions of OLS, taking
expectation on both sides of (1.2), we obtain:
Mean wheat farm productivity of households with elementary education:
⁄
Mean wheat farm productivity of households with high school and above schooling:
⁄
Mean wheat farm productivity of illiterate households:
Page 3
⁄
In other words, in the multiple regression equation (1.2), the mean wheat farm productivity of illiterate
households is given by the intercept, ,and the slope coefficients tell by how much the mean wheat
farm productivity of households withelementary education and high school and aboveschooling differ from the
mean wheat farm productivity of illiterate households. But how do we know that these differences are
statistically significant? To address this question, let‟s consider a numerical example.
Numerical Example-1:
Given data of the following form on wheat farm productivity (yield) and education attainment of household heads based
on a sample of 120 farm households, we can estimate yield as a function of education attainmentwhich is given in three
categories: illiterate, elementary and high school & above.
Based on the above data, if we estimate equation 1.2, we obtain the following regression result
Since we are treating illiterate household heads as the benchmark, the coefficients attached to the various
dummies are differential intercepts, showing by how much the average value of in the category that
receives 1 differs from that of the benchmark category. That is, as the regression result shows, the coefficient of
average yield of households with elementary education and households with high school and
above schooling is larger than average yield level of households with no schooling by about and
quintals, respectively. Therefore, the actual mean yield of households with elementary education and high
school and above schooling can be obtained by adding the differential yields. Doing so, we obtain that the mean
yield of households with elementary education is 12.66quintals and that of households with high school and
above is 14 quintals.
Page 4
But how do we know that these mean yields are statistically different from the mean yield of illiterate
households, the benchmark category? This is straightforward. All we have to do is to find out if each of the
slope coefficients in the above equation are statistically significant. As can be seen from the regression result,
the estimated slope coefficients both households with elementary and high school and above are statistically
significantas implied by the P-values of the t-statistics.
1.1.3 Regression with a mixture of Qualitative and Quantitative Explanatory Variables: ANCOVA
Sometimes an investigator may deal with a regression analysis to assess the statistical significance of the
relationship between quantitative dependent variable and qualitative explanatory variables. In such cases one
often uses dummy variable models with some assumptions about their use. For instance, the implicit assumption
in this case is that the regression lines for the different groups differ only in the intercept term but have the same
slope coefficients.
To exemplify the application of dummy variables with the above implicit assumption, let us suppose that the
investigator we considered above wants to analyze the relationship between wheat farm productivity and
chemical fertilizer application for two groups: male and femaleheaded households.The relationship can be
represented by the following equation:
Page 5
Figure 1.1: Regression lines with a common slope and different intercept
It is evident from equation 1.4 that the coefficient of the dummy variable, measures the differences in the two
intercept terms. Hence, it is called differential intercept coefficientbecause it tells by how much the value of
the intercept that receives the value of 1 differs from the intercept coefficient of the benchmark category.
Note that as we did above, ifthere is a constant term in the regression equation, the number of dummies
defined should be one less than the number of the categories of the variable.This is because the constant
term is the intercept for the base group. As it can be seen from equation 1.4, the constant term, measures
the intercept for female headed households. Furthermore, the constant term plus the coefficient of
measures the intercept for the male headed households. This interpretation holds as long as the base
group is female headed households. But this should not connote that the base group is always female headed
households. Any one group may be chosen as the base group,depending on the preference of the investigator.
On the other hand, if we do not introduce a constant term in the regression equation, we can define a dummy
variable for each group, and in this case the coefficient of the dummy variables measures the intercepts for the
respective groups. However, if we include both the constant term and dummies for all categories of a variable,
we will encounter the problem of perfectmulticollinearityand the regression program either will not run or will
omit one of the dummies automatically.
In general, when we are introducing dummy variables, we have to follow the following rule. If the qualitative
variable has categories and the regression equation has a constant intercept, introduce only dummy
variables, otherwise we will fall into what is called the dummy variable trap, that is, the problem of perfect
multicollinearity.
As you might have noted, so far we have been dealing with regression analysis where there is only one
qualitative variable. In practice, however, we may have more than one qualitative variable affecting the
Page 6
dependent variable. Then, how will you introduce more than one qualitative variable in your regression
analysis?
The introduction of dummy variables for more than one qualitative variable in a regression analysis is
straightforward.That is, for each qualitative variable we introduce one or more than one dummy following the
rule mentioned above. For example, suppose that we want to analyze the determinants of household
consumption(C), and hence we have data on:
1. Y: income of households
2. S: the sex of the head of the household.
3. A:age of the head of the household, which is given in three categories: < 25 years, 25 to 50 years, and>
50 years.
4. E:education of the head of household, also in three categories: < high school, high school but < college
degree, and college degree and above.
Following the rule of dummy variables, we can include these qualitative variables in the form of dummy
variables as follows:
Where, {
For each category the number of dummy variables is one less than the number of classifications.
The assumption made in the dummy-variable method is that it is only the intercept that changes for each group
but not the slope coefficients (i.e., coefficients of Y). The intercept term for each individual is obtained by
substituting the appropriate values for through . For instance, for a male, age < 25, with college degree,
we have , , , , and hence the intercept is . For a female,
age> 50 years, with a college degree, we have , , , , and hence the
intercept term is just .
Furthermore, the coefficients of the dummies are interpreted as the differences between the average
consumption of the omitted (base) category and the category represented by the dummy under consideration,
keeping other things constant. For instance, in equation 1.5 above is interpreted as, keeping other things
Page 7
constant, the average consumption expenditure of male headed households is greater/less than that of their
female headed counterparts.
The other most important question, perhaps, is how to check the statistical significance of the differences among
groups (if any). For this purpose we simply test whether the coefficients of the dummies are statistically
significant or not by using the usual procedure of hypothesis testing, i.e., by using the standard error test or the
students t-test procedures.
Often it is desirable to remove that seasonal factorfrom time series, so that one can concentrate on the other
components, such as the trend. The process of removing the seasonal component from a time series is known as
deseasonalizationor seasonal adjustment, and the time series thus obtained is called thedeseasonalized, or
seasonally adjustedtime series.Usually the important economic series, such as the unemployment rate, the
consumer price index (CPI), the producer price index (PPI), and the industrial production index that we see in
different reports are published in their seasonally adjusted form.
One of the methods used to deseasonalizea time series is the method of dummy variables.On the other hand,
regression on seasonal dummy variables such as quarterly dummy or monthly dummy is a simple way to
estimate seasonal effects in time series data sets.
Example:
To illustrate the dummy variables technique in seasonal analysis, suppose that we have quarterly data on sales
of refrigerators over the years 1978 through 1985 given in the following table.
Table-1.2: Quarterly Data on Sales of refrigerators (in thousands) (1978-I to 1985-IV)
Year Quart FRIG Year Quart FRIG Year Quart FRIG Year Quart FRIG
1978 I 1980 I 1982 I 1984 I
1317 1277 943 1429
II II II II
1615 1258 1175 1699
III III III III
1662 1417 1269 1749
IV IV IV IV
1295 1185 973 1117
1979 I 1981 I 1983 I 1985 I
1271 1196 1102 1242
Page 8
II II II II
1555 1410 1344 1684
III III III III
1639 1417 1641 1764
IV IV IV IV
1238 919 1225 1328
But first let us look at the data, which is shown in Figure 1.2 below.
Year
Figure 1.2: Quarterly Data on Sales of refrigerators (in thousands) (1978-I to 1985-IV)
FRIG FRIG
FRIG FRIG
This figure suggests that perhaps there is a seasonal pattern in the data associated with the various quarters.How
FRIG
can we measure such seasonality in refrigerator sales? We can use the seasonal dummy regression to
estimate the seasonal effect. To do this, let‟s treat the first quarter as the reference quarter and assign dummies
to the second, third, and fourth quarters (i.e., the data setup looks the table below).That is, to havethe following
model:
Where, sales of refrigerators (in thousands) and are seasonal dummies defined as,
Where, {
Page 9
From the data on refrigerator sales given in Table 1.3, we obtain the following regression results:
Where,the values in the brackets are student’s t-statistics for the respective coefficients.
Interpretation:
Since we are treating the first quarter as the benchmark, the coefficients attached to the various dummies are
differential intercepts, showing by how much the average value of in the quarter that receives a
dummy value of 1 differs from that of the benchmark quarter. Put differently, the coefficients on the seasonal
dummies will give the seasonal increase ordecrease in the average value of Yrelative to the base season. For
instance, the coefficient of quarter-2 dummy; average volume of refrigerator sales in season two
is larger than it in season one by units and the difference is statistically significant at 1% level of
significance. If you add the various differential intercept values to the benchmark average value of 1222.125,
you will get the average value for the various quarters.
So far in this chapter, we have been considering models where the explanatory variables are qualitative or
dummy variables.
In this section, we shall look at models where the dependent variable, Y is itself a qualitative variable. Such
models are called limited dependent variable models or qualitative or categorical variable models. This
dummy variable can take on two or more values. However, we shall concentrate on the binary case where
can take only two values. For example, we want to study the determinants of house ownership by households
in Arba Minch town. Since a household may either own a house or not, house ownership is a Yesor Nodecision.
Hence, the response variable, or regressand, can take only two values, say, 1 if the household owns a house and
0 if s/he does not own a house. A second example would be a model of women labor force participation (LFP).
The dependent variable in this case is the labor force participation (LFP) which would take the value of one if
the women do participate in the labor force and a value of zero if the women do not participate in the labor
force.A third example would be the poverty status of households in rural Ethiopia. The dependent variable in
this case is the poverty status of households which would take the value of one if the householdis poor or zero if
the household is non-poor.In all of the above instances, the dependent variable is a
dichotomous/dummy/binary/qualitative/limited variable.
Page 10
To explain the models of the above nature, varioustypes of explanatory variables could be included: both
continuous variables such as age and income and dichotomous variables such as sex and race or exclusively
continuous or exclusively qualitative.
Binary choice models are commonly used for micro analysis in social sciences and medical research. There are
threetypes of Limited Dependent Variable Models:
A. The Linear Probability Models
B. The Logit Models
C. The Probit Models
to the case where Y is a dichotomous variable taking the value zero or one, wherewe require the assumption
that .
The conditional expectation of the observation is given by:
)= '
The conditional expectation of the dependent variable is equal to the probability of something happening,
.
The conditional expectation in this model has to be interpreted as the probability that given the
particular value of . In practice, however, there is nothing in this model to ensure that these probabilities will
lie in the admissible range (0, 1). Since can only take the two values of 0 or 1, it follows that the error term
can only take the two values of ' or ' .
What do you think is the difference between a regression model where the regressand Y is
quantitative and a model where it is qualitative?
In a model where Yis quantitative, our objective is prediction of the average value of the dependent variable
from the given values of explanatory variables. That means, in such models the very objective of regression
analysis is to determine, ⁄ On the other hand, in models where Yis qualitative, our
objective is to predict the probability of something happening, such as owning a smartphone, or owning a
house, or being non-poor, etc. In other words, in binary regression models,
)= '
and hence,
Page 11
That is why qualitative response regression models are often calledprobability models.
From the assumption that it follows that the probabilities of these two events are given by '
and ' , respectively. Thus, the probability distribution of , can be given by the following table.
Value of Probability of Probability of
1 1- ' ' '
0 - ' 1- ' 1- '
Total 1 1 1
Both the dependent variable ( ) and the error term ( ) assume only two values with some probabilities ( '
and ' ). If a variable assumes two values with a probability, that variable is called Bernoulli variable
and its distribution is called Bernoulli distribution. Therefore, the dependent variables do not follow a normal
distribution but they follow the Bernoulli distribution. The dependent variable follows the Bernoulli distribution
with a mean values of or ' and variance of or ' ( ' ). While the disturbance term
follows the Bernoulli distribution with a mean value of zero and variance of or ' (1- ' . As can
be seen, the variance of the disturbance term ( ) is not constant. It is a function of explanatory variables ( ).
Thus, the disturbance term in the case of LPM is heteroscedastic. Clearly, since its variance depends on , is
heteroscedastic. Therefore, the OLS estimation of LPMsare not be efficient.
The problem with the linear probability model (LPM) is that it models the probability of as being linear.
That means, If we were to use OLS regression line, we would get some straight
line, perhaps at higher values of X we would get values of Y above 1 and for low values of X we would get
values of Y below 0. But we cannot have probabilities that fall below 0 or above 1.
If the values of the dependent variable ( ) are limited or discrete, we cannot use ordinary least square method
(OLS). This is because, unless we also restrict the values of the explanatory variable ( ), we may get negative
values of dependent variable for small values of explanatory variable and values greater than one for large
values of the explanatory variable. The estimated probability lies outside the limit zero and one.
The following are the main problems with linear probability models:
A. The dependent ( ) and the disturbance term ( ) are not normally distributed
One of the assumption of linear regression model is that the disturbance term ( ), the dependent variable ( ) and
OLS estimates ( ̂ ) are all normally distributed. But, as we have noted above, both the dependent variable and the
disturbance term followsBernoulli distribution. The error term ( ) follows Bernoulli distribution with average
Page 12
value of zero and variance of . Similarly, the dependent variable follows Bernoulli distribution with
mean and variance of .
[ ]
[ ]
Both the variance of and are the same and a function of explanatory variable ( ). Thus,
the variance of the error term, var( ), is not constant or it varies with the values of explanatory variable ( ).
Since the error term ( ) is not normally distributed, so does the distribution of our estimates. If the estimates
are not normally distributed, we cannot use them for hypothesis testing and estimation (inference). But, this
limitation of LPM is not a series problem because as sample size increases, the distribution of the error term
( ) approaches normal distribution.
The variance of the error term is a function of the explanatory variable ( ) which shows that the variance of
is heteroskedastic. That is, the variance of the error term varies with the explanatory variable, .
C. Violation of one of the axioms of probability (the probability may lie outside the interval; )
Since in the linear probability models measure the conditional probability ( ) of the event Yoccurring
given X, it must necessarily lie between 0 and 1. Although this is true a priori, there is no guarantee that ̂ the
estimators of ⁄ , will necessarily fulfill this restriction, and this is the realproblem with the
estimation of the LPMsby OLS. There are two ways to get rid of this problem. One is to estimate the LPM by
the usual OLS method and find out whether the estimated ̂ lie between 0 and 1. If some are less than 0 (that is,
negative), we equate ̂ with zero ( ),if they are greater than 1, can be equated with one ( ). The
second solution is to devise an estimating technique that will guarantee that the estimated conditional
probabilities ̂ will lie between 0 and 1. In this respect, the Logit and Probit models will guarantee that the
estimated probabilities will indeed lie between the logical limits 0 and 1.
Page 13
1
.5
Predicted probability ( )
0
5 10 15 20 25
Income
Figure 1.3: The scatter plot of the actual Y and predicted probabilities against the explanatory variables
As the above scatter plot shows(which is drawn based on the data on home ownership status given in table-1.4
below), the predicted probability ( ) lies outside the limiting values, for some values of . This
violet one of the assumption of the axioms of probability which states that probability lies only between 0 and 1.
Numerical Example
To illustrate the method of Linear Probability Modeland the points raised above about it, let‟s assume that we
want to study home ownership statusof households in a given town. Assume that we have data for 40
households on home ownership status and income of households (in thousands ofBirr) as given below.
Table-1.4: Hypothetical data on home ownership status ( if owns home, otherwise)
and income, X (thousands of Birr)
Household Y X Household Y X
1 0 8 21 1 24
2 1 20 22 0 16
3 1 18 23 0 12
4 0 11 24 0 11
5 0 12 25 0 16
6 1 19 26 0 11
7 1 20 27 1 20
8 0 13 28 0 18
Page 14
9 1 20 29 0 11
10 0 10 30 0 10
11 1 17 31 1 17
12 1 18 32 0 13
13 0 14 33 1 21
14 0 25 34 1 20
15 1 6 35 0 11
16 1 19 36 0 8
17 1 16 37 1 17
18 0 10 38 1 16
19 0 8 39 0 7
20 1 18 40 1 17
Table-1.5 below shows the estimated probabilities, ̂ , for the various income levels. The mostnoticeable
feature of this table is that two estimated values are negative and two values are in excess of 1, demonstrating
clearly the point made earlier that, although is positive and less than 1, their estimators, ̂ ,need not
be necessarily positive or less than 1. This is one reason that the LPM is not the recommended model when the
dependent variable is dichotomous.
Page 15
0 0.348 0 0.152 0 0.674 1 0.543
1 0.804 0 0.022 0 0.217 0 -0.043
0 0.152 1 0.674 0 0.152 1 0.609
Even if the estimated were all positive and less than 1, the LPM still suffers from the problem of
heteroscedasticity, which can be seen readily from (1.9). As a consequence, we cannot trust the estimated
standard errors reported in (1.10) above.
But, the serious limitation with LPM is that the predicted probability lies outside the natural limit: 0 and 1,
because LPM assumes a linear relationship between the predicted probability and the level of the explanatory
variable ( ). In other words, the problem with LPM is that it assumes that the probability of success or
something happening ( ) is a linear function of explanatory variable(s) ( ). That is, in our house ownership
example, the probability of owning a house ( ) changes by the same amount at both low and high level of
income.
Therefore, the fundamental problem with LPM is that ⁄ ⁄ increases linearly with
. That means, the marginal or incremental effect of remains constant at all levels of income ( ). Thus, in
our home ownership example we found that as X increases by a unit (Birr 1000), the probability of owning
ahouse increases by the same constant amount of . This is so whetherthe income level is Birr 8000, Birr
10,000, Birr 18,000, or Birr 24,000. As can be seen from figure-1.3 above, the predicted probability ( ) is a
linear function of explanatory variable (Xi) which seems patentlyunrealistic.
In reality one would expect that Pi is nonlinearly related to :at very low income a household will not own a
house but at a sufficiently high level of income, say, X*, it most likely will own a house. Any increase in
income beyond X* will have little effect on the probability of owning a house. Thus, at both ends of the income
distribution, the probability of owning a house will be virtually unaffected by a small increase in X.
Page 16
Therefore, what we need is a (probability) model that has these two features: (1) As increases,
⁄ increases but never steps outside the 0–1 interval, and (2) the relationship between and is
non-linear, that is, “one which approaches zero at slower and slower rates as gets small and approaches one at
slower and slower rates as gets very large.”
In this respect, two alternative nonlinear models (1) The Logit Model & (2) The Probit Model were proposed.
The very objective of the Logit and Probit models is to ensure/guarantee that the predicted probability of the
event occurring given the value of explanatory variable remains within the natural [0, 1] bound. That means,
for all X
This requires a nonlinear functional form for the probability. This can be possible if we assume that the
dependent or the error term ( ) follows some sorts of cumulative distribution functions (CDF). The two
important nonlinear functions which are proposed for this are the logistic CDF and the normal CDF.
Where, G is a function taking on values strictly between 0 and 1. That means, , for all real
numbers . This ensures that the predicted probability ( ) strictly lies between 0 and 1.
The logistic distribution function and the Cumulative Normal Distribution Function can be represented
graphically as follows.
= =
0
Cumulative Normal Distribution Function
( )
=∫
√
Figure1.4: Comparison of the two distribution functions which are commonly used in binary regression analysis
Page 17
Therefore, ,
Where, .
As ranges from to + , predicted probability of an event occurring ( ) ranges from 0 to 1. In other
words, as ranges from to + , the predicted probability of an event occurring ( ) ranges from 0 to 1.
Moreover, the predicted probability of an event occurring ( ) is non-linearly related with explanatory variable
( ). Thus, the Logit model satisfies the two conditions; namely (1) and (2) is non-linearly
related with .
The following scatter plot shows the relationship between the predicted probability of owning a house ( and
the level of explanatory variable (income). As can be seen from the figure, the predicted probability of owning a
house lies only in the natural limit, . The relationship between the predicted probability of an event
occurring ( ) and the explanatory variable is also nonlinear. That means, at lower and higher income level, the
change in the predicted probability of owning a house is small for a given level of income.
1
.8
.6
.4
Predicted probabilities ( )
.2
0
5 10 15 20 25
Income
Figure 1.5: The scatterplot of the actual Y and predicted probabilities against the explanatory variable
But, while we are ensuring that the predicted probability of an event occurring ( ) lies in the natural interval,
0 , we have created estimation problem because is nonlinear in parameters and in explanatory
variables and we cannot apply OLS. However, we can linearize the Logit model as follows.
Page 18
= the probability of not owning a house
Take the ratio of the probability of an event happening ( ) to the probability of an event not happening ( )
and the resulting ratio is called odds ratio
Take the natural log of the above odds ratio and the resulting equation is called logit.
b.Even if the Logit (L) is linear in X, the probabilities themselves are not. This property is in contrast with
the LPM model where the probabilities increase linearly with X.
c. Although we have included only a single regressor, in the preceding model, one can add as many
regressors as may be dictated by the underlying theory.
d. If the Logit (L)is positive, it means that when the value of the regressor increases, the odds that the
regressand equals 1 (meaning some event of interest happens) increases. If the logit (L)is negative, the
odds that the regressand equals 1 decrease as the value of X increases. To put it differently, the Logit
becomes negative and increasingly large in magnitude as the odds ratio decreases from 1 to 0 and becomes
increasingly large and positive as the odds ratio increases from 1 to infinity.
e. More formally, the interpretation of the Logit model given above is as follows: , the slope, measures
the change in L for a unit change in X, that is, it tells how the log-odds in favor of owning a house change
as income changes by a unit(Birr 1000) in our example. The intercept is the value of the log odds in
favor of owning a house if income is zero.
Page 19
f. Given a certain level of income, say, X*, if we actually want to estimate not the odds in favor of owning a
house but the probability of owning a house itself, this can be done directly from , once the
estimates of are available.
g. Whereas the LPM assumes that is linearly related to , the Logit model assumes that the log of the
odds ratio is linearly related to .
That is, suppose we have a random sample of nobservations. Letting denote the probability that or
, the joint probability of observing the values, i.e., is given as:
Where, ∏ .
NB: The likelihood of a Bernoulli variable is the probability of success to the power of times the
probability of failure to the power of .
The joint probability given in Eq. (1.13) is known as the likelihood function (LF). If we take natural logarithm
of eq. (1.13), we obtain what is called the log likelihood function (LLF):
∑[ ]
∑[ ]
∑[ ( )] ∑
Page 20
Therefore, we can rewrite the above LLF as:
∑[ ] ∑ ( )
As you can see from (1.14), the log likelihood function is a function of the parameters , since the
are known. In ML our objective is to maximize the LF (or LLF), that is, to obtain the values of the unknown
parameters in such a manner that the probability of observing the given is as high (maximum) as possible.
For this purpose, we differentiate (1.14) partially with respect to each parameter, set the resulting expressions to
zero and solve the resulting expressions.But the resulting expressions become highly nonlinear in the
parameters and no explicit solutions can be obtained. However, the estimates of the parameters can easily be
computed with the aid of Software packages such as EViews, Stata or any other Software Package. Thus, here
under we shall estimate Logit and Probit models using Stata.
Equally important is the interpretation of the results. Therefore, belowwe shalldiscuss on how to interpret the
regression results of Binary Logit Modelswith the aid of numerical examples (estimated through the principle of
Maximum Likelihood (MML) by Software packages). There are three ways of interpreting the regression
results of a Logit model:
A. Logit interpretation
B. Odds ratio interpretation
C. Probability interpretation (Marginal Effect Interpretation)
Example-1
Suppose that a Mathematical Economics instructorgave routine weekly exercises to her section “A” students in
the previous semester. She wants to study the effect of the performance in these routine exercises on the final
grade of students in the course, Mathematical Economics. At the end of the semester, she found average scores
in exercise (ASE) for each student.
Table -1.6: Cross-sectional Data on GPA, ASE, PC and GRADE for 32 Students
GRADE
Students GPA ASE PC (A=1, rest=0)
1 2.66 20 0 0
2 2.89 22 0 0
3 3.28 24 0 0
4 2.92 12 0 0
5 4 21 0 1
6 2.86 17 0 0
7 2.76 17 0 0
8 2.87 21 0 0
9 3.03 25 0 0
10 3.92 29 0 1
11 2.63 20 0 0
Page 21
12 3.32 23 0 0
13 3.57 23 0 0
14 3.26 25 0 1
15 3.53 26 0 0
16 2.74 19 0 0
17 2.75 25 0 0
18 2.83 19 0 0
19 3.12 23 1 0
20 3.16 25 1 1
21 2.06 22 1 0
22 3.62 28 1 1
23 2.89 14 1 0
24 3.51 26 1 0
25 3.54 24 1 1
26 2.83 27 1 1
27 3.39 17 1 1
28 2.67 24 1 0
29 3.65 21 1 1
30 4 23 1 1
31 3.1 21 1 0
32 2.39 19 1 1
The dependent variable is the final grade of students for Mathematical Economics ( , if the student scores
A and , otherwise). Other explanatory variables include the Grade Point Average (GPA) of students and
Personal –Computer ownership ( , if the student has PC and , if the student does not own PC).
Page 22
Logistic regression Number of obs = 32
LR chi2(3) = 15.40
Prob > chi2 = 0.0015
Log likelihood = -12.889633 Pseudo R2 = 0.3740
We have to interpret the coefficients of the regressors of the logit regression results as log of the odds ratio.
Coefficients of GPA and ASE in the table above measures thechange in the estimated logit for a unit change in
the value of regressors(holding other regressors constant). Thus,
The GPA coefficient of 2.826 means, with other variables held constant, that if GPA increases by a unit,
on average the estimated logit increases by about 2.826 units, suggesting a positive relationship between
the two. Or holding other factors constant, as GPA increases by one point, the average logit value goes up
by about 2.83, that is, the log of odds in favor of scoring A grade increases by about 2.83 and it‟s
statistically significant as implied by the P-value of the Z-test.
The holding other factors constant, as ASE increases by one point, the log of the odds ratio increases by
0.095, but this effect is statistically insignificant.
Among students, students who have a PCcompared to those who do not have PC,the log of odds in favor
of scoring “A” grade increases by about 2.38and it‟s statistically significant as implied by the P-value of
the Z-test.
As you can see, all the regressors have a positive effect on the logit, although statistically the effect of ASE is
not significant. However, together all the regressors have a significant effect on scoring “A” grade, as the LR
statistic (an equivalent to F-test in linear regression) is 15.40, whose P- value is about 0.0015, which is very
small.
Note, however, that a more meaningful interpretation is in terms of odds, which are obtained by taking the
antilog of the various coefficients of the logit model. Thus, if you take the antilog of the GPA coefficient of
2.826 you will get ( ). As GPA of a student rises by a unit, then the student is about 16.88
times likely to get an “A”, other things remaining the same.Using Stata, we can easily estimate the odds ratio
for all regressors as in the following table.
Therefore, the marginal effect/probability interpretation of the estimated logit model given above is as follows:
Page 24
As GPA increases by one point, the probability of scoring grade “A” by an average student increases by
53%, assuming that the other factors score average.
As ASE increases by one point, the probability of scoring grade “A” by an average student increases by
about 1.8%, assuming that the other factors score average.
For a student who owns PC the probability of scoring grade “A”is higher by 45.6% as compared to non-
PC owners, assuming that the other factors score average.
because the variance of the logistic distribution, is greater than the variance of the standard normal
distribution, 1. The difference between the coefficients of Logit model and that of Probit model is accounted to
the difference in the variance of the two distributions.
The previous section, we have considered the Logit model. In this section, we shall present the Probit model.
We will try to develop the Probit model based on utility theory, or rational choice perspective on behavior, as
developed by McFadden.
Assume that in our house ownership example, the decision to own a house or not depends on an unobservable
utility index (also known as a latent variable), that is determinedby one or more explanatory variables, say
income , in such a waythat the larger the value of the index, the greater the probability of a household
owning a house. We express the index as
How is the (unobservable) index related to the actual decision to own a house? As before, let if a
household owns a house and if it does not. Now, it is reasonable to assume that there is a critical or
thresholdlevel of the index, call it such that if exceeds , the household will own house, otherwise it will
not. The threshold , like , is not observable, but if we assume that it is normally distributed with the same
mean and variance, it is possible not only to estimate the parameters of the index given above, but also to get
some information about the unobservable index itself.
Page 25
Where, is the probability that a household owns a house. is the standard normal
variable which is normally distributed with a mean of and variance of is the standard normal CDF
which can be explicitly written as follows.
( )
∫
√
The area under the standard normal curve from up to measures the probability of owning a house and
given by:
∫
√
∫
√
In general, the Probit model can be specified as follows.
, ~𝑵
{
( )
∫
√
This model is called Probit model and can be estimated using Maximum Likelihood Method (MLM). The
coefficients from the Probit model are difficult to interpret because they measure the change in the unobservable
associated with a change in one of the explanatory variables. A more useful measure is what we call the
marginal effects.
Example
Let‟s revisit our grade example data given in table-1.6 abovewhich gives data on 32 students about their final
grade in Mathematical Economics examination in relation to the variables GPA, ASE, and PC. Let us see what
the Probit results look like. The regressionresults estimated based on the method of maximum likelihood using
Stata14 are given in the table below.
Table-1.10: Probit Regression Result
Page 26
Probit regression Number of obs = 32
LR chi2(3) = 15.55
Prob > chi2 = 0.0014
Log likelihood = -12.818803 Pseudo R2 = 0.3775
In short, students with higher GPA and PC are more likely to score “A” compared to their counterparts.
In general, “qualitatively”, the results of the Probit model are comparable with those obtained from the logit
model in that GPA and PC are individually statistically significant while the variable ASE is statistically
insignificant in both models. Collectively, all the coefficients are statistically significant, since the value of the
LR statistic is with a p value of .
Page 27
Marginal effects after probit
y = Pr(grade) (predict)
= .26580809
The marginal effect results and their interpretation in a Probit model are similar to that of logit model.
Therefore, the marginal effect/probability interpretation of the estimated Probit model presented above is as
follows:
As GPA increases by one unit, the probability of scoring grade “A” by an average student increases by
53%, holding other factors constant.
As ASE increases by one point, the probability of scoring grade “A” by an average student increases by
about 1.7%, holding other factors constant.
For a student who owns PC the probability of scoring grade “A”is higher by 45.6% as compared to non-
PC owners, holding other factors constant.
where and are the log likelihoods for the unrestricted and restricted models.
Page 28
Given the null hypothesis, the LR statistic follows the distribution asymptotically with df equal to the
number of explanatory variables.
Pseudo
Where, as before is the log likelihood value you obtain from the unrestricted model, and is
thatgenerated by a regression (either Probit or Logit. But note that this goodness of fit measure is
also used and reported often by statistical packages for any regression that involves MLE) with
only the intercept.
The Pseudo measures the fit using the likelihood function and measures the improvement in the value
of the log likelihood relative to having no explanatory variables ( ). For instance, in our estimated Logit
and Probit models above, we see that the . This suggests that the log-likelihood
value increases by about 37% with the introduction of the set of regressors in the models.
Similar to the of the linear regression model, it holds that . An increasing Pseudo
may indicate a better fit of the model, whereas no simple interpretation like for the of the linear
regression model is possible.
Remarks:
Between Logit and Probit, which model is preferable? In most applications the models are quite similar, the
main difference being that the logistic distribution has slightly fatter tails. That is to say, the conditional
probability approaches zero or one at a slower rate in Logit than in Probit.Therefore, there is no compelling
reason to choose one over the other. In practice many researchers choose the Logit model because of its
comparative mathematical simplicity.
Though the models are similar, one has to be careful in interpreting the coefficients estimated by the two
models. For example, for our grade example, the coefficient of GPA of 1.6258 of the Probit model and 2.826 of
the logit model are not directly comparable. The reason is that, although the standard logistic (the basis of logit)
Page 29
and the standard normal distributions (the basis of probit) both have a mean value of zero, their variances are
different; 1 for the standard normal and for the logistic distribution, where . Therefore, if you
multiply the Probit coefficient by about (which is approximately √ ), you will get approximatelythe
logit coefficient. For our example, the Probit coefficient ofGPA is . Multiplying this by , we obtain
, which is close to the Logit coefficient. Alternatively, if you multiply a logit coefficient by
you will get the Probit coefficient. Amemiya, however, suggestsmultiplying a logit estimate
by to get a better estimate of the correspondingProbit estimate. Conversely, multiplying a Probit
coefficient by gives the corresponding logit coefficient.
In general, in practical applications, we follow the suggestion of Amemiya that Logit coefficients are about 1.6
the corresponding Probit coefficients and Probit coefficients are about the corresponding Logit
coefficients. Similarly, the coefficients of LPM are about the corresponding Logit coefficient except for
the intercept. The intercept of LPM is about 0.25 the intercept of Logit plus 0.5. Note, however, that the two
models are the same in magnitude of marginal effect as well as sign of the coefficients of the independent
variables.
Exercise
Assume that we want to study the impact of family size (below 18 years) of a household and the average annual
hours of work by household on the poverty level of a household in a given Village. Suppose that we take a
sample of 20 households from a particular village and obtained the following data on the family size (FS), hours
of work (HRS) and poverty level (Yi) of households(Yi =0, if the household is above the poverty line and Yi
=1, if the house hold is below the poverty line). We use Probit regression to see the effect of family size and
hours of work on the poverty level of a household.
4 200 0
1
2 1 300 0
3 2 600 0
4 1 1000 0
5 8 400 1
6 9 0 1
7 3 900 0
8 2 0 0
9 1 0 1
10 2 0 0
11 1 1000 0
12 4 2000 0
13 3 1000 0
14 6 300 1
Page 30
15 3 1000 0
16 4 1000 1
17 5 200 0
18 8 100 1
19 9 0 1
20 10 0 1
. probit Y FS HRS
. mfx
Since it is difficult to estimate the coefficient of Probit model, we will interpret the coefficient of the marginal
effects.
Regression results from Logit Model for the above Poverty data
Page 31
. logit Y FS HRS, nolog
. mfx
Assignments:
Page 32
CHAPTER TWO
INTRODUCTION TO BASIC TIME SERIES DATA ANALYSIS
Introduction
Time series data is one of the important types of data used in empirical analysis. Time series analysis comprises
methods for analyzing time series data in order to extract meaningful statistics and other characteristics of the
data. It is an attempt to understand the nature of the time series and developing appropriate models useful for
forecasting. This chapter is devoted to the discussionof introductory concepts in time series analysis for two
major reasons:
1. Time series data is frequently used in practice.
2. Time series data analysis poses several challenges to econometricians and practitioners.
Note that almost all of the above problems arise mainly due to non-stationarity of time series data sets.
Therefore, our major emphasis in this chapter is on the discussion of the nature and tests of stationarity, and the
remedial measures for non-stationary data sets.
In Econometrics I, we studied the statistical properties of the OLS estimators based on the notion that samples
were randomly drawn from the appropriate population. Understanding why cross-sectional data should be
viewed as random outcomes is fairly straightforward: a different sample drawn from the population will
generally yield different values of the variables. Therefore, the OLS estimates computed from different random
samples will generally differ, and this is why we consider the OLS estimators to be random variables.
How should we think about randomness in time series data? Certainly, economic time series satisfy the
intuitive requirements for being outcomes of random variables. For example, today we do not know what the
trade balance of Ethiopia will be at the end of this year. We do not know what the annual growth in output will
be in Ethiopia during the coming year. Since the outcomes of these variables are not foreknown, they should
clearly be viewed as random variables.
Table2.1.1
Time series Data on some macroeconomic variables in Ethiopia
Year TB MS2 GDP EXR G
1963 0.69 629.6 9,400 2.4000 303.99
1964 0.69 658.3 9,873 2.3000 316.19
1965 1.04 808 9,892 2.1900 348.14
1966 1.15 1066.6 10,353 2.0700 368.57
1967 0.71 1139.4 11,412 2.0700 442.89
1968 0.79 1421.8 11,145 2.0700 551.99
1969 0.86 1467.9 11,916 2.0700 654.73
1970 0.84 1682.2 13,221 2.0700 752.34
1971 0.61 1848 13,890 2.0700 942.56
1972 0.65 2053.2 15,143 2.0700 993.59
1973 0.62 2377.6 16,135 2.0700 1,017.56
1974 0.47 2643.7 16,530 2.0700 1,090.74
1975 0.29 3040.5 17,498 2.0700 1,244.54
1976 0.45 3383.7 19,655 2.0700 1,506.13
1977 0.42 3849 17,865 2.0700 1,454.12
1978 0.43 4448.2 21,517 2.0700 1,524.52
1979 0.36 4808.7 22,367 2.0700 1,636.04
1980 0.34 5238.7 23,679 2.0700 1,726.70
1981 0.43 5705 24,260 2.0700 2,066.50
1982 0.40 6708.2 25,413 2.0700 2,336.99
1983 0.29 7959.2 27,323 2.0700 2,467.53
1984 0.18 9010.7 31,362 2.0700 2,416.75
1985 0.26 10136.7 34,621 4.2700 2,819.51
1986 0.30 11598.7 43,171 5.7700 3,770.58
1987 0.43 14408.5 43,849 6.2500 4,220.57
1988 0.35 15654.8 55,536 6.3200 5,378.98
1989 0.46 16550.6 62,268 6.5000 5,671.26
1990 0.44 18585.3 64,501 6.8800 5,984.25
1991 0.31 19399.0 62,028 7.5100 7,069.36
1992 0.35 22177.8 66,648 8.1400 11,921.90
1993 0.31 24516.2 68,027 8.3300 9,963.85
1994 0.27 27322.0 66,557 8.5400 9,873.38
Page 34
1995 0.26 30469.6 73,432 8.5809 9,849.58
1996 0.23 34662.5 86,661 8.6197 11,315.21
1997 0.23 40211.7 106,473 8.6518 13,203.04
1998 0.22 46377.4 131,641 8.6810 16,080.46
1999 0.23 56651.9 171,989 8.7943 18,071.82
2000 0.22 68182.1 248,303 9.2441 24,364.45
2001 0.19 82509.8 335,392 10.4205 27,592.06
2002 0.24 104432.4 382,939 12.8909 37,527.99
2003 0.33 145377.0 511,157 16.1178 50,093.38
Formally, a sequence of random variables indexed by time is called a stochastic process or a time series
process(“stochastic” is a synonym for random). When wecollect a time series data set, we obtain one possible
outcome, or realization, of the stochasticprocess. We can only see a single realization, because we cannot go
back in timeand start the process over again (this is analogous to cross-sectional analysis where wecan collect
only one random sample). However, if certain conditions in history had beendifferent, we would generally
obtain a different realization for the stochastic process,and this is why we think of time series data as the
outcome of random variables. Theset of all possible realizations of a time series process plays the role of the
populationin cross-sectional analysis.
Arandom or stochastic process is a collection of random variables ordered in time.We let Y denote a random
variable. We will use the notation Yt to express the Y value in time t. In what sense can we regard GDP as a
stochastic process? Consider forinstance the GDP of 9.4 billion Birr of Ethiopia for the period 1963. In theory,
the GDP figure forthe period 1963 could have been any number, depending on the economicand political
climate then prevailing. The figure of 9.4 billion Birr is a particularrealization of all such possibilities.
Therefore, we can say that GDP isa stochastic process and the actual values we observed for the period 1963-
2003 are a particular realization of that process (i.e. the sample). The distinctionbetween the stochastic process
and its realization is akin to the distinctionbetween population and sample in cross-sectional data. Just as weuse
sample data to draw inferences about a population, in time series weuse the realization to draw inferences about
the underlying stochasticprocess.
Page 35
practically useless concept.It is a necessary condition for building a time series model that is useful for future
forecasting.To explain weak stationarity, let be a stochastic time series with these properties:
Mean: (1)
Variance: = (2)
Covariance: ( ) = [ ] (3)
Where, the covariance (or autocovariance) at lag k, is the covariance between the values of and , that is,
between two Y values k periods apart. If , we obtain , which is simply the variance of if
In short, if a time series is stationary, its mean, variance,and autocovariance (at various lags) remain the same
no matter at what pointwe measure them; that is, they are time invariant. Such a time series will tend to return
to its mean (called mean reversion) and fluctuations around this mean (measured by its variance) will have
broadly constant amplitude. If a time series is not stationary in the sense just defined, it is called a non-
stationary time series. In other words, non-stationary time series will have a time varyingmean or a time-
varying variance, or both.
Why are stationary time series so important?There are two major reasons.
1. If a time series is non-stationary, we can study its behavior only for the time period under consideration.
Each set of time series data will therefore be for a particular episode. As a result, it is not possible to
generalize it to other time periods. Therefore, for the purpose of forecasting or policy analysis, such
(non-stationary) time series may be of little practical value.
2. If we have two or more non-stationary time series, regression analysis involving such time series may
lead to the phenomenon of spurious or non-sense regression.
How do we know that a particular time series is stationary? In particular, are the time series shown in
Figure 2.2.1 stationary? If we depend on common sense, it would seem that the time series depicted in the
above five figures are non-stationary, at least in the mean values. This is because some of them are trending
upward while others are trending downwards.
Exercise: for the graphs in Figure 2.2.1, indicate if the time series are stationary.
Figure 2.2.1
Nine examples of time series data; (a) Google stock price for 200 consecutive days; (b) Daily change in the Google stock
price for 200 consecutive days; (c) Annual number of strikes in the US; (d) Monthly sales of new one-family house (e)
Annual price of a dozen eggs in the US (constant dollars); (f) Monthly total of pigs slaughtered in Victoria, Australia; (g)
Annual total of lynx trapped in the McKenzie River district of north-west Canada; (h) Monthly Australian beer
production; (i) Monthly Australian electricity production. (Hyndman &Athanasopoulos, 2018)
Page 36
Answer: b and g are the only stationary time series.
Although our interest is in stationary time series, we often encounter non-stationary time series, the classic
example being the random walk model (RWM). It is often said that asset prices, such as stock prices follow a
random walk; that is, they are non-stationary. We distinguish two types of random walks: (1) random walk
without drift (i.e., no constant or intercept term) and (2) random walk with drift (i.e., a constant term is present).
Page 37
= + = +
In general, if the process started at some time 0 with a value of , we have
= +∑
As the preceding expression shows, the mean of is equal to its initial, or starting, value, which is constant, but
as increases, its variance increases indefinitely, thus violating a condition of Stationarity. In short, the RWM
without drift is a non-stationary stochastic process. In practice is often set at zero, in which case
. An interesting feature of RWM is the persistence of random shocks (random errors), which is clear from (5):
is the sum of initial plus the sum of random shocks. As a result, the impact of a particular shock does not
die away. For example, if = 2 rather than = 0, then all ‟s from onward will be 2 units higher and the
effect of this shock never dies out. That is why random walk is said to have an infinite memory.
it is easy to show that, while is non-stationary, its first difference is stationary. In other words, the first
differences of a random walk time series are stationary. But we will have more to say about this later.
)= (11)
)= (12)
Page 38
As can be seen from above,for RWM with drift the mean as well as the variance increases over time, again
violating the conditions of (weak) Stationarity. In short, RWM with or without drift, is a non-stationary
stochastic process.
Remark
The random walk model is an example of what is known in the literature as a unit root process. Since
this term has gained tremendous currency in the time series literature, we need to notewhat a unit root
process is.
Let‟s rewrite the RWM (4) as:
If , (13) becomes anRWM (without drift). If is in fact , we face what is known as the unit root
problem, that is, a situation of non-stationarity; we already know that in this case the variance of is not
stationary. The name unit root is due to the fact that . As noted above, the first differences of a
random walk time series/a unit root process are stationary. Thus, the terms non-stationarity, random
walk, and unit root can be treated as synonymous.
Which is nothing but an RWM without drift and is therefore non-stationary. But note that, if we write (14) as
(15)
it becomes stationary, as noted before. Hence, an RWM without drift is a difference stationary process (DSP).
which is a random walk with drift and is, therefore, non-stationary. If we write it as
this means will exhibit a positive (if ) or negative ( ) trend. Such a trend is called a stochastic
trend. Equation (16) is a difference stationary process (DSP)because the non-stationary in can be
eliminated by taking first differences of the time series.
Page 39
Deterministic trend: a trend stationary process (TSP) has a data generating process of:
Although the mean of is , which is not constant, its variance is. Once the values of and are
known, the mean can be forecast perfectly. Therefore, if wesubtract the mean of from the resulting series
will be stationary, hencethe name trend stationary. This procedure of removing the (deterministic)trend is
called detrending.
To see the difference between stochastic and deterministic trends, consider Figure 2.3.1 (next page). The
series named stochastic in Figure 2.3.1 is generated bya RWM: , where 500 values of
were generated from a standard normal distribution and where the initial value of was set at 1.The series
named deterministic is generated as follows: , wherethe swere generated as above and where
is time measured chronologically.
As you can see inFigure 2.3.1, in the case of the deterministic trend,the deviations from the trend line (which
represents nonstationary mean) are purely random and they die out quickly, i.e. they do not contribute to the
long-run development of the time series, which is determined by the trend component . In the case of the
stochastic trend, on the other hand, the random component affects the long-run course of the series .
Page 40
Figure 2.3.1
Deterministic Versus Stochastic Trend
Summarizing, a deterministic trendis a nonrandom function of time whereas a stochastic trend is random and
varies over time.The simplest model of a variable with a stochastic trend is the random walk. According to
Stock and Watson (2007), it is more appropriate to model economic time series as having stochastic rather than
deterministic trends. Therefore, our treatment of trends in economic time series focuses mainly on stochastic
rather than deterministic trends, and when we refer to “trends” in time series data, we mean stochastic trends.
Most economic time series are generally ; that is, they generally become stationary only after taking their
first differences. In our example above, trade balance of Ethiopia is integrated of order one, But, the data
on government expenditure (G) and exchange rate (EXR) of Ethiopia are integrated of order two, .
Page 41
2.5 Test Stationarity of Time Series Data
In many time series analyses, one of the most important preliminary steps in regression analysis is to uncover
the characteristics of the data used in the analysis. The main goals of undertaking stationarity test are to get a
variable which has a constant mean, variance and time invariant covariance of the variables called second order
stationary or covariance stationary. Following the stationary test, if the variables are non-stationary, we cannot
use the data for forecasting purpose unless the series is transformed to a stationary one.
As noted earlier, before one pursues formal tests, it is always advisable to plot the time series under study, as we
have done above for the data given in Table-2.1.1. Such a plot gives an initial clue about the likely nature of the
time series. Take, for instance, the trade balance time series shown in Figure-2.5.1. You will see that over the
period of study,trade balance has been declining, that is, showing a downward trend, suggesting perhaps that the
mean of the TB has been changing. This perhaps suggests that the TB series is not stationary. Such an intuitive
feel is the starting point of more formal tests of stationarity.
Page 42
Figure 2.5.1 The trend of Money Supply (MS2) of Ethiopia
The Trade Balance (TB) of Ethiopia through time through Time
Figure 2.5.2
Figure 2.5.3
The Exchange Rate (ETH/$US) of Ethiopia through
time
Figure2.5.4
The Government Expenditure of Ethiopia through
time
Page 43
Some of the above time series graph have an upward trending (MS2, EX and G) which may be an indication
of the non-stationarity of these data sets. That means, the mean or variance or both may be increasing with
the passage of time. But, the graph for Trade Balance (TB) of Ethiopia ( showed a downward
trend which may also be an indication of the non-stationarity of the trade balance (TB) series.
Since both covariance and variance are measured in the same units of measurement, is a unit less, or pure,
number. It lies between −1 and +1, as any correlation coefficient does. If we plot against k, the graph we
obtain is known as the population correlogram. Since in practice we only have a realization (i.e., sample) of
a stochastic process, we can only compute the sample autocorrelation function (SAFC), ̂ . To compute this,
we must first compute the sample covariance at lag k, ̂ , and the sample variance, ̂ , which are defined as
∑ ̅ ̅
̂
∑ ̅
̂
Therefore, the sample autocorrelation function at lag k is simply the ratio of sample covariance (at lag k) to
sample variance is given as
̂
̂
̂
A plot of ̂ against k is known as the sample correlogram.
How does a sample correlogram enable us to find out if a particular timeseries is stationary? For this
purpose, let us first present the sample correlogramof our exchange rate data set given in Table 2.1.1. The
correlogram of the exchange rate data is presented for 20 lags in Figure 2.5.5.
Figure 2.5.5
Correlogram of Ethiopian Exchange Rate (EX), 1963-2003
44
. corrgram ex, lags(20)
-1 0 1 -1 0 1
LAG AC PAC Q Prob>Q [Autocorrelation] [Partial Autocor]
Now, look at the column labeled AC, which is the sample autocorrelation function, and the first diagram on
the left, labeled autocorrelation. The solid vertical line in this diagram represents the zero axis; observations
above the line are positive values and those below the line are negative values. Given a correlogram, if the
value of ACF is close to zero from above or below, the data are said to be stationary. In other words, for
stationary time series, the autocorrelations (i.e., ACF) between and for any lag k are close to zero
(i.e., the autocorrelation coefficient is statistically insignificant). That is, the successive values of a time
series (such as and ) are not related to each other. Moreover, the lines on the right-hand side of the
correlograms tends to be shorter to the left or to the right from the vertical zero line for stationary time
series.
On the other hand, if a series has a (stochastic) trend, i.e., nonstationary, successive observations are highly
correlated, and the autocorrelation coefficients are typically significantly different from zero for the first
several time lags and then gradually drop toward zero as the number of lags increases. The autocorrelation
coefficient for time lag 1 is often very large (close to 1).
Thus, looking at Figure2.5.6, we can see that the autocorrelation coefficients for the exchange rate series at
various lags are high (i.e., close to +1 from above) and low (i.e., close to -1 from below). Figure2.5.6 is an
example of correlogram of a nonstationary time series.
For further illustration, the correlogram of the trade balance and money supply time series are presented for
10 lags in Figure2.5.7 & 2.5.8 below.
45
In both correlograms, if we look at the column labeled as AC, all the values are close to 1 from above.
Moreover, the lines on the right-hand side of the correlograms are also longer to the right from the vertical zero
line. That means, the lines are close to +1. Thus, both the money supply and trade balance time series are non-
stationary.
Figure 2.5.6
Correlogram of Ethiopian Trade Balance, 1963-2003
. corrgram tb
-1 0 1 -1 0 1
LAG AC PAC Q Prob>Q [Autocorrelation] [Partial Autocor]
46
Figure 2.5.7
Correlogram of Ethiopian Money Supply (MS2), 1963-2003
. corrgram ms2
-1 0 1 -1 0 1
LAG AC PAC Q Prob>Q [Autocorrelation] [Partial Autocor]
A test of stationarity or non-stationarity that has become widely popular over the past several years is the
unit root test.In this section, therefore, we shalladdress the unit root test, using the Augmented Dickey-
Fuller (ADF)testofstationarity of the time series. Dickey and Fuller (1979, 1981) devised a procedure to
formally test for non-stationarity. The ADF test is a modified version of the original Dickey–Fuller (DF)
Test. The modification is the inclusion of extra lagged terms of the dependent variable to eliminate
autocorrelation in the test equation.
The key insight of their test is that testing for non-stationarityis equivalent to testing for the existence of a unit
root. Thus, the starting point is the unit root (stochastic) process given by:
We know that if , that is, in the case of a unit root, (18) becomes a random walk model without drift,
which we know is a non-stationary stochastic process. Therefore, why not simply regress on its one
period lagged value and find out if the estimated is statistically equal to 1? If it is, then is non-
stationary. This is the general idea behind the unit root test of stationarity.
In a nutshell, what we need to examine here is (unity and hence „unit root‟). Obviously, the null
hypothesis is , and the alternative hypothesis is .
We obtain a more convenient version of the test by subtracting from both sides of (18):
+
=(
=
47
Where, is the first-difference operator and In practice, therefore, instead of estimating (18), we
estimate (19) and test the null hypothesis , against the alternative hypothesis . In this
case, if , then i.e. we have a unit root, ,that follows a pure random walk (and, of course,
is nonstationary).
Now let us turn to the estimation of (19). This is simple enough; all we have to do is to take the first
differences of and regress them on and see if the estimated slope coefficient in this regression ̂)
is zero or not. If it is zero, we conclude that is non-stationary. But if it is negative, we conclude that is
stationary.
The modified version of the test equation given by (19), i.e., the test equation with extra lagged terms of the
dependent variable is specified as:
Dickey and Fuller (1981) also proposed two alternative regression equations that can be used for testing for
the presence of a unit root. The first contains a constant in the random walk process, and the second
contains a non-stochastic time trendas in the following equations, respectively:
The difference between the three regressions concerns the presence of the deterministic elements and .
Note that in all the above three test equations is the lag length for the number of lagged dependent
variables to be included. That is, we need to choose a lag length to run the ADF test so that
the residuals are not serially correlated. To determine the number of lags, , we can use one of the
following procedures.
a. General-to-specific testing: Start with Pmax and drop lags until the last lag is statistically
significant, i.e., delete insignificant lags and include the significant ones.
b. Use information criteria such as the Schwarz information criteria, Akaike‟s information criterion
(AIC), Final Prediction Error (FPE), or Hannan-Quinn criterion (HQIC).
In practice, we just click the „automatic selection‟ on the „lag length‟ dialog box in EViews.
Now, the only question is which test we use to find out if the estimated coefficient of in (20, 21 & 22)
is statistically zero or not.The ADF test for stationarity is simply the normal „t‟ test on the coefficient of the
lagged dependent variable from one of the three models (20, 21 &22). This test does not, however, have
48
a conventional „t‟ distribution and so we must use special critical values which were originally calculated by
Dickey and Fuller which is known as the Dickey-Fuller tau statistic1.However, most modern statistical
packages such as Stata and Eviews routinely produce the critical values for Dickey-Fuller tests at 1%, 5%,
and 10% significance levels.
In all the three test equations, the ADF test concerns whether . The ADF test statistic is the
„t’statistic for the lagged dependent variable. ADF statistic is a negative number and more negative
ADF test statistic is the stronger the rejection of the hypothesis that there is a unit root.
Null Hypothesis (H0): If accepted, it suggests the time series has a unit root, meaning it is
non-stationary. It has some time dependent structure.
Alternative Hypothesis (H1): If accepted, the null hypothesis is rejected; it suggests the
time series does not have a unit root, meaning it is stationary.
Equivalently, if
p-value > 0.05: Accept H0; the data has a unit root and is non-stationary
p-value ≤ 0.05: Reject H0; the data does not have a unit root and is stationary
In short, if the ADF statistical value is smaller in absolute terms than the critical value(s) or equivalently if
the P-value for the ADF test statistic is significant then we reject the null hypothesis of a unit root and
conclude that is a stationary process.
Note that the choice among the three possible forms of the ADF test equations depends on the knowledge of
the econometrician about the nature of his/her data.Plotting the data2and observing the graph is sometimes
very useful because it can clearly indicate the presence or not of deterministic regressors. However, if the
form of the data-generating process is unknown, estimation of the most general model given by (22) and
then answering a set of questions regarding the appropriateness of each model and moving to the next model
is suggested (i.e., general to specific procedure).
Illustration: Unit Root Test of Some Macroeconomic Variables of Ethiopia using EViews.
As noted above, the ADF test is based on the null hypothesis that a unit root exists in the time series. Using
the ADF test, some of the variables (TB, MS2, and EX) from our time series data set are examined for unit
root as visible in Table 2.5.1-Table 2.5.3.
Table 2.5.1
1The Dickey-Fuller tau statistic can be found from appendix section of any statistical book (Eg: you can find it
in Gujarati).
2Sometimes if you have data that is exponentially trending then you might need to take the log of the data
first before differencing it. In this case in your ADF unit root tests you will need to take the differences of the log
of the series rather than just the differences of the series.
49
Augmented Dickey-Fuller Unit Root Test on Trade Balance
Table2.5.2
Augmented Dickey-Fuller Unit Root Test on Exchange Rate
Table2.5.3
Augmented Dickey-Fuller Unit Root Test on Log of Money Supply
All the above test results showed that the data sets are non-stationary. The test statistics is lower than the
critical values at 1%, 5% and 10% levels of significance for all the three data sets. This implies the
acceptance of the null hypothesis which states that there is unit root in the data sets.
50
A. Difference-Stationary Processes
If a time series has a unit root, the first differences of such time series(i.e. a series with stochastic trend) are
stationary. Therefore, the solution here is to take the first differences of the time series.
Returning to our Ethiopian Trade Balance (TB) time series, we have already seen that it has a unit root. Let
us now see what happens if we take the first differences of the TB series.Will the first difference still have a
unit root? Let‟s perform the Durbin Watson test.
The 1 percent critical ADF τ value is about −3.5073as givenin APPENDIX-D –Table-7 of Gujarati. Since
the computed τ (= t) is more negative than the critical value, we conclude that the first-differenced TB is
stationary; that is, it is . In other words, the overall unit root test result for TB time series at first
difference is as given in Table 2.5.4 (thetest result from EViews).
51
Table 2.5.4
The output of the unit root test (ADF test) for the variable TB
Figure 2.5.8
First differenced trade balance time series data
If you compare the above Figure 2.5.8 with Figure 2.5.1 (i.e. the Trade Balance figure at level), you will see
the obvious difference between the two.
52
Table2.5.5
Unit Root test of the Log of MS2 data set at first difference
Note that sometimes a series with stochastic trend may need to be differenced twice to be stationary. And
such a series is said to be integrated of order two, . For instance,the Exchange Rate data set became
stationary after second difference.
Table 2.5.6
Unit Root test of the Exchange Rate data set at second difference
In general, the above test results in Table 2.5.4- Table 2.5.6 revealed that Trade balance (TB) and log of
money supply (LMS2) became stationary at first difference while exchange rate data (EX) became
stationary at the second difference. In all the above test results, the null hypothesis of Unit Root is rejected.
54
CASE 3: The series are different order of integration.
Researchers are more likely to be confronted with this situation. For instance, some of the variables are I(0)
while others are I(1).
Like case 2, the cointegration test is also required under this scenario.
Similar to case 2, if series are not cointegrated, we are expected to estimate only the short run.
However, both the long run model and short run model are valid if there is cointegration.
The appropriate test to use is the Bounds cointegration test. This test, however, is beyond the scope
of this course. It may be covered in post graduate econometrics courses.
)
Assume that and are serially uncorrelated as well as mutually uncorrelated. As you know by now, both
these time series are non-stationary; that is, they are integrated of order one or exhibit stochastic trends.
Suppose we regress on . Since and are uncorrelated processes, the from the regression of
on should tend to zero; that is, there should not be any relationship between the two variables. But the
regression results are as follows:
Table 2.6.1
Output of the linear regression Yt=α+ßXt+ut
Coefficients t-Statistics P-Value
0.54 3.7431 0.0012
Constants 110 2.0832 0.0232
0.712 d-statistics = 0.241
As you can see, the coefficient of tis highly statistically significant, and the value is also high and
greater than the Durbin-Watson d-statistics. From these results, you may be tempted to conclude that there is
a significant statistical relationship between and X, whereas a priori there should be none. This is in a
nutshell the phenomenon of spurious or non-senseregression, first discovered by Yule. Yule showed that
(spurious) correlation could persist in non-stationary time series, even if the sample is very large. That there
is something wrong in the preceding regression is suggested by the extremely low Durbin–Watson d value,
55
which suggests very strong first-order autocorrelation. According to Granger and Newbold, when ,
it is a good to suspect that the estimated regression is , as in the example above.
That the regression results presented above are meaningless can be easily seen from regressing the first
differences of (= ) on the first differences of (= ), (remember that although and are non-
stationary, their first differences are stationary). In such a regression you will find that practically zero,
as it should be, and the Durbin–Watson d is about 2. Although dramatic, this example is a strong reminder
that one should be extremely wary of conducting regression analysis based on time series that exhibit
stochastic trends. And one should therefore be extremely cautious in reading too much in the regression
results based on variables.
One solution to this spurious regression problem exists if the time series under investigation are
cointegrated. A cointegration test is the appropriate method for detecting the existence of long-run
relationship. Engle and Granger (1987) argue that, even though a set of economic series is not stationary,
there may exist some linear combinations of the variables that are stationary, if the two variables are really
related. If the separate series are stationary only after differencing but a linear combination of their
levels is stationary, the series are cointegrated.Cointegration becomes an overriding requirement for any
economic model using nonstationary time series data. If the variables do not co-integrate, we usually face
the problems of spurious regression and econometric work becomes almost meaningless.
Suppose that, if there really is a genuine long-run relationship between any two variables: and ,
although the variables will rise/decline overtime (because they are trended), there will be a common trend
that links them together. For an equilibrium, or long-run relationship to exist, what we require, then, is a
linear combination of and that is a stationary variable [an I(0) variable].
Assume that two variables and are individually nonstationary time series. A linear combination of
and can be directly taken from estimating the following regression:
We now subject to a unit root test and hence if we find that ~ , then the variables and are
said to be co-integrated. This is an interesting situation, for although are individually , that
is, they have stochastic trends, their linear combination is I (0). So to speak, the
linear combination cancels out the stochastic trends in the two series. If you take consumption and income as
two variables, savings defined as (income − consumption) could be . As a result, a regression of
would be meaningful (i.e., not spurious). In this case we say that the two variables are
56
co-integrated. Economically speaking, two variables will be co-integrated if they have a long-term, or
equilibrium, relationship between them.
In short, provided we check that the residuals from regressions like are or
stationary, the traditional regression methodology (including the t and F tests) that we have considered
extensively is applicable to data involving (nonstationary) time series. The valuable contribution of the
concepts of unit root, co-integration, etc., is to force us to find out if the regression residuals are stationary.
As Granger notes, “A test for co-integration can be thought of as a pre-test to avoid „spurious regression‟
situations”.
In this chapter we will only use the Engle-Granger (1987) two stage co-integration test procedures. The
other procedures are discussed in post-graduate econometrics courses.
According to the Engle-Granger two steps method, first we have to estimate the static regression equation
to get the long run multiplier. In the second step an error correction model is formulated and estimated using
residuals from the first step as equilibrium error correction.
In other words, all we have to do is estimate a regression like, , obtain the residuals,
and use the ADF unit root test procedure. There is one precaution to exercise, however. Since the estimated
residuals are based on the estimated co-integrating parameter ,the Augmented Dickey Fuller critical
57
significance values are not quite appropriate. However, Engle and Granger have calculated these values,
which can be found in the references of any statistical books. Therefore, the ADF tests in the present
context are known as Engle–Granger (EG) test. However, several software packages now present these
critical values along with other outputs.
Suppose we want to run the following model that predicts the log of the trade balance (LTB) based on the
logs of MS2, GDP, EXR and LG:
Estimate the above model using least squares and save the residuals for stationarity test.
Table 2.6.2
Dependent Variable: LTB;Method: Least Squares; Included observations: 40 after adjustments
Variables Coefficients Standard Error t-Statistics P-value
LMS2 -0.734042 0.256962 -2.856615 0.0071
LGDP 0.219502 0.174400 1.258610 0.2163
LEXR 0.332259 0.137795 2.411259 0.0211
LG 0.175736 0.298237 0.589248 0.5594
Constant 1.500707 0.951265 1.577591 0.1234
; Durbin Watson d statistics =1.18
a. Interpret the above regression results
b. Which variables significantly affect the trade balance of Ethiopia in the long run?
2. Test of the Stationarity of the residuals obtained from the long run model
Table 2.6.3
Augmented Dickey-Fuller Test Equation; Dependent Variable: ; Method: Least Squares;
Included observations: 40 after adjustments
Variables Coefficients Standard Error t-Statistics P-value
Residualst-1 -0.650962 0.150385 -4.328652 0.0001
Constant 0.014651 0.031967 0.458332 0.6493
0.330246 F-statistic 18.73722
Adjusted 0.312621 Prob(F-statistic) 0.000105
Durbin-Watson stat 1.812471
This above regression result showed that the model is significant and we accept the alternative hypothesis.
That means, the residuals from the long run model is stationary and this implies that regression of
58
is meaningful (i.e., not spurious). Thus, the variables
in the model have a long run r/ships or equilibrium.
The coefficient of the lagged values of the residuals is expected to be less than one and negative. It is less
than 1 because the adjustment towards equilibrium may not be 100%. If the coefficient is zero, the system is
at equilibrium.Note that if is negative and significant, the adjustment is towards the equilibrium, but if is
positive and significant, the adjustment is away from the equilibrium.
Illustration: Error Correction Model for the Trade Balance of Ethiopia
The Error Correction Model(ECM) for the Trade Balance of Ethiopia is specified as:
∑ ∑ ∑ ∑
Where, p and q are the optimal lag length (which can be determined by minimizing the model selection
information criteria) for the dependent and explanatory variables, respectively, while isthe
59
one period lagged values of the residuals obtained from the cointegrating regression, i.e., long
run/static model.
Table 2.6.4
Short Run Determinants of the Trade Balance of Ethiopia (Output from EViews);Dependent Variable:
DLTB
Method: Least Squares; Included observations: 39 after adjustments
Variables Coefficients Standard Error t-Statistics P-value
D(LTB(-1)) 0.238526 0.162666 1.466352 0.1523
D(LEXR(-1)) -0.079162 0.274364 -0.288531 0.7748
D(LMS2(-1)) -0.540010 0.580852 -0.929685 0.3595
D(LGDP(-1)) -0.160748 0.351836 -0.456882 0.6508
D(LG(-1)) 0.401352 0.277733 1.445103 0.1582
ECT(-1) -0.888009 0.189013 -4.698145 0.0000
The coefficient of the error-correction term of about suggests that about 89% of the discrepancy
between long-term and short-term Trade Balance is corrected within a year (yearly data), suggesting a high
rate of adjustment to equilibrium.
Where,
is spurious. We don‟t have long run model and we have to run only the following short run model by
disregarding the long run model.
(NB: one can also add the lagged values of the dependent & explanatory variables)
5. If the null hypothesis is rejected, the regression
60
is meaningful (non-spurious). We do have both the long run model and short run dynamics. We have
to include the lagged value of the residuals in the short run model as one explanatory variables.
Thus far, this chapter focused much on identifying problems related to time series, and a bit on solving these
problems. What still lacks is some concrete examples of time series models, that can be run for the bachelor
thesis next year. Therefore, this section discusses two models that are commonly used in time series
analysis:
Autoregressive distributed lag models are models that contain the lagged values of the dependent variable
and the current and lagged values of regressors. The model is used for forecasting and for policy analysis.
For example, the ARDL model can visualize the effects of changes in government expenditure or taxation
on unemployment and inflation (fiscal policy).Usually the effect of a change in a policy variable (e.g. public
spending) does not immediately fully affect to other economic variables (e.g. GDP). The effect is working
for a longer time period, (various time lags) for psychological, technical and institutional reasons.
Mathematically, the model is expressed as follows:
That means, y depends on its own lagged values, and on the lagged values of the explanatory variable, x. If
the variables in the model are stationary, and the usual least squares assumptions are valid, we can estimate
ARDL models using least squares. It is also possible to use non-stationary data for this model, but only if
you add an error correction mechanism (like we discussed in the section above). To simplify the
introduction of the ARDL, we will work with stationary data only in this section 3. A challenge is to use the
optimal number of lag lengths for p and q.Too many lags could increase the error in the forecasts, too few
could leave out relevant information.
3
This is done in most econometric textbooks, for example in Hill, Griffiths & Lim (2011)
61
Including too many lags may lead to insignificant parameter estimates. It is wise to exclude the longest
lag (the final lag) if it is insignificant. Next, re-run the model and perform a hypothesis test on the new
final lag. A drawback of this approach is that it can produce a model that is too large.
An alternative way to determine the lag length p is to minimize the BIC (Bayes‟ information criterion,
also called Schwarz information criterion), AIC (Akaike‟s information criterion). These criteria consider
the pay off between the benefits of a larger model (with more lagged values, so more explanatory power)
and the disadvantage of a greater model (less degrees of freedom). When running the model,Eviews asks
you which information criterion you want to use. It‟s best to choose the Schwarz information criterion.
The output will reveal the optimal lag length, both for the dependent (p) and the independent (q)
variable.
62
Figure 2.7.1 Table 2.7.3
Correlogram of the residual of the function Output for the function
63
In the above discussed models, we have clearly identified a dependent and several independent variables.
According to Sims (1980), this is inappropriate, because economic variables are mostly not influencing each
other in one direction. For example, inflation affects unemployment, but unemployment also affects
inflation. Therefore, Sims developed the Vector Autoregression (VAR) model. The VAR model is a system
of equations, which means that several functions, and thus several dependent variables, are estimated
simultaneously. We will discuss simultaneous equation modeling more in-depth for cross-sectional data in
chapter 3. This section gives a relatively simple example of a bivariate VAR model, because it is a popular
model and also frequently used by undergraduate thesis
Table 2.7.5
students. The output of the VAR model
Vector Autoregression Estimates
Date: 03/26/20 Time: 14:55
∑ ∑ (31) Sample (adjusted): 2000M03 2020M02
Included observations: 240 after adjustments
Standard errors in ( ) & t-statistics in [ ]
highlighted=significantly different from 0 with
α=0.05
DPI SPI
C 4.776801 7.101472
(1.84161) (3.70547)
[ 2.59381] [ 1.91648]
64
∑ ∑ (32) Log likelihood -1722.424
Akaike information criterion 14.43687
Schwarz criterion 14.58189
We assume that both y and x are stationary, and and
are uncorrelated white noise error terms. Because both equations
are linear in parameters, OLS is used for the estimation. For
example, it is assumed that prices of different food commodities are
interdependent. To this end, the FAO deflated food price index data
is used to specify the following VAR model:
And = the deflated sugar price index at year t (base year: 2002-2004)
The number of lags (2 for each variable) is determined based on a minimization of the AIC and BIC. The
VAR output of Eviews is displayed in Table 2.7.5.Based on the R2 and the F-statistic of the model, we can
judge that the model performs well. Unfortunately, no p-values are given for the t-tests of the individual
parameters, but given the t-statistics (the values in the square brackets []) and given that the sample is large
(there are 240 observations), we know that the critical t-value for a 95% confidence level is 1.96. So, all
parameter estimates with a t-value>1.96 (or <-1.96) are considered significant. That is for the DPI formula
the first and second lag of DPI and the second lag of SPI. For SPI the
Table 2.7.6
DPI lags are insignificant, but the SPI lags are significantly different Output of the Granger causality test
from 0.
Though estimation of the VAR model is relatively simple, it is problematic to interpret VAR parameter
estimates, because it seems that „everything causes everything‟. It is not possible to measure the effect of
DPI on SPI, leaving all other factors constant. Luckily, it is possible to test for the direction of the causality.
Causality refers to the ability of one variable to predict the other variable. If there are only 2 variables, x and
y, there are four possibilities:
- y causes x
- x causes y
- x causes y and y causes x (bi-directional feedback)
- x and y are independent
65
Granger (1969) developed a test that defines causality as follows: if a variable, x, Granger-causes y, y can be
predicted with greater accuracy by using past values of the x variables rather than not using such past values.
To test if x causes y, the Granger causality test uses the F-test on the ß parameters of equation (31)
Similarly, to see if y causes x, an F-test must be conducted on the δ parameters of equation (32).
See Table 2.7.6for the output of the Granger causality test for the sugar and dairy prices. Recall that DPI was
estimated according to the following formula:
The first part ofTable 2.7.6, Dependent variable: DPI, tests the H0 that 0. H0 is rejected
(Prob<0.05), so the lagged values of SPI are able to explain the variation in DPI.
The second part of Table 2.7.6, Dependent variable: SPI, tests the H0 that 0. In other words: the
lagged values of DPI cannot explain the variation Figure 2.7.2
in SPI. This H0 is not rejected (Prob.>0.05). Response graphs
Concluding, this output reveals that SPI Granger-causes DPI, but DPI does not Granger-cause SPI.
Next to knowing the parameter estimates, and knowing which variable Granger-causes the other variable(s),
it may be interesting to create response graphs. In the response graphs you see what a shock to one variable,
for example SPI, does through the fitted VAR model to the other variable, in this case DPI. Figure 2.7.2
shows the response graphs from Eviews for the VAR presented above. The right upper graph displays the
impact of a change in DPI on SPI. The value is around 0, which means that there is no big impact. This is in
line with the Granger causality test which could not reject the H0 0. The lower left graph shows
that the response of a change in SPI does
positively affect the DPI for some lags. This
is in line with the conclusion of the Granger
causality test that rejected the H0 that
0.
66
2.7.3 Closing comments
In this chapter, much attention was paid to characteristics of time series. These characteristics may be
troublesome while using time series data in regression. We have also seen some solutions, which allow us to
use time series data in regression. The ones we have discussed are the error correction model (ECM), the
autoregressive distributed lag model and the variance autoregression model. There are many more models
that deal with time series data, which go beyond the scope of this course. It may be that you need to use
these models for your thesis next year. If this is the case, we advise you to read the reference materials. The
titles of these works can be found in the course outline.
67
CHAPTER THREE
INTRODUCTION TO SIMULTANEOUS
EQUATION MODELS
Recall that one of the crucial assumptions of the method of OLS is that the explanatory X variables are
either nonstochastic or, if stochastic (random), are distributed independently of the stochastic disturbance
term. If this is not the case, as you can see in the below examples, application of the method of OLS is
inappropriate.
Equilibrium condition:
Now it is easy to see that and are jointly dependent (=interdependent) variables. If, for example, in (1)
changes because of changes in other variables affecting (such as income, consumer confidence and
tastes), the demand curve will shift upward if is positive and downwardif is negative. These shifts are
shown in Figure 3.1.
68
As the figure shows, a shift in the demand curve changes both and . Similarly, a change in (because
of weather condition, import or export restrictions, etc.) will shift the supply curve, again affecting both
and . Because of this simultaneous dependence between and , and in (1) and and in (2)
cannot be independent. Therefore, a regression of on as in (1) would violate an important assumption of
the classical linear regression model, namely, the assumption of no correlation between the explanatory
variable(s) and the disturbance term.
From the consumption function,it is clear that C and Y are interdependent and that in (5) is not expected to
be independent of the disturbance term, because when shifts (for example, because people consume more
around Addis Amed), then the consumption function also shifts, which, in turn, affects . Therefore, once
again the classical least-squares method is inapplicable to (5).
In general, when a relationship is a part of a system, then some explanatory variables are stochastic and are
correlated with the disturbances. So, the basic assumption of a linear regression model that the explanatory
variable and disturbance are uncorrelated, or explanatory variables are fixed, is violated.
69
Endogenous variables are variables which are influenced by one or more variables in the model. They are
explained by the functioning of the system. Their values are determined by the simultaneous interaction of
the relations in the model. We call themendogenous variables, interdependent variables or jointly
determined variables.
In this model, and are endogenous variables while are exogenous or predetermined
variables.
The problem explained above, that there is correlation between the explanatory variable(s) and the error
term, is at the core of Simultaneous Equation Modeling. For example, in the Wage-Price Model, does wage
affect prices or do prices affect the wage? If OLS is used to estimate the equations individually, the problem
of simultaneity bias occurs. That means that the least-squares estimators are biased (for small/finite
samples) and inconsistent(for infinite samples).That means:as the sample size increases indefinitely, the
estimators do not converge to their true (population) values. If you want to see the proof for this, it is
recommended to read the appendix on page 138-139 in Gujarati (Econometrics by example, 2011).
In estimating simultaneous equation models, it is important to see if the model is identified. A simultaneous
equation model can either be overidentified,under identified or exactly identified. If a model isexactly
identified, unique numerical values of the model parameters can be obtained. If a model is under
identified, no model parameters can be obtained. If a model is over identified, more than one numerical
value can be obtained for some of the parameters.
Order condition of identification
There are several ways to check if a system of equations is identified. The most common one is the order
condition.
To understand the order condition, we shall make use of the followingnotations:
M = number of endogenous variables in the model
m = number of endogenous variables in a given equation
K = number of predetermined variables in the model
k = number of predetermined variables in a given equation
Let‟s consider some examples for illustration of the order condition of identification.
Example1:Consider our previous example of the demand function and the supply function.
Demand Function: ;
Supply Function: ;
This model has two endogenous variables P and Q and no predetermined variables. To be identified, each of
these equations must exclude at least variable. Since this is not the case, both equations are not
identified.
In this model,Q and P are endogenous and (consumer’s income) is exogenous. Applying the order
condition given, we see that thedemand function is unidentified. On the other hand, the supply function is
just identified because it excludes exactly variable,
Notice at this juncture that as the previous examples show, identification of an equation in a model of
simultaneous equations is possible if that equation excludes one or more variables that are present
elsewhere in the system. This situation is known as the exclusion (of variables) criterion, or zero
restrictions criterion(the coefficients of variables not appearing in an equation are assumed to have zero
values). This criterion is by far the most commonly used method of securing or determining identification of
an equation. In using this method, the researcher should always consider economic theory and judge on
whether it is correct that the variable(s) are excluded from or included in the equation.
The order condition is necessary for identification, but unfortunately there are some special cases in which
the order condition is insufficient. A more complex, but both necessary and sufficient method is the rank
condition, which will be shortly discussed below.
As this is quite theoretical, and your matrix algebra insight might need refreshment, please check the
example below.
Following the above procedure, let‟s find out whether equation (13) is identified.
Step 1: Table 3.1 displays the system of equations in tabular form.
Step 2:The coefficients for row (13) have been crossed out, because (13) is the equation under
consideration.
Step 3: The coefficients for columns 1, Y1, Y2, Y3 and X1 have been crossed out, because they appear in
(13).
Table 3.1 The tabular form of the systems of equations (Step 1), equation (13) erased (Step 2) and the
coefficients of the variables included in (13) erased (Step 3)
Coefficients of the variables
Equation No. 1
4Theterm rank refers to the rank of a matrix and is given by the largest-order square matrix (contained in the
given matrix) whose determinant is nonzero. Alternatively, the rank of a matrix is the largest number of
linearly independent rows or columns of that matrix.
72
13 1 0 0 0
14 0 1 0 0
15 0 1 0 0
16 0 1 0 0
Step 4: Matrix A (17) is created from the remaining coefficients Table 3.1. For this equation to be
identified, we must obtain at least one nonzero determinant of order 3 × 3 from the coefficients of the
variables excluded from this equation but included in other equations.
[ ]
| |
Since the determinant is zero, the rank of matrix A (17), is less than 3, (i.e., M-1). Therefore, Eq. (13) does
not satisfy the rank condition and hence is not identified.
Therefore, although the order condition shows that Eq. (13) is identified, the rank condition shows that it is
not. Apparently, the columns or rows of the matrix A given in (17) are not (linearly) independent, meaning
that there is some relationship between the variables , . As a result, we may not have enough
information to estimate the parameters of equation (13).
Our discussion of the order and rank conditions of identification leads to the following general principles of
identifiability of a structural equation in a system of Msimultaneous equations:
1. If and the rank of the A matrix is , the equation is overidentified.
2. If and the rank of the matrix A is , the equation is exactly identified.
3. If and the rank of the matrix A is less than , the equation is not identified.
4. If , the structural equation is not identified. The rank of the A matrix in this case is bound to be less than
.
Which condition should one use in practice: Order or rank? For large simultaneous-equation models,
applying the rank condition is a formidable task. Therefore, as Harvey notes: “Fortunately, the order
condition is usually sufficient to ensure identifiability, and although it is important to be aware of the rank
condition, a failure to verify it will rarely result in disaster”.
If an equation is identified, we can estimate it by using by Indirect Least Squares (ILS) or Two-Stage Least
Squares (2SLS)
Indirect Least Squares (ILS)
ILS can be used for an exactly identified structural equation. This section explains the steps involved in ILS.
Where, , and are quantity, price, consumer‟s income and lagged price,respectively.
73
Step 2:Find the reduced-form equations.
A reduced-form equation is one that expresses an endogenous variable solely in terms of the
predetermined variables and the stochastic disturbances.The reduced form expresses every endogenous
variable as a function of (an) exogenous variable(s).
Based on the equilibrium condition,
.
Solving this equation, we obtain the following equilibrium price:
Equation (20) is the first reduced form equation in this model. Because the model has another endogenous
variable, Qt, another reduced form equation must be made. Substituting the equilibrium price into either the
demand or supply equation, we obtain the following equilibrium quantity:
Where,
Step 3: Estimate each of the reduced form equations by OLS individually. This operation is permissible
since the explanatory variables in these equations are predetermined and hence uncorrelated with the
stochastic disturbances.The estimates obtained are thus consistent.
Suppose that the estimation results of the reduced form models are given as follows:
Step 4:Determine the coefficients of the Structural model (i.e., )from the estimated
reduced-form coefficients. As noted before, if an equation is exactly identified, there is a one-to-one
correspondence between the structural and reduced-form coefficients; that is, one can derive unique
estimates of the former from the latter.
and
and
and
and
and
74
Solving this system of 2 equations, we obtain
Applying the order condition of identification, we can see that the income equation is not identified whereas
the money supply equation is overidentified. To estimate the overidentified money supply model, one can
use the method of 2SLS. As its name indicates, the method of 2SLS involves two successive applications of
OLS.
Note that (25) is nothing but a reduced-form regression because only the exogenous or predetermined
variables appear on the right-hand side.Equation (25) can now be expressed as
̂ ̂
which shows that the stochastic consists of two parts: ̂ , which is a linear combination of the
nonstochastic X‟s, and a random component ̂ .Following the OLS theory, ̂ and ̂ are uncorrelated.
(Why?)
75
Stage 2: The overidentified money supply equation can now be written as:
̂ ̂
̂ ̂
̂
where ̂
Comparing equation (27) with the original money supply model, we see that they are very similar in
appearance,the only difference being that is replaced by ̂ . What is the advantage of (27)? It can be
shown that although in the original money supply equation is correlated or likely to be correlated with the
disturbance term (hence making OLS inappropriate), ̂ in (27) is uncorrelatedwith . Therefore, OLS
can be applied to (27), which will give consistent estimates of the parameters of the money supply function.
As this two-stage procedure indicates, the basic idea behind 2SLS is to “purify” the stochastic explanatory
variable of the influence of the stochastic disturbance . This goal is accomplished by performing the
reduced-form regression of on all the predetermined variables in the system (Stage 1). These
predetermined variables are called the instrumental variables. Obtaining the estimates ̂ , based on the
instrumental variables,and replacing in the original equation by the estimated ̂ , and then applying OLS
to the equation thus transformed (Stage 2). The estimators thus obtained are consistent; that is, they
converge to their true values as the sample size increases indefinitely.
As we showed earlier, the simultaneity problem arises because some of the regressors are endogenous and
are, therefore, likely to be correlated with the disturbance term. Therefore, a test of simultaneity is
essentially a test of whether (an endogenous) regressor is correlated with the error term. If a simultaneity
76
problem exists, alternatives to OLS mustbe found; if it is not, we can use OLS. To find out which is the case
in a concretesituation, we can use Hausman‟s specification test.
For illustration, consider the following national income and money supply model:
National Income Function:
Money Supply Function:
Where, = Real Gross Domestic Product
= Money Supply
= Investment spending
= Government expenditure
Note that applying the order condition, the national income function is under identified while the money
supply function is overidentified (which can be estimated by 2SLS).
Numerical Example:
Suppose that data is given on GDP, money supply, private investment and government spending. We are
interested to estimate the money supply model by 2SLS (since the equation is overidentified). However, we
need to make sure that there is indeed a simultaneity problem that make OLS inappropriate. Otherwise it
makes no sense to use the 2SLS method. To this end, we considered theHausman‟s specification error
testand obtained the following results.
First we estimate the reduced-form regression given in equation (30). From this regression we obtain the
estimated GDP and the residuals ̂ .
Second we regress money supply on estimated GDP and ̂ to obtain the following results:
̂ ̂ ̂
Since the t value of the coefficient of ̂ is statistically significant at 5% significance level (also visible in the
**), we must conclude that there is simultaneity between money supply and GDP. 2SLS is the appropriate
procedure to be used.
Worksheet
Exam training true/false
77
If two variables in a simultaneous-equation model are mutually dependent, they are endogenous
variables.
In SEM, a predetermined variable is the same as an exogenous variable.
An exogenous variable can be mutually dependent with an endogenous variable.
The method of OLS is not applicable to estimate a structural equation in a simultaneous-equation model.
The reduced form equation satisfies all the assumptions needed for the application of OLS.
If a model is under-identified, the ILS procedure will give more than one parameter estimate for the
same parameter.
Even though a model is not identified, it is possible that one of the model‟s equations is identified.
In case an equation is not identified, 2SLS is not applicable.
If an equation is exactly identified, ILS and 2SLS give identical results.
Simultaneity bias
Simultaneity bias means that the independent variable influences the dependent variable, but also vice versa.
If an external shock influences the dependent variable, this has effect on the independent variable, and
therefore you can observe a correlation between the error term and the explanatory variable. In this simple
model, the demand for eggs is estimated based on the price of eggs.
A. Knowing economic theory, you can expect simultaneity bias in this model. Why?
B. Observe the following scatterplot, displaying the relation between P and U. Is P an exogenous or an
endogenous variable in this model? Explain your answer.
C. Alternative to running an OLS on this simple equation, it may be better to build a system of equations.
Build a system of equations for this case.
Use the order condition of identification to see if the following systems of equations are over-, under- or
exactly identified.
78
Where Y= income (GDP), R=interest rate and M=money supply.
Where Y= income (GDP), R=interest rate, M=money supply and I=gross private domestic investment.
Model 3: Demand and supply for loans (source: Gujarati & Porter)
Demand:
Supply:
Where Q = total commercial bank loans ($ billion), R = average prime rate, RS = 3-month Treasury bill rate,
RD =AAAcorporate bond rate, IPI = Index of Industrial Production and TBD = total bank deposits.
Where I=inflation rate, IMP=imports as % of GDP (measure of openness), INC=GDP per capita,
LAND=log of land area in square miles.
Demand Function:
Supply Function:
Equilibrium condition:
Where QAt=quantity Arabica beans demanded and supplied (in tonnes), PAt= price of Arabica coffee beans,
PRt=price of Robusta coffee beans, PAt-1=lagged price of Arabica coffee beans.
A. Do you expect a positive or a negative value for , , and ? Base your answer on economic
theory.
B. Why does individual estimation of the equations lead to inconsistent parameters?
C. Is the model identified, according to the order condition? Explain your answer.
D. Rewrite the system of above structural equations in the reduced form. In order to do so, follow these
steps:
I. Equalize QA demanded and QA supplied.
II. Rewrite, so that the equilibrium price, PAt, is the only dependent variable, with PAt-1 and PRtas
explanatory variables. (solve for PAt)
III. Substitute this PAt in the demand or the supply function, so that the equilibrium quantity (QAt) is
the only dependent variable, with PAt-1 and PRtas explanatory variables.
E. Suppose these are the outcomes of the OLS regression on the reduced form equations:
79
Parameter Parameter estimate from OLS
8
0.2
0.4
44
0.6
-0.8
Use the above parameter estimates of the reduced model, to identify the structural parameters.
I. Solve for & (tip: use the estimates and the formulas of
& for solving for , and & for solving for )
II. Solve for &
III. Solve for &
IV. Write the structural equations including the ILS parameter estimates5
F. For each parameter estimate, check if the sign (- or +) makes sense from an economic point of view (law
of demand, law of supply, cross-elasticity of demand).
2SLS
a. Describe the steps of the 2SLS procedure.
b. Explain how the 2SLS procedure solves the simultaneity bias problem.
c. For a system of two equations (one predicting the price (P) and one predicting the wages (W)), the
following structural parameter estimates were obtained from running OLS and 2SLS:
Wt , Pt , Mt , and Xt are percentage changes in earnings, prices, import prices, and labor productivity
(all percentage changes are over the previous year), respectively, and where Vt represents unfilled
job vacancies (percentage of total number of employees).
As can be observed, the differences between the OLS and the 2SLS outcomes are very small.
Indicate for the following statements if they are true or false:
I. Since the OLS and 2SLS results are practically identical, the 2SLS results are meaningless.
II. We will always see similar outcomes for OLS and 2SLS if the equation is exactly identified.
III. Since the OLS and 2SLS results are practically identical, the correlation between W and u and
between P and u was insignificant.
IV. There is little/no simultaneity bias in this system of equations.
5
If you did well, you’ve obtained the following structural parameter estimates:α0=60;α1=-2; α2=1; ß0=20;ß1=3;ß2=-2
80
Lab class
Open the datafile wages.dta in Stata.
𝑵
(2)
Nominal Wage Rate (in Birr)
𝑵 Unemployment rate (in %)
Consumer price index
Import price index
time
Stochastic disturbances
A person without knowledge of SEM decides to run just two simple OLS models to estimate the parameters.
Please do so (use the command regress)
Let‟s investigate the world cotton market. The following variables are available in the datafile cotton.dta:
T=time
QC= quantity of cotton supplied and demanded in million tons
PC= Cotton, CIF Liverpool, US cents per pound
PW=Wool, coarse, 23-micron, Australian Wool Exchange spot quote, US cents per kilogram
PH= Hides, wholesale dealer's price, US, Chicago, fob Shipping Point, US cents per pound
82
4. INTRODUCTION TO PANEL DATA REGRESSION MODELS
4.1. Introduction
Up until now, we have covered cross-sectional and time series regression analysis using pure cross-sectional
or pure time series data.As you know, in cross-sectional data, values of one or more variables are collected
for several sample units, or entities, at the same point in time (e.g. Crime rates in Arba Minch for a given
year). In time series data we observe the values of one or more variables over a period of time (e.g. GDP for
several quarters or years). While these two cases arise often in applications, data sets that have both cross-
sectional and time series dimensions, commonly called panel data, are being used more and more often in
empirical research as such data can often shed light on important policy questions. In panel data, the same
cross-sectional unit (for example: individuals, families, firms, cities, states, etc) is surveyed over time. That
is, panel data have space as well as time dimensions.
For example, a panel data set on individual farmer‟s (household heads) farm productivity (quintals per
hectare), fertilizer applied (DAP in Kg), labour days (in hours), whether the farmer takes farm related
training is collected by randomly selecting farm households from a population at a given point in time
(2005). Then, these same people are re-interviewed at several subsequent points in time (say 2010 and
2015). This gives us data on productivity, fertilizer, labor, and training, for the same group of farm
households in different years(see Table-4.1).
There are other names for panel data, such as pooled data (pooling of time series and cross-sectional
observations), combination of time series and cross-section data, micro panel data, longitudinal data (a study
over time of a variable or group of subjects), event history analysis (e.g. studying the migration over time of
subjects through rural/urban/international areas) and cohort analysis (e.g. following the career path of 1965
graduates of AMU). Although there are subtle variations, all these are about a movement over time of cross-
sectional units.
83
3. By studying the repeated cross section of observations, panel data are better suited to study the
dynamics of change. Spells of unemployment, job turnover, and labor mobility are better studied
with panel data.
4. Panel data can better detect and measure effects that simply cannot be observed in pure cross-section
or pure time series data. For example, the effects of minimum wage laws on employment and
earnings can be better studied if we include successive waves of minimum wage increases in the
federal and/or state minimum wages.
5. Panel data enable us to study more complicated behavioral models.For example, phenomena such as
economies of scale and technologicalchange can be better handled by panel data than by pure cross-
section orpure time series data.
6. By making data available for several thousand units, panel data canminimize the bias that might
result if we aggregate individuals or firms intobroad aggregates.
If each cross-sectional unit has the same number of time series observations, then such a panel (data) is
called a balanced panelor complete panel (see for instance Table-4.1). If the number of observations
differs among panel members, where some cross-sectionalelements have fewer observations, we call such a
panel an unbalanced panel. The same estimation techniques are used in both cases; however, we will be
concerned with a balanced panel in this chapter. Initially, we assume that the X‟s are nonstochastic and that
the error term follows the classical assumptions, namely, E(uit) N(0, σ2).
This is because the FEMaccounts for the “individuality” of each cross-sectional unit by letting the
intercept, to vary for each cross-sectional unit, but assumesthat the slope coefficients, , are constant
across cross-sectional units. Assuming that the intercept term varies for cross-sectional units, we can specify
a model as:
(4.1)
Notice that we have put the subscript i on the intercept term to suggest that the intercepts of the
cross-sectional units (perhaps firms) may be different; the differences may be due to special features
of each firm.
84
o Note that if we were to write the intercept as , it will suggest that the intercept of each
cross-sectional unit or firm is time variant.
Note also that the matrix of of equation (4.1)is the same for all cross-sectional units, and for all
time periods. Only - the intercept, varies per cross-sectional unit (i).
Equation (4.1) is known as the Fixed Effects Model(FEM). The term “Fixed Effects”is due to the fact that
each cross-sectional-unit‟s intercept, although different from the intercepts of other cross-sectional units,
does not change over time. The intercept for each cross-sectional unit is time-invariant.
How do we actually allow for the (fixed effect) intercept to vary between cross-sectional units? We can
easily do that by the dummy variable technique that welearned in Chapter 1, particularly, the differential
intercept dummies. Remember from chapter 1 that we have to prevent the dummy variable trap. This
means that we must include M-1 dummies. If the number of individuals is 50, we need 50-1=49 dummy
variables to represent the 50 different individuals in the datafile.This approach to the fixed effects model is
called the least-squares dummy variable (LSDV) model.
A disadvantage of the dummy variable approach is that the R2 is usually very high, because all the dummy
variables are explanatory variables and the R2 always increases if more (dummy) explanatory variables are
introduced in the model. Also it requires a lot of memory on the computer, which is not always available. An
alternative approach to estimating the fixed effects model is the within estimation method. This method is
used by Stata. In the within estimation method the variables (Y and X‟s) are all demeaned, i.e. the time-
averaged values are deducted from each value.
̅ ̅ ̅ ̅
Note that is time invariant, so ̅ =0. Therefore the within estimation for a model with one
explanatory variable (X) can be specified as:
̅ ̅ ̅ (4.2)
Where:
is the average of the individual effects ( )
̅ is the time-average Y value of individual i
̅ is the time-average X value of individual i.
̅ is the time-average error term for individual i.
Example FEM
In order to estimate the effect of diet on study results, the university collected data from 50 students that eat
university food for 6 semesters. For all 50 students all information is available for the 6 semesters, so we
work with a balancedpanel. The expectation is that eating proteins (present in meat and eggs) has a positive
effect on brain capacity, and thus improves the study results. The variables in the model include:
Yit the average study results of student i in semester t
Pit how many meals with egg and meat student ate during semester
85
Gi a dummy variable equal to 1 if the student is female, and 0 if the student is male
Ait the age in years of student during the end of semester
Si the average state exam results of student i (before joining the university)
Hit the average weekly study hours (class attendance) of student during semester
̅ ̅ ̅ ̅ ̅ ̅ ̅
Note that the time-invariant variables (in this case Gi and Si) will be omitted, because their mean values ( ̅
and ̅ ) will be the same as the values for each t; there is no change over time.Therefore, the FEM does not provide
parameter estimates for time-invariant variables.
Indeed, the individual differential intercept coefficients are not presented in the output, but they are taken
into account in estimating the model (as this is a FEM). This output shows the following:
- The F-test that tests the mutual significance of all the coefficients in the model is highly significant
This means that the model is able to explain the variance in Yit.
- The parameter coefficients, and their t-tests reveal that, when controlled for individual differences, the
study hours (Hit) have a significant effect on study results (Yit), because p>|t|<0.05. The effects of
protein intake (Pt) and age (Ait) on study results (Yit) are insignificant, because p>|t|>0.05. No estimates
for the effect of gender (Gi) and state exam result (Si) are given, because the fixed effects model can only
estimate the effect of time variant variables, and Gi and Si do not change over time.
86
- Sigma_u (5.14) is the standard deviation of the time-invariant error term (ui), which is available for
every individual (i). In this example, it expresses the variety in the time-constant error term of the 50
students. If sigma_u is large, it means that students are very different with regards to Y when controlled
for the X‟s. Sigma_e (9.04) is the standard deviation of the residuals (the over-all error term; ei). Rho
(0.244) is the share of the estimated variance of the overall error accounted for by the individual effect,
u_i. A high rho indicates that the fixed effect model is much better than the pooled regression, because
much of the error term can be explained by differences in individuals.
- The F-test that all ui‟s are 0 (F=1.68) reveals that there are significant differences between students
(Prob>F=0.0058). This supports the use of the Fixed Effects model.
Types of FEM
The above-discussed model is known as a one-way fixed effects model, because we have allowed the
intercept to differ per individuals (i.e. cross-sectional units). We can also allow the time periods to have
different intercepts, by adding dummies for the different semesters. If we consider fixed
effects both for individuals and time periods, the model is called a two-way fixed effects model. In that
case, the within estimation method can be used for the individual effect and time dummies must be
introduced to capture the deterministic trend (chapter 2). However, allowing for individual fixed effects and
time fixed effects results in an enormous loss in degrees of freedom. Degrees of freedom are the total
number of observations in thesample less the number of independent (linear) constraints or restrictions put
on them.In the above model the number of observations is 300. The linear constraints consist of 4
( and ), and 49 individuals (I-1).That means 300-4-49=247 degrees of freedom are left. If
differential intercepts related to time are added, the total degrees of freedom will be 300-4-49-5=242. Losing
too many degrees of freedom usually results in poorer statistical analysis. This is a big disadvantage of the
FEM. More disadvantages are discussed below.
Disadvantages of FEM
The disadvantages of the fixed effects model (FEM) are as follows:
- Per dummy variable added, a degree of freedom is lost. Therefore, if the sample is not very large,
introducing too many dummies will leave few observations to do meaningful statistical analysis.
- It is impossible to obtain parameter estimates for time-invariant variables like gender and ethnicity.
- The model results are valid under the assumption that ɛit ~N(0, σ2). As has a cross-sectional
component (i) and a time component (t), this assumption may not be realistic.
o Regarding the cross-sectional component: if there is heteroskedasticity, we can obtain
heteroscedasticity-corrected standard errors.
o Regarding the time component: if there is autocorrelation we can adjust the model and assume
autocorrelation. This is, however, beyond the scope of this course.
87
An alternative to the Fixed Effects Model(FEM) is the Random Effects Model (REM), and this REM is
able to deal with some of the above listed problems.
Instead of assuming that all the cross-sectional units are purposively selected, the ECM assumes that the
cross-sectional units are randomly selected from a pool of possible cross-sectional units. Let us consider the
example of panel data from 30 local restaurants with monthly data from 2016-2017. For each restaurant, the
monthly profit (Y) and the average number of customers on weekday evenings (X) were described.
(4.4)
The ECM assumes that the 30 restaurants included are a drawing from a much larger universe of all the local
restaurants in the study area. All these restaurants (the whole population of restaurants) have a common
mean value for the intercept ( ) and individual differences in the intercept values of each local restaurant
are reflected in the error term ui. Therefore,
(4.5)
Substituting (4.5) in (4.4), we get
(4.6)
Where,
The error term, , consists of two components: , which is the individual-specific error (in this case, the
restaurant-specific error) and , which is the combined time series and cross-section error component (the
error that cannot be explained by the independent variables (X‟s) and the individual characteristics (ui)).
For the error terms the ECM assumes (with meaning after ):
ui ~N(0, σu2)
the individual error term is normally distributed with variance σu2
ɛit ~N(0, σɛ2)
The combined time-series and cross-section error component is normally distributed with
variance σɛ2
E(uiɛit)=0
88
The error components ui andɛit are not correlated with each other
E(uiuj)=0 (i≠j)
The individual error components are not correlated with each other
E(ɛitɛis)=E(ɛitɛis)=E(ɛitɛjs)=0 (i≠j; t≠s)
The time series and cross section error component is not autocorrelated.
Notice the difference between the FEM and this ECM. In the FEM each individual has its own fixed
intercept value. In the ECM, on the other hand, the intercept represents the mean value of all these
individual intercepts and the error component, ui, represents the random deviation from this mean. We can
state that E(wit)=0 and var (wit)= σu2+ σɛ2. If σu2=0, there are no differences in the individual intercepts, so
we can pool all the data and run a pooled OLS regression. But if σu2>0, it is useful to run a panel data model.
In the ECM, the disturbance terms, wit for one specific individual (cross-sectional unit) over time are, of
course, correlated, because wit contains ui, which is constant for every individual. Therefore, using OLS to
estimate the model will lead to inefficient outcomes. The most appropriate method to estimate an ECM, is
the method of Generalized Least Squares (GLS). It goes beyond the scope of this course to explain GLS
mathematically, but we can get the outcome from Stata or Eviews also without understanding the
mathematics.
Example of an ECM
Small local coffee shops in Arba Minch offer tea, coffee, soft drinks and sometimes snacks like bananas,
bombolinos or samosas. Furthermore, some of them offer the opportunity to play checkers (dam). An owner
of a small coffee shop wonders if offering checkers is good for the turnover, or bad. On the one hand,
customers may choose the shop with a checkers playing board, because they like to play while drinking
coffee. On the other hand, customers who play checkersuse the seats for a longer time, not leaving space for
others who also want to drink coffee. To find an answer to this question, the shop owner collected data from
15 randomly selected small coffee shops (N=50) over the period of 10 weeks (T=10). For each shop and
each week, she collected the following data:
yit the revenue of the ith shop in week t
dit if the ith shop offered a dam (checkers) board (1) or not (0) during week t
sit the number of seats available for guests in the ith shop in week t
ci if the ith shop is located in a city center (1) or not (0) (city center: Sikela or Secha)
The ECM model that she ran is the following:
89
The output gives parameter estimates both for the time variant (d and s) and the time-invariant (c) variable.
Based on this output, the following conclusions can be drawn:
- The Wald Chi2 test that tests the mutual significance of all the coefficients in the model is highly
significant (Chi2(2)=60.10, p=0.000). This means that the model is able to explain the variance in Yit.
- The parameter coefficients, and their t-tests reveal that the availability of a dam board (d) and the number
of seats (s) both have a significant effect on the revenue (yit), because p>|t|<0.05.Presence in the city
center has no significant effect on the revenue, because p>|t|>0.05. Also, the constant term, the intercept,
is not significantly different from zero.
- Sigma_u (915.78) is the standard deviation of the individual error term (ui) and sigma_e (846.23) is the
standard deviation of the combined time series and cross-section error component(the over-all error term;
ei). Rho (0.539) is the share of the estimated variance of the overall error accounted for by the individual
effect, u_i. It would not be good to use pooled OLS regression in this case, because pooled OLS would
not consider the individual effect and, therefore, not be able to explain much of the variance in Y.
- To answer the question of the lady who runs the small coffee shop: we can conclude from this model that
offering a checkers board has a significant positive effect on the revenue of a local coffee shop. When
controlled for the other independent variables, shops that offer dam make, on average, 417 birr extra per
week.
90
ECM vs FEM
Thus far we have seen three possibilities to analyze panel data in a model: pooled OLS regression, the Fixed
Effect Model (FEM) and the Random Effect Model, or Error Correction Model (ECM). Which of the three
is best? That depends on how the error term behaves.
- Use pooled OLS regression if the individuals (cross sectional units) are homogenous and if the time
periods are homogenous. We can check this by:
o Individual homogeneity: If, in the output of the FEM and the ECM sigma_u (σu) is close to 0, it
means that individuals are homogenous (there is no individual heterogeneity). If this is the case,
pooled OLS regression can be used.
o Time homogeneity: Check if the time-variant variables are stationary (chapter 2). If they are
stationary, you can use pooled OLS regression. If this is not the case, pooled OLS regression is
not optimal and you could use the FEM model with time trend dummies (if the dependent
variable has a deterministic trend). Alternatively, you can run time series econometrics models,
which are outside the scope of this course.
If there ís heterogeneity of the cross-sectional units (or over time), it is better to use the Fixed Effects
Model or the Error Correction model. How to choose between these two? The decision depends on
several factors:
- If T (number of time series data) is large and N (number of cross-sectional units/individuals) is small,
there is likely to be little difference in the parameter estimates of FEM and ECM. Because FEM is
easier, it is preferred to use FEM.
- When N is large and T is small, the estimates obtained by the two methods can differ significantly,
because there are manydifferent ui‟s. If the assumptions underlying ECM hold, ECM estimators are
more efficient than FEM estimators. Furthermore, the ECM can also give parameter estimates for time-
invariant variables.
How to know whether the assumptions underlyingECM hold?
a. One assumption underlying ECM is that the individuals (cross-sectional components) are randomly
selected from the research population. Because the ECM assumes random selection, statistical inference
can only be done if this is the case. If the selection of i‟s is not random, the FEM is appropriate.
b. Next to the randomness of selected sample units, it is important to see if the individual error component,
ui, and one or more regressors (X‟s) are correlated. If the individual error term is not related to any of the
regressors, it is likely that the sample is randomly selected, and that ß1 is indeed the common intercept
(with a random deviation u for each i). If, however, the X‟s and the ui are correlated, the assumption that
the i‟s are randomly selected is invalid, so ECM is inappropriate and FEM will give less biased
parameter estimates.
91
A test devised by Hausman, which is incorporated also in Stata and Eviews, can be used to check if the
ECM is better than the FEM.
92
- Dummy variables for the years in the sample
(from 2009-2013), because of deterministic
trend
93
CHAPTER ONE: REGRESSION WITH DUMMY VARIABLES 2019/20
The results of their Fixed Effects regression are displayed in Table 4.2.
As the results reveal, mobile
Table 4.2: Results of the regression of Parlasca et al (2018)
ownership especially
Independent variables Food groups from Food groups from
increased diversity in food self-production other sources
consumption from non-self- Mobile phone ownership & use -0.011 (0.050) 0.238** (0.097)
produced sources. Income Income 0.557* (0.261) 1.812*** (0.378)
Herd size 0.002* (0.001) 0.002 (0.001)
positively contributed to the
Land farmed -0.001 (0.010) 0.012 (0.010)
diversity in food
Radio ownership 0.016 (0.078) -0.221 (0.235)
consumption, both from self- Cooking source -0.068 (0.130) 0.186 (0.259)
production and from other Household size 0.028***(0.009) 0.054** (0.024)
sources. The herd size is only Education household head 0.075** (0.026) -0.121* (0.064)
(weakly) significant for the Gender household head (1=female) 0.045 (0.089) 0.098 (0.179)
diversity of consumption of Age household head 0.002 (0.002) -0.003 (0.003)
From Table 4.2 we find that the test result is χ2=96.37 for the first dependent variable (about
food groups from self-production). This is significant at the 0.01 level (visible from ***), so H0
must be rejected. Furthermore, Table 4.2 reveals that the test result is χ2=149.74 for the second
dependent variable (about food groups from other sources). This is significant at the 0.01 level
(visible from ***), so H0 must be rejected.
Page 94
CHAPTER ONE: REGRESSION WITH DUMMY VARIABLES 2019/20
Concluding, for both dependent variables the FEM estimates differ significantly from the REM
estimated. Therefore, the parameter estimates in the REM approach (the ECM) are biased, due to
correlation between the individual error, ui, and (at least one of ) the independent variable(s). The
farmers were, apparently, not very randomly selected from the total population of farmers.
Therefore, the FEM is the optimal way to analyze this panel data. The authors Parlasca et al.
correctly decided to use the FEM.
Concluding note: we have seen that the ECM can provide parameter estimates for time-invariant
variables, such as gender and ethnicity. The FEM does not provide parameter estimates for these
variables, but it does control for them by approaching each individual as an individual. To this
end, the FEM controls for all possible time-invariant variables, whereas the ECM only considers
the time-invariant variables that are explicitly included in the model.
References
Baltagi, B. (2008). Econometric analysis of panel data. John Wiley & Sons.
Gujarati, D. (2003). Panel Data Regression Models in:Basic Econometrics. Fourth Edition. Singapura:
McGraw-Hill, chapter 16.
Parlasca, M. C., Mußhoff, O., & Qaim, M. (2018). How mobile phones can improve nutrition among
pastoral communities: Panel data evidence from Northern Kenya (No. 123). Global Food
Discussion Papers.
Wooldridge, J. M. (2013). Advanced Panel Data Methods in: Introductory Econometrics: A Modern
Approach (South-Western, Mason, OH). Chapter 14.1-14.2.
Worksheet
4.1. Introduction
1. Indicate for the following data if it is cross-sectional data, time series data or panel data:
a. Yearly GDP, unemployment and inflation figures of Ethiopia from 2000-2020.
b. Monthly income and savings figures of 3000 households from 2010-2020.
c. Average academic performance of section A, B, C and D in semester 2 of the academic year
2019-2020.
d. Yearly maize production of- and technologies used by 100 farmers from 1995-2015.
e. Monthly unemployment and economic performance statistics for all Ethiopian provinces.
2. Describe the advantages of panel data over cross-section or time series data in Amharic (do not copy-
paste from Baltagi, but translate to Amharic to get a deep understanding).
3. A balanced panel is preferred over an unbalanced panel. How come that still many analyses involve
unbalanced panel data?
4.2. Estimation of Panel Data Regression Model: The Fixed Effects Approach
Page 95
CHAPTER ONE: REGRESSION WITH DUMMY VARIABLES 2019/20
Gujarati (….) collected data on the civilian unemployment rate Y (%) and manufacturing wage
index (1992 = 100) for Canada, the United Kingdom, and the United States for the period 1980–
1999. Consider the model:
1. From an economic point of view, what is the expected relationship between Y and X? Why?
2. Explain: how can this data be analyzed by using pooled regression?
The pooled regression gives the following results:
3. Interpret the outcome of the model: does the manufacturing wage index have a significant effect on
the unemployment rate?
4. What is the disadvantage of using pooled regression?
5. How does the Fixed Effect Model (FEM) solve for this disadvantage?
Because the data is panel data, also a Fixed Effect Model (FEM) was run:
Page 96
CHAPTER ONE: REGRESSION WITH DUMMY VARIABLES 2019/20
6. Report the results of the F-test to indicate if the model, as a whole, is able to explain the variance in
unemployment (Y).
7. Report the results of the t-test to indicate if the manufacturing wage index is a significant predictor for
the unemployment.
8. Is the standard deviation of the residual within the countries higher than the standard deviation of the
overall error term? Explain your answer by referring to sigma_u and sigma_e.
9. Rho is 0.375. What does that mean?
***
Though the Fixed Effects Model solves the main problem of a pooled OLS regression (loss of
heterogeneity between the groups), there are also some disadvantages. The following questions are related
to the disadvantages of the Fixed Effects Model:
1. An economist wants to predict the effect of corona cases on the import of disinfectant soap. To this
end, he collects panel data from December 2019- June 2020 (7 months) for all 54 African countries.
For each month and each country, he collects the average active corona cases (X) and the monetary
value of imported disinfectant soap (Y). The pooled regression model that he estimates is
.
a. How many degrees of freedom are lost in this simple pooled regression model?6
6
NB: the degrees of freedom equal the number of observations minus the number of parameters you need to
calculate during an analysis.
Page 97
CHAPTER ONE: REGRESSION WITH DUMMY VARIABLES 2019/20
b. How many degrees of freedom are lost if he uses a one-way FEM (i.e. adding differential
intercepts for the countries)?
c. How many degrees of freedom are lost if he uses a two-way FEM? Check page 4 of the lecture
notes chapter 4, to see what‟s a two-way FEM.
d. Why is losing degrees of freedom a disadvantage of the FEM?
2. Why can time-invariant variables, such as gender and ethnicity, not be estimated in a FEM? Check
page 3 of the lecture notes.
4.3. Estimation of Panel Data Regression Model: The Random Effects Approach
There are many banana farmers around Arba Minch. Each harvesting season they face a loss of crop,
possibly due to fruit flies, too much or too little water availability, etc. You are asked to design a study in
which panel data will be collected and an ECM will be run. To this end, do the following:
1. Identify possible determinants of crop loss (in the case of banana).
2. Indicate how these determinants will be measured (Which question will be asked in the
questionnaire? Which variables will be in your econometric model?).
3. Indicate how often the data will be collected over time, and why.
4. Mathematically describe the ECM, including the variables selected above.
5. Explain why the ECM is the appropriate method for analyzing the data.
***
In 2019, Thanh Phan et al published a research titled: Does microcredit increase household food
consumption? A study of rural Vietnam7. Data was collected in 2008, 2010, 2012, 2014 and 2016. Next to
C (food consumption, the dependent variable) and K (microcredit borrowing, the explanatory variable of
interest), several other variables were considered:
In this article, a panel data analysis of an unbalanced panel was presented, by using multiple instrumental
variables (remember: chapter 3) and by including several lagged variables (remember: chapter 2). To
prepare yourself for thesis or post-graduate work, it is very useful to read the full article. For this exercise,
however, the following simplifications are made:
7
Phan, C. T., Sun, S., Zhou, Z. Y., & Beg, R. (2019). Does microcredit increase household food consumption? A
study of rural Vietnam. Journal of Asian Economics, 62, 39-51.
Page 98
CHAPTER ONE: REGRESSION WITH DUMMY VARIABLES 2019/20
C Coefficients Robust SE
logK 0.461*** 0.178
logY -0.0532* 0.029
EduH -0.0875** 0.037
HSize -0.302** 0.132
AgeH -0.0106 0.013
Nfarming -0.796*** 0.231
Shock -0.367*** 0.116
Note: ***, **, and * indicate significance at 1, 5, and 10%, respectively.
4. Summarize the outcome of this model.
5. Write a conclusion to the above stated research question Does microcredit increase household food
consumption? based on the above presented model.
Fixed effects or Random effects?
Kumari and Shamar (2017)8 published an article in which they presented their panel data model with the
following (fragment of) abstract:
Purpose
The purpose of this paper is to identify key determinants of foreign direct investment (FDI) inflows in developing
countries by using unbalanced panel data set pertaining to the years 1990-2012. This study considers 20 developing
countries from the whole of South, East and South-East Asia.
Design/methodology/approach
Using seven explanatory variables (market size, trade openness, infrastructure, inflation, interest rate, research and
development and human capital), the authors have tried to find the best fit model from the two models considered
(fixed effect model and random effect model) with the help of Hausman test.
Findings
Fixed effect estimation indicates that market size, trade openness, interest rate and human capital yield significant
coefficients in relation to FDI inflow for the panel of developing countries under study. The findings reveal that
market size is the most significant determinant of FDI inflow.
The dependent variable is FDI inflow, and the explanatory variables are market size, trade openness,
infrastructure, inflation rate, interest rate (IR), research and development (R&D) and human capital (HC).
The cross-sectional units are 18 countries in South- East- and South-East Asia with different levels of
development. The authors have estimated the Fixed Effects and the Random Effects Model.
8
Kumari, R., & Sharma, A. K. (2017). Determinants of foreign direct investment in developing countries: a panel
data study. International Journal of Emerging Markets.
Page 99
CHAPTER ONE: REGRESSION WITH DUMMY VARIABLES 2019/20
1. Which approach, fixed effects or random effects, do you think is most suitable for this case? Base
your answer on the case description and on lecture notes p. 8-11.
2. The results of both approaches is presented in the below table (Table V). Compare the results: what
are the biggest differences?
Independent variables Fixed effects (within) regression Random effects GLS regression
Constant -7.84 (-3.58) 5.05 (5.96)
Market size 1.63 (7.39)** 0.33 (3.85)**
Trade openness 0.63 (1.17) -1.108 (-2.19)
Infrastructure -0.00012 (-2.22)** 0.000072 (1.48)**
Inflation -0.00055(-3.67)** 0.0044 (4.66)
Interest rate -0.0084 (-3.67)** -0.0096 (-3.91)**
Research and development -1.09e-06 (-1.27) 5.47e-07 (0.64)
Human capital 0.0037 (2.82)** 0.0046 (3.33)
No of countries 18 18
No of observations 304 304
(unbalanced sample)
Overall R squared 0.54 0.48
F-statistic: 46.88 (p:0.000) Wald chi-square: 262.88
(p:0.000)
Hausman p-value 0.000
Notes: FDI is measured as net inflows (BOP current USD). Market size is measured as the GDP (current
USD). GDP indicates economic growth and standards of living. Trade openness (masured as an import
plus export divided by GDP per capita (current USD))) describes the openness of the economy.
Infrastructure measured by proxies of electric power consumption (kWh per capita). Interest rate
measured by real interest rate (percent). Research and development is measured as the patent
applications residents registered every year. Human capital is measured as the school enrollment
secondary (percent gross). 1990-2021. t-statistics are in parentheses. ** significant at the 5% level.
3. Based on the information in the above table, would you still make the same choice between the Fixed
Effects Approach and the Random Effects Approach? Refer, at least, to the Hausman test in your
answer.
Lab class
A classical key to economic growth is investment. Next to attracting foreign investment, household
savings can also be used to fund investment. In order to save, a household needs surplus income. Many
panel data studies have been done into the relation between household income and household savings. In
this lab class you will analyze (fake) income- and savings data of 221 rich households from Hawassa.
Sit= Savings of household i in period t in ETB
Iit= Income of household i in period t in ETB
Fit= Family size of household i in period t
Ci= If the household owns a car or another expensive asset at the start of the study
First consider the following saving function:
Page 100
CHAPTER ONE: REGRESSION WITH DUMMY VARIABLES 2019/20
3) Does this model allow for heterogeneity among the households? Explain your answer.
4) Use the command xtset householdID time to explain Stata that you are using panel data.
D. Run the fixed effect model, by using the command xtreg savings income fam_size
Car_or_other_asset,fe
5) Why is the variable Car_or_other_asset omitted?
6) Describe the effect of income and family size on savings.
7) Is the model able to explain the variation in household saving? Refer to the F-test
8) List the values for sigma_u, sigma_e and rho, and explain what they mean.
9) Is it better to use the Fixed Effects Model or Pooled OLS regression for this dataset? Give at
least one reason for your answer.
E. Save the estimates of this fixed effects regression model, so we can use them later in the Hausman
test. Use the command est store fixed.
F. Now run the Random Effects Model, by using the command xtreg savings income fam_size
Car_or_other_asset, re
10) Describe the difference between the fixed effect model and the random effects model
11) Describe the effect of income, family size and car (or other asset) ownership on savings.
12) Is the model able to explain the variation in household saving? Refer to the Wald Chi-squared
test.
13) List the values for sigma_u, sigma_e and rho, and explain what they mean.
G. Save the estimates of this random effects model, by using the command est store random.
H. Run the Hausman test which compares the estimated parameters of the FEM and the REM. Use the
command hausman fixed random.
14) What is the null hypothesis of the Hausman test in this case?
15) Report the chi-square test statistic of the Hausman test, and it‟s p-value.
16) Which model is better to be used? REM or FEM? Base your answer on the outcome of the
Hausman test.
Page 101