Understanding Econometric Modeling With Simple
Understanding Econometric Modeling With Simple
AUTHOR PUBLISHED
Zahid Asghar April 22, 2024
Time series models are those where the data occurs in sequential time periods, so that
there is a concept of progression of time. As opposed to this we have cross section
models, where the data is from the same period of time but taken from different places.
Within time series models, the simplest type is the static model. In static models, all the
action takes place within one time period, and there is no effect of activity in one time
period on the next. The simplest consumption function is the Keynesian consumption
function, which is a static function. It says that current consumption depends only on
current income, so that Ct = f(Yt). Assuming that the function is linear, and that there
can be a random error, leads to the simplest Keynesian consumption function:
Ct = α + βYt + ϵt
Load data
Data on annual income and consumption for Pakistan with some approximations due to
definition issues of consumption expenditure over time (main source :WDI) has been used
for this lecture. Figure 1 indicates that both consumption and income are trending
variables.
:
Figure 1: Consumption and Income per capita over time
The Keynesian consumption function is one of the most widely accepted and estimated
regression models. The causal hypothesis is that Income (GDP) determines Consumption
(Con): GDP ⇒ Con. The simplest regression model which embodies this relationship is:
C = α + βY . Running this regression on the data Table 1 gives the first five observations
of the regression results.
Table 1
# A tibble: 5 × 5
ID C Y C_hat residual
<int> <dbl> <dbl> <dbl> <dbl>
1 1 327. 404. 313. 14.7
2 2 326. 422. 327. -0.527
3 3 348. 433. 336. 12.8
4 4 380. 469. 364. 15.8
5 5 365. 459. 356. 9.18
(1)
(Intercept) −4.418
:
(15.424)
Y 0.785***
(0.020)
Num.Obs. 52
R2 0.970
R2 Adj. 0.969
AIC 509.1
BIC 515.0
Log.Lik. −251.556
F 1589.518
RMSE 30.53
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
The regression has R2 of 0.97, which is interpreted to mean that 97% of the variation in
Pakistani Consumption can be explained by the Pakistani GDP. The t − stat = 0.785 0.020
= 39
shows that the coefficient income is highly significant. The p − value of 0.000 means we
can reject the null hypothesis that the true coefficient is zero, corresponding to the idea
that Y has no influence on C . Validity of regression results depends on a large number of
assumptions, which are discussed in econometrics textbooks.
Here is the plot of residuals of regressing Per Capita consumption in Pakistan on Per
Capita GDP from 1967 to 2018.
:
Figure 2
One of the central assumptions is that the regression residuals should be random, and
should come from a common distribution. Figure 2 indicates high correlation among the
residuals and these dont follow random pattern. This plot shows serious problems, since
these residuals display systematic behavior. They are all negative and small early. To see
how these patterns differ from independent random variables, we provide a graph of
independent random variables with mean 0 and standard error 4.579, matching the
estimated regression model standard error. The Keynesian consumption function is one of
the most widely accepted and estimated regression models. To verify further,we have
some more analysis of these residuals to assess normality, heteroscedasticity and
independence given in Figure 3.
:
Figure 3
Random residuals frequently switch signs. They do not display any patterns in sequencing.
The patterned residuals in the consumption function prove that the regression is not valid.
In such situations, econometricians typically assume that the problem is due to missing
regressors or wrong functional form. By adding suitable additional regressors, and
modifying the functional form, one can generally ensure that the residuals appear to
satisfy the assumptions made about them.
A model of the type Ct = α + βYt + et is called a static model. Events in time period t are
isolated from those of period t − 1 and t + 1. As opposed to this, if data from period
t − 1 enters into the equation for period t, then we have dynamic model.
:
Suppose that we conduct tests and learn that et is correlated with et−1. Note that
et−1 = Ct−1 − α − βYt−1. It follows that Ct−1 and Yt−1 have an effect on current
consumption.
The main message of correlated residuals is that variables from one period ago have an
effect on current period.
Thus in order to fix up the problem of correlated residuals, we need to build a dynamic
model. The simplest dynamic extension of the Keynesian model is the following:
We put in ALL the information we have about ALL the variables in the model at period
t − 1. Similarly we could go back to period t − 2.
The alternative to this which has recently been proposed, and shown to have superior
properties, is general to simple modelling. In this method, we note that errors seem to be
correlated upto the fourth order. So we start with a fourth order dynamic model. In general
the goal is to start with the biggest model you might possibly need and simplify it down to
a simpler model. This is called the top-down or general to simple strategy. We have
illustrated this on the data set as well. The final model which emerges from our analysis is
Ct = α + βYt + γCt−1 + et. However, this model suffers from both structural change
and heteroskedasticity. This suggests that we should try a log transformation to get rid of
the heteroskedasticity. Perhaps this will eliminate or reduce the structural change problem
as well.
How should we proceed? Experience with consumption functions shows that they very
often have dynamic properties. This is indicated by the autocorrelation of the errors as
well. Thus we try the simplest dynamic extension:
:
Dynamic Model
We start with General to Specific approach and include upto three lags of both
consumption and income. There is no hard and fast rule but it is suggested that number of
regressors should not exceed number of observations. We have in total 52 observations
we can select 3 or 4 lages of each (had it been quarterly data we would have selected 4
lags of each). Some studies suggest to have regressor should not exceed one third of the
total number of observations.
(1)
(8.590)
Y 0.618 (0.169)***
(0.169)
(0.296)
(0.156)
(0.297)
(0.189)
(0.214)
(0.159)
Num.Obs. 49
R2 0.993
R2 Adj. 0.992
AIC 417.0
BIC 434.0
Log.Lik. −199.494
F 817.725
RMSE 14.19
:
Results in Table 3 indicate that Y and first lag of C are highly significant and second lag of
Y is significant at 10%. So by standard statistical procedure, one should drop all variables
and re-run the regression with significant variables as follows:
Rows: 50
Columns: 11
$ year <date> 1975-05-24, 1975-05-25, 1975-05-26, 1975-05-27, 1975-05-
2…
$ C <dbl> 348.4950, 380.0568, 365.3227, 356.0221, 367.2798,
396.0909…
$ Y <dbl> 433.0744, 469.4377, 459.0634, 450.3759, 469.0675,
472.1552…
$ lnc <dbl> 5.853624, 5.940320, 5.900781, 5.874993, 5.906124,
5.981644…
$ lny <dbl> 6.070910, 6.151536, 6.129189, 6.110083, 6.150747,
6.157308…
$ c_y <dbl> 0.9642087, 0.9656647, 0.9627345, 0.9615243, 0.9602288,
0.9…
$ cdividy <dbl> 0.8047000, 0.8096000, 0.7958000, 0.7905000, 0.7830001,
0.8…
$ Trend <int> 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
1…
$ resid <dbl> 12.763683, 15.764683, 9.178845, 6.701742, 3.278557,
29.664…
$ trend_resid <dbl> 30.063569, 50.242900, 24.126271, 3.443237, 3.318460,
20.74…
$ g2s1res <dbl> 12.222517, 12.470682, -20.045428, 4.372311, 9.566801,
22.8…
(1)
(8.024)
Y 0.527 (0.101)***
(0.101)
(0.071)
(0.097)
Num.Obs. 50
R2 0.993
R2 Adj. 0.992
:
AIC 419.7
BIC 429.3
Log.Lik. −204.854
F 2082.114
RMSE 14.56
Oh great now we have all the variables are highly significant in Table 4 and if residuals
satisfy all standard assumptions, it will be perfect fit. So lets have a look at the residuals
behavior.
From Figure 5, it seems residuals are behaving randomly. Lets apply some more tests
before I finalize the model.
:
This seems quite a good job. But what about theoretical interpretation.
Now if we interpret and explain our model, it will be fine to say that consumption is
function of current income, but little difficult to justify 2 years lagged income effect on
consumption while no effect of One year lagged income. As first two coefficient makes
sense and we shall also learn about these later on as well. But instead of first lag of income
, what does explain second lag to be in the model either needs a very solid reason or
needs econometric model to be explored little more.
As in time series data there is very high correlation of variables with their own lagged
values. So lets have a look at the correlation matrix and observe how highly they are
correlated with each other.
This correlation plot indicated very correlation between variables and their lags which
implies these variables have very high multicollinearity and indicating they are almost
perfect substitutes. Economic rationale suggests that first lag of income should be more
relevant in explaining current consumption than that of second lag of income. Correlation
between lag(Y,1) and lag(Y,2) is 0.998, therefore, lets use first lag of Y in the model.
(1)
(7.985)
Y 0.727 (0.156)***
(0.156)
(0.074)
(0.161)
Num.Obs. 51
R2 0.992
R2 Adj. 0.992
AIC 430.8
BIC 440.4
Log.Lik. −210.392
F 2067.854
RMSE 14.98
Oh that makes sense and current consumption is explained by current income, past year
consumption and past year income and all other variables are insignificant. This model
pass all diagnostic tests as well.
:
Summary and conclusion
: