Econometrics - Functional Forms
Econometrics - Functional Forms
Objective: The overall objective is to develop an understanding regarding different functional forms
of models and understand the techniques and methodologies in order to determine suitable
functional forms for different theoretical arguments.
Theoretical argument: There is an ambiguous relation between education and income. The existing
literature isn’t able to clarify the relation between these two variables. But the underlying line of
concurrence is that as income increases, education expenditure also increases.
Data: I have considered data for income and education expenditure for 50 states of USA for the year
1979
Regression Analysis: The linear functional form was considered for analysis and following results
were obtained:
Residuals:
Min 1Q Median 3Q Max
-112.390 -42.146 -6.162 30.630 224.210
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -151.27 64.12 -2.359 0.0224 *
Income 689.39 83.50 8.256 9.05e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’
0.1 ‘ ’ 1
Interpretation: A unit change in average income will result in an increase of 689.39 units in the
average education expenditure. Also, the results are statistically significant as p-value is less than
0.05.
The above graph shows as to how well the regression line for linear regression model fits with the
scatterplot of the data between income and education.
Herein, we are considering the absolute change in education expenditure corresponding to absolute
change in income.
Also, the points which are away from the regression line have been highlighted. The data points for
Washington D.C., Alaska and Mississippi fall in this category.
On the basis of the above graph, it can be said that with an exception for the above three states, the
regression line fits the data well.
The above graph, which is the residuals vs fitted values graph involves a scatter plot of
residuals on Y axis and the fitted values on X-axis.
Interpretation
The points bounce around the 0-line randomly and don’t form a specific pattern and
hence, the relation can be said to be almost linear.
There is presence of outliers, namely, Nevada, Utah and Alaska
The QQ Plot plots the standardized residuals on Y-axis and the theoretical quantiles, i.e., the
corresponding values of residuals, if the data comes from normal distribution.
It is a roughly straight line which further reiterates the fact that the residuals are normally
distributed, hence, it can be said that there is a linear relationship among variables.
Washington DC, Nevada and Alaska are the outliers.
The scale location plot is similar to the residuals versus fitted plot. However, it involves the
plotting of standardized residuals instead of residuals themselves.
INTERPRETATION
The red line is not horizontal
The spread around the red line varies with the fitted values
Hence, on the basis of this graph, it can be said that the linear functional form mayn’t be
appropriate for this model.
The residual vs. leverage plot also helps in the identification of influential points in the data
set or analysis.
As per this graph, mainly Alaska is the outlier because it is beyond the dashed line of Cook’s
distance.
This is used for identifying as to whether outliers have a significant impact on the regression
results or not. In this case, only Alaska crosses the dashed lines of Cook’s distance and
hence, values of the variables for Alaska will have a significant influence on the results.
The cook’s distance versus leverage plot is also used to detect the influential values in the
dataset.
The cook’s distance of 1 is the benchmark which is widely adopted.
It is more than 1 for Alaska and hence, it can be considered as the influential value which will
affect the results if linear regression model is used.
THIS VISUAL ANALYSIS SUPPORTS THAT THOUGH THE LINEAR FUNCTIONAL FORM FITS THE
DATA WELL, THERE ARE VALUES WHICH ARE NOT VERY WELL EXPLAINED BY THIS FORM.
Next, we need to check the statistical significance of the same through some tests.
RESULTS:
RESET test
Data: ps_lm
RESET = 6.5658, df1 = 2, df2 = 46, p-value = 0.003102
It is clearly indicative of the fact that at 5% level of significance, the null hypothesis is rejected and it
can be stated that the current functional form is misspecified and non-linear functional form will be
able to explain the data in a better way.
RAINBOW TEST
This test is based upon the rationale that a misspecified model may fit in the centre but lack
fit in the tails. ( It can be clearly seen in the graphs above)
Under this test, the model is fitted to a sub sample and compared with the full sample fit.
In R, it is by default assumed that the data is already ordered and the middle 50% values are
considered for the sub-sample.
RESULTS:
Rainbow test
Data: ps_lm
Rain = 2.1611, df1 = 25, df2 = 23
p-value = 0.03368
On the basis of the results, it can be said that the null hypothesis of no difference in the results of
fitting on sub-sample and full sample can be rejected at 5% level of significance.
And it can be said that the linear functional form is not a good fit for the tails which is also supported
by the graphical and visual analysis.
RESULTS:
Harvey-Collier test
Data: ps_lm
HC = 1.111, DF = 47, p-value = 0.2722
It can be said that the null hypothesis is rejected at 5% level of significance and the mean value of
recursive residuals differs significantly from 0.
Herein, the underlying argument will be that corresponding to a percentage change in income there
will be a corresponding change in education expenditure. With an increase, there will be an increase
and with a decrease, there will be a decrease.
REGRESSION RESULTS:
Residuals:
Min 1Q Median 3Q Max
-0.24954 -0.11179 -0.01208 0.09235 0.35579
Coefficients:
Estimate Std. Error t value Pr (>|t|)
(Intercept) 6.25186 0.04812 129.917 < 2e-16 ***
Log (Income) 1.25962 0.15405 8.177 1.19e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Interpretation: A unit percentage change in income results in 1.25962% change in average education
Expenditure which is also statistically significant as the p-value is less than 0.05.
Regression Diagnostics
It is clearly visible that the residuals here follow normal distribution and hence, it can be said
that this functional form is better
Though the outliers are still present but the distance of residuals corresponding to them
from the theoretical quantiles is comparatively lesser,
So, this functional form should be able to explain the model in a better way.
The outliers are still present but they now lie within the dashed lines of Cook’s distance and hence, it
can be said that this functional form is somewhat better than the previous linear form incorporate
In order to carry out a more formal analysis, the Rainbow test was conducted in order to understand
as to whether the model fits well only on sub-sample or the entire sample.
Results:
Rainbow test
Data: ps_lm
Rain = 0.82764, df1 = 25, df2 = 23,
p-value = 0.6787
Hence, the null hypothesis can’t be rejected at 5% level of significance and it can be said that there is
no significant difference in the model fitting on sub-sample or entire sample.
Comparison of coefficient of determination and standard error: The coefficient of determination for
log-log and linear model are same but the standard error is lesser for log-log model. So, log-log
model is better.
And the test results and graphical analysis also point towards the direction that log-log functional
form is better in this situation than linear form.
LOG-LINEAR FORM
Case 2: Herein, the yield equation was considered which involves the growth of wheat with time.
Regression Results:
Residuals:
Min 1Q Median 3Q Max
-0.73450 -0.09020 0.03399 0.14879 0.29124
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.343366 0.058404 -5.879 4.39e-07 ***
time 0.017844 0.002075 8.599 3.93e-11 ***
---
Signif. Codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Interpretation:
It is indicative of the fact that the rate of growth in average wheat production has increased
by 1.78% each year.
With the given functional form, the independent variable is able to explain 60.8% variation
in the dependent variable.
(Note: I have mainly used adjusted R squared for interpretation because it is adjusted for
sample size)
The results are statistically significant.
Interpretation:
It can be said that the residuals for this regression model are normally and randomly
distributed.
It can be stated that the existing functional form of the regression model is able to explain
the relation well.
Interpretation:
Herein, the standardized residuals haven’t crossed the dashed line for cook’s distance and
hence, it can be stated that there are no such data points which will influence the regression
results
Hence, it can be said that the given functional form is able to explain the relation among the
variables well.
RAINBOW TEST
Results:
Rainbow test
Data: Yield_mod
Rain = 0.45837, df1 = 24, df2 = 22, p-value =
0.9673
Interpretation:
The given results show that the null hypothesis can’t be rejected at 5% level of significance
It can be said that there is no statistical significance in the difference in fitting of this
functional form for sub-sample and the entire sample.
LINEAR-LOG FORM
CASE 3-
Herein, we are trying to understand the relation between income and food expenditure.
Theoretical argument suggests that there is a linear-log relation between the two.
REGRESSION RESULTS
Residuals:
Min 1Q Median 3Q Max
-215.427 -51.666 2.186 47.819 241.548
Coefficients:
Estimate Std. Error t value Pr (>|t|)
(Intercept) -97.19 84.24 -1.154 0.256
Log (income) 132.17 28.80 4.588 4.76e-05 ***
---
Signif. Codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Interpretation:
Interpretation:
Herein, it can be said that the quantiles of residuals are approximately equal to their
theoretical counterparts, in other words, it can be said that the residuals are roughly
normally distributed
It can also be said that the current functional form is able to explain the relation well.
Interpretation:
None of the data points are beyond the dashed lines for cook’s distance and hence, there
are no outliers
The given functional form is able to incorporate the different variables.
RAINBOW TEST
Rainbow test
Data: mod2
Rain = 1.7386, df1 = 20, df2 = 18,p-value =
0.12
Interpretation : The given test results are indicative of the fact that the null hypothesis can’t be
rejected at 5% level of significance and there is no statistical significance in difference between
fitting of the functional form on the sub-sample and sample.
The yield equation can also be well explained by the polynomial functional form.
Yieldt = β1+ β2time^3t +et
Regression Results:
Residuals:
Min 1Q Median 3Q Max
-0.49533 -0.15511 0.02108 0.12969 0.58799
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.637778 0.064131 9.945 4.85e-13 ***
Time 0.021032 0.002279 9.230 4.88e-12 ***
---
Signif. Codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Interpretation:
With a unit increase in t^3, there is 2.1% increase in the production of wheat on an average
and the results are also statistically significant.
The independent variable is able to explain 64.18% variation in the dependent variable.
Graphical Analysis:
Interpretation:
Herein, it can be said that the quantiles of residuals are approximately equal to their
theoretical counterparts, in other words, it can be said that the residuals are roughly
normally distributed
It can also be said that the current functional form is able to explain the relation well.
Interpretation:
None of the data points are beyond the dashed lines for cook’s distance and hence, there
are no outliers
The given functional form is able to incorporate the different variables.
RAINBOW TEST
REGRESSION RESULTS:
Rainbow test
Data: mod1
Rain = 1.1204, df1 = 24, df2 = 22, p-value = 0.3963
The given test results are indicative of the fact that the null hypothesis can’t be rejected at 5% level
of significance and there is no statistical significance in difference between fitting of the functional
form on the sub-sample and sample.
Result : The polynomial functional form has higher coefficient of determination than the log-linear
model with slightly higher standard error and hence, is the better functional form for this model for
relation between yield and time.
Herein, we are considering an interaction term and also, different functional forms together to
explain the impact of education and experience on wages
Regression equation
Regression Results:
Call:
lm(formula = log(wage) ~ educ * exper + I(exper^2), data = cps4_small)
Residuals:
Min 1Q Median 3Q Max
-2.28227 -0.32856 -0.02725 0.33751 1.47088
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.297e-01 2.267e-01 2.336 0.01969 *
educ 1.272e-01 1.472e-02 8.642 < 2e-16 ***
exper 6.298e-02 9.536e-03 6.604 6.48e-11 ***
I(exper^2) -7.139e-04 8.804e-05 -8.109 1.49e-15 ***
educ:exper -1.322e-03 4.949e-04 -2.672 0.00766 **
---
Signif. Codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
GRAPHICAL ANALYSIS:
Interpretation:
Herein, it can be said that the quantiles of residuals are approximately equal to their
theoretical counterparts, in other words, it can be said that the residuals are roughly
normally distributed
It can also be said that the current functional form is able to explain the relation well.
Interpretation:
None of the data points are beyond the dashed lines for cook’s distance and hence, there
are no outliers
The given functional form is able to incorporate the different variables.
RAINBOW TEST
Rainbow test
Data: mod4
Rain = 0.81354, df1 = 500, df2 = 495, p-value
= 0.9892
Interpretation : The given test results are indicative of the fact that the null hypothesis can’t be
rejected at 5% level of significance and there is no statistical significance in difference between
fitting of the functional form on the sub-sample and sample.
Conclusion:
There should be proper understanding of the theoretical logic behind the model
before formulating the functional form.
Statistical and graphical analysis are equally important for understanding the
anomalies in the existing functional form and for rectifying them.