0% found this document useful (0 votes)
129 views

Econometrics - Functional Forms

The document discusses determining the relationship between income and education expenditure in the US using linear regression models. Several linear regression models were analyzed but diagnostic tests showed the linear functional form was misspecified. A log-log model was then applied which showed a 1.25962% change in average education expenditure for a 1% change in income, a relationship that was statistically significant. Diagnostic tests of the log-log model showed an improvement over the linear models in addressing outliers and distribution of residuals.

Uploaded by

Nidhi Kaushik
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
129 views

Econometrics - Functional Forms

The document discusses determining the relationship between income and education expenditure in the US using linear regression models. Several linear regression models were analyzed but diagnostic tests showed the linear functional form was misspecified. A log-log model was then applied which showed a 1.25962% change in average education expenditure for a 1% change in income, a relationship that was statistically significant. Diagnostic tests of the log-log model showed an improvement over the linear models in addressing outliers and distribution of residuals.

Uploaded by

Nidhi Kaushik
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 22

ASSIGNMENT

FUNCTIONAL FORMS OF LINEAR REGRESSION MODELS

Objective: The overall objective is to develop an understanding regarding different functional forms
of models and understand the techniques and methodologies in order to determine suitable
functional forms for different theoretical arguments.

CASE 1: Determination of relation between income and education expenditure in USA

Theoretical argument: There is an ambiguous relation between education and income. The existing
literature isn’t able to clarify the relation between these two variables. But the underlying line of
concurrence is that as income increases, education expenditure also increases.

Data: I have considered data for income and education expenditure for 50 states of USA for the year
1979

Regression Analysis: The linear functional form was considered for analysis and following results
were obtained:

Lm (formula = (Expenditure) ~ (Income), data = ps)

Residuals:
Min 1Q Median 3Q Max
-112.390 -42.146 -6.162 30.630 224.210

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -151.27 64.12 -2.359 0.0224 *
Income 689.39 83.50 8.256 9.05e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’
0.1 ‘ ’ 1

Residual standard error: 61.41 on 48 degrees of


freedom
Multiple R-squared: 0.5868, Adjusted R-squared:
0.5782
F-statistic: 68.16 on 1 and 48 DF, p-value: 9.055e-11

Interpretation: A unit change in average income will result in an increase of 689.39 units in the
average education expenditure. Also, the results are statistically significant as p-value is less than
0.05.
The above graph shows as to how well the regression line for linear regression model fits with the
scatterplot of the data between income and education.

Herein, we are considering the absolute change in education expenditure corresponding to absolute
change in income.

Also, the points which are away from the regression line have been highlighted. The data points for
Washington D.C., Alaska and Mississippi fall in this category.

On the basis of the above graph, it can be said that with an exception for the above three states, the
regression line fits the data well.
 The above graph, which is the residuals vs fitted values graph involves a scatter plot of
residuals on Y axis and the fitted values on X-axis.
 Interpretation
 The points bounce around the 0-line randomly and don’t form a specific pattern and
hence, the relation can be said to be almost linear.
 There is presence of outliers, namely, Nevada, Utah and Alaska
 The QQ Plot plots the standardized residuals on Y-axis and the theoretical quantiles, i.e., the
corresponding values of residuals, if the data comes from normal distribution.
 It is a roughly straight line which further reiterates the fact that the residuals are normally
distributed, hence, it can be said that there is a linear relationship among variables.
 Washington DC, Nevada and Alaska are the outliers.
 The scale location plot is similar to the residuals versus fitted plot. However, it involves the
plotting of standardized residuals instead of residuals themselves.
 INTERPRETATION
 The red line is not horizontal
 The spread around the red line varies with the fitted values
 Hence, on the basis of this graph, it can be said that the linear functional form mayn’t be
appropriate for this model.
The residual vs. leverage plot also helps in the identification of influential points in the data
set or analysis.
As per this graph, mainly Alaska is the outlier because it is beyond the dashed line of Cook’s
distance.

This is used for identifying as to whether outliers have a significant impact on the regression
results or not. In this case, only Alaska crosses the dashed lines of Cook’s distance and
hence, values of the variables for Alaska will have a significant influence on the results.
The cook’s distance versus leverage plot is also used to detect the influential values in the
dataset.
The cook’s distance of 1 is the benchmark which is widely adopted.
It is more than 1 for Alaska and hence, it can be considered as the influential value which will
affect the results if linear regression model is used.

THIS VISUAL ANALYSIS SUPPORTS THAT THOUGH THE LINEAR FUNCTIONAL FORM FITS THE
DATA WELL, THERE ARE VALUES WHICH ARE NOT VERY WELL EXPLAINED BY THIS FORM.
Next, we need to check the statistical significance of the same through some tests.

Ramsey’s RESET (Regression specification error test)


 It is a general specification test for linear regression model.
 It tests if the non-linear values explain the fitted values well because if they have
power to explain the variables in a better way, then the current model is
misspecified.
 Null hypothesis: The current specification is correct.
 It constructs auxiliary variables and tests their significance using F test.
 The auxiliary variables, so constructed are :
 Powers of the fitted values ^y
 original regressors
 First principal component of X.

RESULTS:

RESET test

Data: ps_lm
RESET = 6.5658, df1 = 2, df2 = 46, p-value = 0.003102
It is clearly indicative of the fact that at 5% level of significance, the null hypothesis is rejected and it
can be stated that the current functional form is misspecified and non-linear functional form will be
able to explain the data in a better way.

RAINBOW TEST

 This test is based upon the rationale that a misspecified model may fit in the centre but lack
fit in the tails. ( It can be clearly seen in the graphs above)
 Under this test, the model is fitted to a sub sample and compared with the full sample fit.
 In R, it is by default assumed that the data is already ordered and the middle 50% values are
considered for the sub-sample.

RESULTS:
Rainbow test

Data: ps_lm
Rain = 2.1611, df1 = 25, df2 = 23
p-value = 0.03368

On the basis of the results, it can be said that the null hypothesis of no difference in the results of
fitting on sub-sample and full sample can be rejected at 5% level of significance.

And it can be said that the linear functional form is not a good fit for the tails which is also supported
by the graphical and visual analysis.

THE HARVEY COLLIER TEST

 The Harvey Collier test is based on calculation of recursive residuals.


 The mean value of recursive residuals should significantly vary from 0 for the true relation to
be non-linear.
 Simple t-test is used for testing

RESULTS:

Harvey-Collier test

Data: ps_lm
HC = 1.111, DF = 47, p-value = 0.2722

It can be said that the null hypothesis is rejected at 5% level of significance and the mean value of
recursive residuals differs significantly from 0.

And hence, the underlying relation is non-linear.


It can be said that the change in average education expenditure corresponding to the absolute
change in average income, can’t be called a corresponding absolute change.

We can try fitting a log-log model.

Herein, the underlying argument will be that corresponding to a percentage change in income there
will be a corresponding change in education expenditure. With an increase, there will be an increase
and with a decrease, there will be a decrease.

REGRESSION RESULTS:

Lm (formula = log (Expenditure) ~ log (Income), data = ps)

Residuals:
Min 1Q Median 3Q Max
-0.24954 -0.11179 -0.01208 0.09235 0.35579

Coefficients:
Estimate Std. Error t value Pr (>|t|)
(Intercept) 6.25186 0.04812 129.917 < 2e-16 ***
Log (Income) 1.25962 0.15405 8.177 1.19e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1456 on 48 degrees of freedom


Multiple R-squared: 0.5821, Adjusted R-squared: 0.5734
F-statistic: 66.86 on 1 and 48 DF, p-value: 1.193e-10

Interpretation: A unit percentage change in income results in 1.25962% change in average education

Expenditure which is also statistically significant as the p-value is less than 0.05.

Next, we need to check the appropriateness of this model

Regression Diagnostics
 It is clearly visible that the residuals here follow normal distribution and hence, it can be said
that this functional form is better
 Though the outliers are still present but the distance of residuals corresponding to them
from the theoretical quantiles is comparatively lesser,
 So, this functional form should be able to explain the model in a better way.
The outliers are still present but they now lie within the dashed lines of Cook’s distance and hence, it
can be said that this functional form is somewhat better than the previous linear form incorporate

In order to carry out a more formal analysis, the Rainbow test was conducted in order to understand
as to whether the model fits well only on sub-sample or the entire sample.

Results:

Rainbow test

Data: ps_lm
Rain = 0.82764, df1 = 25, df2 = 23,
p-value = 0.6787

Hence, the null hypothesis can’t be rejected at 5% level of significance and it can be said that there is
no significant difference in the model fitting on sub-sample or entire sample.

Comparison of coefficient of determination and standard error: The coefficient of determination for
log-log and linear model are same but the standard error is lesser for log-log model. So, log-log
model is better.

And the test results and graphical analysis also point towards the direction that log-log functional
form is better in this situation than linear form.

LOG-LINEAR FORM

Case 2: Herein, the yield equation was considered which involves the growth of wheat with time.

Regression Results:

Lm (formula = log (greenough) ~ time, data = wa_wheat)

Residuals:
Min 1Q Median 3Q Max
-0.73450 -0.09020 0.03399 0.14879 0.29124

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.343366 0.058404 -5.879 4.39e-07 ***
time 0.017844 0.002075 8.599 3.93e-11 ***
---
Signif. Codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1992 on 46 degrees of freedom


Multiple R-squared: 0.6165, Adjusted R-squared: 0.6082
F-statistic: 73.94 on 1 and 46 DF, p-value: 3.932e-11

Interpretation:
 It is indicative of the fact that the rate of growth in average wheat production has increased
by 1.78% each year.
 With the given functional form, the independent variable is able to explain 60.8% variation
in the dependent variable.
(Note: I have mainly used adjusted R squared for interpretation because it is adjusted for
sample size)
 The results are statistically significant.

Checking for the appropriateness of the functional form of the model:

Interpretation:

 It can be said that the residuals for this regression model are normally and randomly
distributed.
 It can be stated that the existing functional form of the regression model is able to explain
the relation well.
Interpretation:

 Herein, the standardized residuals haven’t crossed the dashed line for cook’s distance and
hence, it can be stated that there are no such data points which will influence the regression
results
 Hence, it can be said that the given functional form is able to explain the relation among the
variables well.

RAINBOW TEST

Results:

Rainbow test

Data: Yield_mod
Rain = 0.45837, df1 = 24, df2 = 22, p-value =
0.9673

Interpretation:

 The given results show that the null hypothesis can’t be rejected at 5% level of significance
 It can be said that there is no statistical significance in the difference in fitting of this
functional form for sub-sample and the entire sample.
LINEAR-LOG FORM

CASE 3-
 Herein, we are trying to understand the relation between income and food expenditure.
 Theoretical argument suggests that there is a linear-log relation between the two.

REGRESSION RESULTS

Lm (formula = food_exp ~ log (income), data=food)

Residuals:
Min 1Q Median 3Q Max
-215.427 -51.666 2.186 47.819 241.548

Coefficients:
Estimate Std. Error t value Pr (>|t|)
(Intercept) -97.19 84.24 -1.154 0.256
Log (income) 132.17 28.80 4.588 4.76e-05 ***

---
Signif. Codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 91.57 on 38 degrees of freedom


Multiple R-squared: 0.3565, Adjusted R-squared: 0.3396
F-statistic: 21.05 on 1 and 38 DF, p-value: 4.76e-05

Interpretation:

 A unit percent increase in income leads to an increase of 132.17 units in average


consumption expenditure.
 The independent variable is able to explain 33.96% of the variation in the dependent
variable for the given functional form.
 The results are statistically significant.
GRAPHICAL ANALYSIS

Interpretation:

 Herein, it can be said that the quantiles of residuals are approximately equal to their
theoretical counterparts, in other words, it can be said that the residuals are roughly
normally distributed
 It can also be said that the current functional form is able to explain the relation well.
Interpretation:

 None of the data points are beyond the dashed lines for cook’s distance and hence, there
are no outliers
 The given functional form is able to incorporate the different variables.

RAINBOW TEST

Rainbow test

Data: mod2
Rain = 1.7386, df1 = 20, df2 = 18,p-value =
0.12

Interpretation : The given test results are indicative of the fact that the null hypothesis can’t be
rejected at 5% level of significance and there is no statistical significance in difference between
fitting of the functional form on the sub-sample and sample.

Hence, linear log form is appropriate.

POLYNOMIAL FUNCTIONAL FORM

The yield equation can also be well explained by the polynomial functional form.
Yieldt = β1+ β2time^3t +et

Regression Results:

Lm (formula = greenough ~ time, data = wa_wheat)

Residuals:
Min 1Q Median 3Q Max
-0.49533 -0.15511 0.02108 0.12969 0.58799

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.637778 0.064131 9.945 4.85e-13 ***
Time 0.021032 0.002279 9.230 4.88e-12 ***
---
Signif. Codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2187 on 46 degrees of freedom


Multiple R-squared: 0.6494, Adjusted R-squared: 0.6418
F-statistic: 85.2 on 1 and 46 DF, p-value: 4.876e-12

Interpretation:

 With a unit increase in t^3, there is 2.1% increase in the production of wheat on an average
and the results are also statistically significant.
 The independent variable is able to explain 64.18% variation in the dependent variable.

Graphical Analysis:

Interpretation:
 Herein, it can be said that the quantiles of residuals are approximately equal to their
theoretical counterparts, in other words, it can be said that the residuals are roughly
normally distributed
 It can also be said that the current functional form is able to explain the relation well.

Interpretation:

 None of the data points are beyond the dashed lines for cook’s distance and hence, there
are no outliers
 The given functional form is able to incorporate the different variables.

RAINBOW TEST

REGRESSION RESULTS:

Rainbow test

Data: mod1
Rain = 1.1204, df1 = 24, df2 = 22, p-value = 0.3963
The given test results are indicative of the fact that the null hypothesis can’t be rejected at 5% level
of significance and there is no statistical significance in difference between fitting of the functional
form on the sub-sample and sample.

Hence, Polynomial functional form is appropriate.

Result : The polynomial functional form has higher coefficient of determination than the log-linear
model with slightly higher standard error and hence, is the better functional form for this model for
relation between yield and time.

Interaction and mixed Functional Form

Herein, we are considering an interaction term and also, different functional forms together to
explain the impact of education and experience on wages

Regression equation

Regression Results:

Call:
lm(formula = log(wage) ~ educ * exper + I(exper^2), data = cps4_small)

Residuals:
Min 1Q Median 3Q Max
-2.28227 -0.32856 -0.02725 0.33751 1.47088

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.297e-01 2.267e-01 2.336 0.01969 *
educ 1.272e-01 1.472e-02 8.642 < 2e-16 ***
exper 6.298e-02 9.536e-03 6.604 6.48e-11 ***
I(exper^2) -7.139e-04 8.804e-05 -8.109 1.49e-15 ***
educ:exper -1.322e-03 4.949e-04 -2.672 0.00766 **
---
Signif. Codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.5057 on 995 degrees of freedom


Multiple R-squared: 0.2445, Adjusted R-squared: 0.2415
F-statistic: 80.52 on 4 and 995 DF, p-value: < 2.2e-16
Interpretation:

Interpretation: The expected wage increases by 0.68829 as experience increases by 1 year.

GRAPHICAL ANALYSIS:
Interpretation:

 Herein, it can be said that the quantiles of residuals are approximately equal to their
theoretical counterparts, in other words, it can be said that the residuals are roughly
normally distributed
 It can also be said that the current functional form is able to explain the relation well.
Interpretation:

 None of the data points are beyond the dashed lines for cook’s distance and hence, there
are no outliers
 The given functional form is able to incorporate the different variables.

RAINBOW TEST

Rainbow test
Data: mod4
Rain = 0.81354, df1 = 500, df2 = 495, p-value
= 0.9892

Interpretation : The given test results are indicative of the fact that the null hypothesis can’t be
rejected at 5% level of significance and there is no statistical significance in difference between
fitting of the functional form on the sub-sample and sample.

Hence, the given functional form is appropriate for this model.

Conclusion:

 There should be proper understanding of the theoretical logic behind the model
before formulating the functional form.
 Statistical and graphical analysis are equally important for understanding the
anomalies in the existing functional form and for rectifying them.

You might also like