Detecting and Resolving Model Specification Errors in STATA
Detecting and Resolving Model Specification Errors in STATA
Interpretation: The data is somewhat positively skewed as its right tail is longer.
*You can also check the data normality with the following STATA command
Interpretation: Since JB test value is highly significant (its p-value<0.05), the data is non-normal
and has outliers in data as well.
Resolving Data Non-Normality Issue
i) Application of log-lin or double-log model: In this approach, you can run the log model to
smoothen and normalize the data to a certain extent.
*Run the following STATA command of log-lin model of wage to address data normality
issue
*Now predict the residuals of this log-lin model and apply JB test again to check data
normality.
Interpretation: As compared to previous results, JB test value has been greatly reduced from
738.4 to 42.89 and data has been normalized to a greater extent though it is yet not perfectly normal
ii) Winsorization of Data: This statistical method is used to replace the outliers with the nearest
values of quartiles or percentiles.
*First install winsor2 command in STATA as follows:
*Now predict the residuals of this log-lin model and apply JB test again to check data
normality.
Interpretation: Bravo! The data has been completely normalized and outliers have been removed
as well. Now our hypothesis testing would be valid as t and F tests require the data normality
condition.
2) Model Specification Tests (Detecting Omitted Variable Bias)
i) Ramsey’s RESET Test: The first test is Ramsey’s RESET test which is commonly used to
detect the model specification error by including the quadratic and cubic powers of fitted values
of Y variable (in this case wage).
*Run the following STATA command using our final model with winsorized variables.
* After running the above regression, now predict the fitted values of lwage_w, and generate
(g) the quadratic and cubic values of lwage_w as follows:
* Now run the Ramsey’s RESET test by regressing the lwage_w on regressors and quadratic
and cubic terms of lwage_w with the following STATA command.
Interpretation: Since the F-value of unrestricted model is highly significant (its p-
value<0.05), we conclude that model is misspecified.
ii) Langrage Multiplier (LM) Test: In this test we regress the residuals of our model on quadratic
and cubic terms of fitted or estimated values of Y.
*Run the following STATA command
*The following STATA command generates the value of LM test by multiplying the estimates
of number of observations (e(N) with r-squared (e(r2).
. scalar nR2=e(N)*e(r2)
*The following command generates the 5% critical value of χ2 distribution.
. scalar chi2critical=invchi2tail(e(df_m), 0.05)
*The following command generates the p-value of χ2 distribution.
. scalar p_value=chi2tail(e(df_m), nR2)
* Now list all the scalar values generated previously with the following command.
Interpretation: Since calculated value of LM test (586.65) is much greater than χ2 critical value
(5.99), we reject the null hypothesis of no specification errors.
3) Detecting the Right Functional Form: If we want to compare two competing models with
the same dependent variables, then run the regressions in STATA and choose the model which has
the highest Adjusted R2 and lower AIC or BIC values. However, problem arises when we have
different DVs. For instance, you want to compare two model in which DV is wage in one model
and lwage in another model.
wage=ß1+ ß2educ+ ß3exper+ ß4 tenure+ ß4 IQ (1)
lwage=ß1+ ß2 leduc+ ß3 lexper+ ß4 tenure+ ß4 lIQ (2)
In this case, we use the following Box-Cox transformation procedure to choose the right functional
form.
*Step 1. Find out the geometric mean of wage variable with the following STATA command:
*Step 2. Now divide wage variable by its geometric mean to create new variable ‘wagestar’
with the following STATA command
*Step 3. Now regress both models 1 and 2 with newly created common variable ‘wagestar’
with the following STATA commands:
*Step 4. Now calculate the Box-Cox statistic as follows: -
B-Cox stat= n ln
1 RSS2
2 where RSS represents the residual sum of squares of both models.
RSS1
Note: keep the higher RSS in the numerator which is RSS2 of log model in this case Moreover B-
Cox stat follows the chi-square distribution with k-1 degree of freedom, where k is number of
coefficients.
152.02
B-Cox Stat 0.5*935 ln 3.395
150.92
*Now calculate the p-value of Box-Cox statistic where k-1 (5-1=4) degree of freedom; there
are four IVs in our model.
*Now list the calculated p-value with the following STATA command
Interpretation: Since the test-statistic is insignificant (as its p-value is greater than 0.05), so we
cannot conclude that the log function is superior to linear model.