More On Specification and Data

Misspecification Proxy Variables Random Slopes Measurement Error Missing, Nonrandom, Outlier
Chapter 8. More on Specification and Data

Issues
M. Ryan Sanjaya
Departemen Ilmu Ekonomi

Fakultas Ekonomika dan Bisnis
Universitas Gadjah Mada
Maret 2023
M. Ryan Sanjaya — [email protected] Maret 2023 1/22

Specification and data issues
How do we know that our econometric model is correctly

specified?
There is no exact way to know if our model is correctly specified.
Nonetheless, there are some selection criteria to judge whether an
econometric model is good or not.

What is a Good Model?

A good model for empirical analysis should (Hendry & Richard, 1983):
Be data admissible; that is, predictions made from the model must
be logically possible.
Be consistent with theory; that is, it must make good economic
sense.
Have weakly exogenous regressors; that is, the explanatory
variables, must be uncorrelated with the error term (no omitted
variable bias).
Exhibit parameter constancy; that is, the values of the parameters
should be stable.
Exhibit data coherency; that is, the residuals estimated from the
model must be purely random (technically, white noise).
Be encompassing; that is, other models cannot be an improvement
over the chosen model.
Specification Errors
Let’s say the true model is
y = β0 + β1 x1 + β2 x2 + u
We make specification error if we
omit variables → underfitting the model
y = β0 + β 1 x 1 + u
include irrelevant variable → overfitting the model
y = β0 + β1 x 1 + β 2 x 2 + β3 x 3 + u
estimate the wrong functional form
ln y = β0 + β1 x1 + β2 x2 + u
use proxy, e.g., x2∗ , that may contain measurement error
y = β0 + β1 x1 + β2 x2∗ + u
incorrectly specify the stochastic error term.
Detecting Model Misspecification
Look at the pattern of the residual.

Residual in the vertical axis, explanatory variable in the horizontal
axis.
No pattern = good
RESET (regression specification error test).
Davidson-MacKinnon J test.

RESET
1 2 of the original (but not
Obtain the fitted values ŷ and Rold
necessarily true) model
y = β0 + β1 x1 + β2 x2 + u.
2 Estimate the expanded model by adding ŷ 2 and ŷ 3 , and get Rnew
2
y = β0 + β1 x1 + β2 x2 + δ1 ŷ 2 + δ2 ŷ 3 + error .
3 Calculate the F statistic
2 2

Rnew − Rold (n − k − 3)
F = 2
(1 − Rnew ) 2
under the null hypothesis of H0 : δ1 = 0 and δ2 = 0.
The distribution of the F statistic is approximately F2,n−k−3 in large
samples (and the Gauss-Markov assumptions).
If H0 is rejected, then we have functional form problem.

RESET
An LM version is also available (and the chi-square distribution

will have two df).
We can also do the test to be made robust to heteroskedasticity
using the methods discussed in the previous chapter.
Drawback of RESET test: It provides no real direction on how to
proceed if the model is rejected → what’s the best model then?
RESET has no power for detecting omitted variables.
If the functional form is properly specified, RESET has no power
for detecting heteroskedasticity.
reg y x1 x2
estat ovtest

Davidson-MacKinnon test
Two nonnested models:
y = β0 + β1 x1 + β2 x2 + u (1)
vs
y = β0 + β1 ln x1 + β2 ln x2 + u. (2)
1 Estimate model 2 and obtain fitted values y̌ .

2 Use y̌ as an additional regressor in model 1.
You can also do the opposite: estimate model 1 and use the fitted
values as regressor in model 2.
3 Use t-test: if the estimated parameter for y̌ is significant, then
model 1 is rejected.

Drawbacks of the Davidson-MacKinnon test
A clear winner need not emerge.

Both models could be rejected or neither model could be rejected.
If none are rejected, use adjusted R 2 to choose between them.
If both models are rejected, more work needs to be done.
Rejection of one model does not mean that the other model is the
correct one.
We can’t compare models with different dependent variables.

Proxy Variables
In the absence of a relevant variable, use a proxy variable.

If ability is unobserved, use IQ score.
Does IQ and ability the same? Measurement error?

What is a Good Proxy?
Suppose the model is
y = β0 + β1 x1 + β2 x2 + β3 x3∗ + u
where x3∗ is unobserved and is proxied by x3 in the plug-in

regression
y on x1 , x2 , x3 .
The variable x3 is a good proxy for x3∗ if
x3∗ is closely correlated with x3 , that is x3∗ = δ0 + δ3 x3 + v3 ,
the estimated error term u is uncorrelated with x1 , x2 , and x3∗ ,
u is uncorrelated with x3 ,
v3 is uncorrelated with x1 , x2 , x3 .

Lagged Dependent Variable as Proxy

We can use lagged dependent variable as a proxy in a cross
sectional regression.
For example:
crime = β0 + β1 officer + β2 unem + β3 crime−1 + u.
Some cities had high crime rate in the past and today.
If crime−1 is not included we might suffer from reverse causality:
since the city has high crime rate → high unemployment and many
police officers.
If crime−1 is included, we can do this experiment: if two cities have
the same previous crime rate and current unemployment rate, then
β1 measures the effect of another police officer on crime rate.

Models with Random Slopes
A model with random slopes is given by
yi = ai + bi xi .
The slope coefficient bi is a random draw from the population.

That is, the slope of x varied by individual.
We cannot estimate the slope for each observation but we can
estimate the average slope across population → average partial
effect (APE) or average marginal effect (AME).
The assumption is that the slopes are independent of the
explanatory variables.
In Stata:
mixed

Example
Determinant of language score.

Dataset: snijders.dta
Cross sections of 2287 8th grade students from 131 schools in the
Netherlands.
Variables of interest: langpost (language score), iqvc (average
verbal IQ score), and schoolnr (identity code for each school).
In Stata:
** Random intercepts only
mixed langpost iqvc || schoolnr: , mle
** Random intercepts and slopes

mixed langpost iqvc || schoolnr: iqvc, mle covariance(indep)

Example Result — Random Intercepts
The expected language score

for a kid with average verbal
IQ averages 40.61 across all
schools, but with substantial
variation (variance = 9.50).
The common slope is
estimated as a gain of 2.49
points in language score per
point of verbal IQ.

Example Result — Random Intercepts and Slopes
The expected language score

for a child with average IQ
now averages 40.64 across
schools, with about the same
variance of 9.54.
The expected gain in language
score per point of IQ averages
2.52, a bit higher than in
random intercepts.

Measurement Error
The measurement error is defined as the difference between the

observed value (y , x1 ) and the actual value in population (y ∗ , x1∗ )
e0 = y − y ∗
e1 = x1 − x1∗
If e0 and e1 is uncorrelated with the explanatory variables → good.

If e0 and e1 is correlated with the error term u → bias → need to
collect new data with better data-collecting technique.

Classical errors-in-variables
Classical errors-in-variables (CEV) assumption: the measurement

error is uncorrelated with unobserved explanatory variable
Cov (x1∗ , e1 ) = 0.
Violation of this properties will resulted in a biased and

inconsistent estimator → attenuation bias (the estimated slope
will always be attenuated/weaker/underestimated).

Missing Data
If the data are missing completely at random (MCAR), then

missing data cause no statistical problems.
Complete cases estimator: use only observations with complete
data in the regression.
Multiple-imputation method.
mi estimate

Missing Indicator Method
Procedure.
1 Create Zik = xik when it is Drawbacks of MIM.
observed, 0 otherwise. Requires strong
2 Create a missing data assumptions, such as xk to
indicator mik = 1 when xik be uncorrelated with
is missing, 0 otherwise. x1 , x2 , ...xk−1 .
3 Estimate yi on It is less robust than the
xi1 , ..., xi,k−1 , Zik , mik for complete cases estimator.
i = 1, ..., n.

Nonrandom Samples
Exogenous sample selection.

Selection based on the independent variables (sometimes called
missing at random, MAR).
E.g., regressing y on x1 and age, but the survey is only for those
age > 40; nonrandom sample of adults.
Do not cause bias.
Endogenous sample selection.
Selection based on the dependent variable.
E.g., regressing wealth on x1 , x2 , but only those with
wealth < 250, 000 is in the sample.
Creates bias and inconsistent estimates.

Outliers
Least absolute deviations (LAD) estimation can be used to

minimise the impact of outliers in a regression.
Minimize the sum of the absolute residuals
LAD is designed to estimate the parameters of the conditional
median of y given the xs.
LAD is a special case of robust regression and quantile regression.
In Stata:
qreg y x1 x2

More On Specification and Data

Uploaded by

Copyright:

Available Formats

More On Specification and Data

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

More On Specification and Data

Uploaded by

Copyright:

Available Formats

Misspecification Proxy Variables Random Slopes Measurement Error Missing, Nonrandom, Outlier

Chapter 8. More on Specification and Data

Departemen Ilmu Ekonomi

M. Ryan Sanjaya — [email protected] Maret 2023 1/22

Specification and data issues

How do we know that our econometric model is correctly

M. Ryan Sanjaya — [email protected] Maret 2023 2/22

What is a Good Model?

Detecting Model Misspecification

Look at the pattern of the residual.

M. Ryan Sanjaya — [email protected] Maret 2023 5/22

M. Ryan Sanjaya — [email protected] Maret 2023 6/22

An LM version is also available (and the chi-square distribution

M. Ryan Sanjaya — [email protected] Maret 2023 7/22

1 Estimate model 2 and obtain fitted values y̌ .

M. Ryan Sanjaya — [email protected] Maret 2023 8/22

Drawbacks of the Davidson-MacKinnon test

A clear winner need not emerge.

M. Ryan Sanjaya — [email protected] Maret 2023 9/22

In the absence of a relevant variable, use a proxy variable.

M. Ryan Sanjaya — [email protected] Maret 2023 10/22

What is a Good Proxy?

Suppose the model is

where x3∗ is unobserved and is proxied by x3 in the plug-in

M. Ryan Sanjaya — [email protected] Maret 2023 11/22

Lagged Dependent Variable as Proxy

crime = β0 + β1 officer + β2 unem + β3 crime−1 + u.

M. Ryan Sanjaya — [email protected] Maret 2023 12/22

Models with Random Slopes

A model with random slopes is given by

The slope coefficient bi is a random draw from the population.

M. Ryan Sanjaya — [email protected] Maret 2023 13/22

Determinant of language score.

** Random intercepts and slopes

M. Ryan Sanjaya — [email protected] Maret 2023 14/22

Example Result — Random Intercepts

The expected language score

M. Ryan Sanjaya — [email protected] Maret 2023 15/22

Example Result — Random Intercepts and Slopes

The expected language score

M. Ryan Sanjaya — [email protected] Maret 2023 16/22

The measurement error is defined as the difference between the

If e0 and e1 is uncorrelated with the explanatory variables → good.

M. Ryan Sanjaya — [email protected] Maret 2023 17/22

Classical errors-in-variables (CEV) assumption: the measurement

Violation of this properties will resulted in a biased and

M. Ryan Sanjaya — [email protected] Maret 2023 18/22

If the data are missing completely at random (MCAR), then

M. Ryan Sanjaya — [email protected] Maret 2023 19/22

Missing Indicator Method

M. Ryan Sanjaya — [email protected] Maret 2023 20/22

Exogenous sample selection.

M. Ryan Sanjaya — [email protected] Maret 2023 21/22

Least absolute deviations (LAD) estimation can be used to

M. Ryan Sanjaya — [email protected] Maret 2023 22/22

You might also like