Basic Econometrics Revision - Econometric Modelling
Basic Econometrics Revision - Econometric Modelling
Uma Kollamparambil
Today's Agenda
A quick round-up of basic econometrics Econometric Modelling
Regression analysis Theory specifies the functional relationship Measurement of relationship uses regression analysis to arrive at values of a and b.
Y = a + bX + e
Components dependent & independent variables, intercept (O), coefficients, error term Regression may be simple or multivariate according to the no. of independent variables
Requirements
Model Specification: relationship between dependent and independent variables
scatter plot specify function that best fits the scatter
Inference
t-statistic R-square or Coefficient of Determination F-statistic
Estimation -- OLS
18 16 14 12 10 8 6 4 2 0 0 10 20 30 40 50
How to Estimate a and b in the linear equation? The OLS estimator solves:
Mina;b [Yi a bX i ] Min
[Yi a bX i ]
a ;b
This minimization problem can be solved using calculus. The result is the OLS estimators of a and b
OLS estimator of b
X =Y b a
OLS estimator of a
Intercept Q (X)
(Y Y )
i
Sb =
(n k ) ( X i X ) 2
2 (Yi Yi )
Here, the R-squared is a measure of the goodness of fit of our model, while the standard deviation of b gives us a measure of confidence for out estimate of b.
Intercept Q (X)
Coefficients Standard Error t Stat P-value 87.10614525 17.92601129 4.859204 0.004636 12.2122905 1.19773201 10.19618 0.000156
These are goodness of fit measures reported by excel for our example data.
Hypothesis testing
Hypothesis formulation Test:
Confidence interval method: Construct interval of estimated b at desired level of confidence & SE of b. check if b falls within. If it does, accept null hypothesis Test of significance method: Estimate t-value of b and compare with table t value. If former less than latter accept the null hypothesis.
Hypothesis testing
SUMMARY OUTPUT Regression Statistics Multiple R 0.976786811 R Square 0.954112475 Adjusted R Square 0.94493497 Standard Error 27.08645377 Observations 7 ANOVA df Regression Residual Total 1 5 6 SS MS F Significance F 76274.47725 76274.48 103.9621 0.000155729 3668.379888 733.676 79942.85714
Intercept Q (X)
Coefficients Standard Error t Stat P-value 87.10614525 17.92601129 4.859204 0.004636 12.2122905 1.19773201 10.19618 0.000156
b = Sb
the t-ratio. Combined with information in critical values from a student-t distribution, this ratio tells us how confident we are that a value is significantly different from zero.
F =
( k 1) (1 R ) /( n k )
2
R2
Multivariate regression
Y1 1 Y = 1 2 Yn 1
X 21 X 22 X 2n
/ /
= (X X ) X Y
/
y = X + u
X k1 1 u X k 2 2 + u X kn 3 u
1 2 3
Model Specification
Sources of misspecification
Omission of relevant variable Inclusion of unnecessary variables Wrong functional form Errors of measurement Incorrect specification of the stochastic error term
Model Specification Errors: Omitting Relevant Variables and Including Irrelevant Variables
To properly estimate a regression model, we need to have specified the correct model A typical specification error occurs when the estimated model does not include the correct set of explanatory variables This specification error takes two forms Omitting one or more relevant explanatory variables Including one or more irrelevant explanatory variables Either form of specification error results in problems with OLS estimates
rt = 0 + 1GDPt + 2 INFt + t
rt = 0 + 1GDPt + t
t = 2 INFt + t
Thus, the error term of this model is actually equal to If there is any correlation between the omitted variable (INF) and the explanatory variable (GDP), then there is a violation of classical assumption Cov(uiXi)=0
When Cov(X1X2)#0, Estimate of both constant & slope biased Bias continues even with larger sample When Cov(X1X2)=0, constant is biased, slope unbiased Variance of error is incorrectly estimated Consequently, variance of slope is biased Leads to misleading conclusions through confidence interval and hypothesis testing procedures regarding statistical significance of the estimated parameters. Forecasts therefore based on mis-specified model will be unreliable
To avoid omitted variable bias, A simple solution is to add the omitted variable back to the model, but the problem with this solution is to be able to detect which is the omitted variable Omitted variable bias is hard to detect, but there could be some obvious indications of this specification error. The best way to detect the omitted variable specification bias is to rely on the theoretical arguments behind the model. - Which variables does the theory suggest should be included?
Note, though, that a significant coefficient with the unexpected sign can also occur due to a small sample size However, most of the data sets used in empirical finance are large enough that this most likely is not the cause of the specification bias.
The estimated coefficients (both constant and slope) are unbiased The variance of the error term is estimated accurately
2 new
2 old
= 1 (1 ) * (c / k )
Example: A well-known model of nominal exchange rate determination is the Purchasing Power Parity (PPP) model s = P/P* s = nominal exchange rate (e.g. rand/$), P = price level in the SA, P* = price level in the US
Taking natural logs, we can estimate the following model ln(s) = 0 + 1ln(P) + 2ln(P*) + i
Property of double-log model: Estimated coefficients show elasticities between dependent and explanatory variables
Example: A 1% change in P will result in a 1% change in the nominal exchange rate (s).
How do we know if weve gotten the right functional form for our model? - Expected coefficient signs, R2, t-stat and DW dstat
Ramseys RESET
Regression Specification Error Test Estimate assumed model and derive Then, estimate y = 0 + 1x1 + + kxk + 12 + 13 +error and test H0: 1 = 0, 2 = 0
2 2 ( Rnew Rold ) / no.of `new`parameters F= 2 (1 Rnew ) / n k new
If Ho rejected it indicates mis-specified model Advantage , in RESET you dont have to specify the the correct alternative model Disadvantage: doesn't help in attaining the right model
Obtain residuals from 1 and regress it on all X in Eq2 including ones in eq1 Ui=a0+a1X1+a2X2+a3X3
nR
2
2 no .ofrestrictions
R = 1 (1 R 2 )
n 1 nk
2)Discerning approach: make use of information provided by other models as well along with the initial model
Davidson-MacKinnon J test,
An alternative, the Davidson-MacKinnon test, uses from one model as regressor in the second model and tests for significance. Y=a+b1X1+b2X2 - A Y=c0+c1Z1+c2Z2 - B Estimate B and obtain Y^B Y=a+b1X1+b2X2+ b3Y^B
Davidson-MacKinnon J test,
Use t-test, if b3=0, not rejected, we accept model A Reverse the models and re-do steps More difficult if one model uses y and the other uses ln(y) Can follow same basic logic and transform predicted ln(y) to get for the second step In any case, Davidson-MacKinnon test may reject neither or both models rather than clearly preferring one specification
Measurement Error
Sometimes we have the variable we want, but we think it is measured with error Examples: A survey asks how many hours did you work over the last year, or how many weeks you used child care when your child was young Consequences of Measurement error in y different from measurement error in x
The effect of measurement error on OLS estimates depends on our assumption about the correlation between e1 and x1
If Cov(x1, e1) # 0, OLS estimates are biased, and variances larger Use Proxy or IV variables
Proxy Variables
What if model is mis-specified because no data is available on an important x variable? It may be possible to avoid omitted variable bias by using a proxy variable A proxy variable must be related to the unobservable variable But must be uncorrelated with the error term Sargen test
Non-random Samples
If the sample is chosen on the basis of an x variable, then estimates are unbiased If the sample is chosen on the basis of the y variable, then we have sample selection bias Sample selection can be more subtle Say looking at wages for workers since people choose to work this isnt the same as wage offers
Outliers
Sometimes an individual observation can be very different from the others, and can have a large effect on the outcome Sometimes this outlier will simply be do to errors in data entry one reason why looking at summary statistics is important Sometimes the observation will just truly be very different from the others
Outliers (cont'd)
Not unreasonable to fix observations where its clear there was just an extra zero entered or left off, etc. Not unreasonable to drop observations that appear to be extreme outliers, although readers may prefer to see estimates with and without the outliers Can use Stata to investigate outliers
Nx1
n x k+1
k+1 x 1
nx1
b = (X'X)-1X'Y
Assumptions
E(u)=0 where u and 0 are n x 1 column vectors, 0 being a null vector.
2I
Assumptions
The rank of X is p(X)=k, where k is the number of columns in X and k is less than the number of observations, n (no multi-collinearity) I x = 0
I x = 0 where I is a 1 x k row vector and x is a k x 1
column vector.
i.e. U~N(0, 2 I )
R =
X y nY
I
^ I
yI y nY
var cov( ) = ( X X )
^ 2 I
u =
u u = nk nk
^ 2 i
^I ^