Econometrics Simpler Note

Download as pdf or txt
Download as pdf or txt
You are on page 1of 692

Econometrics

Michael Creel
Department of Economics and Economic History
Universitat Autnoma de Barcelona
February 2014
Contents
1 About this document 16
1.1 Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.2 Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.3 Licenses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.4 Obtaining the materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.5 An easy way run the examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2 Introduction: Economic and econometric models 23
3 Ordinary Least Squares 28
3.1 The Linear Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2 Estimation by least squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3 Geometric interpretation of least squares estimation . . . . . . . . . . . . . . . . . . . . 33
3.4 Inuential observations and outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.5 Goodness of t . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.6 The classical linear regression model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
1
3.7 Small sample statistical properties of the least squares estimator . . . . . . . . . . . . . 46
3.8 Example: The Nerlove model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4 Asymptotic properties of the least squares estimator 63
4.1 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.2 Asymptotic normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.3 Asymptotic eciency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5 Restrictions and hypothesis tests 69
5.1 Exact linear restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.2 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.3 The asymptotic equivalence of the LR, Wald and score tests . . . . . . . . . . . . . . . 85
5.4 Interpretation of test statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.5 Condence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.6 Bootstrapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.7 Wald test for nonlinear restrictions: the delta method . . . . . . . . . . . . . . . . . . . 94
5.8 Example: the Nerlove data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6 Stochastic regressors 108
6.1 Case 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.2 Case 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.3 Case 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.4 When are the assumptions reasonable? . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
6.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7 Data problems 117
7.1 Collinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.2 Measurement error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
7.3 Missing observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
7.4 Missing regressors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
7.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
8 Functional form and nonnested tests 150
8.1 Flexible functional forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
8.2 Testing nonnested hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
9 Generalized least squares 168
9.1 Eects of nonspherical disturbances on the OLS estimator . . . . . . . . . . . . . . . . 169
9.2 The GLS estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
9.3 Feasible GLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
9.4 Heteroscedasticity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
9.5 Autocorrelation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
9.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
10 Endogeneity and simultaneity 235
10.1 Simultaneous equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
10.2 Reduced form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
10.3 Estimation of the reduced form equations . . . . . . . . . . . . . . . . . . . . . . . . . . 243
10.4 Bias and inconsistency of OLS estimation of a structural equation . . . . . . . . . . . . 247
10.5 Note about the rest of this chaper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
10.6 Identication by exclusion restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
10.7 2SLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
10.8 Testing the overidentifying restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
10.9 System methods of estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
10.10Example: Kleins Model 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
11 Numeric optimization methods 284
11.1 Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
11.2 Derivative-based methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
11.3 Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
11.4 A practical example: Maximum likelihood estimation using count data: The MEPS
data and the Poisson model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
11.5 Numeric optimization: pitfalls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
11.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
12 Asymptotic properties of extremum estimators 308
12.1 Extremum estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
12.2 Existence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
12.3 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
12.4 Example: Consistency of Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
12.5 Example: Inconsistency of Misspecied Least Squares . . . . . . . . . . . . . . . . . . . 322
12.6 Example: Linearization of a nonlinear model . . . . . . . . . . . . . . . . . . . . . . . . 322
12.7 Asymptotic Normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
12.8 Example: Classical linear model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
12.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
13 Maximum likelihood estimation 332
13.1 The likelihood function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
13.2 Consistency of MLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338
13.3 The score function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
13.4 Asymptotic normality of MLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
13.5 The information matrix equality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
13.6 The Cramr-Rao lower bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350
13.7 Likelihood ratio-type tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353
13.8 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
13.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372
14 Generalized method of moments 375
14.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375
14.2 Denition of GMM estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381
14.3 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382
14.4 Asymptotic normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383
14.5 Choosing the weighting matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386
14.6 Estimation of the variance-covariance matrix . . . . . . . . . . . . . . . . . . . . . . . . 390
14.7 Estimation using conditional moments . . . . . . . . . . . . . . . . . . . . . . . . . . . 395
14.8 Estimation using dynamic moment conditions . . . . . . . . . . . . . . . . . . . . . . . 398
14.9 A specication test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399
14.10Example: Generalized instrumental variables estimator . . . . . . . . . . . . . . . . . . 402
14.11Nonlinear simultaneous equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413
14.12Maximum likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414
14.13Example: OLS as a GMM estimator - the Nerlove model again . . . . . . . . . . . . . . 417
14.14Example: The MEPS data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
14.15Example: The Hausman Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420
14.16Application: Nonlinear rational expectations . . . . . . . . . . . . . . . . . . . . . . . . 429
14.17Empirical example: a portfolio model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434
14.18Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438
15 Models for time series data 442
15.1 ARMA models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445
15.2 VAR models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453
15.3 ARCH, GARCH and Stochastic volatility . . . . . . . . . . . . . . . . . . . . . . . . . . 455
15.4 State space models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461
15.5 Nonstationarity and cointegration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462
15.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462
16 Bayesian methods 463
16.1 Denitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464
16.2 Philosophy, etc. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465
16.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467
16.4 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 468
16.5 Computational methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 470
16.6 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474
16.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 482
17 Introduction to panel data 483
17.1 Generalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483
17.2 Static models and correlations between variables . . . . . . . . . . . . . . . . . . . . . . 486
17.3 Estimation of the simple linear panel model . . . . . . . . . . . . . . . . . . . . . . . . 488
17.4 Dynamic panel data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492
17.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497
18 Quasi-ML 499
18.1 Consistent Estimation of Variance Components . . . . . . . . . . . . . . . . . . . . . . 502
18.2 Example: the MEPS Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504
18.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 517
19 Nonlinear least squares (NLS) 519
19.1 Introduction and denition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 519
19.2 Identication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 522
19.3 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524
19.4 Asymptotic normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524
19.5 Example: The Poisson model for count data . . . . . . . . . . . . . . . . . . . . . . . . 526
19.6 The Gauss-Newton algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 528
19.7 Application: Limited dependent variables and sample selection . . . . . . . . . . . . . . 530
20 Nonparametric inference 535
20.1 Possible pitfalls of parametric inference: estimation . . . . . . . . . . . . . . . . . . . . 535
20.2 Possible pitfalls of parametric inference: hypothesis testing . . . . . . . . . . . . . . . . 541
20.3 Estimation of regression functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543
20.4 Density function estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 561
20.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 567
20.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574
21 Quantile regression 575
22 Simulation-based methods for estimation and inference 581
22.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 582
22.2 Simulated maximum likelihood (SML) . . . . . . . . . . . . . . . . . . . . . . . . . . . 589
22.3 Method of simulated moments (MSM) . . . . . . . . . . . . . . . . . . . . . . . . . . . 593
22.4 Ecient method of moments (EMM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 597
22.5 Indirect likelihood inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604
22.6 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 611
22.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 618
23 Parallel programming for econometrics 619
23.1 Example problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 621
24 Introduction to Octave 628
24.1 Getting started . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 628
24.2 A short introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 629
24.3 If youre running a Linux installation... . . . . . . . . . . . . . . . . . . . . . . . . . . . 631
25 Notation and Review 632
25.1 Notation for dierentiation of vectors and matrices . . . . . . . . . . . . . . . . . . . . 632
25.2 Convergenge modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634
25.3 Rates of convergence and asymptotic equality . . . . . . . . . . . . . . . . . . . . . . . 638
26 Licenses 642
26.1 The GPL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 642
26.2 Creative Commons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 658
27 The attic 666
27.1 Hurdle models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 677
List of Figures
1.1 Octave . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.2 L
Y
X . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1 Typical data, Classical Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 Example OLS Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3 The t in observation space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4 Detection of inuential observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.5 Uncentered R
2
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.6 Unbiasedness of OLS under classical assumptions . . . . . . . . . . . . . . . . . . . . . 48
3.7 Biasedness of OLS when an assumption fails . . . . . . . . . . . . . . . . . . . . . . . . 49
3.8 Gauss-Markov Result: The OLS estimator . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.9 Gauss-Markov Resul: The split sample estimator . . . . . . . . . . . . . . . . . . . . . 54
5.1 Joint and Individual Condence Regions . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.2 RTS as a function of rm size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
7.1 s() when there is no collinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
10
7.2 s() when there is collinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
7.3 Collinearity: Monte Carlo results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
7.4 OLS and Ridge regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
7.5 with and without measurement error . . . . . . . . . . . . . . . . . . . . . . . . . 142
7.6 Sample selection bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
9.1 Rejection frequency of 10% t-test, H0 is true. . . . . . . . . . . . . . . . . . . . . . . . 172
9.2 Motivation for GLS correction when there is HET . . . . . . . . . . . . . . . . . . . . . 188
9.3 Residuals, Nerlove model, sorted by rm size . . . . . . . . . . . . . . . . . . . . . . . . 193
9.4 Residuals from time trend for CO2 data . . . . . . . . . . . . . . . . . . . . . . . . . . 201
9.5 Autocorrelation induced by misspecication . . . . . . . . . . . . . . . . . . . . . . . . 203
9.6 Eciency of OLS and FGLS, AR1 errors . . . . . . . . . . . . . . . . . . . . . . . . . . 213
9.7 Durbin-Watson critical values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
9.8 Dynamic model with MA(1) errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
9.9 Residuals of simple Nerlove model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
9.10 OLS residuals, Klein consumption equation . . . . . . . . . . . . . . . . . . . . . . . . . 228
10.1 Exogeneity and Endogeneity (adapted from Cameron and Trivedi) . . . . . . . . . . . . 236
11.1 Search method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286
11.2 Increasing directions of search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
11.3 Newton iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
11.4 Using Sage to get analytic derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
11.5 Mountains with low fog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302
11.6 A foggy mountain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
13.1 Dwarf mongooses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
13.2 Life expectancy of mongooses, Weibull model . . . . . . . . . . . . . . . . . . . . . . . 368
13.3 Life expectancy of mongooses, mixed Weibull model . . . . . . . . . . . . . . . . . . . . 370
14.1 Method of Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376
14.2 Asymptotic Normality of GMM estimator,
2
example . . . . . . . . . . . . . . . . . . 387
14.3 Inecient and Ecient GMM estimators,
2
data . . . . . . . . . . . . . . . . . . . . . 391
14.4 GIV estimation results for , dynamic model with measurement error . . . . . . . . 411
14.5 OLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421
14.6 IV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422
14.7 Incorrect rank and the Hausman test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427
15.1 NYSE weekly close price, 100 log dierences . . . . . . . . . . . . . . . . . . . . . . . 457
16.1 Bayesian estimation, exponential likelihood, lognormal prior . . . . . . . . . . . . . . . 468
16.2 Chernozhukov and Hong, Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 469
16.3 Metropolis-Hastings MCMC, exponential likelihood, lognormal prior . . . . . . . . . . . 475
16.4 Data from RBC model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 480
16.5 BVAR residuals, with separation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 481
20.1 True and simple approximating functions . . . . . . . . . . . . . . . . . . . . . . . . . . 537
20.2 True and approximating elasticities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 538
20.3 True function and more exible approximation . . . . . . . . . . . . . . . . . . . . . . . 540
20.4 True elasticity and more exible approximation . . . . . . . . . . . . . . . . . . . . . . 541
20.5 Negative binomial raw moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565
20.6 Kernel tted OBDV usage versus AGE . . . . . . . . . . . . . . . . . . . . . . . . . . . 568
20.7 Dollar-Euro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 571
20.8 Dollar-Yen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 572
20.9 Kernel regression tted conditional second moments, Yen/Dollar and Euro/Dollar . . . 573
21.1 Inverse CDF for N(0,1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 577
21.2 Quantile regression results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 580
23.1 Speedups from parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 626
24.1 Running an Octave program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 630
List of Tables
17.1 Dynamic panel data model. Bias. Source for ML and II is Gouriroux, Phillips and
Yu, 2010, Table 2. SBIL, SMIL and II are exactly identied, using the ML auxiliary
statistic. SBIL(OI) and SMIL(OI) are overidentied, using both the naive and ML
auxiliary statistics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494
17.2 Dynamic panel data model. RMSE. Source for ML and II is Gouriroux, Phillips and
Yu, 2010, Table 2. SBIL, SMIL and II are exactly identied, using the ML auxiliary
statistic. SBIL(OI) and SMIL(OI) are overidentied, using both the naive and ML
auxiliary statistics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495
18.1 Marginal Variances, Sample and Estimated (Poisson) . . . . . . . . . . . . . . . . . . . 505
18.2 Marginal Variances, Sample and Estimated (NB-II) . . . . . . . . . . . . . . . . . . . . 512
18.3 Information Criteria, OBDV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 516
22.1 True parameter values and bound of priors . . . . . . . . . . . . . . . . . . . . . . . . . 610
22.2 Monte Carlo results, bias corrected estimators . . . . . . . . . . . . . . . . . . . . . . . 610
27.1 Actual and Poisson tted frequencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 677
14
27.2 Actual and Hurdle Poisson tted frequencies . . . . . . . . . . . . . . . . . . . . . . . . 683
Chapter 1
About this document
1.1 Prerequisites
These notes have been prepared under the assumption that the reader understands basic statistics,
linear algebra, and mathematical optimization. There are many sources for this material, one are
the appendices to Introductory Econometrics: A Modern Approach by Jerey Wooldridge. It is the
students resposibility to get up to speed on this material, it will not be covered in class
This document integrates lecture notes for a one year graduate level course with computer programs
that illustrate and apply the methods that are studied. The immediate availability of executable (and
modiable) example programs when using the PDF version of the document is a distinguishing feature
of these notes. If printed, the document is a somewhat terse approximation to a textbook. These notes
are not intended to be a perfect substitute for a printed textbook. If you are a student of mine, please
note that last sentence carefully. There are many good textbooks available. Students taking my courses
should read the appropriate sections from at least one of the following books (or other textbooks with
16
similar level and content)
Cameron, A.C. and P.K. Trivedi, Microeconometrics - Methods and Applications
Davidson, R. and J.G. MacKinnon, Econometric Theory and Methods
Gallant, A.R., An Introduction to Econometric Theory
Hamilton, J.D., Time Series Analysis
Hayashi, F., Econometrics
A more introductory-level reference is Introductory Econometrics: A Modern Approach by Jerey
Wooldridge.
1.2 Contents
With respect to contents, the emphasis is on estimation and inference within the world of stationary
data. If you take a moment to read the licensing information in the next section, youll see that you
are free to copy and modify the document. If anyone would like to contribute material that expands
the contents, it would be very welcome. Error corrections and other additions are also welcome.
The integrated examples (they are on-line here and the support les are here) are an important
part of these notes. GNU Octave (www.octave.org) has been used for most of the example programs,
which are scattered though the document. This choice is motivated by several factors. The rst is
the high quality of the Octave environment for doing applied econometrics. Octave is similar to the
commercial package Matlab R _, and will run scripts for that language without modication
1
. The
fundamental tools (manipulation of matrices, statistical functions, minimization, etc.) exist and are
implemented in a way that make extending them fairly easy. Second, an advantage of free software is
that you dont have to pay for it. This can be an important consideration if you are at a university
with a tight budget or if need to run many copies, as can be the case if you do parallel computing
(discussed in Chapter 23). Third, Octave runs on GNU/Linux, Windows and MacOS. Figure 1.1
shows a sample GNU/Linux work environment, with an Octave script being edited, and the results
are visible in an embedded shell window. As of 2011, some examples are being added using Gretl, the
Gnu Regression, Econometrics, and Time-Series Library. This is an easy to use program, available in
a number of languages, and it comes with a lot of data ready to use. It runs on the major operating
systems. As of 2012, I am increasingly trying to make examples run on Matlab, though the need for
add-on toolboxes for tasks as simple as generating random numbers limits what can be done.
The main document was prepared using L
Y
X (www.lyx.org). L
Y
X is a free
2
what you see is what
you mean word processor, basically working as a graphical frontend to L
A
T
E
X. It (with help from
other applications) can export your work in L
A
T
E
X, HTML, PDF and several other forms. It will run
on Linux, Windows, and MacOS systems. Figure 1.2 shows L
Y
X editing this document.
1
Matlab R _is a trademark of The Mathworks, Inc. Octave will run pure Matlab scripts. If a Matlab script calls an extension, such as a
toolbox function, then it is necessary to make a similar extension available to Octave. The examples discussed in this document call a number
of functions, such as a BFGS minimizer, a program for ML estimation, etc. All of this code is provided with the examples, as well as on the
PelicanHPC live CD image.
2
Free is used in the sense of freedom, but L
Y
X is also free of charge (free as in free beer).
Figure 1.1: Octave
Figure 1.2: L
Y
X
1.3 Licenses
All materials are copyrighted by Michael Creel with the date that appears above. They are provided
under the terms of the GNU General Public License, ver. 2, which forms Section 26.1 of the notes, or,
at your option, under the Creative Commons Attribution-Share Alike 2.5 license, which forms Section
26.2 of the notes. The main thing you need to know is that you are free to modify and distribute these
materials in any way you like, as long as you share your contributions in the same way the materials
are made available to you. In particular, you must make available the source les, in editable form,
for your modied version of the materials.
1.4 Obtaining the materials
The materials are available on my web page. In addition to the nal product, which youre probably
looking at in some form now, you can obtain the editable L
Y
X sources, which will allow you to create
your own version, if you like, or send error corrections and contributions.
1.5 An easy way run the examples
Octave is available from the Octave home page, www.octave.org. Also, some updated links to packages
for Windows and MacOS are at http://www.dynare.org/download/octave. The example programs are
available as links to les on my web page in the PDF version, and here. Support les needed to run
these are available here. The les wont run properly from your browser, since there are dependencies
between les - they are only illustrative when browsing. To see how to use these les (edit and run
them), you should go to the home page of this document, since you will probably want to download the
pdf version together with all the support les and examples. Then set the base URL of the PDF le
to point to wherever the Octave les are installed. Then you need to install Octave and the support
les. All of this may sound a bit complicated, because it is. An easier solution is available:
The Linux OS image le econometrics.iso an ISO image le that may be copied to USB or burnt
to CDROM. It contains a bootable-from-CD or USB GNU/Linux system. These notes, in source form
and as a PDF, together with all of the examples and the software needed to run them are available on
econometrics.iso. I recommend starting o by using virtualization, to run the Linux system with all of
the materials inside of a virtual computer, while still running your normal operating system. Various
virtualization platforms are available. I recommend Virtualbox
3
, which runs on Windows, Linux, and
Mac OS.
3
Virtualbox is free software (GPL v2). That, and the fact that it works very well, is the reason it is recommended here. There are a number
of similar products available. It is possible to run PelicanHPC as a virtual machine, and to communicate with the installed operating system
using a private network. Learning how to do this is not too dicult, and it is very convenient.
Chapter 2
Introduction: Economic and
econometric models
Heres some data: 100 observations on 3 economic variables. Lets do some exploratory analysis using
Gretl:
histograms
correlations
x-y scatterplots
So, what can we say? Correlations? Yes. Causality? Who knows? This is economic data, generated by
economic agents, following their own beliefs, technologies and preferences. It is not experimental data
generated under controlled conditions. How can we determine causality if we dont have experimental
data?
23
Without a model, we cant distinguish correlation from causality. It turns out that the variables
were looking at are QUANTITY (q), PRICE (p), and INCOME (m). Economic theory tells us that
the quantity of a good that consumers will puchase (the demand function) is something like:
q = f(p, m, z)
q is the quantity demanded
p is the price of the good
m is income
z is a vector of other variables that may aect demand
The supply of the good to the market is the aggregation of the rms supply functions. The market
supply function is something like
q = g(p, z)
Suppose we have a sample consisting of a number of observations on q p and m at dierent time
periods t = 1, 2, ..., n. Supply and demand in each period is
q
t
= f(p
t
, m
t
, z
t
)
q
t
= g(p
t
, z
t
)
(draw some graphs showing roles of m and z)
This is the basic economic model of supply and demand: q and p are determined in the market
equilibrium, given by the intersection of the two curves. These two variables are determined jointly by
the model, and are the endogenous variables. Income (m) is not determined by this model, its value is
determined independently of q and p by some other process. m is an exogenous variable. So, m causes
q, though the demand function. Because q and p are jointly determined, m also causes p. p and q do
not cause m, according to this theoretical model. q and p have a joint causal relationship.
Economic theory can help us to determine the causality relationships between correlated vari-
ables.
If we had experimental data, we could control certain variables and observe the outcomes for
other variables. If we see that variable x changes as the controlled value of variable y is changed,
then we know that y causes x. With economic data, we are unable to control the values of
the variables: for example in supply and demand, if price changes, then quantity changes, but
quantity also aect price. We cant control the market price, because the market price changes as
quantity adjusts. This is the reason we need a theoretical model to help us distinguish correlation
and causality.
The model is essentially a theoretical construct up to now:
We dont know the forms of the functions f and g.
Some components of z
t
may not be observable. For example, people dont eat the same lunch
every day, and you cant tell what they will order just by looking at them. There are unobservable
components to supply and demand, and we can model them as random variables. Suppose we
can break z
t
into two unobservable components
t1
and
t2
.
An econometric model attempts to quantify the relationship more precisely. A step toward an estimable
econometric model is to suppose that the model may be written as
q
t
=
1
+
2
p
t
+
3
m
t
+
t1
q
t
=
1
+
2
p
t
+
t1
We have imposed a number of restrictions on the theoretical model:
The functions f and g have been specied to be linear functions
The parameters (
1
,
2
, etc.) are constant over time.
There is a single unobservable component in each equation, and we assume it is additive.
If we assume nothing about the error terms
t1
and
t2
, we can always write the last two equations,
as the errors simply make up the dierence between the true demand and supply functions and the
assumed forms. But in order for the coecients to exist in a sense that has economic meaning, and
in order to be able to use sample data to make reliable inferences about their values, we need to make
additional assumptions. Such assumptions might be something like:
E(
tj
) = 0, j = 1, 2
E(p
t

tj
) = 0, j = 1, 2
E(m
t

tj
) = 0, j = 1, 2
These are assertions that the errors are uncorrelated with the variables, and such assertions may or
may not be reasonable. Later we will see how such assumption may be used and/or tested.
All of the last six bulleted points have no theoretical basis, in that the theory of supply and
demand doesnt imply these conditions. The validity of any results we obtain using this model will
be contingent on these additional restrictions being at least approximately correct. For this reason,
specication testing will be needed, to check that the model seems to be reasonable. Only when we
are convinced that the model is at least approximately correct should we use it for economic analysis.
When testing a hypothesis using an econometric model, at least three factors can cause a statistical
test to reject the null hypothesis:
1. the hypothesis is false
2. a type I error has occured
3. the econometric model is not correctly specied, and thus the test does not have the assumed
distribution
To be able to make scientic progress, we would like to ensure that the third reason is not contributing
in a major way to rejections, so that rejection will be most likely due to either the rst or second
reasons. Hopefully the above example makes it clear that econometric models are necessarily more
detailed than what we can obtain from economic theory, and that this additional detail introduces
many possible sources of misspecication of econometric models. In the next few sections we will
obtain results supposing that the econometric model is entirely correctly specied. Later we will
examine the consequences of misspecication and see some methods for determining if a model is
correctly specied. Later on, econometric methods that seek to minimize maintained assumptions are
introduced.
Chapter 3
Ordinary Least Squares
3.1 The Linear Model
Consider approximating a variable y using the variables x
1
, x
2
, ..., x
k
. We can consider a model that is
a linear approximation:
Linearity: the model is a linear function of the parameter vector
0
:
y =
0
1
x
1
+
0
2
x
2
+ ... +
0
k
x
k
+
or, using vector notation:
y = x
/

0
+
The dependent variable y is a scalar random variable, x = ( x
1
x
2
x
k
)

is a k-vector of explana-
tory variables, and
0
= (
0
1

0
2

0
k
)

. The superscript 0 in
0
means this is the true value
of the unknown parameter. It will be dened more precisely later, and usually suppressed when its
28
not necessary for clarity.
Suppose that we want to use data to try to determine the best linear approximation to y using the
variables x. The data (y
t
, x
t
) , t = 1, 2, ..., n are obtained by some form of sampling
1
. An individual
observation is
y
t
= x
/
t
+
t
The n observations can be written in matrix form as
y = X + , (3.1)
where y =
_
y
1
y
2
y
n
_
/
is n 1 and X =
_
x
1
x
2
x
n
_
/
.
Linear models are more general than they might rst appear, since one can employ nonlinear
transformations of the variables:

0
(z) =
_

1
(w)
2
(w)
p
(w)
_
+
where the
i
() are known functions. Dening y =
0
(z), x
1
=
1
(w), etc. leads to a model in the form
of equation 3.4. For example, the Cobb-Douglas model
z = Aw

2
2
w

3
3
exp()
can be transformed logarithmically to obtain
ln z = ln A +
2
ln w
2
+
3
ln w
3
+ .
1
For example, cross-sectional data may be obtained by random sampling. Time series data accumulate historically.
If we dene y = ln z,
1
= ln A, etc., we can put the model in the form needed. The approximation is
linear in the parameters, but not necessarily linear in the variables.
3.2 Estimation by least squares
Figure 3.1, obtained by running TypicalData.m shows some data that follows the linear model y
t
=

1
+
2
x
t2
+
t
. The green line is the true regression line
1
+
2
x
t2
, and the red crosses are the data
points (x
t2
, y
t
), where
t
is a random error that has mean zero and is independent of x
t2
. Exactly how
the green line is dened will become clear later. In practice, we only have the data, and we dont know
where the green line lies. We need to gain information about the straight line that best ts the data
points.
The ordinary least squares (OLS) estimator is dened as the value that minimizes the sum of the
squared errors:

= arg min s()


where
s() =
n

t=1
(y
t
x
/
t
)
2
(3.2)
= (y X)
/
(y X)
= y
/
y 2y
/
X +
/
X
/
X
= | y X |
2
Figure 3.1: Typical data, Classical Model
-15
-10
-5
0
5
10
0 2 4 6 8 10 12 14 16 18 20
X
data
true regression line
This last expression makes it clear how the OLS estimator is dened: it minimizes the Euclidean dis-
tance between y and X. The tted OLS coecients are those that give the best linear approximation
to y using x as basis functions, where best means minimum Euclidean distance. One could think
of other estimators based upon other metrics. For example, the minimum absolute distance (MAD)
minimizes

n
t=1
[y
t
x
/
t
[. Later, we will see that which estimator is best in terms of their statistical
properties, rather than in terms of the metrics that dene them, depends upon the properties of ,
about which we have as yet made no assumptions.
To minimize the criterion s(), nd the derivative with respect to :
D

s() = 2X
/
y + 2X
/
X
Then setting it to zeros gives
D

s(

) = 2X
/
y + 2X
/
X

0
so

= (X
/
X)
1
X
/
y.
To verify that this is a minimum, check the second order sucient condition:
D
2

s(

) = 2X
/
X
Since (X) = K, this matrix is positive denite, since its a quadratic form in a p.d. matrix
(identity matrix of order n), so

is in fact a minimizer.
The tted values are the vector y = X

.
The residuals are the vector = y X

Note that
y = X +
= X

+
Also, the rst order conditions can be written as
X
/
y X
/
X

= 0
X
/
_
y X

_
= 0
X
/
= 0
which is to say, the OLS residuals are orthogonal to X. Lets look at this more carefully.
3.3 Geometric interpretation of least squares estimation
In X, Y Space
Figure 3.2 shows a typical t to data, along with the true regression line. Note that the true line and
the estimated line are dierent. This gure was created by running the Octave program OlsFit.m .
You can experiment with changing the parameter values to see how this aects the t, and to see how
the tted line will sometimes be close to the true line, and sometimes rather far away.
Figure 3.2: Example OLS Fit
-15
-10
-5
0
5
10
15
0 2 4 6 8 10 12 14 16 18 20
X
data points
fitted line
true line
In Observation Space
If we want to plot in observation space, well need to use only two or three observations, or well
encounter some limitations of the blackboard. If we try to use 3, well encounter the limits of my
artistic ability, so lets use two. With only two observations, we cant have K > 1.
Figure 3.3: The t in observation space
Observation 2
Observation 1
x
y
S(x)
x*beta=P_xY
e = M_xY
We can decompose y into two components: the orthogonal projection onto the Kdimensional
space spanned by X, X

, and the component that is the orthogonal projection onto the n K


subpace that is orthogonal to the span of X, .
Since

is chosen to make as short as possible, will be orthogonal to the space spanned by
X. Since X is in this space, X
/
= 0. Note that the f.o.c. that dene the least squares estimator
imply that this is so.
Projection Matrices
X

is the projection of y onto the span of X, or


X

= X (X
/
X)
1
X
/
y
Therefore, the matrix that projects y onto the span of X is
P
X
= X(X
/
X)
1
X
/
since
X

= P
X
y.
is the projection of y onto the N K dimensional space that is orthogonal to the span of X. We
have that
= y X

= y X(X
/
X)
1
X
/
y
=
_
I
n
X(X
/
X)
1
X
/
_
y.
So the matrix that projects y onto the space orthogonal to the span of X is
M
X
= I
n
X(X
/
X)
1
X
/
= I
n
P
X
.
We have
= M
X
y.
Therefore
y = P
X
y + M
X
y
= X

+ .
These two projection matrices decompose the n dimensional vector y into two orthogonal components
- the portion that lies in the K dimensional space dened by X, and the portion that lies in the
orthogonal n K dimensional space.
Note that both P
X
and M
X
are symmetric and idempotent.
A symmetric matrix A is one such that A = A
/
.
An idempotent matrix A is one such that A = AA.
The only nonsingular idempotent matrix is the identity matrix.
3.4 Inuential observations and outliers
The OLS estimator of the i
th
element of the vector
0
is simply

i
=
_
(X
/
X)
1
X
/
_
i
y
= c
/
i
y
This is how we dene a linear estimator - its a linear function of the dependent variable. Since its
a linear combination of the observations on the dependent variable, where the weights are determined
by the observations on the regressors, some observations may have more inuence than others.
To investigate this, let e
t
be an n vector of zeros with a 1 in the t
th
position, i.e., its the
tth column of the matrix I
n
. Dene
h
t
= (P
X
)
tt
= e
/
t
P
X
e
t
so h
t
is the t
th
element on the main diagonal of P
X
. Note that
h
t
= | P
X
e
t
|
2
so
h
t
| e
t
|
2
= 1
So 0 < h
t
< 1. Also,
TrP
X
= K h = K/n.
So the average of the h
t
is K/n. The value h
t
is referred to as the leverage of the observation. If
the leverage is much higher than average, the observation has the potential to aect the OLS t
importantly. However, an observation may also be inuential due to the value of y
t
, rather than the
weight it is multiplied by, which only depends on the x
t
s.
To account for this, consider estimation of without using the t
th
observation (designate this
estimator as

(t)
). One can show (see Davidson and MacKinnon, pp. 32-5 for proof) that

(t)
=


_
1
1 h
t
_
(X
/
X)
1
X
/
t

t
so the change in the t
th
observations tted value is
x
/
t

x
/
t

(t)
=
_
h
t
1 h
t
_

t
While an observation may be inuential if it doesnt aect its own tted value, it certainly is inuential
if it does. A fast means of identifying inuential observations is to plot
_
h
t
1h
t
_

t
(which I will refer to
as the own inuence of the observation) as a function of t. Figure 3.4 gives an example plot of data,
t, leverage and inuence. The Octave program is InuentialObservation.m. (note to self when
lecturing: load the data ../OLS/inuencedata into Gretl and reproduce this). If you re-run
the program you will see that the leverage of the last observation (an outlying value of x) is always
high, and the inuence is sometimes high.
After inuential observations are detected, one needs to determine why they are inuential. Possible
causes include:
data entry error, which can easily be corrected once detected. Data entry errors are very common.
Figure 3.4: Detection of inuential observations
0
2
4
6
8
10
12
14
0 0.5 1 1.5 2 2.5 3 3.5
Data points
fitted
Leverage
Influence
special economic factors that aect some observations. These would need to be identied and
incorporated in the model. This is the idea behind structural change: the parameters may not
be constant across all observations.
pure randomness may have caused us to sample a low-probability observation.
There exist robust estimation methods that downweight outliers.
3.5 Goodness of t
The tted model is
y = X

+
Take the inner product:
y
/
y =

/
X
/
X

+ 2

/
X
/
+
/

But the middle term of the RHS is zero since X
/
= 0, so
y
/
y =

/
X
/
X

+
/
(3.3)
The uncentered R
2
u
is dened as
R
2
u
= 1

/

y
/
y
=

/
X
/
X

y
/
y
=
| P
X
y |
2
| y |
2
= cos
2
(),
where is the angle between y and the span of X .
The uncentered R
2
changes if we add a constant to y, since this changes (see Figure 3.5, the
yellow vector is a constant, since its on the 45 degree line in observation space). Another, more
common denition measures the contribution of the variables, other than the constant term, to
explaining the variation in y. Thus it measures the ability of the model to explain the variation
of y about its unconditional sample mean.
Let = (1, 1, ..., 1)
/
, a n -vector. So
M

= I
n
(
/
)
1

/
= I
n

/
/n
M

y just returns the vector of deviations from the mean. In terms of deviations from the mean,
equation 3.3 becomes
y
/
M

y =

/
X
/
M

+
/
M


Figure 3.5: Uncentered R
2
The centered R
2
c
is dened as
R
2
c
= 1

/

y
/
M

y
= 1
ESS
TSS
where ESS =
/
and TSS = y
/
M

y=

n
t=1
(y
t
y)
2
.
Supposing that X contains a column of ones (i.e., there is a constant term),
X
/
= 0

t

t
= 0
so M

= . In this case
y
/
M

y =

/
X
/
M

+
/

So
R
2
c
=
RSS
TSS
where RSS =

/
X
/
M

Supposing that a column of ones is in the space spanned by X (P


X
= ), then one can show
that 0 R
2
c
1.
3.6 The classical linear regression model
Up to this point the model is empty of content beyond the denition of a best linear approximation
to y and some geometrical properties. There is no economic content to the model, and the regression
parameters have no economic interpretation. For example, what is the partial derivative of y with
respect to x
j
? The linear approximation is
y =
1
x
1
+
2
x
2
+ ... +
k
x
k
+
The partial derivative is
y
x
j
=
j
+

x
j
Up to now, theres no guarantee that

x
j
=0. For the to have an economic meaning, we need to
make additional assumptions. The assumptions that are appropriate to make depend on the data
under consideration. Well start with the classical linear regression model, which incorporates some
assumptions that are clearly not realistic for economic data. This is to be able to explain some concepts
with a minimum of confusion and notational clutter. Later well adapt the results to what we can get
with more realistic assumptions.
Linearity: the model is a linear function of the parameter vector
0
:
y =
0
1
x
1
+
0
2
x
2
+ ... +
0
k
x
k
+ (3.4)
or, using vector notation:
y = x
/

0
+
Nonstochastic linearly independent regressors: X is a xed matrix of constants, it has rank
K equal to its number of columns, and
lim
1
n
X
/
X = Q
X
(3.5)
where Q
X
is a nite positive denite matrix. This is needed to be able to identify the individual eects
of the explanatory variables.
Independently and identically distributed errors:
IID(0,
2
I
n
) (3.6)
is jointly distributed IID. This implies the following two properties:
Homoscedastic errors:
V (
t
) =
2
0
, t (3.7)
Nonautocorrelated errors:
c(
t

s
) = 0, t ,= s (3.8)
Optionally, we will sometimes assume that the errors are normally distributed.
Normally distributed errors:
N(0,
2
I
n
) (3.9)
3.7 Small sample statistical properties of the least squares
estimator
Up to now, we have only examined numeric properties of the OLS estimator, that always hold. Now
we will examine statistical properties. The statistical properties depend upon the assumptions we
make.
Unbiasedness
We have

= (X
/
X)
1
X
/
y. By linearity,

= (X
/
X)
1
X
/
(X + )
= + (X
/
X)
1
X
/

By 3.5 and 3.6


E(X
/
X)
1
X
/
= E(X
/
X)
1
X
/

= (X
/
X)
1
X
/
E
= 0
so the OLS estimator is unbiased under the assumptions of the classical model.
Figure 3.6 shows the results of a small Monte Carlo experiment where the OLS estimator was
calculated for 10000 samples from the classical model with y = 1 + 2x +, where n = 20,
2

= 9, and
x is xed across samples. We can see that the
2
appears to be estimated without bias. The program
that generates the plot is Unbiased.m , if you would like to experiment with this.
With time series data, the OLS estimator will often be biased. Figure 3.7 shows the results of
a small Monte Carlo experiment where the OLS estimator was calculated for 1000 samples from the
AR(1) model with y
t
= 0 + 0.9y
t1
+
t
, where n = 20 and
2

= 1. In this case, assumption 3.5 does


not hold: the regressors are stochastic. We can see that the bias in the estimation of
2
is about -0.2.
The program that generates the plot is Biased.m , if you would like to experiment with this.
Figure 3.6: Unbiasedness of OLS under classical assumptions
0
0.02
0.04
0.06
0.08
0.1
-3 -2 -1 0 1 2 3
Figure 3.7: Biasedness of OLS when an assumption fails
0
0.02
0.04
0.06
0.08
0.1
0.12
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4
Normality
With the linearity assumption, we have

= + (X
/
X)
1
X
/
. This is a linear function of . Adding
the assumption of normality (3.9, which implies strong exogeneity), then

N
_
, (X
/
X)
1

2
0
_
since a linear function of a normal random vector is also normally distributed. In Figure 3.6 you can
see that the estimator appears to be normally distributed. It in fact is normally distributed, since
the DGP (see the Octave program) has normal errors. Even when the data may be taken to be IID,
the assumption of normality is often questionable or simply untenable. For example, if the dependent
variable is the number of automobile trips per week, it is a count variable with a discrete distribution,
and is thus not normally distributed. Many variables in economics can take on only nonnegative
values, which, strictly speaking, rules out normality.
2
The variance of the OLS estimator and the Gauss-Markov theorem
Now lets make all the classical assumptions except the assumption of normality. We have

=
+ (X
/
X)
1
X
/
and we know that E(

) = . So
V ar(

) = E
_
_


_ _


_
/
_
= E
_
(X
/
X)
1
X
/

/
X(X
/
X)
1
_
= (X
/
X)
1

2
0
2
Normality may be a good model nonetheless, as long as the probability of a negative value occuring is negligable under the model. This
depends upon the mean being large enough in relation to the variance.
The OLS estimator is a linear estimator, which means that it is a linear function of the dependent
variable, y.

=
_
(X
/
X)
1
X
/
_
y
= Cy
where C is a function of the explanatory variables only, not the dependent variable. It is also unbiased
under the present assumptions, as we proved above. One could consider other weights W that are a
function of X that dene some other linear estimator. Well still insist upon unbiasedness. Consider

= Wy, where W = W(X) is some k n matrix function of X. Note that since W is a function of
X, it is nonstochastic, too. If the estimator is unbiased, then we must have WX = I
K
:
c(Wy) = c(WX
0
+ W)
= WX
0
=
0

WX = I
K
The variance of

is
V (

) = WW
/

2
0
.
Dene
D = W (X
/
X)
1
X
/
so
W = D + (X
/
X)
1
X
/
Since WX = I
K
, DX = 0, so
V (

) =
_
D + (X
/
X)
1
X
/
_ _
D + (X
/
X)
1
X
/
_
/

2
0
=
_
DD
/
+ (X
/
X)
1
_

2
0
So
V (

) V (

)
The inequality is a shorthand means of expressing, more formally, that V (

) V (

) is a positive
semi-denite matrix. This is a proof of the Gauss-Markov Theorem. The OLS estimator is the best
linear unbiased estimator (BLUE).
It is worth emphasizing again that we have not used the normality assumption in any way to
prove the Gauss-Markov theorem, so it is valid if the errors are not normally distributed, as long
as the other assumptions hold.
To illustrate the Gauss-Markov result, consider the estimator that results from splitting the sample
into p equally-sized parts, estimating using each part of the data separately by OLS, then averaging
the p resulting estimators. You should be able to show that this estimator is unbiased, but inecient
with respect to the OLS estimator. The program Eciency.m illustrates this using a small Monte
Carlo experiment, which compares the OLS estimator and a 3-way split sample estimator. The data
generating process follows the classical model, with n = 21. The true parameter value is = 2. In
Figures 3.8 and 3.9 we can see that the OLS estimator is more ecient, since the tails of its histogram
Figure 3.8: Gauss-Markov Result: The OLS estimator
0
0.02
0.04
0.06
0.08
0.1
0.12
0 0.5 1 1.5 2 2.5 3 3.5 4
are more narrow.
We have that E(

) = and V ar(

) =
_
X

X
_
1

2
0
, but we still need to estimate the variance of ,

2
0
, in order to have an idea of the precision of the estimates of . A commonly used estimator of
2
0
is

2
0
=
1
n K

/

This estimator is unbiased:
Figure 3.9: Gauss-Markov Resul: The split sample estimator
0
0.02
0.04
0.06
0.08
0.1
0.12
0 1 2 3 4 5

2
0
=
1
n K

/

=
1
n K

/
M
c(

2
0
) =
1
n K
E(Tr
/
M)
=
1
n K
E(TrM
/
)
=
1
n K
TrE(M
/
)
=
1
n K

2
0
TrM
=
1
n K

2
0
(n k)
=
2
0
where we use the fact that Tr(AB) = Tr(BA) when both products are conformable. Thus, this
estimator is also unbiased under these assumptions.
3.8 Example: The Nerlove model
Theoretical background
For a rm that takes input prices w and the output level q as given, the cost minimization problem is
to choose the quantities of inputs x to solve the problem
min
x
w
/
x
subject to the restriction
f(x) = q.
The solution is the vector of factor demands x(w, q). The cost function is obtained by substituting
the factor demands into the criterion function:
Cw, q) = w
/
x(w, q).
Monotonicity Increasing factor prices cannot decrease cost, so
C(w, q)
w
0
Remember that these derivatives give the conditional factor demands (Shephards Lemma).
Homogeneity The cost function is homogeneous of degree 1 in input prices: C(tw, q) = tC(w, q)
where t is a scalar constant. This is because the factor demands are homogeneous of degree zero
in factor prices - they only depend upon relative prices.
Returns to scale The returns to scale parameter is dened as the inverse of the elasticity of
cost with respect to output:
=
_
_
C(w, q)
q
q
C(w, q)
_
_
1
Constant returns to scale is the case where increasing production q implies that cost increases
in the proportion 1:1. If this is the case, then = 1.
Cobb-Douglas functional form
The Cobb-Douglas functional form is linear in the logarithms of the regressors and the dependent
variable. For a cost function, if there are g factors, the Cobb-Douglas cost function has the form
C = Aw

1
1
...w

g
g
q

q
e

What is the elasticity of C with respect to w


j
?
e
C
w
j
=
_
_
C

W
J
_
_
_
w
j
C
_
=
j
Aw

1
1
.w

j
1
j
..w

g
g
q

q
e

w
j
Aw

1
1
...w

g
g q

q
e

=
j
This is one of the reasons the Cobb-Douglas form is popular - the coecients are easy to interpret,
since they are the elasticities of the dependent variable with respect to the explanatory variable. Not
that in this case,
e
C
w
j
=
_
_
C

W
J
_
_
_
w
j
C
_
= x
j
(w, q)
w
j
C
s
j
(w, q)
the cost share of the j
th
input. So with a Cobb-Douglas cost function,
j
= s
j
(w, q). The cost shares
are constants.
Note that after a logarithmic transformation we obtain
ln C = +
1
ln w
1
+ ... +
g
ln w
g
+
q
ln q +
where = ln A . So we see that the transformed model is linear in the logs of the data.
One can verify that the property of HOD1 implies that
g

i=1

g
= 1
In other words, the cost shares add up to 1.
The hypothesis that the technology exhibits CRTS implies that
=
1

q
= 1
so
q
= 1. Likewise, monotonicity implies that the coecients
i
0, i = 1, ..., g.
The Nerlove data and OLS
The le nerlove.data contains data on 145 electric utility companies cost of production, output and
input prices. The data are for the U.S., and were collected by M. Nerlove. The observations are by
row, and the columns are COMPANY, COST (C), OUTPUT (Q), PRICE OF LABOR (P
L
),
PRICE OF FUEL (P
F
) and PRICE OF CAPITAL (P
K
). Note that the data are sorted by output
level (the third column).
We will estimate the Cobb-Douglas model
ln C =
1
+
2
ln Q +
3
ln P
L
+
4
ln P
F
+
5
ln P
K
+ (3.10)
using OLS. To do this yourself, you need the data le mentioned above, as well as Nerlove.m (the
estimation program), and the library of Octave functions mentioned in the introduction to Octave
that forms section 24 of this document.
3
The results are
*********************************************************
OLS estimation results
Observations 145
R-squared 0.925955
Sigma-squared 0.153943
Results (Ordinary var-cov estimator)
estimate st.err. t-stat. p-value
constant -3.527 1.774 -1.987 0.049
output 0.720 0.017 41.244 0.000
labor 0.436 0.291 1.499 0.136
fuel 0.427 0.100 4.249 0.000
capital -0.220 0.339 -0.648 0.518
*********************************************************
3
If you are running the bootable CD, you have all of this installed and ready to run.
Do the theoretical restrictions hold?
Does the model t well?
What do you think about RTS?
While we will most often use Octave programs as examples in this document, since following the
programming statements is a useful way of learning how theory is put into practice, you may be
interested in a more user-friendly environment for doing econometrics. I heartily recommend Gretl,
the Gnu Regression, Econometrics, and Time-Series Library. This is an easy to use program, available
in English, French, and Spanish, and it comes with a lot of data ready to use. It even has an option
to save output as L
A
T
E
X fragments, so that I can just include the results into this document, no muss,
no fuss. Here is the Nerlove data in the form of a GRETL data set: nerlove.gdt . Here the results of
the Nerlove model from GRETL:
Model 2: OLS estimates using the 145 observations 1145
Dependent variable: l_cost
Variable Coecient Std. Error t-statistic p-value
const 3.5265 1.77437 1.9875 0.0488
l_output 0.720394 0.0174664 41.2445 0.0000
l_labor 0.436341 0.291048 1.4992 0.1361
l_fuel 0.426517 0.100369 4.2495 0.0000
l_capita 0.219888 0.339429 0.6478 0.5182
Mean of dependent variable 1.72466
S.D. of dependent variable 1.42172
Sum of squared residuals 21.5520
Standard error of residuals ( ) 0.392356
Unadjusted R
2
0.925955
Adjusted

R
2
0.923840
F(4, 140) 437.686
Akaike information criterion 145.084
Schwarz Bayesian criterion 159.967
Fortunately, Gretl and my OLS program agree upon the results. Gretl is included in the bootable
CD mentioned in the introduction. I recommend using GRETL to repeat the examples that are done
using Octave.
The previous properties hold for nite sample sizes. Before considering the asymptotic properties
of the OLS estimator it is useful to review the MLE estimator, since under the assumption of normal
errors the two estimators coincide.
3.9 Exercises
1. Prove that the split sample estimator used to generate gure 3.9 is unbiased.
2. Calculate the OLS estimates of the Nerlove model using Octave and GRETL, and provide print-
outs of the results. Interpret the results.
3. Do an analysis of whether or not there are inuential observations for OLS estimation of the
Nerlove model. Discuss.
4. Using GRETL, examine the residuals after OLS estimation and tell me whether or not you
believe that the assumption of independent identically distributed normal errors is warranted.
No need to do formal tests, just look at the plots. Print out any that you think are relevant, and
interpret them.
5. For a random vector X N(
x
, ), what is the distribution of AX + b, where A and b are
conformable matrices of constants?
6. Using Octave, write a little program that veries that Tr(AB) = Tr(BA) for A and B 4x4
matrices of random numbers. Note: there is an Octave function trace.
7. For the model with a constant and a single regressor, y
t
=
1
+
2
x
t
+
t
, which satises the
classical assumptions, prove that the variance of the OLS estimator declines to zero as the sample
size increases.
Chapter 4
Asymptotic properties of the least
squares estimator
The OLS estimator under the classical assumptions is BLUE
1
, for all sample sizes. Now lets see what
happens when the sample size tends to innity.
1
BLUE best linear unbiased estimator if I havent dened it before
63
4.1 Consistency

= (X
/
X)
1
X
/
y
= (X
/
X)
1
X
/
(X + )
=
0
+ (X
/
X)
1
X
/

=
0
+
_
_
X
/
X
n
_
_
1
X
/

n
Consider the last two terms. By assumption lim
n
_
X

X
n
_
= Q
X
lim
n
_
X

X
n
_
1
= Q
1
X
, since the
inverse of a nonsingular matrix is a continuous function of the elements of the matrix. Considering
X

n
,
X
/

n
=
1
n
n

t=1
x
t

t
Each x
t

t
has expectation zero, so
E
_
_
X
/

n
_
_
= 0
The variance of each term is
V (x
t

t
) = x
t
x
/
t

2
.
As long as these are nite, and given a technical condition
2
, the Kolmogorov SLLN applies, so
1
n
n

t=1
x
t

t
a.s.
0.
This implies that

a.s.

0
.
This is the property of strong consistency: the estimator converges in almost surely to the true value.
The consistency proof does not use the normality assumption.
Remember that almost sure convergence implies convergence in probability.
4.2 Asymptotic normality
Weve seen that the OLS estimator is normally distributed under the assumption of normal errors. If
the error distribution is unknown, we of course dont know the distribution of the estimator. However,
we can get asymptotic results. Assuming the distribution of is unknown, but the the other classical
assumptions hold:
2
For application of LLNs and CLTs, of which there are very many to choose from, Im going to avoid the technicalities. Basically, as long
as terms that make up an average have nite variances and are not too strongly dependent, one will be able to nd a LLN or CLT to apply.
Which one it is doesnt matter, we only need the result.

=
0
+ (X
/
X)
1
X
/


0
= (X
/
X)
1
X
/

n
_


0
_
=
_
_
X
/
X
n
_
_
1
X
/

n
Now as before,
_
X

X
n
_
1
Q
1
X
.
Considering
X

n
, the limit of the variance is
lim
n
V
_
_
X
/

n
_
_
= lim
n
E
_
_
X
/

/
X
n
_
_
=
2
0
Q
X
The mean is of course zero. To get asymptotic normality, we need to apply a CLT. We assume
one (for instance, the Lindeberg-Feller CLT) holds, so
X
/

n
d
N
_
0,
2
0
Q
X
_
Therefore,

n
_


0
_
d
N
_
0,
2
0
Q
1
X
_
(4.1)
In summary, the OLS estimator is normally distributed in small and large samples if is normally
distributed. If is not normally distributed,

is asymptotically normally distributed when a
CLT can be applied.
4.3 Asymptotic eciency
The least squares objective function is
s() =
n

t=1
(y
t
x
/
t
)
2
Supposing that is normally distributed, the model is
y = X
0
+ ,
N(0,
2
0
I
n
), so
f() =
n

t=1
1

2
2
exp
_
_


2
t
2
2
_
_
The joint density for y can be constructed using a change of variables. We have = yX, so

y

= I
n
and [

[ = 1, so
f(y) =
n

t=1
1

2
2
exp
_
_

(y
t
x
/
t
)
2
2
2
_
_
.
Taking logs,
ln L(, ) = nln

2 nln
n

t=1
(y
t
x
/
t
)
2
2
2
.
Maximizing this function with respect to and gives what is known as the maximum likelihood
(ML) estimator. It turns out that ML estimators are asymptotically ecient, a concept that will be
explained in detail later. Its clear that the rst order conditions for the MLE of
0
are the same as the
rst order conditions that dene the OLS estimator (up to multiplication by a constant), so the OLS
estimator of is also the ML estimator. The estimators are the same, under the present assumptions.
Therefore, their properties are the same. In particular, under the classical assumptions with normality,
the OLS estimator

is asymptotically ecient. Note that one needs to make an assumption about
the distribution of the errors to compute the ML estimator. If the errors had a distribution other than
the normal, then the OLS estimator and the ML estimator would not coincide.
As well see later, it will be possible to use (iterated) linear estimation methods and still achieve
asymptotic eciency even if the assumption that V ar() ,=
2
I
n
, as long as is still normally dis-
tributed. This is not the case if is nonnormal. In general with nonnormal errors it will be necessary
to use nonlinear estimation methods to achieve asymptotically ecient estimation.
4.4 Exercises
1. Write an Octave program that generates a histogram for R Monte Carlo replications of

n
_

j

j
_
,
where

is the OLS estimator and
j
is one of the k slope parameters. R should be a large number,
at least 1000. The model used to generate data should follow the classical assumptions, except
that the errors should not be normally distributed (try U(a, a), t(p),
2
(p) p, etc). Gener-
ate histograms for n 20, 50, 100, 1000. Do you observe evidence of asymptotic normality?
Comment.
Chapter 5
Restrictions and hypothesis tests
5.1 Exact linear restrictions
In many cases, economic theory suggests restrictions on the parameters of a model. For example, a
demand function is supposed to be homogeneous of degree zero in prices and income. If we have a
Cobb-Douglas (log-linear) model,
ln q =
0
+
1
ln p
1
+
2
ln p
2
+
3
ln m + ,
then we need that
k
0
ln q =
0
+
1
ln kp
1
+
2
ln kp
2
+
3
ln km + ,
69
so

1
ln p
1
+
2
ln p
2
+
3
ln m =
1
ln kp
1
+
2
ln kp
2
+
3
ln km
= (ln k) (
1
+
2
+
3
) +
1
ln p
1
+
2
ln p
2
+
3
ln m.
The only way to guarantee this for arbitrary k is to set

1
+
2
+
3
= 0,
which is a parameter restriction. In particular, this is a linear equality restriction, which is probably
the most commonly encountered case.
Imposition
The general formulation of linear equality restrictions is the model
y = X +
R = r
where R is a QK matrix, Q < K and r is a Q1 vector of constants.
We assume R is of rank Q, so that there are no redundant restrictions.
We also assume that that satises the restrictions: they arent infeasible.
Lets consider how to estimate subject to the restrictions R = r. The most obvious approach is to
set up the Lagrangean
min

s() =
1
n
(y X)
/
(y X) + 2
/
(R r).
The Lagrange multipliers are scaled by 2, which makes things less messy. The fonc are
D

s(

) = 2X
/
y + 2X
/
X

R
+ 2R
/

0
D

s(

) = R

R
r 0,
which can be written as
_

_
X
/
X R
/
R 0
_

_
_

_ =
_

_
X
/
y
r
_

_ .
We get
_

_ =
_

_
X
/
X R
/
R 0
_

_
1
_

_
X
/
y
r
_

_ .
Maybe youre curious about how to invert a partitioned matrix? I can help you with that:
Note that
_

_
(X
/
X)
1
0
R(X
/
X)
1
I
Q
_

_
_

_
X
/
X R
/
R 0
_

_ AB
=
_

_
I
K
(X
/
X)
1
R
/
0 R(X
/
X)
1
R
/
_

_
I
K
(X
/
X)
1
R
/
0 P
_

_
C,
and
_

_
I
K
(X
/
X)
1
R
/
P
1
0 P
1
_

_
_

_
I
K
(X
/
X)
1
R
/
0 P
_

_ DC
= I
K+Q
,
so
DAB = I
K+Q
DA = B
1
B
1
=
_

_
I
K
(X
/
X)
1
R
/
P
1
0 P
1
_

_
_

_
(X
/
X)
1
0
R(X
/
X)
1
I
Q
_

_
=
_

_
(X
/
X)
1
(X
/
X)
1
R
/
P
1
R(X
/
X)
1
(X
/
X)
1
R
/
P
1
P
1
R(X
/
X)
1
P
1
_

_ ,
If you werent curious about that, please start paying attention again. Also, note that we have made
the denition P = R(X
/
X)
1
R
/
)
_

_ =
_

_
(X
/
X)
1
(X
/
X)
1
R
/
P
1
R(X
/
X)
1
(X
/
X)
1
R
/
P
1
P
1
R(X
/
X)
1
P
1
_

_
_

_
X
/
y
r
_

_
=
_

(X
/
X)
1
R
/
P
1
_
R

r
_
P
1
_
R

r
_
_

_
=
_

_
(I
K
(X
/
X)
1
R
/
P
1
R)
P
1
R
_

_

+
_

_
(X
/
X)
1
R
/
P
1
r
P
1
r
_

_
The fact that

R
and

are linear functions of

makes it easy to determine their distributions, since
the distribution of

is already known. Recall that for x a random vector, and for A and b a matrix
and vector of constants, respectively, V ar (Ax + b) = AV ar(x)A
/
.
Though this is the obvious way to go about nding the restricted estimator, an easier way, if the
number of restrictions is small, is to impose them by substitution. Write
y = X
1

1
+ X
2

2
+
_
R
1
R
2
_
_

2
_

_ = r
where R
1
is QQ nonsingular. Supposing the Q restrictions are linearly independent, one can always
make R
1
nonsingular by reorganizing the columns of X. Then

1
= R
1
1
r R
1
1
R
2

2
.
Substitute this into the model
y = X
1
R
1
1
r X
1
R
1
1
R
2

2
+ X
2

2
+
y X
1
R
1
1
r =
_
X
2
X
1
R
1
1
R
2
_

2
+
or with the appropriate denitions,
y
R
= X
R

2
+ .
This model satises the classical assumptions, supposing the restriction is true. One can estimate by
OLS. The variance of

2
is as before
V (

2
) = (X
/
R
X
R
)
1

2
0
and the estimator is

V (

2
) = (X
/
R
X
R
)
1

2
where one estimates
2
0
in the normal way, using the restricted model, i.e.,

2
0
=
_
y
R
X
R

2
_
/
_
y
R
X
R

2
_
n (K Q)
To recover

1
, use the restriction. To nd the variance of

1
, use the fact that it is a linear function
of

2
, so
V (

1
) = R
1
1
R
2
V (

2
)R
/
2
_
R
1
1
_
/
= R
1
1
R
2
(X
/
2
X
2
)
1
R
/
2
_
R
1
1
_
/

2
0
Properties of the restricted estimator
We have that

R
=

(X
/
X)
1
R
/
P
1
_
R

r
_
=

+ (X
/
X)
1
R
/
P
1
r (X
/
X)
1
R
/
P
1
R(X
/
X)
1
X
/
y
= + (X
/
X)
1
X
/
+ (X
/
X)
1
R
/
P
1
[r R] (X
/
X)
1
R
/
P
1
R(X
/
X)
1
X
/

R
= (X
/
X)
1
X
/

+ (X
/
X)
1
R
/
P
1
[r R]
(X
/
X)
1
R
/
P
1
R(X
/
X)
1
X
/

Mean squared error is


MSE(

R
) = c(

R
)(

R
)
/
Noting that the crosses between the second term and the other terms expect to zero, and that the
cross of the rst and third has a cancellation with the square of the third, we obtain
MSE(

R
) = (X
/
X)
1

2
+ (X
/
X)
1
R
/
P
1
[r R] [r R]
/
P
1
R(X
/
X)
1
(X
/
X)
1
R
/
P
1
R(X
/
X)
1

2
So, the rst term is the OLS covariance. The second term is PSD, and the third term is NSD.
If the restriction is true, the second term is 0, so we are better o. True restrictions improve
eciency of estimation.
If the restriction is false, we may be better or worse o, in terms of MSE, depending on the
magnitudes of r R and
2
.
5.2 Testing
In many cases, one wishes to test economic theories. If theory suggests parameter restrictions, as in
the above homogeneity example, one can test theory by testing parameter restrictions. A number of
tests are available. The rst two (t and F) have a known small sample distributions, when the errors
are normally distributed. The third and fourth (Wald and score) do not require normality of the
errors, but their distributions are known only approximately, so that they are not exactly valid with
nite samples.
t-test
Suppose one has the model
y = X +
and one wishes to test the single restriction H
0
:R = r vs. H
A
:R ,= r . Under H
0
, with normality
of the errors,
R

r N
_
0, R(X
/
X)
1
R
/

2
0
_
so
R

r
_
R(X
/
X)
1
R
/

2
0
=
R

0
_
R(X
/
X)
1
R
/
N (0, 1) .
The problem is that
2
0
is unknown. One could use the consistent estimator

2
0
in place of
2
0
, but the
test would only be valid asymptotically in this case.
Proposition 1.
N(0,1)
_

2
(q)
q
t(q)
as long as the N(0, 1) and the
2
(q) are independent.
We need a few results on the
2
distribution.
Proposition 2. If x N(, I
n
) is a vector of n independent r.v.s., then x
/
x
2
(n, ) where
=

i

2
i
=
/
is the noncentrality parameter.
When a
2
r.v. has the noncentrality parameter equal to zero, it is referred to as a central
2
r.v.,
and its distribution is written as
2
(n), suppressing the noncentrality parameter.
Proposition 3. If the n dimensional random vector x N(0, V ), then x
/
V
1
x
2
(n).
Well prove this one as an indication of how the following unproven propositions could be proved.
Proof: Factor V
1
as P
/
P (this is the Cholesky factorization, where P is dened to be upper
triangular). Then consider y = Px. We have
y N(0, PV P
/
)
but
V P
/
P = I
n
PV P
/
P = P
so PV P
/
= I
n
and thus y N(0, I
n
). Thus y
/
y
2
(n) but
y
/
y = x
/
P
/
Px = xV
1
x
and we get the result we wanted.
A more general proposition which implies this result is
Proposition 4. If the n dimensional random vector x N(0, V ), then x
/
Bx
2
((B)) if and only
if BV is idempotent.
An immediate consequence is
Proposition 5. If the random vector (of dimension n) x N(0, I), and B is idempotent with rank r,
then x
/
Bx
2
(r).
Consider the random variable

/

2
0
=

/
M
X

2
0
=
_

0
_
/
M
X
_

0
_

2
(n K)
Proposition 6. If the random vector (of dimension n) x N(0, I), then Ax and x
/
Bx are independent
if AB = 0.
Now consider (remember that we have only one restriction in this case)
R

R(X

X)
1
R


(nK)
2
0
=
R

0
_
R(X
/
X)
1
R
/
This will have the t(n K) distribution if

and
/
are independent. But

= + (X
/
X)
1
X
/
and
(X
/
X)
1
X
/
M
X
= 0,
so
R

0
_
R(X
/
X)
1
R
/
=
R

r

R

t(n K)
In particular, for the commonly encountered test of signicance of an individual coecient, for which
H
0
:
i
= 0 vs. H
0
:
i
,= 0 , the test statistic is

i
t(n K)
Note: the t test is strictly valid only if the errors are actually normally distributed. If one
has nonnormal errors, one could use the above asymptotic result to justify taking critical values
from the N(0, 1) distribution, since t(n K)
d
N(0, 1) as n . In practice, a conservative
procedure is to take critical values from the t distribution if nonnormality is suspected. This will
reject H
0
less often since the t distribution is fatter-tailed than is the normal.
F test
The F test allows testing multiple restrictions jointly.
Proposition 7. If x
2
(r) and y
2
(s), then
x/r
y/s
F(r, s), provided that x and y are independent.
Proposition 8. If the random vector (of dimension n) x N(0, I), then x
/
Ax
and x
/
Bx are independent if AB = 0.
Using these results, and previous results on the
2
distribution, it is simple to show that the
following statistic has the F distribution:
F =
_
R

r
_
/ _
R(X
/
X)
1
R
/
_
1
_
R

r
_
q
2
F(q, n K).
A numerically equivalent expression is
(ESS
R
ESS
U
) /q
ESS
U
/(n K)
F(q, n K).
Note: The F test is strictly valid only if the errors are truly normally distributed. The following
tests will be appropriate when one cannot assume normally distributed errors.
Wald-type tests
The t and F tests require normality of the errors. The Wald test does not, but it is an asymptotic
test - it is only approximately valid in nite samples.
The Wald principle is based on the idea that if a restriction is true, the unrestricted model should
approximately satisfy the restriction. Given that the least squares estimator is asymptotically nor-
mally distributed:

n
_


0
_
d
N
_
0,
2
0
Q
1
X
_
then under H
0
: R
0
= r, we have

n
_
R

r
_
d
N
_
0,
2
0
RQ
1
X
R
/
_
so by Proposition [3]
n
_
R

r
_
/ _

2
0
RQ
1
X
R
/
_
1
_
R

r
_
d

2
(q)
Note that Q
1
X
or
2
0
are not observable. The test statistic we use substitutes the consistent estimators.
Use (X
/
X/n)
1
as the consistent estimator of Q
1
X
. With this, there is a cancellation of n
/
s, and the
statistic to use is
_
R

r
_
/
_

2
0
R(X
/
X)
1
R
/
_
1
_
R

r
_
d

2
(q)
The Wald test is a simple way to test restrictions without having to estimate the restricted
model.
Note that this formula is similar to one of the formulae provided for the F test.
Score-type tests (Rao tests, Lagrange multiplier tests)
The score test is another asymptotically valid test that does not require normality of the errors.
In some cases, an unrestricted model may be nonlinear in the parameters, but the model is linear
in the parameters under the null hypothesis. For example, the model
y = (X)

+
is nonlinear in and , but is linear in under H
0
: = 1. Estimation of nonlinear models is a bit
more complicated, so one might prefer to have a test based upon the restricted, linear model. The
score test is useful in this situation.
Score-type tests are based upon the general principle that the gradient vector of the unrestricted
model, evaluated at the restricted estimate, should be asymptotically normally distributed with
mean zero, if the restrictions are true. The original development was for ML estimation, but the
principle is valid for a wide variety of estimation methods.
We have seen that

=
_
R(X
/
X)
1
R
/
_
1
_
R

r
_
= P
1
_
R

r
_
so

n

P =

n
_
R

r
_
Given that

n
_
R

r
_
d
N
_
0,
2
0
RQ
1
X
R
/
_
under the null hypothesis, we obtain

n

P
d
N
_
0,
2
0
RQ
1
X
R
/
_
So
_

n

P
_
/ _

2
0
RQ
1
X
R
/
_
1
_

n

P
_
d

2
(q)
Noting that limnP = RQ
1
X
R
/
, we obtain,

/
_
_
R(X
/
X)
1
R
/

2
0
_
_

2
(q)
since the powers of n cancel. To get a usable test statistic substitute a consistent estimator of
2
0
.
This makes it clear why the test is sometimes referred to as a Lagrange multiplier test. It
may seem that one needs the actual Lagrange multipliers to calculate this. If we impose the
restrictions by substitution, these are not available. Note that the test can be written as
_
R
/

_
/
(X
/
X)
1
R
/

2
0
d

2
(q)
However, we can use the fonc for the restricted estimator:
X
/
y + X
/
X

R
+ R
/

to get that
R
/

= X
/
(y X

R
)
= X
/

R
Substituting this into the above, we get

/
R
X(X
/
X)
1
X
/

R

2
0
d

2
(q)
but this is simply

/
R
P
X

2
0

R
d

2
(q).
To see why the test is also known as a score test, note that the fonc for restricted least squares
X
/
y + X
/
X

R
+ R
/

give us
R
/

= X
/
y X
/
X

R
and the rhs is simply the gradient (score) of the unrestricted model, evaluated at the restricted esti-
mator. The scores evaluated at the unrestricted estimate are identically zero. The logic behind the
score test is that the scores evaluated at the restricted estimate should be approximately zero, if the
restriction is true. The test is also known as a Rao test, since P. Rao rst proposed it in 1948.
5.3 The asymptotic equivalence of the LR, Wald and score
tests
Note: the discussion of the LR test has been moved forward in these notes. I no longer teach the
material in this section, but Im leaving it here for reference.
We have seen that the three tests all converge to
2
random variables. In fact, they all converge to
the same
2
rv, under the null hypothesis. Well show that the Wald and LR tests are asymptotically
equivalent. We have seen that the Wald test is asymptotically equivalent to
W
a
= n
_
R

r
_
/ _

2
0
RQ
1
X
R
/
_
1
_
R

r
_
d

2
(q) (5.1)
Using


0
= (X
/
X)
1
X
/

and
R

r = R(


0
)
we get

nR(


0
) =

nR(X
/
X)
1
X
/

= R
_
_
X
/
X
n
_
_
1
n
1/2
X
/

Substitute this into [5.1] to get


W
a
= n
1

/
XQ
1
X
R
/
_

2
0
RQ
1
X
R
/
_
1
RQ
1
X
X
/

a
=
/
X(X
/
X)
1
R
/
_

2
0
R(X
/
X)
1
R
/
_
1
R(X
/
X)
1
X
/

a
=

/
A(A
/
A)
1
A
/

2
0
a
=

/
P
R

2
0
where P
R
is the projection matrix formed by the matrix X(X
/
X)
1
R
/
.
Note that this matrix is idempotent and has q columns, so the projection matrix has rank q.
Now consider the likelihood ratio statistic
LR
a
= n
1/2
g(
0
)
/
J(
0
)
1
R
/
_
RJ(
0
)
1
R
/
_
1
RJ(
0
)
1
n
1/2
g(
0
) (5.2)
Under normality, we have seen that the likelihood function is
ln L(, ) = nln

2 nln
1
2
(y X)
/
(y X)

2
.
Using this,
g(
0
) D

1
n
ln L(, )
=
X
/
(y X
0
)
n
2
=
X
/

n
2
Also, by the information matrix equality:
J(
0
) = H

(
0
)
= limD

g(
0
)
= limD

X
/
(y X
0
)
n
2
= lim
X
/
X
n
2
=
Q
X

2
so
J(
0
)
1
=
2
Q
1
X
Substituting these last expressions into [5.2], we get
LR
a
=
/
X
/
(X
/
X)
1
R
/
_

2
0
R(X
/
X)
1
R
/
_
1
R(X
/
X)
1
X
/

a
=

/
P
R

2
0
a
= W
This completes the proof that the Wald and LR tests are asymptotically equivalent. Similarly, one
can show that, under the null hypothesis,
qF
a
= W
a
= LM
a
= LR
The proof for the statistics except for LR does not depend upon normality of the errors, as can
be veried by examining the expressions for the statistics.
The LR statistic is based upon distributional assumptions, since one cant write the likelihood
function without them.
However, due to the close relationship between the statistics qF and LR, supposing normality,
the qF statistic can be thought of as a pseudo-LR statistic, in that its like a LR statistic in
that it uses the value of the objective functions of the restricted and unrestricted models, but it
doesnt require distributional assumptions.
The presentation of the score and Wald tests has been done in the context of the linear model.
This is readily generalizable to nonlinear models and/or other estimation methods.
Though the four statistics are asymptotically equivalent, they are numerically dierent in small sam-
ples. The numeric values of the tests also depend upon how
2
is estimated, and weve already seen
than there are several ways to do this. For example all of the following are consistent for
2
under H
0


nk

R

R
nk+q

R

R
n
and in general the denominator call be replaced with any quantity a such that lima/n = 1.
It can be shown, for linear regression models subject to linear restrictions, and if


n
is used to
calculate the Wald test and

R

R
n
is used for the score test, that
W > LR > LM.
For this reason, the Wald test will always reject if the LR test rejects, and in turn the LR test
rejects if the LM test rejects. This is a bit problematic: there is the possibility that by careful
choice of the statistic used, one can manipulate reported results to favor or disfavor a hypothesis. A
conservative/honest approach would be to report all three test statistics when they are available. In the
case of linear models with normal errors the F test is to be preferred, since asymptotic approximations
are not an issue.
The small sample behavior of the tests can be quite dierent. The true size (probability of rejection
of the null when the null is true) of the Wald test is often dramatically higher than the nominal size
associated with the asymptotic distribution. Likewise, the true size of the score test is often smaller
than the nominal size.
5.4 Interpretation of test statistics
Now that we have a menu of test statistics, we need to know how to use them.
5.5 Condence intervals
Condence intervals for single coecients are generated in the normal manner. Given the t statistic
t() =

a 100 (1 ) % condence interval for


0
is dened by the bounds of the set of such that t() does
not reject H
0
:
0
= , using a signicance level:
C() = : c
/2
<

< c
/2

The set of such is the interval

c
/2
A condence ellipse for two coecients jointly would be, analogously, the set of {
1
,
2
such that
the F (or some other test statistic) doesnt reject at the specied critical value. This generates an
ellipse, if the estimators are correlated.
Figure 5.1: Joint and Individual Condence Regions
The region is an ellipse, since the CI for an individual coecient denes a (innitely long)
rectangle with total prob. mass 1 , since the other coecient is marginalized (e.g., can take
on any value). Since the ellipse is bounded in both dimensions but also contains mass 1 , it
must extend beyond the bounds of the individual CI.
From the pictue we can see that:
Rejection of hypotheses individually does not imply that the joint test will reject.
Joint rejection does not imply individal tests will reject.
5.6 Bootstrapping
When we rely on asymptotic theory to use the normal distribution-based tests and condence intervals,
were often at serious risk of making important errors. If the sample size is small and errors are highly
nonnormal, the small sample distribution of

n
_


0
_
may be very dierent than its large sample
distribution. Also, the distributions of test statistics may not resemble their limiting distributions
at all. A means of trying to gain information on the small sample distribution of test statistics and
estimators is the bootstrap. Well consider a simple example, just to get the main idea.
Suppose that
y = X
0
+
IID(0,
2
0
)
X is nonstochastic
Given that the distribution of is unknown, the distribution of

will be unknown in small samples.
However, since we have random sampling, we could generate articial data. The steps are:
1. Draw n observations from with replacement. Call this vector
j
(its a n 1).
2. Then generate the data by y
j
= X

+
j
3. Now take this and estimate

j
= (X
/
X)
1
X
/
y
j
.
4. Save

j
5. Repeat steps 1-4, until we have a large number, J, of

j
.
With this, we can use the replications to calculate the empirical distribution of

j
. One way to form a
100(1-)% condence interval for
0
would be to order the

j
from smallest to largest, and drop the
rst and last J/2 of the replications, and use the remaining endpoints as the limits of the CI. Note
that this will not give the shortest CI if the empirical distribution is skewed.
Suppose one was interested in the distribution of some function of

, for example a test statistic.
Simple: just calculate the transformation for each j, and work with the empirical distribution of
the transformation.
If the assumption of iid errors is too strong (for example if there is heteroscedasticity or au-
tocorrelation, see below) one can work with a bootstrap dened by sampling from (y, x) with
replacement.
How to choose J: J should be large enough that the results dont change with repetition of the
entire bootstrap. This is easy to check. If you nd the results change a lot, increase J and try
again.
The bootstrap is based fundamentally on the idea that the empirical distribution of the sample
data converges to the actual sampling distribution as n becomes large, so statistics based on
sampling from the empirical distribution should converge in distribution to statistics based on
sampling from the actual sampling distribution.
In nite samples, this doesnt hold. At a minimum, the bootstrap is a good way to check if
asymptotic theory results oer a decent approximation to the small sample distribution.
Bootstrapping can be used to test hypotheses. Basically, use the bootstrap to get an approxima-
tion to the empirical distribution of the test statistic under the alternative hypothesis, and use
this to get critical values. Compare the test statistic calculated using the real data, under the
null, to the bootstrap critical values. There are many variations on this theme, which we wont
go into here.
5.7 Wald test for nonlinear restrictions: the delta method
Testing nonlinear restrictions of a linear model is not much more dicult, at least when the model is
linear. Since estimation subject to nonlinear restrictions requires nonlinear estimation methods, which
are beyond the score of this course, well just consider the Wald test for nonlinear restrictions on a
linear model.
Consider the q nonlinear restrictions
r(
0
) = 0.
where r() is a q-vector valued function. Write the derivative of the restriction evaluated at as
D

r()[

= R()
We suppose that the restrictions are not redundant in a neighborhood of
0
, so that
(R()) = q
in a neighborhood of
0
. Take a rst order Taylors series expansion of r(

) about
0
:
r(

) = r(
0
) + R(

)(


0
)
where

is a convex combination of

and
0
. Under the null hypothesis we have
r(

) = R(

)(


0
)
Due to consistency of

we can replace

by
0
, asymptotically, so

nr(

)
a
=

nR(
0
)(


0
)
Weve already seen the distribution of

n(


0
). Using this we get

nr(

)
d
N
_
0, R(
0
)Q
1
X
R(
0
)
/

2
0
_
.
Considering the quadratic form
nr(

)
/
_
R(
0
)Q
1
X
R(
0
)
/
_
1
r(

2
0
d

2
(q)
under the null hypothesis. Substituting consistent estimators for
0,
Q
X
and
2
0
, the resulting statistic
is
r(

)
/
_
R(

)(X
/
X)
1
R(

)
/
_
1
r(

2
d

2
(q)
under the null hypothesis.
This is known in the literature as the delta method, or as Kleins approximation.
Since this is a Wald test, it will tend to over-reject in nite samples. The score and LR tests are
also possibilities, but they require estimation methods for nonlinear models, which arent in the
scope of this course.
Note that this also gives a convenient way to estimate nonlinear functions and associated asymptotic
condence intervals. If the nonlinear function r(
0
) is not hypothesized to be zero, we just have

n
_
r(

) r(
0
)
_
d
N
_
0, R(
0
)Q
1
X
R(
0
)
/

2
0
_
so an approximation to the distribution of the function of the estimator is
r(

) N(r(
0
), R(
0
)(X
/
X)
1
R(
0
)
/

2
0
)
For example, the vector of elasticities of a function f(x) is
(x) =
f(x)
x

x
f(x)
where means element-by-element multiplication. Suppose we estimate a linear function
y = x
/
+ .
The elasticities of y w.r.t. x are
(x) =

x
/

x
(note that this is the entire vector of elasticities). The estimated elasticities are

(x) =

x
/

x
To calculate the estimated standard errors of all ve elasticites, use
R() =
(x)

/
=
_

_
x
1
0 0
0 x
2
.
.
.
.
.
.
.
.
. 0
0 0 x
k
_

_
x
/

_

1
x
2
1
0 0
0
2
x
2
2
.
.
.
.
.
.
.
.
. 0
0 0
k
x
2
k
_

_
(x
/
)
2
.
To get a consistent estimator just substitute in

. Note that the elasticity and the standard error are
functions of x. The program ExampleDeltaMethod.m shows how this can be done.
In many cases, nonlinear restrictions can also involve the data, not just the parameters. For
example, consider a model of expenditure shares. Let x(p, m) be a demand funcion, where p is prices
and m is income. An expenditure share system for G goods is
s
i
(p, m) =
p
i
x
i
(p, m)
m
, i = 1, 2, ..., G.
Now demand must be positive, and we assume that expenditures sum to income, so we have the
restrictions
0 s
i
(p, m) 1, i
G

i=1
s
i
(p, m) = 1
Suppose we postulate a linear model for the expenditure shares:
s
i
(p, m) =
i
1
+ p
/

i
p
+ m
i
m
+
i
It is fairly easy to write restrictions such that the shares sum to one, but the restriction that the shares
lie in the [0, 1] interval depends on both parameters and the values of p and m. It is impossible to
impose the restriction that 0 s
i
(p, m) 1 for all possible p and m. In such cases, one might consider
whether or not a linear model is a reasonable specication.
5.8 Example: the Nerlove data
Remember that we in a previous example (section 3.8) that the OLS results for the Nerlove model are
*********************************************************
OLS estimation results
Observations 145
R-squared 0.925955
Sigma-squared 0.153943
Results (Ordinary var-cov estimator)
estimate st.err. t-stat. p-value
constant -3.527 1.774 -1.987 0.049
output 0.720 0.017 41.244 0.000
labor 0.436 0.291 1.499 0.136
fuel 0.427 0.100 4.249 0.000
capital -0.220 0.339 -0.648 0.518
*********************************************************
Note that s
K
=
K
< 0, and that
L
+
F
+
K
,= 1.
Remember that if we have constant returns to scale, then
Q
= 1, and if there is homogeneity
of degree 1 then
L
+
F
+
K
= 1. We can test these hypotheses either separately or jointly.
NerloveRestrictions.m imposes and tests CRTS and then HOD1. From it we obtain the results that
follow:
Imposing and testing HOD1
*******************************************************
Restricted LS estimation results
Observations 145
R-squared 0.925652
Sigma-squared 0.155686
estimate st.err. t-stat. p-value
constant -4.691 0.891 -5.263 0.000
output 0.721 0.018 41.040 0.000
labor 0.593 0.206 2.878 0.005
fuel 0.414 0.100 4.159 0.000
capital -0.007 0.192 -0.038 0.969
*******************************************************
Value p-value
F 0.574 0.450
Wald 0.594 0.441
LR 0.593 0.441
Score 0.592 0.442
Imposing and testing CRTS
*******************************************************
Restricted LS estimation results
Observations 145
R-squared 0.790420
Sigma-squared 0.438861
estimate st.err. t-stat. p-value
constant -7.530 2.966 -2.539 0.012
output 1.000 0.000 Inf 0.000
labor 0.020 0.489 0.040 0.968
fuel 0.715 0.167 4.289 0.000
capital 0.076 0.572 0.132 0.895
*******************************************************
Value p-value
F 256.262 0.000
Wald 265.414 0.000
LR 150.863 0.000
Score 93.771 0.000
Notice that the input price coecients in fact sum to 1 when HOD1 is imposed. HOD1 is not
rejected at usual signicance levels (e.g., = 0.10). Also, R
2
does not drop much when the restriction
is imposed, compared to the unrestricted results. For CRTS, you should note that
Q
= 1, so the
restriction is satised. Also note that the hypothesis that
Q
= 1 is rejected by the test statistics at all
reasonable signicance levels. Note that R
2
drops quite a bit when imposing CRTS. If you look at the
unrestricted estimation results, you can see that a t-test for
Q
= 1 also rejects, and that a condence
interval for
Q
does not overlap 1.
From the point of view of neoclassical economic theory, these results are not anomalous: HOD1 is
an implication of the theory, but CRTS is not.
Exercise 9. Modify the NerloveRestrictions.m program to impose and test the restrictions jointly.
The Chow test Since CRTS is rejected, lets examine the possibilities more carefully. Recall that
the data is sorted by output (the third column). Dene 5 subsamples of rms, with the rst group
being the 29 rms with the lowest output levels, then the next 29 rms, etc. The ve subsamples can
be indexed by j = 1, 2, ..., 5, where j = 1 for t = 1, 2, ...29, j = 2 for t = 30, 31, ...58, etc. Dene
dummy variables D
1
, D
2
, ..., D
5
where
D
1
=
_

_
1 t 1, 2, ...29
0 t / 1, 2, ...29
D
2
=
_

_
1 t 30, 31, ...58
0 t / 30, 31, ...58
.
.
.
D
5
=
_

_
1 t 117, 118, ..., 145
0 t / 117, 118, ..., 145
Dene the model
ln C
t
=
5

j=1

1
D
j
+
5

j=1

j
D
j
ln Q
t
+
5

j=1

Lj
D
j
ln P
Lt
+
5

j=1

Fj
D
j
ln P
Ft
+
5

j=1

Kj
D
j
ln P
Kt
+
t
(5.3)
Note that the rst column of nerlove.data indicates this way of breaking up the sample, and provides
and easy way of dening the dummy variables. The new model may be written as
_

_
y
1
y
2
.
.
.
y
5
_

_
=
_

_
X
1
0 0
0 X
2
.
.
. X
3
X
4
0
0 X
5
_

_
_

5
_

_
+
_

2
.
.
.

5
_

_
(5.4)
where y
1
is 291, X
1
is 295,
j
is the 5 1 vector of coecients for the j
th
subsample (e.g.,

1
= (
1
,
1
,
L1
,
F1
,
K1
)
/
), and
j
is the 29 1 vector of errors for the j
th
subsample.
The Octave program Restrictions/ChowTest.m estimates the above model. It also tests the hy-
pothesis that the ve subsamples share the same parameter vector, or in other words, that there is
coecient stability across the ve subsamples. The null to test is that the parameter vectors for the
separate groups are all the same, that is,

1
=
2
=
3
=
4
=
5
This type of test, that parameters are constant across dierent sets of data, is sometimes referred to
as a Chow test.
There are 20 restrictions. If thats not clear to you, look at the Octave program.
The restrictions are rejected at all conventional signicance levels.
Since the restrictions are rejected, we should probably use the unrestricted model for analysis. What
is the pattern of RTS as a function of the output group (small to large)? Figure 5.2 plots RTS. We
can see that there is increasing RTS for small rms, but that RTS is approximately constant for large
rms.
5.9 Exercises
1. Using the Chow test on the Nerlove model, we reject that there is coecient stability across the
5 groups. But perhaps we could restrict the input price coecients to be the same but let the
Figure 5.2: RTS as a function of rm size
1
1.2
1.4
1.6
1.8
2
2.2
2.4
2.6
1 1.5 2 2.5 3 3.5 4 4.5 5
RTS
constant and output coecients vary by group size. This new model is
ln C =
5

j=1

j
D
j
+
5

j=1

j
D
j
ln Q +
L
ln P
L
+
F
ln P
F
+
K
ln P
K
+ (5.5)
(a) estimate this model by OLS, giving R
2
, estimated standard errors for coecients, t-statistics
for tests of signicance, and the associated p-values. Interpret the results in detail.
(b) Test the restrictions implied by this model (relative to the model that lets all coecients
vary across groups) using the F, qF, Wald, score and likelihood ratio tests. Comment on
the results.
(c) Estimate this model but imposing the HOD1 restriction, using an OLS estimation pro-
gram. Dont use mc_olsr or any other restricted OLS estimation program. Give estimated
standard errors for all coecients.
(d) Plot the estimated RTS parameters as a function of rm size. Compare the plot to that
given in the notes for the unrestricted model. Comment on the results.
2. For the model of the above question, compute 95% condence intervals for RTS for each of the
5 groups of rms, using the delta method to compute standard errors. Comment on the results.
3. Perform a Monte Carlo study that generates data from the model
y = 2 + 1x
2
+ 1x
3
+
where the sample size is 30, x
2
and x
3
are independently uniformly distributed on [0, 1] and
IIN(0, 1)
(a) Compare the means and standard errors of the estimated coecients using OLS and re-
stricted OLS, imposing the restriction that
2
+
3
= 2.
(b) Compare the means and standard errors of the estimated coecients using OLS and re-
stricted OLS, imposing the restriction that
2
+
3
= 1.
(c) Discuss the results.
Chapter 6
Stochastic regressors
Up to now we have treated the regressors as xed, which is clearly unrealistic. Now we will assume
they are random. There are several ways to think of the problem. First, if we are interested in an
analysis conditional on the explanatory variables, then it is irrelevant if they are stochastic or not,
since conditional on the values of they regressors take on, they are nonstochastic, which is the case
already considered.
In cross-sectional analysis it is usually reasonable to make the analysis conditional on the regres-
sors.
In dynamic models, where y
t
may depend on y
t1
, a conditional analysis is not suciently
general, since we may want to predict into the future many periods out, so we need to consider
the behavior of

and the relevant test statistics unconditional on X.
The model well deal will involve a combination of the following assumptions
108
Assumption 10. Linearity: the model is a linear function of the parameter vector
0
:
y
t
= x
/
t

0
+
t
,
or in matrix form,
y = X
0
+ ,
where y is n 1, X =
_
x
1
x
2
x
n
_
/
, where x
t
is K 1, and
0
and are conformable.
Assumption 11. Stochastic, linearly independent regressors
X has rank K with probability 1
X is stochastic
lim
n
Pr
_
1
n
X
/
X = Q
X
_
= 1, where Q
X
is a nite positive denite matrix.
Assumption 12. Central limit theorem
n
1/2
X
/

d
N(0, Q
X

2
0
)
Assumption 13. Normality (Optional): [X N(0,
2
I
n
): is normally distributed
Assumption 14. Strongly exogenous regressors. The regressors X are strongly exogenous if
c(
t
[X) = 0, t (6.1)
Assumption 15. Weakly exogenous regressors: The regressors are weakly exogenous if
c(
t
[x
t
) = 0, t
In both cases, x
/
t
is the conditional mean of y
t
given x
t
: E(y
t
[x
t
) = x
/
t

6.1 Case 1
Normality of , strongly exogenous regressors
In this case,

=
0
+ (X
/
X)
1
X
/

c(

[X) =
0
+ (X
/
X)
1
X
/
c([X)
=
0
and since this holds for all X, E(

) = , unconditional on X. Likewise,

[X N
_
, (X
/
X)
1

2
0
_
If the density of X is d(X), the marginal density of

is obtained by multiplying the conditional
density by d(X) and integrating over X. Doing this leads to a nonnormal density for

, in small
samples.
However, conditional on X, the usual test statistics have the t, F and
2
distributions. Impor-
tantly, these distributions dont depend on X, so when marginalizing to obtain the unconditional
distribution, nothing changes. The tests are valid in small samples.
Summary: When X is stochastic but strongly exogenous and is normally distributed:
1.

is unbiased
2.

is nonnormally distributed
3. The usual test statistics have the same distribution as with nonstochastic X.
4. The Gauss-Markov theorem still holds, since it holds conditionally on X, and this is true
for all X.
5. Asymptotic properties are treated in the next section.
6.2 Case 2
nonnormally distributed, strongly exogenous regressors
The unbiasedness of

carries through as before. However, the argument regarding test statistics
doesnt hold, due to nonnormality of . Still, we have

=
0
+ (X
/
X)
1
X
/

=
0
+
_
_
X
/
X
n
_
_
1
X
/

n
Now
_
_
X
/
X
n
_
_
1
p
Q
1
X
by assumption, and
X
/

n
=
n
1/2
X
/

n
p
0
since the numerator converges to a N(0, Q
X

2
) r.v. and the denominator still goes to innity. We
have unbiasedness and the variance disappearing, so, the estimator is consistent:

0
.
Considering the asymptotic distribution

n
_


0
_
=

n
_
_
X
/
X
n
_
_
1
X
/

n
=
_
_
X
/
X
n
_
_
1
n
1/2
X
/

so

n
_


0
_
d
N(0, Q
1
X

2
0
)
directly following the assumptions. Asymptotic normality of the estimator still holds. Since the asymp-
totic results on all test statistics only require this, all the previous asymptotic results on test statistics
are also valid in this case.
Summary: Under strongly exogenous regressors, with normal or nonnormal,

has the proper-
ties:
1. Unbiasedness
2. Consistency
3. Gauss-Markov theorem holds, since it holds in the previous case and doesnt depend on
normality.
4. Asymptotic normality
5. Tests are asymptotically valid
6. Tests are not valid in small samples if the error is normally distributed
6.3 Case 3
Weakly exogenous regressors
An important class of models are dynamic models, where lagged dependent variables have an impact
on the current value. A simple version of these models that captures the important points is
y
t
= z
/
t
+
p

s=1

s
y
ts
+
t
= x
/
t
+
t
where now x
t
contains lagged dependent variables. Clearly, even with E(
t
[x
t
) = 0, X and are not
uncorrelated, so one cant show unbiasedness. For example,
c(
t1
x
t
) ,= 0
since x
t
contains y
t1
(which is a function of
t1
) as an element.
This fact implies that all of the small sample properties such as unbiasedness, Gauss-Markov
theorem, and small sample validity of test statistics do not hold in this case. Recall Figure 3.7.
This is a case of weakly exogenous regressors, and we see that the OLS estimator is biased in
this case.
Nevertheless, under the above assumptions, all asymptotic properties continue to hold, using the
same arguments as before.
6.4 When are the assumptions reasonable?
The two assumptions weve added are
1. lim
n
Pr
_
1
n
X
/
X = Q
X
_
= 1, a Q
X
nite positive denite matrix.
2. n
1/2
X
/

d
N(0, Q
X

2
0
)
The most complicated case is that of dynamic models, since the other cases can be treated as nested
in this case. There exist a number of central limit theorems for dependent processes, many of which
are fairly technical. We wont enter into details (see Hamilton, Chapter 7 if youre interested). A main
requirement for use of standard asymptotics for a dependent sequence
s
t
=
1
n
n

t=1
z
t

to converge in probability to a nite limit is that z


t
be stationary, in some sense.
Strong stationarity requires that the joint distribution of the set
z
t
, z
t+s
, z
tq
, ...
not depend on t.
Covariance (weak) stationarity requires that the rst and second moments of this set not depend
on t.
An example of a sequence that doesnt satisfy this is an AR(1) process with a unit root (a random
walk):
x
t
= x
t1
+
t

t
IIN(0,
2
)
One can show that the variance of x
t
depends upon t in this case, so its not weakly stationary.
The series sin t +
t
has a rst moment that depends upon t, so its not weakly stationary either.
Stationarity prevents the process from trending o to plus or minus innity, and prevents cyclical
behavior which would allow correlations between far removed z
t
znd z
s
to be high. Draw a picture
here.
In summary, the assumptions are reasonable when the stochastic conditioning variables have
variances that are nite, and are not too strongly dependent. The AR(1) model with unit root
is an example of a case where the dependence is too strong for standard asymptotics to apply.
The study of nonstationary processes is an important part of econometrics, but it isnt in the
scope of this course.
6.5 Exercises
1. Show that for two random variables A and B, if E(A[B) = 0, then E (Af(B)) = 0. How is this
used in the proof of the Gauss-Markov theorem?
2. Is it possible for an AR(1) model for time series data, e.g., y
t
= 0 + 0.9y
t1
+
t
satisfy weak
exogeneity? Strong exogeneity? Discuss.
Chapter 7
Data problems
In this section well consider problems associated with the regressor matrix: collinearity, missing
observations and measurement error.
7.1 Collinearity
Motivation: Data on Mortality and Related Factors
The data set mortality.data contains annual data from 1947 - 1980 on death rates in the U.S., along
with data on factors like smoking and consumption of alcohol. The data description is:
DATA4-7: Death rates in the U.S. due to coronary heart disease and their
determinants. Data compiled by Jennifer Whisenand
chd = death rate per 100,000 population (Range 321.2 - 375.4)
117
cal = Per capita consumption of calcium per day in grams (Range 0.9 - 1.06)
unemp = Percent of civilian labor force unemployed in 1,000 of persons 16 years and older (Range
2.9 - 8.5)
cig = Per capita consumption of cigarettes in pounds of tobacco by persons 18 years and older
approx. 339 cigarettes per pound of tobacco (Range 6.75 - 10.46)
edfat = Per capita intake of edible fats and oil in poundsincludes lard, margarine and butter
(Range 42 - 56.5)
meat = Per capita intake of meat in poundsincludes beef, veal, pork, lamb and mutton (Range
138 - 194.8)
spirits = Per capita consumption of distilled spirits in taxed gallons for individuals 18 and older
(Range 1 - 2.9)
beer = Per capita consumption of malted liquor in taxed gallons for individuals 18 and older
(Range 15.04 - 34.9)
wine = Per capita consumption of wine measured in taxed gallons for individuals 18 and older
(Range 0.77 - 2.65)
Consider estimation results for several models:

chd = 334.914
(58.939)
+ 5.41216
(5.156)
cig + 36.8783
(7.373)
spirits 5.10365
(1.2513)
beer
+ 13.9764
(12.735)
wine
T = 34

R
2
= 0.5528 F(4, 29) = 11.2 = 9.9945
(standard errors in parentheses)

chd = 353.581
(56.624)
+ 3.17560
(4.7523)
cig + 38.3481
(7.275)
spirits 4.28816
(1.0102)
beer
T = 34

R
2
= 0.5498 F(3, 30) = 14.433 = 10.028
(standard errors in parentheses)

chd = 243.310
(67.21)
+ 10.7535
(6.1508)
cig + 22.8012
(8.0359)
spirits 16.8689
(12.638)
wine
T = 34

R
2
= 0.3198 F(3, 30) = 6.1709 = 12.327
(standard errors in parentheses)

chd = 181.219
(49.119)
+ 16.5146
(4.4371)
cig + 15.8672
(6.2079)
spirits
T = 34

R
2
= 0.3026 F(2, 31) = 8.1598 = 12.481
(standard errors in parentheses)
Note how the signs of the coecients change depending on the model, and that the magnitudes of
the parameter estimates vary a lot, too. The parameter estimates are highly sensitive to the particular
model we estimate. Why? Well see that the problem is that the data exhibit collinearity.
Collinearity: denition
Collinearity is the existence of linear relationships amongst the regressors. We can always write

1
x
1
+
2
x
2
+ +
K
x
K
+ v = 0
where x
i
is the i
th
column of the regressor matrix X, and v is an n 1 vector. In the case that there
exists collinearity, the variation in v is relatively small, so that there is an approximately exact linear
relation between the regressors.
relative and approximate are imprecise, so its dicult to dene when collinearilty exists.
In the extreme, if there are exact linear relationships (every element of v equal) then (X) < K, so
(X
/
X) < K, so X
/
X is not invertible and the OLS estimator is not uniquely dened. For example,
if the model is
y
t
=
1
+
2
x
2t
+
3
x
3t
+
t
x
2t
=
1
+
2
x
3t
then we can write
y
t
=
1
+
2
(
1
+
2
x
3t
) +
3
x
3t
+
t
=
1
+
2

1
+
2

2
x
3t
+
3
x
3t
+
t
= (
1
+
2

1
) + (
2

2
+
3
) x
3t
=
1
+
2
x
3t
+
t
The
/
s can be consistently estimated, but since the
/
s dene two equations in three
/
s, the

/
s cant be consistently estimated (there are multiple values of that solve the rst order
conditions). The
/
s are unidentied in the case of perfect collinearity.
Perfect collinearity is unusual, except in the case of an error in construction of the regressor
matrix, such as including the same regressor twice.
Another case where perfect collinearity may be encountered is with models with dummy variables, if
one is not careful. Consider a model of rental price (y
i
) of an apartment. This could depend factors
such as size, quality etc., collected in x
i
, as well as on the location of the apartment. Let B
i
= 1 if the
i
th
apartment is in Barcelona, B
i
= 0 otherwise. Similarly, dene G
i
, T
i
and L
i
for Girona, Tarragona
and Lleida. One could use a model such as
y
i
=
1
+
2
B
i
+
3
G
i
+
4
T
i
+
5
L
i
+ x
/
i
+
i
In this model, B
i
+G
i
+T
i
+L
i
= 1, i, so there is an exact relationship between these variables and
the column of ones corresponding to the constant. One must either drop the constant, or one of the
qualitative variables.
A brief aside on dummy variables
Dummy variable: A dummy variable is a binary-valued variable that indicates whether or not some
condition is true. It is customary to assign the value 1 if the condition is true, and 0 if the condition
is false.
Dummy variables are used essentially like any other regressor. Use d to indicate that a variable is
a dummy, so that variables like d
t
and d
t2
are understood to be dummy variables. Variables like x
t
and x
t3
are ordinary continuous regressors. You know how to interpret the following models:
y
t
=
1
+
2
d
t
+
t
y
t
=
1
d
t
+
2
(1 d
t
) +
t
y
t
=
1
+
2
d
t
+
3
x
t
+
t
Interaction terms: an interaction term is the product of two variables, so that the eect of one
variable on the dependent variable depends on the value of the other. The following model has an
interaction term. Note that
E(y[x)
x
=
3
+
4
d
t
. The slope depends on the value of d
t
.
y
t
=
1
+
2
d
t
+
3
x
t
+
4
d
t
x
t
+
t
Multiple dummy variables: we can use more than one dummy variable in a model. We will study
models of the form
y
t
=
1
+
2
d
t1
+
3
d
t2
+
4
x
t
+
t
y
t
=
1
+
2
d
t1
+
3
d
t2
+
4
d
t1
d
t2
+
5
x
t
+
t
Incorrect usage: You should understand why the following models are not correct usages of dummy
variables:
1. overparameterization:
y
t
=
1
+
2
d
t
+
3
(1 d
t
) +
t
2. multiple values assigned to multiple categories. Suppose that we a condition that denes 4
possible categories, and we create a variable d = 1 if the observation is in the rst category,
d = 2 if in the second, etc. (This is not strictly speaking a dummy variable, according to our
denition). Why is the following model not a good one?
y
t
=
1
+
2
d +
What is the correct way to deal with this situation?
Multiple parameterizations. To formulate a model that conditions on a given set of categorical
information, there are multiple ways to use dummy variables. For example, the two models
y
t
=
1
d
t
+
2
(1 d
t
) +
3
x
t
+
4
d
t
x
t
+
t
and
y
t
=
1
+
2
d
t
+
3
x
t
d
t
+
4
x
t
(1 d
t
) +
t
are equivalent. You should know what are the 4 equations that relate the
j
parameters to the
j
parameters, j = 1, 2, 3, 4. You should know how to interpret the parameters of both models.
Back to collinearity
The more common case, if one doesnt make mistakes such as these, is the existence of inexact linear
relationships, i.e., correlations between the regressors that are less than one in absolute value, but not
zero. The basic problem is that when two (or more) variables move together, it is dicult to determine
their separate inuences.
Example 16. Two children are in a room, along with a broken lamp. Both say I didnt do it!. How
can we tell who broke the lamp?
Lack of knowledge about the separate inuences of variables is reected in imprecise estimates,
i.e., estimates with high variances. With economic data, collinearity is commonly encountered, and is
often a severe problem.
Figure 7.1: s() when there is no collinearity
-6 -4 -2 0 2 4 6
-6
-4
-2
0
2
4
6
60
55
50
45
40
35
30
25
20
15
When there is collinearity, the minimizing point of the objective function that denes the OLS
estimator (s(), the sum of squared errors) is relatively poorly dened. This is seen in Figures 7.1 and
7.2.
To see the eect of collinearity on variances, partition the regressor matrix as
X =
_
x W
_
where x is the rst column of X (note: we can interchange the columns of X isf we like, so theres
no loss of generality in considering the rst column). Now, the variance of

, under the classical
Figure 7.2: s() when there is collinearity
-6 -4 -2 0 2 4 6
-6
-4
-2
0
2
4
6
100
90
80
70
60
50
40
30
20
assumptions, is
V (

) = (X
/
X)
1

2
Using the partition,
X
/
X =
_

_
x
/
x x
/
W
W
/
x W
/
W
_

_
and following a rule for partitioned inversion,
(X
/
X)
1
1,1
=
_
x
/
x x
/
W(W
/
W)
1
W
/
x
_
1
=
_
x
/
_
I
n
W(W
/
W)

1
W
/
_
x
_
1
=
_
ESS
x[W
_
1
where by ESS
x[W
we mean the error sum of squares obtained from the regression
x = W + v.
Since
R
2
= 1 ESS/TSS,
we have
ESS = TSS(1 R
2
)
so the variance of the coecient corresponding to x is
V (

x
) =

2
TSS
x
(1 R
2
x[W
)
(7.1)
We see three factors inuence the variance of this coecient. It will be high if
1.
2
is large
2. There is little variation in x. Draw a picture here.
3. There is a strong linear relationship between x and the other regressors, so that W can explain
the movement in x well. In this case, R
2
x[W
will be close to 1. As R
2
x[W
1, V (

x
) .
The last of these cases is collinearity.
Intuitively, when there are strong linear relations between the regressors, it is dicult to determine
the separate inuence of the regressors on the dependent variable. This can be seen by comparing the
OLS objective function in the case of no correlation between regressors with the objective function
with correlation between the regressors. See the gures nocollin.ps (no correlation) and collin.ps
(correlation), available on the web site.
Example 17. The Octave script DataProblems/collinearity.m performs a Monte Carlo study with
correlated regressors. The model is y = 1+x
2
+x
3
+, where the correlation between x
2
and x
3
can be
set. Three estimators are used: OLS, OLS dropping x
3
(a false restriction), and restricted LS using

2
=
3
(a true restriction). The output when the correlation between the two regressors is 0.9 is
octave:1> collinearity
Contribution received from node 0. Received so far: 500
Contribution received from node 0. Received so far: 1000
correlation between x2 and x3: 0.900000
descriptive statistics for 1000 OLS replications
mean st. dev. min max
0.996 0.182 0.395 1.574
0.996 0.444 -0.463 2.517
1.008 0.436 -0.342 2.301
descriptive statistics for 1000 OLS replications, dropping x3
mean st. dev. min max
0.999 0.198 0.330 1.696
1.905 0.207 1.202 2.651
descriptive statistics for 1000 Restricted OLS replications, b2=b3
mean st. dev. min max
0.998 0.179 0.433 1.574
1.002 0.096 0.663 1.339
1.002 0.096 0.663 1.339
octave:2>
Figure 7.3 shows histograms for the estimated
2
, for each of the three estimators.
repeat the experiment with a lower value of rho, and note how the standard errors of the OLS
estimator change.
Figure 7.3: Collinearity: Monte Carlo results
(a) OLS,

2
(b) OLS,

2
, dropping x3
(c) Restricted LS,

2
, with true restriction
2
=
3
Detection of collinearity
The best way is simply to regress each explanatory variable in turn on the remaining regressors. If
any of these auxiliary regressions has a high R
2
, there is a problem of collinearity. Furthermore, this
procedure identies which parameters are aected.
Sometimes, were only interested in certain parameters. Collinearity isnt a problem if it doesnt
aect what were interested in estimating.
An alternative is to examine the matrix of correlations between the regressors. High correlations are
sucient but not necessary for severe collinearity.
Also indicative of collinearity is that the model ts well (high R
2
), but none of the variables is
signicantly dierent from zero (e.g., their separate inuences arent well determined).
In summary, the articial regressions are the best approach if one wants to be careful.
Example 18. Nerlove data and collinearity. The simple Nerlove model is
ln C =
1
+
2
ln Q +
3
ln P
L
+
4
ln P
F
+
5
ln P
K
+
When this model is estimated by OLS, some coecients are not signicant (see subsection 3.8). This
may be due to collinearity.The Octave script DataProblems/NerloveCollinearity.m checks the regres-
sors for collinearity. If you run this, you will see that collinearity is not a problem with this data. Why
is the coecient of ln P
K
not signicantly dierent from zero?
Dealing with collinearity
More information
Collinearity is a problem of an uninformative sample. The rst question is: is all the available informa-
tion being used? Is more data available? Are there coecient restrictions that have been neglected?
Picture illustrating how a restriction can solve problem of perfect collinearity.
Stochastic restrictions and ridge regression
Supposing that there is no more data or neglected restrictions, one possibility is to change perspectives,
to Bayesian econometrics. One can express prior beliefs regarding the coecients using stochastic
restrictions. A stochastic linear restriction would be something of the form
R = r + v
where R and r are as in the case of exact linear restrictions, but v is a random vector. For example,
the model could be
y = X +
R = r + v
_
_
_

v
_
_
_ N
_
_
_
0
0
_
_
_,
_
_
_

I
n
0
nq
0
qn

2
v
I
q
_
_
_
This sort of model isnt in line with the classical interpretation of parameters as constants: according
to this interpretation the left hand side of R = r + v is constant but the right is random. This
model does t the Bayesian perspective: we combine information coming from the model and the
data, summarized in
y = X +
N(0,
2

I
n
)
with prior beliefs regarding the distribution of the parameter, summarized in
R N(r,
2
v
I
q
)
Since the sample is random it is reasonable to suppose that c(v
/
) = 0, which is the last piece of
information in the specication. How can you estimate using this model? The solution is to treat the
restrictions as articial data. Write
_

_
y
r
_

_ =
_

_
X
R
_

_ +
_

v
_

_
This model is heteroscedastic, since
2

,=
2
v
. Dene the prior precision k =

/
v
. This expresses the
degree of belief in the restriction relative to the variability of the data. Supposing that we specify k,
then the model
_

_
y
kr
_

_ =
_

_
X
kR
_

_ +
_

kv
_

_
is homoscedastic and can be estimated by OLS. Note that this estimator is biased. It is consistent,
however, given that k is a xed constant, even if the restriction is false (this is in contrast to the case
of false exact restrictions). To see this, note that there are Q restrictions, where Q is the number of
rows of R. As n , these Q articial observations have no weight in the objective function, so the
estimator has the same limiting objective function as the OLS estimator, and is therefore consistent.
To motivate the use of stochastic restrictions, consider the expectation of the squared length of

:
c(

) = c
_
_
+ (X
/
X)
1
X
/

_
/
_
+ (X
/
X)
1
X
/

_
_
=
/
+c
_

/
X(X
/
X)
1
(X
/
X)
1
X
/

_
=
/
+ Tr (X
/
X)
1

2
=
/
+
2
K

i=1

i
(the trace is the sum of eigenvalues)
>
/
+
max(X

X
1
)

2
(the eigenvalues are all positive, sinceX
/
X is p.d.
so
c(

) >
/
+

2

min(X

X)
where
min(X

X)
is the minimum eigenvalue of X
/
X (which is the inverse of the maximum eigenvalue of
(X
/
X)
1
). As collinearity becomes worse and worse, X
/
X becomes more nearly singular, so
min(X

X)
tends to zero (recall that the determinant is the product of the eigenvalues) and c(

) tends to
innite. On the other hand,
/
is nite.
Now considering the restriction I
K
= 0 + v. With this restriction the model becomes
_

_
y
0
_

_ =
_

_
X
kI
K
_

_ +
_

kv
_

_
and the estimator is

ridge
=
_
_
_
_
X
/
kI
K
_
_

_
X
kI
K
_

_
_
_
_
1
_
X
/
I
K
_
_

_
y
0
_

_
=
_
X
/
X + k
2
I
K
_
1
X
/
y
This is the ordinary ridge regression estimator. The ridge regression estimator can be seen to add
k
2
I
K
, which is nonsingular, to X
/
X, which is more and more nearly singular as collinearity becomes
worse and worse. As k , the restrictions tend to = 0, that is, the coecients are shrunken
toward zero. Also, the estimator tends to

ridge
=
_
X
/
X + k
2
I
K
_
1
X
/
y
_
k
2
I
K
_
1
X
/
y =
X
/
y
k
2
0
so

/
ridge

ridge
0. This is clearly a false restriction in the limit, if our original model is at all sensible.
There should be some amount of shrinkage that is in fact a true restriction. The problem is to
determine the k such that the restriction is correct. The interest in ridge regression centers on the fact
that it can be shown that there exists a k such that MSE(

ridge
) <

OLS
. The problem is that this k
depends on and
2
, which are unknown.
The ridge trace method plots

/
ridge

ridge
as a function of k, and chooses the value of k that
artistically seems appropriate (e.g., where the eect of increasing k dies o). Draw picture here.
This means of choosing k is obviously subjective. This is not a problem from the Bayesian perspective:
the choice of k reects prior beliefs about the length of .
In summary, the ridge estimator oers some hope, but it is impossible to guarantee that it will
outperform the OLS estimator. Collinearity is a fact of life in econometrics, and there is no clear
Figure 7.4: OLS and Ridge regression
(a) OLS (b) Ridge
solution to the problem.
The Octave script DataProblems/RidgeRegression.m does a Monte Carlo study that shows that
ridge regression can help to deal with collinearity. This script generates Figures and, which show the
Monte Carlo sampling frequency of the OLS and ridge estimators, after subtracting the true parameter
values. You can see that the ridge estimator has much lower RMSE.
7.2 Measurement error
Measurement error is exactly what it says, either the dependent variable or the regressors are measured
with error. Thinking about the way economic data are reported, measurement error is probably quite
prevalent. For example, estimates of growth of GDP, ination, etc. are commonly revised several
times. Why should the last revision necessarily be correct?
Error of measurement of the dependent variable
Measurement errors in the dependent variable and the regressors have important dierences. First
consider error in measurement of the dependent variable. The data generating process is presumed to
be
y

= X +
y = y

+ v
v
t
iid(0,
2
v
)
where y

= y + v is the unobservable true dependent variable, and y is what is observed. We assume


that and v are independent and that y

= X + satises the classical assumptions. Given this, we


have
y + v = X +
so
y = X + v
= X +

t
iid(0,
2

+
2
v
)
As long as v is uncorrelated with X, this model satises the classical assumptions and can be
estimated by OLS. This type of measurement error isnt a problem, then, except in that the
increased variability of the error term causes an increase in the variance of the OLS estimator
(see equation 7.1).
Error of measurement of the regressors
The situation isnt so good in this case. The DGP is
y
t
= x
/
t
+
t
x
t
= x

t
+ v
t
v
t
iid(0,
v
)
where
v
is a K K matrix. Now X

contains the true, unobserved regressors, and X is what is


observed. Again assume that v is independent of , and that the model y = X

+ satises the
classical assumptions. Now we have
y
t
= (x
t
v
t
)
/
+
t
= x
/
t
v
/
t
+
t
= x
/
t
+
t
The problem is that now there is a correlation between x
t
and
t
, since
c(x
t

t
) = c ((x

t
+ v
t
) (v
/
t
+
t
))
=
v

where

v
= c (v
t
v
/
t
) .
Because of this correlation, the OLS estimator is biased and inconsistent, just as in the case of au-
tocorrelated errors with lagged dependent variables. In matrix notation, write the estimated model
as
y = X +
We have that

=
_
_
X
/
X
n
_
_
1
_
_
X
/
y
n
_
_
and
plim
_
_
X
/
X
n
_
_
1
= plim
(X
/
+ V
/
) (X

+ V )
n
= (Q
X
+
v
)
1
since X

and V are independent, and


plim
V
/
V
n
= limc
1
n
n

t=1
v
t
v
/
t
=
v
Likewise,
plim
_
_
X
/
y
n
_
_
= plim
(X
/
+ V
/
) (X

+ )
n
= Q
X

so
plim

= (Q
X
+
v
)
1
Q
X

So we see that the least squares estimator is inconsistent when the regressors are measured with error.
A potential solution to this problem is the instrumental variables (IV) estimator, which well
discuss shortly.
Example 19. Measurement error in a dynamic model. Consider the model
y

t
= + y

t1
+ x
t
+
t
y
t
= y

t
+
t
where
t
and
t
are independent Gaussian white noise errors. Suppose that y

t
is not observed, and
instead we observe y
t
. What are the properties of the OLS regression on the equation
y
t
= + y
t1
+ x
t
+
t
? The error is

t
= y
t
y
t1
x
t
= y

t
+
t
y

t1

t1
x
t
= + y

t1
+ x
t
+
t
+
t
y

t1

t1
x
t
=
t
+
t

t1
So the error term is autocorrelated. Note that y
t1
= + y
t2
+ x
t1
+
t1
, so we the error
t
and the regressor y
t1
are correlated, because they share the common term
t1
. This means that the
equation
y
t
= + y
t1
+ x
t
+
t
does not satisfy weak exogeneity, and the OLS estimator will be biased and inconsistent.
The Octave script DataProblems/MeasurementError.m does a Monte Carlo study. The sample
size is n = 100. Figure 7.5 gives the results. The rst panel shows a histogram for 1000 replications of
, when

= 1, so that there is signicant measurement error. The second panel repeats this with

= 0, so that there is not measurement error. Note that there is much more bias with measurement
error. There is also bias without measurement error. This is due to the same reason that we saw bias
in Figure 3.7: one of the classical assumptions (nonstochastic regressors) that guarantees unbiasedness
of OLS does not hold for this model. Without measurement error, the OLS estimator is consistent.
By re-running the script with larger n, you can verify that the bias disappears when

= 0, but not
when

> 0.
Figure 7.5: with and without measurement error
(a) with measurement error:

= 1 (b) without measurement error:

= 0
7.3 Missing observations
Missing observations occur quite frequently: time series data may not be gathered in a certain year, or
respondents to a survey may not answer all questions. Well consider two cases: missing observations
on the dependent variable and missing observations on the regressors.
Missing observations on the dependent variable
In this case, we have
y = X +
or
_

_
y
1
y
2
_

_ =
_

_
X
1
X
2
_

_ +
_

2
_

_
where y
2
is not observed. Otherwise, we assume the classical assumptions hold.
A clear alternative is to simply estimate using the compete observations
y
1
= X
1
+
1
Since these observations satisfy the classical assumptions, one could estimate by OLS.
The question remains whether or not one could somehow replace the unobserved y
2
by a predictor,
and improve over OLS in some sense. Let y
2
be the predictor of y
2
. Now

=
_

_
_

_
X
1
X
2
_

_
/
_

_
X
1
X
2
_

_
_

_
1
_

_
X
1
X
2
_

_
/
_

_
y
1
y
2
_

_
= [X
/
1
X
1
+ X
/
2
X
2
]
1
[X
/
1
y
1
+ X
/
2
y
2
]
Recall that the OLS fonc are
X
/
X

= X
/
y
so if we regressed using only the rst (complete) observations, we would have
X
/
1
X
1

1
= X
/
1
y
1.
Likewise, an OLS regression using only the second (lled in) observations would give
X
/
2
X
2

2
= X
/
2
y
2
.
Substituting these into the equation for the overall combined estimator gives

= [X
/
1
X
1
+ X
/
2
X
2
]
1
_
X
/
1
X
1

1
+ X
/
2
X
2

2
_
= [X
/
1
X
1
+ X
/
2
X
2
]
1
X
/
1
X
1

1
+ [X
/
1
X
1
+ X
/
2
X
2
]
1
X
/
2
X
2

2
A

1
+ (I
K
A)

2
where
A [X
/
1
X
1
+ X
/
2
X
2
]
1
X
/
1
X
1
and we use
[X
/
1
X
1
+ X
/
2
X
2
]
1
X
/
2
X
2
= [X
/
1
X
1
+ X
/
2
X
2
]
1
[(X
/
1
X
1
+ X
/
2
X
2
) X
/
1
X
1
]
= I
K
[X
/
1
X
1
+ X
/
2
X
2
]
1
X
/
1
X
1
= I
K
A.
Now,
c(

) = A + (I
K
A)c
_

2
_
and this will be unbiased only if c
_

2
_
= .
The conclusion is that the lled in observations alone would need to dene an unbiased estimator.
This will be the case only if
y
2
= X
2
+
2
where
2
has mean zero. Clearly, it is dicult to satisfy this condition without knowledge of .
Note that putting y
2
= y
1
does not satisfy the condition and therefore leads to a biased estimator.
Exercise 20. Formally prove this last statement.
The sample selection problem
In the above discussion we assumed that the missing observations are random. The sample selection
problem is a case where the missing observations are not random. Consider the model
y

t
= x
/
t
+
t
which is assumed to satisfy the classical assumptions. However, y

t
is not always observed. What is
observed is y
t
dened as
y
t
= y

t
if y

t
0
Or, in other words, y

t
is missing when it is less than zero.
The dierence in this case is that the missing values are not random: they are correlated with the
x
t
. Consider the case
y

= x +
with V () = 25, but using only the observations for which y

> 0 to estimate. Figure 7.6 illustrates


the bias. The Octave program is sampsel.m
Figure 7.6: Sample selection bias
-10
-5
0
5
10
15
20
25
0 2 4 6 8 10
Data
True Line
Fitted Line
There are means of dealing with sample selection bias, but we will not go into it here. One should
at least be aware that nonrandom selection of the sample will normally lead to bias and inconsistency
if the problem is not taken into account.
Missing observations on the regressors
Again the model is
_

_
y
1
y
2
_

_ =
_

_
X
1
X
2
_

_ +
_

2
_

_
but we assume now that each row of X
2
has an unobserved component(s). Again, one could just
estimate using the complete observations, but it may seem frustrating to have to drop observations
simply because of a single missing variable. In general, if the unobserved X
2
is replaced by some
prediction, X

2
, then we are in the case of errors of observation. As before, this means that the OLS
estimator is biased when X

2
is used instead of X
2
. Consistency is salvaged, however, as long as the
number of missing observations doesnt increase with n.
Including observations that have missing values replaced by ad hoc values can be interpreted
as introducing false stochastic restrictions. In general, this introduces bias. It is dicult to
determine whether MSE increases or decreases. Monte Carlo studies suggest that it is dangerous
to simply substitute the mean, for example.
In the case that there is only one regressor other than the constant, subtitution of x for the
missing x
t
does not lead to bias. This is a special case that doesnt hold for K > 2.
Exercise 21. Prove this last statement.
In summary, if one is strongly concerned with bias, it is best to drop observations that have
missing components. There is potential for reduction of MSE through lling in missing elements
with intelligent guesses, but this could also increase MSE.
7.4 Missing regressors
Suppose that the model y = X + W + satises the classical assumptions, so OLS would be a
consistent estimator. However, lets suppose that the regressors W are not available in the sample.
What are the properties of the OLS estimator of the model y = X + ? We can think of this as
a case of imposing false restrictions: = 0 when in fact ,= 0. We know that the restricted least
squares estimator is biased and inconsistent, in general, when we impose false restrictions. Another
way of thinking of this is to look to see if X and are correlated. We have
E(X
t

t
) = E (X
t
(W
/
t
+
t
))
= E(X
t
W
/
t
) + E(X
t

t
)
= E(X
t
W
/
t
)
where the last line follows because E(X
t

t
) = 0 by assumption. So, there will be correlation between
the error and the regressors if there is collinearity between the included regressors X
t
and the missing
regressors W
t
. If there is not, the OLS estimator will be consistent. Because the normal thing is to have
collinearity between regressors, we expect that missing regressors will lead to bias and inconsistency
of the OLS estimator.
7.5 Exercises
1. Consider the simple Nerlove model
ln C =
1
+
2
ln Q +
3
ln P
L
+
4
ln P
F
+
5
ln P
K
+
When this model is estimated by OLS, some coecients are not signicant. We have seen that
collinearity is not an important problem. Why is
5
not signicantly dierent from zero? Give
an economic explanation.
2. For the model y =
1
x
1
+
2
x
2
+ ,
(a) verify that the level sets of the OLS criterion function (dened in equation 3.2) are straight
lines when there is perfect collinearity
(b) For this model with perfect collinearity, the OLS estimator does not exist. Depict what this
statement means using a drawing.
(c) Show how a restriction R
1

1
+ R
2

2
= r causes the restricted least squares estimator to
exist, using a drawing.
Chapter 8
Functional form and nonnested
tests
Though theory often suggests which conditioning variables should be included, and suggests the signs
of certain derivatives, it is usually silent regarding the functional form of the relationship between the
dependent variable and the regressors. For example, considering a cost function, one could have a
Cobb-Douglas model
c = Aw

1
1
w

2
2
q

q
e

This model, after taking logarithms, gives


ln c =
0
+
1
ln w
1
+
2
ln w
2
+
q
ln q +
150
where
0
= ln A. Theory suggests that A > 0,
1
> 0,
2
> 0,
3
> 0. This model isnt compatible with
a xed cost of production since c = 0 when q = 0. Homogeneity of degree one in input prices suggests
that
1
+
2
= 1, while constant returns to scale implies
q
= 1.
While this model may be reasonable in some cases, an alternative

c =
0
+
1

w
1
+
2

w
2
+
q

q +
may be just as plausible. Note that

x and ln(x) look quite alike, for certain values of the regressors,
and up to a linear transformation, so it may be dicult to choose between these models.
The basic point is that many functional forms are compatible with the linear-in-parameters model,
since this model can incorporate a wide variety of nonlinear transformations of the dependent variable
and the regressors. For example, suppose that g() is a real valued function and that x() is a K
vector-valued function. The following model is linear in the parameters but nonlinear in the variables:
x
t
= x(z
t
)
y
t
= x
/
t
+
t
There may be P fundamental conditioning variables z
t
, but there may be K regressors, where K may
be smaller than, equal to or larger than P. For example, x
t
could include squares and cross products
of the conditioning variables in z
t
.
8.1 Flexible functional forms
Given that the functional form of the relationship between the dependent variable and the regressors is
in general unknown, one might wonder if there exist parametric models that can closely approximate
a wide variety of functional relationships. A Diewert-Flexible functional form is dened as one such
that the function, the vector of rst derivatives and the matrix of second derivatives can take on an
arbitrary value at a single data point. Flexibility in this sense clearly requires that there be at least
K = 1 + P +
_
P
2
P
_
/2 + P
free parameters: one for each independent eect that we wish to model.
Suppose that the model is
y = g(x) +
A second-order Taylors series expansion (with remainder term) of the function g(x) about the point
x = 0 is
g(x) = g(0) + x
/
D
x
g(0) +
x
/
D
2
x
g(0)x
2
+ R
Use the approximation, which simply drops the remainder term, as an approximation to g(x) :
g(x) g
K
(x) = g(0) + x
/
D
x
g(0) +
x
/
D
2
x
g(0)x
2
As x 0, the approximation becomes more and more exact, in the sense that g
K
(x) g(x),
D
x
g
K
(x) D
x
g(x) and D
2
x
g
K
(x) D
2
x
g(x). For x = 0, the approximation is exact, up to the
second order. The idea behind many exible functional forms is to note that g(0), D
x
g(0) and D
2
x
g(0)
are all constants. If we treat them as parameters, the approximation will have exactly enough free
parameters to approximate the function g(x), which is of unknown form, exactly, up to second order,
at the point x = 0. The model is
g
K
(x) = + x
/
+ 1/2x
/
x
so the regression model to t is
y = + x
/
+ 1/2x
/
x +
While the regression model has enough free parameters to be Diewert-exible, the question
remains: is plim = g(0)? Is plim

= D
x
g(0)? Is plim

= D
2
x
g(0)?
The answer is no, in general. The reason is that if we treat the true values of the parameters as
these derivatives, then is forced to play the part of the remainder term, which is a function of
x, so that x and are correlated in this case. As before, the estimator is biased in this case.
A simpler example would be to consider a rst-order T.S. approximation to a quadratic function.
Draw picture.
The conclusion is that exible functional forms arent really exible in a useful statistical
sense, in that neither the function itself nor its derivatives are consistently estimated, unless the
function belongs to the parametric family of the specied functional form. In order to lead to
consistent inferences, the regression model must be correctly specied.
The translog form
In spite of the fact that FFFs arent really exible for the purposes of econometric estimation and
inference, they are useful, and they are certainly subject to less bias due to misspecication of the
functional form than are many popular forms, such as the Cobb-Douglas or the simple linear in the
variables model. The translog model is probably the most widely used FFF. This model is as above,
except that the variables are subjected to a logarithmic tranformation. Also, the expansion point is
usually taken to be the sample mean of the data, after the logarithmic transformation. The model is
dened by
y = ln(c)
x = ln
_
z
z
_
= ln(z) ln( z)
y = + x
/
+ 1/2x
/
x +
In this presentation, the t subscript that distinguishes observations is suppressed for simplicity. Note
that
y
x
= + x
=
ln(c)
ln(z)
(the other part of x is constant)
=
c
z
z
c
which is the elasticity of c with respect to z. This is a convenient feature of the translog model. Note
that at the means of the conditioning variables, z, x = 0, so
y
x

z= z
=
so the are the rst-order elasticities, at the means of the data.
To illustrate, consider that y is cost of production:
y = c(w, q)
where w is a vector of input prices and q is output. We could add other variables by extending q in
the obvious manner, but this is supressed for simplicity. By Shephards lemma, the conditional factor
demands are
x =
c(w, q)
w
and the cost shares of the factors are therefore
s =
wx
c
=
c(w, q)
w
w
c
which is simply the vector of elasticities of cost with respect to input prices. If the cost function is
modeled using a translog function, we have
ln(c) = + x
/
+ z
/
+ 1/2
_
x
/
z
_
_

11

12

/
12

22
_

_
_

_
x
z
_

_
= + x
/
+ z
/
+ 1/2x
/

11
x + x
/

12
z + 1/2z
2

22
where x = ln(w/ w) (element-by-element division) and z = ln(q/ q), and

11
=
_

11

12

12

22
_

12
=
_

13

23
_

22
=
33
.
Note that symmetry of the second derivatives has been imposed.
Then the share equations are just
s = +
_

11

12
_
_

_
x
z
_

_
Therefore, the share equations and the cost equation have parameters in common. By pooling the
equations together and imposing the (true) restriction that the parameters of the equations be the
same, we can gain eciency.
To illustrate in more detail, consider the case of two inputs, so
x =
_

_
x
1
x
2
_

_ .
In this case the translog model of the logarithmic cost function is
ln c = +
1
x
1
+
2
x
2
+ z +

11
2
x
2
1
+

22
2
x
2
2
+

33
2
z
2
+
12
x
1
x
2
+
13
x
1
z +
23
x
2
z
The two cost shares of the inputs are the derivatives of ln c with respect to x
1
and x
2
:
s
1
=
1
+
11
x
1
+
12
x
2
+
13
z
s
2
=
2
+
12
x
1
+
22
x
2
+
13
z
Note that the share equations and the cost equation have parameters in common. One can do
a pooled estimation of the three equations at once, imposing that the parameters are the same. In
this way were using more observations and therefore more information, which will lead to imporved
eciency. Note that this does assume that the cost equation is correctly specied (i.e., not an approx-
imation), since otherwise the derivatives would not be the true derivatives of the log cost function,
and would then be misspecied for the shares. To pool the equations, write the model in matrix form
(adding in error terms)
_

_
ln c
s
1
s
2
_

_
=
_

_
1 x
1
x
2
z
x
2
1
2
x
2
2
2
z
2
2
x
1
x
2
x
1
z x
2
z
0 1 0 0 x
1
0 0 x
2
z 0
0 0 1 0 0 x
2
0 x
1
0 z
_

_
_

11

22

33

12

13

23
_

_
+
_

3
_

_
This is one observation on the three equations. With the appropriate notation, a single observation
can be written as
y
t
= X
t
+
t
The overall model would stack n observations on the three equations for a total of 3n observations:
_

_
y
1
y
2
.
.
.
y
n
_

_
=
_

_
X
1
X
2
.
.
.
X
n
_

_
+
_

2
.
.
.

n
_

_
Next we need to consider the errors. For observation t the errors can be placed in a vector

t
=
_

1t

2t

3t
_

_
First consider the covariance matrix of this vector: the shares are certainly correlated since they
must sum to one. (In fact, with 2 shares the variances are equal and the covariance is -1 times the
variance. General notation is used to allow easy extension to the case of more than 2 inputs). Also,
its likely that the shares and the cost equation have dierent variances. Supposing that the model is
covariance stationary, the variance of
t
won
/
t depend upon t:
V ar
t
=
0
=
_

11

12

13

22

23

33
_

_
Note that this matrix is singular, since the shares sum to 1. Assuming that there is no autocorrelation,
the overall covariance matrix has the seemingly unrelated regressions (SUR) structure.
V ar
_

2
.
.
.

n
_

_
=
=
_

0
0 0
0
0
.
.
.
.
.
.
.
.
.
.
.
. 0
0 0
0
_

_
= I
n

0
where the symbol indicates the Kronecker product. The Kronecker product of two matrices A and
B is
A B =
_

_
a
11
B a
12
B a
1q
B
a
21
B
.
.
.
.
.
.
.
.
.
a
pq
B a
pq
B
_

_
.
FGLS estimation of a translog model
So, this model has heteroscedasticity and autocorrelation, so OLS wont be ecient. The next question
is: how do we estimate eciently using FGLS? FGLS is based upon inverting the estimated error
covariance

. So we need to estimate .
An asymptotically ecient procedure is (supposing normality of the errors)
1. Estimate each equation by OLS
2. Estimate
0
using

0
=
1
n
n

t=1

t

/
t
3. Next we need to account for the singularity of
0
. It can be shown that

0
will be singular when
the shares sum to one, so FGLS wont work. The solution is to drop one of the share equations,
for example the second. The model becomes
_

_
ln c
s
1
_

_ =
_

_
1 x
1
x
2
z
x
2
1
2
x
2
2
2
z
2
2
x
1
x
2
x
1
z x
2
z
0 1 0 0 x
1
0 0 x
2
z 0
_

_
_

11

22

33

12

13

23
_

_
+
_

2
_

_
or in matrix notation for the observation:
y

t
= X

t
+

t
and in stacked notation for all observations we have the 2n observations:
_

_
y

1
y

2
.
.
.
y

n
_

_
=
_

_
X

1
X

2
.
.
.
X

n
_

_
+
_

2
.
.
.

n
_

_
or, nally in matrix notation for all observations:
y

= X

Considering the error covariance, we can dene

0
= V ar
_

2
_

= I
n

0
Dene

0
as the leading 2 2 block of

0
, and form

= I
n

0
.
This is a consistent estimator, following the consistency of OLS and applying a LLN.
4. Next compute the Cholesky factorization

P
0
= Chol
_

0
_
1
(I am assuming this is dened as an upper triangular matrix, which is consistent with the way
Octave does it) and the Cholesky factorization of the overall covariance matrix of the 2 equation
model, which can be calculated as

P = Chol

= I
n


P
0
5. Finally the FGLS estimator can be calculated by applying OLS to the transformed model

P
/
y

=

P
/
X

/
P

or by directly using the GLS formula

FGLS
=
_
X
/
_

0
_
1
X

_
1
X
/
_

0
_
1
y

It is equivalent to transform each observation individually:

P
/
0
y

y
=

P
/
0
X

t
+

P
/
0

and then apply OLS. This is probably the simplest approach.


A few last comments.
1. We have assumed no autocorrelation across time. This is clearly restrictive. It is relatively simple
to relax this, but we wont go into it here.
2. Also, we have only imposed symmetry of the second derivatives. Another restriction that the
model should satisfy is that the estimated shares should sum to 1. This can be accomplished by
imposing

1
+
2
= 1
3

i=1

ij
= 0, j = 1, 2, 3.
These are linear parameter restrictions, so they are easy to impose and will improve eciency if
they are true.
3. The estimation procedure outlined above can be iterated. That is, estimate

FGLS
as above, then
re-estimate

0
using errors calculated as
= y X

FGLS
These might be expected to lead to a better estimate than the estimator based on

OLS
, since
FGLS is asymptotically more ecient. Then re-estimate using the new estimated error covari-
ance. It can be shown that if this is repeated until the estimates dont change (i.e., iterated to
convergence) then the resulting estimator is the MLE. At any rate, the asymptotic properties
of the iterated and uniterated estimators are the same, since both are based upon a consistent
estimator of the error covariance.
8.2 Testing nonnested hypotheses
Given that the choice of functional form isnt perfectly clear, in that many possibilities exist, how
can one choose between forms? When one form is a parametric restriction of another, the previously
studied tests such as Wald, LR, score or qF are all possibilities. For example, the Cobb-Douglas model
is a parametric restriction of the translog: The translog is
y
t
= + x
/
t
+ 1/2x
/
t
x
t
+
where the variables are in logarithms, while the Cobb-Douglas is
y
t
= + x
/
t
+
so a test of the Cobb-Douglas versus the translog is simply a test that = 0.
The situation is more complicated when we want to test non-nested hypotheses. If the two functional
forms are linear in the parameters, and use the same transformation of the dependent variable, then
they may be written as
M
1
: y = X +

t
iid(0,
2

)
M
2
: y = Z +
iid(0,
2

)
We wish to test hypotheses of the form: H
0
: M
i
is correctly specied versus H
A
: M
i
is misspecied,
for i = 1, 2.
One could account for non-iid errors, but well suppress this for simplicity.
There are a number of ways to proceed. Well consider the J test, proposed by Davidson and
MacKinnon, Econometrica (1981). The idea is to articially nest the two models, e.g.,
y = (1 )X + (Z) +
If the rst model is correctly specied, then the true value of is zero. On the other hand, if
the second model is correctly specied then = 1.
The problem is that this model is not identied in general. For example, if the models share
some regressors, as in
M
1
: y
t
=
1
+
2
x
2t
+
3
x
3t
+
t
M
2
: y
t
=
1
+
2
x
2t
+
3
x
4t
+
t
then the composite model is
y
t
= (1 )
1
+ (1 )
2
x
2t
+ (1 )
3
x
3t
+
1
+
2
x
2t
+
3
x
4t
+
t
Combining terms we get
y
t
= ((1 )
1
+
1
) + ((1 )
2
+
2
) x
2t
+ (1 )
3
x
3t
+
3
x
4t
+
t
=
1
+
2
x
2t
+
3
x
3t
+
4
x
4t
+
t
The four
/
s are consistently estimable, but is not, since we have four equations in 7 unknowns, so
one cant test the hypothesis that = 0.
The idea of the J test is to substitute in place of . This is a consistent estimator supposing
that the second model is correctly specied. It will tend to a nite probability limit even if the second
model is misspecied. Then estimate the model
y = (1 )X + (Z ) +
= X + y +
where y = Z(Z
/
Z)
1
Z
/
y = P
Z
y. In this model, is consistently estimable, and one can show that,
under the hypothesis that the rst model is correct,
p
0 and that the ordinary t -statistic for = 0
is asymptotically normal:
t =



a
N(0, 1)
If the second model is correctly specied, then t
p
, since tends in probability to 1, while
its estimated standard error tends to zero. Thus the test will always reject the false null model,
asymptotically, since the statistic will eventually exceed any critical value with probability one.
We can reverse the roles of the models, testing the second against the rst.
It may be the case that neither model is correctly specied. In this case, the test will still reject
the null hypothesis, asymptotically, if we use critical values from the N(0, 1) distribution, since
as long as tends to something dierent from zero, [t[
p
. Of course, when we switch the
roles of the models the other will also be rejected asymptotically.
In summary, there are 4 possible outcomes when we test two models, each against the other.
Both may be rejected, neither may be rejected, or one of the two may be rejected.
There are other tests available for non-nested models. The J test is simple to apply when
both models are linear in the parameters. The P-test is similar, but easier to apply when M
1
is
nonlinear.
The above presentation assumes that the same transformation of the dependent variable is used
by both models. MacKinnon, White and Davidson, Journal of Econometrics, (1983) shows how
to deal with the case of dierent transformations.
Monte-Carlo evidence shows that these tests often over-reject a correctly specied model. Can
use bootstrap critical values to get better-performing tests.
Chapter 9
Generalized least squares
Recall the assumptions of the classical linear regression model, in Section 3.6. One of the assumptions
weve made up to now is that

t
IID(0,
2
)
or occasionally

t
IIN(0,
2
).
Now well investigate the consequences of nonidentically and/or dependently distributed errors. Well
assume xed regressors for now, to keep the presentation simple, and later well look at the conse-
quences of relaxing this admittedly unrealistic assumption. The model is
y = X +
c() = 0
V () =
168
where is a general symmetric positive denite matrix (well write in place of
0
to simplify the
typing of these notes).
The case where is a diagonal matrix gives uncorrelated, nonidentically distributed errors. This
is known as heteroscedasticity: i, j s.t. V (
i
) ,= V (
j
)
The case where has the same number on the main diagonal but nonzero elements o the main
diagonal gives identically (assuming higher moments are also the same) dependently distributed
errors. This is known as autocorrelation: i ,= j s.t. E(
i

j
) ,= 0)
The general case combines heteroscedasticity and autocorrelation. This is known as nonspheri-
cal disturbances, though why this term is used, I have no idea. Perhaps its because under the
classical assumptions, a joint condence region for would be an n dimensional hypersphere.
9.1 Eects of nonspherical disturbances on the OLS estima-
tor
The least square estimator is

= (X
/
X)
1
X
/
y
= + (X
/
X)
1
X
/

We have unbiasedness, as before.


The variance of

is
c
_
(

)(

)
/
_
= c
_
(X
/
X)
1
X
/

/
X(X
/
X)
1
_
= (X
/
X)
1
X
/
X(X
/
X)
1
(9.1)
Due to this, any test statistic that is based upon an estimator of
2
is invalid, since there isnt
any
2
, it doesnt exist as a feature of the true d.g.p. In particular, the formulas for the t, F,
2
based tests given above do not lead to statistics with these distributions.


is still consistent, following exactly the same argument given before.
If is normally distributed, then

N
_
, (X
/
X)
1
X
/
X(X
/
X)
1
_
The problem is that is unknown in general, so this distribution wont be useful for testing
hypotheses.
Without normality, and with stochastic X (e.g., weakly exogenous regressors) we still have

n
_


_
=

n(X
/
X)
1
X
/

=
_
_
X
/
X
n
_
_
1
n
1/2
X
/

Dene the limiting variance of n


1/2
X
/
(supposing a CLT applies) as
lim
n
c
_
_
X
/

/
X
n
_
_
= , a.s.
so we obtain

n
_


_
d
N
_
0, Q
1
X
Q
1
X
_
. Note that the true asymptotic distribution of the
OLS has changed with respect to the results under the classical assumptions. If we neglect to
take this into account, the Wald and score tests will not be asymptotically valid. So we need to
gure out how to take it into account.
To see the invalidity of test procedures that are correct under the classical assumptions, when we
have nonspherical errors, consider the Octave script GLS/EectsOLS.m. This script does a Monte
Carlo study, generating data that are either heteroscedastic or homoscedastic, and then computes the
empirical rejection frequency of a nominally 10% t-test. When the data are heteroscedastic, we obtain
something like what we see in Figure 9.1. This sort of heteroscedasticity causes us to reject a true null
hypothesis regarding the slope parameter much too often. You can experiment with the script to look
at the eects of other sorts of HET, and to vary the sample size.
Figure 9.1: Rejection frequency of 10% t-test, H0 is true.
Summary: OLS with heteroscedasticity and/or autocorrelation is:
unbiased with xed or strongly exogenous regressors
biased with weakly exogenous regressors
has a dierent variance than before, so the previous test statistics arent valid
is consistent
is asymptotically normally distributed, but with a dierent limiting covariance matrix. Previous
test statistics arent valid in this case for this reason.
is inecient, as is shown below.
9.2 The GLS estimator
Suppose were known. Then one could form the Cholesky decomposition
P
/
P =
1
Here, P is an upper triangular matrix. We have
P
/
P = I
n
so
P
/
PP
/
= P
/
,
which implies that
PP
/
= I
n
Lets take some time to play with the Cholesky decomposition. Try out the GLS/cholesky.m Octave
script to see that the above claims are true, and also to see how one can generate data from a N(0, V )
distribition.
Consider the model
Py = PX + P,
or, making the obvious denitions,
y

= X

.
This variance of

= P is
c(P
/
P
/
) = PP
/
= I
n
Therefore, the model
y

= X

c(

) = 0
V (

) = I
n
satises the classical assumptions. The GLS estimator is simply OLS applied to the transformed
model:

GLS
= (X
/
X

)
1
X
/
y

= (X
/
P
/
PX)
1
X
/
P
/
Py
= (X
/

1
X)
1
X
/

1
y
The GLS estimator is unbiased in the same circumstances under which the OLS estimator is
unbiased. For example, assuming X is nonstochastic
c(

GLS
) = c
_
(X
/

1
X)
1
X
/

1
y
_
= c
_
(X
/

1
X)
1
X
/

1
(X +
_
= .
To get the variance of the estimator, we have

GLS
= (X
/
X

)
1
X
/
y

= (X
/
X

)
1
X
/
(X

)
= + (X
/
X

)
1
X
/

so
c
_
_

GLS

_ _

GLS

_
/
_
= c
_
(X
/
X

)
1
X
/

/
X

(X
/
X

)
1
_
= (X
/
X

)
1
X
/
X

(X
/
X

)
1
= (X
/
X

)
1
= (X
/

1
X)
1
Either of these last formulas can be used.
All the previous results regarding the desirable properties of the least squares estimator hold,
when dealing with the transformed model, since the transformed model satises the classical
assumptions..
Tests are valid, using the previous formulas, as long as we substitute X

in place of X. Further-
more, any test that involves
2
can set it to 1. This is preferable to re-deriving the appropriate
formulas.
The GLS estimator is more ecient than the OLS estimator. This is a consequence of the
Gauss-Markov theorem, since the GLS estimator is based on a model that satises the classical
assumptions but the OLS estimator is not. To see this directly, note that
V ar(

) V ar(

GLS
) = (X
/
X)
1
X
/
X(X
/
X)
1
(X
/

1
X)
1
= AA

where A =
_
(X
/
X)
1
X
/
(X
/

1
X)
1
X
/

1
_
. This may not seem obvious, but it is true, as you
can verify for yourself. Then noting that AA

is a quadratic form in a positive denite matrix,


we conclude that AA

is positive semi-denite, and that GLS is ecient relative to OLS.


As one can verify by calculating rst order conditions, the GLS estimator is the solution to the
minimization problem

GLS
= arg min(y X)
/

1
(y X)
so the metric
1
is used to weight the residuals.
9.3 Feasible GLS
The problem is that ordinarily isnt known, so this estimator isnt available.
Consider the dimension of : its an n n matrix with (n
2
n) /2 + n = (n
2
+ n) /2 unique
elements (remember - it is symmetric, because its a covariance matrix).
The number of parameters to estimate is larger than n and increases faster than n. Theres no
way to devise an estimator that satises a LLN without adding restrictions.
The feasible GLS estimator is based upon making sucient assumptions regarding the form of
so that a consistent estimator can be devised.
Suppose that we parameterize as a function of X and , where may include as well as other
parameters, so that
= (X, )
where is of xed dimension. If we can consistently estimate , we can consistently estimate , as
long as the elements of (X, ) are continuous functions of (by the Slutsky theorem). In this case,

= (X,

)
p
(X, )
If we replace in the formulas for the GLS estimator with

, we obtain the FGLS estimator. The
FGLS estimator shares the same asymptotic properties as GLS. These are
1. Consistency
2. Asymptotic normality
3. Asymptotic eciency if the errors are normally distributed. (Cramer-Rao).
4. Test procedures are asymptotically valid.
In practice, the usual way to proceed is
1. Dene a consistent estimator of . This is a case-by-case proposition, depending on the parame-
terization (). Well see examples below.
2. Form

= (X,

)
3. Calculate the Cholesky factorization

P = Chol(

1
).
4. Transform the model using

Py =

PX +

P
5. Estimate using OLS on the transformed model.
9.4 Heteroscedasticity
Heteroscedasticity is the case where
c(
/
) =
is a diagonal matrix, so that the errors are uncorrelated, but have dierent variances. Heteroscedastic-
ity is usually thought of as associated with cross sectional data, though there is absolutely no reason
why time series data cannot also be heteroscedastic. Actually, the popular ARCH (autoregressive
conditionally heteroscedastic) models explicitly assume that a time series is heteroscedastic.
Consider a supply function
q
i
=
1
+
p
P
i
+
s
S
i
+
i
where P
i
is price and S
i
is some measure of size of the i
th
rm. One might suppose that unobservable
factors (e.g., talent of managers, degree of coordination between production units, etc.) account for
the error term
i
. If there is more variability in these factors for large rms than for small rms, then

i
may have a higher variance when S
i
is high than when it is low.
Another example, individual demand.
q
i
=
1
+
p
P
i
+
m
M
i
+
i
where P is price and M is income. In this case,
i
can reect variations in preferences. There are
more possibilities for expression of preferences when one is rich, so it is possible that the variance of

i
could be higher when M is high.
Add example of group means.
OLS with heteroscedastic consistent varcov estimation
Eicker (1967) and White (1980) showed how to modify test statistics to account for heteroscedasticity
of unknown form. The OLS estimator has asymptotic distribution

n
_


_
d
N
_
0, Q
1
X
Q
1
X
_
as weve already seen. Recall that we dened
lim
n
c
_
_
X
/

/
X
n
_
_
=
This matrix has dimension K K and can be consistently estimated, even if we cant estimate
consistently. The consistent estimator, under heteroscedasticity but no autocorrelation is

=
1
n
n

t=1
x
t
x
/
t

2
t
One can then modify the previous test statistics to obtain tests that are valid when there is het-
eroscedasticity of unknown form. For example, the Wald test for H
0
: R r = 0 would be
n
_
R

r
_
/
_
_
_R
_
_
X
/
X
n
_
_
1

_
_
X
/
X
n
_
_
1
R
/
_
_
_
1
_
R

r
_
a

2
(q)
To see the eects of ignoring HET when doing OLS, and the good eect of using a HET consistent
covariance estimator, consider the script bootstrap_example1.m. This script generates data from a
linear model with HET, then computes standard errors using the ordinary OLS formula, the Eicker-
White formula, and also bootstrap standard errors. Note that Eicker-White and bootstrap pretty
much agree, while the OLS formula gives standard errors that are quite dierent. Typical output of
this script follows:
octave:1> bootstrap_example1
Bootstrap standard errors
0.083376 0.090719 0.143284
*********************************************************
OLS estimation results
Observations 100
R-squared 0.014674
Sigma-squared 0.695267
Results (Ordinary var-cov estimator)
estimate st.err. t-stat. p-value
1 -0.115 0.084 -1.369 0.174
2 -0.016 0.083 -0.197 0.845
3 -0.105 0.088 -1.189 0.237
*********************************************************
OLS estimation results
Observations 100
R-squared 0.014674
Sigma-squared 0.695267
Results (Het. consistent var-cov estimator)
estimate st.err. t-stat. p-value
1 -0.115 0.084 -1.381 0.170
2 -0.016 0.090 -0.182 0.856
3 -0.105 0.140 -0.751 0.454
If you run this several times, you will notice that the OLS standard error for the last param-
eter appears to be biased downward, at least comparing to the other two methods, which are
asymptotically valid.
The true coecients are zero. With a standard error biased downward, the t-test for lack of
signicance will reject more often than it should (the variables really are not signicant, but we
will nd that they seem to be more often than is due to Type-I error.
For example, you should see that the p-value for the last coecient is smaller than 0.10 more
than 10% of the time. Run the script 20 times and youll see.
Detection
There exist many tests for the presence of heteroscedasticity. Well discuss three methods.
Goldfeld-Quandt The sample is divided in to three parts, with n
1
, n
2
and n
3
observations, where
n
1
+n
2
+n
3
= n. The model is estimated using the rst and third parts of the sample, separately, so
that

1
and

3
will be independent. Then we have

1/

1

2
=

1

M
1

2
d

2
(n
1
K)
and

3/

3

2
=

3

M
3

2
d

2
(n
3
K)
so

1/

1
/(n
1
K)

3/

3
/(n
3
K)
d
F(n
1
K, n
3
K).
The distributional result is exact if the errors are normally distributed. This test is a two-tailed test.
Alternatively, and probably more conventionally, if one has prior ideas about the possible magnitudes
of the variances of the observations, one could order the observations accordingly, from largest to
smallest. In this case, one would use a conventional one-tailed F-test. Draw picture.
Ordering the observations is an important step if the test is to have any power.
The motive for dropping the middle observations is to increase the dierence between the average
variance in the subsamples, supposing that there exists heteroscedasticity. This can increase the
power of the test. On the other hand, dropping too many observations will substantially increase
the variance of the statistics
1/

1
and
3/

3
. A rule of thumb, based on Monte Carlo experiments
is to drop around 25% of the observations.
If one doesnt have any ideas about the form of the het. the test will probably have low power
since a sensible data ordering isnt available.
Whites test When one has little idea if there exists heteroscedasticity, and no idea of its potential
form, the White test is a possibility. The idea is that if there is homoscedasticity, then
c(
2
t
[x
t
) =
2
, t
so that x
t
or functions of x
t
shouldnt help to explain c(
2
t
). The test works as follows:
1. Since
t
isnt available, use the consistent estimator
t
instead.
2. Regress

2
t
=
2
+ z
/
t
+ v
t
where z
t
is a P-vector. z
t
may include some or all of the variables in x
t
, as well as other variables.
Whites original suggestion was to use x
t
, plus the set of all unique squares and cross products
of variables in x
t
.
3. Test the hypothesis that = 0. The qF statistic in this case is
qF =
P (ESS
R
ESS
U
) /P
ESS
U
/ (n P 1)
Note that ESS
R
= TSS
U
, so dividing both numerator and denominator by this we get
qF = (n P 1)
R
2
1 R
2
Note that this is the R
2
of the articial regression used to test for heteroscedasticity, not the R
2
of the original model.
An asymptotically equivalent statistic, under the null of no heteroscedasticity (so that R
2
should tend
to zero), is
nR
2
a

2
(P).
This doesnt require normality of the errors, though it does assume that the fourth moment of
t
is
constant, under the null. Question: why is this necessary?
The White test has the disadvantage that it may not be very powerful unless the z
t
vector is
chosen well, and this is hard to do without knowledge of the form of heteroscedasticity.
It also has the problem that specication errors other than heteroscedasticity may lead to rejec-
tion.
Note: the null hypothesis of this test may be interpreted as = 0 for the variance model
V (
2
t
) = h( + z
/
t
), where h() is an arbitrary function of unknown form. The test is more
general than is may appear from the regression that is used.
Plotting the residuals A very simple method is to simply plot the residuals (or their squares).
Draw pictures here. Like the Goldfeld-Quandt test, this will be more informative if the observations
are ordered according to the suspected form of the heteroscedasticity.
Correction
Correcting for heteroscedasticity requires that a parametric form for () be supplied, and that a
means for estimating consistently be determined. The estimation method will be specic to the for
supplied for (). Well consider two examples. Before this, lets consider the general nature of GLS
when there is heteroscedasticity.
When we have HET but no AUT, is a diagonal matrix:
=
_

2
1
0 . . . 0
.
.
.
2
2
.
.
.
.
.
. 0
0 0
2
n
_

_
Likewise,
1
is diagonal

1
=
_

_
1

2
1
0 . . . 0
.
.
.
1

2
2
.
.
.
.
.
. 0
0 0
1

2
n
_

_
and so is the Cholesky decomposition P = chol(
1
)
P =
_

_
1

1
0 . . . 0
.
.
.
1

2
.
.
.
.
.
. 0
0 0
1

n
_

_
We need to transform the model, just as before, in the general case:
Py = PX + P,
or, making the obvious denitions,
y

= X

.
Note that multiplying by P just divides the data for each observation (y
i
, x
i
) by the corresponding
standard error of the error term,
i
. That is, y

i
= y
i
/
i
and x

i
= x
i
/
i
(note that x
i
is a K-vector:
we divided each element, including the 1 corresponding to the constant).
This makes sense. Consider Figure 9.2, which shows a true regression line with heteroscedastic
errors. Which sample is more informative about the location of the line? The ones with observations
Figure 9.2: Motivation for GLS correction when there is HET
with smaller variances. So, the GLS solution is equivalent to OLS on the transformed data. By the
transformed data is the original data, weighted by the inverse of the standard error of the observations
error term. When the standard error is small, the weight is high, and vice versa. The GLS correction
for the case of HET is also known as weighted least squares, for this reason.
Multiplicative heteroscedasticity
Suppose the model is
y
t
= x
/
t
+
t

2
t
= c(
2
t
) = (z
/
t
)

but the other classical assumptions hold. In this case

2
t
= (z
/
t
)

+ v
t
and v
t
has mean zero. Nonlinear least squares could be used to estimate and consistently, were

t
observable. The solution is to substitute the squared OLS residuals
2
t
in place of
2
t
, since it is
consistent by the Slutsky theorem. Once we have and

, we can estimate
2
t
consistently using

2
t
= (z
/
t
)

2
t
.
In the second step, we transform the model by dividing by the standard deviation:
y
t

t
=
x
/
t


t
+

t

t
or
y

t
= x
/
t
+

t
.
Asymptotically, this model satises the classical assumptions.
This model is a bit complex in that NLS is required to estimate the model of the variance. A
simpler version would be
y
t
= x
/
t
+
t

2
t
= c(
2
t
) =
2
z

t
where z
t
is a single variable. There are still two parameters to be estimated, and the model of
the variance is still nonlinear in the parameters. However, the search method can be used in this
case to reduce the estimation problem to repeated applications of OLS.
First, we dene an interval of reasonable values for , e.g., [0, 3].
Partition this interval into M equally spaced values, e.g., 0, .1, .2, ..., 2.9, 3.
For each of these values, calculate the variable z

m
t
.
The regression

2
t
=
2
z

m
t
+ v
t
is linear in the parameters, conditional on
m
, so one can estimate
2
by OLS.
Save the pairs (
2
m
,
m
), and the corresponding ESS
m
. Choose the pair with the minimum ESS
m
as the estimate.
Next, divide the model by the estimated standard deviations.
Can rene. Draw picture.
Works well when the parameter to be searched over is low dimensional, as in this case.
Groupwise heteroscedasticity
A common case is where we have repeated observations on each of a number of economic agents:
e.g., 10 years of macroeconomic data on each of a set of countries or regions, or daily observations
of transactions of 200 banks. This sort of data is a pooled cross-section time-series model. It may
be reasonable to presume that the variance is constant over time within the cross-sectional units, but
that it diers across them (e.g., rms or countries of dierent sizes...). The model is
y
it
= x
/
it
+
it
c(
2
it
) =
2
i
, t
where i = 1, 2, ..., G are the agents, and t = 1, 2, ..., n are the observations on each agent.
The other classical assumptions are presumed to hold.
In this case, the variance
2
i
is specic to each agent, but constant over the n observations for
that agent.
In this model, we assume that c(
it

is
) = 0. This is a strong assumption that well relax later.
To correct for heteroscedasticity, just estimate each
2
i
using the natural estimator:

2
i
=
1
n
n

t=1

2
it
Note that we use 1/n here since its possible that there are more than n regressors, so n K
could be negative. Asymptotically the dierence is unimportant.
With each of these, transform the model as usual:
y
it

i
=
x
/
it


i
+

it

i
Do this for each cross-sectional group. This transformed model satises the classical assumptions,
asymptotically.
Example: the Nerlove model (again!)
Remember the Nerlove data - see sections 3.8 and 5.8. Lets check the Nerlove data for evidence
of heteroscedasticity. In what follows, were going to use the model with the constant and output
coecient varying across 5 groups, but with the input price coecients xed (see Equation 5.5 for the
rationale behind this). Figure 9.3, which is generated by the Octave program GLS/NerloveResiduals.m
plots the residuals. We can see pretty clearly that the error variance is larger for small rms than for
larger rms.
Now lets try out some tests to formally check for heteroscedasticity. The Octave program GLS/HetTests.m
performs the White and Goldfeld-Quandt tests, using the above model. The results are
Value p-value
Whites test 61.903 0.000
Value p-value
GQ test 10.886 0.000
Figure 9.3: Residuals, Nerlove model, sorted by rm size
-1.5
-1
-0.5
0
0.5
1
1.5
0 20 40 60 80 100 120 140 160
Regression residuals
Residuals
All in all, it is very clear that the data are heteroscedastic. That means that OLS estimation is
not ecient, and tests of restrictions that ignore heteroscedasticity are not valid. The previous tests
(CRTS, HOD1 and the Chow test) were calculated assuming homoscedasticity. The Octave pro-
gram GLS/NerloveRestrictions-Het.m uses the Wald test to check for CRTS and HOD1, but using a
heteroscedastic-consistent covariance estimator.
1
The results are
Testing HOD1
Value p-value
Wald test 6.161 0.013
Testing CRTS
Value p-value
Wald test 20.169 0.001
We see that the previous conclusions are altered - both CRTS is and HOD1 are rejected at the 5%
level. Maybe the rejection of HOD1 is due to to Wald tests tendency to over-reject?
From the previous plot, it seems that the variance of is a decreasing function of output. Suppose
that the 5 size groups have dierent error variances (heteroscedasticity by groups):
V ar(
i
) =
2
j
,
1
By the way, notice that GLS/NerloveResiduals.m and GLS/HetTests.m use the restricted LS estimator directly to restrict the fully general
model with all coecients varying to the model with only the constant and the output coecient varying. But GLS/NerloveRestrictions-Het.m
estimates the model by substituting the restrictions into the model. The methods are equivalent, but the second is more convenient and easier
to understand.
where j = 1 if i = 1, 2, ..., 29, etc., as before. The Octave script GLS/NerloveGLS.m estimates the
model using GLS (through a transformation of the model so that OLS can be applied). The estimation
results are i
*********************************************************
OLS estimation results
Observations 145
R-squared 0.958822
Sigma-squared 0.090800
Results (Het. consistent var-cov estimator)
estimate st.err. t-stat. p-value
constant1 -1.046 1.276 -0.820 0.414
constant2 -1.977 1.364 -1.450 0.149
constant3 -3.616 1.656 -2.184 0.031
constant4 -4.052 1.462 -2.771 0.006
constant5 -5.308 1.586 -3.346 0.001
output1 0.391 0.090 4.363 0.000
output2 0.649 0.090 7.184 0.000
output3 0.897 0.134 6.688 0.000
output4 0.962 0.112 8.612 0.000
output5 1.101 0.090 12.237 0.000
labor 0.007 0.208 0.032 0.975
fuel 0.498 0.081 6.149 0.000
capital -0.460 0.253 -1.818 0.071
*********************************************************
*********************************************************
OLS estimation results
Observations 145
R-squared 0.987429
Sigma-squared 1.092393
Results (Het. consistent var-cov estimator)
estimate st.err. t-stat. p-value
constant1 -1.580 0.917 -1.723 0.087
constant2 -2.497 0.988 -2.528 0.013
constant3 -4.108 1.327 -3.097 0.002
constant4 -4.494 1.180 -3.808 0.000
constant5 -5.765 1.274 -4.525 0.000
output1 0.392 0.090 4.346 0.000
output2 0.648 0.094 6.917 0.000
output3 0.892 0.138 6.474 0.000
output4 0.951 0.109 8.755 0.000
output5 1.093 0.086 12.684 0.000
labor 0.103 0.141 0.733 0.465
fuel 0.492 0.044 11.294 0.000
capital -0.366 0.165 -2.217 0.028
*********************************************************
Testing HOD1
Value p-value
Wald test 9.312 0.002
The rst panel of output are the OLS estimation results, which are used to consistently estimate the

2
j
. The second panel of results are the GLS estimation results. Some comments:
The R
2
measures are not comparable - the dependent variables are not the same. The measure
for the GLS results uses the transformed dependent variable. One could calculate a comparable
R
2
measure, but I have not done so.
The dierences in estimated standard errors (smaller in general for GLS) can be interpreted as
evidence of improved eciency of GLS, since the OLS standard errors are calculated using the
Huber-White estimator. They would not be comparable if the ordinary (inconsistent) estimator
had been used.
Note that the previously noted pattern in the output coecients persists. The nonconstant
CRTS result is robust.
The coecient on capital is now negative and signicant at the 3% level. That seems to indicate
some kind of problem with the model or the data, or economic theory.
Note that HOD1 is now rejected. Problem of Wald test over-rejecting? Specication error in
model?
9.5 Autocorrelation
Autocorrelation, which is the serial correlation of the error term, is a problem that is usually associated
with time series data, but also can aect cross-sectional data. For example, a shock to oil prices will
simultaneously aect all countries, so one could expect contemporaneous correlation of macroeconomic
variables across countries.
Example
Consider the Keeling-Whorf data on atmospheric CO2 concentrations an Mauna Loa, Hawaii (see
http://en.wikipedia.org/wiki/Keeling_Curve and http://cdiac.ornl.gov/ftp/ndp001/maunaloa.
txt).
From the le maunaloa.txt: THE DATA FILE PRESENTED IN THIS SUBDIRECTORY CON-
TAINS MONTHLY AND ANNUAL ATMOSPHERIC CO2 CONCENTRATIONS DERIVED FROM
THE SCRIPPS INSTITUTION OF OCEANOGRAPHYS (SIOs) CONTINUOUS MONITORING
PROGRAM AT MAUNA LOA OBSERVATORY, HAWAII. THIS RECORD CONSTITUTES THE
LONGEST CONTINUOUS RECORD OF ATMOSPHERIC CO2 CONCENTRATIONS AVAILABLE
IN THE WORLD. MONTHLY AND ANNUAL AVERAGE MOLE FRACTIONS OF CO2 IN WATER-
VAPOR-FREE AIR ARE GIVEN FROM MARCH 1958 THROUGH DECEMBER 2003, EXCEPT
FOR A FEW INTERRUPTIONS.
The data is available in Octave format at CO2.data .
If we t the model CO2
t
=
1
+
2
t +
t
, we get the results
octave:8> CO2Example
warning: load: file found in load path
*********************************************************
OLS estimation results
Observations 468
R-squared 0.979239
Sigma-squared 5.696791
Results (Het. consistent var-cov estimator)
estimate st.err. t-stat. p-value
1 316.918 0.227 1394.406 0.000
2 0.121 0.001 141.521 0.000
*********************************************************
It seems pretty clear that CO2 concentrations have been going up in the last 50 years, surprise, surprise.
Lets look at a residual plot for the last 3 years of the data, see Figure 9.4. Note that there is a very
predictable pattern. This is pretty strong evidence that the errors of the model are not independent
of one another, which means there seems to be autocorrelation.
Figure 9.4: Residuals from time trend for CO2 data
Causes
Autocorrelation is the existence of correlation across the error term:
c(
t

s
) ,= 0, t ,= s.
Why might this occur? Plausible explanations include
1. Lags in adjustment to shocks. In a model such as
y
t
= x
/
t
+
t
,
one could interpret x
/
t
as the equilibrium value. Suppose x
t
is constant over a number of
observations. One can interpret
t
as a shock that moves the system away from equilibrium. If
the time needed to return to equilibrium is long with respect to the observation frequency, one
could expect
t+1
to be positive, conditional on
t
positive, which induces a correlation.
2. Unobserved factors that are correlated over time. The error term is often assumed to correspond
to unobservable factors. If these factors are correlated, there will be autocorrelation.
3. Misspecication of the model. Suppose that the DGP is
y
t
=
0
+
1
x
t
+
2
x
2
t
+
t
but we estimate
y
t
=
0
+
1
x
t
+
t
Figure 9.5: Autocorrelation induced by misspecication
The eects are illustrated in Figure 9.5.
Eects on the OLS estimator
The variance of the OLS estimator is the same as in the case of heteroscedasticity - the standard
formula does not apply. The correct formula is given in equation 9.1. Next we discuss two GLS
corrections for OLS. These will potentially induce inconsistency when the regressors are nonstochastic
(see Chapter 6) and should either not be used in that case (which is usually the relevant case) or used
with caution. The more recommended procedure is discussed in section 9.5.
AR(1)
There are many types of autocorrelation. Well consider two examples. The rst is the most commonly
encountered case: autoregressive order 1 (AR(1) errors. The model is
y
t
= x
/
t
+
t

t
=
t1
+ u
t
u
t
iid(0,
2
u
)
c(
t
u
s
) = 0, t < s
We assume that the model satises the other classical assumptions.
We need a stationarity assumption: [[ < 1. Otherwise the variance of
t
explodes as t increases,
so standard asymptotics will not apply.
By recursive substitution we obtain

t
=
t1
+ u
t
= (
t2
+ u
t1
) + u
t
=
2

t2
+ u
t1
+ u
t
=
2
(
t3
+ u
t2
) + u
t1
+ u
t
In the limit the lagged drops out, since
m
0 as m , so we obtain

t
=

m=0

m
u
tm
With this, the variance of
t
is found as
c(
2
t
) =
2
u

m=0

2m
=

2
u
1
2
If we had directly assumed that
t
were covariance stationary, we could obtain this using
V (
t
) =
2
c(
2
t1
) + 2c(
t1
u
t
) +c(u
2
t
)
=
2
V (
t
) +
2
u
,
so
V (
t
) =

2
u
1
2
The variance is the 0
th
order autocovariance:
0
= V (
t
)
Note that the variance does not depend on t
Likewise, the rst order autocovariance
1
is
Cov(
t
,
t1
) =
s
= c((
t1
+ u
t
)
t1
)
= V (
t
)
=

2
u
1
2
Using the same method, we nd that for s < t
Cov(
t
,
ts
) =
s
=

s

2
u
1
2
The autocovariances dont depend on t: the process
t
is covariance stationary
The correlation ( in general, for r.v.s x and y) is dened as
corr(x, y) =
cov(x, y)
se(x)se(y)
but in this case, the two standard errors are the same, so the s-order autocorrelation
s
is

s
=
s
All this means that the overall matrix has the form
=

2
u
1
2
. .
this is the variance
_

_
1
2

n1
1
n2
.
.
.
.
.
.
.
.
.
.
.
.

n1
1
_

_
. .
this is the correlation matrix
So we have homoscedasticity, but elements o the main diagonal are not zero. All of this depends
only on two parameters, and
2
u
. If we can estimate these consistently, we can apply FGLS.
It turns out that its easy to estimate these consistently. The steps are
1. Estimate the model y
t
= x
/
t
+
t
by OLS.
2. Take the residuals, and estimate the model

t
=
t1
+ u

t
Since
t
p

t
, this regression is asymptotically equivalent to the regression

t
=
t1
+ u
t
which satises the classical assumptions. Therefore, obtained by applying OLS to
t
=
t1
+u

t
is consistent. Also, since u

t
p
u
t
, the estimator

2
u
=
1
n
n

t=2
( u

t
)
2
p

2
u
3. With the consistent estimators
2
u
and , form

= (
2
u
, ) using the previous structure of ,
and estimate by FGLS. Actually, one can omit the factor
2
u
/(1
2
), since it cancels out in the
formula

FGLS
=
_
X
/

1
X
_
1
(X
/

1
y).
One can iterate the process, by taking the rst FGLS estimator of , re-estimating and
2
u
,
etc. If one iterates to convergences its equivalent to MLE (supposing normal errors).
An asymptotically equivalent approach is to simply estimate the transformed model
y
t
y
t1
= (x
t
x
t1
)
/
+ u

t
using n 1 observations (since y
0
and x
0
arent available). This is the method of Cochrane and
Orcutt. Dropping the rst observation is asymptotically irrelevant, but it can be very important
in small samples. One can recuperate the rst observation by putting
y

1
= y
1
_
1
2
x

1
= x
1
_
1
2
This somewhat odd-looking result is related to the Cholesky factorization of
1
. See Davidson
and MacKinnon, pg. 348-49 for more discussion. Note that the variance of y

1
is
2
u
, asymptoti-
cally, so we see that the transformed model will be homoscedastic (and nonautocorrelated, since
the u
/
s are uncorrelated with the y
/
s, in dierent time periods.
MA(1)
The linear regression model with moving average order 1 errors is
y
t
= x
/
t
+
t

t
= u
t
+ u
t1
u
t
iid(0,
2
u
)
c(
t
u
s
) = 0, t < s
In this case,
V (
t
) =
0
= c
_
(u
t
+ u
t1
)
2
_
=
2
u
+
2

2
u
=
2
u
(1 +
2
)
Similarly

1
= c [(u
t
+ u
t1
) (u
t1
+ u
t2
)]
=
2
u
and

2
= [(u
t
+ u
t1
) (u
t2
+ u
t3
)]
= 0
so in this case
=
2
u
_

_
1 +
2
0 0
1 +
2

0
.
.
.
.
.
.
.
.
.
.
.
.
0 1 +
2
_

_
Note that the rst order autocorrelation is

1
=

2
u

2
u
(1+
2
)
=

1

0
=

(1 +
2
)
This achieves a maximum at = 1 and a minimum at = 1, and the maximal and minimal
autocorrelations are 1/2 and -1/2. Therefore, series that are more strongly autocorrelated cant
be MA(1) processes.
Again the covariance matrix has a simple structure that depends on only two parameters. The problem
in this case is that one cant estimate using OLS on

t
= u
t
+ u
t1
because the u
t
are unobservable and they cant be estimated consistently. However, there is a simple
way to estimate the parameters.
Since the model is homoscedastic, we can estimate
V (
t
) =
2

=
2
u
(1 +
2
)
using the typical estimator:

2
u
(1 +
2
) =
1
n
n

t=1

2
t
By the Slutsky theorem, we can interpret this as dening an (unidentied) estimator of both
2
u
and , e.g., use this as

2
u
(1 +

2
) =
1
n
n

t=1

2
t
However, this isnt sucient to dene consistent estimators of the parameters, since its uniden-
tied - two unknowns, one equation.
To solve this problem, estimate the covariance of
t
and
t1
using

Cov(
t
,
t1
) =

2
u
=
1
n
n

t=2

t

t1
This is a consistent estimator, following a LLN (and given that the epsilon hats are consistent
for the epsilons). As above, this can be interpreted as dening an unidentied estimator of the
two parameters:

2
u
=
1
n
n

t=2

t

t1
Now solve these two equations to obtain identied (and therefore consistent) estimators of both
and
2
u
. Dene the consistent estimator

= (

2
u
)
following the form weve seen above, and transform the model using the Cholesky decomposition.
The transformed model satises the classical assumptions asymptotically.
Note: there is no guarantee that estimated using the above method will be positive denite,
which may pose a problem. Another method would be to use ML estimation, if one is willing to
make distributional assumptions regarding the white noise errors.
Monte Carlo example: AR1
Lets look at a Monte Carlo study that compares OLS and GLS when we have AR1 errors. The model
is
y
t
= 1 + x
t
+
t

t
=
t1
+ u
t
with = 0.9. The sample size is n = 30, and 1000 Monte Carlo replications are done. The Octave
script is GLS/AR1Errors.m. Figure 9.6 shows histograms of the estimated coecient of x minus the
true value. We can see that the GLS histogram is much more concentrated about 0, which is indicative
of the eciency of GLS relative to OLS.
Figure 9.6: Eciency of OLS and FGLS, AR1 errors
(a) OLS (b) GLS
Asymptotically valid inferences with autocorrelation of unknown form
See Hamilton Ch. 10, pp. 261-2 and 280-84.
When the form of autocorrelation is unknown, one may decide to use the OLS estimator, without
correction. Weve seen that this estimator has the limiting distribution

n
_


_
d
N
_
0, Q
1
X
Q
1
X
_
where, as before, is
= lim
n
c
_
_
X
/

/
X
n
_
_
We need a consistent estimate of . Dene m
t
= x
t

t
(recall that x
t
is dened as a K 1 vector).
Note that
X
/
=
_
x
1
x
2
x
n
_
_

2
.
.
.

n
_

_
=
n

t=1
x
t

t
=
n

t=1
m
t
so that
= lim
n
1
n
c
_
_
_
_
n

t=1
m
t
_
_
_
_
n

t=1
m
/
t
_
_
_
_
We assume that m
t
is covariance stationary (so that the covariance between m
t
and m
ts
does not
depend on t).
Dene the v th autocovariance of m
t
as

v
= c(m
t
m
/
tv
).
Note that c(m
t
m
/
t+v
) =
/
v
. (show this with an example). In general, we expect that:
m
t
will be autocorrelated, since
t
is potentially autocorrelated:

v
= c(m
t
m
/
tv
) ,= 0
Note that this autocovariance does not depend on t, due to covariance stationarity.
contemporaneously correlated ( c(m
it
m
jt
) ,= 0 ), since the regressors in x
t
will in general be
correlated (more on this later).
and heteroscedastic (c(m
2
it
) =
2
i
, which depends upon i ), again since the regressors will have
dierent variances.
While one could estimate parametrically, we in general have little information upon which to base
a parametric specication. Recent research has focused on consistent nonparametric estimators of .
Now dene

n
= c
1
n
_
_
_
_
n

t=1
m
t
_
_
_
_
n

t=1
m
/
t
_
_
_
_
We have (show that the following is true, by expanding sum and shifting rows to left)

n
=
0
+
n 1
n
(
1
+
/
1
) +
n 2
n
(
2
+
/
2
) +
1
n
(
n1
+
/
n1
)
The natural, consistent estimator of
v
is

v
=
1
n
n

t=v+1
m
t
m
/
tv
.
where
m
t
= x
t

t
(note: one could put 1/(n v) instead of 1/n here). So, a natural, but inconsistent, estimator of
n
would be

n
=

0
+
n 1
n
_

1
+

/
1
_
+
n 2
n
_

2
+

/
2
_
+ +
1
n
_

n1
+

/
n1
_
=

0
+
n1

v=1
n v
n
_

v
+

/
v
_
.
This estimator is inconsistent in general, since the number of parameters to estimate is more than
the number of observations, and increases more rapidly than n, so information does not build up as
n .
On the other hand, supposing that
v
tends to zero suciently rapidly as v tends to , a modied
estimator

n
=

0
+
q(n)

v=1
_

v
+

/
v
_
,
where q(n)
p
as n will be consistent, provided q(n) grows suciently slowly.
The assumption that autocorrelations die o is reasonable in many cases. For example, the
AR(1) model with [[ < 1 has autocorrelations that die o.
The term
nv
n
can be dropped because it tends to one for v < q(n), given that q(n) increases
slowly relative to n.
A disadvantage of this estimator is that is may not be positive denite. This could cause one to
calculate a negative
2
statistic, for example!
Newey and West proposed and estimator (Econometrica, 1987) that solves the problem of possible
nonpositive deniteness of the above estimator. Their estimator is

n
=

0
+
q(n)

v=1
_
1
v
q + 1
_
_

v
+

/
v
_
.
This estimator is p.d. by construction. The condition for consistency is that n
1/4
q(n) 0.
Note that this is a very slow rate of growth for q. This estimator is nonparametric - weve placed
no parametric restrictions on the form of . It is an example of a kernel estimator.
Finally, since
n
has as its limit,

n
p
. We can now use

n
and

Q
X
=
1
n
X
/
X to consistently
estimate the limiting distribution of the OLS estimator under heteroscedasticity and autocorrelation
of unknown form. With this, asymptotically valid tests are constructed in the usual way.
Testing for autocorrelation
Durbin-Watson test
The Durbin-Watson test is not strictly valid in most situations where we would like to use it.
Nevertheless, it is encountered often enough so that one should know something about it. The Durbin-
Watson test statistic is
DW =

n
t=2
(
t

t1
)
2

n
t=1

2
t
=

n
t=2
(
2
t
2
t

t1
+
2
t1
)

n
t=1

2
t
The null hypothesis is that the rst order autocorrelation of the errors is zero: H
0
:
1
= 0. The
alternative is of course H
A
:
1
,= 0. Note that the alternative is not that the errors are AR(1),
since many general patterns of autocorrelation will have the rst order autocorrelation dierent
than zero. For this reason the test is useful for detecting autocorrelation in general. For the
same reason, one shouldnt just assume that an AR(1) model is appropriate when the DW test
rejects the null.
Under the null, the middle term tends to zero, and the other two tend to one, so DW
p
2.
Supposing that we had an AR(1) error process with = 1. In this case the middle term tends
to 2, so DW
p
0
Supposing that we had an AR(1) error process with = 1. In this case the middle term tends
to 2, so DW
p
4
These are the extremes: DW always lies between 0 and 4.
The distribution of the test statistic depends on the matrix of regressors, X, so tables cant give
exact critical values. The give upper and lower bounds, which correspond to the extremes that
are possible. See Figure 9.7. There are means of determining exact critical values conditional on
X.
Note that DW can be used to test for nonlinearity (add discussion).
The DW test is based upon the assumption that the matrix X is xed in repeated samples. This
is often unreasonable in the context of economic time series, which is precisely the context where
the test would have application. It is possible to relate the DW test to other test statistics which
are valid without strict exogeneity.
Breusch-Godfrey test
This test uses an auxiliary regression, as does the White test for heteroscedasticity. The regression
is

t
= x
/
t
+
1

t1
+
2

t2
+ +
P

tP
+ v
t
and the test statistic is the nR
2
statistic, just as in the White test. There are P restrictions, so the
test statistic is asymptotically distributed as a
2
(P).
The intuition is that the lagged errors shouldnt contribute to explaining the current error if
there is no autocorrelation.
x
t
is included as a regressor to account for the fact that the
t
are not independent even if the

t
are. This is a technicality that we wont go into here.
This test is valid even if the regressors are stochastic and contain lagged dependent variables, so
it is considerably more useful than the DW test for typical time series data.
Figure 9.7: Durbin-Watson critical values
The alternative is not that the model is an AR(P), following the argument above. The alternative
is simply that some or all of the rst P autocorrelations are dierent from zero. This is compatible
with many specic forms of autocorrelation.
Lagged dependent variables and autocorrelation
Weve seen that the OLS estimator is consistent under autocorrelation, as long as plim
X

n
= 0. This
will be the case when c(X
/
) = 0, following a LLN. An important exception is the case where X
contains lagged y
/
s and the errors are autocorrelated.
Example 22. Dynamic model with MA1 errors. Consider the model
y
t
= + y
t1
+ x
t
+
t

t
=
t
+
t1
We can easily see that a regressor is not weakly exogenous:
c(y
t1

t
) = c ( + y
t2
+ x
t1
+
t1
+
t2
)(
t
+
t1
)
,= 0
since one of the terms is c(
2
t1
) which is clearly nonzero. In this case c(x
t

t
) ,= 0, and therefore
plim
X

n
,= 0. Since
plim

= + plim
X
/

n
the OLS estimator is inconsistent in this case. One needs to estimate by instrumental variables (IV),
which well get to later
The Octave script GLS/DynamicMA.m does a Monte Carlo study. The sample size is n = 100.
The true coecients are = 1 = 0.9 and = 1. The MA parameter is = 0.95. Figure 9.8 gives
the results. You can see that the constant and the autoregressive parameter have a lot of bias. By
re-running the script with = 0, you will see that much of the bias disappears (not all - why?).
Examples
Nerlove model, yet again The Nerlove model uses cross-sectional data, so one may not think of
performing tests for autocorrelation. However, specication error can induce autocorrelated errors.
Consider the simple Nerlove model
ln C =
1
+
2
ln Q +
3
ln P
L
+
4
ln P
F
+
5
ln P
K
+
and the extended Nerlove model
ln C =
5

j=1

j
D
j
+
5

j=1

j
D
j
ln Q +
L
ln P
L
+
F
ln P
F
+
K
ln P
K
+
discussed around equation 5.5. If you have done the exercises, you have seen evidence that the extended
model is preferred. So if it is in fact the proper model, the simple model is misspecied. Lets check
if this misspecication might induce autocorrelated errors.
The Octave program GLS/NerloveAR.m estimates the simple Nerlove model, and plots the resid-
uals as a function of ln Q, and it calculates a Breusch-Godfrey test statistic. The residual plot is in
Figure 9.9 , and the test results are:
Value p-value
Breusch-Godfrey test 34.930 0.000
Clearly, there is a problem of autocorrelated residuals.
Repeat the autocorrelation tests using the extended Nerlove model (Equation 5.5) to see the prob-
lem is solved.
Figure 9.8: Dynamic model with MA(1) errors
(a)
(b)
(c)


Figure 9.9: Residuals of simple Nerlove model
-1
-0.5
0
0.5
1
1.5
2
0 2 4 6 8 10
Residuals
Quadratic fit to Residuals
Klein model Kleins Model I is a simple macroeconometric model. One of the equations in the
model explains consumption (C) as a function of prots (P), both current and lagged, as well as the
sum of wages in the private sector (W
p
) and wages in the government sector (W
g
). Have a look at the
README le for this data set. This gives the variable names and other information.
Consider the model
C
t
=
0
+
1
P
t
+
2
P
t1
+
3
(W
p
t
+ W
g
t
) +
1t
The Octave program GLS/Klein.m estimates this model by OLS, plots the residuals, and performs
the Breusch-Godfrey test, using 1 lag of the residuals. The estimation and test results are:
*********************************************************
OLS estimation results
Observations 21
R-squared 0.981008
Sigma-squared 1.051732
Results (Ordinary var-cov estimator)
estimate st.err. t-stat. p-value
Constant 16.237 1.303 12.464 0.000
Profits 0.193 0.091 2.115 0.049
Lagged Profits 0.090 0.091 0.992 0.335
Wages 0.796 0.040 19.933 0.000
*********************************************************
Value p-value
Breusch-Godfrey test 1.539 0.215
and the residual plot is in Figure 9.10. The test does not reject the null of nonautocorrelatetd errors,
but we should remember that we have only 21 observations, so power is likely to be fairly low. The
residual plot leads me to suspect that there may be autocorrelation - there are some signicant runs
below and above the x-axis. Your opinion may dier.
Since it seems that there may be autocorrelation, letss try an AR(1) correction. The Octave
program GLS/KleinAR1.m estimates the Klein consumption equation assuming that the errors follow
the AR(1) pattern. The results, with the Breusch-Godfrey test for remaining autocorrelation are:
*********************************************************
OLS estimation results
Observations 21
R-squared 0.967090
Sigma-squared 0.983171
Results (Ordinary var-cov estimator)
estimate st.err. t-stat. p-value
Constant 16.992 1.492 11.388 0.000
Profits 0.215 0.096 2.232 0.039
Lagged Profits 0.076 0.094 0.806 0.431
Wages 0.774 0.048 16.234 0.000
Figure 9.10: OLS residuals, Klein consumption equation
-3
-2
-1
0
1
2
0 5 10 15 20 25
Regression residuals
Residuals
*********************************************************
Value p-value
Breusch-Godfrey test 2.129 0.345
The test is farther away from the rejection region than before, and the residual plot is a bit more
favorable for the hypothesis of nonautocorrelated residuals, IMHO. For this reason, it seems that
the AR(1) correction might have improved the estimation.
Nevertheless, there has not been much of an eect on the estimated coecients nor on their
estimated standard errors. This is probably because the estimated AR(1) coecient is not very
large (around 0.2)
The existence or not of autocorrelation in this model will be important later, in the section on
simultaneous equations.
9.6 Exercises
1. Comparing the variances of the OLS and GLS estimators, I claimed that the following holds:
V ar(

) V ar(

GLS
) = AA

Verify that this is true.


2. Show that the GLS estimator can be dened as

GLS
= arg min(y X)
/

1
(y X)
3. The limiting distribution of the OLS estimator with heteroscedasticity of unknown form is

n
_


_
d
N
_
0, Q
1
X
Q
1
X
_
,
where
lim
n
c
_
_
X
/

/
X
n
_
_
=
Explain why

=
1
n
n

t=1
x
t
x
/
t

2
t
is a consistent estimator of this matrix.
4. Dene the v th autocovariance of a covariance stationary process m
t
, where E(m
t
) = 0 as

v
= c(m
t
m
/
tv
).
Show that c(m
t
m
/
t+v
) =
/
v
.
5. For the Nerlove model with dummies and interactions discussed above (see Section 9.4 and
equation 5.5)
ln C =
5

j=1

j
D
j
+
5

j=1

j
D
j
ln Q +
L
ln P
L
+
F
ln P
F
+
K
ln P
K
+
above, we did a GLS correction based on the assumption that there is HET by groups (V (
t
[x
t
) =

2
j
). Lets assume that this model is correctly specied, except that there may or may not be
HET, and if it is present it may be of the form assumed, or perhaps of some other form. What
happens if the assumed form of HET is incorrect?
(a) Is the FGLS based on the assumed form of HET consistent?
(b) Is it ecient? Is it likely to be ecient with respect to OLS?
(c) Are hypothesis tests using the FGLS estimator valid? If not, can they be made valid
following some procedure? Explain.
(d) Are the t-statistics reported in Section 9.4 valid?
(e) Which estimator do you prefer, the OLS estimator or the FGLS estimator? Discuss.
6. Perhaps we can be a little more parsimonious with the Nerlove data (nerlove.data ), rather
than using so many parameters to account for non-constant returns to scale, and to account for
heteroscedasticity. Consider the original model
ln C = +
Q
ln Q +
L
ln P
L
+
F
ln P
F
+
K
ln P
K
+
(a) Estimate by OLS, plot the residuals, and test for autocorrelation and heteroscedasticity.
Explain your ndings.
(b) Consider the model
ln C = +
Q
ln Q +
Q
(ln Q)
2
+
L
ln P
L
+
F
ln P
F
+
K
ln P
K
+
i. Explain how this model can account for non-constant returns to scale.
ii. estimate this model, and test for autocorrelation and heteroscedasticity. You should
nd that there is HET, but no strong evidence of AUT. Why is this the case?
iii. Do a GLS correction where it is assumed that V (
i
) =

2
(ln Q
i
)
2
. In GRETL, there is a
weighted least squares option that you can use. Why does this assumed form of HET
make sense?
iv. plot the weighted residuals versus output. Is there evidence of HET, or has the correc-
tion eliminated the problem?
v. plot the tted values for returns to scale, for all of the rms.
7. The hall.csv or hall.gdt dataset contains monthly observation on 3 variables: the consumption
ratio c
t
/c
t1
; the gross return of an equally weighted index of assets ewr
t
; and the gross return
of the same index, but weighted by value, vwr
t
. The idea is that a representative consumer may
nance consumption by investing in assets. Present wealth is used for two things: consumption
and investment. The return on investment denes wealth in the next period, and the process
repeats. For the moment, explore the properties of the variables.
(a) Are the variances constant over time?
(b) Do the variables appear to be autocorrelated? Hint: regress a variable on its own lags.
(c) Do the variable seem to be normally distributed?
(d) Look at the properties of the growth rates of the variables: repeat a-c for growth rates. The
growth rate of a variable x
t
is given by log (x
t
/x
t1
).
8. Consider the model
y
t
= C + A
1
y
t1
+
t
E(
t

/
t
) =
E(
t

/
s
) = 0, t ,= s
where y
t
and
t
are G 1 vectors, C is a G 1 of constants, and A
1
is a G G matrix of
parameters. The matrix is a GG covariance matrix. Assume that we have n observations.
This is a vector autoregressive model, of order 1 - commonly referred to as a VAR(1) model.
(a) Show how the model can be written in the form Y = X +, where Y is a Gn 1 vector,
is a (G + G
2
)1 parameter vector, and the other items are conformable. What is the
structure of X? What is the structure of the covariance matrix of ?
(b) This model has HET and AUT. Verify this statement.
(c) Set G = 2,C = (0 0)
/
A =
_

_
0.8 0.1
0.2 0.5
_

_, =
_

_
1 0.5
0.5 1
_

_. Simulate data from this model,


then estimate the model using OLS and feasible GLS. You should nd that the two estima-
tors are identical, which might seem surprising, given that there is HET and AUT.
(d) (optional, and advanced). Prove analytically that the OLS and GLS estimators are identical.
Hint: this model is of the form of seemingly unrelated regressions.
9. Consider the model
y
t
= +
1
y
t1
+
2
y
t2
+
t
where
t
is a N(0, 1) white noise error. This is an autogressive model of order 2 (AR2) model.
Suppose that data is generated from the AR2 model, but the econometrician mistakenly decides
to estimate an AR1 model (y
t
= +
1
y
t1
+
t
).
(a) simulate data from the AR2 model, setting
1
= 0.5 and
2
= 0.4, using a sample size of
n = 30.
(b) Estimate the AR1 model by OLS, using the simulated data
(c) test the hypothesis that
1
= 0.5
(d) test for autocorrelation using the test of your choice
(e) repeat the above steps 10000 times.
i. What percentage of the time does a t-test reject the hypothesis that
1
= 0.5?
ii. What percentage of the time is the hypothesis of no autocorrelation rejected?
(f) discuss your ndings. Include a residual plot for a representative sample.
10. Modify the script given in Subsection 9.5 so that the rst observation is dropped, rather than
given special treatment. This corresponds to using the Cochrane-Orcutt method, whereas the
script as provided implements the Prais-Winsten method. Check if there is an eciency loss
when the rst observation is dropped.
Chapter 10
Endogeneity and simultaneity
Several times weve encountered cases where correlation between regressors and the error term lead
to biasedness and inconsistency of the OLS estimator. Cases include autocorrelation with lagged
dependent variables (Example 22), measurement error in the regressors (Example 19) and missing
regressors (Section 7.4). Another important case we have not seen yet is that of simultaneous equations.
The cause is dierent, but the eect is the same: bias and inconsistency when OLS is applied to a
single equation. The basic idea is presented in Figure 10.1. A simple regression will estimate the
overall eect of x on y. If were interested in the direct eect, , then we have a problem when the
overall eect and the direct eect dier.
10.1 Simultaneous equations
Up until now our model is
y = X +
235
Figure 10.1: Exogeneity and Endogeneity (adapted from Cameron and Trivedi)
where we assume weak exogeneity of the regressors, so that E(x
t

t
) = 0. With weak exogeneity, the
OLS estimator has desirable large sample properties (consistency, asymptotic normality).
Simultaneous equations is a dierent prospect. An example of a simultaneous equation system is
a simple supply-demand system:
Demand: q
t
=
1
+
2
p
t
+
3
y
t
+
1t
Supply: q
t
=
1
+
2
p
t
+
2t
c
_
_
_
_

1t

2t
_

_
_

1t

2t
_
_
_
_ =
_

11

12

22
_

_
, t
The presumption is that q
t
and p
t
are jointly determined at the same time by the intersection of these
equations. Well assume that y
t
is determined by some unrelated process. Its easy to see that we have
correlation between regressors and errors. Solving for p
t
:

1
+
2
p
t
+
3
y
t
+
1t
=
1
+
2
p
t
+
2t

2
p
t

2
p
t
=
1

1
+
3
y
t
+
1t

2t
p
t
=

1

2
+

3
y
t

2
+

1t

2t

2
Now consider whether p
t
is uncorrelated with
1t
:
c(p
t

1t
) = c
__

2
+

3
y
t

2
+

1t

2t

2
_

1t
_
=

11

12

2
Because of this correlation, weak exogeneity does not hold, and OLS estimation of the demand equation
will be biased and inconsistent. The same applies to the supply equation, for the same reason.
In this model, q
t
and p
t
are the endogenous varibles (endogs), that are determined within the
system. y
t
is an exogenous variable (exogs). These concepts are a bit tricky, and well return to it in
a minute. First, some notation. Suppose we group together current endogs in the vector Y
t
. If there
are G endogs, Y
t
is G 1. Group current and lagged exogs, as well as lagged endogs in the vector
X
t
, which is K 1. Stack the errors of the G equations into the error vector E
t
. The model, with
additional assumtions, can be written as
Y
/
t
= X
/
t
B + E
/
t
E
t
N(0, ), t (10.1)
c(E
t
E
/
s
) = 0, t ,= s
There are G equations here, and the parameters that enter into each equation are contained in the
columns of the matrices and B. We can stack all n observations and write the model as
Y = XB + E
c(X
/
E) = 0
(KG)
vec(E) N(0, )
where
Y =
_

_
Y
/
1
Y
/
2
.
.
.
Y
/
n
_

_
, X =
_

_
X
/
1
X
/
2
.
.
.
X
/
n
_

_
, E =
_

_
E
/
1
E
/
2
.
.
.
E
/
n
_

_
Y is n G, X is n K, and E is n G.
This system is complete, in that there are as many equations as endogs.
There is a normality assumption. This isnt necessary, but allows us to consider the relationship
between least squares and ML estimators.
Since there is no autocorrelation of the E
t
s, and since the columns of E are individually
homoscedastic, then
=
_

11
I
n

12
I
n

1G
I
n

22
I
n
.
.
.
.
.
.
.
.
.

GG
I
n
_

_
= I
n

X may contain lagged endogenous and exogenous variables. These variables are predetermined.
We need to dene what is meant by endogenous and exogenous when classifying the current
period variables. Remember the denition of weak exogeneity Assumption 15, the regressors are
weakly exogenous if E(E
t
[X
t
) = 0. Endogenous regressors are those for which this assumption
does not hold. As long as there is no autocorrelation, lagged endogenous variables are weakly
exogenous.
10.2 Reduced form
Recall that the model is
Y
/
t
= X
/
t
B + E
/
t
V (E
t
) =
This is the model in structural form.
Denition 23. [Structural form] An equation is in structural form when more than one current period
endogenous variable is included.
The solution for the current period endogs is easy to nd. It is
Y
/
t
= X
/
t
B
1
+ E
/
t

1
= X
/
t
+ V
/
t
Now only one current period endog appears in each equation. This is the reduced form.
Denition 24. [Reduced form] An equation is in reduced form if only one current period endog is
included.
An example is our supply/demand system. The reduced form for quantity is obtained by solving
the supply equation for price and substituting into demand:
q
t
=
1
+
2
_
q
t

2t

2
_
+
3
y
t
+
1t

2
q
t

2
q
t
=
2

2
(
1
+
2t
) +
2

3
y
t
+
2

1t
q
t
=

2

2
+

2

3
y
t

2
+

2

1t

2t

2
=
11
+
21
y
t
+ V
1t
Similarly, the rf for price is

1
+
2
p
t
+
2t
=
1
+
2
p
t
+
3
y
t
+
1t

2
p
t

2
p
t
=
1

1
+
3
y
t
+
1t

2t
p
t
=

1

2
+

3
y
t

2
+

1t

2t

2
=
12
+
22
y
t
+ V
2t
The interesting thing about the rf is that the equations individually satisfy the classical assumptions,
since y
t
is uncorrelated with
1t
and
2t
by assumption, and therefore c(y
t
V
it
) = 0, i=1,2, t. The
errors of the rf are
_

_
V
1t
V
2t
_

_ =
_

1t

2t

1t

2t

2
_

_
The variance of V
1t
is
V (V
1t
) = c
__

1t

2t

2
_ _

1t

2t

2
__
=

2
2

11
2
2

12
+
2

22
(
2

2
)
2
This is constant over time, so the rst rf equation is homoscedastic.
Likewise, since the
t
are independent over time, so are the V
t
.
The variance of the second rf error is
V (V
2t
) = c
__

1t

2t

2
_ _

1t

2t

2
__
=

11
2
12
+
22
(
2

2
)
2
and the contemporaneous covariance of the errors across equations is
c(V
1t
V
2t
) = c
__

1t

2t

2
_ _

1t

2t

2
__
=

2

11
(
2
+
2
)
12
+
22
(
2

2
)
2
In summary the rf equations individually satisfy the classical assumptions, under the assumtions
weve made, but they are contemporaneously correlated.
The general form of the rf is
Y
/
t
= X
/
t
B
1
+ E
/
t

1
= X
/
t
+ V
/
t
so we have that
V
t
=
_

1
_
/
E
t
N
_
0,
_

1
_
/

1
_
, t
and that the V
t
are timewise independent (note that this wouldnt be the case if the E
t
were autocor-
related).
From the reduced form, we can easily see that the endogenous variables are correlated with the
structural errors:
E(E
t
Y
/
t
) = E
_
E
t
_
X
/
t
B
1
+ E
/
t

1
__
= E
_
E
t
X
/
t
B
1
+ E
t
E
/
t

1
_
=
1
(10.2)
10.3 Estimation of the reduced form equations
From above, the RF equations are
Y
/
t
= X
/
t
B
1
+ E
/
t

1
= X
/
t
+ V
/
t
and
V
t
N (0, ) , t
where we dene (
1
)
/

1
. The rf parameter estimator

, is simply OLS applied to this model,
equation by equation::

= (X
/
X)
1
X
/
Y
which is simply

= (X
/
X)
1
X
/
_
y
1
y
2
y
G
_
that is, OLS equation by equation using all the exogs in the estimation of each column of .
It may seem odd that we use OLS on the reduced form, since the rf equations are correlated,
because (
1
)
/

1
is a full matrix. Why dont we do GLS to improve eciency of estimation of
the RF parameters?
OLS equation by equation to get the rf is equivalent to
_

_
y
1
y
2
.
.
.
y
G
_

_
=
_

_
X 0 0
0 X
.
.
.
.
.
.
.
.
. 0
0 0 X
_

_
_

2
.
.
.

G
_

_
+
_

_
v
1
v
2
.
.
.
v
G
_

_
where y
i
is the n 1 vector of observations of the i
th
endog, X is the entire n K matrix of exogs,
i
is the i
th
column of , and v
i
is the i
th
column of V. Use the notation
y = X + v
to indicate the pooled model. Following this notation, the error covariance matrix is
V (v) = I
n
This is a special case of a type of model known as a set of seemingly unrelated equations (SUR)
since the parameter vector of each equation is dierent. The important feature of this special
case is that the regressors are the same in each equation. The equations are contemporanously
correlated, because of the non-zero o diagonal elements in .
Note that each equation of the system individually satises the classical assumptions.
Normally when doing SUR, one simply does GLS on the whole system y = X + v, where
V (v) = I
n
, which is in general more ecient than OLS on each equation.
However, when the regressors are the same in all equations, as is true in the present case of
estimation of the RF parameters, SUR OLS. To show this note that in this case X = I
n
X.
Using the rules
1. (A B)
1
= (A
1
B
1
)
2. (A B)
/
= (A
/
B
/
) and
3. (A B)(C D) = (AC BD), we get

SUR
=
_
(I
n
X)
/
( I
n
)
1
(I
n
X)
_
1
(I
n
X)
/
( I
n
)
1
y
=
__

1
X
/
_
(I
n
X)
_
1
_

1
X
/
_
y
=
_
(X
/
X)
1
_ _

1
X
/
_
y
=
_
I
G
(X
/
X)
1
X
/
_
y
=
_

_

1

2
.
.
.

G
_

_
Note that this provides the answer to the exercise 8d in the chapter on GLS.
So the unrestricted rf coecients can be estimated eciently (assuming normality) by OLS, even
if the equations are correlated.
We have ignored any potential zeros in the matrix , which if they exist could potentially increase
the eciency of estimation of the rf.
Another example where SUROLS is in estimation of vector autoregressions which is discussed
in Section 15.2.
10.4 Bias and inconsistency of OLS estimation of a structural
equation
Considering the rst equation (this is without loss of generality, since we can always reorder the
equations) we can partition the Y matrix as
Y =
_
y Y
1
Y
2
_
y is the rst column
Y
1
are the other endogenous variables that enter the rst equation
Y
2
are endogs that are excluded from this equation
Similarly, partition X as
X =
_
X
1
X
2
_
X
1
are the included exogs, and X
2
are the excluded exogs.
Finally, partition the error matrix as
E =
_
E
12
_
Assume that has ones on the main diagonal. These are normalization restrictions that simply
scale the remaining coecients on each equation, and which scale the variances of the error terms.
Given this scaling and our partitioning, the coecient matrices can be written as
=
_

_
1
12

1

22
0
32
_

_
B =
_

1
B
12
0 B
22
_

_
With this, the rst equation can be written as
y = Y
1

1
+ X
1

1
+ (10.3)
= Z +
The problem, as weve seen, is that the columns of Z corresponding to Y
1
are correlated with ,
because these are endogenous variables, and as we saw in equation 10.2, the endogenous variables are
correlated with the structural errors, so they dont satisfy weak exogeneity. So, E(Z
/
) ,=0. What are
the properties of the OLS estimator in this situation?

= (Z
/
Z)
1
Z
/
y
= (Z
/
Z)
1
Z
/
_
Z
0
+
_
=
0
+ (Z
/
Z)
1
Z
/

Its clear that the OLS estimator is biased in general. Also,


0
=
_
_
Z
/
Z
n
_
_
1
Z
/

n
Say that lim
Z

n
= A,a.s., and lim
Z

Z
n
= Q
Z
, a.s. Then
lim
_


0
_
= Q
1
Z
A ,= 0, a.s.
So the OLS estimator of a structural equation is inconsistent. In general, correlation between regressors
and errors leads to this problem, whether due to measurement error, simultaneity, or omitted regressors.
10.5 Note about the rest of this chaper
In class, I will not teach the material in the rest of this chapter at this time, but instead we will go
on to GMM. The material that follows is easier to understand in the context of GMM, where we get
a nice unied theory.
10.6 Identication by exclusion restrictions
The material in the rest of this chapter is no longer used in classes, but Im leaving it in the notes for
reference.
The identication problem in simultaneous equations is in fact of the same nature as the identica-
tion problem in any estimation setting: does the limiting objective function have the proper curvature
so that there is a unique global minimum or maximum at the true parameter value? In the context
of IV estimation, this is the case if the limiting covariance of the IV estimator is positive denite and
plim
1
n
W
/
= 0. This matrix is
V

IV
) = (Q
XW
Q
1
WW
Q
/
XW
)
1

2
The necessary and sucient condition for identication is simply that this matrix be positive
denite, and that the instruments be (asymptotically) uncorrelated with .
For this matrix to be positive denite, we need that the conditions noted above hold: Q
WW
must
be positive denite and Q
XW
must be of full rank ( K ).
These identication conditions are not that intuitive nor is it very obvious how to check them.
Necessary conditions
If we use IV estimation for a single equation of the system, the equation can be written as
y = Z +
where
Z =
_
Y
1
X
1
_
Notation:
Let K be the total numer of weakly exogenous variables.
Let K

= cols(X
1
) be the number of included exogs, and let K

= K K

be the number of
excluded exogs (in this equation).
Let G

= cols(Y
1
) + 1 be the total number of included endogs, and let G

= G G

be the
number of excluded endogs.
Using this notation, consider the selection of instruments.
Now the X
1
are weakly exogenous and can serve as their own instruments.
It turns out that X exhausts the set of possible instruments, in that if the variables in X dont
lead to an identied model then no other instruments will identify the model either. Assuming
this is true (well prove it in a moment), then a necessary condition for identication is that
cols(X
2
) cols(Y
1
) since if not then at least one instrument must be used twice, so W will not
have full column rank:
(W) < K

+ G

1 (Q
ZW
) < K

+ G

1
This is the order condition for identication in a set of simultaneous equations. When the only
identifying information is exclusion restrictions on the variables that enter an equation, then
the number of excluded exogs must be greater than or equal to the number of included endogs,
minus 1 (the normalized lhs endog), e.g.,
K

1
To show that this is in fact a necessary condition consider some arbitrary set of instruments W.
A necessary condition for identication is that

_
plim
1
n
W
/
Z
_
= K

+ G

1
where
Z =
_
Y
1
X
1
_
Recall that weve partitioned the model
Y = XB + E
as
Y =
_
y Y
1
Y
2
_
X =
_
X
1
X
2
_
Given the reduced form
Y = X + V
we can write the reduced form using the same partition
_
y Y
1
Y
2
_
=
_
X
1
X
2
_
_

11

12

13

21

22

23
_

_ +
_
v V
1
V
2
_
so we have
Y
1
= X
1

12
+ X
2

22
+ V
1
so
1
n
W
/
Z =
1
n
W
/
_
X
1

12
+ X
2

22
+ V
1
X
1
_
Because the W s are uncorrelated with the V
1
s, by assumption, the cross between W and V
1
converges
in probability to zero, so
plim
1
n
W
/
Z = plim
1
n
W
/
_
X
1

12
+ X
2

22
X
1
_
Since the far rhs term is formed only of linear combinations of columns of X, the rank of this matrix
can never be greater than K, regardless of the choice of instruments. If Z has more than K columns,
then it is not of full column rank. When Z has more than K columns we have
G

1 + K

> K
or noting that K

= K K

,
G

1 > K

In this case, the limiting matrix is not of full column rank, and the identication condition fails.
Sucient conditions
Identication essentially requires that the structural parameters be recoverable from the data. This
wont be the case, in general, unless the structural model is subject to some restrictions. Weve
already identied necessary conditions. Turning to sucient conditions (again, were only considering
identication through zero restricitions on the parameters, for the moment).
The model is
Y
/
t
= X
/
t
B + E
t
V (E
t
) =
This leads to the reduced form
Y
/
t
= X
/
t
B
1
+ E
t

1
= X
/
t
+ V
t
V (V
t
) =
_

1
_
/

1
=
The reduced form parameters are consistently estimable, but none of them are known a priori, and
there are no restrictions on their values. The problem is that more than one structural form has the
same reduced form, so knowledge of the reduced form parameters alone isnt enough to determine the
structural parameters. To see this, consider the model
Y
/
t
F = X
/
t
BF + E
t
F
V (E
t
F) = F
/
F
where F is some arbirary nonsingular GG matrix. The rf of this new model is
Y
/
t
= X
/
t
BF (F)
1
+ E
t
F (F)
1
= X
/
t
BFF
1

1
+ E
t
FF
1

1
= X
/
t
B
1
+ E
t

1
= X
/
t
+ V
t
Likewise, the covariance of the rf of the transformed model is
V (E
t
F (F)
1
) = V (E
t

1
)
=
Since the two structural forms lead to the same rf, and the rf is all that is directly estimable, the
models are said to be observationally equivalent. What we need for identication are restrictions on
and B such that the only admissible F is an identity matrix (if all of the equations are to be identied).
Take the coecient matrices as partitioned before:
_

B
_

_ =
_

_
1
12

1

22
0
32

1
B
12
0 B
22
_

_
The coecients of the rst equation of the transformed model are simply these coecients multiplied
by the rst column of F. This gives
_

B
_

_
_

_
f
11
F
2
_

_ =
_

_
1
12

1

22
0
32

1
B
12
0 B
22
_

_
_

_
f
11
F
2
_

_
For identication of the rst equation we need that there be enough restrictions so that the only
admissible
_

_
f
11
F
2
_

_
be the leading column of an identity matrix, so that
_

_
1
12

1

22
0
32

1
B
12
0 B
22
_

_
_

_
f
11
F
2
_

_ =
_

_
1

1
0

1
0
_

_
Note that the third and fth rows are
_

32
B
22
_

_ F
2
=
_

_
0
0
_

_
Supposing that the leading matrix is of full column rank, e.g.,

_
_
_
_

32
B
22
_

_
_
_
_ = cols
_
_
_
_

32
B
22
_

_
_
_
_ = G1
then the only way this can hold, without additional restrictions on the models parameters, is if F
2
is
a vector of zeros. Given that F
2
is a vector of zeros, then the rst equation
_
1
12
_
_

_
f
11
F
2
_

_ = 1 f
11
= 1
Therefore, as long as

_
_
_
_

32
B
22
_

_
_
_
_ = G1
then
_

_
f
11
F
2
_

_ =
_

_
1
0
G1
_

_
The rst equation is identied in this case, so the condition is sucient for identication. It is also
necessary, since the condition implies that this submatrix must have at least G 1 rows. Since this
matrix has
G

+ K

= GG

+ K

rows, we obtain
GG

+ K

G1
or
K

1
which is the previously derived necessary condition.
The above result is fairly intuitive (draw picture here). The necessary condition ensures that there
are enough variables not in the equation of interest to potentially move the other equations, so as to
trace out the equation of interest. The sucient condition ensures that those other equations in fact
do move around as the variables change their values. Some points:
When an equation has K

= G

1, is is exactly identied, in that omission of an identiying


restriction is not possible without loosing consistency.
When K

> G

1, the equation is overidentied, since one could drop a restriction and


still retain consistency. Overidentifying restrictions are therefore testable. When an equation
is overidentied we have more instruments than are strictly necessary for consistent estimation.
Since estimation by IV with more instruments is more ecient asymptotically, one should employ
overidentifying restrictions if one is condent that theyre true.
We can repeat this partition for each equation in the system, to see which equations are identied
and which arent.
These results are valid assuming that the only identifying information comes from knowing
which variables appear in which equations, e.g., by exclusion restrictions, and through the use
of a normalization. There are other sorts of identifying information that can be used. These
include
1. Cross equation restrictions
2. Additional restrictions on parameters within equations (as in the Klein model discussed
below)
3. Restrictions on the covariance matrix of the errors
4. Nonlinearities in variables
When these sorts of information are available, the above conditions arent necessary for identi-
cation, though they are of course still sucient.
To give an example of how other information can be used, consider the model
Y = XB + E
where is an upper triangular matrix with 1s on the main diagonal. This is a triangular system of
equations. In this case, the rst equation is
y
1
= XB
1
+ E
1
Since only exogs appear on the rhs, this equation is identied.
The second equation is
y
2
=
21
y
1
+ XB
2
+ E
2
This equation has K

= 0 excluded exogs, and G

= 2 included endogs, so it fails the order (necessary)


condition for identication.
However, suppose that we have the restriction
21
= 0, so that the rst and second structural
errors are uncorrelated. In this case
c(y
1t

2t
) = c (X
/
t
B
1
+
1t
)
2t
= 0
so theres no problem of simultaneity. If the entire matrix is diagonal, then following the same
logic, all of the equations are identied. This is known as a fully recursive model.
10.7 2SLS
When we have no information regarding cross-equation restrictions or the structure of the error co-
variance matrix, one can estimate the parameters of a single equation of the system without regard to
the other equations.
This isnt always ecient, as well see, but it has the advantage that misspecications in other
equations will not aect the consistency of the estimator of the parameters of the equation of
interest.
Also, estimation of the equation wont be aected by identication problems in other equations.
The 2SLS estimator is very simple: it is the GIV estimator, using all of the weakly exogenous variables
as instruments. In the rst stage, each column of Y
1
is regressed on all the weakly exogenous variables
in the system, e.g., the entire X matrix. The tted values are

Y
1
= X(X
/
X)
1
X
/
Y
1
= P
X
Y
1
= X

1
Since these tted values are the projection of Y
1
on the space spanned by X, and since any vector
in this space is uncorrelated with by assumption,

Y
1
is uncorrelated with . Since

Y
1
is simply the
reduced-form prediction, it is correlated with Y
1
, The only other requirement is that the instruments
be linearly independent. This should be the case when the order condition is satised, since there are
more columns in X
2
than in Y
1
in this case.
The second stage substitutes

Y
1
in place of Y
1
, and estimates by OLS. This original model is
y = Y
1

1
+ X
1

1
+
= Z +
and the second stage model is
y =

Y
1

1
+ X
1

1
+ .
Since X
1
is in the space spanned by X, P
X
X
1
= X
1
, so we can write the second stage model as
y = P
X
Y
1

1
+ P
X
X
1

1
+
P
X
Z +
The OLS estimator applied to this model is

= (Z
/
P
X
Z)
1
Z
/
P
X
y
which is exactly what we get if we estimate using IV, with the reduced form predictions of the endogs
used as instruments. Note that if we dene

Z = P
X
Z
=
_

Y
1
X
1
_
so that

Z are the instruments for Z, then we can write

= (

Z
/
Z)
1

Z
/
y
Important note: OLS on the transformed model can be used to calculate the 2SLS estimate of
, since we see that its equivalent to IV using a particular set of instruments. However the OLS
covariance formula is not valid. We need to apply the IV covariance formula already seen above.
Actually, there is also a simplication of the general IV variance formula. Dene

Z = P
X
Z
=
_

Y X
_
The IV covariance estimator would ordinarily be

V (

) =
_
Z
/

Z
_
1
_

Z
/

Z
_ _

Z
/
Z
_
1

2
IV
However, looking at the last term in brackets

Z
/
Z =
_

Y
1
X
1
_
/
_
Y
1
X
1
_
=
_

_
Y
/
1
(P
X
)Y
1
Y
/
1
(P
X
)X
1
X
/
1
Y
1
X
/
1
X
1
_

_
but since P
X
is idempotent and since P
X
X = X, we can write
_

Y
1
X
1
_
/
_
Y
1
X
1
_
=
_

_
Y
/
1
P
X
P
X
Y
1
Y
/
1
P
X
X
1
X
/
1
P
X
Y
1
X
/
1
X
1
_

_
=
_

Y
1
X
1
_
/
_

Y
1
X
1
_
=

Z
/

Z
Therefore, the second and last term in the variance formula cancel, so the 2SLS varcov estimator
simplies to

V (

) =
_
Z
/

Z
_
1

2
IV
which, following some algebra similar to the above, can also be written as

V (

) =
_

Z
/

Z
_
1

2
IV
(10.4)
Finally, recall that though this is presented in terms of the rst equation, it is general since any
equation can be placed rst.
Properties of 2SLS:
1. Consistent
2. Asymptotically normal
3. Biased when the mean esists (the existence of moments is a technical issue we wont go into
here).
4. Asymptotically inecient, except in special circumstances (more on this later).
10.8 Testing the overidentifying restrictions
The selection of which variables are endogs and which are exogs is part of the specication of the
model. As such, there is room for error here: one might erroneously classify a variable as exog when
it is in fact correlated with the error term. A general test for the specication on the model can be
formulated as follows:
The IV estimator can be calculated by applying OLS to the transformed model, so the IV objective
function at the minimized value is
s(

IV
) =
_
y X

IV
_
/
P
W
_
y X

IV
_
,
but

IV
= y X

IV
= y X(X
/
P
W
X)
1
X
/
P
W
y
=
_
I X(X
/
P
W
X)
1
X
/
P
W
_
y
=
_
I X(X
/
P
W
X)
1
X
/
P
W
_
(X + )
= A(X + )
where
A I X(X
/
P
W
X)
1
X
/
P
W
so
s(

IV
) = (
/
+
/
X
/
) A
/
P
W
A(X + )
Moreover, A
/
P
W
A is idempotent, as can be veried by multiplication:
A
/
P
W
A =
_
I P
W
X(X
/
P
W
X)
1
X
/
_
P
W
_
I X(X
/
P
W
X)
1
X
/
P
W
_
=
_
P
W
P
W
X(X
/
P
W
X)
1
X
/
P
W
_ _
P
W
P
W
X(X
/
P
W
X)
1
X
/
P
W
_
=
_
I P
W
X(X
/
P
W
X)
1
X
/
_
P
W
.
Furthermore, A is orthogonal to X
AX =
_
I X(X
/
P
W
X)
1
X
/
P
W
_
X
= X X
= 0
so
s(

IV
) =
/
A
/
P
W
A
Supposing the are normally distributed, with variance
2
, then the random variable
s(

IV
)

2
=

/
A
/
P
W
A

2
is a quadratic form of a N(0, 1) random variable with an idempotent matrix in the middle, so
s(

IV
)

2

2
((A
/
P
W
A))
This isnt available, since we need to estimate
2
. Substituting a consistent estimator,
s(

IV
)

2
a

2
((A
/
P
W
A))
Even if the arent normally distributed, the asymptotic result still holds. The last thing we
need to determine is the rank of the idempotent matrix. We have
A
/
P
W
A =
_
P
W
P
W
X(X
/
P
W
X)
1
X
/
P
W
_
so
(A
/
P
W
A) = Tr
_
P
W
P
W
X(X
/
P
W
X)
1
X
/
P
W
_
= TrP
W
TrX
/
P
W
P
W
X(X
/
P
W
X)
1
= TrW(W
/
W)
1
W
/
K
X
= TrW
/
W(W
/
W)
1
K
X
= K
W
K
X
where K
W
is the number of columns of W and K
X
is the number of columns of X. The de-
grees of freedom of the test is simply the number of overidentifying restrictions: the number of
instruments we have beyond the number that is strictly necessary for consistent estimation.
This test is an overall specication test: the joint null hypothesis is that the model is correctly
specied and that the W form valid instruments (e.g., that the variables classied as exogs really
are uncorrelated with . Rejection can mean that either the model y = Z + is misspecied,
or that there is correlation between X and .
This is a particular case of the GMM criterion test, which is covered in the second half of the
course. See Section 14.9.
Note that since

IV
= A
and
s(

IV
) =
/
A
/
P
W
A
we can write
s(

IV
)

2
=
(
/
W(W
/
W)
1
W
/
) (W(W
/
W)
1
W
/
)

/
/n
= n(RSS

IV
[W
/TSS

IV
)
= nR
2
u
where R
2
u
is the uncentered R
2
from a regression of the IV residuals on all of the instruments
W. This is a convenient way to calculate the test statistic.
On an aside, consider IV estimation of a just-identied model, using the standard notation
y = X +
and W is the matrix of instruments. If we have exact identication then cols(W) = cols(X), so W

X
is a square matrix. The transformed model is
P
W
y = P
W
X + P
W

and the fonc are


X
/
P
W
(y X

IV
) = 0
The IV estimator is

IV
= (X
/
P
W
X)
1
X
/
P
W
y
Considering the inverse here
(X
/
P
W
X)
1
=
_
X
/
W(W
/
W)
1
W
/
X
_
1
= (W
/
X)
1
_
X
/
W(W
/
W)
1
_
1
= (W
/
X)
1
(W
/
W) (X
/
W)
1
Now multiplying this by X
/
P
W
y, we obtain

IV
= (W
/
X)
1
(W
/
W) (X
/
W)
1
X
/
P
W
y
= (W
/
X)
1
(W
/
W) (X
/
W)
1
X
/
W(W
/
W)
1
W
/
y
= (W
/
X)
1
W
/
y
The objective function for the generalized IV estimator is
s(

IV
) =
_
y X

IV
_
/
P
W
_
y X

IV
_
= y
/
P
W
_
y X

IV
_

/
IV
X
/
P
W
_
y X

IV
_
= y
/
P
W
_
y X

IV
_

/
IV
X
/
P
W
y +

/
IV
X
/
P
W
X

IV
= y
/
P
W
_
y X

IV
_

/
IV
_
X
/
P
W
y + X
/
P
W
X

IV
_
= y
/
P
W
_
y X

IV
_
by the fonc for generalized IV. However, when were in the just indentied case, this is
s(

IV
) = y
/
P
W
_
y X(W
/
X)
1
W
/
y
_
= y
/
P
W
_
I X(W
/
X)
1
W
/
_
y
= y
/
_
W(W
/
W)
1
W
/
W(W
/
W)
1
W
/
X(W
/
X)
1
W
/
_
y
= 0
The value of the objective function of the IV estimator is zero in the just identied case. This makes
sense, since weve already shown that the objective function after dividing by
2
is asymptotically

2
with degrees of freedom equal to the number of overidentifying restrictions. In the present case,
there are no overidentifying restrictions, so we have a
2
(0) rv, which has mean 0 and variance 0,
e.g., its simply 0. This means were not able to test the identifying restrictions in the case of exact
identication.
10.9 System methods of estimation
2SLS is a single equation method of estimation, as noted above. The advantage of a single equation
method is that its unaected by the other equations of the system, so they dont need to be specied
(except for dening what are the exogs, so 2SLS can use the complete set of instruments). The
disadvantage of 2SLS is that its inecient, in general.
Recall that overidentication improves eciency of estimation, since an overidentied equation
can use more instruments than are necessary for consistent estimation.
Secondly, the assumption is that
Y = XB + E
c(X
/
E) = 0
(KG)
vec(E) N(0, )
Since there is no autocorrelation of the E
t
s, and since the columns of E are individually
homoscedastic, then
=
_

11
I
n

12
I
n

1G
I
n

22
I
n
.
.
.
.
.
.
.
.
.

GG
I
n
_

_
= I
n
This means that the structural equations are heteroscedastic and correlated with one another
In general, ignoring this will lead to inecient estimation, following the section on GLS. When
equations are correlated with one another estimation should account for the correlation in order
to obtain eciency.
Also, since the equations are correlated, information about one equation is implicitly information
about all equations. Therefore, overidentication restrictions in any equation improve eciency
for all equations, even the just identied equations.
Single equation methods cant use these types of information, and are therefore inecient (in
general).
3SLS
Note: It is easier and more practical to treat the 3SLS estimator as a generalized method of moments
estimator (see Chapter 14). I no longer teach the following section, but it is retained for its possible
historical interest. Another alternative is to use FIML (Subsection 10.9), if you are willing to make
distributional assumptions on the errors. This is computationally feasible with modern computers.
Following our above notation, each structural equation can be written as
y
i
= Y
i

1
+ X
i

1
+
i
= Z
i

i
+
i
Grouping the G equations together we get
_

_
y
1
y
2
.
.
.
y
G
_

_
=
_

_
Z
1
0 0
0 Z
2
.
.
.
.
.
.
.
.
. 0
0 0 Z
G
_

_
_

2
.
.
.

G
_

_
+
_

2
.
.
.

G
_

_
or
y = Z +
where we already have that
c(
/
) =
= I
n
The 3SLS estimator is just 2SLS combined with a GLS correction that takes advantage of the structure
of . Dene

Z as

Z =
_

_
X(X
/
X)
1
X
/
Z
1
0 0
0 X(X
/
X)
1
X
/
Z
2
.
.
.
.
.
.
.
.
. 0
0 0 X(X
/
X)
1
X
/
Z
G
_

_
=
_

Y
1
X
1
0 0
0

Y
2
X
2
.
.
.
.
.
.
.
.
. 0
0 0

Y
G
X
G
_

_
These instruments are simply the unrestricted rf predicitions of the endogs, combined with the
exogs. The distinction is that if the model is overidentied, then
= B
1
may be subject to some zero restrictions, depending on the restrictions on and B, and

does not
impose these restrictions. Also, note that

is calculated using OLS equation by equation, as was
discussed in Section 10.3.
The 2SLS estimator would be

= (

Z
/
Z)
1

Z
/
y
as can be veried by simple multiplication, and noting that the inverse of a block-diagonal matrix is
just the matrix with the inverses of the blocks on the main diagonal. This IV estimator still ignores the
covariance information. The natural extension is to add the GLS transformation, putting the inverse
of the error covariance into the formula, which gives the 3SLS estimator

3SLS
=
_

Z
/
( I
n
)
1
Z
_
1

Z
/
( I
n
)
1
y
=
_

Z
/
_

1
I
n
_
Z
_
1

Z
/
_

1
I
n
_
y
This estimator requires knowledge of . The solution is to dene a feasible estimator using a consistent
estimator of . The obvious solution is to use an estimator based on the 2SLS residuals:

i
= y
i
Z
i

i,2SLS
(IMPORTANT NOTE: this is calculated using Z
i
, not

Z
i
). Then the element i, j of is estimated
by

ij
=

/
i

j
n
Substitute

into the formula above to get the feasible 3SLS estimator.
Analogously to what we did in the case of 2SLS, the asymptotic distribution of the 3SLS estimator
can be shown to be

n
_

3SLS

_
a
N
_
_
_0, lim
n
c
_

_
_
_

Z
/
( I
n
)
1

Z
n
_
_
1
_

_
_
_
_
A formula for estimating the variance of the 3SLS estimator in nite samples (cancelling out the powers
of n) is

V
_

3SLS
_
=
_

Z
/
_

1
I
n
_

Z
_
1
This is analogous to the 2SLS formula in equation (10.4), combined with the GLS correction.
In the case that all equations are just identied, 3SLS is numerically equivalent to 2SLS. Proving
this is easiest if we use a GMM interpretation of 2SLS and 3SLS. GMM is presented in the next
econometrics course. For now, take it on faith.
FIML
Full information maximum likelihood is an alternative estimation method. FIML will be asymptotically
ecient, since ML estimators based on a given information set are asymptotically ecient w.r.t.
all other estimators that use the same information set, and in the case of the full-information ML
estimator we use the entire information set. The 2SLS and 3SLS estimators dont require distributional
assumptions, while FIML of course does. Our model is, recall
Y
/
t
= X
/
t
B + E
/
t
E
t
N(0, ), t
c(E
t
E
/
s
) = 0, t ,= s
The joint normality of E
t
means that the density for E
t
is the multivariate normal, which is
(2)
g/2
_
det
1
_
1/2
exp
_

1
2
E
/
t

1
E
t
_
The transformation from E
t
to Y
t
requires the Jacobian
[ det
dE
t
dY
/
t
[ = [ det [
so the density for Y
t
is
(2)
G/2
[ det [
_
det
1
_
1/2
exp
_

1
2
(Y
/
t
X
/
t
B)
1
(Y
/
t
X
/
t
B)
/
_
Given the assumption of independence over time, the joint log-likelihood function is
ln L(B, , ) =
nG
2
ln(2) + nln([ det [)
n
2
ln det
1

1
2
n

t=1
(Y
/
t
X
/
t
B)
1
(Y
/
t
X
/
t
B)
/
This is a nonlinear in the parameters objective function. Maximixation of this can be done using
iterative numeric methods. Well see how to do this in the next section.
It turns out that the asymptotic distribution of 3SLS and FIML are the same, assuming normality
of the errors.
One can calculate the FIML estimator by iterating the 3SLS estimator, thus avoiding the use of
a nonlinear optimizer. The steps are
1. Calculate

3SLS
and

B
3SLS
as normal.
2. Calculate

=

B
3SLS

1
3SLS
. This is new, we didnt estimate in this way before. This
estimator may have some zeros in it. When Greene says iterated 3SLS doesnt lead to
FIML, he means this for a procedure that doesnt update

, but only updates

and

B
and

. If you update

you do converge to FIML.
3. Calculate the instruments

Y = X

and calculate

using

and

B to get the estimated
errors, applying the usual estimator.
4. Apply 3SLS using these new instruments and the estimate of .
5. Repeat steps 2-4 until there is no change in the parameters.
FIML is fully ecient, since its an ML estimator that uses all information. This implies that
3SLS is fully ecient when the errors are normally distributed. Also, if each equation is just iden-
tied and the errors are normal, then 2SLS will be fully ecient, since in this case 2SLS3SLS.
When the errors arent normally distributed, the likelihood function is of course dierent than
whats written above.
10.10 Example: Kleins Model 1
To give a practical example, consider the following (old-fashioned, but illustrative) macro model (this
is the widely known Kleins Model 1)
Consumption: C
t
=
0
+
1
P
t
+
2
P
t1
+
3
(W
p
t
+ W
g
t
) +
1t
Investment: I
t
=
0
+
1
P
t
+
2
P
t1
+
3
K
t1
+
2t
Private Wages: W
p
t
=
0
+
1
X
t
+
2
X
t1
+
3
A
t
+
3t
Output: X
t
= C
t
+ I
t
+ G
t
Prots: P
t
= X
t
T
t
W
p
t
Capital Stock: K
t
= K
t1
+ I
t
_
_
_
_
_
_

1t

2t

3t
_
_
_
_
_
_
IID
_
_
_
_
_
_
_
_
_
_
_
_
0
0
0
_
_
_
_
_
_
,
_
_
_
_
_
_

11

12

13

22

23

33
_
_
_
_
_
_
_
_
_
_
_
_
The other variables are the government wage bill, W
g
t
, taxes, T
t
, government nonwage spending, G
t
,and
a time trend, A
t
. The endogenous variables are the lhs variables,
Y
/
t
=
_
C
t
I
t
W
p
t
X
t
P
t
K
t
_
and the predetermined variables are all others:
X
/
t
=
_
1 W
g
t
G
t
T
t
A
t
P
t1
K
t1
X
t1
_
.
The model assumes that the errors of the equations are contemporaneously correlated, by nonauto-
correlated. The model written as Y = XB + E gives
=
_

_
1 0 0 1 0 0
0 1 0 1 0 1

3
0 1 0 1 0
0 0
1
1 1 0

1

1
0 0 1 0
0 0 0 0 0 1
_

_
B =
_

0

0

0
0 0 0

3
0 0 0 0 0
0 0 0 1 0 0
0 0 0 0 1 0
0 0
3
0 0 0

2

2
0 0 0 0
0
3
0 0 0 1
0 0
2
0 0 0
_

_
To check this identication of the consumption equation, we need to extract
32
and B
22
, the subma-
trices of coecients of endogs and exogs that dont appear in this equation. These are the rows that
have zeros in the rst column, and we need to drop the rst column. We get
_

32
B
22
_

_ =
_

_
1 0 1 0 1
0
1
1 1 0
0 0 0 0 1
0 0 1 0 0
0 0 0 1 0
0
3
0 0 0

3
0 0 0 1
0
2
0 0 0
_

_
We need to nd a set of 5 rows of this matrix gives a full-rank 55 matrix. For example, selecting
rows 3,4,5,6, and 7 we obtain the matrix
A =
_

_
0 0 0 0 1
0 0 1 0 0
0 0 0 1 0
0
3
0 0 0

3
0 0 0 1
_

_
This matrix is of full rank, so the sucient condition for identication is met. Counting included
endogs, G

= 3, and counting excluded exogs, K

= 5, so
K

L = G

1
5 L = 3 1
L = 3
The equation is over-identied by three restrictions, according to the counting rules, which are
correct when the only identifying information are the exclusion restrictions. However, there is
additional information in this case. Both W
p
t
and W
g
t
enter the consumption equation, and their
coecients are restricted to be the same. For this reason the consumption equation is in fact
overidentied by four restrictions.
The Octave program Simeq/Klein2SLS.m performs 2SLS estimation for the 3 equations of Kleins
model 1, assuming nonautocorrelated errors, so that lagged endogenous variables can be used as
instruments. The results are:
CONSUMPTION EQUATION
*******************************************************
2SLS estimation results
Observations 21
R-squared 0.976711
Sigma-squared 1.044059
estimate st.err. t-stat. p-value
Constant 16.555 1.321 12.534 0.000
Profits 0.017 0.118 0.147 0.885
Lagged Profits 0.216 0.107 2.016 0.060
Wages 0.810 0.040 20.129 0.000
*******************************************************
INVESTMENT EQUATION
*******************************************************
2SLS estimation results
Observations 21
R-squared 0.884884
Sigma-squared 1.383184
estimate st.err. t-stat. p-value
Constant 20.278 7.543 2.688 0.016
Profits 0.150 0.173 0.867 0.398
Lagged Profits 0.616 0.163 3.784 0.001
Lagged Capital -0.158 0.036 -4.368 0.000
*******************************************************
WAGES EQUATION
*******************************************************
2SLS estimation results
Observations 21
R-squared 0.987414
Sigma-squared 0.476427
estimate st.err. t-stat. p-value
Constant 1.500 1.148 1.307 0.209
Output 0.439 0.036 12.316 0.000
Lagged Output 0.147 0.039 3.777 0.002
Trend 0.130 0.029 4.475 0.000
*******************************************************
The above results are not valid (specically, they are inconsistent) if the errors are autocorrelated,
since lagged endogenous variables will not be valid instruments in that case. You might consider
eliminating the lagged endogenous variables as instruments, and re-estimating by 2SLS, to obtain
consistent parameter estimates in this more complex case. Standard errors will still be estimated
inconsistently, unless use a Newey-West type covariance estimator. Food for thought...
Chapter 11
Numeric optimization methods
Readings: Hamilton, ch. 5, section 7 (pp. 133-139)

; Gourieroux and Monfort, Vol. 1, ch. 13, pp.


443-60

; Goe, et. al. (1994).


The next chapter introduces extremum estimators, which are minimizers or maximizers of objective
functions. If were going to be applying extremum estimators, well need to know how to nd an
extremum. This section gives a very brief introduction to what is a large literature on numeric
optimization methods. Well consider a few well-known techniques, and one fairly new technique that
may allow one to solve dicult problems. The main objective is to become familiar with the issues,
and to learn how to use the BFGS algorithm at the practical level.
The general problem we consider is how to nd the maximizing element

(a K -vector) of a
function s(). This function may not be continuous, and it may not be dierentiable. Even if it
is twice continuously dierentiable, it may not be globally concave, so local maxima, minima and
284
saddlepoints may all exist. Supposing s() were a quadratic function of , e.g.,
s() = a + b
/
+
1
2

/
C,
the rst order conditions would be linear:
D

s() = b + C
so the maximizing (minimizing) element would be

= C
1
b. This is the sort of problem we have with
linear models estimated by OLS. Its also the case for feasible GLS, since conditional on the estimate
of the varcov matrix, we have a quadratic objective function in the remaining parameters.
More general problems will not have linear f.o.c., and we will not be able to solve for the maximizer
analytically. This is when we need a numeric optimization method.
11.1 Search
The idea is to create a grid over the parameter space and evaluate the function at each point on the
grid. Select the best point. Then rene the grid in the neighborhood of the best point, and continue
until the accuracy is good enough. See Figure 11.1. One has to be careful that the grid is ne enough
in relationship to the irregularity of the function to ensure that sharp peaks are not missed entirely.
To check q values in each dimension of a K dimensional parameter space, we need to check q
K
points. For example, if q = 100 and K = 10, there would be 100
10
points to check. If 1000 points
can be checked in a second, it would take 3. 171 10
9
years to perform the calculations, which is
approximately 2/3 the age of the earth. The search method is a very reasonable choice if K is small,
Figure 11.1: Search method
but it quickly becomes infeasible if K is moderate or large.
11.2 Derivative-based methods
Introduction
Derivative-based methods are dened by
1. the method for choosing the initial value,
1
2. the iteration method for choosing
k+1
given
k
(based upon derivatives)
3. the stopping criterion.
The iteration method can be broken into two problems: choosing the stepsize a
k
(a scalar) and choosing
the direction of movement, d
k
, which is of the same dimension of , so that

(k+1)
=
(k)
+ a
k
d
k
.
A locally increasing direction of search d is a direction such that
s( + ad)
a
> 0
for a positive but small. That is, if we go in direction d, we will improve on the objective function, at
least if we dont go too far in that direction.
As long as the gradient at is not zero there exist increasing directions, and they can all be
represented as Q
k
g(
k
) where Q
k
is a symmetric pd matrix and g () = D

s() is the gradient at


. To see this, take a T.S. expansion around a
0
= 0
s( + ad) = s( + 0d) + (a 0) g( + 0d)
/
d + o(1)
= s() + ag()
/
d + o(1)
For small enough a the o(1) term can be ignored. If d is to be an increasing direction, we need
g()
/
d > 0. Dening d = Qg(), where Q is positive denite, we guarantee that
g()
/
d = g()
/
Qg() > 0
unless g() = 0. Every increasing direction can be represented in this way (p.d. matrices are
those such that the angle between g and Qg() is less that 90 degrees). See Figure 11.2.
With this, the iteration rule becomes

(k+1)
=
(k)
+ a
k
Q
k
g(
k
)
and we keep going until the gradient becomes zero, so that there is no increasing direction. The
problem is how to choose a and Q.
Conditional on Q, choosing a is fairly straightforward. A simple line search is an attractive
possibility, since a is a scalar.
The remaining problem is how to choose Q.
Figure 11.2: Increasing directions of search
Note also that this gives no guarantees to nd a global maximum.
Steepest descent
Steepest descent (ascent if were maximizing) just sets Q to and identity matrix, since the gradient
provides the direction of maximum rate of change of the objective function.
Advantages: fast - doesnt require anything more than rst derivatives.
Disadvantages: This doesnt always work too well however: see the Rosenbrock, or banana
function: http://en.wikipedia.org/wiki/Rosenbrock_function.
Newton s method
Newtons method uses information about the slope and curvature of the objective function to determine
which direction and how far to move from an initial point. Supposing were trying to maximize s
n
().
Take a second order Taylors series approximation of s
n
() about
k
(an initial guess).
s
n
() s
n
(
k
) + g(
k
)
/
_

k
_
+ 1/2
_

k
_
/
H(
k
)
_

k
_
To attempt to maximize s
n
(), we can maximize the portion of the right-hand side that depends on
, i.e., we can maximize
s() = g(
k
)
/
+ 1/2
_

k
_
/
H(
k
)
_

k
_
with respect to . This is a much easier problem, since it is a quadratic function in , so it has linear
rst order conditions. These are
D

s() = g(
k
) + H(
k
)
_

k
_
So the solution for the next round estimate is

k+1
=
k
H(
k
)
1
g(
k
)
See http://en.wikipedia.org/wiki/Newton%27s_method_in_optimization for more information.
This is illustrated in Figure 11.3.
However, its good to include a stepsize, since the approximation to s
n
() may be bad far away
from the maximizer

, so the actual iteration formula is

k+1
=
k
a
k
H(
k
)
1
g(
k
)
A potential problem is that the Hessian may not be negative denite when were far from the
maximizing point. So H(
k
)
1
may not be positive denite, and H(
k
)
1
g(
k
) may not
dene an increasing direction of search. This can happen when the objective function has at
regions, in which case the Hessian matrix is very ill-conditioned (e.g., is nearly singular), or
when were in the vicinity of a local minimum, H(
k
) is positive denite, and our direction
is a decreasing direction of search. Matrix inverses by computers are subject to large errors
when the matrix is ill-conditioned. Also, we certainly dont want to go in the direction of a
minimum when were maximizing. To solve this problem, Quasi-Newton methods simply add
a positive denite component to H() to ensure that the resulting matrix is positive denite,
e.g., Q = H() +bI, where b is chosen large enough so that Q is well-conditioned and positive
denite. This has the benet that improvement in the objective function is guaranteed. See
Figure 11.3: Newton iteration
http://en.wikipedia.org/wiki/Quasi-Newton_method.
Another variation of quasi-Newton methods is to approximate the Hessian by using successive
gradient evaluations. This avoids actual calculation of the Hessian, which is an order of mag-
nitude (in the dimension of the parameter vector) more costly than calculation of the gradient.
They can be done to ensure that the approximation is p.d. DFP and BFGS are two well-known
examples.
show bfgsmin_example.m to optimize Rosenbrock function
Stopping criteria
The last thing we need is to decide when to stop. A digital computer is subject to limited machine
precision and round-o errors. For these reasons, it is unreasonable to hope that a program can
exactly nd the point that maximizes a function. We need to dene acceptable tolerances. Some
stopping criteria are:
Negligable change in parameters:
[
k
j

k1
j
[ <
1
, j
Negligable relative change:
[

k
j

k1
j

k1
j
[ <
2
, j
Negligable change of function:
[s(
k
) s(
k1
)[ <
3
Gradient negligibly dierent from zero:
[g
j
(
k
)[ <
4
, j
Or, even better, check all of these.
Also, if were maximizing, its good to check that the last round (real, not approximate) Hessian
is negative denite.
Starting values
The Newton-Raphson and related algorithms work well if the objective function is concave (when
maximizing), but not so well if there are convex regions and local minima or multiple local maxima.
The algorithm may converge to a local minimum or to a local maximum that is not optimal. The
algorithm may also have diculties converging at all.
The usual way to ensure that a global maximum has been found is to use many dierent
starting values, and choose the solution that returns the highest objective function value. THIS
IS IMPORTANT in practice. More on this later.
Calculating derivatives
The Newton-Raphson algorithm requires rst and second derivatives. It is often dicult to calculate
derivatives (especially the Hessian) analytically if the function s
n
() is complicated. Possible solutions
are to calculate derivatives numerically, or to use programs such as MuPAD or Mathematica to cal-
culate analytic derivatives. For example, Figure 11.4 shows Sage
1
calculating a couple of derivatives.
1
Sage is free software that has both symbolic and numeric computational capabilities. See http://www.sagemath.org/
The KAIST Sage cell server will let you try Sage online, its address is http://aleph.sagemath.org/.
Numeric derivatives are less accurate than analytic derivatives, and are usually more costly to
evaluate. Both factors usually cause optimization programs to be less successful when numeric
derivatives are used.
One advantage of numeric derivatives is that you dont have to worry about having made an
error in calculating the analytic derivative. When programming analytic derivatives its a good
idea to check that they are correct by using numeric derivatives. This is a lesson I learned the
hard way when writing my thesis.
Numeric second derivatives are much more accurate if the data are scaled so that the elements of
the gradient are of the same order of magnitude. Example: if the model is y
t
= h(x
t
+z
t
) +
t
,
and estimation is by NLS, suppose that D

s
n
() = 1000 and D

s
n
() = 0.001. One could dene

= /1000; x

t
= 1000x
t
;

= 1000; z

t
= z
t
/1000. In this case, the gradients D

s
n
() and
D

s
n
() will both be 1.
In general, estimation programs always work better if data is scaled in this way, since roundo
errors are less likely to become important. This is important in practice.
There are algorithms (such as BFGS and DFP) that use the sequential gradient evaluations to
build up an approximation to the Hessian. The iterations are faster because the actual Hessian
isnt calculated, but more iterations usually are required for convergence. Versions of BFGS are
probably the most widely used optimizers in econometrics.
Switching between algorithms during iterations is sometimes useful.
Figure 11.4: Using Sage to get analytic derivatives
11.3 Simulated Annealing
Simulated annealing is an algorithm which can nd an optimum in the presence of nonconcavities,
discontinuities and multiple local minima/maxima. Basically, the algorithm randomly selects evalua-
tion points, accepts all points that yield an increase in the objective function, but also accepts some
points that decrease the objective function. This allows the algorithm to escape from local minima.
As more and more points are tried, periodically the algorithm focuses on the best point so far, and
reduces the range over which random points are generated. Also, the probability that a negative move
is accepted reduces. The algorithm relies on many evaluations, as in the search method, but focuses
in on promising areas, which reduces function evaluations with respect to the search method. It does
not require derivatives to be evaluated. I have a program to do this if youre interested.
11.4 A practical example: Maximum likelihood estimation
using count data: The MEPS data and the Poisson
model
To show optimazation methods in practice, using real economic data, this section presents maximum
likelihood estimation results for a particular model using real data. The focus at present is simply on
numeric optimization. Later, after studying maximum likelihood estimation, this section can be read
again.
Demand for health care is usually thought of a a derived demand: health care is an input to a home
production function that produces health, and health is an argument of the utility function. Grossman
(1972), for example, models health as a capital stock that is subject to depreciation (e.g., the eects
of ageing). Health care visits restore the stock. Under the home production framework, individuals
decide when to make health care visits to maintain their health stock, or to deal with negative shocks
to the stock in the form of accidents or illnesses. As such, individual demand will be a function of the
parameters of the individuals utility functions.
The MEPS health data le , meps1996.data, contains 4564 observations on six measures of health
care usage. The data is from the 1996 Medical Expenditure Panel Survey (MEPS). You can get
more information at http://www.meps.ahrq.gov/. The six measures of use are are oce-based
visits (OBDV), outpatient visits (OPV), inpatient visits (IPV), emergency room visits (ERV), den-
tal visits (VDV), and number of prescription drugs taken (PRESCR). These form columns 1 - 6 of
meps1996.data. The conditioning variables are public insurance (PUBLIC), private insurance (PRIV),
sex (SEX), age (AGE), years of education (EDUC), and income (INCOME). These form columns 7 -
12 of the le, in the order given here. PRIV and PUBLIC are 0/1 binary variables, where a 1 indicates
that the person has access to public or private insurance coverage. SEX is also 0/1, where 1 indicates
that the person is female. This data will be used in examples fairly extensively in what follows.
The program ExploreMEPS.m shows how the data may be read in, and gives some descriptive
information about variables, which follows:
All of the measures of use are count data, which means that they take on the values 0, 1, 2, .... It
might be reasonable to try to use this information by specifying the density as a count data density.
One of the simplest count data densities is the Poisson density, which is
f
Y
(y) =
exp()
y
y!
.
For this density, E(Y ) = V (Y ) = . The Poisson average log-likelihood function is
s
n
() =
1
n
n

i=1
(
i
+ y
i
ln
i
ln y
i
!)
We will parameterize the model as

i
= exp(x
/
i
)
x
i
= [1 PUBLIC PRIV SEX AGE EDUC INC]
/
(11.1)
This ensures that the mean is positive, as is required for the Poisson model, and now the mean (and
the variance) depend upon explanatory variables. Note that for this parameterization

x
j
=
j
so

j
x
j
=

x
j
x
j

x
j
,
the elasticity of the conditional mean of y with respect to the j
th
conditioning variable.
The program EstimatePoisson.m estimates a Poisson model using the full data set. The results of
the estimation, using OBDV as the dependent variable are here:
MPITB extensions found
OBDV
******************************************************
Poisson model, MEPS 1996 full data set
MLE Estimation Results
BFGS convergence: Normal convergence
Average Log-L: -3.671090
Observations: 4564
estimate st. err t-stat p-value
constant -0.791 0.149 -5.290 0.000
pub. ins. 0.848 0.076 11.093 0.000
priv. ins. 0.294 0.071 4.137 0.000
sex 0.487 0.055 8.797 0.000
age 0.024 0.002 11.471 0.000
edu 0.029 0.010 3.061 0.002
inc -0.000 0.000 -0.978 0.328
Information Criteria
CAIC : 33575.6881 Avg. CAIC: 7.3566
BIC : 33568.6881 Avg. BIC: 7.3551
AIC : 33523.7064 Avg. AIC: 7.3452
******************************************************
11.5 Numeric optimization: pitfalls
In this section well examine two common problems that can be encountered when doing numeric
optimization of nonlinear models, and some solutions.
Poor scaling of the data
When the data is scaled so that the magnitudes of the rst and second derivatives are of dierent
orders, problems can easily result. If we uncomment the appropriate line in EstimatePoisson.m, the
data will not be scaled, and the estimation program will have diculty converging (it seems to take
an innite amount of time). With unscaled data, the elements of the score vector have very dierent
magnitudes at the initial value of (all zeros). To see this run CheckScore.m. With unscaled data,
one element of the gradient is very large, and the maximum and minimum elements are 5 orders of
magnitude apart. This causes convergence problems due to serious numerical inaccuracy when doing
inversions to calculate the BFGS direction of search. With scaled data, none of the elements of the
gradient are very large, and the maximum dierence in orders of magnitude is 3. Convergence is quick.
Figure 11.5: Mountains with low fog
Multiple optima
Multiple optima (one global, others local) can complicate life, since we have limited means of deter-
mining if there is a higher maximum than the one were at. Think of climbing a mountain in an
unknown range, in a very foggy place. A nice picture is Figure 11.5, but try to imagine the scene if
the clouds were 2000m thicker. A representation is Figure 11.6). You can go up until theres nowhere
else to go up, but since youre in the fog you dont know if the true summit is across the gap thats at
your feet. Do you claim victory and go home, or do you trudge down the gap and explore the other
side?
Figure 11.6: A foggy mountain
The best way to avoid stopping at a local maximum is to use many starting values, for example
on a grid, or randomly generated. Or perhaps one might have priors about possible values for the
parameters (e.g., from previous studies of similar data).
Lets try to nd the true minimizer of minus 1 times the foggy mountain function (since the
algorithms are set up to minimize). From the picture, you can see its close to (0, 0), but lets pretend
there is fog, and that we dont know that. The program FoggyMountain.m shows that poor start
values can lead to problems. It uses SA, which nds the true global minimum, and it shows that
BFGS using a battery of random start values can also nd the global minimum help. The output of
one run is here:
MPITB extensions found
======================================================
BFGSMIN final results
Used numeric gradient
------------------------------------------------------
STRONG CONVERGENCE
Function conv 1 Param conv 1 Gradient conv 1
------------------------------------------------------
Objective function value -0.0130329
Stepsize 0.102833
43 iterations
------------------------------------------------------
param gradient change
15.9999 -0.0000 0.0000
-28.8119 0.0000 0.0000
The result with poor start values
ans =
16.000 -28.812
================================================
SAMIN final results
NORMAL CONVERGENCE
Func. tol. 1.000000e-10 Param. tol. 1.000000e-03
Obj. fn. value -0.100023
parameter search width
0.037419 0.000018
-0.000000 0.000051
================================================
Now try a battery of random start values and
a short BFGS on each, then iterate to convergence
The result using 20 randoms start values
ans =
3.7417e-02 2.7628e-07
The true maximizer is near (0.037,0)
In that run, the single BFGS run with bad start values converged to a point far from the true minimizer,
which simulated annealing and BFGS using a battery of random start values both found the true
maximizer. Using a battery of random start values, we managed to nd the global max. The moral
of the story is to be cautious and dont publish your results too quickly.
11.6 Exercises
1. In octave, type help bfgsmin_example, to nd out the location of the le. Edit the le to
examine it and learn how to call bfgsmin. Run it, and examine the output.
2. In octave, type help samin_example, to nd out the location of the le. Edit the le to
examine it and learn how to call samin. Run it, and examine the output.
3. Numerically minimize the function sin(x) + 0.01 (x a)
2
, setting a = 0, using the software of
your choice. Plot the function over the interval (2, 2). Does the software nd the global
minimum? Does this depend on the starting value you use? Outline a strategy that would allow
you to nd the minimum reliably, when a can take on any given value in the interval (, ).
4. Numerically compute the OLS estimator of the Nerlove model by using an interative minimization
algorithm to minimize the sum of squared residuals. Verify that the results coincide with those
given in subsection 3.8. The important part of this problem is to learn how to minimize a
function that depends on both parameters and data. Try to write your function so that it is
easy to use it with an arbitrary data set.
Chapter 12
Asymptotic properties of
extremum estimators
Readings: Hayashi (2000), Ch. 7; Gourieroux and Monfort (1995), Vol. 2, Ch. 24; Amemiya, Ch.
4 section 4.1; Davidson and MacKinnon, pp. 591-96; Gallant, Ch. 3; Newey and McFadden (1994),
Large Sample Estimation and Hypothesis Testing, in Handbook of Econometrics, Vol. 4, Ch. 36.
12.1 Extremum estimators
Well begin with study of extremum estimators in general. Let Z
n
= z
1
, z
2
, ..., z
n
be the available
data, arranged in a n p matrix, based on a sample of size n (there are p variables). Our paradigm is
that data are generated as a draw from the joint density f
Z
n
(z). This density may not be known, but
it exists in principle. The draw from the density may be thought of as the outcome of a random ex-
308
periment that is characterized by the probability space , T, P. When the experiment is performed,
is the result, and Z
n
() = Z
1
(), Z
2
(), ..., Z
n
() = z
1
, z
2
, ..., z
n
is the realized data. The
probability space is rich enough to allow us to consider events dened in terms of an innite sequence
of data Z = z
1
, z
2
, ..., .
Denition 25. [Extremum estimator] An extremum estimator

is the optimizing element of an
objective function s
n
(Z
n
, ) over a set .
Because the data Z
n
() depends on , we can emphasize this by writing s
n
(, ). Ill be loose with
notation and interchange when convenient.
Example 26. OLS. Let the d.g.p. be y
t
= x
/
t

0
+
t
, t = 1, 2, ..., n,
0
. Stacking observations
vertically, y
n
= X
n

0
+
n
, where X
n
=
_
x
1
x
2
x
n
_
/
. Let Z
n
= [y
n
X
n
]. The least squares
estimator is dened as

arg min

s
n
(Z
n
, )
where
s
n
(Z
n
, ) = 1/n
n

t=1
(y
t
x
/
t
)
2
As you already know,

= (X
/
X)
1
X
/
y.
.
Example 27. Maximum likelihood. Suppose that the continuous random variables Y
t
IIN(
0
,
2
0
), t =
1, 2, ..., n. If is a standard normal random variable, its density is
f

(z; ) = (2)
1/2
exp
_
_

z
2
2
_
_
.
We have that
t
= (Y
t

0
)/
0
is standard normal, and the Jacobian [
t
/y
t
[ = 1/
0
. Thus, doing a
change of variable, the density of a single observation on Y is
f
Y
(y
t
; , ) = (2)
1/2
(1/) exp
_
_

1
2
_
y
t

_
2
_
_
.
The maximum likelihood estimator is maximizes the joint density of the sample. Because the data are
i.i.d., the joint density of the sample y
1
, y
2
, ..., y
n
is the product of the densities of each observation,
and the ML estimator is

arg max

L
n
() =
n

t=1
(2)
1/2
(1/) exp
_
_

(y
t
)
2
2
_
_
Because the natural logarithm is strictly increasing on (0, ), maximization of the average logarithmic
likelihood function is achieved at the same

as for the likelihood function. So, the ML estimator

arg max

s
n
() where
s
n
() = (1/n) ln L
n
() = ln

2 log (1/n)
n

t=1
(y
t
)
2
2
Solution of the f.o.c. leads to the familiar result that

= y. Well come back to this in more detail
later.
Example 28. Bayesian estimator
Bayesian point estimators such as the posterior mode, median or mean can be expressed as ex-
tremum estimators. For example, the posterior mean E([Z
n
) is the minimizer (with respect to ) of
the function
s
n
() =
_

( )
2
f(Z
n
; )()/f(Z
n
)d
where f(Z
n
; ) is the likelihood function, () is a prior density, and f(Z
n
) is the marginal likelihood
of the data. These concepts are explained later, for now the point is that Bayesian estimators can be
thought of as extremum estimators, and the theory for extremum estimators will apply.
Note that the objective function s
n
(Z
n
, ) is a random function, because it depends on Z
n
() =
Z
1
(), Z
2
(), ..., Z
n
() = z
1
, z
2
, ..., z
n
. We need to consider what happens as dierent outcomes
occur. These dierent outcomes lead to dierent data being generated, and the dierent
data causes the objective function to change. Note, however, that for a xed , the data
Z
n
() = Z
1
(), Z
2
(), ..., Z
n
() = z
1
, z
2
, ..., z
n
are a xed realization, and the objective function
s
n
(Z
n
, ) becomes a non-random function of . When actually computing an extremum estimator,
we treat the data as xed, and employ algorithms for optimization of nonstochastic functions. When
analyzing the properties of an extremum estimator, we need to investigate what happens throughout
: we do not focus only on the that generated the observed data. This is because we would like to
nd estimators that work well on average for any data set that can result from .
Well often write the objective function suppressing the dependence on Z
n
, as s
n
(, ) or simply
s
n
(), depending on context. The rst of these emphasizes the fact that the objective function is
random, and the second is more compact. However, the data is still in there, and because the data is
randomly sampled, the objective function is random, too.
12.2 Existence
If s
n
() is continuous in and is compact, then a maximizer exists, by the Weierstrass maximum
theorem (Debreu, 1959). In some cases of interest, s
n
() may not be continuous. Nevertheless, it
may still converge to a continous function, in which case existence will not be a problem, at least
asymptotically. Henceforth in this course, we assume that s
n
() is continuous.
12.3 Consistency
The following theorem is patterned on a proof in Gallant (1987) (the article, ref. later), which well see
in its original form later in the course. It is interesting to compare the following proof with Amemiyas
Theorem 4.1.1, which is done in terms of convergence in probability.
Theorem 29. [Consistency of e.e.] Suppose that

n
is obtained by maximizing s
n
() over .
Assume
(a) Compactness: The parameter space is an open bounded subset of Euclidean space 1
K
. So the
closure of , , is compact.
(b) Uniform Convergence: There is a nonstochastic function s

() that is continuous in on
such that
lim
n
sup

[s
n
(, ) s

()[ = 0, a.s.
(c) Identication: s

() has a unique global maximum at


0
, i.e., s

(
0
) > s

(), ,=
0
,

Then

n
a.s.

0
.
Proof: Select a and hold it xed. Then s
n
(, ) is a xed sequence of functions. Suppose
that is such that s
n
(, ) converges to s

(). This happens with probability one by assumption (b).


The sequence

n
lies in the compact set , by assumption (a) and the fact that maximixation is over
. Since every sequence from a compact set has at least one limit point (Bolzano-Weierstrass), say
that

is a limit point of

n
. There is a subsequence

n
m
(n
m
is simply a sequence of increasing
integers) with lim
m

n
m
=

. By uniform convergence and continuity,
lim
m
s
n
m
(

n
m
) = s

).
To see this, rst of all, select an element

t
from the sequence
_

n
m
_
. Then uniform convergence implies
lim
m
s
n
m
(

t
) = s

t
)
Continuity of s

() implies that
lim
t
s

t
) = s

)
since the limit as t of
_

t
_
is

. So the above claim is true.
Next, by maximization
s
n
m
(

n
m
) s
n
m
(
0
)
which holds in the limit, so
lim
m
s
n
m
(

n
m
) lim
m
s
n
m
(
0
).
However,
lim
m
s
n
m
(

n
m
) = s

),
as seen above, and
lim
m
s
n
m
(
0
) = s

(
0
)
by uniform convergence, so
s

) s

(
0
).
But by assumption (3), there is a unique global maximum of s

() at
0
, so we must have s

) =
s

(
0
), and

=
0
in the limit. Finally, all of the above limits hold almost surely, since so far we
have held xed, but now we need to consider all . Therefore

n
has only one limit point,

0
, except on a set C with P(C) = 0.
Discussion of the proof:
This proof relies on the identication assumption of a unique global maximum at
0
. An equiv-
alent way to state this is
(c) Identication: Any point in with s

() s

(
0
) must be such that |
0
|= 0, which
matches the way we will write the assumption in the section on nonparametric inference.
We assume that

n
is in fact a global maximum of s
n
() . It is not required to be unique for n
nite, though the identication assumption requires that the limiting objective function have a
unique maximizing argument. The previous section on numeric optimization methods showed
that actually nding the global maximum of s
n
() may be a non-trivial problem.
See Amemiyas Example 4.1.4 for a case where discontinuity leads to breakdown of consistency.
The assumption that
0
is in the interior of (part of the identication assumption) has not been
used to prove consistency, so we could directly assume that
0
is simply an element of a compact
set . The reason that we assume its in the interior here is that this is necessary for subsequent
proof of asymptotic normality, and Id like to maintain a minimal set of simple assumptions, for
clarity. Parameters on the boundary of the parameter set cause theoretical diculties that we
will not deal with in this course. Just note that conventional hypothesis testing methods do not
apply in this case.
Note that s
n
() is not required to be continuous, though s

() is.
The following gures illustrate why uniform convergence is important. In the second gure, if the
function is not converging quickly enough around the lower of the two maxima. If the pointwise
convergence in this region is slow enough, there is no guarantee that the maximizer will be in the
neighborhood of the global maximizer of s

(), even when n is very large. Uniform comvergence


means that we are in the situation of the top graphic. As long as n is large enough, the maximum
will be in the neighborhood of the global maximum of s

().
With uniform convergence, the maximum of the sample
objective function eventually must be in the neighborhood
of the maximum of the limiting objective function
With pointwise convergence, the sample objective function
may have its maximum far away from that of the limiting
objective function
Sucient conditions for assumption (b)
We need a uniform strong law of large numbers in order to verify assumption (2) of Theorem 29. To
verify the uniform convergence assumption, it is often feasible to employ the following set of stronger
assumptions:
the parameter space is compact, which is given by assumption (b)
the objective function s
n
() is continuous and bounded with probability one on the entire pa-
rameter space
a standard SLLN can be shown to apply to some point in the parameter space. That is, we
can show that s
n
()
a.s.
s

() for some . Note that in most cases, the objective function will be
an average of terms, such as
s
n
() =
1
n
n

t=1
s
t
()
As long as the s
t
() are not too strongly dependent, and have nite variances, we can usually
nd a SLLN that will apply.
With these assumptions, it can be shown that pointwise convergence holds throughout the parameter
space, so we obtain the needed uniform convergence.
These are reasonable conditions in many cases, and henceforth when dealing with specic estimators
well simply assume that pointwise almost sure convergence can be extended to uniform almost sure
convergence in this way.
More on the limiting objective function
The limiting objective function in assumption (b) is s

(). What is the nature of this function and


where does it come from?
Remember our paradigm - data is presumed to be generated as a draw from f
Z
n
(z), and the
objective function is s
n
(Z
n
, ).
Usually, s
n
(Z
n
, ) is an average of terms.
The limiting objective function is found by applying a strong (weak) law of large numbers to
s
n
(Z
n
, ).
A strong (weak) LLN says that an average of terms converges almost surely (in probability) to
the limit of the expectation of the average.
Supposing one holds,
s

() = lim
n
cs
n
(Z
n
, ) = lim
n
_
Z
n
s
n
(z, )f
Z
n
(z)dz
Now suppose that the density f
Z
n
(z) that characterizes the DGP is parametric: f
Z
n
(z; ), , and
the data is generated by
0
. Now we have two parameters to worry about, and . We are
probably interested in learning about the true DGP, which means that
0
is the item of interest.
When the DGP is parametric, the limiting objective function is
s

() = lim
n
cs
n
(Z
n
, ) = lim
n
_
Z
n
s
n
(z, )f
Z
n
(z;
0
)dz
and we can write the limiting objective function as s

(,
0
) to emphasize the dependence on the
parameter of the DGP. From the theorem, we know that

n
a.s.

0
What is the relationship between
0
and
0
?
and may have dierent dimensions. Often, the statistical model (with parameter ) only
partially describes the DGP. For example, the case of OLS with errors of unknown distribution.
In some cases, the dimension of may be greater than that of . For example, tting a polynomial
to an unknown nonlinear function.
If knowledge of
0
is sucient for knowledge of
0
, we have a correctly and fully specied model.

0
is referred to as the true parameter value.
If knowledge of
0
is sucient for knowledge of some but not all elements of
0
, we have a correctly
specied semiparametric model.
0
is referred to as the true parameter value, understanding that
not all parameters of the DGP are estimated.
If knowledge of
0
is not sucient for knowledge of any elements of
0
, or if it causes us to draw
false conclusions regarding at least some of the elements of
0
, our model is misspecied.
0
is
referred to as the pseudo-true parameter value.
Summary
The theorem for consistency is really quite intuitive. It says that with probability one, an extremum
estimator converges to the value that maximizes the limit of the expectation of the objective function.
Because the objective function may or may not make sense, depending on how good or poor is the
model, we may or may not be estimating parameters of the DGP.
12.4 Example: Consistency of Least Squares
We suppose that data is generated by random sampling of (Y, X), where y
t
=
0
x
t
+
t
. (X, ) has the
common distribution function F
Z
=
x

(x and are independent) with support Z = A c. Suppose


that the variances
2
X
and
2

are nite. The sample objective function for a sample size n is


s
n
() = 1/n
n

t=1
(y
t
x
t
)
2
= 1/n
n

i=1
(
0
x
t
+
t
x
t
)
2
= 1/n
n

t=1
(x
t
(
0
))
2
+ 2/n
n

t=1
x
t
(
0
)
t
+ 1/n
n

t=1

2
t
Considering the last term, by the SLLN,
1/n
n

t=1

2
t
a.s.

_
A
_
c

2
d
A
d
c
=
2

.
Considering the second term, since E() = 0 and X and are independent, the SLLN implies
that it converges to zero.
Finally, for the rst term, for a given , we assume that a SLLN applies so that
1/n
n

t=1
(x
t
(
0
))
2 a.s.

_
A
(x (
0
))
2
d
A
(12.1)
=
_

_
2
_
A
x
2
d
A
=
_

_
2
E
_
X
2
_
Finally, the objective function is clearly continuous, and the parameter space is assumed to be compact,
so the convergence is also uniform. Thus,
s

() =
_

_
2
E
_
X
2
_
+
2

A minimizer of this is clearly =


0
.
Exercise 30. Show that in order for the above solution to be unique it is necessary that E(X
2
) ,= 0.
Interpret this condition.
This example shows that Theorem 29 can be used to prove strong consistency of the OLS estimator.
There are easier ways to show this, of course - this is only an example of application of the theorem.
12.5 Example: Inconsistency of Misspecied Least Squares
You already know that the OLS estimator is inconsistent when relevant variables are omitted. Lets
verify this result in the context of extremum estimators. We suppose that data is generated by random
sampling of (Y, X), where y
t
=
0
x
t
+
t
. (X, ) has the common distribution function F
Z
=
x

(x
and are independent) with support Z = A c. Suppose that the variances
2
X
and
2

are nite.
However, the econometrician is unaware of the true DGP, and instead proposes the misspecied model
y
t
=
0
w
t
+
t
. Suppose that E(W) = 0 but that E(WX) ,= 0.
The sample objective function for a sample size n is
s
n
() = 1/n
n

t=1
(y
t
w
t
)
2
= 1/n
n

i=1
(
0
x
t
+
t
w
t
)
2
= 1/n
n

t=1
(
0
x
t
)
2
+ 1/n
n

t=1
(w
t
)
2
+ 1/n
n

t=1

2
t
+ 2/n
n

t=1

0
x
t

t
2/n
n

t=1

0
x
t
w
t
2/n
n

t=1

t
x
t
w
t
Using arguments similar to above,
s

() =
2
E
_
W
2
_
2
0
E(WX) + C
So,
0
=

0
E(WX)
E(W
2
)
, which is the true parameter of the DGP, multiplied by the pseudo-true value of a
regression of X on W. The OLS estimator is not consistent for the true parameter,
0
12.6 Example: Linearization of a nonlinear model
Ref. Gourieroux and Monfort, section 8.3.4. White, Intnl Econ. Rev. 1980 is an earlier reference.
Suppose we have a nonlinear model
y
i
= h(x
i
,
0
) +
i
where

i
iid(0,
2
)
The nonlinear least squares estimator solves

n
= arg min
1
n
n

i=1
(y
i
h(x
i
, ))
2
Well study this more later, but for now it is clear that the foc for minimization will require solving
a set of nonlinear equations. A common approach to the problem seeks to avoid this diculty by
linearizing the model. A rst order Taylors series expansion about the point x
0
with remainder gives
y
i
= h(x
0
,
0
) + (x
i
x
0
)
/
h(x
0
,
0
)
x
+
i
where
i
encompasses both
i
and the Taylors series remainder. Note that
i
is no longer a classical
error - its mean is not zero. We should expect problems.
Dene

= h(x
0
,
0
) x
/
0
h(x
0
,
0
)
x

=
h(x
0
,
0
)
x
Given this, one might try to estimate

and

by applying OLS to
y
i
= + x
i
+
i
Question, will and

be consistent for

and

?
The answer is no, as one can see by interpreting and

as extremum estimators. Let = (,
/
)
/
.
= arg min s
n
() =
1
n
n

i=1
(y
i
x
i
)
2
The objective function converges to its expectation
s
n
()
u.a.s.
s

() = c
X
c
Y [X
(y x)
2
and converges a.s. to the
0
that minimizes s

():

0
= arg min c
X
c
Y [X
(y x)
2
Noting that
c
X
c
Y [X
(y x
/
)
2
= c
X
c
Y [X
_
h(x,
0
) + x
_
2
=
2
+c
X
_
h(x,
0
) x
_
2
since cross products involving drop out.
0
and
0
correspond to the hyperplane that is closest to
the true regression function h(x,
0
) according to the mean squared error criterion. This depends on
both the shape of h() and the density function of the conditioning variables.
x_0

x
x
x
x
x
x
x
x
x
x
Tangent line
Fitted line
Inconsistency of the linear approximation, even at
the approximation point
h(x,)
It is clear that the tangent line does not minimize MSE, since, for example, if h(x,
0
) is concave,
all errors between the tangent line and the true function are negative.
Note that the true underlying parameter
0
is not estimated consistently, either (it may be of a
dierent dimension than the dimension of the parameter of the approximating model, which is
2 in this example).
Second order and higher-order approximations suer from exactly the same problem, though
to a less severe degree, of course. For this reason, translog, Generalized Leontiev and other
exible functional forms based upon second-order approximations in general suer from bias
and inconsistency. The bias may not be too important for analysis of conditional means, but it
can be very important for analyzing rst and second derivatives. In production and consumer
analysis, rst and second derivatives (e.g., elasticities of substitution) are often of interest, so
in this case, one should be cautious of unthinking application of models that impose stong
restrictions on second derivatives.
This sort of linearization about a long run equilibrium is a common practice in dynamic macroe-
conomic models. It is justied for the purposes of theoretical analysis of a model given the
models parameters, but it is not justiable for the estimation of the parameters of the model
using data. The section on simulation-based methods oers a means of obtaining consistent esti-
mators of the parameters of dynamic macro models that are too complex for standard methods
of analysis.
12.7 Asymptotic Normality
A consistent estimator is oftentimes not very useful unless we know how fast it is likely to be converging
to the true value, and the probability that it is far away from the true value. Establishment of
asymptotic normality with a known scaling factor solves these two problems. The following theorem
is similar to Amemiyas Theorem 4.1.3 (pg. 111).
Theorem 31. [Asymptotic normality of e.e.] In addition to the assumptions of Theorem 29, assume
(a)
n
() D
2

s
n
() exists and is continuous in an open, convex neighborhood of
0
.
(b)
n
(
n
)
a.s.

(
0
), a nite negative denite matrix, for any sequence
n
that converges
almost surely to
0
.
(c)

nD

s
n
(
0
)
d
N [0, J

(
0
)] , where J

(
0
) = lim
n
V ar

nD

s
n
(
0
)
Then

n
_


0
_
d
N [0,

(
0
)
1
J

(
0
)

(
0
)
1
]
Proof: By Taylor expansion:
D

s
n
(

n
) = D

s
n
(
0
) + D
2

s
n
(

)
_


0
_
where

+ (1 )
0
, 0 1.
Note that

will be in the neighborhood where D
2

s
n
() exists with probability one as n becomes
large, by consistency.
Now the l.h.s. of this equation is zero, at least asymptotically, since

n
is a maximizer and the
f.o.c. must hold exactly since the limiting objective function is strictly concave in a neighborhood
of
0
.
Also, since

is between

n
and
0
, and since

n
a.s.

0
, assumption (b) gives
D
2

s
n
(

)
a.s.

(
0
)
So
0 = D

s
n
(
0
) +
_

(
0
) + o
s
(1)
_
_


0
_
And
0 =

nD

s
n
(
0
) +
_

(
0
) + o
s
(1)
_
n
_


0
_
Now

nD

s
n
(
0
)
d
N [0, J

(
0
)] by assumption c, so

(
0
) + o
s
(1)
_
n
_


0
_
d
N
_
0, J

(
0
)
_
Also, [

(
0
) + o
s
(1)]
a.s.
(
0
), so

n
_


0
_
d
N
_
0,

(
0
)
1
J

(
0
)

(
0
)
1
_
by the Slutsky Theorem (see Gallant, Theorem 4.6).
Skip this in lecture. A note on the order of these matrices: Supposing that s
n
() is repre-
sentable as an average of n terms, which is the case for all estimators we consider, D
2

s
n
() is
also an average of n matrices, the elements of which are not centered (they do not have zero
expectation). Supposing a SLLN applies, the almost sure limit of D
2

s
n
(
0
),

(
0
) = O(1), as
we saw in Example 87. On the other hand, assumption (c):

nD

s
n
(
0
)
d
N [0, J

(
0
)] means
that

nD

s
n
(
0
) = O
p
()
where we use the result of Example 85. If we were to omit the

n, wed have
D

s
n
(
0
) = n

1
2
O
p
(1)
= O
p
_
n

1
2
_
where we use the fact that O
p
(n
r
)O
p
(n
q
) = O
p
(n
r+q
). The sequence D

s
n
(
0
) is centered, so we
need to scale by

n to avoid convergence to zero.
12.8 Example: Classical linear model
Lets use the results to get the asymptotic distribution of the OLS estimator applied to the classical
model, to verify that we obtain the results seen before. The OLS criterion is
s
n
() =
1
n
(y X)
/
(y X)
=
1
n
_
X
0
+ X
_
/
_
X
0
+ X
_
=
1
n
_
_

_
/
X
/
X
_

_
2
/
X +
/

_
The rst derivative is
D

s
n
() =
1
n
_
2X
/
X
_

_
2X
/

_
so, evaluating at
0
,
D

s
n
(
0
) = 2
X
/

n
This has expectation 0, so the variance is the expectation of the outer product:
V ar

nD

s
n
(
0
) = E
_
_
_
_

n2
X
/

n
_
_
_
_

n2
X
/

n
_
_
/
_
_
= E4
X
/

/
X
n
= 4
2

E
_
_
X
/
X
n
_
_
(assuming regressors independent of errors). Therefore
J

(
0
) = lim
n
V ar

nD

s
n
(
0
)
= 4
2

Q
X
where Q
X
= limE
_
X

X
n
_
, a nite p.d. matrix, is obtained using a LLN.
The second derivative is

n
() = D
2

s
n
(
0
) =
1
n
[2X
/
X] .
A SLLN tells us that this converges almost surely to the limit of its expectation:

(
0
) = 2Q
X
Theres no parameter in that last expression, so uniformity is not an issue.
The asymptotic normality theorem (31) tells us that

n
_


0
_
d
N
_
0,

(
0
)
1
J

(
0
)

(
0
)
1
_
which is, given the above,

n
_


0
_
d
N
_
_
0,
_
_
Q
1
X
2
_
_
_
4
2

Q
X
_
_
_
Q
1
X
2
_
_
_
_
or

n
_


0
_
d
N
_
0, Q
1
X

2

_
.
This is the same thing we saw in equation 4.1, of course. So, the theory seems to work :-)
12.9 Exercises
1. Suppose that x
i
uniform(0,1), and y
i
= 1 x
2
i
+
i
, where
i
is iid(0,
2
). Suppose we estimate
the misspecied model y
i
= +x
i
+
i
by OLS. Find the numeric values of
0
and
0
that are
the probability limits of and

2. Verify your results using Octave by generating data that follows the above model, and calculating
the OLS estimator. When the sample size is very large the estimator should be very close to the
analytical results you obtained in question 1.
3. Use the asymptotic normality theorem to nd the asymptotic distribution of the ML estimator
of
0
for the model y = x
0
+, where N(0, 1) and is independent of x. This means nding

s
n
(), (
0
),
s
n
()

, and J(
0
). The expressions may involve the unspecied density of x.
Chapter 13
Maximum likelihood estimation
The maximum likelihood estimator is important because it uses all of the information in a fully
specied statistical model. Its use of all of the information causes it to have a number of attractive
properties, foremost of which is asymptotic eciency. For this reason, the ML estimator can serve as
a benchmark against which other estimators may be measured. The ML estimator requires that the
statistical model be fully specied, which essentially means that there is enough information to draw
data from the DGP, given the parameter. This is a fairly strong requirement, and for this reason we
need to be concerned about the possible misspecication of the statistical model. If this is the case,
the ML estimator will not have the nice properties that it has under correct specication.
332
13.1 The likelihood function
Suppose we have a sample of size n of the random vectors y and z. Suppose the joint density of
Y =
_
y
1
. . . y
n
_
and Z =
_
z
1
. . . z
n
_
is characterized by a parameter vector
0
:
f
Y Z
(Y, Z,
0
).
This is the joint density of the sample. This density can be factored as
f
Y Z
(Y, Z,
0
) = f
Y [Z
(Y [Z,
0
)f
Z
(Z,
0
)
The likelihood function is just this density evaluated at other values
L(Y, Z, ) = f(Y, Z, ), ,
where is a parameter space.
The maximum likelihood estimator of
0
is the value of that maximizes the likelihood function.
Note that if
0
and
0
share no elements, then the maximizer of the conditional likelihood func-
tion f
Y [Z
(Y [Z, ) with respect to is the same as the maximizer of the overall likelihood function
f
Y Z
(Y, Z, ) = f
Y [Z
(Y [Z, )f
Z
(Z, ), for the elements of that correspond to . In this case, the
variables Z are said to be exogenous for estimation of , and we may more conveniently work with the
conditional likelihood function f
Y [Z
(Y [Z, ) for the purposes of estimating
0
.
When this is the case, the maximum likelihood estimator of
0
= arg max f
Y [Z
(Y [Z, ). Well
suppose this framework in what follows.
If the n observations are independent, the likelihood function can be written as
L(Y [Z, ) =
n

t=1
f(y
t
[z
t
, )
If this is not possible, we can always factor the likelihood into contributions of observations, by
using the fact that a joint density can be factored into the product of a marginal and conditional
(doing this iteratively)
L(Y, ) = f(y
1
[z
1
, )f(y
2
[y
1
, z
2
, )f(y
3
[y
1
, y
2
, z
3
, ) f(y
n
[y
1,
y
2
, . . . y
tn
, z
n
, )
To simplify notation, dene
x
t
= y
1
, y
2
, ..., y
t1
, z
t

so x
1
= z
1
, x
2
= y
1
, z
2
, etc. - it contains exogenous and predetermined endogeous variables. Now
the likelihood function can be written as
L(Y, ) =
n

t=1
f(y
t
[x
t
, )
The criterion function can be dened as the average log-likelihood function:
s
n
() =
1
n
ln L(Y, ) =
1
n
n

t=1
ln f(y
t
[x
t
, )
The maximum likelihood estimator may thus be dened equivalently as

= arg max s
n
(),
where the set maximized over is dened below. Since ln() is a monotonic increasing function, ln L
and L maximize at the same value of . Dividing by n has no eect on

.
Example 32. Example: Bernoulli trial
Suppose that we are ipping a coin that may be biased, so that the probability of a heads may not
be 0.5. Maybe were interested in estimating the probability of a heads. Let Y = 1(heads) be a
binary variable that indicates whether or not a heads is observed. The outcome of a toss is a Bernoulli
random variable:
f
Y
(y, p
0
) = p
y
0
(1 p
0
)
1y
, y 0, 1
= 0, y / 0, 1
So a representative term that enters the likelihood function is
f
Y
(y, p) = p
y
(1 p)
1y
and
ln f
Y
(y, p) = y ln p + (1 y) ln (1 p)
The derivative of this is
ln f
Y
(y, p)
p
=
y
p

(1 y)
(1 p)
=
y p
p (1 p)
Averaging this over a sample of size n gives
s
n
(p)
p
=
1
n
n

i=1
y
i
p
p (1 p)
Setting to zero and solving gives
p = y (13.1)
So its easy to calculate the MLE of p
0
in this case. For future reference, note that E(Y ) =

Y =1
Y =0
yp
y
0
(1 p
0
)
1y
= p
0
and V ar(Y ) = E(Y
2
) [E(Y )]
2
= p
0
p
2
0
.
For this example, s
n
(p) =
1
n

n
t=1
y
t
ln p + (1 y
t
) ln (1 p).
A LLN tells us that s
n
(p)
a.s.
p
0
ln p + (1 p
0
) ln(1 p).
The parameter space is compact (p
0
lies between 0 and 1)
the objective function is continous
thus, the a.s. convergence is also uniform.
The consistency theorem for extremum estimators tells us that the ML estimator converges to the
value that maximized the limiting objective function. Because s

(p) = p
0
ln p + (1 p
0
) ln(1 p), we
can easily check that the maximizer is p
0
. So, the ML estimator is consistent for the true probability.
In practice, we need to ensure that p stays between 0 and 1. To do this with an unconstrained
optimization algorithm, we can use a parameterization. See subsection 13.8 for an example.
Now imagine that we had a bag full of bent coins, each bent around a sphere of a dierent radius
(with the head pointing to the outside of the sphere). We might suspect that the probability of a heads
could depend upon the radius. Suppose that p
i
p(x
i
, ) = (1 + exp(x
/
i
))
1
where x
i
=
_
1 r
i
_
/
,
so that is a 21 vector. Now
p
i
()

= p
i
(1 p
i
) x
i
so
ln f
Y
(y, )

=
y p
i
p
i
(1 p
i
)
p
i
(1 p
i
) x
i
= (y
i
p(x
i
, )) x
i
So the derivative of the average log lihelihood function is now
s
n
()

n
i=1
(y
i
p(x
i
, )) x
i
n
This is a set of 2 nonlinear equations in the two unknown elements in . There is no explicit solution
for the two elements that set the equations to zero. This is commonly the case with ML estimators:
they are often nonlinear, and nding the value of the estimate often requires use of numeric methods
to nd solutions to the rst order conditions. See Chapter 11 for more information on how to do this.
Example 33. Example: Likelihood function of classical linear regression model
The classical linear regression model with normality is outlined in Section 3.6. The likelihood
function for this model is presented in Section 4.3. A Octave/Matlab example that shows how to
compute the maximum likelihood estimator for data that follows the CLRM with normality is in
NormalExample.m , which makes use of NormalLF.m .
13.2 Consistency of MLE
The MLE is an extremum estimator, given basic assumptions it is consistent for the value that maxi-
mizes the limiting objective function, following Theorem 29. The question is: what is the value that
maximizes s

()?
Remember that s
n
() =
1
n
ln L(Y, ), and L(Y,
0
) is the true density of the sample data. For any
,=
0
c
_
_
ln
_
_
L()
L(
0
)
_
_
_
_
ln
_
_
c
_
_
L()
L(
0
)
_
_
_
_
by Jensens inequality ( ln () is a concave function).
Now, the expectation on the RHS is
c
_
_
L()
L(
0
)
_
_
=
_
L()
L(
0
)
L(
0
)dy = 1,
since L(
0
) is the density function of the observations, and since the integral of any density is 1.
Therefore, since ln(1) = 0,
c
_
_
ln
_
_
L()
L(
0
)
_
_
_
_
0,
or
c (s
n
()) c (s
n
(
0
)) 0.
A SLLN tells us that s
n
()
a.s.
s

(,
0
) = limc (s
n
()), and with continuity and a compact
parameter space, this is uniform, so
s

(,
0
) s

(
0
,
0
) 0
except on a set of zero probability. Note: the
0
appears because the expectation is taken with respect
to the true density L(
0
).
By the identication assumption there is a unique maximizer, so the inequality is strict if ,=
0
:
s

(,
0
) s

(
0
,
0
) < 0, ,=
0
, a.s.
Therefore,
0
is the unique maximizer of s

(,
0
), and thus, Theorem 29 tells us that
lim
n

=
0
, a.s.
So, the ML estimator is consistent for the true parameter value.
13.3 The score function
Assumption: (Dierentiability) Assume that s
n
() is twice continuously dierentiable in a
neighborhood N(
0
) of
0
, at least when n is large enough.
To maximize the log-likelihood function, take derivatives:
g
n
(Y, ) = D

s
n
()
=
1
n
n

t=1
D

ln f(y
t
[x
x
, )

1
n
n

t=1
g
t
().
This is the score vector (with dim K 1). Note that the score function has Y as an argument, which
implies that it is a random function. Y (and any exogeneous variables) will often be suppressed for
clarity, but one should not forget that they are still there.
The ML estimator

sets the derivatives to zero:
g
n
(

) =
1
n
n

t=1
g
t
(

) 0.
We will show that c

[g
t
()] = 0, t. This is the expectation taken with respect to the density f(),
not necessarily f (
0
) .
c

[g
t
()] =
_
[D

ln f(y
t
[x
t
, )]f(y
t
[x, )dy
t
=
_
1
f(y
t
[x
t
, )
[D

f(y
t
[x
t
, )] f(y
t
[x
t
, )dy
t
=
_
D

f(y
t
[x
t
, )dy
t
.
Given some regularity conditions on boundedness of D

f, we can switch the order of integration and


dierentiation, by the dominated convergence theorem. This gives
c

[g
t
()] = D

_
f(y
t
[x
t
, )dy
t
(13.2)
= D

1
= 0
where we use the fact that the integral of the density is 1.
So c

(g
t
() = 0 : the expectation of the score vector is zero.
This hold for all t, so it implies that c

g
n
(Y, ) = 0.
13.4 Asymptotic normality of MLE
Recall that we assume that the log-likelihood function s
n
() is twice continuously dierentiable. Take
a rst order Taylors series expansion of g(Y,

) about the true value
0
:
0 g(

) = g(
0
) + (D

g(

))
_


0
_
or with appropriate denitions
(

)
_


0
_
= g(
0
),
where

+ (1 )
0
, 0 < < 1. Assume (

) is invertible (well justify this in a minute). So

n
_


0
_
= (

)
1

ng(
0
) (13.3)
Now consider (

), the matrix of second derivatives of the average log likelihood function. This
is
(

) = D

g(

)
= D
2

s
n
(

)
=
1
n
n

t=1
D
2

ln f
t
(

)
where the notation
D
2

s
n
()

2
s
n
()

/
.
Given that this is an average of terms, it should usually be the case that this satises a strong law
of large numbers (SLLN). Regularity conditions are a set of assumptions that guarantee that this will
happen. There are dierent sets of assumptions that can be used to justify appeal to dierent SLLNs.
For example, the D
2

ln f
t
(

) must not be too strongly dependent over time, and their variances must
not become innite. We dont assume any particular set here, since the appropriate assumptions will
depend upon the particularities of a given model. However, we assume that a SLLN applies.
Also, since we know that

is consistent, and since

+(1 )
0
, we have that

a.s.

0
. Also,
by the above dierentiability assumption, () is continuous in . Given this, (

) converges to the
limit of its expectation:
(

)
a.s.
lim
n
c
_
D
2

s
n
(
0
)
_
=

(
0
) <
This matrix converges to a nite limit.
Re-arranging orders of limits and dierentiation, which is legitimate given certain regularity con-
ditions related to the boundedness of the log-likelihood function, we get

(
0
) = D
2

lim
n
c (s
n
(
0
))
= D
2

(
0
,
0
)
Weve already seen that
s

(,
0
) < s

(
0
,
0
)
i.e.,
0
maximizes the limiting objective function. Since there is a unique maximizer, and by the
assumption that s
n
() is twice continuously dierentiable (which holds in the limit), then

(
0
)
must be negative denite, and therefore of full rank. Therefore the previous inversion is justied,
asymptotically, and we have

n
_


0
_
= (

)
1

ng(
0
). (13.4)
Now consider

ng(
0
). This is

ng
n
(
0
) =

nD

s
n
()
=

n
n
n

t=1
D

ln f
t
(y
t
[x
t
,
0
)
=
1

n
n

t=1
g
t
(
0
)
Weve already seen that c

[g
t
()] = 0. As such, it is reasonable to assume that a CLT applies.
Note that g
n
(
0
)
a.s.
0, by consistency. To avoid this collapse to a degenerate r.v. (a constant
vector) we need to scale by

n. A generic CLT states that, for X
n
a random vector that satises
certain conditions,
X
n
E(X
n
)
d
N(0, limV (X
n
))
The certain conditions that X
n
must satisfy depend on the case at hand. Usually, X
n
will be of the
form of an average, scaled by

n:
X
n
=

n
t=1
X
t
n
This is the case for

ng(
0
) for example. Then the properties of X
n
depend on the properties of the
X
t
. For example, if the X
t
have nite variances and are not too strongly dependent, then a CLT for
dependent processes will apply. Supposing that a CLT applies, and noting that E(

ng
n
(
0
) = 0, we
get

ng
n
(
0
)
d
N [0, J

(
0
)] (13.5)
where
J

(
0
) = lim
n
c

0
_
n[g
n
(
0
)] [g
n
(
0
)]
/
_
= lim
n
V

0
_
ng
n
(
0
)
_
This can also be written as
J

(
0
) is known as the information matrix.
Combining [13.4] and [13.5], and noting that (

)
a.s.

(
0
), we get

n
_


0
_
a
N
_
0,

(
0
)
1
J

(
0
)

(
0
)
1
_
.
The MLE estimator is asymptotical ly normally distributed.
Denition 34. Consistent and asymptotically normal (CAN). An estimator

of a parameter
0
is

n-consistent and asymptotically normally distributed if



n
_


0
_
d
N (0, V

) where V

is a
nite positive denite matrix.
There do exist, in special cases, estimators that are consistent such that

n
_


0
_
p
0. These are
known as superconsistent estimators, since in ordinary circumstances with stationary data,

n is the
highest factor that we can multiply by and still get convergence to a stable limiting distribution.
Denition 35. Asymptotically unbiased. An estimator

of a parameter
0
is asymptotically unbiased
if
lim
n
c

) = .
Estimators that are CAN are asymptotical ly unbiased, though not all consistent estimators are
asymptotically unbiased. Such cases are unusual, though.
13.5 The information matrix equality
We will show that

() = I

(). Let f
t
() be short for f(y
t
[x
t
, )
1 =
_
f
t
()dy, so
0 =
_
D

f
t
()dy
=
_
(D

ln f
t
()) f
t
()dy
Now dierentiate again:
0 =
_
_
D
2

ln f
t
()
_
f
t
()dy +
_
[D

ln f
t
()] D

f
t
()dy
= c

_
D
2

ln f
t
()
_
+
_
[D

ln f
t
()] [D

ln f
t
()] f
t
()dy
= c

_
D
2

ln f
t
()
_
+c

[D

ln f
t
()] [D

ln f
t
()]
= c

[
t
()] +c

[g
t
()] [g
t
()]
/
(13.6)
Now sum over n and multiply by
1
n
c

1
n
n

t=1
[
t
()] = c

_
_
1
n
n

t=1
[g
t
()] [g
t
()]
/
_
_
(13.7)
The scores g
t
and g
s
are uncorrelated for t ,= s, since for t > s, f
t
(y
t
[y
1
, ..., y
t1
, ) has conditioned on
prior information, so what was random in s is xed in t. (This forms the basis for a specication test
proposed by White: if the scores appear to be correlated one may question the specication of the
model). This allows us to write
c

[
n
()] = c

_
n[g()] [g()]
/
_
since all cross products between dierent periods expect to zero. Finally take limits, we get

() = J

(). (13.8)
This holds for all , in particular, for
0
. Using this,

n
_


0
_
a.s.
N
_
0,

(
0
)
1
J

(
0
)

(
0
)
1
_
simplies to

n
_


0
_
a.s.
N
_
0, J

(
0
)
1
_
(13.9)
or

n
_


0
_
a.s.
N
_
0,

(
0
)
1
_
(13.10)
To estimate the asymptotic variance, we need estimators of

(
0
) and J

(
0
). We can use

(
0
) =
1
n
n

t=1
g
t
(

)g
t
(

)
/

(
0
) =
n
(

).
as is intuitive if one considers equation 13.7. Note, one cant use

(
0
) = n
_
g
n
(

)
_ _
g
n
(

)
_
/
to estimate the information matrix. Why not?
From this we see that there are alternative ways to estimate V

(
0
) that are all valid. These include

(
0
) =

(
0
)
1

(
0
) =

J

(
0
)
1

(
0
) =

(
0
)
1

(
0
)

(
0
)
1
These are known as the inverse Hessian, outer product of the gradient (OPG) and sandwich estimators,
respectively. The sandwich form is the most robust, since it coincides with the covariance estimator
of the quasi-ML estimator.
With a litte more detail, the methods are:
The sandwich version:

= n
_

_
_

n
t=1
D
2

ln f(y
t
[Y
t1
,

)
_

n
t=1
_
D

ln f(y
t
[Y
t1
,

)
_ _
D

ln f(y
t
[Y
t1
,

)
_
/
_
1

n
t=1
D
2

ln f(y
t
[Y
t1
,

)
_
_

_
1
or the inverse of the negative of the Hessian (since the middle and last term cancel, except for a
minus sign):

=
_
_
1/n
n

t=1
D
2

ln f(y
t
[Y
t1
,

)
_
_
1
,
or the inverse of the outer product of the gradient (since the middle and last cancel except for a
minus sign, and the rst term converges to minus the inverse of the middle term, which is still
inside the overall inverse)

=
_
_
_
1/n
n

t=1
_
D

ln f(y
t
[Y
t1
,

)
_ _
D

ln f(y
t
[Y
t1
,

)
_
/
_
_
_
1
.
This simplication is a special result for the MLE estimator - it doesnt apply to GMM estimators in
general.
Asymptotically, if the model is correctly specied, all of these forms converge to the same limit.
In small samples they will dier. In particular, there is evidence that the outer product of the
gradient formula does not perform very well in small samples (see Davidson and MacKinnon,
pg. 477).
Whites Information matrix test (Econometrica, 1982) is based upon comparing the two ways to
estimate the information matrix: outer product of gradient or negative of the Hessian. If they
dier by too much, this is evidence of misspecication of the model.
Example, Coin ipping, again
In section 32 we saw that the MLE for the parameter of a Bernoulli trial, with i.i.d. data, is the sample
mean: p = y (equation 13.1). Now lets nd the limiting variance of

n( p p
0
). We can do this in a
simple way:
limV ar

n( p p
0
) = limnV ar ( p p
0
)
= limnV ar ( p)
= limnV ar ( y)
= limnV ar
_
y
t
n
_
= lim
1
n

V ar(y
t
) (by independence of obs.)
= lim
1
n
nV ar(y) (by identically distributed obs.)
= V ar(y)
= p
0
(1 p
0
)
While that is simple, lets verify this using the methods of Chapter 12 give the same answer. The
log-likelihood function is
s
n
(p) =
1
n
n

t=1
y
t
ln p + (1 y
t
) ln (1 p)
so
Es
n
(p) = p
0
ln p +
_
1 p
0
_
ln (1 p)
by the fact that the observations are i.i.d. Thus, s

(p) = p
0
ln p + (1 p
0
) ln (1 p). A bit of
calculation shows that
D
2

s
n
(p)

p=p
0

n
() =
1
p
0
(1 p
0
)
,
which doesnt depend upon n. By results weve seen on MLE, limV ar

n( p p
0
) =
1

(p
0
). And
in this case,
1

(p
0
) = p
0
(1 p
0
). So, we get the same limiting variance using both methods.
Exercise 36. For this example, nd J

(p
0
).
13.6 The Cramr-Rao lower bound
Theorem 37. [Cramer-Rao Lower Bound] The limiting variance of a CAN estimator of
0
, say

,
minus the inverse of the information matrix is a positive semidenite matrix.
Proof: Since the estimator is CAN, it is asymptotically unbiased, so
lim
n
c

) = 0
Dierentiate wrt
/
:
D

lim
n
c

) = lim
n
_
D

_
f(Y, )
_


__
dy
= 0 (this is a K K matrix of zeros).
Noting that D

f(Y, ) = f()D

ln f(), we can write


lim
n
_
_


_
f()D

ln f()dy + lim
n
_
f(Y, )D


_
dy = 0.
Now note that D


_
= I
K
, and
_
f(Y, )(I
K
)dy = I
K
. With this we have
lim
n
_
_


_
f()D

ln f()dy = I
K
.
Playing with powers of n we get
lim
n
_

n
_


_
n
1
n
[D

ln f()]
. .
f()dy = I
K
Note that the bracketed part is just the transpose of the score vector, g(), so we can write
lim
n
c

_
n
_


_
ng()
/
_
= I
K
This means that the covariance of the score function with

n
_


_
, for

any CAN estimator, is an
identity matrix. Using this, suppose the variance of

n
_


_
tends to V

). Therefore,
V

n
_

ng()
_

_ =
_

_
V

) I
K
I
K
J

()
_

_ . (13.11)
Since this is a covariance matrix, it is positive semi-denite. Therefore, for any K -vector ,
_

/

/
J
1

()
_
_

_
V

) I
K
I
K
J

()
_

_
_

()
1

_ 0.
This simplies to

/
_
V

) J
1

()
_
0.
Since is arbitrary, V

) J
1

() is positive semidenite. This conludes the proof.


This means that J
1

() is a lower bound for the asymptotic variance of a CAN estimator.


(Asymptotic eciency) Given two CAN estimators of a parameter
0
, say

and

,

is asymptoti-
cally ecient with respect to

if V

) V

) is a positive semidenite matrix.


A direct proof of asymptotic eciency of an estimator is infeasible, but if one can show that the
asymptotic variance is equal to the inverse of the information matrix, then the estimator is asymp-
totically ecient. In particular, the MLE is asymptotically ecient with respect to any other CAN
estimator.
13.7 Likelihood ratio-type tests
Suppose we would like to test a set of q possibly nonlinear restrictions r() = 0, where the q k matrix
D

r() has rank q. The Wald test can be calculated using the unrestricted model. The score test can
be calculated using only the restricted model. The likelihood ratio test, on the other hand, uses both
the restricted and the unrestricted estimators. The test statistic is
LR = 2
_
ln L(

) ln L(

)
_
where

is the unrestricted estimate and

is the restricted estimate. To show that it is asymptotically

2
, take a second order Taylors series expansion of ln L(

) about

:
ln L(

) ln L(

) +
n
2
_

_
/
(

)
_

_
(note, the rst order term drops out since D

ln L(

) 0 by the fonc and we need to multiply the


second-order term by n since () is dened in terms of
1
n
ln L()) so
LR n
_

_
/
(

)
_

_
As n , (

(
0
) = J(
0
), by the information matrix equality. So
LR
a
= n
_

_
/
J

(
0
)
_

_
(13.12)
We also have that, from the theory on the asymptotic normality of the MLE and the information
matrix equality

n
_


0
_
a
= J

(
0
)
1
n
1/2
g(
0
).
An analogous result for the restricted estimator is (this is unproven here, to prove this set up the
Lagrangean for MLE subject to R = r, and manipulate the rst order conditions) :

n
_


0
_
a
= J

(
0
)
1
_
I
n
R
/
_
RJ

(
0
)
1
R
/
_
1
RJ

(
0
)
1
_
n
1/2
g(
0
).
Combining the last two equations

n
_

_
a
= n
1/2
J

(
0
)
1
R
/
_
RJ

(
0
)
1
R
/
_
1
RJ

(
0
)
1
g(
0
)
so, substituting into [13.12]
LR
a
=
_
n
1/2
g(
0
)
/
J

(
0
)
1
R
/
_ _
RJ

(
0
)
1
R
/
_
1
_
RJ

(
0
)
1
n
1/2
g(
0
)
_
But since
n
1/2
g(
0
)
d
N (0, J

(
0
))
the linear function
RJ

(
0
)
1
n
1/2
g(
0
)
d
N(0, RJ

(
0
)
1
R
/
).
We can see that LR is a quadratic form of this rv, with the inverse of its variance in the middle, so
LR
d

2
(q).
Summary of MLE
Consistent
Asymptotically normal (CAN)
Asymptotically ecient
Asymptotically unbiased
LR test is available for testing hypothesis
The presentation is for general MLE: we havent specied the distribution or the linearity/non-
linearity of the estimator
13.8 Examples
ML of Nerlove model, assuming normality
As we saw in Section 4.3, the ML and OLS estimators of in the linear model y = X + coincide
when is assumed to be i.i.d. normally distributed. The Octave script NerloveMLE.m veries this
result, for the basic Nerlove model (eqn. 3.10). The output of the script follows:
******************************************************
check MLE with normality, compare to OLS
MLE Estimation Results
BFGS convergence: Normal convergence
Average Log-L: -0.465806
Observations: 145
estimate st. err t-stat p-value
constant -3.527 1.689 -2.088 0.037
output 0.720 0.032 22.491 0.000
labor 0.436 0.241 1.808 0.071
fuel 0.427 0.074 5.751 0.000
capital -0.220 0.318 -0.691 0.490
sig 0.386 0.041 9.290 0.000
Information Criteria
CAIC : 170.9442 Avg. CAIC: 1.1789
BIC : 164.9442 Avg. BIC: 1.1375
AIC : 147.0838 Avg. AIC: 1.0144
******************************************************
Compare the output to that of Nerlove.m , which does OLS. The script also provides a basic example
of how to use the MLE estimation routing mle_results.m
Example: Binary response models: theory
This section extends the Bernoulli trial model to binary response models with conditioning variables,
as such models arise in a variety of contexts.
Assume that
y

= x
/

y = 1(y

> 0)
N(0, 1)
Here, y

is an unobserved (latent) continuous variable, and y is a binary variable that indicates whether
y

is negative or positive. Then the probit model results, where Pr(y = 1[x) = Pr( < x
/
) = (x
/
),
where
() =
_
x

(2)
1/2
exp(

2
2
)d
is the standard normal distribution function.
The logit model results if the errors are not normal, but rather have a logistic distribution. This
distribution is similar to the standard normal, but has fatter tails. The probability has the following
parameterization
Pr(y = 1[x) = (x
/
) = (1 + exp(x
/
))
1
.
In general, a binary response model will require that the choice probability be parameterized in
some form which could be logit, probit, or something else. For a vector of explanatory variables x, the
response probability will be parameterized in some manner
Pr(y = 1[x) = p(x, )
Again, if p(x, ) = (x
/
), we have a logit model. If p(x, ) = (x
/
), where () is the standard
normal distribution function, then we have a probit model.
Regardless of the parameterization, we are dealing with a Bernoulli density,
f
Y
i
(y
i
[x
i
) = p(x
i
, )
y
i
(1 p(x, ))
1y
i
so as long as the observations are independent, the maximum likelihood (ML) estimator,

, is the
maximizer of
s
n
() =
1
n
n

i=1
(y
i
ln p(x
i
, ) + (1 y
i
) ln [1 p(x
i
, )])

1
n
n

i=1
s(y
i
, x
i
, ). (13.13)
Following the above theoretical results,

tends in probability to the
0
that maximizes the uniform
almost sure limit of s
n
(). Noting that cy
i
= p(x
i
,
0
), and following a SLLN for i.i.d. processes, s
n
()
converges almost surely to the expectation of a representative term s(y, x, ). First one can take the
expectation conditional on x to get
c
y[x
y ln p(x, ) + (1 y) ln [1 p(x, )] = p(x,
0
) ln p(x, ) + [1 p(x,
0
)] ln [1 p(x, )] .
Next taking expectation over x we get the limiting objective function
s

() =
_
A
p(x,
0
) ln p(x, ) + [1 p(x,
0
)] ln [1 p(x, )] (x)dx, (13.14)
where (x) is the (joint - the integral is understood to be multiple, and A is the support of x) density
function of the explanatory variables x. This is clearly continuous in , as long as p(x, ) is continuous,
and if the parameter space is compact we therefore have uniform almost sure convergence. Note that
p(x, ) is continous for the logit and probit models, for example. The maximizing element of s

(),

, solves the rst order conditions


_
A
_
_
_
p(x,
0
)
p(x,

p(x,

)
1 p(x,
0
)
1 p(x,

p(x,

)
_
_
_
(x)dx = 0
This is clearly solved by

=
0
. Provided the solution is unique,

is consistent. Question: whats
needed to ensure that the solution is unique?
The asymptotic normality theorem tells us that

n
_


0
_
d
N
_
0,

(
0
)
1
J

(
0
)

(
0
)
1
_
.
In the case of i.i.d. observations J

(
0
) = lim
n
V ar

nD

s
n
(
0
) is simply the expectation of a
typical element of the outer product of the gradient.
Theres no need to subtract the mean, since its zero, following the f.o.c. in the consistency proof
above and the fact that observations are i.i.d.
The terms in n also drop out by the same argument:
lim
n
V ar

nD

s
n
(
0
) = lim
n
V ar

nD

1
n

t
s(
0
)
= lim
n
V ar
1

n
D

t
s(
0
)
= lim
n
1
n
V ar

t
D

s(
0
)
= lim
n
V arD

s(
0
)
= V arD

s(
0
)
So we get
J

(
0
) = c
_

s(y, x,
0
)

/
s(y, x,
0
)
_
.
Likewise,

(
0
) = c

2

/
s(y, x,
0
).
Expectations are jointly over y and x, or equivalently, rst over y conditional on x, then over x. From
above, a typical element of the objective function is
s(y, x,
0
) = y ln p(x,
0
) + (1 y) ln [1 p(x,
0
)] .
Now suppose that we are dealing with a correctly specied logit model:
p(x, ) = (1 + exp(x
/
))
1
.
We can simplify the above results in this case. We have that

p(x, ) = (1 + exp(x
/
))
2
exp(x
/
)x
= (1 + exp(x
/
))
1
exp(x
/
)
1 + exp(x
/
)
x
= p(x, ) (1 p(x, )) x
=
_
p(x, ) p(x, )
2
_
x.
So

s(y, x,
0
) = [y p(x,
0
)] x (13.15)

/
s(
0
) =
_
p(x,
0
) p(x,
0
)
2
_
xx
/
.
Taking expectations over y then x gives
J

(
0
) =
_
E
Y
_
y
2
2p(x,
0
)p(x,
0
) + p(x,
0
)
2
_
xx
/
(x)dx (13.16)
=
_
_
p(x,
0
) p(x,
0
)
2
_
xx
/
(x)dx. (13.17)
where we use the fact that E
Y
(y) = E
Y
(y
2
) = p(x,
0
). Likewise,

(
0
) =
_
_
p(x,
0
) p(x,
0
)
2
_
xx
/
(x)dx. (13.18)
Note that we arrive at the expected result: the information matrix equality holds (that is,

(
0
) =
J

(
0
)). With this,

n
_


0
_
d
N
_
0,

(
0
)
1
J

(
0
)

(
0
)
1
_
simplies to

n
_


0
_
d
N
_
0,

(
0
)
1
_
which can also be expressed as

n
_


0
_
d
N
_
0, J

(
0
)
1
_
.
On a nal note, the logit and standard normal CDFs are very similar - the logit distribution is a
bit more fat-tailed. While coecients will vary slightly between the two models, functions of interest
such as estimated probabilities p(x,

) will be virtually identical for the two models.
Estimation of the logit model
In this section we will consider maximum likelihood estimation of the logit model for binary 0/1
dependent variables. We will use the BFGS algorithm to nd the MLE.
A binary response is a variable that takes on only two values, customarily 0 and 1, which can be
thought of as codes for whether or not a condisiton is satised. For example, 0=drive to work, 1=take
the bus. Often the observed binary variable, say y, is related to an unobserved (latent) continuous
varable, say y

. We would like to know the eect of covariates, x, on y. The model can be represented
as
y

= g(x)
y = 1(y

> 0)
Pr(y = 1) = F

[g(x)]
p(x, )
The log-likelihood function is
s
n
() =
1
n
n

i=1
(y
i
ln p(x
i
, ) + (1 y
i
) ln [1 p(x
i
, )])
For the logit model, the probability has the specic form
p(x, ) =
1
1 + exp(x
/
)
You should download and examine LogitDGP.m , which generates data according to the logit
model, logit.m , which calculates the loglikelihood, and EstimateLogit.m , which sets things up and
calls the estimation routine, which uses the BFGS algorithm.
Here are some estimation results with n = 100, and the true = (0, 1)
/
.
***********************************************
Trial of MLE estimation of Logit model
MLE Estimation Results
BFGS convergence: Normal convergence
Average Log-L: 0.607063
Observations: 100
estimate st. err t-stat p-value
constant 0.5400 0.2229 2.4224 0.0154
slope 0.7566 0.2374 3.1863 0.0014
Information Criteria
CAIC : 132.6230
BIC : 130.6230
AIC : 125.4127
***********************************************
The estimation program is calling mle_results(), which in turn calls a number of other routines.
Duration data and the Weibull model
In some cases the dependent variable may be the time that passes between the occurence of two
events. For example, it may be the duration of a strike, or the time needed to nd a job once one is
unemployed. Such variables take on values on the positive real line, and are referred to as duration
data.
A spell is the period of time between the occurence of initial event and the concluding event. For
example, the initial event could be the loss of a job, and the nal event is the nding of a new job.
The spell is the period of unemployment.
Let t
0
be the time the initial event occurs, and t
1
be the time the concluding event occurs. For
simplicity, assume that time is measured in years. The random variable D is the duration of the spell,
D = t
1
t
0
. Dene the density function of D, f
D
(t), with distribution function F
D
(t) = Pr(D < t).
Several questions may be of interest. For example, one might wish to know the expected time one
has to wait to nd a job given that one has already waited s years. The probability that a spell lasts
more than s years is
Pr(D > s) = 1 Pr(D s) = 1 F
D
(s).
The density of D conditional on the spell being longer than s years is
f
D
(t[D > s) =
f
D
(t)
1 F
D
(s)
.
The expectanced additional time required for the spell to end given that is has already lasted s years
is the expectation of D with respect to this density, minus s.
E = c(D[D > s) s =
_
_
_

t
z
f
D
(z)
1 F
D
(s)
dz
_
_
s
To estimate this function, one needs to specify the density f
D
(t) as a parametric density, then
estimate by maximum likelihood. There are a number of possibilities including the exponential density,
the lognormal, etc. A reasonably exible model that is a generalization of the exponential density is
the Weibull density
f
D
(t[) = e
(t)

(t)
1
.
According to this model, c(D) =

. The log-likelihood is just the product of the log densities.


To illustrate application of this model, 402 observations on the lifespan of dwarf mongooses (see
Figure 13.1) in Serengeti National Park (Tanzania) were used to t a Weibull model. The spell in
this case is the lifetime of an individual mongoose. The parameter estimates and standard errors are

= 0.559 (0.034) and = 0.867 (0.033) and the log-likelihood value is -659.3. Figure 13.2 presents
tted life expectancy (expected additional years of life) as a function of age, with 95% condence
bands. The plot is accompanied by a nonparametric Kaplan-Meier estimate of life-expectancy. This
nonparametric estimator simply averages all spell lengths greater than age, and then subtracts age.
This is consistent by the LLN.
In the gure one can see that the model doesnt t the data well, in that it predicts life expectancy
quite dierently than does the nonparametric model. For ages 4-6, the nonparametric estimate is
outside the condence interval that results from the parametric model, which casts doubt upon the
parametric model. Mongooses that are between 2-6 years old seem to have a lower life expectancy
than is predicted by the Weibull model, whereas young mongooses that survive beyond infancy have
a higher life expectancy, up to a bit beyond 2 years. Due to the dramatic change in the death rate as
a function of t, one might specify f
D
(t) as a mixture of two Weibull densities,
f
D
(t[) =
_
e
(
1
t)

1
(
1
t)

1
1
_
+ (1 )
_
e
(
2
t)

2
(
2
t)

2
1
_
.
The parameters
i
and
i
, i = 1, 2 are the parameters of the two Weibull densities, and is the
parameter that mixes the two.
With the same data, can be estimated using the mixed model. The results are a log-likelihood =
-623.17. Note that a standard likelihood ratio test cannot be used to chose between the two models,
since under the null that = 1 (single density), the two parameters
2
and
2
are not identied. It is
possible to take this into account, but this topic is out of the scope of this course. Nevertheless, the
Figure 13.1: Dwarf mongooses
Figure 13.2: Life expectancy of mongooses, Weibull model
improvement in the likelihood function is considerable. The parameter estimates are
Parameter Estimate St. Error

1
0.233 0.016

1
1.722 0.166

2
1.731 0.101

2
1.522 0.096
0.428 0.035
Note that the mixture parameter is highly signicant. This model leads to the t in Figure 13.3. Note
that the parametric and nonparametric ts are quite close to one another, up to around 6 years. The
disagreement after this point is not too important, since less than 5% of mongooses live more than 6
years, which implies that the Kaplan-Meier nonparametric estimate has a high variance (since its an
average of a small number of observations).
Mixture models are often an eective way to model complex responses, though they can suer from
overparameterization. Alternatives will be discussed later.
For examples of MLE using logit and Poisson model applied to data, see Section ?? in the chapter
on Numerical Optimization. You should examine the scripts and run them to see how MLE is actually
done.
Estimation of a simple DSGE model
Dynamic stochastic general equilibrium model are widely used tools in macroeconomics. These are
models in which current decisions depend upon expectations of future events. An example is the simple
real business cycle model discussed in the le rbc.pdf, by Fernndez-Villaverde, which is available on
Figure 13.3: Life expectancy of mongooses, mixed Weibull model
the Dynare web page www.dynare.org. The le EstimateRBC_ML.mod shows how this model may
be estimated, using maximum likelihood methods. The estimation process involves forming a linear
approximation to the true model, which means that the estimator is not actually the true maximum
likelihood estimator, it is actually a quasi-ML estimator. The quasi-likelihood is computed by
putting the linearized model in state-space form, and then computing the likelihood iteratively using
Kalman ltering, which relies on the assumption that shocks to the model are normally distributed.
State space models and Kalman ltering are introduced in Section 15.4. Once the likelihood function
is available, the methods studied in this Chapter may be applied. The intention at the moment is
simple to show that ML is an estimation method that may be applied to complicated and more or less
realistic economic models. If you play around with the estimation program, you will see that diculties
are encountered with estimating certain parameters. This may be due to an excessive information loss
due to the linearization.
13.9 Exercises
1. Consider coin tossing with a single possibly biased coin. The density function for the random
variable y = 1(heads) is
f
Y
(y, p
0
) = p
y
0
(1 p
0
)
1y
, y 0, 1
= 0, y / 0, 1
Suppose that we have a sample of size n. We know from above that the ML estimator is

p
0
= y.
We also know from the theory above that

n( y p
0
)
a
N
_
0,

(p
0
)
1
J

(p
0
)

(p
0
)
1
_
a) nd the analytic expression for g
t
() and show that c

[g
t
()] = 0
b) nd the analytical expressions for

(p
0
) and J

(p
0
) for this problem
c) verify that the result for limV ar

n( p p) found in section 13.5 is equal to

(p
0
)
1
J

(p
0
)

(p
0
)
1
d) Write an Octave program that does a Monte Carlo study that shows that

n( y p
0
) is ap-
proximately normally distributed when n is large. Please give me histograms that show the
sampling frequency of

n( y p
0
) for several values of n.
2. The exponential density is
f
X
(x) =
_

_
e
x
, x 0
0, x < 0
Suppose we have an independently and identically distributed sample of size n, x
i
, i =
1, 2, ..., n, where each x
i
follows this exponential distribution.
(a) write the log likelihood function
(b) compute the maximum likelihood estimator of the parameter .
3. Suppose we have an i.i.d. sample of size n from the Poisson density. The Poisson density is
f
y
(y; ) =
e

y
y!
. Verify that the ML estimator is asymptotically distributed as

n
_


0
_
d

N(0,
0
), where
0
is the true parameter value. Hint: compute the asymptotic variance using

(
0
)
1
.
4. Consider the model y
t
= x
/
t
+
t
where the errors follow the Cauchy (Student-t with 1 degree
of freedom) density. So
f(
t
) =
1
(1 +
2
t
)
, <
t
<
The Cauchy density has a shape similar to a normal density, but with much thicker tails. Thus,
extremely small and large errors occur much more frequently with this density than would happen
if the errors were normally distributed. Find the score function g
n
() where =
_

/

_
/
.
5. Consider the model classical linear regression model y
t
= x
/
t
+
t
where
t
IIN(0,
2
). Find
the score function g
n
() where =
_

/

_
/
.
6. Compare the rst order conditions that dene the ML estimators of problems 4 and 5 and
interpret the dierences. Why are the rst order conditions that dene an ecient estimator
dierent in the two cases?
7. Assume a d.g.p. follows the logit model: Pr(y = 1[x) = (1 + exp(
0
x))
1
.
(a) Assume that x uniform(-a,a). Find the asymptotic distribution of the ML estimator of

0
(this is a scalar parameter).
(b) Now assume that x uniform(-2a,2a). Again nd the asymptotic distribution of the ML
estimator of
0
.
(c) Comment on the results
8. There is an ML estimation routine in the provided software that accompanies these notes. Edit
(to see what it does) then run the script mle_example.m. Interpret the output.
9. Estimate the simple Nerlove model discussed in section 3.8 by ML, assuming that the errors are
i.i.d. N(0,
2
) and compare to the results you get from running Nerlove.m .
10. Using logit.m and EstimateLogit.m as templates, write a function to calculate the probit log
likelihood, and a script to estimate a probit model. Run it using data that actually follows a
logit model (you can generate it in the same way that is done in the logit example).
11. Study mle_results.m to see what it does. Examine the functions that mle_results.m calls,
and in turn the functions that those functions call. Write a complete description of how the
whole chain works.
12. In Subsection 11.4 a model is presented for data on health care usage, along with some Octave
scripts. Look at the Poisson estimation results for the OBDV measure of health care use and
give an economic interpretation. Estimate Poisson models for the other 5 measures of health
care usage, using the provided scripts.
Chapter 14
Generalized method of moments
Readings: Hamilton Ch. 14

; Davidson and MacKinnon, Ch. 17 (see pg. 587 for refs. to applica-
tions); Newey and McFadden (1994), "Large Sample Estimation and Hypothesis Testing", in Handbook
of Econometrics, Vol. 4, Ch. 36.
14.1 Motivation
The principle of the method of moments is to set the population moment to the sample moment,
then invert to solve for the estimator of the parameter. This is illustrated in Figure 14.1. The sample
moment

will converge to the true moment (
0
), so the estimator will converge to the true parameter
value. We need that the moment function be invertible.
375
Figure 14.1: Method of Moments
Sampling from
2
(
0
)
Example 38. (Method of moments, v1) Suppose we draw a random sample of y
t
from the
2
(
0
)
distribution. Here,
0
is the parameter of interest. The rst moment (expectation),
1
, of a random
variable will in general be a function of the parameters of the distribution:
1
=
1
(
0
) .
In this example, if Y
2
(
0
), then E(Y ) =
0
, so the relationship is the identity function

1
(
0
) =
0
, though in general the relationship may be more complicated. The sample rst moment is

1
=
n

t=1
y
t
/n.
Dene a moment condition as
m
1
() =
1
()

1
where the sample moment

1
= y =

n
t=1
y
t
/n.
The method of moments principle is to choose the estimator of the parameter to set the estimate
of the population moment equal to the sample moment, or equivalently to make the moment condition
equal to zero: m
1
(

) 0. Then the equation is solved for the estimator. In this case,


m
1
(

) =


n

t=1
y
t
/n = 0
is solved by

= y. Since y =

n
t=1
y
t
/n
p

0
by the LLN, the estimator is consistent.
Example 39. (Method of moments, v2) The variance of a
2
(
0
) r.v. is
V (y
t
) = E
_
y
t

0
_
2
= 2
0
.
The sample variance is

V (y
t
) =

n
t=1
(y
t
y)
2
n
. Dene the an alternative moment condition as the
population moment minus the sample moment:
m
2
() = V (y
t
)

V (y
t
)
= 2

n
t=1
(y
t
y)
2
n
We can see that the average moment condition is the average of the contributions
m
2t
() = V (y
t
) (y
t
y)
2
The MM estimator using the variance would set
m
2
(

) = 2

n
t=1
(y
t
y)
2
n
0.
Again, by the LLN, the sample variance is consistent for the true variance, that is,

n
t=1
(y
t
y)
2
n
p
2
0
.
So, the estimator is half the sample variance:

=
1
2

n
t=1
(y
t
y)
2
n
,
This estimator is also consistent for
0
.
Example 40. Try some MM estimation yourself: heres an Octave script that implements the two
MM estimators discussed above: GMM/chi2mm.m
Note that when you run the script, the two estimators give dierent results. Each of the two
estimators is consistent.
With two moment-parameter equations and only one parameter, we have overidentication,
which means that we have more information than is strictly necessary for consistent estimation
of the parameter.
The idea behind GMM is to combine information from the two moment-parameter equations to
form a new estimator which will be more ecient, in general (proof of this below).
Sampling from t(
0
)
Heres another example based upon the t-distribution. The density function of a t-distributed r.v. Y
t
is
f
Y
t
(y
t
,
0
) =
[(
0
+ 1) /2]
(
0
)
1/2
(
0
/2)
_
1 +
_
y
2
t
/
0
__
(
0
+1)/2
Given an iid sample of size n, one could estimate
0
by maximizing the log-likelihood function

arg max

ln L
n
() =
n

t=1
ln f
Y
t
(y
t
, )
This approach is attractive since ML estimators are asymptotically ecient. This is because
the ML estimator uses all of the available information (e.g., the distribution is fully specied up
to a parameter). Recalling that a distribution is completely characterized by its moments, the
ML estimator is interpretable as a GMM estimator that uses all of the moments. The method
of moments estimator uses only K moments to estimate a Kdimensional parameter. Since
information is discarded, in general, by the MM estimator, eciency is lost relative to the ML
estimator.
Example 41. (Method of moments). A t-distributed r.v. with density f
Y
t
(y
t
,
0
) has mean zero and
variance V (y
t
) =
0
/ (
0
2) (for
0
> 2).
We can dene a moment condition as the dierence between the theoretical variance and the
sample variance: m
1
() = / ( 2) 1/n

n
t=1
y
2
t
. When evaluated at the true parameter value
0
,
both c

0 [m
1
(
0
)] = 0.
Choosing

to set m
1
(

) 0 yields a MM estimator:

=
2
1
n

i
y
2
i
(14.1)
This estimator is based on only one moment of the distribution - it uses less information than the
ML estimator, so it is intuitively clear that the MM estimator will be inecient relative to the ML
estimator.
Example 42. (Method of moments). An alternative MM estimator could be based upon the fourth
moment of the t-distribution. The fourth moment of a t-distributed r.v. is

4
E(y
4
t
) =
3 (
0
)
2
(
0
2) (
0
4)
,
provided that
0
> 4. We can dene a second moment condition
m
2
() =
3 ()
2
( 2) ( 4)

1
n
n

t=1
y
4
t
A second, dierent MM estimator chooses

to set m
2
(

) 0. If you solve this youll see that the


estimate is dierent from that in equation 14.1.
This estimator isnt ecient either, since it uses only one moment. A GMM estimator would
use the two moment conditions together to estimate the single parameter. The GMM estimator
is overidentied, which leads to an estimator which is ecient relative to the just identied MM
estimators (more on eciency later).
14.2 Denition of GMM estimator
For the purposes of this course, the following denition of the GMM estimator is suciently general:
Denition 43. The GMM estimator of the K -dimensional parameter vector
0
,

arg min

s
n
()
m
n
()
/
W
n
m
n
(), where m
n
() =
1
n

n
t=1
m
t
() is a g-vector, g K, with c

m() = 0, and W
n
converges
almost surely to a nite g g symmetric positive denite matrix W

.
Whats the reason for using GMM if MLE is asymptotically ecient?
Robustness: GMM is based upon a limited set of moment conditions. For consistency, only
these moment conditions need to be correctly specied, whereas MLE in eect requires correct
specication of every conceivable moment condition. GMM is robust with respect to distributional
misspecication. The price for robustness is usually a loss of eciency with respect to the MLE
estimator. Keep in mind that the true distribution is not known so if we erroneously specify a
distribution and estimate by MLE, the estimator will be inconsistent in general (not always).
Feasibility: in some cases the MLE estimator is not available, because we are not able to deduce
or compute the likelihood function. More on this in the section on simulation-based estimation.
The GMM estimator may still be feasible even though MLE is not available.
Example 44. The Octave script GMM/chi2gmm.m implements GMM using the same
2
data as was
using in Example 40, above. The two moment conditions, based on the sample mean and sample vari-
ance are combined. The weight matrix is an identity matrix, I
2
. In Octave, type help gmm_estimate
to get more information on how the GMM estimation routine works.
14.3 Consistency
We simply assume that the assumptions of Theorem 29 hold, so the GMM estimator is strongly
consistent. The only assumption that warrants additional comments is that of identication. In
Theorem 29, the third assumption reads: (c) Identication: s

() has a unique global maximum


at
0
, i.e., s

(
0
) > s

(), ,=
0
. Taking the case of a quadratic objective function s
n
() =
m
n
()
/
W
n
m
n
(), rst consider m
n
().
Applying a uniform law of large numbers, we get m
n
()
a.s.
m

().
Since c

0m
n
(
0
) = 0 by assumption, m

(
0
) = 0.
Since s

(
0
) = m

(
0
)
/
W

(
0
) = 0, in order for asymptotic identication, we need that
m

() ,= 0 for ,=
0
, for at least some element of the vector. This and the assumption that
W
n
a.s.
W

, a nite positive g g denite g g matrix guarantee that


0
is asymptotically
identied.
Note that asymptotic identication does not rule out the possibility of lack of identication for
a given data set - there may be multiple minimizing solutions in nite samples.
Example 45. Increase n in the Octave script GMM/chi2gmm.m to see evidence of the consistency of
the GMM estimator.
14.4 Asymptotic normality
We also simply assume that the conditions of Theorem 31 hold, so we will have asymptotic normal-
ity. However, we do need to nd the structure of the asymptotic variance-covariance matrix of the
estimator. From Theorem 31, we have

n
_


0
_
d
N
_
0,

(
0
)
1
J

(
0
)

(
0
)
1
_
where

(
0
) is the almost sure limit of

2

s
n
() when evaluated at
0
and
J

(
0
) = lim
n
V ar

s
n
(
0
).
We need to determine the form of these matrices given the objective function s
n
() = m
n
()
/
W
n
m
n
().
Now using the product rule from the introduction,

s
n
() = 2
_

n
()
_
W
n
m
n
()
(this is analogous to

/
X
/
X = 2X
/
X which appears when computing the rst order conditions
for the OLS estimator)
Dene the K g matrix
D
n
()

m
/
n
() ,
so:

s() = 2D()Wm() . (14.2)


(Note that s
n
(), D
n
(), W
n
and m
n
() all depend on the sample size n, but it is omitted to unclutter
the notation).
To take second derivatives, let D
i
be the i th row of D(). Using the product rule,

i
s() =

/
2D
i
()Wm()
= 2D
i
WD
/
+ 2m
/
W
_

/
D
/
i
_
When evaluating the term
2m()
/
W
_

/
D()
/
i
_
at
0
, assume that

D()
/
i
satises a LLN, so that it converges almost surely to a nite limit. In this
case, we have
2m(
0
)
/
W
_

/
D(
0
)
/
i
_
a.s.
0,
since m(
0
) = o
p
(1) and W
a.s.
W

.
Stacking these results over the K rows of D, we get
lim

2

/
s
n
(
0
) =

(
0
) = 2D

D
/

, a.s.,
where we dene limD = D

, a.s., and limW = W

, a.s. (we assume a LLN holds).


With regard to J

(
0
), following equation 14.2, and noting that the scores have mean zero at
0
(since cm(
0
) = 0 by assumption), we have
J

(
0
) = lim
n
V ar

s
n
(
0
)
= lim
n
c4nDWm(
0
)m(
0
)
/
WD
/
= lim
n
c4DW
_
nm(
0
)
_ _
nm(
0
)
/
_
WD
/
Now, given that m(
0
) is an average of centered (mean-zero) quantities, it is reasonable to expect a
CLT to apply, after multiplication by

n. Assuming this,

nm(
0
)
d
N(0,

),
where

= lim
n
c
_
nm(
0
)m(
0
)
/
_
.
Using this, and the last equation, we get
J

(
0
) = 4D

D
/

Using these results, the asymptotic normality theorem (31) gives us

n
_


0
_
d
N
_
0, (D

D
/

)
1
D

D
/

(D

D
/

)
1
_
,
the asymptotic distribution of the GMM estimator for arbitrary weighting matrix W
n
. Note that for
J

to be positive denite, D

must have full row rank, (D

) = k. This is related to identication.


If the rows of m
n
() were not linearly independent of one another, then neither D
n
nor D

would
have full row rank. Identication plus two times dierentiability of the objective function lead to J

being positive denite.


Example 46. The Octave script GMM/AsymptoticNormalityGMM.m does a Monte Carlo of the
GMM estimator for the
2
data. Histograms for 1000 replications of

n
_


0
_
are given in Figure
14.2. On the left are results for n = 100, on the right are results for n = 1000. Note that the
two distributions are fairly similar. In both cases the distribution is approximately normal. The
distribution for the small sample size is somewhat asymmetric. This has mostly disappeared for the
larger sample size.
14.5 Choosing the weighting matrix
W is a weighting matrix, which determines the relative importance of violations of the individual
moment conditions. For example, if we are much more sure of the rst moment condition, which is
Figure 14.2: Asymptotic Normality of GMM estimator,
2
example
(a) n = 10 (b) n = 1000
based upon the variance, than of the second, which is based upon the fourth moment, we could set
W =
_

_
a 0
0 b
_

_
with a much larger than b. In this case, errors in the second moment condition have less weight in the
objective function.
Since moments are not independent, in general, we should expect that there be a correlation
between the moment conditions, so it may not be desirable to set the o-diagonal elements to 0.
W may be a random, data dependent matrix.
We have already seen that the choice of W will inuence the asymptotic distribution of the GMM
estimator. Since the GMM estimator is already inecient w.r.t. MLE, we might like to choose
the W matrix to make the GMM estimator ecient within the class of GMM estimators dened
by m
n
().
To provide a little intuition, consider the linear model y = x
/
+ , where N(0, ). That is,
he have heteroscedasticity and autocorrelation.
Let P be the Cholesky factorization of
1
, e.g, P
/
P =
1
.
Then the model Py = PX + P satises the classical assumptions of homoscedasticity and
nonautocorrelation, since V (P) = PV ()P
/
= PP
/
= P(P
/
P)
1
P
/
= PP
1
(P
/
)
1
P
/
= I
n
.
(Note: we use (AB)
1
= B
1
A
1
for A, B both nonsingular). This means that the transformed
model is ecient.
The OLS estimator of the model Py = PX + P minimizes the objective function (y
X)
/

1
(y X). Interpreting (y X) = () as moment conditions (note that they do
have zero expectation when evaluated at
0
), the optimal weighting matrix is seen to be the
inverse of the covariance matrix of the moment conditions. This result carries over to GMM
estimation. (Note: this presentation of GLS is not a GMM estimator, because the number of
moment conditions here is equal to the sample size, n. Later well see that GLS can be put into
the GMM framework dened above).
Theorem 47. If

is a GMM estimator that minimizes m
n
()
/
W
n
m
n
(), the asymptotic variance of

will be minimized by choosing W


n
so that W
n
a.s
W

=
1

, where

= lim
n
c [nm(
0
)m(
0
)
/
] .
Proof: For W

=
1

, the asymptotic variance


(D

D
/

)
1
D

D
/

(D

D
/

)
1
simplies to (D

D
/

)
1
. Now, let A be the dierence between the general form and the simplied
form:
A = (D

D
/

)
1
D

D
/

(D

D
/

)
1

_
D

D
/

_
1
Set B = (D

D
/

)
1
D

(D

D
/

)
1
D

. You can show that A = B

. This is a
quadratic form in a p.d. matrix, so it is p.s.d., which concludes the proof.
The result

n
_


0
_
d
N
_
0,
_
D

D
/

_
1
_
(14.3)
allows us to treat

N
_
_
_
0
,
(D

D
/

)
1
n
_
_
_,
where the means approximately distributed as. To operationalize this we need estimators of D

and

.
The obvious estimator of

D

is simply

m
/
n
_

_
, which is consistent by the consistency of

,
assuming that

m
/
n
is continuous in . Stochastic equicontinuity results can give us this result
even if

m
/
n
is not continuous.
Example 48. To see the eect of using an ecient weight matrix, consider the Octave script GM-
M/EcientGMM.m. This modies the previous Monte Carlo for the
2
data. This new Monte Carlo
computes the GMM estimator in two ways:
1) based on an identity weight matrix
2) using an estimated optimal weight matrix. The estimated ecient weight matrix is computed as
the inverse of the estimated covariance of the moment conditions, using the inecient estimator of the
rst step. See the next section for more on how to do this.
Figure 14.3 shows the results, plotting histograms for 1000 replications of

n
_


0
_
. Note that the
use of the estimated ecient weight matrix leads to much better results in this case. This is a simple
case where it is possible to get a good estimate of the ecient weight matrix. This is not always so.
See the next section.
14.6 Estimation of the variance-covariance matrix
(See Hamilton Ch. 10, pp. 261-2 and 280-84)

.
Figure 14.3: Inecient and Ecient GMM estimators,
2
data
(a) inecient (b) ecient
In the case that we wish to use the optimal weighting matrix, we need an estimate of

, the
limiting variance-covariance matrix of

nm
n
(
0
). While one could estimate

parametrically, we in
general have little information upon which to base a parametric specication. In general, we expect
that:
m
t
will be autocorrelated (
ts
= c(m
t
m
/
ts
) ,= 0). Note that this autocovariance will not depend
on t if the moment conditions are covariance stationary.
contemporaneously correlated, since the individual moment conditions will not in general be
independent of one another (c(m
it
m
jt
) ,= 0).
and have dierent variances (c(m
2
it
) =
2
it
).
Since we need to estimate so many components if we are to take the parametric approach, it is unlikely
that we would arrive at a correct parametric specication. For this reason, research has focused on
consistent nonparametric estimators of

.
Henceforth we assume that m
t
is covariance stationary (the covariance between m
t
and m
ts
does
not depend on t). Dene the v th autocovariance of the moment conditions
v
= c(m
t
m
/
ts
). Note
that c(m
t
m
/
t+s
) =
/
v
. Recall that m
t
and m are functions of , so for now assume that we have some
consistent estimator of
0
, so that m
t
= m
t
(

). Now

n
= c
_
nm(
0
)m(
0
)
/
_
= c
_
_
n
_
_
1/n
n

t=1
m
t
_
_
_
_
1/n
n

t=1
m
/
t
_
_
_
_
= c
_
_
1/n
_
_
n

t=1
m
t
_
_
_
_
n

t=1
m
/
t
_
_
_
_
=
0
+
n 1
n
(
1
+
/
1
) +
n 2
n
(
2
+
/
2
) +
1
n
(
n1
+
/
n1
)
A natural, consistent estimator of
v
is

v
= 1/n
n

t=v+1
m
t
m
/
tv
.
(you might use n v in the denominator instead). So, a natural, but inconsistent, estimator of

would be

0
+
n 1
n
_

1
+

/
1
_
+
n 2
n
_

2
+

/
2
_
+ +
_

n1
+

/
n1
_
=

0
+
n1

v=1
n v
n
_

v
+

/
v
_
.
This estimator is inconsistent in general, since the number of parameters to estimate is more than
the number of observations, and increases more rapidly than n, so information does not build up as
n .
On the other hand, supposing that
v
tends to zero suciently rapidly as v tends to , a modied
estimator

0
+
q(n)

v=1
_

v
+

/
v
_
,
where q(n)
p
as n will be consistent, provided q(n) grows suciently slowly. The term
nv
n
can be dropped because q(n) must be o
p
(n). This allows information to accumulate at a rate that
satises a LLN. A disadvantage of this estimator is that it may not be positive denite. This could
cause one to calculate a negative
2
statistic, for example!
Note: the formula for

requires an estimate of m(
0
), which in turn requires an estimate of ,
which is based upon an estimate of ! The solution to this circularity is to set the weighting
matrix W arbitrarily (for example to an identity matrix), obtain a rst consistent but inecient
estimate of
0
, then use this estimate to form

, then re-estimate
0
. The process can be iterated
until neither

nor

change appreciably between iterations.
Newey-West covariance estimator
The Newey-West estimator (Econometrica, 1987) solves the problem of possible nonpositive denite-
ness of the above estimator. Their estimator is

0
+
q(n)

v=1
_
1
v
q + 1
_
_

v
+

/
v
_
.
This estimator is p.d. by construction. The condition for consistency is that n
1/4
q 0. Note that
this is a very slow rate of growth for q. This estimator is nonparametric - weve placed no parametric
restrictions on the form of . It is an example of a kernel estimator.
In a more recent paper, Newey and West (Review of Economic Studies, 1994) use pre-whitening
before applying the kernel estimator. The idea is to t a VAR model to the moment conditions. It is
expected that the residuals of the VAR model will be more nearly white noise, so that the Newey-West
covariance estimator might perform better with short lag lengths..
The VAR model is
m
t
=
1
m
t1
+ +
p
m
tp
+ u
t
This is estimated, giving the residuals u
t
. Then the Newey-West covariance estimator is applied to
these pre-whitened residuals, and the covariance is estimated combining the tted VAR

m
t
=

1
m
t1
+ +

p
m
tp
with the kernel estimate of the covariance of the u
t
. See Newey-West for details.
I have a program that does this if youre interested.
14.7 Estimation using conditional moments
So far, the moment conditions have been presented as unconditional expectations. One common way
of dening unconditional moment conditions is based upon conditional moment conditions.
Suppose that a random variable Y has zero expectation conditional on the random variable X
c
Y [X
Y =
_
Y f(Y [X)dY = 0
Then the unconditional expectation of the product of Y and a function g(X) of X is also zero. The
unconditional expectation is
cY g(X) =
_
A
__

Y g(X)f(Y, X)dY
_
dX.
This can be factored into a conditional expectation and an expectation w.r.t. the marginal density of
X :
cY g(X) =
_
A
__

Y g(X)f(Y [X)dY
_
f(X)dX.
Since g(X) doesnt depend on Y it can be pulled out of the integral
cY g(X) =
_
A
__

Y f(Y [X)dY
_
g(X)f(X)dX.
But the term in parentheses on the rhs is zero by assumption, so
cY g(X) = 0
as claimed.
This is important econometrically, since models often imply restrictions on conditional moments.
Suppose a model tells us that the function K(y
t
, x
t
) has expectation, conditional on the information
set I
t
, equal to k(x
t
, ),
c

K(y
t
, x
t
)[I
t
= k(x
t
, ).
For example, in the context of the classical linear model y
t
= x
/
t
+
t
, we can set K(y
t
, x
t
) = y
t
so that k(x
t
, ) = x
/
t
.
With this, the error function

t
() = K(y
t
, x
t
) k(x
t
, )
has conditional expectation equal to zero
c

t
()[I
t
= 0.
This is a scalar moment condition, which isnt sucient to identify a K -dimensional parameter
(K > 1). However, the above result allows us to form various unconditional expectations
m
t
() = Z(w
t
)
t
()
where Z(w
t
) is a g 1-vector valued function of w
t
and w
t
is a set of variables drawn from the
information set I
t
. The Z(w
t
) are instrumental variables. We now have g moment conditions, so as
long as g > K the necessary condition for identication holds.
One can form the n g matrix
Z
n
=
_

_
Z
1
(w
1
) Z
2
(w
1
) Z
g
(w
1
)
Z
1
(w
2
) Z
2
(w
2
) Z
g
(w
2
)
.
.
.
.
.
.
Z
1
(w
n
) Z
2
(w
n
) Z
g
(w
n
)
_

_
=
_

_
Z
/
1
Z
/
2
Z
/
n
_

_
With this we can form the g moment conditions
m
n
() =
1
n
Z
/
n
_

1
()

2
()
.
.
.

n
()
_

_
Dene the vector of error functions
h
n
() =
_

1
()

2
()
.
.
.

n
()
_

_
With this, we can write
m
n
() =
1
n
Z
/
n
h
n
()
=
1
n
n

t=1
Z
t
h
t
()
=
1
n
n

t=1
m
t
()
where Z
(t,)
is the t
th
row of Z
n
. This ts the previous treatment.
14.8 Estimation using dynamic moment conditions
Note that dynamic moment conditions simplify the var-cov matrix, but are often harder to formulate.
The will be added in future editions. For now, the Hansen application below is enough.
14.9 A specication test
The rst order conditions for minimization, using the an estimate of the optimal weighting matrix,
are

s(

) = 2
_

n
_

_
_

1
m
n
_

_
0
or
D(

1
m
n
(

) 0
Consider a Taylor expansion of m(

):
m(

) = m
n
(
0
) + D
/
n
(

)
_


0
_
(14.4)
where

is between

and
0
. Multiplying by D(

1
we obtain
D(

1
m(

) = D(

1
m
n
(
0
) + D(

1
D(

)
/
_


0
_
The lhs is zero, so
D(

1
m
n
(
0
) =
_
D(

1
D(

)
/
_ _


0
_
or
_


0
_
=
_
D(

1
D(

)
/
_
1
D(

1
m
n
(
0
)
With this, and taking into account the original expansion (equation 14.4), we get

nm(

) =

nm
n
(
0
)

nD
/
n
(

)
_
D(

1
D(

)
/
_
1
D(

1
m
n
(
0
).
With some factoring, this last can be written as

nm(

) =
_

1/2
D
/
n
(

)
_
D(

1
D(

)
/
_
1
D(

1/2
_
_

1/2
m
n
(
0
)
_
and then multiply be

1/2
to get

1/2
m(

) =
_
I
g

1/2
D
/
n
(

)
_
D(

1
D(

)
/
_
1
D(

1/2
_
_

1/2
m
n
(
0
)
_
Now

1/2
m
n
(
0
)
d
N(0, I
g
)
and the big matrix I
g

1/2
D
/
n
(

)
_
D(

1
D(

)
/
_
1
D(

1/2
converges in probability to P =
I
g

1/2

D
/

(D

D
/

)
1
D

1/2

. However, one can easily verify that P is idempotent and has


rank g K, (recall that the rank of an idempotent matrix is equal to its trace). We know that
N(0, I
g
)
/
P N(0, I
g
)
2
(g K). So, a quadratic form on the r.h.s. has an asymptotic chi-square
distribution. The quadratic form on the l.h.s. must also have the same distribution, so we nally get
_

1/2
m(

)
_
/
_

1/2
m(

)
_
= nm(

)
/

1
m(

)
d

2
(g K)
or
n s
n
(

)
d

2
(g K)
supposing the model is correctly specied. This is a convenient test since we just multiply the optimized
value of the objective function by n, and compare with a
2
(g K) critical value. The test is a general
test of whether or not the moments used to estimate are correctly specied.
This wont work when the estimator is just identied. The f.o.c. are
D

s
n
() = D

1
m(

) 0.
But with exact identication, both D and

are square and invertible (at least asymptotically,
assuming that asymptotic normality hold), so
m(

) 0.
So the moment conditions are zero regardless of the weighting matrix used. As such, we might
as well use an identity matrix and save trouble. Also s
n
(

) = 0, so the test breaks down.


A note: this sort of test often over-rejects in nite samples. One should be cautious in rejecting
a model when this test rejects.
14.10 Example: Generalized instrumental variables estima-
tor
The IV estimator may appear a bit unusual at rst, but it will grow on you over time. We have in
fact already seen the IV estimator above, in the discussion of conditional moments. Lets look at the
special case of a linear model with iid errors, but with correlation between regressors and errors:
y
t
= x
/
t
+
t
c(x
/
t

t
) ,= 0
Lets assume, just to keep things simple, that the errors are iid
The model in matrix form is y = X +
Let K = dim(x
t
). Consider some vector z
t
of dimension G1, where G K. Assume that E(z
t

t
) = 0.
The variables z
t
are instrumental variables. Consider the moment conditions
m
t
() = z
t

t
= z
t
(y
t
x
/
t
)
We can arrange the instruments in the n G matrix
Z =
_

_
z
/
1
z
/
2
.
.
.
z
/
n
_

_
The average moment conditions are
m
n
() =
1
n
Z
/

=
1
n
(Z
/
y Z
/
X)
The generalized instrumental variables estimator is just the GMM estimator based upon these moment
conditions. When G = K, we have exact identication, and it is referred to as the instrumental
variables estimator.
The rst order conditions for GMM are D
n
W
n
m
n
(

) = 0, which imply that


D
n
W
n
Z
/
X

IV
= D
n
W
n
Z
/
y
Exercise 49. Verify that D
n
=
X

Z
n
. Remember that (assuming dierentiability) identication of
the GMM estimator requires that this matrix must converge to a matrix with full row rank. Can
just any variable that is uncorrelated with the error be used as an instrument, or is there some other
condition?
Exercise 50. Verify that the ecient weight matrix is W
n
=
_
Z

Z
n
_
1
(up to a constant).
If we accept what is stated in these two exercises, then
X
/
Z
n
_
_
Z
/
Z
n
_
_
1
Z
/
X

IV
=
X
/
Z
n
_
_
Z
/
Z
n
_
_
1
Z
/
y
Noting that the powers of n cancel, we get
X
/
Z (Z
/
Z)
1
Z
/
X

IV
= X
/
Z (Z
/
Z)
1
Z
/
y
or

IV
=
_
X
/
Z (Z
/
Z)
1
Z
/
X
_
1
X
/
Z (Z
/
Z)
1
Z
/
y (14.5)
Another way of arriving to the same point is to dene the projection matrix P
Z
P
Z
= Z(Z
/
Z)
1
Z
/
Anything that is projected onto the space spanned by Z will be uncorrelated with , by the denition
of Z. Transforming the model with this projection matrix we get
P
Z
y = P
Z
X + P
Z

or
y

= X

Now we have that

and X

are uncorrelated, since this is simply


c(X
/

) = c(X
/
P
/
Z
P
Z
)
= c(X
/
P
Z
)
and
P
Z
X = Z(Z
/
Z)
1
Z
/
X
is the tted value from a regression of X on Z. This is a linear combination of the columns of Z, so it
must be uncorrelated with . This implies that applying OLS to the model
y

= X

will lead to a consistent estimator, given a few more assumptions.


Exercise 51. Verify algebraically that applying OLS to the above model gives the IV estimator of
equation 14.5.
With the denition of P
Z
, we can write

IV
= (X
/
P
Z
X)
1
X
/
P
Z
y
from which we obtain

IV
= (X
/
P
Z
X)
1
X
/
P
Z
(X
0
+ )
=
0
+ (X
/
P
Z
X)
1
X
/
P
Z

so

IV

0
= (X
/
P
Z
X)
1
X
/
P
Z

=
_
X
/
Z(Z
/
Z)
1
Z
/
X
_
1
X
/
Z(Z
/
Z)
1
Z
/

Now we can introduce factors of n to get

IV

0
=
_
_
_
_
X
/
Z
n
_
_
_
_
Z
/
Z
n
1
_
_
_
_
Z
/
X
n
_
_
_
_
1
_
_
X
/
Z
n
_
_
_
_
Z
/
Z
n
_
_
1
_
_
Z
/

n
_
_
Assuming that each of the terms with a n in the denominator satises a LLN, so that

Z
n
p
Q
ZZ
, a nite pd matrix

Z
n
p
Q
XZ
, a nite matrix with rank K (= cols(X) ). That is to say, the instruments must be
correlated with the regressors.

n
p
0
then the plim of the rhs is zero. This last term has plim 0 since we assume that Z and are uncorrelated,
e.g.,
c(z
/
t

t
) = 0,
Given these assumtions the IV estimator is consistent

IV
p

0
.
Furthermore, scaling by

n, we have

n
_

IV

0
_
=
_
_
_
_
_
X
/
Z
n
_
_
_
_
Z
/
Z
n
_
_
1
_
_
Z
/
X
n
_
_
_
_
_
1 _
_
X
/
Z
n
_
_
_
_
Z
/
Z
n
_
_
1
_
_
Z
/

n
_
_
Assuming that the far right term saties a CLT, so that

n
d
N(0, Q
ZZ

2
)
then we get

n
_

IV

0
_
d
N
_
0, (Q
XZ
Q
1
ZZ
Q
/
XZ
)
1

2
_
The estimators for Q
XZ
and Q
ZZ
are the obvious ones. An estimator for
2
is

2
IV
=
1
n
_
y X

IV
_
/
_
y X

IV
_
.
This estimator is consistent following the proof of consistency of the OLS estimator of
2
, when the
classical assumptions hold.
The formula used to estimate the variance of

IV
is

V (

IV
) =
_
(X
/
Z) (Z
/
Z)
1
(Z
/
X)
_
1

2
IV
The GIV estimator is
1. Consistent
2. Asymptotically normally distributed
3. Biased in general, since even though c(X
/
P
Z
) = 0, c(X
/
P
Z
X)
1
X
/
P
Z
may not be zero, since
(X
/
P
Z
X)
1
and X
/
P
Z
are not independent.
An important point is that the asymptotic distribution of

IV
depends upon Q
XZ
and Q
ZZ
, and these
depend upon the choice of Z. The choice of instruments inuences the eciency of the estimator. This
point was made above, when optimal instruments were discussed.
When we have two sets of instruments, Z
1
and Z
2
such that Z
1
Z
2
, then the IV estimator
using Z
2
is at least as eciently asymptotically as the estimator that used Z
1
. More instruments
leads to more asymptotically ecient estimation, in general.
The penalty for indiscriminant use of instruments is that the small sample bias of the IV estimator
rises as the number of instruments increases. The reason for this is that P
Z
X becomes closer
and closer to X itself as the number of instruments increases.
Exercise 52. How would one adapt the GIV estimator presented here to deal with the case of HET
and AUT?
Example 53. Recall Example 19 which deals with a dynamic model with measurement error. The
model is
y

t
= + y

t1
+ x
t
+
t
y
t
= y

t
+
t
where
t
and
t
are independent Gaussian white noise errors. Suppose that y

t
is not observed, and
instead we observe y
t
. If we estimate the equation
y
t
= + y
t1
+ x
t
+
t
by OLS, we have seen in Example 19 that the estimator is biased an inconsistent. What about using
the GIV estimator? Consider using as instruments Z = [1 x
t
x
t1
x
t2
]. The lags of x
t
are correlated
with y
t1
as long as and are dierent from zero, and by assumption x
t
and its lags are uncorrelated
with
t
and
t
(and thus theyre also uncorrelated with
t
). Thus, these are legitimate instruments.
As we have 4 instruments and 3 parameters, this is an overidentied situation. The Octave script
GMM/MeasurementErrorIV.m does a Monte Carlo study using 1000 replications, with a sample size
of 100. The results are comparable with those in Example 19. Using the GIV estimator, descriptive
statistics for 1000 replications are
octave:3> MeasurementErrorIV
rm: cannot remove meas_error.out: No such file or directory
mean st. dev. min max
0.000 0.241 -1.250 1.541
-0.016 0.149 -0.868 0.827
-0.001 0.177 -0.757 0.876
octave:4>
If you compare these with the results for the OLS estimator, you will see that the bias of the GIV
estimator is much less for estimation of . If you increase the sample size, you will see that the GIV
estimator is consistent, but that the OLS estimator is not.
A histogram for is in Figure 14.4. You can compare with the similar gure for the OLS
estimator, Figure 7.5. As mentioned above, when the GMM estimator is overidentied and we use a
consistent estimate of the ecient weight matrix, we have the criterion-based specication test n s
n
(

)
available. The Octave script GMM/SpecTest.m does a Monte Carlo study of this test, for the dynamic
model with measurement error, and shows that it over-rejects a correctly specied model in this case.
This is a common result for this test.
2SLS
In the general discussion of GIV above, we havent considered from where we get the instruments. Two
stage least squares is an example of a particular GIV estimator, where the instruments are obtained
in a particular way. Consider a single equation from a system of simultaneous equations. Refer back
to equation 10.3 for context. The model is
y = Y
1

1
+ X
1

1
+
= Z +
where Y
1
are current period endogenous variables that are correlated with the error term. X
1
are
exogenous and predetermined variables that are assumed not to be correlated with the error term.
Let X be all of the weakly exogenous variables (please refer back for context). The problem, recall, is
that the variables in Y
1
are correlated with .
Dene

Z =
_

Y
1
X
1
_
as the vector of predictions of Z when regressed upon X:

Z = X (X
/
X)
1
X
/
Z
Figure 14.4: GIV estimation results for , dynamic model with measurement error
Remember that X are all of the exogenous variables from all equations. The tted values of
a regression of X
1
on X are just X
1
, because X contains X
1
. So,

Y
1
are the reduced form
predictions of Y
1
.
Since

Z is a linear combination of the weakly exogenous variables X, it must be uncorrelated
with . This suggests the K-dimensional moment condition m
t
() = z
t
(y
t
z
/
t
) and so
m() = 1/n

t
z
t
(y
t
z
/
t
) .
Since we have K parameters and K moment conditions, the GMM estimator will set m identically
equal to zero, regardless of W, so we have

=
_
_

t
z
t
z
/
t
_
_
1

t
( z
t
y
t
) =
_

Z
/
Z
_
1

Z
/
y
This is the standard formula for 2SLS. We use the exogenous variables and the reduced form predictions
of the endogenous variables as instruments, and apply IV estimation. See Hamilton pp. 420-21 for the
varcov formula (which is the standard formula for 2SLS), and for how to deal with
t
heterogeneous
and dependent (basically, just use the Newey-West or some other consistent estimator of , and apply
the usual formula).
Note that autocorrelation of
t
causes lagged endogenous variables to loose their status as legit-
imate instruments. Some caution is warranted if this suspicion arises.
An example of 2SLS estimation is given in Section 10.10.
We can also estimate this same model using plain GMM estimation, this is done in Simeq/Kle-
inGMM.m. This script shows the use of the Newey-West covariance estimator.
14.11 Nonlinear simultaneous equations
GMM provides a convenient way to estimate nonlinear systems of simultaneous equations. We have a
system of equations of the form
y
1t
= f
1
(z
t
,
0
1
) +
1t
y
2t
= f
2
(z
t
,
0
2
) +
2t
.
.
.
y
Gt
= f
G
(z
t
,
0
G
) +
Gt
,
or in compact notation
y
t
= f(z
t
,
0
) +
t
,
where f() is a G -vector valued function, and
0
= (
0/
1
,
0/
2
, ,
0/
G
)
/
. We assume that z
t
contains the
current period endogenous variables, so we have a simultaneity problem.
We need to nd an A
i
1 vector of instruments x
it
, for each equation, that are uncorrelated with

it
. Typical instruments would be low order monomials in the exogenous variables in z
t
, with their
lagged values. Then we can dene the
_

G
i=1
A
i
_
1 orthogonality conditions
m
t
() =
_

_
(y
1t
f
1
(z
t
,
1
)) x
1t
(y
2t
f
2
(z
t
,
2
)) x
2t
.
.
.
(y
Gt
f
G
(z
t
,
G
)) x
Gt
_

_
.
once we have gotten this far, we can just proceed with GMM estimation, one-step, two-step,
CUE, or whatever.
A note on identication: selection of instruments that ensure identication is a non-trivial prob-
lem. Identication in nonlinear models is not as easy to check as it is with linear models, where
counting zero restrictions works.
A note on eciency: the selected set of instruments has important eects on the eciency of
estimation. Unfortunately there is little theory oering guidance on what is the optimal set.
More on this later.
14.12 Maximum likelihood
In the introduction we argued that ML will in general be more ecient than GMM since ML implicitly
uses all of the moments of the distribution while GMM uses a limited number of moments. Actually, a
distribution with P parameters can be uniquely characterized by P moment conditions. However, some
sets of P moment conditions may contain more information than others, since the moment conditions
could be highly correlated. A GMM estimator that chose an optimal set of P moment conditions
would be fully ecient. Here well see that the optimal moment conditions are simply the scores of
the ML estimator.
Let y
t
be a G -vector of variables, and let Y
t
= (y
/
1
, y
/
2
, ..., y
/
t
)
/
. Then at time t, Y
t1
has been observed
(refer to it as the information set, since we assume the conditioning variables have been selected to
take advantage of all useful information). The likelihood function is the joint density of the sample:
L() = f(y
1
, y
2
, ..., y
n
, )
which can be factored as
L() = f(y
n
[Y
n1
, ) f(Y
n1
, )
and we can repeat this to get
L() = f(y
n
[Y
n1
, ) f(y
n1
[Y
n2
, ) ... f(y
1
).
The log-likelihood function is therefore
ln L() =
n

t=1
ln f(y
t
[Y
t1
, ).
Dene
m
t
(Y
t
, ) D

ln f(y
t
[Y
t1
, )
as the score of the t
th
observation. It can be shown that, under the regularity conditions, that the
scores have conditional mean zero when evaluated at
0
(see 13.2):
cm
t
(Y
t
,
0
)[Y
t1
= 0
so one could interpret these as moment conditions to use to dene a just-identied GMM estimator (
if there are K parameters there are K score equations). The GMM estimator sets
1/n
n

t=1
m
t
(Y
t
,

) = 1/n
n

t=1
D

ln f(y
t
[Y
t1
,

) = 0,
which are precisely the rst order conditions of MLE. Therefore, MLE can be interpreted as a GMM
estimator. The GMM varcov formula is V

= (D

1
D
/

)
1
.
Consistent estimates of variance components are as follows
D

/
m(Y
t
,

) = 1/n
n

t=1
D
2

ln f(y
t
[Y
t1
,

)

It is important to note that m
t
and m
ts
, s > 0 are both conditionally and unconditionally
uncorrelated. Conditional uncorrelation follows from the fact that m
ts
is a function of Y
ts
,
which is in the information set at time t. Unconditional uncorrelation follows from the fact
that conditional uncorrelation hold regardless of the realization of Y
t1
, so marginalizing with
respect to Y
t1
preserves uncorrelation (see the section on ML estimation, above). The fact that
the scores are serially uncorrelated implies that can be estimated by the estimator of the 0
th
autocovariance of the moment conditions:

= 1/n
n

t=1
m
t
(Y
t
,

)m
t
(Y
t
,

)
/
= 1/n
n

t=1
_
D

ln f(y
t
[Y
t1
,

)
_ _
D

ln f(y
t
[Y
t1
,

)
_
/
14.13 Example: OLS as a GMM estimator - the Nerlove
model again
The simple Nerlove model can be estimated using GMM. The Octave script NerloveGMM.m estimates
the model by GMM and by OLS. It also illustrates that the weight matrix does not matter when the
moments just identify the parameter. You are encouraged to examine the script and run it.
14.14 Example: The MEPS data
The MEPS data on health care usage discussed in section 11.4 estimated a Poisson model by maximum
likelihood (probably misspecied). Perhaps the same latent factors (e.g., chronic illness) that induce
one to make doctor visits also inuence the decision of whether or not to purchase insurance. If this
is the case, the PRIV variable could well be endogenous, in which case, the Poisson ML estimator
would be inconsistent, even if the conditional mean were correctly specied. The Octave script meps.m
estimates the parameters of the model presented in equation 11.1, using Poisson ML (better thought
of as quasi-ML), and IV estimation
1
. Both estimation methods are implemented using a GMM form.
Running that script gives the output
OBDV
******************************************************
IV
1
The validity of the instruments used may be debatable, but real data sets often dont contain ideal instruments.
GMM Estimation Results
BFGS convergence: Normal convergence
Objective function value: 0.004273
Observations: 4564
No moment covariance supplied, assuming efficient weight matrix
Value df p-value
X^2 test 19.502 3.000 0.000
estimate st. err t-stat p-value
constant -0.441 0.213 -2.072 0.038
pub. ins. -0.127 0.149 -0.851 0.395
priv. ins. -1.429 0.254 -5.624 0.000
sex 0.537 0.053 10.133 0.000
age 0.031 0.002 13.431 0.000
edu 0.072 0.011 6.535 0.000
inc 0.000 0.000 4.500 0.000
******************************************************
******************************************************
Poisson QML
GMM Estimation Results
BFGS convergence: Normal convergence
Objective function value: 0.000000
Observations: 4564
No moment covariance supplied, assuming efficient weight matrix
Exactly identified, no spec. test
estimate st. err t-stat p-value
constant -0.791 0.149 -5.289 0.000
pub. ins. 0.848 0.076 11.092 0.000
priv. ins. 0.294 0.071 4.136 0.000
sex 0.487 0.055 8.796 0.000
age 0.024 0.002 11.469 0.000
edu 0.029 0.010 3.060 0.002
inc -0.000 0.000 -0.978 0.328
******************************************************
Note how the Poisson QML results, estimated here using a GMM routine, are the same as were
obtained using the ML estimation routine (see subsection 11.4). This is an example of how (Q)ML may
be represented as a GMM estimator. Also note that the IV and QML results are considerably dierent.
Treating PRIV as potentially endogenous causes the sign of its coecient to change. Perhaps it is
logical that people who own private insurance make fewer visits, if they have to make a co-payment.
Note that income becomes positive and signicant when PRIV is treated as endogenous.
Perhaps the dierence in the results depending upon whether or not PRIV is treated as endogenous
can suggest a method for testing exogeneity. Onward to the Hausman test!
14.15 Example: The Hausman Test
This section discusses the Hausman test, which was originally presented in Hausman, J.A. (1978),
Specication tests in econometrics, Econometrica, 46, 1251-71.
Consider the simple linear regression model y
t
= x
/
t
+
t
. We assume that the functional form and
the choice of regressors is correct, but that the some of the regressors may be correlated with the error
term, which as you know will produce inconsistency of

. For example, this will be a problem if
if some regressors are endogeneous
some regressors are measured with error
some relevant regressors are omitted (equivalent to imposing false restrictions)
lagged values of the dependent variable are used as regressors and
t
is autocorrelated.
To illustrate, the Octave program OLSvsIV.m performs a Monte Carlo experiment where errors are
correlated with regressors, and estimation is by OLS and IV. The true value of the slope coecient
used to generate the data is = 2. Figure 14.5 shows that the OLS estimator is quite biased, while
Figure 14.5: OLS
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
2.28 2.3 2.32 2.34 2.36 2.38
OLS estimates
Figure 14.6 shows that the IV estimator is on average much closer to the true value. If you play with
the program, increasing the sample size, you can see evidence that the OLS estimator is asymptotically
biased, while the IV estimator is consistent. You can also play with the covariances of the instrument
and regressor, and the covariance of the regressor and the error.
We have seen that inconsistent and the consistent estimators converge to dierent probability
limits. This is the idea behind the Hausman test - a pair of consistent estimators converge to the same
probability limit, while if one is consistent and the other is not they converge to dierent limits. If we
accept that one is consistent (e.g., the IV estimator), but we are doubting if the other is consistent
(e.g., the OLS estimator), we might try to check if the dierence between the estimators is signicantly
Figure 14.6: IV
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
1.9 1.92 1.94 1.96 1.98 2 2.02 2.04 2.06 2.08
IV estimates
dierent from zero.
If were doubting about the consistency of OLS (or QML, etc.), why should we be interested
in testing - why not just use the IV estimator? Because the OLS estimator is more ecient
when the regressors are exogenous and the other classical assumptions (including normality of
the errors) hold.
Play with the above script to convince yourself of this point.
When we have a more ecient estimator that relies on stronger assumptions (such as exogeneity)
than the IV estimator, we might prefer to use it, unless we have evidence that the assumptions
are false.
So, lets consider the covariance between the MLE estimator

(or any other fully ecient estimator)
and some other CAN estimator, say

. Now, lets recall some results from MLE. Equation 13.4 is:

n
_


0
_
a.s.

(
0
)
1

ng(
0
).
Equation 13.8 is

() = J

().
Combining these two equations, we get

n
_


0
_
a.s.
I

(
0
)
1

ng(
0
).
Also, equation 13.11 tells us that the asymptotic covariance between any CAN estimator and the
MLE score vector is
V

n
_

ng()
_

_ =
_

_
V

) I
K
I
K
J

()
_

_ .
Now, consider
_

_
I
K
0
K
0
K
I

()
1
_

_
_

n
_

ng()
_

_
a.s.

n
_

n
_


_
_

_ .
The asymptotic covariance of this is
V

n
_

n
_


_
_

_ =
_

_
I
K
0
K
0
K
I

()
1
_

_
_

_
V

) I
K
I
K
J

()
_

_
_

_
I
K
0
K
0
K
I

()
1
_

_
=
_

_
V

) I

()
1
I

()
1
I

()
1
_

_ ,
which, for clarity in what follows, we might write as (note to self for lectures: the 2,2 element has
changed)
V

n
_

n
_


_
_

_ =
_

_
V

) I

()
1
I

()
1
V

)
_

_ .
So, the asymptotic covariance between the MLE and any other CAN estimator is equal to the MLE
asymptotic variance (the inverse of the information matrix).
Now, suppose we wish to test whether the the two estimators are in fact both converging to
0
,
versus the alternative hypothesis that the MLE estimator is not in fact consistent (the consistency
of

is a maintained hypothesis). Under the null hypothesis that they are, we have
_
I
K
I
K
_
_

n
_


0
_

n
_


0
_
_

_ =

n
_

_
,
will be asymptotically normally distributed as (work out on blackboard)

n
_

_
d
N
_
0, V

) V

)
_
.
So,
n
_

_
/
_
V

) V

)
_
1
_

_
d

2
(),
where is the rank of the dierence of the asymptotic variances. A statistic that has the same
asymptotic distribution is
_

_
/
_

V (

)

V (

)
_
1
_

_
d

2
().
This is the Hausman test statistic, in its original form. The reason that this test has power under
the alternative hypothesis is that in that case the MLE estimator will not be consistent, and will
converge to
A
, say, where
A
,=
0
. Then the mean of the asymptotic distribution of vector

n
_

_
will be
0

A
, a non-zero vector, so the test statistic will eventually reject, regardless of how small a
signicance level is used.
Note: if the test is based on a sub-vector of the entire parameter vector of the MLE, it is possible
that the inconsistency of the MLE will not show up in the portion of the vector that has been
used. If this is the case, the test may not have power to detect the inconsistency. This may occur,
for example, when the consistent but inecient estimator is not identied for all the parameters
of the model, so that we estimate only some of the parameters using the inecient estimator,
and the test does not include the others.
Some things to note:
The rank, , of the dierence of the asymptotic variances is often less than the dimension of the
matrices, and it may be dicult to determine what the true rank is. If the true rank is lower
than what is taken to be true, the test will be biased against rejection of the null hypothesis.
The contrary holds if we underestimate the rank.
A solution to this problem is to use a rank 1 test, by comparing only a single coecient. For
example, if a variable is suspected of possibly being endogenous, that variables coecients may
be compared.
This simple formula only holds when the estimator that is being tested for consistency is fully
ecient under the null hypothesis. This means that it must be a ML estimator or a fully
ecient estimator that has the same asymptotic distribution as the ML estimator. This is
quite restrictive since modern estimators such as GMM, QML, or even OLS with heteroscedastic
consistent standard errors are not in general fully ecient.
Following up on this last point, lets think of two not necessarily ecient estimators,

1
and

2
, where
one is assumed to be consistent, but the other may not be. We assume for expositional simplicity
that both

1
and

2
belong to the same parameter space, and that each estimator can be expressed
as generalized method of moments (GMM) estimator. The estimators are dened (suppressing the
Figure 14.7: Incorrect rank and the Hausman test
dependence upon data) by

i
= arg min

m
i
(
i
)
/
W
i
m
i
(
i
)
where m
i
(
i
) is a g
i
1 vector of moment conditions, and W
i
is a g
i
g
i
positive denite weighting
matrix, i = 1, 2. Consider the omnibus GMM estimator
_

1
,

2
_
= arg min

_
m
1
(
1
)
/
m
2
(
2
)
/
_
_

_
W
1
0
(g
1
g
2
)
0
(g
2
g
1
)
W
2
_

_
_

_
m
1
(
1
)
m
2
(
2
)
_

_ . (14.6)
Suppose that the asymptotic covariance of the omnibus moment vector is
= lim
n
V ar
_

n
_

_
m
1
(
1
)
m
2
(
2
)
_

_
_

_
(14.7)

_
_
_

1

12

2
_
_
_.
The standard Hausman test is equivalent to a Wald test of the equality of
1
and
2
(or subvectors of
the two) applied to the omnibus GMM estimator, but with the covariance of the moment conditions
estimated as

=
_
_
_

1
0
(g
1
g
2
)
0
(g
2
g
1
)

2
_
_
_.
While this is clearly an inconsistent estimator in general, the omitted
12
term cancels out of the test
statistic when one of the estimators is asymptotically ecient, as we have seen above, and thus it need
not be estimated.
The general solution when neither of the estimators is ecient is clear: the entire matrix must
be estimated consistently, since the
12
term will not cancel out. Methods for consistently estimating
the asymptotic covariance of a vector of moment conditions are well-known, e.g., the Newey-West
estimator discussed previously. The Hausman test using a proper estimator of the overall covariance
matrix will now have an asymptotic
2
distribution when neither estimator is ecient.
However, the test suers from a loss of power due to the fact that the omnibus GMM estimator
of equation 14.6 is dened using an inecient weight matrix. A new test can be dened by using an
alternative omnibus GMM estimator
_

1
,

2
_
= arg min

_
m
1
(
1
)
/
m
2
(
2
)
/
_
_

_
1
_

_
m
1
(
1
)
m
2
(
2
)
_

_ , (14.8)
where

is a consistent estimator of the overall covariance matrix of equation 14.7. By standard
arguments, this is a more ecient estimator than that dened by equation 14.6, so the Wald test
using this alternative is more powerful. See my article in Applied Economics, 2004, for more details,
including simulation results. The Octave script hausman.m calculates the Wald test corresponding to
the ecient joint GMM estimator (the H2 test in my paper), for a simple linear model.
14.16 Application: Nonlinear rational expectations
Readings: Hansen and Singleton, Econometrics, 1982; Tauchen, Journal of Business and Economic
Statistics, 1986.
Though GMM estimation has many applications, application to rational expectations models is
elegant, since theory directly suggests the moment conditions. Hansen and Singletons 1982 paper is
also a classic worth studying in itself. Though I strongly recommend reading the paper, Ill use a
simplied model with similar notation to Hamiltons. The literature on estimation of these models
has grown a lot since these early papers. After work like the cited papers, people moved to ML
estimation of linearized models, using Kalman ltering. Current methods are usually Bayesian, and
involve sophisticated ltering methods to compute the likelihood function for nonlinear models with
non-normal shocks. There is a lot of interesting stu that is beyond the scope of this course. I have
done some work using simulation-based estimation methods applied to such models. The methods
explained in this section are intended to provide an example of GMM estimation. They are not the
state of the art for estimation of such models.
We assume a representative consumer maximizes expected discounted utility over an innite hori-
zon. Expectations are rational, and the agent has full information (is fully aware of the history of the
world up to the current time period - hows that for an assumption!). Utility is temporally additive,
and the expected utility hypothesis holds. The future consumption stream is the stochastic sequence
c
t

t=0
. The objective function at time t is the discounted expected utility

s=0

s
c (u(c
t+s
)[I
t
) . (14.9)
The parameter is between 0 and 1, and reects discounting.
I
t
is the information set at time t, and includes the all realizations of all random variables indexed
t and earlier.
The choice variable is c
t
- current consumption, which is constrained to be less than or equal to
current wealth w
t
.
Suppose the consumer can invest in a risky asset. A dollar invested in the asset yields a gross
return
(1 + r
t+1
) =
p
t+1
+ d
t+1
p
t
where p
t
is the price and d
t
is the dividend in period t. Thus, r
t+1
is the net return on a dollar
invested in period t.
The price of c
t
is normalized to 1.
Current wealth w
t
= (1 + r
t
)i
t1
, where i
t1
is investment in period t 1. So the problem
is to allocate current wealth between current consumption and investment to nance future
consumption: w
t
= c
t
+ i
t
.
Future net rates of return r
t+s
, s > 0 are not known in period t: the asset is risky.
A partial set of necessary conditions for utility maximization have the form:
u
/
(c
t
) = c (1 + r
t+1
) u
/
(c
t+1
)[I
t
. (14.10)
To see that the condition is necessary, suppose that the lhs < rhs. Then by reducing current con-
sumption marginally would cause equation 14.9 to drop by u
/
(c
t
), since there is no discounting of
the current period. At the same time, the marginal reduction in consumption nances investment,
which has gross return (1 + r
t+1
) , which could nance consumption in period t + 1. This increase in
consumption would cause the objective function to increase by c (1 + r
t+1
) u
/
(c
t+1
)[I
t
. Therefore,
unless the condition holds, the expected discounted utility function is not maximized.
To use this we need to choose the functional form of utility. A constant relative risk aversion
(CRRA) form is
u(c
t
) =
c
1
t
1
1
where is the coecient of relative risk aversion. With this form,
u
/
(c
t
) = c

t
so the foc are
c

t
= c
_
(1 + r
t+1
) c

t+1
[I
t
_
While it is true that
c
_
c

t

_
(1 + r
t+1
) c

t+1
__
[I
t
= 0
so that we could use this to dene moment conditions, it is unlikely that c
t
is stationary, even though
it is in real terms, and our theory requires stationarity. To solve this, divide though by c

t
c
_
_
1-
_
_
_
(1 + r
t+1
)
_
c
t+1
c
t
_

_
_
_
_
_
[I
t
= 0
(note that c
t
can be passed though the conditional expectation since c
t
is chosen based only upon
information available in time t). That is to say, c
t
is in the information set I
t
.
Now
1-
_
_
_
(1 + r
t+1
)
_
c
t+1
c
t
_

_
_
_
is analogous to h
t
() dened above: its a scalar moment condition. To get a vector of moment condi-
tions we need some instruments. Suppose that z
t
is a vector of variables drawn from the information
set I
t
. We can use the necessary conditions to form the expressions
_
1 (1 + r
t+1
)
_
c
t+1
c
t
_

_
z
t
m
t
()
represents and .
Therefore, the above expression may be interpreted as a moment condition which can be used
for GMM estimation of the parameters
0
.
Note that at time t, m
ts
has been observed, and is therefore an element of the information set. By
rational expectations, the autocovariances of the moment conditions other than
0
should be zero.
The optimal weighting matrix is therefore the inverse of the variance of the moment conditions:

= limE
_
nm(
0
)m(
0
)
/
_
which can be consistently estimated by

= 1/n
n

t=1
m
t
(

)m
t
(

)
/
As before, this estimate depends on an initial consistent estimate of , which can be obtained by
setting the weighting matrix W arbitrarily (to an identity matrix, for example). After obtaining

, we
then minimize
s() = m()
/

1
m().
This process can be iterated, e.g., use the new estimate to re-estimate , use this to estimate
0
, and
repeat until the estimates dont change.
In principle, we could use a very large number of moment conditions in estimation, since any
current or lagged variable could be used in x
t
. Since use of more moment conditions will lead
to a more (asymptotically) ecient estimator, one might be tempted to use many instrumental
variables. We will do a computer lab that will show that this may not be a good idea with
nite samples. This issue has been studied using Monte Carlos (Tauchen, JBES, 1986). The
reason for poor performance when using many instruments is that the estimate of becomes
very imprecise.
Empirical papers that use this approach often have serious problems in obtaining precise es-
timates of the parameters, and identication can be problematic. Note that we are basing
everything on a single partial rst order condition. Probably this f.o.c. is simply not informative
enough.
14.17 Empirical example: a portfolio model
The Octave program portfolio.m performs GMM estimation of a portfolio model, using the data le
tauchen.data. The columns of this data le are c, p, and d in that order. There are 95 observations
(source: Tauchen, JBES, 1986). As instruments we use lags of c and r, as well as a constant. For a
single lag the estimation results are
MPITB extensions found
******************************************************
Example of GMM estimation of rational expectations model
GMM Estimation Results
BFGS convergence: Normal convergence
Objective function value: 0.000014
Observations: 94
Value df p-value
X^2 test 0.001 1.000 0.971
estimate st. err t-stat p-value
beta 0.915 0.009 97.271 0.000
gamma 0.569 0.319 1.783 0.075
******************************************************
For two lags the estimation results are
MPITB extensions found
******************************************************
Example of GMM estimation of rational expectations model
GMM Estimation Results
BFGS convergence: Normal convergence
Objective function value: 0.037882
Observations: 93
Value df p-value
X^2 test 3.523 3.000 0.318
estimate st. err t-stat p-value
beta 0.857 0.024 35.636 0.000
gamma -2.351 0.315 -7.462 0.000
******************************************************
Pretty clearly, the results are sensitive to the choice of instruments. Maybe there is some problem here:
poor instruments, or possibly a conditional moment that is not very informative. Moment conditions
formed from Euler conditions sometimes do not identify the parameter of a model. See Hansen, Heaton
and Yarron, (1996) JBES V14, N3. I believe that this is the case here, though I havent checked it
carefully.
Aside on ML estimation of RBC model. A similar model is the RBC model discussed by
Fernndez-Villaverde: Fernndez-Villaverdes RBC example. Files to estimate this model by maximum
likelihood are provided here. The main point for the purposes of this course is that methods other
than GMM based on the Euler equation do exist, and work better. For those of you who go on to do
empirical macro work, this example may be useful in the future.
14.18 Exercises
1. Do the exercises in section 14.10.
2. Show how the GIV estimator presented in section 14.10 can be adapted to account for an error
term with HET and/or AUT.
3. For the GIV estimator presented in section 14.10, nd the form of the expressions J

(
0
) and

(
0
) that appear in the asymptotic distribution of the estimator, assuming that an ecient
weight matrix is used.
4. The Octave script meps.m estimates a model for oce-based doctpr visits (OBDV) using two
dierent moment conditions, a Poisson QML approach and an IV approach. If all conditioning
variables are exogenous, both approaches should be consistent. If the PRIV variable is endoge-
nous, only the IV approach should be consistent. Neither of the two estimators is ecient in any
case, since we already know that this data exhibits variability that exceeds what is implied by the
Poisson model (e.g., negative binomial and other models t much better). Test the exogeneity of
the variable PRIV with a GMM-based Hausman-type test, using the Octave script hausman.m
for hints about how to set up the test.
5. Using Octave, generate data from the logit dgp. The script EstimateLogit.m should prove quite
helpful.
(a) Recall that E(y
t
[x
t
) = p(x
t
, ) = [1 + exp(x
t
/)]
1
. Consider the moment condtions
(exactly identied) m
t
() = [y
t
p(x
t
, )]x
t
. Estimate by GMM (using gmm_results),
using these moments.
(b) Estimate by ML (using mle_results).
(c) The two estimators should coincide. Prove analytically that the estimators coicide.
6. Verify the missing steps needed to show that n m(

)
/

1
m(

) has a
2
(g K) distribution.
That is, show that the monster matrix is idempotent and has trace equal to g K.
7. For the portfolio example, experiment with the program using lags of 3 and 4 periods to dene
instruments
(a) Iterate the estimation of = (, ) and to convergence.
(b) Comment on the results. Are the results sensitive to the set of instruments used? Look at

as well as

. Are these good instruments? Are the instruments highly correlated with one
another? Is there something analogous to collinearity going on here?
8. Run the Octave script GMM/chi2gmm.m with several sample sizes. Do the results you obtain
seem to agree with the consistency of the GMM estimator? Explain.
9. The GMM estimator with an arbitrary weight matrix has the asymptotic distribution

n
_


0
_
d
N
_
0, (D

D
/

)
1
D

D
/

(D

D
/

)
1
_
Supposing that you compute a GMM estimator using an arbitrary weight matrix, so that this
result applies. Carefully explain how you could test the hypothesis H
0
: R
0
= r versus H
A
:
R
0
,= r, where R is a given q k matrix, and r is a given q 1 vector. I suggest that you use
a Wald test. Explain exactly what is the test statistic, and how to compute every quantity that
appears in the statistic.
10. (proof that the GMM optimal weight matrix is one such that W

=
1

) Consider the dierence


of the asymptotic variance using an arbitrary weight matrix, minus the asymptotic variance using
the optimal weight matrix:
A = (D

D
/

)
1
D

D
/

(D

D
/

)
1

_
D

D
/

_
1
Set B = (D

D
/

)
1
D

(D

D
/

)
1
D

. Verify that A = B

. What is
the implication of this? Explain.
11. Recall the dynamic model with measurement error that was discussed in class:
y

t
= + y

t1
+ x
t
+
t
y
t
= y

t
+
t
where
t
and
t
are independent Gaussian white noise errors. Suppose that y

t
is not observed,
and instead we observe y
t
. We can estimate the equation
y
t
= + y
t1
+ x
t
+
t
using GIV, as was done above. The Octave script GMM/SpecTest.m performs a Monte Carlo
study of the performance of the GMM criterion test,
n s
n
(

)
d

2
(g K)
Examine the script and describe what it does. Run this script to verify that the test over-rejects.
Increase the sample size, to determine if the over-rejection problem becomes less severe. Discuss
your ndings.
Chapter 15
Models for time series data
Hamilton, Time Series Analysis is a good reference for this chapter.
Up to now weve considered the behavior of the dependent variable y
t
as a function of other variables
x
t
. These variables can of course contain lagged dependent variables, e.g., x
t
= (w
t
, y
t1
, ..., y
tj
). Pure
time series methods consider the behavior of y
t
as a function only of its own lagged values, unconditional
on other observable variables. One can think of this as modeling the behavior of y
t
after marginalizing
out all other variables. While its not immediately clear why a model that has other explanatory
variables should marginalize to a linear in the parameters time series model, most applied time series
work is done with linear models, though nonlinear time series is also a large and growing eld.
Basic concepts
Denition 54. [Stochastic process] A stochastic process is a sequence of random variables, indexed
by time: Y
t

t=
442
Denition 55. [Time series] A time series is one observation of a stochastic process, over a specic
interval: y
t

n
t=1
.
So a time series is a sample of size n from a stochastic process. Its important to keep in mind that
conceptually, one could draw another sample, and that the values would be dierent.
Denition 56. [Autocovariance] The j
th
autocovariance of a stochastic process is
jt
= c(y
t

t
)(y
tj

tj
) where
t
= c (y
t
) .
Denition 57. [Covariance (weak) stationarity] A stochastic process is covariance stationary if it has
time constant mean and autocovariances of all orders:

t
= , t

jt
=
j
, t
As weve seen, this implies that
j
=
j
: the autocovariances depend only one the interval between
observations, but not the time of the observations.
Denition 58. [Strong stationarity] A stochastic process is strongly stationary if the joint distribution
of an arbitrary collection of the Y
t
doesnt depend on t.
Since moments are determined by the distribution, strong stationarityweak stationarity.
What is the mean of Y
t
? The time series is one sample from the stochastic process. One could
think of M repeated samples from the stoch. proc., e.g., y
tm
By a LLN, we would expect that
lim
M
1
M
M

m=1
y
tm
p
c(Y
t
)
The problem is, we have only one sample to work with, since we cant go back in time and collect
another. How can c(Y
t
) be estimated then? It turns out that ergodicity is the needed property.
Denition 59. [Ergodicity]. A stationary stochastic process is ergodic (for the mean) if the time
average converges to the mean
1
n
n

t=1
y
t
p
(15.1)
A sucient condition for ergodicity is that the autocovariances be absolutely summable:

j=0
[
j
[ <
This implies that the autocovariances die o, so that the y
t
are not so strongly dependent that they
dont satisfy a LLN.
Denition 60. [Autocorrelation] The j
th
autocorrelation,
j
is just the j
th
autocovariance divided by
the variance:

j
=

j

0
(15.2)
Denition 61. [White noise] White noise is just the time series literature term for a classical error.

t
is white noise if i) c(
t
) = 0, t, ii) V (
t
) =
2
, t and iii)
t
and
s
are independent, t ,= s. Gaussian
white noise just adds a normality assumption.
15.1 ARMA models
With these concepts, we can discuss ARMA models. These are closely related to the AR and MA
error processes that weve already discussed. The main dierence is that the lhs variable is observed
directly now.
MA(q) processes
A q
th
order moving average (MA) process is
y
t
= +
t
+
1

t1
+
2

t2
+ +
q

tq
where
t
is white noise. The variance is

0
= c (y
t
)
2
= c (
t
+
1

t1
+
2

t2
+ +
q

tq
)
2
=
2
_
1 +
2
1
+
2
2
+ +
2
q
_
Similarly, the autocovariances are

j
= c [(y
t
) (y
tj
)]
=
2
(
j
+
j+1

1
+
j+2

2
+ +
q

qj
), j q
= 0, j > q
Therefore an MA(q) process is necessarily covariance stationary and ergodic, as long as
2
and all of
the
j
are nite.
Its not immediately obvious how to estimate the parameters of the MA model, because the
t
are not observable. If they were, we could estimate by OLS. One can work with the joint density
of y. Suppose we have a MA(1) model. Then y N(, ), where is obtained from the above
autocovariances (of order 0 and 1 when q=1). This can be used to write the likelihood, at the cost of
using a lot of computer memory. A more economical approach uses the Kalman lter, which well see
in the discussion of state space models.
AR(p) processes
An AR(p) process can be represented as
y
t
= c +
1
y
t1
+
2
y
t2
+ +
p
y
tp
+
t
This is just a linear regression model, and assuming stationarity, we can estimate the parameters by
OLS. What is needed for stationarity?
The dynamic behavior of an AR(p) process can be studied by writing this p
th
order dierence
equation as a vector rst order dierence equation (this is known as the companion form):
_

_
y
t
y
t1
.
.
.
y
tp+1
_

_
=
_

_
c
0
.
.
.
0
_

_
+
_

1

2

p
1 0 0 0
0 1 0
.
.
. 0
.
.
.
.
.
.
.
.
.
.
.
. 0
0 0 1 0
_

_
_

_
y
t1
y
t2
.
.
.
y
tp
_

_
+
_

t
0
.
.
.
0
_

_
or
Y
t
= C + FY
t1
+ E
t
With this, we can recursively work forward in time:
Y
t+1
= C + FY
t
+ E
t+1
= C + F (C + FY
t1
+ E
t
) + E
t+1
= C + FC + F
2
Y
t1
+ FE
t
+ E
t+1
and
Y
t+2
= C + FY
t+1
+ E
t+2
= C + F
_
C + FC + F
2
Y
t1
+ FE
t
+ E
t+1
_
+ E
t+2
= C + FC + F
2
C + F
3
Y
t1
+ F
2
E
t
+ FE
t+1
+ E
t+2
or in general
Y
t+j
= C + FC + + F
j
C + F
j+1
Y
t1
+ F
j
E
t
+ F
j1
E
t+1
+ + FE
t+j1
+ E
t+j
Consider the impact of a shock in period t on y
t+j
. This is simply
Y
t+j
E
/
t (1,1)
= F
j
(1,1)
If the system is to be stationary, then as we move forward in time this impact must die o. Otherwise
a shock causes a permanent change in the mean of y
t
. Therefore, stationarity requires that
lim
j
F
j
(1,1)
= 0
Save this result, well need it in a minute.
Consider the eigenvalues of the matrix F. These are the such that
[F I
P
[ = 0
The determinant here can be expressed as a polynomial. For example, for p = 1, the matrix F is
simply
F =
1
so
[
1
[ = 0
can be written as

1
= 0
When p = 2, the matrix F is
F =
_

1

2
1 0
_

_
so
F I
P
=
_

1

2
1
_

_
and
[F I
P
[ =
2

2
So the eigenvalues are the roots of the polynomial

2
which can be found using the quadratic equation. This generalizes. For a p
th
order AR process, the
eigenvalues are the roots of

p1

p2

2

p1

p
= 0
Supposing that all of the roots of this polynomial are distinct, then the matrix F can be factored as
F = TT
1
where T is the matrix which has as its columns the eigenvectors of F, and is a diagonal matrix with
the eigenvalues on the main diagonal. Using this decomposition, we can write
F
j
=
_
TT
1
_ _
TT
1
_

_
TT
1
_
where TT
1
is repeated j times. This gives
F
j
= T
j
T
1
and

j
=
_

j
1
0 0
0
j
2
.
.
.
0
j
p
_

_
Supposing that the
i
i = 1, 2, ..., p are all real valued, it is clear that
lim
j
F
j
(1,1)
= 0
requires that
[
i
[ < 1, i = 1, 2, ..., p
e.g., the eigenvalues must be less than one in absolute value.
It may be the case that some eigenvalues are complex-valued. The previous result generalizes
to the requirement that the eigenvalues be less than one in modulus, where the modulus of a
complex number a + bi is
mod(a + bi) =

a
2
+ b
2
This leads to the famous statement that stationarity requires the roots of the determinantal
polynomial to lie inside the complex unit circle. draw picture here.
When there are roots on the unit circle (unit roots) or outside the unit circle, we leave the world
of stationary processes.
Dynamic multipliers: y
t+j
/
t
= F
j
(1,1)
is a dynamic multiplier or an impulse-response func-
tion. Real eigenvalues lead to steady movements, whereas complex eigenvalues lead to ocillatory
behavior. Of course, when there are multiple eigenvalues the overall eect can be a mixture.
pictures
Moments of AR(p) process
The AR(p) process is
y
t
= c +
1
y
t1
+
2
y
t2
+ +
p
y
tp
+
t
Assuming stationarity, c(y
t
) = , t, so
= c +
1
+
2
+ ... +
p

so
=
c
1
1

2
...
p
and
c =
1
...
p

so
y
t
=
1
...
p
+
1
y
t1
+
2
y
t2
+ +
p
y
tp
+
t

=
1
(y
t1
) +
2
(y
t2
) + ... +
p
(y
tp
) +
t
With this, the second moments are easy to nd: The variance is

0
=
1

1
+
2

2
+ ... +
p

p
+
2
The autocovariances of orders j 1 follow the rule

j
= c [(y
t
) (y
tj
))]
= c [(
1
(y
t1
) +
2
(y
t2
) + ... +
p
(y
tp
) +
t
) (y
tj
)]
=
1

j1
+
2

j2
+ ... +
p

jp
Using the fact that
j
=
j
, one can take the p + 1 equations for j = 0, 1, ..., p, which have p + 1
unknowns (
2
,
0
,
1
, ...,
p
) and solve for the unknowns. With these, the
j
for j > p can be solved
for recursively.
ARMA model
An ARMA(p, q) model is (1 +
1
L +
2
L
2
+ ... +
p
L
p
)y
t
= c + (1 +
1
L +
2
L
2
+... +
q
L
q
)
t
. These
are popular in applied time series analysis. A high order AR process may be well approximated by a
low order MA process, and a high order MA process may be well approximated by a low order AR
process. By combining low order AR and MA processes in the same model, one can hope to t a wide
variety of time series using a parsimonious number of parameters. There is much literature on how to
choose p and q, which is outside the scope of this course. Estimation can be done using the Kalman
lter, assuming that the errors are normally distributed.
15.2 VAR models
Consider the model
y
t
= C + A
1
y
t1
+
t
(15.3)
E(
t

/
t
) =
E(
t

/
s
) = 0, t ,= s
where y
t
and
t
are G1 vectors, C is a G1 of constants, and A
1
is a GG matrix of parameters.
The matrix is a G G covariance matrix. Assume that we have n observations. This is a vector
autoregressive model, of order 1 - commonly referred to as a VAR(1) model. It is a collection of G
AR(1) models, augmented to include lags of other endogenous variables, and the G equations are
contemporaneously correlated. The extension to a VAR(p) model is quite obvious.
As shown in Section 10.3, it is ecient to estimate a VAR model using OLS equation by equation,
there is no need to use GLS, in spite of the cross equation correlations.
A VAR model of this form can be thought of as the reduced form of a dynamic simultaneous
equations system, with all of the variables treated as endogenous, and with lags of all of the endogenous
variables present. The simultaneous equations model is (see equation 10.1)
Y
/
t
= X
/
t
B + E
/
t
which can be written after transposing (and adapting notation to use small case, pulling the constant
out of X
t
and using v
t
for the error) as
/
y
t
= a + B
/
x
t
+ v
t
. Let x
t
= y
t1
. Then we have
/
y
t
=
a + B
/
y
t1
+ v
t
. Premultiplying by the inverse of
/
gives
y
t
= (
/
)
1
a + (
/
)
1
B
/
y
t1
+ (
/
)
1
v
t
.
Finally dene C = (
/
)
1
a, A
1
= (
/
)
1
B
/
and
t
= (
/
)
1
v
t
, and we have the VAR(1) model of
equation 15.3. C. Sims originally proposed reduced form VAR models as an alternative to structural
simultaneous equatons models, which were perceived to require too many unrealistic assumptions for
their identication. However, the search for structural interpretations of VAR models slowly crept back
into the literature, leading to structural VARs. A structural VAR model is really just a dynamic
linear simultaneous equations model, with other imaginative and hopefully more realistic methods used
for identication. The issue of identifying the structural parameters and B is more or less the same
problem that was studied in the context of simultaneous equatons. There, identication was obtained
through zero restrictions. In the structural VAR literature, zero restrictions are often used, but other
information may also be used, such as covariance matrix restrictions or sign restrictions. Interest often
focuses on the impulse-response functions. Identication of the impact of structural shocks (how to
estimate the impact-response functions) is complicated, with many alternative methodologies, and is
often a topic of much disagreement among practitioners. The estimated impulse response functions
are often sensitive to the identication strategy that is used. There is a large literature. Papers by C.
Sims are a good place to start, if one wants to learn more. He also oers a good deal of useful software
on his web page.
An issue which arises when a VAR(p) model y
t
= C +A
1
y
t1
+ +A
p
y
tp
+
t
is contemplated is
that the number of parameters increases rapidly in p, which introduces severe collinearity problems.
One can use Bayesian methods such as the Minnesota prior (Litterman), which is a prior that each
variable separately follows a random walk (an AR(1) model with = 1). The prior on A
1
is that it is
an identity matrix, and the prior on the A
j
, j > 1 is that they are zero matrices. This can be done
using stochastic restrictions similar to what was in the discussion of collinearity and ridge regression.
Bayesian VARs is a now a substantial body of literature. An introduction to more formal Bayesian
methods is given in a chapter that follows. For highly parameterized models, Bayesian methods can
help to impose structure.
15.3 ARCH, GARCH and Stochastic volatility
ARCH (autoregressive conditionally heteoscedastic) models appeared in the literature in 1982, in
Engle, Robert F. (1982). "Autoregressive Conditional Heteroscedasticity with Estimates of Variance
of United Kingdom Ination", Econometrica 50:987-1008. This paper stimulated a very large growth
in the literature for a number of years afterward. The related GARCH (generalized ARCH) model is
now one of the most widely used models for nancial time series.
Financial time series often exhibit several type of behavior:
volatility clustering: periods of low variation can be followed by periods of high variation
fat tails, or excess kurtosis: the marginal density of a series is more strongly peaked and has
fatter tails than does a normal distribution with the same mean and variance.
other features, such as leverage (correlation between returns and volatility) and perhaps slight
autocorrelation within the bounds allowed by arbitrage.
The data set nysewk.gdt, which is provided with Gretl, provides an example. If we compute 100
times the growth rate of the series, using log dierences, we can obtain the plots in Figure 15.1. In the
rst we clearly see volatility clusters, and in the second, we see excess kurtosis and tails fatter than
the normal distribution. The skewness suggests that leverage may be present.
The presence of volatility clusters indicates that the variance of the series is not constant over time,
conditional on past events. Engles ARCH paper was the rst to model this feature.
ARCH
A basic ARCH specication is
y
t
= + y
t1
+
t
g
t
+
t

t
=
t
u
t

2
t
= +
q

i=1

2
ti
where the u
t
are Gaussian white noise shocks. The ARCH variance is a moving average process.
Previous large shocks to the series cause the conditional variance of the series to increase. There is no
leverage: negative shocks have the same impact on the future variance as do positive shocks..
for
2
t
to be positive for all realizations of
t
, we need > 0,
i
0, i.
Figure 15.1: NYSE weekly close price, 100 log dierences
(a) Time series plot
(b) Frequency distribution
to ensure that the model is covariance stationary, we need

i

i
< 1. Otherwise, the variances
will explode o to innity.
Given that
t
is normally distributed. To nd the likelihood in terms of the observable y
t
instead of
the unobservable
t
, rst note that the series u
t
= (y
t
g
t
) /
t
=

t

t
is iid Gaussian, so the likelihood
is simply the product of standard normal densities.
u N(0, I), so
f(u) =
n

t=1
1

2
exp
_
_

u
2
t
2
_
_
The joint density for y can be constructed using a change of variables. We have u
t
= (y
t
y
t1
) /
t
,
so
u
t
y
t
=
1

t
and [
u
y

[ =

n
t=1
1

t
, so doing a change of variables,
f(y; ) =
n

t=1
1

2
1

t
exp
_
_

1
2
_
y
t
y
t1

t
_
2
_
_
where includes the parameters in g
t
and the alpha parameters of the ARCH specication. Taking
logs,
ln L() = nln

2
n

t=1
ln
t

1
2
n

t=1
_
y
t
y
t1

t
_
2
.
In principle, this is easy to maximize. Some complications can arise when the restrictions for
positivity and stationarity are imposed. Consider a fairly short data series with low volatility in
the initial part, and high volatility at the end. This data appears to have a nonstationary variance
sequence. If one attempts to estimate and ARCH model with stationarity imposed, the data and the
restrictions are saying two dierent things, which can make maximization of the likelihood function
dicult.
The Octave script ArchExample.m illustrates estimation of an ARCH(1) model, using the NYSE
closing price data.
GARCH
Note that an ARCH model species the variance process as a moving average. For the same reason
that an ARMA model may be used to parsimoniously model a series instead of a high order AR or
MA, one can do the same thing for the variance series. A basic GARCH(p,q) (Bollerslev, Tim (1986).
"Generalized Autoregressive Conditional Heteroskedasticity", Journal of Econometrics, 31:307-327)
specication is
y
t
= + y
t1
+
t

t
=
t
u
t

2
t
= +
q

i=1

2
ti
+
p

i=1

2
ti
The idea is that a GARCH model with low values of p and q may t the data as well or better than
an ARCH model with large q.
the model also requires restrictions for positive variance and stationarity, which are:
> 0

i
0, i = 1, ..., q

i
0, i = 1, ..., p


q
i=1

i
+

p
i=1

i
< 1.
to estimate a GARCH model, you need to initialize
2
0
at some value. The sample unconditional
variance is one possibility. Another choice could be the sample variance of the initial elements
of the sequence. One can also backcast the conditional variance.
The GARCH model also requires restrictions on the parameters to ensure stationarity and positivity
of the variance. A useful modication is the EGARCH model (exponential GARCH, Nelson, D. B.
(1991). "Conditional heteroskedasticity in asset returns: A new approach", Econometrica 59: 347-
370). This model treats the logarithm of the variance as an ARMA process, so the variance will be
positive without restrictions on the parameters. It is also possible to introduce asymmetry (leverage)
and non-normality.
The Octave script GarchExample.m illustrates estimation of a GARCH(1,1) model, using the NYSE
closing price data. You can get the same results more quickly using Gretl, which takes advantage of
C code for the model. If you play with the example, you can see that the results are sensitive to start
values. The likelihood function does not appear to have a nice well-dened global maximum. Thus,
one needs to use care when estimating this sort of model, or rely on some software that is known to
work well.
Note that the test of homoscedasticity against ARCH or GARCH involves parameters being on
the boundary of the parameter space. Also, the reducton of GARCH to ARCH has the same prob-
lem. Testing needs to be done taking this into account. See Demos and Sentana (1998) Journal of
Econometrics.
Stochastic volatility
In ARCH and GARCH models, the same shocks that aect the level also aect the variance. The
stochastic volatility model allows the variance to have its own random component. A simple example
is
y
t
= exp(h
t
)
t
h
t
= + h
t1
+
t
In this model, the log of the standard error of the observed sequence follows an AR(1) model. Once can
introduce leverage by allowing correlation between
t
and h
t
. Variants of this sort of model are widely
used to model nancial data, competing with the GARCH(1,1) model for being the most popular
choice. Many estimation methods have been proposed.
15.4 State space models
See Fernndez-Villaverdes notes Fernndez-Villaverdes Kalman lter notes and Mikushevas MIT
OpenCourseWare notes, lectures 20 and 21: Mikushevas Kalman lter notes. I will follow those notes
in class.
For nonlinear state space models, or non-Gaussian state space models, the basic Kalman lter
cannot be used, and the particle lter is becoming a widely-used means of computing the likelihood.
This is a fairly new, computationally demanding technique, and is currently (this was written in 2013)
an active area of research. Papers by Fernndez-Villaverde and Rubio-Ramrez provide interesting and
reasonably accessible applications in the context of estimating macroeconomic (DSGE) models.
15.5 Nonstationarity and cointegration
Im going to follow Karl Whelans notes, which are available at Whelan notes.
15.6 Exercises
1. Use Matlab to estimate the same GARCH(1,1) model as in the GarchExample.m script provided
above. Also, estimate an ARCH(4) model for the same data. If unconstrained estimation
does not satisfy stationarity restrictions, then do contrained estimation. Compare likelihood
values. Which of the two models do you prefer? But do the models have the same number
of parameters? Find out what is the consistent Akaike information criterion or the Bayes
information criterion and what they are used for. Compute one or the other, or both, and
discuss what they tell you about selecting between the two models.
Chapter 16
Bayesian methods
References I have used to prepare these notes: Cameron and Trivedi, Microeconometrics: Methods
and Applications, Chapter 13; Chernozhukov and Hong (2003), An MCMC approach to classical
estimation, Journal of Econometrics; Gallant and Tauchen, EMM: A program for ecient method
of moments estimation; Hoogerheide, van Dijk and van Oest (2007) Simulation Based Bayesian
Econometric Inference: Principles and Some Recent Computational Advances. You might also like to
read See Mikushevas MIT OpenCourseWare notes, lectures 23, 24 and 25: Bayesian notes.
This chaper provides a brief introduction to Bayesian methods, which form a large part of econo-
metric research, especially in the last two decades. Advances in computational methods (e.g., MCMC,
particle ltering), combined with practical advantages of Bayesian methods (e.g., no need for mini-
mization and improved identication coming from the prior) have contributed to the popularity of this
approach.
463
16.1 Denitions
The Bayesian approach treats the parameter of a model as a random vector. The parameter has a
density, (), which is known as the prior. It is assumed that the econometrician can provide this
density, which reects current beliefs about the parameter.
We also have sample information, y=y
1
, y
2
, ...y
n
. Were already familiar with the likelihood
function, f(y[), which is the density of the sample given a parameter value.
Given these two pieces, we can write the joint density of the sample and the parameter:
f(y, ) = f(y[)()
We can get the marginal likelihood by integrating out the parameter, integrating over its support :
f(y) =
_

f(y, )d
The last step is to get the posterior of the parameter. This is simply the density of the parameter
conditional on the sample, and we get it in the normal way we get a conditional density, using Bayes
theorem
f([y) =
f(y, )
f(y)
=
f(y[)()
f(y)
The posterior reects the learning that occurs about the parameter when one receives the sample
information. The sources of information used to make the posterior are the prior and the likelihood
function. Once we have the posterior, one can provide a complete probabilistic description about our
updated beliefs about the parameter, using quantiles or moments of the posterior. The posterior mean
or median provide the Bayesian analogue of the frequentist point estimator in the form of the ML
estimator. We can dene regions analogous to condence intervals by using quantiles of the posterior,
or the marginal posterior.
So far, this is pretty straightforward. The complications are often computational. To illustrate,
the posterior mean is
E([y) =
_

f([y)d =
_

f(y[)()d
_

f(y, )d
One can see that a means of integrating will be needed. Only in very special cases will the integrals
have analytic solutions. Otherwise, computational methods will be needed.
16.2 Philosophy, etc.
So, the classical paradigm views the data as generated by a data generating process, which is a perhaps
unknown model characterized by a parameter vector, and the data is generated from the model at a
particular value of the parameter vector. Bayesians view data as given, and update beliefs about a
random parameter using the information about the parameter contained in the data.
Bayesians and frequentists have a long tradition of arguing about the meaning and interpretation of
their respective procedures. Heres my take on the debate. Fundamentally, I take the frequentist view:
I nd it pleasing to think about a model with a xed non-random parameter about which we would like
to learn. I like the idea of a point estimator that gives a best guess about the true parameter. However,
we shouldnt reinvent the wheel each time we get a new sample: previous samples have information
about the parameter, and we should use all of the available information. A pure frequentist approach
would require writing the joint likelihood of all samples, which would almost certainly constitute an
impossible task. The Bayesian approach concentrates all of the information coming from previous
work in the form of a prior. A fairly simple, easy to use prior may not exactly capture all previous
information, but it could oer a handy and reasonably accurate summary. So, the idea of a prior as
a summary of what we have learned may simply be viewed as a practical solution to the problem of
using all the available information. Given that its a summary, one may as well use a convenient form,
as long as its plausible and the results dont depend too exaggerately on the prior.
About the likelihood function, fortunately, Bayesians and frequentists are in agreement, so theres
no need for further comment.
When we get to how to generate and interpret results, there is some divergence. Frequentists
maximize the likelihood function, and compute standard errors, etc., using the methods already ex-
plained in these notes. A frequentist could test the hypothesis that
0
=

by seeing if the data are


suciently likely conditional on the parameter value

. A Bayesian would check if

is a plausible
value conditional on the observed data.
I have criticized the frequentist practice of using only the current sample, ignoring what previous
work has told us about the parameter, simply because its too hard to write the overall joint likelihood
for all samples. So to be fair, heres a criticism of the Bayesian approach. If were doing Bayesian
learning, what is it were learning about? If its not a xed parameter value then what is it? What is
the process that generated the sample data? If the parameter is random, was the sample generated at a
single realization, or at many realizations? If we had an innite sample, then the Bayesian estimators
(e.g., posterior mean or median) converge to a point. What is that point if its not the same true
parameter value that the frequentists are trying to estimate? Why would one use noninformative
priors for ones whole career - dont we believe what we learned from the last paper we wrote? These
questions often receive no answer, or obscure answers.
It turns out that one can analyze Bayesian estimators from a classical (frequentist) perspective.
It also turns out that Bayesian estimators may be easier to compute reliably than analogous classical
estimators. These computational advantages, combined with the ability to use information from
previous work in an intelligent way, make the study of Bayesian methods attractive for frequentists.
If a Bayesian takes the view that there is a xed data generating process, and Bayesian learning leads
in the limit to the same xed true value that frequentists posit, then the study of frequentist theory
will be useful to a Bayesian practitioner.
For the rest of this, I will adopt the classical, frequentist perspective, and study the behavior of
Bayesian estimators in this context.
16.3 Example
Suppose data is generated by i.i.d. sampling from an exponential distribution with mean . An
exponential random variable takes values on the positive real numbers. Waiting times are often
modeled using the exponential distribution.
The density of a typical sample element is f(y[) =
1

e
y/
. The likelihood is simply the product
of the sample contributions.
Suppose the prior for is lognormal(1,1). This means that the logarithm of is standard
normal. We use a lognormal prior because it enforces the requirement that the parameter of the
exponential density be positive.
The Octave script BayesExample1.m implements Bayesian estimation for this setup.
with a sample of 10 observations, we obtain the results in panel (a) of Figure 16.1, while with a
sample of size 50 we obtain the results in panel (b). Note how the posterior is more concentrated
Figure 16.1: Bayesian estimation, exponential likelihood, lognormal prior
(a) N=10 (b) N=50
around the true parameter value in panel (b). Also note how the posterior mean is closer to
the prior mean when the sample is small. When the sample is small, the likelihood function has
less weight, and more of the information comes from the prior. When the sample is larger, the
likelihood function will have more weight, and its eect will dominate the priors.
16.4 Theory
Chernozhukov and Hong (2003) An MCMC Approach to Classical Estimation http://www.sciencedirect.
com/science/article/pii/S0304407603001003 is a very interesting article that shows how Bayesian
methods may be used with criterion functions that are associated with classical estimation techniques.
For example, it is possible to compute a posterior mean version of a GMM estimator. Chernozhukov
and Hong provide their Theorem 2, which proves consistency and asymptotic normality for a general
Figure 16.2: Chernozhukov and Hong, Theorem 2
class of such estimators. When the criterion function L
n
() in their paper is set to the log-likelihood
function, the pseudo-prior () is a real Bayesian prior, and the penalty function
n
is the squared
loss function (see the paper), then the class of estimators discussed by CH reduces to the ordinary
Bayesian posterior mean. As such, their Theorem 2, in Figure 16.2 tells us that this estimator is
consistent and asymptotically normally distributed. In particular, the Bayesian posterior mean has
the same asymptotic distribution as does the ordinary maximum likelihood estimator.
the intuition is clear: as the amount of information coming from the sample increases, the
likelihood function brings an increasing amount of information relative to the prior. Eventually,
the prior is no longer important for determining the shape of the posterior.
when the sample is large, the shape of the posterior depends on the likelihood function. The
likelihood function collapses around
0
when the sample is generated at
0
. The same is true
of the posterior, it narrows around
0
. This causes the posterior mean to converge to the true
parameter value. In fact, all quantiles of the posterior converge to
0
. Chernozhukov and Hong
discuss estimators dened using quantiles.
For an econometrician coming from the frequentist perspective, this is attractive. The Bayesian
estimator has the same asymptotic behavior as the MLE. There may be computational advan-
tages to using the Bayesian approach, because there is no need for optimization. If the objective
function that denes the classical estimator is irregular (multiple local optima, nondierentia-
bilities, noncontinuities...), then optimization may be very dicult. However, Bayesian methods
that use integration may be more tractible. Tthis is the main motivation of CHs paper. Addi-
tional advantages include the benets if an informative prior is available. When this is the case,
the Bayesian estimator can have better small sample performance than the maximum likelihood
estimator.
16.5 Computational methods
To compute the posterior mean, we need to evaluate E([y) =
_

f([y)d =
_

f(y[)()d/
_

f(y, )d.
Note that both of the integrals are multiple integrals, with the dimension given by that of the
parameter, .
Under some special circumstances, the integrals may have analytic solutions: e.g., Gaussian
likelihood with a Gaussian prior leads to a Gaussian posterior.
When the dimension of the parameter is low, quadrature methods may be used. What was done
in as was done in BayesExample1.m is an unsosticated example of this. More sophisticated
methods use an intelligently chosen grid to reduce the number of function evaluations. Still,
these methods only work for dimensions up to 3 or so.
Otherwise, some form of simulation-based Monte Carlo integration must be used. The basic
idea is that E([y) can be approximated by (1/S)

S
s=1

s
, where
s
is a random draw from the
posterior distribution f([y). The trick is how to make draws from the posterior when in general
we cant compute the posterior.
the law of large numbers tells us that this average will converge to the desired expectation
as S gets large
convergence will be more rapid if the random draws are independent of one another, but
insisting on independence may have computational drawbacks.
Monte Carlo methods include importance sampling, Markov chain Monte Carlo (MCMC) and
sequential Monte Carlo (SMC, also known as particle ltering). The great expansion of these
methods over the years has caused Bayesian econometrics to become much more widely used
than it was in the not so distant (for some of us) past. There is much literature - here we will
only look at a basic example that captures the main ideas.
MCMC
Variants of Markov chain Monte Carlo have become a very widely used means of computing Bayesian
estimates. See Tierney (1994) Markov Chains for Exploring Posterior Distributions Annals of Statis-
tics and Chib and Greenberg (1995) Understanding the Metropolis-Hastings algorithm The American
Statistician.
Lets consider the basic Metropolis-Hastings MCMC algorithm. We will generate a long realization
of a Markov chain process for , as follows:
The prior density is (), as above. Let g(

;
s
) be a proposal density, which generates a new trial
parameter value

given the most recently accepted parameter value


s
. A proposal will be accepted
if
f(

[y)
f(
s
[y)
g(
s
;

)
g(

;
s
)
>
where is a U(0, 1) random variate.
There are two parts to the numerator and denominator: the posterior, and the proposal density.
Focusing on the numerator, when the trial value of the proposal has a higher posterior, acceptance
is favored. The other factor is the density associated with returning to
s
when starting at

, which
has to do with the reversability of the Markov chain. If this is too low, acceptance is not favored.
We dont want to jump to a new region if we will never get back, as we need to sample from the
entire support of the posterior. The two together mean that we will jump to a new area only if we
are able to eventually jump back with a reasonably high probability. The probability of jumping is
higher when the new area has a higher posterior density, but lower if its hard to get back. The idea
is to sample from all regions of the posterior, those with high and low density, sampling more heavily
from regions of high density. We want to go occasionally to regions of low density, but it is important
not to get stuck there. Consider a bimodal density: we want to explore the area around both modes.
To be able to do that, it is important that the proposal density allows us to be able to jump between
modes. Understanding in detail why this makes sense is the tricky and elegant part of the theory, see
the references for more information.
Note that the ratio of posteriors is equal to the ratio of likelihoods times the ratio of priors:
f(

[y)
f(
s
[y)
=
f(y[

)
f(y[
s
)
(

)
(
s
)
because the marginal likelihood f(y) is the same in both cases. We dont need to compute that
integral! We dont need to know the posterior, either. The acceptance criterion can be written
as: accept if
f(y[

)
f(y[
s
)
(

)
(
s
)
g(
s
;

)
g(

;
s
)
>
otherwise, reject
From this, we see that the information needed to determine if a proposal is accepted or rejected
is the prior, the proposal density, and the likelihood function f(y[).
the steps are:
1. the algorithm is initialized at some
1
2. for s = 2, ..., S,
(a) draw

from g(

;
s
)
(b) according to the acceptance/rejection criterion, if the result is acceptance, set
s+1
=

,
otherwise set
s+1
=
s
(c) iterate
Once the chain is considered to have stabilized, say at iteration r, the values of
s
for s > r are
taken to be draws from the posterior. The posterior mean is computed as the simple average of
the value. Quantiles, etc., can be computed in the appropriate fashion.
the art of applying these methods consists of providing a good candidate density so that the
acceptance rate is reasonably high. Otherwise, the chain will be highly autocorrelated, with long
intervals where the same value of appears. There is a vast literature on this, and the vastness
of the literature should serve as a warning that getting this to work in practice is not necessarily
a simple matter. If it were, there would be fewer papers on the topic.
too high acceptance rate: this is usually due to a proposal density that gives proposals very
close to the current value, e.g, a random walk with very low variance. This means that the
posterior is being explored ineciently, we travel around through the support at a very low
rate, which means the chain will have to run for a long time to do a thorough exploration.
too low acceptance rate: this means that the steps are too large, and we move to low
posterior density regions too frequently. The chain will become highly autocorrelated, so
long periods convey little additional information relative to a subset of the values in the
interval
16.6 Examples
MCMC for the simple example
The simple exponential example with lognormal prior can be implemented using MH MCMC, and
this is done in the Octave script BayesExample2.m . Play around with the sample size and the tuning
parameter, and note the eects on the computed posterior mean and on the acceptance rate. An
example of output is given in Figure 16.3. In that Figure, the chain shows relatively long periods of
Figure 16.3: Metropolis-Hastings MCMC, exponential likelihood, lognormal prior
rejection, meaning that the tuning parameter needs to be lowered, to cause the random walk to be a
little less random.
Bayesian VAR with Minnesota priors
Consider a VAR(p) model, where the data have been de-meaned:
y
t
= A
1
y
t1
+ + A
p
y
tp
+
t
This follows an SUR structure, so OLS estimation is appropriate, even though we expect that V (
t
) =
is a full GG matrix (heteroscedasticity and autocorrelation, which would normally lead on to think
of GLS estimation). As was previously noted, a problem with the estimation of this model is that
the number of parameters increases rapidly in the number of lags, p. One can use Bayesian methods
such as the Minnesota prior (Doan, T., Litterman, R., Sims, C. (1984). "Forecasting and conditional
projection using realistic prior distributions". Econometric Reviews 3: 1100), which is a prior that
each variable separately follows a random walk (an AR(1) model with = 1). The prior on A
1
is that
it is an identity matrix, and the prior on the A
j
, j > 1 is that they are zero matrices. This can be done
using stochastic restrictions similar to what was in the discussion of collinearity and ridge regression.
To be specic, note that the model can be written as
Y = Y
1
A
/
1
+ Y
2
A
/
2
+ + Y
p
A
/
p
+ E
where
Y =
_

_
y
/
1
y
/
2
.
.
.
y
/
n
_

_
is the n G matrix of all the data, and the right hand side Y
/
s are this matrix, lagged the indicated
number of times. The initial data with missing lags has been dropped, and n refers to the number of
complete observations, including all needed lags.
Exercise 62. Convince yourself that this matrix representation is the same as y
t
= A
1
y
t1
+ +
A
p
y
tp
+
t
, just writing all observations at once, and transposing.
Now, consider the prior that each variable separately follows a random walk. If this were exactly
true, then A
1
= I
G
, and all the A
s
= 0
G
, a GG matrix of zeros, for s = 2, 3, ..., p. Consider the prior
A
1
N(I
G
,
2
1
I
G
)
A
2
N(0
G
,
2
2
I
G
)
.
.
.
A
p
N(0
G
,
2
p
I
G
)
and all of the matrices of parameters are independent of one another. In the same way we formulated
the ridge regression estimator in Section 7.1, we can write the model and priors as
_

_
Y
I
G
0
G
.
.
.
0
G
_

_
=
_

_
Y
1
Y
2
Y
p
I
G
0
G
0
G
0
G
I
G
0
G
.
.
.
.
.
.
0
G
I
G
_

_
_

_
A
/
1
A
/
2
.
.
.
A
/
p
_

_
+
_

_
E
v
1
v
2
.
.
.
v
p
_

_
The nal blocks may be multiplied by a prior precision, to enforce the prior to the desired degree, and
then estimation may be done using OLS, just as we did when introducing ordinary ridge regression.
This is a simple example of a Bayesian VAR: the VAR(p) model, combined with a certain prior (random
walk, and Gaussian prior), implemented using mixed estimation.
We have previously seen a simple RBC model, in Section 13.8. If you run rbc.mod using Dynare,
it will generate simulated data from this model. The data le rbcdata.m contains 400 observations
on consumption, investment and hours worked, generated by this model. The data are plotted in
Figure 16.4. Hours worked is quite stable around the steady state value of 1/3, but consumption
and investment uctuate a little more. Lets estimate a Bayesian VAR, using this data. The script
EstimateBVAR.m gives the results
octave:1> EstimateBVAR
plain OLS results
A1
-0.463821 -9.234590 -10.554874
0.065112 1.650430 0.913775
0.300228 1.465854 2.509376
A2
1.47732 9.31190 10.57178
-0.18877 -1.20802 -1.30911
-0.10263 -0.64005 -0.78350
r-squares OLS fit: 0.98860 0.82362 0.79340
#################################
Minnesota prior results
A1
1.004195 0.037051 0.026717
-0.046089 0.706809 -0.273725
0.058231 0.160412 1.121387
A2
0.0066800 0.0739334 0.0674895
-0.0358829 -0.1631035 -0.1307285
0.0357671 0.1514610 0.1249041
r-squares Minnesota fit: 0.98859 0.82341 0.79320
Note how the R
2
s hardly change, but the estimated coecients are much more similar to AR1 ts.
The prior seems to be imposing discipline on the coecients, without aecting goodness of t in any
serious way. Having a look at the residuals, see Figure 16.5. Note that the residuals for investment
and hours are obviously very highly correlated. This is because the model that generated the data
contains only one shock (a technology shock), so the stochastic behavior of the variables is necessarily
fairly tightly linked.
Bayesian estimation of DSGE model
In Section 13.8, a simple DSGE model was estimated by ML. EstimateRBC_Bayesian.mod is a Dynare
.mod le that lets you do the same thing using Bayesian methods, with MCMC. Another example of
Bayesian estimation of a DSGE model is given in Section 22.6
Figure 16.4: Data from RBC model
16.5
Figure 16.5: BVAR residuals, with separation
16.7 Exercises
1. Experiment with the examples to learn about tuning, etc.
Chapter 17
Introduction to panel data
Reference: Cameron and Trivedi, 2005, Microeconometrics: Methods and Applications, Part V, Chap-
ters 21 and 22 (plus 23 if you have special interest in the topic).
In this chapter well look at panel data. Panel data is an important area in applied econometrics,
simply because much of the available data has this structure. Also, it provides an example where things
weve already studied (GLS, endogeneity, GMM, Hausman test) come into play. There has been much
work in this area, and the intention is not to give a complete overview, but rather to highlight the
issues and see how the tools we have studied can be applied.
17.1 Generalities
Panel data combines cross sectional and time series data: we have a time series for each of the
agents observed in a cross section. The addition of temporal information can in principle allow us to
investigate issues such as persistence, habit formation, and dynamics. Starting from the perspective
483
of a single time series, the addition of cross-sectional information allows investigation of heterogeneity.
In both cases, if parameters are common across units or over time, the additional data allows for more
precise estimation.
The basic idea is to allow variables to have two indices, i = 1, 2, ..., n and t = 1, 2, ..., T. The simple
linear model
y
i
= + x
i
+
i
becomes
y
it
= + x
it
+
it
We could think of allowing the parameters to change over time and over cross sectional units. This
would give
y
it
=
it
+ x
it

it
+
it
The problem here is that there are more parameters than observations, to the model is not identied.
We need some restraint! The proper restrictions to use of course depend on the problem at hand, and
a single model is unlikely to be appropriate for all situations. For example, one could have time and
cross-sectional dummies, and slopes that vary by time:
y
it
=
i
+
t
+ x
it

t
+
it
There is a lot of room for playing around here. We also need to consider whether or not n and T are
xed or growing. Well need at least one of them to be growing in order to do asymptotics.
To provide some focus, well consider common slope parameters, but agent-specic intercepts,
which:
y
it
=
i
+ x
it
+
it
(17.1)
I will refer to this as the simple linear panel model. This is the model most often encountered in the
applied literature. It is like the original cross-sectional model, in that the
/
s are constant over time
for all i. However were now allowing for the constant to vary across i (some individual heterogeneity).
The
/
s are xed over time, which is a testable restriction, of course. We can consider what happens
as n but T is xed. This would be relevant for microeconometric panels, (e.g., the PSID data)
where a survey of a large number of individuals may be done for a limited number of time periods.
Macroeconometric applications might look at longer time series for a small number of cross-sectional
units (e.g., 40 years of quarterly data for 15 European countries). For that case, we could keep n
xed (seems appropriate when dealing with the EU countries), and do asymptotics as T increases, as
is normal for time series. The asymptotic results depend on how we do this, of course.
Why bother using panel data, what are the benets? The model
y
it
=
i
+ x
it
+
it
is a restricted version of
y
it
=
i
+ x
it

i
+
it
which could be estimated for each i in turn. Why use the panel approach?
Because the restrictions that
i
=
j
= ... = , if true, lead to more ecient estimation.
Estimation for each i in turn will be very uninformative if T is small.
Another reason is that panel data allows us to estimate parameters that are not identied by
cross sectional (time series) data. For example, if the model is
y
it
=
i
+
t
+ x
it

t
+
it
and we have only cross sectional data, we cannot estimate the
i
. If we have only time series
data on a single cross sectional unit i = 1, we cannot estimate the
t
. Cross-sectional variation
allows us to estimate parameters indexed by time, and time series variation allows us to estimate
parameters indexed by cross-sectional unit. Parameters indexed by both i and t will require
other forms of restrictions in order to be estimable.
The main issues are:
can be estimated consistently? This is almost always a goal.
can the
i
be estimated consistently? This is often of secondary interest.
sometimes, were interested in estimating the distribution of
i
across i.
are the
i
correlated with x
it
?
does the presence of
i
complicate estimation of ?
what about the covariance stucture? Were likely to have HET and AUT, so GLS issue will
probably be relevant. Potential for eciency gains.
17.2 Static models and correlations between variables
To begin with, assume that the x
it
are weakly exogenous variables (uncorrelated with
it
), and that
the model is static: x
it
does not contain lags of y
it
. The basic problem we have in the panel data model
y
it
=
i
+ x
it
+
it
is the presence of the
i
. These are individual-specic parameters. Or, possibly
more accurately, they can be thought of as individual-specic variables that are not observed (latent
variables). The reason for thinking of them as variables is because the agent may choose their values
following some process.
Dene = E(
i
), so E(
i
) = 0. Our model y
it
=
i
+ x
it
+
it
may be written
y
it
=
i
+ x
it
+
it
= + x
it
+ (
i
+
it
)
= + x
it
+
it
Note that E(
it
) = 0. A way of thinking about the data generating process is this: First,
i
is drawn,
either in turn from the set of n xed values, or randomly, and then x is drawn from f
X
(z[
i
). In either
case, the important point is that the distribution of x may vary depending on the realization,
i
. Thus,
there may be correlation between
i
and x
it
, which means that E(x
it

it
) ,=0 in the above equation. This
means that OLS estimation of the model would lead to biased and inconsistent estimates. However, it
is possible (but unlikely for economic data) that x
it
and
it
are independent or at least uncorrelated,
if the distribution of x
it
is constant with respect to the realization of
i
. In this case OLS estimation
would be consistent.
Fixed eects: when E(x
it

it
) ,=0, the model is called the xed eects model
Random eects: when E(x
it

it
) = 0, the model is called the random eects model.
I nd this to be pretty poor nomenclature, because the issue is not whether eects are xed or
random (they are always random, unconditional on i). The issue is whether or not the eects are
correlated with the other regressors. In economics, it seems likely that the unobserved variable is
probably correlated with the observed regressors, x (this is simply the presence of collinearity between
observed and unobserved variables, and collinearity is usually the rule rather than the exception).
So, we expect that the xed eects model is probably the relevant one unless special circumstances
mean that the
i
are uncorrelated with the x
it
.
17.3 Estimation of the simple linear panel model
Fixed eects: The within estimator
How can we estimate the parameters of the simple linear panel model (equation 17.1) and what
properties do the estimators have? First, we assume that the
i
are correlated with the x
it
(xed
eects model ). The model can be written as y
it
= +x
it
+
it
, and we have that E(x
it

it
) ,=0. As
such, OLS estimation of this model will give biased an inconsistent estimated of the parameters and
. The within estimator is a solution - this involves subtracting the time series average from each
cross sectional unit.
x
i
=
1
T
T

t=1
x
it

i
=
1
T
T

t=1

it
y
i
=
1
T
T

t=1
y
it
=
i
+
1
T
T

t=1
x
it
+
1
T
T

t=1

it
y
i
=
i
+ x
i
+
i
(17.2)
The transformed model is
y
it
y
i
=
i
+ x
it
+
it

i
x
i

i
(17.3)
y

it
= x

it
+

it
where x

it
= x
it
x
i
and

it
=
it

i
. In this model, it is clear that x

it
and

it
are uncorrelated,
as long as the original regressors x
it
are strongly exogenous with respect to the original error
it
(E(x
it

is
) = 0, t, s). In this case, OLS will give consistent estimates of the parameters of this model,
.
What about the
i
? Can they be consistently estimated? An estimator is

i
=
1
T
T

t=1
_
y
it
x
it

_
Its fairly obvious that this is a consistent estimator if T . For a short panel with xed T, this
estimator is not consistent. Nevertheless, the variation in the
i
can be fairly informative about the
heterogeneity. A couple of notes:
an equivalent approach is to estimate the model
y
it
=
n

j=1
d
j,it

i
+ x
it
+
it
by OLS. The d
j
, j = 1, 2, ..., n are n dummy variables that take on the value 1 if j = i, zero
otherwise. They are indicators of the cross sectional unit of the observation. (Write out form
of regressor matrix on blackboard). Estimating this model by OLS gives numerically exactly
the same results as the within estimator, and you get the
i
automatically. See Cameron and
Trivedi, section 21.6.4 for details. An interesting and important result known as the Frisch-
Waugh-Lovell Theorem can be used to show that the two means of estimation give identical
results.
This last expression makes it clear why the within estimator cannot estimate slope coecients
corresponding to variables that have no time variation. Such variables are perfectly collinear
with the cross sectional dummies d
j
. The corresponding coecients are not identied.
OLS estimation of the within model is consistent, but probably not ecient, because it is
highly probable that the
it
are not iid. There is very likely heteroscedasticity across the i and
autocorrelation between the T observations corresponding to a given i. One needs to estimate
the covariance matrix of the parameter estimates taking this into account. It is possible to
use GLS corrections if you make assumptions regarding the het. and autocor. Quasi-GLS,
using a possibly misspecied model of the error covariance, can lead to more ecient estimates
than simple OLS. One can then combine it with subsequent panel-robust covariance estimation
to deal with the misspecication of the error covariance, which would invalidate inferences if
ignored. The White heteroscedasticity consistent covariance estimator is easily extended to
panel data with independence across i, but with heteroscedasticity and autocorrelation within i,
and heteroscedasticity between i. See Cameron and Trivedi, Section 21.2.3.
Estimation with random eects
The original model is
y
it
=
i
+ x
it
+
it
This can be written as
y
it
= + x
it
+ (
i
+
it
)
y
it
= + x
it
+
it
(17.4)
where E(
it
) = 0, and E(x
it

it
) = 0. As such, the OLS estimator of this model is consistent. We can
recover estimates of the
i
as discussed above. It is to be noted that the error
it
is almost certainly
heteroscedastic and autocorrelated, so OLS will not be ecient, and inferences based on OLS need to
be done taking this into account. One could attempt to use GLS, or panel-robust covariance matrix
estimation, or both, as above.
There are other estimators when we have random eects, a well-known example being the between
estimator, which operates on the time averages of the cross sectional units. There is no advantage to
doing this, as the overall estimator is already consistent, and averaging looses information (eciency
loss). One would still need to deal with cross sectional heteroscedasticity when using the between
estimator, so there is no gain in simplicity, either.
It is to be emphasized that random eects is not a plausible assumption with most economic data,
so use of this estimator is discouraged, even if your statistical package oers it as an option. Think
carefully about whether the assumption is warranted before trusting the results of this estimator.
Hausman test
Suppose youre doubting about whether xed or random eects are present. If we have xed eects,
then the within estimator will be consistent, but the estimator of the previous section will not.
Evidence that the two estimators are converging to dierent limits is evidence in favor of xed eects,
not random eects. A Hausman test statistic can be computed, using the dierence between the two
estimators. The null hypothesis is random eects so that both estimators are consistent. When
the test rejects, we conclude that xed eects are present, so the within estimator should be used.
Now, what happens if the test does not reject? One could optimistically turn to the random eects
model, but its probably more realistic to conclude that the test may have low power. Failure to
reject does not mean that the null hypothesis is true. After all, estimation of the covariance matrices
needed to compute the Hausman test is a non-trivial issue, and is a source of considerable noise in
the test statistic (noise=low power). Finally, the simple version of the Hausman test requires that
the estimator under the null be fully ecient. Achieving this goal is probably a utopian prospect. A
conservative approach would acknowledge that neither estimator is likely to be ecient, and to operate
accordingly. I have a little paper on this topic, Creel, Applied Economics, 2004. See also Cameron
and Trivedi, section 21.4.3.
17.4 Dynamic panel data
When we have panel data, we have information on both y
it
as well as y
i,t1
. One may naturally think
of including y
i,t1
as a regressor, to capture dynamic eects that cant be analyed with only cross-
sectional data. Excluding dynamic eects is often the reason for detection of spurious AUT of the
errors. With dynamics, there is likely to be less of a problem of autocorrelation, but one should still
be concerned that some might still be present. The model becomes
y
it
=
i
+ y
i,t1
+ x
it
+
it
y
it
= + y
i,t1
+ x
it
+ (
i
+
it
)
y
it
= + y
i,t1
+ x
it
+
it
We assume that the x
it
are uncorrelated with
it
.
Note that
i
is a component that determines both y
it
and its lag, y
i,t1
. Thus,
i
and y
i,t1
are
correlated, even if the
i
are pure random eects (uncorrelated with x
it
).
So, y
i,t1
is correlated with
it
.
For this reason, OLS estimation is inconsistent even for the random eects model, and its also
of course still inconsistent for the xed eects model.
When regressors are correlated with the errors, the natural thing to do is start thinking of
instrumental variables estimation, or GMM.
To illustrate, consider a simple linear dynamic panel model
y
it
=
i
+
0
y
it1
+
it
(17.5)
where
it
N(0, 1),
i
N(0, 1),
0
= 0, 0.3, 0.6, 0.9 and
i
and
i
are independently distributed.
Tables 17.1 and 17.2 present bias and RMSE for the within estimator (labeled as ML) and some
simulation-based estimators. Note that the within estimator is very biased, and has a large RMSE.
The overidentied SBIL estimator (see Creel and Kristensen, Indirect Likelihood Inference) has the
lowest RMSE. Simulation-based estimators are discussed in a later Chapter. Perhaps these results will
stimulate your interest.
Table 17.1: Dynamic panel data model. Bias. Source for ML and II is Gouriroux, Phillips and Yu,
2010, Table 2. SBIL, SMIL and II are exactly identied, using the ML auxiliary statistic. SBIL(OI)
and SMIL(OI) are overidentied, using both the naive and ML auxiliary statistics.
T N ML II SBIL SBIL(OI)
5 100 0.0 -0.199 0.001 0.004 -0.000
5 100 0.3 -0.274 -0.001 0.003 -0.001
5 100 0.6 -0.362 0.000 0.004 -0.001
5 100 0.9 -0.464 0.000 -0.022 -0.000
5 200 0.0 -0.200 0.000 0.001 0.000
5 200 0.3 -0.275 -0.010 0.001 -0.001
5 200 0.6 -0.363 -0.000 0.001 -0.001
5 200 0.9 -0.465 -0.003 -0.010 0.001
Table 17.2: Dynamic panel data model. RMSE. Source for ML and II is Gouriroux, Phillips and Yu,
2010, Table 2. SBIL, SMIL and II are exactly identied, using the ML auxiliary statistic. SBIL(OI)
and SMIL(OI) are overidentied, using both the naive and ML auxiliary statistics.
T N ML II SBIL SBIL(OI)
5 100 0.0 0.204 0.057 0.059 0.044
5 100 0.3 0.278 0.081 0.065 0.041
5 100 0.6 0.365 0.070 0.071 0.036
5 100 0.9 0.467 0.076 0.059 0.033
5 200 0.0 0.203 0.041 0.041 0.031
5 200 0.3 0.277 0.074 0.046 0.029
5 200 0.6 0.365 0.050 0.050 0.025
5 200 0.9 0.467 0.054 0.046 0.027
Arellano-Bond estimator
The rst thing is to realize that the
i
that are a component of the error are correlated with all
regressors in the general case of xed eects. Getting rid of the
i
is a step in the direction of solving
the problem. We could subtract the time averages, as above for the within estimator, but this
would give us problems later when we need to dene instruments. Instead, consider the model in rst
dierences
y
it
y
i,t1
=
i
+ y
i,t1
+ x
it
+
it

i
y
i,t2
x
i,t1

i,t1
y
it
y
i,t1
= (y
i,t1
y
i,t2
) + (x
it
x
i,t1
) +
it

i,t1
or
y
it
= y
i,t1
+ x
it
+
it
Now the pesky
i
are no longer in the picture. Note that we loose one observation when doing rst
dierencing. OLS estimation of this model will still be inconsistent, because y
i,t1
is clearly correlated
with
i,t1
. Note also that the error
it
is serially correlated even if the
it
are not. There is no problem
of correlation between x
it
and
it
. Thus, to do GMM, we need to nd instruments for y
i,t1
, but
the variables in x
it
can serve as their own instruments.
How about using y
i.t2
as an instrument? It is clearly correlated with y
i,t1
= (y
i,t1
y
i,t2
), and
as long as the
it
are not serially correlated, then y
i.t2
is not correlated with
it
=
it

i,t1
. We can
also use additional lags y
i.ts
, s 2 to increase eciency, because GMM with additional instruments is
asymptotically more ecient than with less instruments (but small sample bias may become a serious
problem). This sort of estimator is widely known in the literature as an Arellano-Bond estimator, due
to the inuential 1991 paper of Arellano and Bond (1991).
Note that this sort of estimators requires T = 3 at a minimum. Suppose T = 4. Then for
t = 1 and t = 2, we cannot compute the moment conditions. For t = 3, we can compute the
moment conditions using a single lag y
i,1
as an instrument. When t = 4, we can use both y
i,1
and y
i,2
as instruments. This sort of unbalancedness in the instruments requires a bit of care
when programming. Also, additional instruments increase asymptotic eciency but can lead to
increased small sample bias, so one should be a little careful with using too many instruments.
Some robustness checks, looking at the stability of the estimates are a way to proceed.
One should note that serial correlation of the
it
will cause this estimator to be inconsistent.
Serial correlation of the errors may be due to dynamic misspecication, and this can be solved
by including additional lags of the dependent variable. However, serial correlation may also be
due to factors not captured in lags of the dependent variable. If this is a possibility, then the
validity of the Arellano-Bond type instruments is in question.
A nal note is that the error
it
is serially correlated even when the
it
are not, and very likely
heteroscedastic across i. One needs to take this into account when computing the covariance of
the GMM estimator. One can also attempt to use GLS style weighting to improve eciency.
There are many possibilities.
17.5 Exercises
1. In the context of a dynamic model with xed eects, why is the dierencing used in the within
estimation approach (equation 17.3) problematic? That is, why does the Arellano-Bond estima-
tor operate on the model in rst dierences instead of using the within approach?
2. Consider the simple linear panel data model with random eects (equation 17.4). Suppose that
the
it
are independent across cross sectional units, so that E(
it

js
) = 0, i ,= j, t, s. With a
cross sectional unit, the errors are independently and identically distributed, so E(
2
it
) =
2
i
, but
E(
it

is
) = 0, t ,= s. More compactly, let
i
=
_

i1

i2

iT
_
/
. Then the assumptions are
that E(
i

/
i
) =
2
i
I
T
, and E(
i

/
j
) = 0, i ,= j.
(a) write out the form of the entire covariance matrix (nT nT) of all errors, = E(
/
), where
=
_

/
1

/
2

/
T
_
/
is the column vector of nT errors.
(b) suppose that n is xed, and consider asymptotics as T grows. Is it possible to estimate the

i
consistently? If so, how?
(c) suppose that T is xed, and consider asymptotics an n grows. Is it possible to estimate the

i
consistently? If so, how?
(d) For one of the two preceeding parts (b) and (c), consistent estimation is possible. For that
case, outline how to do within estimation using a GLS correction.
Chapter 18
Quasi-ML
Quasi-ML is the estimator one obtains when a misspecied probability model is used to calculate an
ML estimator.
Given a sample of size n of a random vector y and a vector of conditioning variables x, suppose the
joint density of Y =
_
y
1
. . . y
n
_
conditional on X =
_
x
1
. . . x
n
_
is a member of the parametric
family p

(Y[X, ), . The true joint density is associated with the vector


0
:
p

(Y[X,
0
).
As long as the marginal density of X doesnt depend on
0
, this conditional density fully characterizes
the random characteristics of samples: i.e., it fully describes the probabilistically important features
499
of the d.g.p. The likelihood function is just this density evaluated at other values
L(Y[X, ) = p

(Y[X, ), .
Let Y
t1
=
_
y
1
. . . y
t1
_
, Y
0
= 0, and let X
t
=
_
x
1
. . . x
t
_
The likelihood function,
taking into account possible dependence of observations, can be written as
L(Y[X, ) =
n

t=1
p
t
(y
t
[Y
t1
, X
t
, )

t=1
p
t
()
The average log-likelihood function is:
s
n
() =
1
n
ln L(Y[X, ) =
1
n
n

t=1
ln p
t
()
Suppose that we do not have knowledge of the family of densities p
t
(). Mistakenly, we may
assume that the conditional density of y
t
is a member of the family f
t
(y
t
[Y
t1
, X
t
, ), ,
where there is no
0
such that f
t
(y
t
[Y
t1
, X
t
,
0
) = p
t
(y
t
[Y
t1
, X
t
,
0
), t (this is what we mean
by misspecied).
This setup allows for heterogeneous time series data, with dynamic misspecication.
The QML estimator is the argument that maximizes the misspecied average log likelihood, which
we refer to as the quasi-log likelihood function. This objective function is
s
n
() =
1
n
n

t=1
ln f
t
(y
t
[Y
t1
, X
t
,
0
)

1
n
n

t=1
ln f
t
()
and the QML is

n
= arg max

s
n
()
A SLLN for dependent sequences applies (we assume), so that
s
n
()
a.s.
lim
n
c
1
n
n

t=1
ln f
t
() s

()
We assume that this can be strengthened to uniform convergence, a.s., following the previous argu-
ments. The pseudo-true value of is the value that maximizes s():

0
= arg max

()
Given assumptions so that theorem 29 is applicable, we obtain
lim
n

n
=
0
, a.s.
Applying the asymptotic normality theorem,

n
_


0
_
d
N
_
0,

(
0
)
1
J

(
0
)

(
0
)
1
_
where

(
0
) = lim
n
cD
2

s
n
(
0
)
and
J

(
0
) = lim
n
V ar

nD

s
n
(
0
).
Note that asymptotic normality only requires that the additional assumptions regarding and
J hold in a neighborhood of
0
for and at
0
, for J, not throughout . In this sense, asymptotic
normality is a local property.
18.1 Consistent Estimation of Variance Components
Consistent estimation of

(
0
) is straightforward. Assumption (b) of Theorem 31 implies that

n
(

n
) =
1
n
n

t=1
D
2

ln f
t
(

n
)
a.s.
lim
n
c
1
n
n

t=1
D
2

ln f
t
(
0
) =

(
0
).
That is, just calculate the Hessian using the estimate

n
in place of
0
.
Consistent estimation of J

(
0
) is more dicult, and may be impossible.
Notation: Let g
t
D

f
t
(
0
)
We need to estimate
J

(
0
) = lim
n
V ar

nD

s
n
(
0
)
= lim
n
V ar

n
1
n
n

t=1
D

ln f
t
(
0
)
= lim
n
1
n
V ar
n

t=1
g
t
= lim
n
1
n
c
_
_
_
_
_
n

t=1
(g
t
cg
t
)
_
_
_
_
n

t=1
(g
t
cg
t
)
_
_
/
_
_
_
This is going to contain a term
lim
n
1
n
n

t=1
(cg
t
) (cg
t
)
/
which will not tend to zero, in general. This term is not consistently estimable in general, since it
requires calculating an expectation using the true density under the d.g.p., which is unknown.
There are important cases where J

(
0
) is consistently estimable. For example, suppose that
the data come from a random sample (i.e., they are iid). This would be the case with cross
sectional data, for example. (Note: under i.i.d. sampling, the joint distribution of (y
t
, x
t
) is
identical. This does not imply that the conditional density f(y
t
[x
t
) is identical).
With random sampling, the limiting objective function is simply
s

(
0
) = c
X
c
0
ln f(y[x,
0
)
where c
0
means expectation of y[x and c
X
means expectation respect to the marginal density of
x.
By the requirement that the limiting objective function be maximized at
0
we have
D

c
X
c
0
ln f(y[x,
0
) = D

(
0
) = 0
The dominated convergence theorem allows switching the order of expectation and dierentia-
tion, so
D

c
X
c
0
ln f(y[x,
0
) = c
X
c
0
D

ln f(y[x,
0
) = 0
The CLT implies that
1

n
n

t=1
D

ln f(y[x,
0
)
d
N(0, J

(
0
)).
That is, its not necessary to subtract the individual means, since they are zero. Given this, and
due to independent observations, a consistent estimator is

J =
1
n
n

t=1
D

ln f
t
(

)D

ln f
t
(

)
This is an important case where consistent estimation of the covariance matrix is possible. Other cases
exist, even for dynamically misspecied time series models.
18.2 Example: the MEPS Data
To check the plausibility of the Poisson model for the MEPS data, we can compare the sample
unconditional variance with the estimated unconditional variance according to the Poisson model:

V (y) =

n
t=1

t
n
. Using the program PoissonVariance.m, for OBDV and ERV, we get We see that
Table 18.1: Marginal Variances, Sample and Estimated (Poisson)
OBDV ERV
Sample 38.09 0.151
Estimated 3.28 0.086
even after conditioning, the overdispersion is not captured in either case. There is huge problem with
OBDV, and a signicant problem with ERV. In both cases the Poisson model does not appear to be
plausible. You can check this for the other use measures if you like.
Innite mixture models: the negative binomial model
Reference: Cameron and Trivedi (1998) Regression analysis of count data, chapter 4.
The two measures seem to exhibit extra-Poisson variation. To capture unobserved heterogeneity,
a possibility is the random parameters approach. Consider the possibility that the constant term in a
Poisson model were random:
f
Y
(y[x, ) =
exp()
y
y!
= exp(x/ + )
= exp(x/) exp()
=
where = exp(x/) and = exp(). Now captures the randomness in the constant. The problem
is that we dont observe , so we will need to marginalize it to get a usable density
f
Y
(y[x) =
_

exp[]
y
y!
f
v
(z)dz
This density can be used directly, perhaps using numerical integration to evaluate the likelihood
function. In some cases, though, the integral will have an analytic solution. For example, if follows
a certain one parameter gamma density, then
f
Y
(y[x, ) =
(y + )
(y + 1)()
_

+
_

_

+
_
y
(18.1)
where = (, ). appears since it is the parameter of the gamma density.
For this density, E(y[x) = , which we have parameterized = exp(x
/
)
The variance depends upon how is parameterized.
If = /, where > 0, then V (y[x) = + . Note that is a function of x, so that
the variance is too. This is referred to as the NB-I model.
If = 1/, where > 0, then V (y[x) = +
2
. This is referred to as the NB-II model.
So both forms of the NB model allow for overdispersion, with the NB-II model allowing for a more
radical form.
Testing reduction of a NB model to a Poisson model cannot be done by testing = 0 using standard
Wald or LR procedures. The critical values need to be adjusted to account for the fact that = 0 is
on the boundary of the parameter space. Without getting into details, suppose that the data were in
fact Poisson, so there is equidispersion and the true = 0. Then about half the time the sample data
will be underdispersed, and about half the time overdispersed. When the data is underdispersed, the
MLE of will be = 0. Thus, under the null, there will be a probability spike in the asymptotic
distribution of
_
n( ) =

n at 0, so standard testing methods will not be valid.


This program will do estimation using the NB model. Note how modelargs is used to select a NB-I
or NB-II density. Here are NB-I estimation results for OBDV:
MPITB extensions found
OBDV
======================================================
BFGSMIN final results
Used analytic gradient
------------------------------------------------------
STRONG CONVERGENCE
Function conv 1 Param conv 1 Gradient conv 1
------------------------------------------------------
Objective function value 2.18573
Stepsize 0.0007
17 iterations
------------------------------------------------------
param gradient change
1.0965 0.0000 -0.0000
0.2551 -0.0000 0.0000
0.2024 -0.0000 0.0000
0.2289 0.0000 -0.0000
0.1969 0.0000 -0.0000
0.0769 0.0000 -0.0000
0.0000 -0.0000 0.0000
1.7146 -0.0000 0.0000
******************************************************
Negative Binomial model, MEPS 1996 full data set
MLE Estimation Results
BFGS convergence: Normal convergence
Average Log-L: -2.185730
Observations: 4564
estimate st. err t-stat p-value
constant -0.523 0.104 -5.005 0.000
pub. ins. 0.765 0.054 14.198 0.000
priv. ins. 0.451 0.049 9.196 0.000
sex 0.458 0.034 13.512 0.000
age 0.016 0.001 11.869 0.000
edu 0.027 0.007 3.979 0.000
inc 0.000 0.000 0.000 1.000
alpha 5.555 0.296 18.752 0.000
Information Criteria
CAIC : 20026.7513 Avg. CAIC: 4.3880
BIC : 20018.7513 Avg. BIC: 4.3862
AIC : 19967.3437 Avg. AIC: 4.3750
******************************************************
Note that the parameter values of the last BFGS iteration are dierent that those reported in the
nal results. This reects two things - rst, the data were scaled before doing the BFGS minimiza-
tion, but the mle_results script takes this into account and reports the results using the original
scaling. But also, the parameterization = exp(

) is used to enforce the restriction that > 0.


The unrestricted parameter

= log is used to dene the log-likelihood function, since the BFGS


minimization algorithm does not do contrained minimization. To get the standard error and t-statistic
of the estimate of , we need to use the delta method. This is done inside mle_results, making use
of the function parameterize.m .
Likewise, here are NB-II results:
MPITB extensions found
OBDV
======================================================
BFGSMIN final results
Used analytic gradient
------------------------------------------------------
STRONG CONVERGENCE
Function conv 1 Param conv 1 Gradient conv 1
------------------------------------------------------
Objective function value 2.18496
Stepsize 0.0104394
13 iterations
------------------------------------------------------
param gradient change
1.0375 0.0000 -0.0000
0.3673 -0.0000 0.0000
0.2136 0.0000 -0.0000
0.2816 0.0000 -0.0000
0.3027 0.0000 0.0000
0.0843 -0.0000 0.0000
-0.0048 0.0000 -0.0000
0.4780 -0.0000 0.0000
******************************************************
Negative Binomial model, MEPS 1996 full data set
MLE Estimation Results
BFGS convergence: Normal convergence
Average Log-L: -2.184962
Observations: 4564
estimate st. err t-stat p-value
constant -1.068 0.161 -6.622 0.000
pub. ins. 1.101 0.095 11.611 0.000
priv. ins. 0.476 0.081 5.880 0.000
sex 0.564 0.050 11.166 0.000
age 0.025 0.002 12.240 0.000
edu 0.029 0.009 3.106 0.002
inc -0.000 0.000 -0.176 0.861
alpha 1.613 0.055 29.099 0.000
Information Criteria
CAIC : 20019.7439 Avg. CAIC: 4.3864
BIC : 20011.7439 Avg. BIC: 4.3847
AIC : 19960.3362 Avg. AIC: 4.3734
******************************************************
For the OBDV usage measurel, the NB-II model does a slightly better job than the NB-I model, in
terms of the average log-likelihood and the information criteria (more on this last in a moment).
Note that both versions of the NB model t much better than does the Poisson model (see 11.4).
The estimated is highly signicant.
To check the plausibility of the NB-II model, we can compare the sample unconditional variance
with the estimated unconditional variance according to the NB-II model:

V (y) =

n
t=1

t
+ (

t)
2
n
. For
OBDV and ERV (estimation results not reported), we get For OBDV, the overdispersion problem is
signicantly better than in the Poisson case, but there is still some that is not captured. For ERV,
the negative binomial model seems to capture the overdispersion adequately.
Table 18.2: Marginal Variances, Sample and Estimated (NB-II)
OBDV ERV
Sample 38.09 0.151
Estimated 30.58 0.182
Finite mixture models: the mixed negative binomial model
The nite mixture approach to tting health care demand was introduced by Deb and Trivedi (1997).
The mixture approach has the intuitive appeal of allowing for subgroups of the population with dierent
health status. If individuals are classied as healthy or unhealthy then two subgroups are dened. A
ner classication scheme would lead to more subgroups. Many studies have incorporated objective
and/or subjective indicators of health status in an eort to capture this heterogeneity. The available
objective measures, such as limitations on activity, are not necessarily very informative about a persons
overall health status. Subjective, self-reported measures may suer from the same problem, and may
also not be exogenous
Finite mixture models are conceptually simple. The density is
f
Y
(y,
1
, ...,
p
,
1
, ...,
p1
) =
p1

i=1

i
f
(i)
Y
(y,
i
) +
p
f
p
Y
(y,
p
),
where
i
> 0, i = 1, 2, ..., p,
p
= 1

p1
i=1

i
, and

p
i=1

i
= 1. Identication requires that the
i
are
ordered in some way, for example,
1

2

p
and
i
,=
j
, i ,= j. This is simple to accomplish
post-estimation by rearrangement and possible elimination of redundant component densities.
The properties of the mixture density follow in a straightforward way from those of the com-
ponents. In particular, the moment generating function is the same mixture of the moment
generating functions of the component densities, so, for example, E(Y [x) =

p
i=1

i
(x), where

i
(x) is the mean of the i
th
component density.
Mixture densities may suer from overparameterization, since the total number of parameters
grows rapidly with the number of component densities. It is possible to constrained parameters
across the mixtures.
Testing for the number of component densities is a tricky issue. For example, testing for p = 1
(a single component, which is to say, no mixture) versus p = 2 (a mixture of two components)
involves the restriction
1
= 1, which is on the boundary of the parameter space. Not that when

1
= 1, the parameters of the second component can take on any value without aecting the
density. Usual methods such as the likelihood ratio test are not applicable when parameters are
on the boundary under the null hypothesis. Information criteria means of choosing the model
(see below) are valid.
The following results are for a mixture of 2 NB-II models, for the OBDV data, which you can replicate
using this program .
OBDV
******************************************************
Mixed Negative Binomial model, MEPS 1996 full data set
MLE Estimation Results
BFGS convergence: Normal convergence
Average Log-L: -2.164783
Observations: 4564
estimate st. err t-stat p-value
constant 0.127 0.512 0.247 0.805
pub. ins. 0.861 0.174 4.962 0.000
priv. ins. 0.146 0.193 0.755 0.450
sex 0.346 0.115 3.017 0.003
age 0.024 0.004 6.117 0.000
edu 0.025 0.016 1.590 0.112
inc -0.000 0.000 -0.214 0.831
alpha 1.351 0.168 8.061 0.000
constant 0.525 0.196 2.678 0.007
pub. ins. 0.422 0.048 8.752 0.000
priv. ins. 0.377 0.087 4.349 0.000
sex 0.400 0.059 6.773 0.000
age 0.296 0.036 8.178 0.000
edu 0.111 0.042 2.634 0.008
inc 0.014 0.051 0.274 0.784
alpha 1.034 0.187 5.518 0.000
Mix 0.257 0.162 1.582 0.114
Information Criteria
CAIC : 19920.3807 Avg. CAIC: 4.3647
BIC : 19903.3807 Avg. BIC: 4.3610
AIC : 19794.1395 Avg. AIC: 4.3370
******************************************************
It is worth noting that the mixture parameter is not signicantly dierent from zero, but also not
that the coecients of public insurance and age, for example, dier quite a bit between the two latent
classes.
Information criteria
As seen above, a Poisson model cant be tested (using standard methods) as a restriction of a negative
binomial model. But it seems, based upon the values of the likelihood functions and the fact that
the NB model ts the variance much better, that the NB model is more appropriate. How can we
determine which of a set of competing models is the best?
The information criteria approach is one possibility. Information criteria are functions of the log-
likelihood, with a penalty for the number of parameters used. Three popular information criteria are
the Akaike (AIC), Bayes (BIC) and consistent Akaike (CAIC). The formulae are
CAIC = 2 ln L(

) + k(ln n + 1)
BIC = 2 ln L(

) + k ln n
AIC = 2 ln L(

) + 2k
It can be shown that the CAIC and BIC will select the correctly specied model from a group of
models, asymptotically. This doesnt mean, of course, that the correct model is necesarily in the
group. The AIC is not consistent, and will asymptotically favor an over-parameterized model over the
correctly specied model. Here are information criteria values for the models weve seen, for OBDV.
Pretty clearly, the NB models are better than the Poisson. The one additional parameter gives a very
signicant improvement in the likelihood function value. Between the NB-I and NB-II models, the
Table 18.3: Information Criteria, OBDV
Model AIC BIC CAIC
Poisson 7.345 7.355 7.357
NB-I 4.375 4.386 4.388
NB-II 4.373 4.385 4.386
MNB-II 4.337 4.361 4.365
NB-II is slightly favored. But one should remember that information criteria values are statistics,
with variances. With another sample, it may well be that the NB-I model would be favored, since the
dierences are so small. The MNB-II model is favored over the others, by all 3 information criteria.
Why is all of this in the chapter on QML? Lets suppose that the correct model for OBDV is in fact
the NB-II model. It turns out in this case that the Poisson model will give consistent estimates of the
slope parameters (if a model is a member of the linear-exponential family and the conditional mean
is correctly specied, then the parameters of the conditional mean will be consistently estimated).
So the Poisson estimator would be a QML estimator that is consistent for some parameters of the
true model. The ordinary OPG or inverse Hessian ML covariance estimators are however biased
and inconsistent, since the information matrix equality does not hold for QML estimators. But for
i.i.d. data (which is the case for the MEPS data) the QML asymptotic covariance can be consistently
estimated, as discussed above, using the sandwich form for the ML estimator. mle_results in fact
reports sandwich results, so the Poisson estimation results would be reliable for inference even if the
true model is the NB-I or NB-II. Not that they are in fact similar to the results for the NB models.
However, if we assume that the correct model is the MNB-II model, as is favored by the informa-
tion criteria, then both the Poisson and NB-x models will have misspecied mean functions, so the
parameters that inuence the means would be estimated with bias and inconsistently.
18.3 Exercises
1. Considering the MEPS data (the description is in Section 11.4), for the OBDV (y) measure, let
be a latent index of health status that has expectation equal to unity.
1
We suspect that and
PRIV may be correlated, but we assume that is uncorrelated with the other regressors. We
assume that
E(y[PUB, PRIV, AGE, EDUC, INC, )
= exp(
1
+
2
PUB +
3
PRIV +
4
AGE +
5
EDUC +
6
INC).
We use the Poisson QML estimator of the model
y Poisson()
= exp(
1
+
2
PUB +
3
PRIV + (18.2)

4
AGE +
5
EDUC +
6
INC).
Since much previous evidence indicates that health care services usage is overdispersed
2
, this is
almost certainly not an ML estimator, and thus is not ecient. However, when and PRIV
are uncorrelated, this estimator is consistent for the
i
parameters, since the conditional mean
is correctly specied in that case. When and PRIV are correlated, Mullahys (1997) NLIV
1
A restriction of this sort is necessary for identication.
2
Overdispersion exists when the conditional variance is greater than the conditional mean. If this is the case, the Poisson specication is not
correct.
estimator that uses the residual function
=
y

1,
where is dened in equation 18.2, with appropriate instruments, is consistent. As instruments
we use all the exogenous regressors, as well as the cross products of PUB with the variables in
Z = AGE, EDUC, INC. That is, the full set of instruments is
W = 1 PUB Z PUB Z .
(a) Calculate the Poisson QML estimates.
(b) Calculate the generalized IV estimates (do it using a GMM formulation - see the portfolio
example for hints how to do this).
(c) Calculate the Hausman test statistic to test the exogeneity of PRIV.
(d) comment on the results
Chapter 19
Nonlinear least squares (NLS)
Readings: Davidson and MacKinnon, Ch. 2

and 5

; Gallant, Ch. 1
19.1 Introduction and denition
Nonlinear least squares (NLS) is a means of estimating the parameter of the model
y
t
= f(x
t
,
0
) +
t
.
In general,
t
will be heteroscedastic and autocorrelated, and possibly nonnormally distributed.
However, dealing with this is exactly as in the case of linear models, so well just treat the iid
case here,

t
iid(0,
2
)
519
If we stack the observations vertically, dening
y = (y
1
, y
2
, ..., y
n
)
/
f = (f(x
1
, ), f(x
1
, ), ..., f(x
1
, ))
/
and
= (
1
,
2
, ...,
n
)
/
we can write the n observations as
y = f () +
Using this notation, the NLS estimator can be dened as

arg min

s
n
() =
1
n
[y f ()]
/
[y f ()] =
1
n
| y f () |
2
The estimator minimizes the weighted sum of squared errors, which is the same as minimizing
the Euclidean distance between y and f ().
The objective function can be written as
s
n
() =
1
n
[y
/
y 2y
/
f () +f ()
/
f ()] ,
which gives the rst order conditions

f (

)
/
_
y +
_

f (

)
/
_
f (

) 0.
Dene the n K matrix
F(

) D

f (

). (19.1)
In shorthand, use

F in place of F(

). Using this, the rst order conditions can be written as

F
/
y +

F
/
f (

) 0,
or

F
/
_
y f (

)
_
0. (19.2)
This bears a good deal of similarity to the f.o.c. for the linear model - the derivative of the prediction
is orthogonal to the prediction error. If f () = X, then

F is simply X, so the f.o.c. (with spherical
errors) simplify to
X
/
y X
/
X = 0,
the usual 0LS f.o.c.
We can interpret this geometrically: INSERT drawings of geometrical depiction of OLS and NLS
(see Davidson and MacKinnon, pgs. 8,13 and 46).
Note that the nonlinearity of the manifold leads to potential multiple local maxima, minima and
saddlepoints: the objective function s
n
() is not necessarily well-behaved and may be dicult to
minimize.
19.2 Identication
As before, identication can be considered conditional on the sample, and asymptotically. The
condition for asymptotic identication is that s
n
() tend to a limiting function s

() such that
s

(
0
) < s

(), ,=
0
. This will be the case if s

(
0
) is strictly convex at
0
, which requires
that D
2

(
0
) be positive denite. Consider the objective function:
s
n
() =
1
n
n

t=1
[y
t
f(x
t
, )]
2
=
1
n
n

t=1
_
f(x
t
,
0
) +
t
f
t
(x
t
, )
_
2
=
1
n
n

t=1
_
f
t
(
0
) f
t
()
_
2
+
1
n
n

t=1
(
t
)
2

2
n
n

t=1
_
f
t
(
0
) f
t
()
_

t
As in example 12.4, which illustrated the consistency of extremum estimators using OLS, we
conclude that the second term will converge to a constant which does not depend upon .
A LLN can be applied to the third term to conclude that it converges pointwise to 0, as long as
f () and are uncorrelated.
Next, pointwise convergence needs to be stregnthened to uniform almost sure convergence. There
are a number of possible assumptions one could use. Here, well just assume it holds.
Turning to the rst term, well assume a pointwise law of large numbers applies, so
1
n
n

t=1
_
f
t
(
0
) f
t
()
_
2
a.s.

_
_
f(z,
0
) f(z, )
_
2
d(z), (19.3)
where (x) is the distribution function of x. In many cases, f(x, ) wil l be bounded and con-
tinuous, for all , so strengthening to uniform almost sure convergence is immediate. For
example if f(x, ) = [1 + exp(x)]
1
, f : 1
K
(0, 1) , a bounded range, and the function is
continuous in .
Given these results, it is clear that a minimizer is
0
. When considering identication (asymptotic),
the question is whether or not there may be some other minimizer. A local condition for identication
is that

/
s

() =

2

/
_
_
f(x,
0
) f(x, )
_
2
d(x)
be positive denite at
0
. Evaluating this derivative, we obtain (after a little work)

/
_
_
f(x,
0
) f(x, )
_
2
d(x)

0
= 2
_
_
D

f(z,
0
)
/
_ _
D

f(z,
0
)
_
/
d(z)
the expectation of the outer product of the gradient of the regression function evaluated at
0
. (Note:
the uniform boundedness we have already assumed allows passing the derivative through the integral,
by the dominated convergence theorem.) This matrix will be positive denite (wp1) as long as the
gradient vector is of full rank (wp1). The tangent space to the regression manifold must span a K
-dimensional space if we are to consistently estimate a K -dimensional parameter vector. This is
analogous to the requirement that there be no perfect colinearity in a linear model. This is a necessary
condition for identication. Note that the LLN implies that the above expectation is equal to

(
0
) = 2 limc
F
/
F
n
19.3 Consistency
We simply assume that the conditions of Theorem 29 hold, so the estimator is consistent. Given
that the strong stochastic equicontinuity conditions hold, as discussed above, and given the above
identication conditions an a compact estimation space (the closure of the parameter space ), the
consistency proofs assumptions are satised.
19.4 Asymptotic normality
As in the case of GMM, we also simply assume that the conditions for asymptotic normality as in
Theorem 31 hold. The only remaining problem is to determine the form of the asymptotic variance-
covariance matrix. Recall that the result of the asymptotic normality theorem is

n
_


0
_
d
N
_
0,

(
0
)
1
J

(
0
)

(
0
)
1
_
,
where

(
0
) is the almost sure limit of

2

s
n
() evaluated at
0
, and
J

(
0
) = limV ar

nD

s
n
(
0
)
The objective function is
s
n
() =
1
n
n

t=1
[y
t
f(x
t
, )]
2
So
D

s
n
() =
2
n
n

t=1
[y
t
f(x
t
, )] D

f(x
t
, ).
Evaluating at
0
,
D

s
n
(
0
) =
2
n
n

t=1

t
D

f(x
t
,
0
).
Note that the expectation of this is zero, since
t
and x
t
are assumed to be uncorrelated. So to calculate
the variance, we can simply calculate the second moment about zero. Also note that
n

t=1

t
D

f(x
t
,
0
) =

_
f (
0
)
_
/

= F
/

With this we obtain


J

(
0
) = limV ar

nD

s
n
(
0
)
= limnc
4
n
2
F
/

F
= 4
2
limc
F
/
F
n
Weve already seen that

(
0
) = 2 limc
F
/
F
n
,
where the expectation is with respect to the joint density of x and . Combining these expressions for

(
0
) and J

(
0
), and the result of the asymptotic normality theorem, we get

n
_


0
_
d
N
_
_
_0,
_
_
limc
F
/
F
n
_
_
1

2
_
_
_.
We can consistently estimate the variance covariance matrix using
_
_

F
/

F
n
_
_
1

2
, (19.4)
where

F is dened as in equation 19.1 and

2
=
_
y f (

)
_
/
_
y f (

)
_
n
,
the obvious estimator. Note the close correspondence to the results for the linear model.
19.5 Example: The Poisson model for count data
Suppose that y
t
conditional on x
t
is independently distributed Poisson. A Poisson random variable is
a count data variable, which means it can take the values {0,1,2,...}. This sort of model has been used
to study visits to doctors per year, number of patents registered by businesses per year, etc.
The Poisson density is
f(y
t
) =
exp(
t
)
y
t
t
y
t
!
, y
t
0, 1, 2, ....
The mean of y
t
is
t
, as is the variance. Note that
t
must be positive. Suppose that the true mean is

0
t
= exp(x
/
t

0
),
which enforces the positivity of
t
. Suppose we estimate
0
by nonlinear least squares:

= arg min s
n
() =
1
T
n

t=1
(y
t
exp(x
/
t
))
2
We can write
s
n
() =
1
T
n

t=1
_
exp(x
/
t

0
+
t
exp(x
/
t
)
_
2
=
1
T
n

t=1
_
exp(x
/
t

0
exp(x
/
t
)
_
2
+
1
T
n

t=1

2
t
+ 2
1
T
n

t=1

t
_
exp(x
/
t

0
exp(x
/
t
)
_
The last term has expectation zero since the assumption that c(y
t
[x
t
) = exp(x
/
t

0
) implies that
c (
t
[x
t
) = 0, which in turn implies that functions of x
t
are uncorrelated with
t
. Applying a strong
LLN, and noting that the objective function is continuous on a compact parameter space, we get
s

() = c
x
_
exp(x
/

0
exp(x
/
)
_
2
+c
x
exp(x
/

0
)
where the last term comes from the fact that the conditional variance of is the same as the variance
of y. This function is clearly minimized at =
0
, so the NLS estimator is consistent as long as
identication holds.
Exercise 63. Determine the limiting distribution of

n
_


0
_
. This means nding the the specic
forms of

2

s
n
(), (
0
),
s
n
()

, and J(
0
). Again, use a CLT as needed, no need to verify that it
can be applied.
19.6 The Gauss-Newton algorithm
Readings: Davidson and MacKinnon, Chapter 6, pgs. 201-207

.
The Gauss-Newton optimization technique is specically designed for nonlinear least squares. The
idea is to linearize the nonlinear model, rather than the objective function. The model is
y = f (
0
) + .
At some in the parameter space, not equal to
0
, we have
y = f () +
where is a combination of the fundamental error term and the error due to evaluating the regression
function at rather than the true value
0
. Take a rst order Taylors series approximation around a
point
1
:
y = f (
1
) +
_
D

f
_

1
__ _

1
_
+ + approximation error.
Dene z y f (
1
) and b (
1
). Then the last equation can be written as
z = F(
1
)b + ,
where, as above, F(
1
) D

f (
1
) is the n K matrix of derivatives of the regression function,
evaluated at
1
, and is plus approximation error from the truncated Taylors series.
Note that F is known, given
1
.
Note that one could estimate b simply by performing OLS on the above equation.
Given

b, we calculate a new round estimate of
0
as
2
=

b +
1
. With this, take a new Taylors
series expansion around
2
and repeat the process. Stop when

b = 0 (to within a specied
tolerance).
To see why this might work, consider the above approximation, but evaluated at the NLS estimator:
y = f (

) +F(

)
_

_
+
The OLS estimate of b

is

b =
_

F
/

F
_
1

F
/
_
y f (

)
_
.
This must be zero, since

F
/
_

_ _
y f (

)
_
0
by denition of the NLS estimator (these are the normal equations as in equation 19.2, Since

b 0
when we evaluate at

, updating would stop.
The Gauss-Newton method doesnt require second derivatives, as does the Newton-Raphson
method, so its faster.
The varcov estimator, as in equation 19.4 is simple to calculate, since we have

F as a by-product
of the estimation process (i.e., its just the last round regressor matrix). In fact, a normal OLS
program will give the NLS varcov estimator directly, since its just the OLS varcov estimator
from the last iteration.
The method can suer from convergence problems since F()
/
F(), may be very nearly singular,
even with an asymptotically identied model, especially if is very far from

. Consider the
example
y =
1
+
2
x
t

3
+
t
When evaluated at
2
0,
3
has virtually no eect on the NLS objective function, so F will
have rank that is essentially 2, rather than 3. In this case, F
/
F will be nearly singular, so
(F
/
F)
1
will be subject to large roundo errors.
19.7 Application: Limited dependent variables and sample
selection
Readings: Davidson and MacKinnon, Ch. 15

(a quick reading is sucient), J. Heckman, Sample


Selection Bias as a Specication Error, Econometrica, 1979 (This is a classic article, not required for
reading, and which is a bit out-dated. Nevertheless its a good place to start if you encounter sample
selection problems in your research).
Sample selection is a common problem in applied research. The problem occurs when observations
used in estimation are sampled non-randomly, according to some selection scheme.
Example: Labor Supply
Labor supply of a person is a positive number of hours per unit time supposing the oer wage is higher
than the reservation wage, which is the wage at which the person prefers not to work. The model
(very simple, with t subscripts suppressed):
Characteristics of individual: x
Latent labor supply: s

= x
/
+
Oer wage: w
o
= z
/
+
Reservation wage: w
r
= q
/
+
Write the wage dierential as
w

= (z
/
+ ) (q
/
+ )
r
/
+
We have the set of equations
s

= x
/
+
w

= r
/
+ .
Assume that
_

_ N
_
_
_
_

_
0
0
_

_ ,
_

1
_

_
_
_
_.
We assume that the oer wage and the reservation wage, as well as the latent variable s

are unob-
servable. What is observed is
w = 1 [w

> 0]
s = ws

.
In other words, we observe whether or not a person is working. If the person is working, we observe
labor supply, which is equal to latent labor supply, s

. Otherwise, s = 0 ,= s

. Note that we are using


a simplifying assumption that individuals can freely choose their weekly hours of work.
Suppose we estimated the model
s

= x
/
+ residual
using only observations for which s > 0. The problem is that these observations are those for which
w

> 0, or equivalently, < r


/
and
c [[ < r
/
] ,= 0,
since and are dependent. Furthermore, this expectation will in general depend on x since elements
of x can enter in r. Because of these two facts, least squares estimation is biased and inconsistent.
Consider more carefully c [[ < r
/
] . Given the joint normality of and , we can write (see
for example Spanos Statistical Foundations of Econometric Model ling, pg. 122)
= + ,
where has mean zero and is independent of . With this we can write
s

= x
/
+ + .
If we condition this equation on < r
/
we get
s = x
/
+ c([ < r
/
) +
which may be written as
s = x
/
+ c([ > r
/
) +
A useful result is that for
z N(0, 1)
E(z[z > z

) =
(z

)
(z

)
,
where () and () are the standard normal density and distribution function, respectively.
The quantity on the RHS above is known as the inverse Mills ratio:
IMR(z

) =
(z

)
(z

)
With this we can write (making use of the fact that the standard normal density is symmetric
about zero, so that (a) = (a)):
s = x
/
+
(r
/
)
(r
/
)
+ (19.5)

_
x
/
(r

)
(r

)
_
_

_ + . (19.6)
where = . The error term has conditional mean zero, and is uncorrelated with the regressors
x
/
(r

)
(r

)
. At this point, we can estimate the equation by NLS.
Heckman showed how one can estimate this in a two step procedure where rst is estimated,
then equation 19.6 is estimated by least squares using the estimated value of to form the
regressors. This is inecient and estimation of the covariance is a tricky issue. It is probably
easier (and more ecient) just to do MLE.
The model presented above depends strongly on joint normality. There exist many alternative
models which weaken the maintained assumptions. It is possible to estimate consistently without
distributional assumptions. See Ahn and Powell, Journal of Econometrics, 1994.
Chapter 20
Nonparametric inference
20.1 Possible pitfalls of parametric inference: estimation
Readings: H. White (1980) Using Least Squares to Approximate Unknown Regression Functions,
International Economic Review, pp. 149-70.
In this section we consider a simple example, which illustrates both why nonparametric methods
may in some cases be preferred to parametric methods.
We suppose that data is generated by random sampling of (y, x), where y = f(x) +, x is uniformly
distributed on (0, 2), and is a classical error. Suppose that
f(x) = 1 +
3x
2

_
x
2
_
2
The problem of interest is to estimate the elasticity of f(x) with respect to x, throughout the range
of x.
535
In general, the functional form of f(x) is unknown. One idea is to take a Taylors series approxi-
mation to f(x) about some point x
0
. Flexible functional forms such as the transcendental logarithmic
(usually know as the translog) can be interpreted as second order Taylors series approximations. Well
work with a rst order approximation, for simplicity. Approximating about x
0
:
h(x) = f(x
0
) + D
x
f(x
0
) (x x
0
)
If the approximation point is x
0
= 0, we can write
h(x) = a + bx
The coecient a is the value of the function at x = 0, and the slope is the value of the derivative
at x = 0. These are of course not known. One might try estimation by ordinary least squares. The
objective function is
s(a, b) = 1/n
n

t=1
(y
t
h(x
t
))
2
.
The limiting objective function, following the argument we used to get equations 12.1 and 19.3 is
s

(a, b) =
_
2
0
(f(x) h(x))
2
dx.
The theorem regarding the consistency of extremum estimators (Theorem 29) tells us that a and

b
will converge almost surely to the values that minimize the limiting objective function. Solving the
rst order conditions
1
reveals that s

(a, b) obtains its minimum at


_
a
0
=
7
6
, b
0
=
1

_
. The estimated
1
The following results were obtained using the free computer algebra system (CAS) Maxima. Unfortunately, I have lost the source code to
get the results :-(
Figure 20.1: True and simple approximating functions
0 1 2 3 4 5 6 7
x
1.0
1.5
2.0
2.5
3.0
3.5
approx
true
approximating function

h(x) therefore tends almost surely to
h

(x) = 7/6 + x/
In Figure 20.1 we see the true function and the limit of the approximation to see the asymptotic bias
as a function of x.
(The approximating model is the straight line, the true model has curvature.) Note that the
approximating model is in general inconsistent, even at the approximation point. This shows that
exible functional forms based upon Taylors series approximations do not in general lead to consis-
tent estimation of functions.
The approximating model seems to t the true model fairly well, asymptotically. However, we are
interested in the elasticity of the function. Recall that an elasticity is the marginal function divided
Figure 20.2: True and approximating elasticities
0 1 2 3 4 5 6 7
x
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
approx
true
by the average function:
(x) = x
/
(x)/(x)
Good approximation of the elasticity over the range of x will require a good approximation of both
f(x) and f
/
(x) over the range of x. The approximating elasticity is
(x) = xh
/
(x)/h(x)
In Figure 20.2 we see the true elasticity and the elasticity obtained from the limiting approximating
model.
The true elasticity is the line that has negative slope for large x. Visually we see that the elasticity
is not approximated so well. Root mean squared error in the approximation of the elasticity is
_
_
2
0
((x) (x))
2
dx
_
1/2
= . 31546
Now suppose we use the leading terms of a trigonometric series as the approximating model. The
reason for using a trigonometric series as an approximating model is motivated by the asymptotic
properties of the Fourier exible functional form (Gallant, 1981, 1982), which we will study in more
detail below. Normally with this type of model the number of basis functions is an increasing function
of the sample size. Here we hold the set of basis function xed. We will consider the asymptotic
behavior of a xed model, which we interpret as an approximation to the estimators behavior in nite
samples. Consider the set of basis functions:
Z(x) =
_
1 x cos(x) sin(x) cos(2x) sin(2x)
_
.
The approximating model is
g
K
(x) = Z(x).
Maintaining these basis functions as the sample size increases, we nd that the limiting objective
function is minimized at
_
a
1
=
7
6
, a
2
=
1

, a
3
=
1

2
, a
4
= 0, a
5
=
1
4
2
, a
6
= 0
_
.
Substituting these values into g
K
(x) we obtain the almost sure limit of the approximation
g

(x) = 7/6 + x/ + (cos x)


_

2
_
+ (sin x) 0 + (cos 2x)
_

1
4
2
_
+ (sin 2x) 0 (20.1)
Figure 20.3: True function and more exible approximation
0 1 2 3 4 5 6 7
x
1.0
1.5
2.0
2.5
3.0
3.5
approx
true
In Figure 20.3 we have the approximation and the true function: Clearly the truncated trigonometric
series model oers a better approximation, asymptotically, than does the linear model. In Figure 20.4
we have the more exible approximations elasticity and that of the true function: On average, the t
is better, though there is some implausible wavyness in the estimate. Root mean squared error in the
approximation of the elasticity is
_
_
_
_
2
0
_
_
(x)
g
/

(x)x
g

(x)
_
_
2
dx
_
_
_
1/2
= . 16213,
about half that of the RMSE when the rst order approximation is used. If the trigonometric series
contained innite terms, this error measure would be driven to zero, as we shall see.
Figure 20.4: True elasticity and more exible approximation
0 1 2 3 4 5 6 7
x
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
approx
true
20.2 Possible pitfalls of parametric inference: hypothesis test-
ing
What do we mean by the term nonparametric inference? Simply, this means inferences that are
possible without restricting the functions of interest to belong to a parametric family.
Consider means of testing for the hypothesis that consumers maximize utility. A consequence
of utility maximization is that the Slutsky matrix D
2
p
h(p, U), where h(p, U) are the a set of
compensated demand functions, must be negative semi-denite. One approach to testing for
utility maximization would estimate a set of normal demand functions x(p, m).
Estimation of these functions by normal parametric methods requires specication of the func-
tional form of demand, for example
x(p, m) = x(p, m,
0
) + ,
0

0
,
where x(p, m,
0
) is a function of known form and
0
is a nite dimensional parameter.
After estimation, we could use x = x(p, m,

) to calculate (by solving the integrability problem,
which is non-trivial)

D
2
p
h(p, U). If we can statistically reject that the matrix is negative semi-
denite, we might conclude that consumers dont maximize utility.
The problem with this is that the reason for rejection of the theoretical proposition may be that
our choice of functional form is incorrect. In the introductory section we saw that functional
form misspecication leads to inconsistent estimation of the function and its derivatives.
Testing using parametric models always means we are testing a compound hypothesis. The
hypothesis that is tested is 1) the economic proposition we wish to test, and 2) the model is
correctly specied. Failure of either 1) or 2) can lead to rejection (as can a Type-I error, even
when 2) holds). This is known as the model-induced augmenting hypothesis.
Varians WARP allows one to test for utility maximization without specifying the form of the
demand functions. The only assumptions used in the test are those directly implied by theory,
so rejection of the hypothesis calls into question the theory.
Nonparametric inference also allows direct testing of economic propositions, avoiding the model-
induced augmenting hypothesis. The cost of nonparametric methods is usually an increase in
complexity, and a loss of power, compared to what one would get using a well-specied parametric
model. The benet is robustness against possible misspecication.
20.3 Estimation of regression functions
The Fourier functional form
Readings: Gallant, 1987, Identication and consistency in semi-nonparametric regression, in Ad-
vances in Econometrics, Fifth World Congress, V. 1, Truman Bewley, ed., Cambridge.
Suppose we have a multivariate model
y = f(x) + ,
where f(x) is of unknown form and x is a Pdimensional vector. For simplicity, assume that is a
classical error. Let us take the estimation of the vector of elasticities with typical element

x
i
=
x
i
f(x)
f(x)
x
i
f(x)
,
at an arbitrary point x
i
.
The Fourier form, following Gallant (1982), but with a somewhat dierent parameterization, may
be written as
g
K
(x [
K
) = +x
/
+ 1/2x
/
Cx +
A

=1
J

j=1
(u
j
cos(jk
/

x) v
j
sin(jk
/

x)) . (20.2)
where the K-dimensional parameter vector

K
= ,
/
, vec

(C)
/
, u
11
, v
11
, . . . , u
JA
, v
JA

/
. (20.3)
We assume that the conditioning variables x have each been transformed to lie in an interval that
is shorter than 2. This is required to avoid periodic behavior of the approximation, which is
desirable since economic functions arent periodic. For example, subtract sample means, divide
by the maxima of the conditioning variables, and multiply by 2eps, where eps is some positive
number less than 2 in value.
The k

are elementary multi-indices which are simply P vectors formed of integers (negative,
positive and zero). The k

, = 1, 2, ..., A are required to be linearly independent, and we follow


the convention that the rst non-zero element be positive. For example
_
0 1 1 0 1
_
/
is a potential multi-index to be used, but
_
0 1 1 0 1
_
/
is not since its rst nonzero element is negative. Nor is
_
0 2 2 0 2
_
/
a multi-index we would use, since it is a scalar multiple of the original multi-index.
We parameterize the matrix C dierently than does Gallant because it simplies things in prac-
tice. The cost of this is that we are no longer able to test a quadratic specication using nested
testing.
The vector of rst partial derivatives is
D
x
g
K
(x [
K
) = +Cx +
A

=1
J

j=1
[(u
j
sin(jk
/

x) v
j
cos(jk
/

x)) jk

] (20.4)
and the matrix of second partial derivatives is
D
2
x
g
K
(x[
K
) = C+
A

=1
J

j=1
_
(u
j
cos(jk
/

x) + v
j
sin(jk
/

x)) j
2
k

k
/

_
(20.5)
To dene a compact notation for partial derivatives, let be an N-dimensional multi-index with
no negative elements. Dene [ [

as the sum of the elements of . If we have N arguments x of the


(arbitrary) function h(x), use D

h(x) to indicate a certain partial derivative:


D

h(x)

[[

1
1
x

2
2
x

N
N
h(x)
When is the zero vector, D

h(x) h(x). Taking this denition and the last few equations into
account, we see that it is possible to dene (1 K) vector Z

(x) so that
D

g
K
(x[
K
) = z

(x)
/

K
. (20.6)
Both the approximating model and the derivatives of the approximating model are linear in the
parameters.
For the approximating model to the function (not derivatives), write g
K
(x[
K
) = z
/

K
for sim-
plicity.
The following theorem can be used to prove the consistency of the Fourier form.
Theorem 64. [Gal lant and Nychka, 1987] Suppose that

h
n
is obtained by maximizing a sample ob-
jective function s
n
(h) over ]
K
n
where ]
K
is a subset of some function space ] on which is dened a
norm | h |. Consider the fol lowing conditions:
(a) Compactness: The closure of ] with respect to | h | is compact in the relative topology dened
by | h |.
(b) Denseness:
K
]
K
, K = 1, 2, 3, ... is a dense subset of the closure of ] with respect to | h |
and ]
K
]
K+1
.
(c) Uniform convergence: There is a point h

in ] and there is a function s

(h, h

) that is
continuous in h with respect to | h | such that
lim
n
sup
]
[ s
n
(h) s

(h, h

) [= 0
almost surely.
(d) Identication: Any point h in the closure of ] with s

(h, h

) s

(h

, h

) must have | hh

|=
0.
Under these conditions lim
n
| h

h
n
|= 0 almost surely, provided that lim
n
K
n
= almost
surely.
The modication of the original statement of the theorem that has been made is to set the parameter
space in Gallant and Nychkas (1987) Theorem 0 to a single point and to state the theorem in terms
of maximization rather than minimization.
This theorem is very similar in form to Theorem 29. The main dierences are:
1. A generic norm | h | is used in place of the Euclidean norm. This norm may be stronger than
the Euclidean norm, so that convergence with respect to | h | implies convergence w.r.t the
Euclidean norm. Typically we will want to make sure that the norm is strong enough to imply
convergence of all functions of interest.
2. The estimation space ] is a function space. It plays the role of the parameter space in
our discussion of parametric estimators. There is no restriction to a parametric family, only a
restriction to a space of functions that satisfy certain conditions. This formulation is much less
restrictive than the restriction to a parametric family.
3. There is a denseness assumption that was not present in the other theorem.
We will not prove this theorem (the proof is quite similar to the proof of theorem [29], see Gallant,
1987) but we will discuss its assumptions, in relation to the Fourier form as the approximating model.
Sobolev norm Since all of the assumptions involve the norm | h | , we need to make explicit
what norm we wish to use. We need a norm that guarantees that the errors in approximation of the
functions we are interested in are accounted for. Since we are interested in rst-order elasticities in
the present case, we need close approximation of both the function f(x) and its rst derivative f
/
(x),
throughout the range of x. Let A be an open set that contains all values of x that were interested
in. The Sobolev norm is appropriate in this case. It is dened, making use of our notation for partial
derivatives, as:
| h |
m,A
= max
[

[m
sup
A

h(x)

To see whether or not the function f(x) is well approximated by an approximating model g
K
(x [
K
),
we would evaluate
| f(x) g
K
(x [
K
) |
m,A
.
We see that this norm takes into account errors in approximating the function and partial derivatives
up to order m. If we want to estimate rst order elasticities, as is the case in this example, the relevant
m would be m = 1. Furthermore, since we examine the sup over A, convergence w.r.t. the Sobolev
means uniform convergence, so that we obtain consistent estimates for all values of x.
Compactness Verifying compactness with respect to this norm is quite technical and unenlighten-
ing. It is proven by Elbadawi, Gallant and Souza, Econometrica, 1983. The basic requirement is that
if we need consistency w.r.t. | h |
m,A
, then the functions of interest must belong to a Sobolev space
which takes into account derivatives of order m + 1. A Sobolev space is the set of functions

m,A
(D) = h(x) :| h(x) |
m,A
< D,
where D is a nite constant. In plain words, the functions must have bounded partial derivatives of
one order higher than the derivatives we seek to estimate.
The estimation space and the estimation subspace Since in our case were interested in con-
sistent estimation of rst-order elasticities, well dene the estimation space as follows:
Denition 65. [Estimation space] The estimation space ] =
2,A
(D). The estimation space is an
open set, and we presume that h

].
So we are assuming that the function to be estimated has bounded second derivatives throughout
A.
With seminonparametric estimators, we dont actually optimize over the estimation space. Rather,
we optimize over a subspace, ]
K
n
, dened as:
Denition 66. [Estimation subspace] The estimation subspace ]
K
is dened as
]
K
= g
K
(x[
K
) : g
K
(x[
K
)
2,Z
(D),
K
1
K
,
where g
K
(x,
K
) is the Fourier form approximation as dened in Equation 20.2.
Denseness The important point here is that ]
K
is a space of functions that is indexed by a nite
dimensional parameter (
K
has K elements, as in equation 20.3). With n observations, n > K,
this parameter is estimable. Note that the true function h

is not necessarily an element of ]


K
, so
optimization over ]
K
may not lead to a consistent estimator. In order for optimization over ]
K
to
be equivalent to optimization over ], at least asymptotically, we need that:
1. The dimension of the parameter vector, dim
K
n
as n . This is achieved by making A
and J in equation 20.2 increasing functions of n, the sample size. It is clear that K will have to
grow more slowly than n. The second requirement is:
2. We need that the ]
K
be dense subsets of ].
The estimation subspace ]
K
, dened above, is a subset of the closure of the estimation space, ] . A
set of subsets /
a
of a set / is dense if the closure of the countable union of the subsets is equal to
the closure of /:

a=1
/
a
= /
Use a picture here. The rest of the discussion of denseness is provided just for completeness: theres
no need to study it in detail. To show that ]
K
is a dense subset of ] with respect to | h |
1,A
, it
is useful to apply Theorem 1 of Gallant (1982), who in turn cites Edmunds and Moscatelli (1977).
We reproduce the theorem as presented by Gallant, with minor notational changes, for convenience of
reference:
Theorem 67. [Edmunds and Moscatelli, 1977] Let the real-valued function h

(x) be continuously
dierentiable up to order m on an open set containing the closure of A. Then it is possible to choose
a triangular array of coecients
1
,
2
, . . .
K
, . . . , such that for every q with 0 q < m, and every
> 0, | h

(x) h
K
(x[
K
) |
q,A
= o(K
m+q+
) as K .
In the present application, q = 1, and m = 2. By denition of the estimation space, the elements
of ] are once continuously dierentiable on A, which is open and contains the closure of A, so the
theorem is applicable. Closely following Gallant and Nychka (1987),

]
K
is the countable union of
the ]
K
. The implication of Theorem 67 is that there is a sequence of {h
K
} from

]
K
such that
lim
K
| h

h
K
|
1,A
= 0,
for all h

]. Therefore,
]

]
K
.
However,

]
K
],
so

]
K
].
Therefore
] =

]
K
,
so

]
K
is a dense subset of ], with respect to the norm | h |
1,A
.
Uniform convergence We now turn to the limiting objective function. We estimate by OLS. The
sample objective function stated in terms of maximization is
s
n
(
K
) =
1
n
n

t=1
(y
t
g
K
(x
t
[
K
))
2
With random sampling, as in the case of Equations 12.1 and 19.3, the limiting objective function is
s

(g, f) =
_
A
(f(x) g(x))
2
dx
2

. (20.7)
where the true function f(x) takes the place of the generic function h

in the presentation of the


theorem. Both g(x) and f(x) are elements of

]
K
.
The pointwise convergence of the objective function needs to be strengthened to uniform conver-
gence. We will simply assume that this holds, since the way to verify this depends upon the specic
application. We also have continuity of the objective function in g, with respect to the norm | h |
1,A
since
lim
|g
1
g
0
|
1,X
0
_
s

_
g
1
, f)
_
s

_
g
0
, f)
__
= lim
|g
1
g
0
|
1,X
0
_
A
_
_
g
1
(x) f(x)
_
2

_
g
0
(x) f(x)
_
2
_
dx.
By the dominated convergence theorem (which applies since the nite bound D used to dene
2,Z
(D)
is dominated by an integrable function), the limit and the integral can be interchanged, so by inspec-
tion, the limit is zero.
Identication The identication condition requires that for any point (g, f) in ]], s

(g, f)
s

(f, f) | gf |
1,A
= 0. This condition is clearly satised given that g and f are once continuously
dierentiable (by the assumption that denes the estimation space).
Review of concepts For the example of estimation of rst-order elasticities, the relevant concepts
are:
Estimation space ] =
2,A
(D): the function space in the closure of which the true function
must lie.
Consistency norm | h |
1,A
. The closure of ] is compact with respect to this norm.
Estimation subspace ]
K
. The estimation subspace is the subset of ] that is representable by a
Fourier form with parameter
K
. These are dense subsets of ].
Sample objective function s
n
(
K
), the negative of the sum of squares. By standard arguments
this converges uniformly to the
Limiting objective function s

( g, f), which is continuous in g and has a global maximum in its


rst argument, over the closure of the innite union of the estimation subpaces, at g = f.
As a result of this, rst order elasticities
x
i
f(x)
f(x)
x
i
f(x)
are consistently estimated for all x A.
Discussion Consistency requires that the number of parameters used in the expansion increase with
the sample size, tending to innity. If parameters are added at a high rate, the bias tends relatively
rapidly to zero. A basic problem is that a high rate of inclusion of additional parameters causes the
variance to tend more slowly to zero. The issue of how to chose the rate at which parameters are
added and which to add rst is fairly complex. A problem is that the allowable rates for asymptotic
normality to obtain (Andrews 1991; Gallant and Souza, 1991) are very strict. Supposing we stick to
these rates, our approximating model is:
g
K
(x[
K
) = z
/

K
.
Dene Z
K
as the nK matrix of regressors obtained by stacking observations. The LS estimator
is

K
= (Z
/
K
Z
K
)
+
Z
/
K
y,
where ()
+
is the Moore-Penrose generalized inverse.
This is used since Z
/
K
Z
K
may be singular, as would be the case for K(n) large enough when
some dummy variables are included.
. The prediction, z
/

K
, of the unknown function f(x) is asymptotically normally distributed:

n
_
z
/

K
f(x)
_
d
N(0, AV ),
where
AV = lim
n
E
_
_
z
/
_
_
Z
/
K
Z
K
n
_
_
+
z
2
_
_
.
Formally, this is exactly the same as if we were dealing with a parametric linear model. I
emphasize, though, that this is only valid if K grows very slowly as n grows. If we cant stick to
acceptable rates, we should probably use some other method of approximating the small sample
distribution. Bootstrapping is a possibility. Well discuss this in the section on simulation.
Kernel regression estimators
Readings: Bierens, 1987, Kernel estimators of regression functions, in Advances in Econometrics,
Fifth World Congress, V. 1, Truman Bewley, ed., Cambridge.
An alternative method to the semi-nonparametric method is a fully nonparametric method of
estimation. Kernel regression estimation is an example (others are splines, nearest neighbor, etc.).
Well consider the Nadaraya-Watson kernel regression estimator in a simple case.
Suppose we have an iid sample from the joint density f(x, y), where x is k -dimensional. The
model is
y
t
= g(x
t
) +
t
,
where
E(
t
[x
t
) = 0.
The conditional expectation of y given x is g(x). By denition of the conditional expectation,
we have
g(x) =
_
y
f(x, y)
h(x)
dy
=
1
h(x)
_
yf(x, y)dy,
where h(x) is the marginal density of x :
h(x) =
_
f(x, y)dy.
This suggests that we could estimate g(x) by estimating h(x) and
_
yf(x, y)dy.
Estimation of the denominator
A kernel estimator for h(x) has the form

h(x) =
1
n
n

t=1
K [(x x
t
) /
n
]

k
n
,
where n is the sample size and k is the dimension of x.
The function K() (the kernel) is absolutely integrable:
_
[K(x)[dx < ,
and K() integrates to 1 :
_
K(x)dx = 1.
In this respect, K() is like a density function, but we do not necessarily restrict K() to be
nonnegative.
The window width parameter,
n
is a sequence of positive numbers that satises
lim
n

n
= 0
lim
n
n
k
n
=
So, the window width must tend to zero, but not too quickly.
To show pointwise consistency of

h(x) for h(x), rst consider the expectation of the estimator
(since the estimator is an average of iid terms we only need to consider the expectation of a
representative term):
E
_

h(x)
_
=
_

k
n
K [(x z) /
n
] h(z)dz.
Change variables as z

= (x z)/
n
, so z = x
n
z

and [
dz
dz

[ =
k
n
, we obtain
E
_

h(x)
_
=
_

k
n
K (z

) h(x
n
z

)
k
n
dz

=
_
K (z

) h(x
n
z

)dz

.
Now, asymptotically,
lim
n
E
_

h(x)
_
= lim
n
_
K (z

) h(x
n
z

)dz

=
_
lim
n
K (z

) h(x
n
z

)dz

=
_
K (z

) h(x)dz

= h(x)
_
K (z

) dz

= h(x),
since
n
0 and
_
K (z

) dz

= 1 by assumption. (Note: that we can pass the limit through the


integral is a result of the dominated convergence theorem.. For this to hold we need that h()
be dominated by an absolutely integrable function.
Next, considering the variance of

h(x), we have, due to the iid assumption
n
k
n
V
_

h(x)
_
= n
k
n
1
n
2
n

t=1
V
_
_
_
K [(x x
t
) /
n
]

k
n
_
_
_
=
k
n
1
n
n

t=1
V K [(x x
t
) /
n
]
By the representative term argument, this is
n
k
n
V
_

h(x)
_
=
k
n
V K [(x z) /
n
]
Also, since V (x) = E(x
2
) E(x)
2
we have
n
k
n
V
_

h(x)
_
=
k
n
E
_
(K [(x z) /
n
])
2
_

k
n
E (K [(x z) /
n
])
2
=
_

k
n
K [(x z) /
n
]
2
h(z)dz
k
n
__

k
n
K [(x z) /
n
] h(z)dz
_
2
=
_

k
n
K [(x z) /
n
]
2
h(z)dz
k
n
E
_

h(x)
_
2
The second term converges to zero:

k
n
E
_

h(x)
_
2
0,
by the previous result regarding the expectation and the fact that
n
0. Therefore,
lim
n
n
k
n
V
_

h(x)
_
= lim
n
_

k
n
K [(x z) /
n
]
2
h(z)dz.
Using exactly the same change of variables as before, this can be shown to be
lim
n
n
k
n
V
_

h(x)
_
= h(x)
_
[K(z

)]
2
dz

.
Since both
_
[K(z

)]
2
dz

and h(x) are bounded, this is bounded, and since n


k
n
by assump-
tion, we have that
V
_

h(x)
_
0.
Since the bias and the variance both go to zero, we have pointwise consistency (convergence in
quadratic mean implies convergence in probability).
Estimation of the numerator
To estimate
_
yf(x, y)dy, we need an estimator of f(x, y). The estimator has the same form as the
estimator for h(x), only with one dimension more:

f(x, y) =
1
n
n

t=1
K

[(y y
t
) /
n
, (x x
t
) /
n
]

k+1
n
The kernel K

() is required to have mean zero:


_
yK

(y, x) dy = 0
and to marginalize to the previous kernel for h(x) :
_
K

(y, x) dy = K(x).
With this kernel, we have
_
y

f(y, x)dy =
1
n
n

t=1
y
t
K [(x x
t
) /
n
]

k
n
by marginalization of the kernel, so we obtain
g(x) =
1

h(x)
_
y

f(y, x)dy
=
1
n

n
t=1
y
t
K[(xx
t
)/
n
]

k
n
1
n

n
t=1
K[(xx
t
)/
n
]

k
n
=

n
t=1
y
t
K [(x x
t
) /
n
]

n
t=1
K [(x x
t
) /
n
]
.
This is the Nadaraya-Watson kernel regression estimator.
Discussion
The kernel regression estimator for g(x
t
) is a weighted average of the y
j
, j = 1, 2, ..., n, where
higher weights are associated with points that are closer to x
t
. The weights sum to 1.
The window width parameter
n
imposes smoothness. The estimator is increasingly at as

n
, since in this case each weight tends to 1/n.
A large window width reduces the variance (strong imposition of atness), but increases the bias.
A small window width reduces the bias, but makes very little use of information except points
that are in a small neighborhood of x
t
. Since relatively little information is used, the variance is
large when the window width is small.
The standard normal density is a popular choice for K(.) and K

(y, x), though there are possibly


better alternatives.
Choice of the window width: Cross-validation
The selection of an appropriate window width is important. One popular method is cross validation.
This consists of splitting the sample into two parts (e.g., 50%-50%). The rst part is the in sample
data, which is used for estimation, and the second part is the out of sample data, used for evaluation
of the t though RMSE or some other criterion. The steps are:
1. Split the data. The out of sample data is y
out
and x
out
.
2. Choose a window width .
3. With the in sample data, t y
out
t
corresponding to each x
out
t
. This tted value is a function of the
in sample data, as well as the evaluation point x
out
t
, but it does not involve y
out
t
.
4. Repeat for all out of sample points.
5. Calculate RMSE()
6. Go to step 2, or to the next step if enough window widths have been tried.
7. Select the that minimizes RMSE() (Verify that a minimum has been found, for example by
plotting RMSE as a function of ).
8. Re-estimate using the best and all of the data.
This same principle can be used to choose A and J in a Fourier form model.
20.4 Density function estimation
Kernel density estimation
The previous discussion suggests that a kernel density estimator may easily be constructed. We have
already seen how joint densities may be estimated. If were interested in a conditional density, for
example of y conditional on x, then the kernel estimate of the conditional density is simply

f
y[x
=

f(x, y)

h(x)
=
1
n

n
t=1
K

[(yy
t
)/
n
,(xx
t
)/
n
]

k+1
n
1
n

n
t=1
K[(xx
t
)/
n
]

k
n
=
1

n
t=1
K

[(y y
t
) /
n
, (x x
t
) /
n
]

n
t=1
K [(x x
t
) /
n
]
where we obtain the expressions for the joint and marginal densities from the section on kernel regres-
sion.
Semi-nonparametric maximum likelihood
Readings: Gallant and Nychka, Econometrica, 1987. For a Fortran program to do this and a useful
discussion in the users guide, see this link. See also Cameron and Johansson, Journal of Applied
Econometrics, V. 12, 1997.
MLE is the estimation method of choice when we are condent about specifying the density. Is is
possible to obtain the benets of MLE when were not so condent about the specication? In part,
yes.
Suppose were interested in the density of y conditional on x (both may be vectors). Suppose that
the density f(y[x, ) is a reasonable starting approximation to the true density. This density can be
reshaped by multiplying it by a squared polynomial. The new density is
g
p
(y[x, , ) =
h
2
p
(y[)f(y[x, )

p
(x, , )
where
h
p
(y[) =
p

k=0

k
y
k
and
p
(x, , ) is a normalizing factor to make the density integrate (sum) to one. Because h
2
p
(y[)/
p
(x, , )
is a homogenous function of it is necessary to impose a normalization:
0
is set to 1. The normal-
ization factor
p
(, ) is calculated (following Cameron and Johansson) using
E(Y
r
) =

y=0
y
r
f
Y
(y[, )
=

y=0
y
r
[h
p
(y[)]
2

p
(, )
f
Y
(y[)
=

y=0
p

k=0
p

l=0
y
r
f
Y
(y[)
k

l
y
k
y
l
/
p
(, )
=
p

k=0
p

l=0

l
_
_
_

y=0
y
r+k+l
f
Y
(y[)
_
_
_
/
p
(, )
=
p

k=0
p

l=0

l
m
k+l+r
/
p
(, ).
By setting r = 0 we get that the normalizing factor is
20.8

p
(, ) =
p

k=0
p

l=0

l
m
k+l
(20.8)
Recall that
0
is set to 1 to achieve identication. The m
r
in equation 20.8 are the raw moments of
the baseline density. Gallant and Nychka (1987) give conditions under which such a density may be
treated as correctly specied, asymptotically. Basically, the order of the polynomial must increase as
the sample size increases. However, there are technicalities.
Similarly to Cameron and Johannson (1997), we may develop a negative binomial polynomial
(NBP) density for count data. The negative binomial baseline density may be written (see equation
as
f
Y
(y[) =
(y + )
(y + 1)()
_

+
_

_

+
_
y
where = , , > 0 and > 0. The usual means of incorporating conditioning variables x is
the parameterization = e
x

. When = / we have the negative binomial-I model (NB-I). When


= 1/ we have the negative binomial-II (NP-II) model. For the NB-I density, V (Y ) = + . In
the case of the NB-II model, we have V (Y ) = +
2
. For both forms, E(Y ) = .
The reshaped density, with normalization to sum to one, is
f
Y
(y[, ) =
[h
p
(y[)]
2

p
(, )
(y + )
(y + 1)()
_

+
_

_

+
_
y
. (20.9)
To get the normalization factor, we need the moment generating function:
M
Y
(t) =

_
e
t
+
_

. (20.10)
To illustrate, Figure 20.5 shows calculation of the rst four raw moments of the NB density, calculated
using MuPAD, which is a Computer Algebra System that (used to be?) free for personal use. These
are the moments you would need to use a second order polynomial (p = 2). MuPAD will output
Figure 20.5: Negative binomial raw moments
these results in the form of C code, which is relatively easy to edit to write the likelihood function
for the model. This has been done in NegBinSNP.cc, which is a C++ version of this model that can
be compiled to use with octave using the mkoctfile command. Note the impressive length of the
expressions when the degree of the expansion is 4 or 5! This is an example of a model that would be
dicult to formulate without the help of a program like MuPAD.
It is possible that there is conditional heterogeneity such that the appropriate reshaping should be
more local. This can be accomodated by allowing the
k
parameters to depend upon the conditioning
variables, for example using polynomials.
Gallant and Nychka, Econometrica, 1987 prove that this sort of density can approximate a wide
variety of densities arbitrarily well as the degree of the polynomial increases with the sample size.
This approach is not without its drawbacks: the sample objective function can have an extremely large
number of local maxima that can lead to numeric diculties. If someone could gure out how to do
in a way such that the sample objective function was nice and smooth, they would probably get the
paper published in a good journal. Any ideas?
Heres a plot of true and the limiting SNP approximations (with the order of the polynomial xed)
to four dierent count data densities, which variously exhibit over and underdispersion, as well as
excess zeros. The baseline model is a negative binomial density.
0 5 10 15 20
.1
.2
.3
.4
.5
Case 1
0 5 10 15 20 25
.05
.1
Case 2
1 2 3 4 5 6 7
.05
.1
.15
.2
.25
Case 3
2.5 5 7.5 10 12.5 15
.05
.1
.15
.2
Case 4
20.5 Examples
MEPS health care usage data
Well use the MEPS OBDV data to illustrate kernel regression and semi-nonparametric maximum
likelihood.
Kernel regression estimation
Lets try a kernel regression t for the OBDV data. The program OBDVkernel.m loads the MEPS
OBDV data, scans over a range of window widths and calculates leave-one-out CV scores, and plots the
tted OBDV usage versus AGE, using the best window width. The plot is in Figure 20.6. Note that
usage increases with age, just as weve seen with the parametric models. Once could use bootstrapping
to generate a condence interval to the t.
Seminonparametric ML estimation and the MEPS data
Now lets estimate a seminonparametric density for the OBDV data. Well reshape a negative bino-
mial density, as discussed above. The program EstimateNBSNP.m loads the MEPS OBDV data and
estimates the model, using a NB-I baseline density and a 2nd order polynomial expansion. The output
is:
OBDV
======================================================
Figure 20.6: Kernel tted OBDV usage versus AGE
BFGSMIN final results
Used numeric gradient
------------------------------------------------------
STRONG CONVERGENCE
Function conv 1 Param conv 1 Gradient conv 1
------------------------------------------------------
Objective function value 2.17061
Stepsize 0.0065
24 iterations
------------------------------------------------------
param gradient change
1.3826 0.0000 -0.0000
0.2317 -0.0000 0.0000
0.1839 0.0000 0.0000
0.2214 0.0000 -0.0000
0.1898 0.0000 -0.0000
0.0722 0.0000 -0.0000
-0.0002 0.0000 -0.0000
1.7853 -0.0000 -0.0000
-0.4358 0.0000 -0.0000
0.1129 0.0000 0.0000
******************************************************
NegBin SNP model, MEPS full data set
MLE Estimation Results
BFGS convergence: Normal convergence
Average Log-L: -2.170614
Observations: 4564
estimate st. err t-stat p-value
constant -0.147 0.126 -1.173 0.241
pub. ins. 0.695 0.050 13.936 0.000
priv. ins. 0.409 0.046 8.833 0.000
sex 0.443 0.034 13.148 0.000
age 0.016 0.001 11.880 0.000
edu 0.025 0.006 3.903 0.000
inc -0.000 0.000 -0.011 0.991
gam1 1.785 0.141 12.629 0.000
gam2 -0.436 0.029 -14.786 0.000
lnalpha 0.113 0.027 4.166 0.000
Information Criteria
CAIC : 19907.6244 Avg. CAIC: 4.3619
BIC : 19897.6244 Avg. BIC: 4.3597
AIC : 19833.3649 Avg. AIC: 4.3456
******************************************************
Note that the CAIC and BIC are lower for this model than for the models presented in Table 18.3.
This model ts well, still being parsimonious. You can play around trying other use measures, using
a NP-II baseline density, and using other orders of expansions. Density functions formed in this way
may have MANY local maxima, so you need to be careful before accepting the results of a casual run.
Figure 20.7: Dollar-Euro
To guard against having converged to a local maximum, one can try using multiple starting values, or
one could try simulated annealing as an optimization method. If you uncomment the relevant lines
in the program, you can use SA to do the minimization. This will take a lot of time, compared to
the default BFGS minimization. The chapter on parallel computations might be interesting to read
before trying this.
Financial data and volatility
The data set rates contains the growth rate (100log dierence) of the daily spot $/euro and $/yen
exchange rates at New York, noon, from January 04, 1999 to February 12, 2008. There are 2291
observations. See the README le for details. Figures ?? and ?? show the data and their histograms.
at the center of the histograms, the bars extend above the normal density that best ts the data,
and the tails are fatter than those of the best t normal density. This feature of the data is
known as leptokurtosis.
in the series plots, we can see that the variance of the growth rates is not constant over time.
Figure 20.8: Dollar-Yen
Volatility clusters are apparent, alternating between periods of stability and periods of more wild
swings. This is known as conditional heteroscedasticity. ARCH and GARCH well-known models
that are often applied to this sort of data.
Many structural economic models often cannot generate data that exhibits conditional het-
eroscedasticity without directly assuming shocks that are conditionally heteroscedastic. It would
be nice to have an economic explanation for how conditional heteroscedasticity, leptokurtosis,
and other (leverage, etc.) features of nancial data result from the behavior of economic agents,
rather than from a black box that provides shocks.
The Octave script kernelt.m performs kernel regression to t E(y
2
t
[y
2
t1,
y
2
t2
), and generates the plots
in Figure 20.9.
From the point of view of learning the practical aspects of kernel regression, note how the data
is compactied in the example script.
In the Figure, note how current volatility depends on lags of the squared return rate - it is high
when both of the lags are high, but drops o quickly when either of the lags is low.
Figure 20.9: Kernel regression tted conditional second moments, Yen/Dollar and Euro/Dollar
(a) Yen/Dollar (b) Euro/Dollar
The fact that the plots are not at suggests that this conditional moment contain information
about the process that generates the data. Perhaps attempting to match this moment might be
a means of estimating the parameters of the dgp. Well come back to this later.
20.6 Exercises
1. In Octave, type edit kernel_example.
(a) Look this script over, and describe in words what it does.
(b) Run the script and interpret the output.
(c) Experiment with dierent bandwidths, and comment on the eects of choosing small and
large values.
2. In Octave, type help kernel_regression.
(a) How can a kernel t be done without supplying a bandwidth?
(b) How is the bandwidth chosen if a value is not provided?
(c) What is the default kernel used?
3. Using the Octave script OBDVkernel.m as a model, plot kernel regression ts for OBDV visits
as a function of income and education.
Chapter 21
Quantile regression
References: Cameron and Trivedi, Chapter 4, and Chernozhukovs MIT OpenCourseWare notes, lec-
ture 8 Chernozhukovs quantile reg notes.
This chapter gives a brief outline of quantile regression. The intention is to learn what quantile
regression is, and its potential uses, but without going into the topic in depth.
The classical linear regression model y
t
= x
/
t
+
t
with normal errors implied that the distribution
of y
t
conditional on x
t
is
y
t
N(x
/
t
,
2
)
The conditional quantile of Y , conditional on X = x (notation: Y
[X=x
) is the smallest value z such
that Pr(Y z[X = x) = . If F
Y [X=x
is the conditional CDF of Y, then the -conditional quantile is
Y
[X=x
= infy : F
Y [X=x
(y). When = 0.5, we are talking about the conditional median Y
0.5[X=x
,
but we could be interested in other quantiles, too.
Note that Pr(Y < x
/
[X = x) = 0.5 when the model follows the classical assumptions with
575
normal errors, because the normal distribution is symmetric about the mean, so Y
0.5[X=x
= x
/
. One
can estimate the conditional median just by using the tted conditional mean, because the mean and
median are the same given normality.
How about other quantiles? We have y = x
/
+ and N(0,
2
). Conditional on x, x
/
is
given, and the distribution of does not depend on x. Note that / is standard normal, and the
quantile of / is simply the inverse of the standard normal CDF evaulated at ,
1
(), where
is the standard normal CDF function. The probit function
1
() is tabulated (or can be found in
Octave using the norminv function). It is plotted in Figure 21.1.
The quantile of is
1
(). Thus, the conditional quantile of y is Y
[X=x
= x
/
+
1
().
Draw gure here showing conditional quantiles for classical model.
Note that the conditional quantiles for the classical model are linear functions of x, and all have the
same slope the only thing that changes with is the intercept
1
(). If the error is heteroscedastic,
quantiles can have dierent slopes. Draw a picture.
To compute conditional quantiles for the classical linear model, we used the assumption of nor-
mality. Can we estimate conditional quantiles without making distributional assumptions? Yes, we
can! (nod to Obama). You can do fully nonparametric conditional density estimation, and use the
tted conditional density to compute quantiles. Note that estimating quantiles where is close to 0
or 1 is dicult, because you have few observations that lie in the neighborhood of the quantile, so
you should expect a large variance if you go the nonparametric route. For this reason, we may go the
semi-parametric route, which imposes more structure. When people talk about quantile regression,
they usually mean the semi-parametric approach.
The assumption is that the -conditional quantile of the dependent variable Y is a linear function
of the conditioning variables X: Y
[X=x
= x
/

.
Figure 21.1: Inverse CDF for N(0,1)
This is a generalization of what we get from the classical model with normality, where the slopes
of the quantiles with respect to the regressors are constant for all .
For the classical model with normality,

x
Y
[X=x
= .
With the assumption of linear quantiles without distributional assumptions,

x
Y
[X=x
=

,
so the slopes are allowed to change with .
This is a step in the direction of exibility, but it also means we need to estimate many param-
eters if were interested in many quantiles: there may be an eciency loss due to using many
parameters to avoid distributional assumptions.
The question is how to estimate

when we dont make distributional assumptions.


It turns out that the problem can be expressed as an extremum estimator,

= arg min s
n
() where
s
n
() =
n

i=1
[1(y
i
x
/
i

) + 1(y
i
< x
/
i

)(1 )] [y
i
x
/
i

[
First, suppose that = 0.5, so we are estimating the median. Then the objective simplies to
minimizing the absolute deviations:
s
n
() =
n

i=1
[y
i
x
/
i

[
The presence of the weights in the general version accounts for the fact that if were estimating the
= 0.1 quantile, we expect 90% of the y
i
to be greater than x
/
i

, and only 10% to be smaller. We


need to downweight the likely events and upweight the unlikely events so that the objective function
minimizes at the appropriate place.
One note is that median regression may be a useful means of dealing with data that satises the
classical assumptions, except for contamination by outliers. In class, use Gretl to show this.
Note that the quantile regression objective function is discontinuous. Minimization can be done
quickkly using linear programming. BFGS wont work.
the asymptotic distribution is normal, with the sandwich form typical of extremum estimators.
Estimation of the terms is not completely straightforward, so methods like bootstrapping may
be preferable.
the asymptotic variance depends upon which quantile were estimating. When is close to 0 or
1, the asymptotic variance becomes large, and the asymptotic appoximation is unreliable for the
small sample distribution. Extreme quantiles are hard to estimate with precision, because the
data is sparse in those regions.
The articial data set quantile.gdt allows you to explore quantile regression with GRETL, and to see
how median regression can help to deal with data contamination.
If you do quantile regression of the variable y versus x, we are in a situation where the assumptions
of the classical model hold. Quantiles all have approximately the same slope (the true value is
1).
With heteroscedastic data, the quantiles have dierent slopes.
see Figure 21.2
Figure 21.2: Quantile regression results
(a) homoscedastic data (b) heteroscedastic data
Chapter 22
Simulation-based methods for
estimation and inference
Readings: Gourieroux and Monfort (1996) Simulation-Based Econometric Methods (Oxford Univer-
sity Press). There are many articles. Some of the seminal papers are Gallant and Tauchen (1996),
Which Moments to Match?, ECONOMETRIC THEORY, Vol. 12, 1996, pages 657-681; Gourieroux,
Monfort and Renault (1993), Indirect Inference, J. Apl. Econometrics; Pakes and Pollard (1989)
Econometrica; McFadden (1989) Econometrica.
Simulation-based methods use computer power as a major input to do econometrics. Of course,
computer power has always been used, but when intensive use of computer power is contemplated, it is
possible to do things that are otherwise infeasible. Examples include obtaining more accurate results
that what asymptotic theory gives us, using methods like bootstrapping, or to perform estimation using
simulation, when analytic expressions for objective functions that dene estimators are not available.
581
22.1 Motivation
Simulation methods are of interest when the DGP is fully characterized by a parameter vector, so that
simulated data can be generated, but the likelihood function and moments of the observable varables
are not calculable, so that MLE or GMM estimation is not possible. Many moderately complex models
result in intractible likelihoods or moments, as we will see. Simulation-based estimation methods open
up the possibility to estimate truly complex models. The desirability introducing a great deal of
complexity may be an issue
1
, but it least it becomes a possibility.
Example: Multinomial and/or dynamic discrete response models
(following McFadden, 1989)
Let y

i
be a latent random vector of dimension m. Suppose that
y

i
= X
i
+
i
where X
i
is mK. Suppose that

i
N(0, ) (22.1)
Henceforth drop the i subscript when it is not needed for clarity.
y

is not observed. Rather, we observe a many-to-one mapping


y = (y

)
1
Remember that a model is an abstraction from reality, and abstraction helps us to isolate the important features of a phenomenon.
This mapping is such that each element of y is either zero or one (in some cases only one element
will be one).
Dene
A
i
= A(y
i
) = y

[y
i
= (y

)
Suppose random sampling of (y
i
, X
i
). In this case the elements of y
i
may not be independent of
one another (and clearly are not if is not diagonal). However, y
i
is independent of y
j
, i ,= j.
Let = (
/
, (vec

)
/
)
/
be the vector of parameters of the model. The contribution of the i
th
observation to the likelihood function is
p
i
() =
_
A
i
n(y

i
X
i
, )dy

i
where
n(, ) = (2)
M/2
[[
1/2
exp
_
_

2
_
_
is the multivariate normal density of an M -dimensional random vector. The log-likelihood
function is
ln L() =
1
n
n

i=1
ln p
i
()
and the MLE

solves the score equations
1
n
n

i=1
g
i
(

) =
1
n
n

i=1
D

p
i
(

)
p
i
(

)
0.
The problem is that evaluation of L
i
() and its derivative w.r.t. by standard methods of
numeric integration such as quadrature is computationally infeasible when m (the dimension of
y) is higher than 3 or 4 (as long as there are no restrictions on ).
The mapping (y

) has not been made specic so far. This setup is quite general: for dierent
choices of (y

) it nests the case of dynamic binary discrete choice models as well as the case of
multinomial discrete choice (the choice of one out of a nite set of alternatives).
Multinomial discrete choice is illustrated by a (very simple) job search model. We have
cross sectional data on individuals matching to a set of m jobs that are available (one of
which is unemployment). The utility of alternative j is
u
j
= X
j
+
j
Utilities of jobs, stacked in the vector u
i
are not observed. Rather, we observe the vector
formed of elements
y
j
= 1 [u
j
> u
k
, k m, k ,= j]
Only one of these elements is dierent than zero.
Dynamic discrete choice is illustrated by repeated choices over time between two alterna-
tives. Let alternative j have utility
u
jt
= W
jt

jt
,
j 1, 2
t 1, 2, ..., m
Then
y

= u
2
u
1
= (W
2
W
1
) +
2

1
X +
Now the mapping is (element-by-element)
y = 1 [y

> 0] ,
that is y
it
= 1 if individual i chooses the second alternative in period t, zero otherwise.
Example: Marginalization of latent variables
Economic data often presents substantial heterogeneity that may be dicult to model. A possibility
is to introduce latent random variables. This can cause the problem that there may be no known
closed form for the distribution of observable variables after marginalizing out the unobservable latent
variables. For example, count data (that takes values 0, 1, 2, 3, ...) is often modeled using the Poisson
distribution
Pr(y = i) =
exp()
i
i!
The mean and variance of the Poisson distribution are both equal to :
c(y) = V (y) = .
Often, one parameterizes the conditional mean as

i
= exp(X
i
).
This ensures that the mean is positive (as it must be). Estimation by ML is straightforward.
Often, count data exhibits overdispersion which simply means that
V (y) > c(y).
If this is the case, a solution is to use the negative binomial distribution rather than the Poisson. An
alternative is to introduce a latent variable that reects heterogeneity into the specication:

i
= exp(X
i
+
i
)
where
i
has some specied density with support S (this density may depend on additional parameters).
Let d(
i
) be the density of
i
. In some cases, the marginal density of y
Pr(y = y
i
) =
_
S
exp [exp(X
i
+
i
)] [exp(X
i
+
i
)]
y
i
y
i
!
d(
i
)
will have a closed-form solution (one can derive the negative binomial distribution in the way if
has an exponential distribution - see equation 18.1), but often this will not be possible. In this case,
simulation is a means of calculating Pr(y = i), which is then used to do ML estimation. This would
be an example of the Simulated Maximum Likelihood (SML) estimation.
In this case, since there is only one latent variable, quadrature is probably a better choice.
However, a more exible model with heterogeneity would allow all parameters (not just the
constant) to vary. For example
Pr(y = y
i
) =
_
S
exp [exp(X
i

i
)] [exp(X
i

i
)]
y
i
y
i
!
d(
i
)
entails a K = dim
i
-dimensional integral, which will not be evaluable by quadrature when K
gets large.
Estimation of models specied in terms of stochastic dierential equations
It is often convenient to formulate models in terms of continuous time using dierential equations. A
realistic model should account for exogenous shocks to the system, which can be done by assuming
a random component. This leads to a model that is expressed as a system of stochastic dierential
equations. Consider the process
dy
t
= g(, y
t
)dt + h(, y
t
)dW
t
which is assumed to be stationary. W
t
is a standard Brownian motion (Weiner process), such that
W(T) =
_
T
0
dW
t
N(0, T)
Brownian motion is a continuous-time stochastic process such that
W(0) = 0
[W(s) W(t)] N(0, s t)
[W(s) W(t)] and [W(j) W(k)] are independent for s > t > j > k. That is, non-overlapping
segments are independent.
One can think of Brownian motion the accumulation of independent normally distributed shocks with
innitesimal variance.
The function g(, y
t
) is the deterministic part.
h(, y
t
) determines the variance of the shocks.
To estimate a model of this sort, we typically have data that are assumed to be observations of y
t
in
discrete points y
1
, y
2
, ...y
T
. That is, though y
t
is a continuous process it is observed in discrete time.
To perform inference on , direct ML or GMM estimation is not usually feasible, because one
cannot, in general, deduce the transition density f(y
t
[y
t1
, ). This density is necessary to evaluate the
likelihood function or to evaluate moment conditions (which are based upon expectations with respect
to this density).
A typical solution is to discretize the model, by which we mean to nd a discrete time approx-
imation to the model. The discretized version of the model is
y
t
y
t1
=

g(, y
t1
) +

h(, y
t1
)
t

t
N(0, 1)
The discretization induces a new parameter, (that is, the
0
which denes the best approxi-
mation of the discretization to the actual (unknown) discrete time version of the model is not
equal to
0
which is the true parameter value). This is an approximation, and as such ML
estimation of (which is actually quasi-maximum likelihood, QML) based upon this equation is
in general biased and inconsistent for the original parameter, . Nevertheless, the approximation
shouldnt be too bad, especially if the time interval between t and t1 is small. Simulation-based
inference using the discrete time approximation as an auxiliary model will allow inference on ,
rather than on .
The important point about these three examples is that computational diculties prevent direct
application of ML, GMM, etc. Nevertheless the model is fully specied in probabilistic terms up
to a parameter vector. This means that the model is simulable, conditional on the parameter
vector.
22.2 Simulated maximum likelihood (SML)
For simplicity, consider cross-sectional data. An ML estimator solves

ML
= arg max s
n
() =
1
n
n

t=1
ln p(y
t
[X
t
, )
where p(y
t
[X
t
, ) is the density function of the t
th
observation. When p(y
t
[X
t
, ) does not have a known
closed form,

ML
is an infeasible estimator. However, it may be possible to dene a random function
such that
c

f(, y
t
, X
t
, ) = p(y
t
[X
t
, )
where the density of is known. If this is the case, the simulator
p (y
t
, X
t
, ) =
1
H
H

s=1
f(
ts
, y
t
, X
t
, )
is unbiased for p(y
t
[X
t
, ).
The SML simply substitutes p (y
t
, X
t
, ) in place of p(y
t
[X
t
, ) in the log-likelihood function, that
is

SML
= arg max s
n
() =
1
n
n

i=1
ln p (y
t
, X
t
, )
Example: multinomial probit
Recall that the utility of alternative j is
u
j
= X
j
+
j
and the vector y is formed of elements
y
j
= 1 [u
j
> u
k
, k m, k ,= j]
The problem is that Pr(y
j
= 1[) cant be calculated when m is larger than 4 or 5. However, it is easy
to simulate this probability.
Draw
i
from the distribution N(0, )
Calculate u
i
= X
i
+
i
(where X
i
is the matrix formed by stacking the X
ij
)
Dene y
ij
= 1 [u
ij
> u
ik
, k m, k ,= j]
Repeat this H times and dene

ij
=

H
h=1
y
ijh
H
Dene

i
as the m-vector formed of the

ij
. Each element of

i
is between 0 and 1, and the
elements sum to one.
Now p (y
i
, X
i
, ) = y
/
i

i
The SML multinomial probit log-likelihood function is
ln L(, ) =
1
n
n

i=1
y
/
i
ln p (y
i
, X
i
, )
This is to be maximized w.r.t. and .
Notes:
The H draws of
i
are draw only once and are used repeatedly during the iterations used to nd

and

. The draws are dierent for each i. If the
i
are re-drawn at every iteration the estimator
will not converge.
The log-likelihood function with this simulator is a discontinuous function of and . This
does not cause problems from a theoretical point of view since it can be shown that ln L(, ) is
stochastically equicontinuous. However, it does cause problems if one attempts to use a gradient-
based optimization method such as Newton-Raphson.
It may be the case, particularly if few simulations, H, are used, that some elements of

i
are
zero. If the corresponding element of y
i
is equal to 1, there will be a log(0) problem.
Solutions to discontinuity:
1) use an estimation method that doesnt require a continuous and dierentiable objective
function, for example, simulated annealing. This is computationally costly.
2) Smooth the simulated probabilities so that they are continuous functions of the param-
eters. For example, apply a kernel transformation such as
y
ij
=
_
A
_
u
ij

m
max
k=1
u
ik
__
+ .5 1
_
u
ij
=
m
max
k=1
u
ik
_
where A is a large positive number. This approximates a step function such that y
ij
is very
close to zero if u
ij
is not the maximum, and y
ij
is very close to 1 if u
ij
is the maximum.
This makes y
ij
a continuous function of and , so that p
ij
and therefore ln L(, )
will be continuous and dierentiable. Consistency requires that A(n)
p
, so that the
approximation to a step function becomes arbitrarily close as the sample size increases.
There are alternative methods (e.g., Gibbs sampling) that may work better, but this is too
technical to discuss here.
To solve to log(0) problem, one possibility is to search the web for the slog function. Also,
increase H if this is a serious problem.
Properties
The properties of the SML estimator depend on how H is set. The following is taken from Lee
(1995) Asymptotic Bias in Simulated Maximum Likelihood Estimation of Discrete Choice Models,
Econometric Theory, 11, pp. 437-83.
Theorem 68. [Lee] 1) if lim
n
n
1/2
/H = 0, then

n
_

SML

0
_
d
N(0, J
1
(
0
))
2) if lim
n
n
1/2
/H = , a nite constant, then

n
_

SML

0
_
d
N(B, J
1
(
0
))
where B is a nite vector of constants.
This means that the SML estimator is asymptotically biased if H doesnt grow faster than n
1/2
.
The varcov is the typical inverse of the information matrix, so that as long as H grows fast
enough the estimator is consistent and fully asymptotically ecient.
22.3 Method of simulated moments (MSM)
Suppose we have a DGP(y[x, ) which is simulable given , but is such that the density of y is not
calculable.
Once could, in principle, base a GMM estimator upon the moment conditions
m
t
() = [K(y
t
, x
t
) k(x
t
, )] z
t
where
k(x
t
, ) =
_
K(y
t
, x
t
)p(y[x
t
, )dy,
z
t
is a vector of instruments in the information set and p(y[x
t
, ) is the density of y conditional on x
t
.
The problem is that this density is not available.
However k(x
t
, ) is readily simulated using

k (x
t
, ) =
1
H
H

h=1
K(

y
h
t
, x
t
)
By the law of large numbers,

k (x
t
, )
a.s.
k (x
t
, ) , as H , which provides a clear intuitive
basis for the estimator, though in fact we obtain consistency even for H nite, since a law of
large numbers is also operating across the n observations of real data, so errors introduced by
simulation cancel themselves out.
This allows us to form the moment conditions

m
t
() =
_
K(y
t
, x
t
)

k (x
t
, )
_
z
t
(22.2)
where z
t
is drawn from the information set. As before, form

m() =
1
n
n

i=1

m
t
()
=
1
n
n

i=1
_
_
K(y
t
, x
t
)
1
H
H

h=1
k(

y
h
t
, x
t
)
_
_
z
t
(22.3)
with which we form the GMM criterion and estimate as usual. Note that the unbiased simulator
k(

y
h
t
, x
t
) appears linearly within the sums.
Properties
Suppose that the optimal weighting matrix is used. McFadden (ref. above) and Pakes and Pollard
(refs. above) show that the asymptotic distribution of the MSM estimator is very similar to that of
the infeasible GMM estimator. In particular, assuming that the optimal weighting matrix is used, and
for H nite,

n
_

MSM

0
_
d
N
_
0,
_
1 +
1
H
_
_
D

1
D
/

_
1
_
(22.4)
where (D

1
D
/

)
1
is the asymptotic variance of the infeasible GMM estimator.
That is, the asymptotic variance is inated by a factor 1 + 1/H. For this reason the MSM
estimator is not fully asymptotically ecient relative to the infeasible GMM estimator, for H
nite, but the eciency loss is small and controllable, by setting H reasonably large.
The estimator is asymptotically unbiased even for H = 1. This is an advantage relative to SML.
If one doesnt use the optimal weighting matrix, the asymptotic varcov is just the ordinary GMM
varcov, inated by 1 + 1/H.
The above presentation is in terms of a specic moment condition based upon the conditional
mean. Simulated GMM can be applied to moment conditions of any form.
Comments
Why is SML inconsistent if H is nite, while MSM is? The reason is that SML is based upon an average
of logarithms of an unbiased simulator (the densities of the observations). To use the multinomial
probit model as an example, the log-likelihood function is
ln L(, ) =
1
n
n

i=1
y
/
i
ln p
i
(, )
The SML version is
ln L(, ) =
1
n
n

i=1
y
/
i
ln p
i
(, )
The problem is that
E ln( p
i
(, )) ,= ln(c p
i
(, ))
in spite of the fact that
c p
i
(, ) = p
i
(, )
due to the fact that ln() is a nonlinear transformation. The only way for the two to be equal (in the
limit) is if H tends to innite so that p () tends to p ().
The reason that MSM does not suer from this problem is that in this case the unbiased simulator
appears linearly within every sum of terms, and it appears within a sum over n (see equation [22.3]).
Therefore the SLLN applies to cancel out simulation errors, from which we get consistency. That is,
using simple notation for the random sampling case, the moment conditions
m() =
1
n
n

i=1
_
_
K(y
t
, x
t
)
1
H
H

h=1
k(

y
h
t
, x
t
)
_
_
z
t
(22.5)
=
1
n
n

i=1
_
_
k(x
t
,
0
) +
t

1
H
H

h=1
[k(x
t
, ) +
ht
]
_
_
z
t
(22.6)
converge almost surely to
m

() =
_
_
k(x,
0
) k(x, )
_
z(x)d(x).
(note: z
t
is assume to be made up of functions of x
t
). The objective function converges to
s

() = m

()
/

()
which obviously has a minimum at
0
, henceforth consistency.
If you look at equation 22.6 a bit, you will see why the variance ination factor is (1 +
1
H
).
22.4 Ecient method of moments (EMM)
The choice of which moments upon which to base a GMM estimator can have very pronounced eects
upon the eciency of the estimator.
A poor choice of moment conditions may lead to very inecient estimators, and can even cause
identication problems (as weve seen with the GMM problem set).
The drawback of the above approach MSM is that the moment conditions used in estimation are
selected arbitrarily. The asymptotic eciency of the estimator may be low.
The asymptotically optimal choice of moments would be the score vector of the likelihood func-
tion,
m
t
() = D

ln p
t
( [ I
t
)
As before, this choice is unavailable.
The ecient method of moments (EMM) (see Gallant and Tauchen (1996), Which Moments to
Match?, ECONOMETRIC THEORY, Vol. 12, 1996, pages 657-681) seeks to provide moment condi-
tions that closely mimic the score vector. If the approximation is very good, the resulting estimator
will be very nearly fully ecient.
The DGP is characterized by random sampling from the density
p(y
t
[x
t
,
0
) p
t
(
0
)
We can dene an auxiliary model, called the score generator, which simply provides a (misspec-
ied) parametric density
f(y[x
t
, ) f
t
()
This density is known up to a parameter . We assume that this density function is calculable.
Therefore quasi-ML estimation is possible. Specically,

= arg max

s
n
() =
1
n
n

t=1
ln f
t
().
After determining

we can calculate the score functions D

ln f(y
t
[x
t
,

).
The important point is that even if the density is misspecied, there is a pseudo-true
0
for
which the true expectation, taken with respect to the true but unknown density of y, p(y[x
t
,
0
),
and then marginalized over x is zero:

0
: c
X
c
Y [X
_
D

ln f(y[x,
0
)
_
=
_
X
_
Y [X
D

ln f(y[x,
0
)p(y[x,
0
)dyd(x) = 0
We have seen in the section on QML that

0
; this suggests using the moment conditions
m
n
(,

) =
1
n
n

t=1
_
D

ln f
t
(

)p
t
()dy (22.7)
These moment conditions are not calculable, since p
t
() is not available, but they are simulable
using

m
n
(,

) =
1
n
n

t=1
1
H
H

h=1
D

ln f(

y
h
t
[x
t
,

)
where y
h
t
is a draw from DGP(), holding x
t
xed. By the LLN and the fact that

converges
to
0
,

(
0
,
0
) = 0.
This is not the case for other values of , assuming that
0
is identied.
The advantage of this procedure is that if f(y
t
[x
t
, ) closely approximates p(y[x
t
, ), then

m
n
(,

)
will closely approximate the optimal moment conditions which characterize maximum likelihood
estimation, which is fully ecient.
If one has prior information that a certain density approximates the data well, it would be a
good choice for f().
If one has no density in mind, there exist good ways of approximating unknown distributions
parametrically: Philips ERAs (Econometrica, 1983) and Gallant and Nychkas (Econometrica,
1987) SNP density estimator which we saw before. Since the SNP density is consistent, the
eciency of the indirect estimator is the same as the infeasible ML estimator.
Optimal weighting matrix
I will present the theory for H nite, and possibly small. This is done because it is sometimes
impractical to estimate with H very large. Gallant and Tauchen give the theory for the case of H so
large that it may be treated as innite (the dierence being irrelevant given the numerical precision
of a computer). The theory for the case of H innite follows directly from the results presented here.
The moment condition

m(,

) depends on the pseudo-ML estimate



. We can apply Theorem 31
to conclude that

n
_


0
_
d
N
_
0, (
0
)
1
J(
0
)(
0
)
1
_
(22.8)
If the density f(y
t
[x
t
,

) were in fact the true density p(y[x


t
, ), then

would be the maximum likelihood
estimator, and (
0
)
1
J(
0
) would be an identity matrix, due to the information matrix equality.
However, in the present case we assume that f(y
t
[x
t
,

) is only an approximation to p(y[x


t
, ), so there
is no cancellation.
Recall that (
0
) p lim
_

2

s
n
(
0
)
_
. Comparing the denition of s
n
() with the denition of
the moment condition in Equation 22.7, we see that
(
0
) = D

m(
0
,
0
).
As in Theorem 31,
J(
0
) = lim
n
c
_
_
n
s
n
()

0
s
n
()

0
_
_
.
In this case, this is simply the asymptotic variance covariance matrix of the moment conditions, .
Now take a rst order Taylors series approximation to

nm
n
(
0
,

) about
0
:

n m
n
(
0
,

) =

n m
n
(
0
,
0
) +

nD

m(
0
,
0
)
_


0
_
+ o
p
(1)
First consider

n m
n
(
0
,
0
). It is straightforward but somewhat tedious to show that the asymp-
totic variance of this term is
1
H
I

(
0
).
Next consider the second term

nD

m(
0
,
0
)
_


0
_
. Note that D

m
n
(
0
,
0
)
a.s.
(
0
), so we
have

nD

m(
0
,
0
)
_


0
_
=

n(
0
)
_


0
_
, a.s.
But noting equation 22.8

n(
0
)
_


0
_
a
N
_
0, J(
0
)
_
Now, combining the results for the rst and second terms,

n m
n
(
0
,

)
a
N
_
0,
_
1 +
1
H
_
J(
0
)
_
Suppose that

J(
0
) is a consistent estimator of the asymptotic variance-covariance matrix of the
moment conditions. This may be complicated if the score generator is a poor approximator, since the
individual score contributions may not have mean zero in this case (see the section on QML) . Even
if this is the case, the individuals means can be calculated by simulation, so it is always possible to
consistently estimate J(
0
) when the model is simulable. On the other hand, if the score generator
is taken to be correctly specied, the ordinary estimator of the information matrix is consistent.
Combining this with the result on the ecient GMM weighting matrix in Theorem 47, we see that
dening

as

= arg min

m
n
(,

)
/
__
1 +
1
H
_

J(
0
)
_
1
m
n
(,

)
is the GMM estimator with the ecient choice of weighting matrix.
If one has used the Gallant-Nychka ML estimator as the auxiliary model, the appropriate weight-
ing matrix is simply the information matrix of the auxiliary model, since the scores are uncorre-
lated. (e.g., it really is ML estimation asymptotically, since the score generator can approximate
the unknown density arbitrarily well).
Asymptotic distribution
Since we use the optimal weighting matrix, the asymptotic distribution is as in Equation 14.3, so we
have (using the result in Equation 22.8):

n
_


0
_
d
N
_

_0,
_
_
D

__
1 +
1
H
_
J(
0
)
_
1
D
/

_
_
1
_

_ ,
where
D

= lim
n
c
_
D

m
/
n
(
0
,
0
)
_
.
This can be consistently estimated using

D = D

m
/
n
(

)
Diagnotic testing
The fact that

nm
n
(
0
,

)
a
N
_
0,
_
1 +
1
H
_
J(
0
)
_
implies that
nm
n
(

)
/
__
1 +
1
H
_
J(

)
_
1
m
n
(

)
a

2
(q)
where q is dim() dim(), since without dim() moment conditions the model is not identied, so
testing is impossible. One test of the model is simply based on this statistic: if it exceeds the
2
(q)
critical point, something may be wrong (the small sample performance of this sort of test would be a
topic worth investigating).
Information about what is wrong can be gotten from the pseudo-t-statistics:
_
_
diag
__
1 +
1
H
_
J(

)
_
1/2
_
_
1

nm
n
(

)
can be used to test which moments are not well modeled. Since these moments are related to
parameters of the score generator, which are usually related to certain features of the model,
this information can be used to revise the model. These arent actually distributed as N(0, 1),
since

nm
n
(
0
,

) and

nm
n
(

) have dierent distributions (that of



nm
n
(

) is somewhat
more complicated). It can be shown that the pseudo-t statistics are biased toward nonrejection.
See Gourieroux et. al. or Gallant and Long, 1995, for more details.
22.5 Indirect likelihood inference
This method is something Ive been working on for the last few years with Dennis Kristensen. The
main reference is Creel and Kristensen, 2013. The method is related to Approximate Bayesian
Computing. Our paper adds some formal results that relates the idea to GMM, and which gives the
rst applications in economics. It is a very useful method, in my opinion. We have used it to estimate
complicated models such as DSGE models and continuous time jump diusions, with good success. It
combines simulation based estimation, nonparametric tting, and Bayesian methods. The following is
a brief description, and an example.
Suppose we have a fully specied model indexed by a parameter R
k
. Given a sample
Y
n
= (y
1
, ..., y
n
) generated at the unknown true parameter value
0
, the generalized method of moments
estimator is based on a vector of statistics Z
n
= Z
n
(Y
n
), that lead to moment conditions m
n
() =
Z
n
E

(Z
n
), where E

indicates expectations under the model. In a similar vein, Creel and Kristensen
(CK13) propose a Bayesian indirect likelihood (BIL) estimator

BIL
= E([Z
n
) =
_

f
n
([Z
n
) d, (22.9)
where, for some prior density () on the parameter space , f
n
([Z
n
) is the posterior distribution
given by
f
n
([Z
n
) =
f
n
(Z
n
, )
f
n
(Z
n
)
=
f
n
(Z
n
[) ()
_

f
n
(Z
n
[) () d
.
This is very much like the widely used Bayesian posterior mean, except that the likelihood is formulated
in terms of the density of the statistic, f
n
(Z
n
[), rather than the full sample. Advantages of the BIL
estimator over GMM are the avoidance of optimization, avoidance of the need to compute the ecient
weight matrix, and higher order eciency relative to the GMM estimator that uses the optimal weight
matrix (CK13).
Computation of the BIL estimator requires knowledge of f
n
(Z
n
[), which is normally not known.
Just as the simulated method of moments may be required when GMM is not feasible, simulation and
nonparametric regression may be used to compute a simulated BIL (SBIL) estimator. The method
explored in this paper is knearest neighbors (KNN) nonparametric regression. This is implemented
as follows: Make i.i.d. draws
s
, s = 1, ..., S, from the pseudo-prior density (), for each draw generate
a sample Y
n
(
s
) from the model at this parameter value, and then compute the corresponding statistic
Z
s
n
= Z(Y
n
(
s
)), s = 1, ..., S. Let Z
o
= Z
s
n
, s = 1, 2, ..., S be the set of S draws of the statistic.
Given the i.i.d. draws (
s
, Z
s
n
), s = 1, ...S, we can obtain the SBIL estimator using

SBIL
=

E
S
[[Z
n
] =

S
s=1

s
K
h
(Z
s
n
Z
n
)

S
s=1
K
h
(Z
s
n
Z
n
)
. (22.10)
where K
h
(z) 0 is a kernel function that depends on a bandwidth parameter h. This is most obviously
a kernel regression estimator, but it can also be a nearest neighbor estimator if the bandwidth is
adaptive, which is what we do.
In the literature on kernel regression, it is well-known that the specic kernel function chosen is of
less importance than is choosing the bandwidth appropriately, given the chosen kernel (REF). For this
reason, we use a truncated Gaussian kernel, and focus on choosing the bandwidth well. The specic
kernel function we use is
K
h
(Z
s
n
Z
n
) =

_
a|
1/2
(Z
s
n
Z
n
)|
h
_
if
_
_
_
1/2
(Z
s
n
Z
n
)
_
_
_ h
0 if
_
_
_
1/2
(Z
s
n
Z
n
)
_
_
_ > h
where () is the multivariate standard normal density function. The matrix is the diagonal matrix
containing the sample variances of the S replications of Z
s
n
. This matrix plays the important role of
putting the elements of the auxiliary statistic vector on the same scale, so that statistics with larger
variances do not dominate the distance measure. The bandwidth h is adaptive, it is the k
th
order
statistic of the S distances
d
s
=
_
_
_
1/2
(Z
s
n
Z
n
)
_
_
_ , s = 1, 2, ..., S. (22.11)
Dene d
k
s
as the kth order statistic of these distances, so the kernel bandwidth is h = d
k
s
. Thus, this
adaptive kernel regression estimator is a KNN estimator, where only the closest k neighbors aect the
t. The scalar tuning parameter a inuences how rapidly the weights decline as the distance between
the simulated statistic and the observed statistic increases. We set a = 2. With these choices, the
SBIL estimator is a weighted average of the
s
such that the corresponding simulated (scaled) Z
s
n
is
among the k nearest neighbors to Z
n
, and the weights are declining as the distance from Z
n
increases.
The problem of choosing the bandwidth becomes one of choosing the number of neighbors to use.
In the i.i.d. sample context that applies to the simulated pairs (
s
, Z
s
n
), the KNN estimator is con-
sistent for the true posterior mean E([Z
n
) as S increases, as long as the chosen number of neighbors,
k, grows slowly with S (Li and Racine, 2007, Ch. 14). Because S, the number of simulations can be
made as large as needed, consistency of KNN regression means that the SBIL estimator can be made
arbitrarily close to the infeasible BIL estimator. Nevertheless, methods for choosing k as a function of
S to obtain good performance of the SBIL estimator without requiring the number of simulations to
be extremely large are desirable, to limit the computational demand. Another factor that obviously
aects the performance of the SBIL estimator is the choice of the vector of statistics that form Z
n
.
These issues have been addressed, but they are beyond the score of these notes.
This discussion makes clear the nature of the estimator. The infeasible BIL estimator is a posterior
mean, conditional on a statistic, rather than the full sample. It turns out that the BIL estimator
is rst order asymptotically equivalent to the optimal GMM estimator that uses the same statistic
(CK13). Thus, the relationship between the BIL estimator and the ordinary posterior mean E([Y
n
)
based on the full sample is essentially the same as the relationship between the GMM estimator
and the maximum likelihood estimator: the rst is in general not fully ecient, and the issue of
the choice of statistics arises. The relationship between the SBIL estimator and the infeasible BIL
estimator is like that between an ordinary Bayesian posterior mean computed using Markov chain
Monte Carlo or some other computational technique, and the desired true posterior mean: the rst
is a numeric approximation of the second, which can be made as precise as needed by means of
additional computional resources. Our argument for using the SBIL estimator is one of convenience
and performance. In terms of convenience, the SBIL estimator can be reliably computed using simple
means that are very amenable to parallel computing techniques. Regarding performance, we show by
example that Z
n
can be found which lead to good estimation results, even for complicated models
(e.g., DSGE models) that traditionally have required sophisticated estimation techniques.
A Simple DSGE Model
CK13 shows that SBIL estimation is tractible and gives reliable results for estimating the parameters
of a simple DSGE model. SBIL estimation can be done quickly and easily enough so that it is possible
to show its good performance via Monte Carlo, and example software has been provided that allows
conrmation of these results in little time. Here we use the same model to illustrate the methods
we propose. The model is as follows: A single good can be consumed or used for investment, and a
single competitive rm maximizes prots. The variables are: y output; c consumption; k capital; i
investment, n labor; w real wages; r return to capital. The household maximizes expected discounted
utility
E
t

s=0

s
_
_
c
1
t+s
1
+ (1 n
t+s
)
t

_
_
subject to the budget constraint c
t
+i
t
= r
t
k
t
+w
t
n
t
and the accumulation of capital k
t+1
= i
t
+(1k
t
).
There is a preference shock,
t
, that aects the desirability of leisure. The shock evolves according to
ln
t
=

ln
t1
+

t
. The competitive rm produces the good y
t
using the technology y
t
= k

t
n
1
t
z
t
.
Technology shocks z
t
also follow an AR(1) process in logarithms: ln z
t
=
z
ln z
t1
+
z
u
t
. The
innovations to the preference and technology shocks,
t
and u
t
, are mutually independent i.i.d. standard
normally distributed. The good y
t
can be allocated by the consumer to consumption or investment:
y
t
= c
t
+ i
t
. The consumer provides capital and labor to the rm, and is paid at the rates r
t
and w
t
,
respectively. The unknown parameters are collected in = (, , , , ,
z
,

,
z
,

). In total, we
have seven variables and two shocks.
In the estimation, we treat capital stock k as unobserved, while the remaing variables are observed.
The true parameter values are given in Table 22.1. Following Ruge-Murcia (2012), rather than set a
true value and prior for and estimate this parameter directly, we instead treat steady state hours
n as a parameter to estimate, along with the other parameters, excepting . Following Ruge-Murcia
(2012), the true value for was found using the other true parameter values, along with the restriction
that true steady state hours n = 1/3. The true value is = c

(1 )

= 3.417, where overbars


indicate steady state values of the variables.
The solution method is third-order perturbation using Dynare (Adjemian et al., 2011). The pseudo-
prior () is a uniform distribution over the hypercube dened by the bounds of the parameter space,
which are found in columns 3 and 4 of Table 22.1. The chosen limits cause the pseudo-prior means
to be biased for the true parameter values (see column 5 of the Table), and they are intended to be
broad, in comparison to the fairly strongly informative priors that are often used when estimating
DSGE models (see column 6 of the Table). To generate simulations, a parameter value
s
is drawn
from the prior, then the model is solved at this parameter value, and a simulated sample is drawn.
The sample size is n = 80, which mimics 20 years of quarterly data. With a simulated sample, we can
generate a realization of the vector of statistics, Z
s
n
(
s
).
Table 22.1 gives the true parameter values and the limits of the uniform priors, along with infor-
mation about the informativeness of the prior. Table 22.2 give results for a Monte Carlo study of
the performance of the SBIL estimator. If you compare RMSE in the two tables, youll see that the
SBIL estimator achieves a considerable reduction of RMSE relative to that of the prior. Also, the
SBIL estimator (with bootstrap-based bias correction) is essentially unbiased for all of the models
parameters.
A simple script that does SBIL estimation of the same DSGE model as discussed above is at DSGE
by SBIL. This implements a less sophisticated verion of the estimator than was used to make the tables
presented here (less careful choice of auxiliary statistics, no bias correction, etc), but it conveys the
main ideas.
Table 22.1: True parameter values and bound of priors
Prior bounds
Parameter True value Lower Upper Prior Bias Prior RMSE
0.330 0.2 0.4 -0.030 0.065
0.990 0.97 0.999 -0.006 0.010
0.025 0.005 0.04 0.000 0.009
2.000 0.0 5.0 0.500 1.527

z
0.900 0.5 0.999 -0.150 0.208

z
0.010 0.001 0.1 0.041 0.049

0.700 0.5 0.999 0.049 0.152

0.005 0.001 0.1 0.046 0.054


n 7/24 8/24 9/24 0.000 0.024
Table 22.2: Monte Carlo results, bias corrected estimators
Mean Bias RMSE
Parameter True value 1st round 2nd round 1st round 2nd round 1st round 2nd round
0.330 0.329 0.330 -0.001 -0.000 0.002 0.002
0.990 0.990 0.990 0.000 -0.000 0.001 0.001
0.025 0.025 0.025 -0.000 -0.000 0.001 0.001
2.000 2.065 2.027 0.065 0.027 0.290 0.292

z
0.900 0.891 0.899 -0.009 -0.001 0.052 0.052

z
0.010 0.010 0.010 0.000 -0.000 0.002 0.002

0.700 0.707 0.701 0.007 0.001 0.071 0.076

0.005 0.005 0.005 0.000 -0.000 0.001 0.002


n 1/3 0.333 0.333 -0.000 -0.000 0.004 0.004
22.6 Examples
SML of a Poisson model with latent heterogeneity
We have seen (see equation 18.1) that a Poisson model with latent heterogeneity that follows an
exponential distribution leads to the negative binomial model. To illustrate SML, we can integrate
out the latent heterogeneity using Monte Carlo, rather than the analytical approach which leads to
the negative binomial model. In actual practice, one would not want to use SML in this case, but it is
a nice example since it allows us to compare SML to the actual ML estimator. The Octave function
dened by PoissonLatentHet.m calculates the simulated log likelihood for a Poisson model where
= exp x
/
t
+ ), where N(0, 1). This model is similar to the negative binomial model, except
that the latent variable is normally distributed rather than gamma distributed. The Octave script
EstimatePoissonLatentHet.m estimates this model using the MEPS OBDV data that has already been
discussed. Note that simulated annealing is used to maximize the log likelihood function. Attempting
to use BFGS leads to trouble. I suspect that the log likelihood is approximately non-dierentiable in
places, around which it is very at, though I have not checked if this is true. If you run this script,
you will see that it takes a long time to get the estimation results, which are:
******************************************************
Poisson Latent Heterogeneity model, SML estimation, MEPS 1996 full data set
MLE Estimation Results
BFGS convergence: Max. iters. exceeded
Average Log-L: -2.171826
Observations: 4564
estimate st. err t-stat p-value
constant -1.592 0.146 -10.892 0.000
pub. ins. 1.189 0.068 17.425 0.000
priv. ins. 0.655 0.065 10.124 0.000
sex 0.615 0.044 13.888 0.000
age 0.018 0.002 10.865 0.000
edu 0.024 0.010 2.523 0.012
inc -0.000 0.000 -0.531 0.596
lnalpha 0.203 0.014 14.036 0.000
Information Criteria
CAIC : 19899.8396 Avg. CAIC: 4.3602
BIC : 19891.8396 Avg. BIC: 4.3584
AIC : 19840.4320 Avg. AIC: 4.3472
******************************************************
octave:3>
If you compare these results to the results for the negative binomial model, given in subsection (18.2),
you can see that the present model ts better according to the CAIC criterion. The present model
is considerably less convenient to work with, however, due to the computational requirements. The
chapter on parallel computing is relevant if you wish to use models of this sort.
MSM
An example of estimation using the MSM is given in the script le MSM_Example.m. The rst order
moving average (MA(1)) model has been widely used to investigate the performance of the indirect
inference estimator, and a pth-order autoregressive model is often used to generate the auxiliary
statistic (see, for example, Gouriroux, Monfort and Renault, 1993; Chumacero, 2001). In this section
we estimate the MA(1) model
y
t
=
t
+
t1

t
i.i.d. N(0,
2
)
The parameter vector is = (, ). We set the parameter space for the initial simulated annealing
stage (to get good start values for the gradient-based algorithm) to = (1, 1) (0, 2), which
imposes invertibility, which is needed for the parameter to be identied. The statistic Z
n
is the vector
of estimated parameters (
0
,
1
, ...,
P
,
2

) of an AR(P) model y
t
=
0
+

P
p=1

p
y
tp
+
t
, t to the
data using ordinary least squares.
We estimate using MSM implemented as II, using continuously updated GMM (Hanson, Heaton
and Yaron, 1996). The moment conditions that dene the continuously updated indirect inference
(CU-II) estimator are m
n
() = Z
n


Z
S,n
() where

Z
S,n
() =
1
S

S
s=1
Z
s
n
(), and the weight matrix at
each iteration is the inverse of
S
n
() =
1
S

S
s=1
_
Z
s
n
()

Z
S,n
()
_ _
Z
s
n
()

Z
S,n
()
_
/
, where S = 100.
Example: EMM estimation of a discrete choice model
In this section consider EMM estimation. There is a sophisticated package by Gallant and Tauchen for
this, but here well look at some simple, but hopefully didactic code. The le probitdgp.m generates
data that follows the probit model. The le emm_moments.m denes EMM moment conditions,
where the DGP and score generator can be passed as arguments. Thus, it is a general purpose
moment condition for EMM estimation. This le is interesting enough to warrant some discussion.
A listing appears in Listing 19.1. Line 3 denes the DGP, and the arguments needed to evaluate it
are dened in line 4. The score generator is dened in line 5, and its arguments are dened in line 6.
The QML estimate of the parameter of the score generator is read in line 7. Note in line 10 how the
random draws needed to simulate data are passed with the data, and are thus xed during estimation,
to avoid chattering. The simulated data is generated in line 16, and the derivative of the score
generator using the simulated data is calculated in line 18. In line 20 we average the scores of the
score generator, which are the moment conditions that the function returns.
1 function scores = emm_moments(theta, data, momentargs)
2 k = momentargs{1};
3 dgp = momentargs{2}; # the data generating process (DGP)
4 dgpargs = momentargs{3}; # its arguments (cell array)
5 sg = momentargs{4}; # the score generator (SG)
6 sgargs = momentargs{5}; # SG arguments (cell array)
7 phi = momentargs{6}; # QML estimate of SG parameter
8 y = data(:,1);
9 x = data(:,2:k+1);
10 rand_draws = data(:,k+2:columns(data)); # passed with data to ensure fixed across iterations
11 n = rows(y);
12 scores = zeros(n,rows(phi)); # container for moment contributions
13 reps = columns(rand_draws); # how many simulations?
14 for i = 1:reps
15 e = rand_draws(:,i);
16 y = feval(dgp, theta, x, e, dgpargs); # simulated data
17 sgdata = [y x]; # simulated data for SG
18 scores = scores + numgradient(sg, {phi, sgdata, sgargs}); # gradient of SG
19 endfor
20 scores = scores / reps; # average over number of simulations
21 endfunction
Listing 22.1: emm_moments.m
The le emm_example.m performs EMM estimation of the probit model, using a logit model as
the score generator. The results we obtain are
Score generator results:
=====================================================
BFGSMIN final results
Used analytic gradient
------------------------------------------------------
STRONG CONVERGENCE
Function conv 1 Param conv 1 Gradient conv 1
------------------------------------------------------
Objective function value 0.281571
Stepsize 0.0279
15 iterations
------------------------------------------------------
param gradient change
1.8979 0.0000 0.0000
1.6648 -0.0000 0.0000
1.9125 -0.0000 0.0000
1.8875 -0.0000 0.0000
1.7433 -0.0000 0.0000
======================================================
Model results:
******************************************************
EMM example
GMM Estimation Results
BFGS convergence: Normal convergence
Objective function value: 0.000000
Observations: 1000
Exactly identified, no spec. test
estimate st. err t-stat p-value
p1 1.069 0.022 47.618 0.000
p2 0.935 0.022 42.240 0.000
p3 1.085 0.022 49.630 0.000
p4 1.080 0.022 49.047 0.000
p5 0.978 0.023 41.643 0.000
******************************************************
It might be interesting to compare the standard errors with those obtained from ML estimation,
to check eciency of the EMM estimator. One could even do a Monte Carlo study.
Indirect likelihood inference
A simple script that does SBIL estimation of the same DSGE model as discussed above is at DSGE by
SBIL. This implements a less sophisticated verion of the estimator than was used to make the tables
presented here (less careful choice of auxiliary statistics, no bias correction, etc), but it conveys the
main ideas.
22.7 Exercises
1. (basic) Examine the Octave script and function discussed in subsection 22.6 and describe what
they do.
2. (basic) Examine the Octave scripts and functions discussed in subsection 22.6 and describe what
they do.
3. (advanced, but even if you dont do this you should be able to describe what needs to be done)
Write Octave code to do SML estimation of the probit model. Do an estimation using data
generated by a probit model ( probitdgp.m might be helpful). Compare the SML estimates to
ML estimates.
4. (more advanced) Do a little Monte Carlo study to compare ML, SML and EMM estimation of
the probit model. Investigate how the number of simulations aect the two simulation-based
estimators.
Chapter 23
Parallel programming for
econometrics
The following borrows heavily from Creel (2005).
Parallel computing can oer an important reduction in the time to complete computations. This
is well-known, but it bears emphasis since it is the main reason that parallel computing may be
attractive to users. To illustrate, the Intel Pentium IV (Willamette) processor, running at 1.5GHz,
was introduced in November of 2000. The Pentium IV (Northwood-HT) processor, running at 3.06GHz,
was introduced in November of 2002. An approximate doubling of the performance of a commodity
CPU took place in two years. Extrapolating this admittedly rough snapshot of the evolution of the
performance of commodity processors, one would need to wait more than 6.6 years and then purchase
a new computer to obtain a 10-fold improvement in computational performance. The examples in
this chapter show that a 10-fold improvement in performance can be achieved immediately, using
619
distributed parallel computing on available computers.
Recent (this is written in 2005) developments that may make parallel computing attractive to a
broader spectrum of researchers who do computations. The rst is the fact that setting up a cluster
of computers for distributed parallel computing is not dicult. If you are using the ParallelKnoppix
bootable CD that accompanies these notes, you are less than 10 minutes away from creating a cluster,
supposing you have a second computer at hand and a crossover ethernet cable. See the ParallelKnop-
pix tutorial. A second development is the existence of extensions to some of the high-level matrix
programming (HLMP) languages
1
that allow the incorporation of parallelism into programs written
in these languages. A third is the spread of dual and quad-core CPUs, so that an ordinary desktop
or laptop computer can be made into a mini-cluster. Those cores wont work together on a single
problem unless they are told how to.
Following are examples of parallel implementations of several mainstream problems in econometrics.
A focus of the examples is on the possibility of hiding parallelization from end users of programs. If
programs that run in parallel have an interface that is nearly identical to the interface of equivalent
serial versions, end users will nd it easy to take advantage of parallel computings performance. We
continue to use Octave, taking advantage of the MPI Toolbox (MPITB) for Octave, by by Fernndez
Baldomero et al. (2004). There are also parallel packages for Ox, R, and Python which may be
of interest to econometricians, but as of this writing, the following examples are the most accessible
introduction to parallel programming for econometricians.
1
By high-level matrix programming language I mean languages such as MATLAB (TM the Mathworks, Inc.), Ox (TM OxMetrics Tech-
nologies, Ltd.), and GNU Octave (www.octave.org), for example.
23.1 Example problems
This section introduces example problems from econometrics, and shows how they can be parallelized
in a natural way.
Monte Carlo
A Monte Carlo study involves repeating a random experiment many times under identical condi-
tions. Several authors have noted that Monte Carlo studies are obvious candidates for parallelization
(Doornik et al. 2002; Bruche, 2003) since blocks of replications can be done independently on dierent
computers. To illustrate the parallelization of a Monte Carlo study, we use same trace test example
as do Doornik, et. al. (2002). tracetest.m is a function that calculates the trace test statistic for the
lack of cointegration of integrated time series. This function is illustrative of the format that we adopt
for Monte Carlo simulation of a function: it receives a single argument of cell type, and it returns a
row vector that holds the results of one random simulation. The single argument in this case is a cell
array that holds the length of the series in its rst position, and the number of series in the second
position. It generates a random result though a process that is internal to the function, and it reports
some output in a row vector (in this case the result is a scalar).
mc_example1.m is an Octave script that executes a Monte Carlo study of the trace test by repeat-
edly evaluating the tracetest.m function. The main thing to notice about this script is that lines 7
and 10 call the function montecarlo.m. When called with 3 arguments, as in line 7, montecarlo.m
executes serially on the computer it is called from. In line 10, there is a fourth argument. When called
with four arguments, the last argument is the number of slave hosts to use. We see that running the
Monte Carlo study on one or more processors is transparent to the user - he or she must only indicate
the number of slave computers to be used.
ML
For a sample (y
t
, x
t
)
n
of n observations of a set of dependent and explanatory variables, the maximum
likelihood estimator of the parameter can be dened as

= arg max s
n
()
where
s
n
() =
1
n
n

t=1
ln f(y
t
[x
t
, )
Here, y
t
may be a vector of random variables, and the model may be dynamic since x
t
may contain
lags of y
t
. As Swann (2002) points out, this can be broken into sums over blocks of observations, for
example two blocks:
s
n
() =
1
n
_
_
_
_
_
n
1

t=1
ln f(y
t
[x
t
, )
_
_
+
_
_
n

t=n
1
+1
ln f(y
t
[x
t
, )
_
_
_
_
_
Analogously, we can dene up to n blocks. Again following Swann, parallelization can be done by
calculating each block on separate computers.
mle_example1.m is an Octave script that calculates the maximum likelihood estimator of the
parameter vector of a model that assumes that the dependent variable is distributed as a Poisson
random variable, conditional on some explanatory variables. In lines 1-3 the data is read, the name
of the density function is provided in the variable model, and the initial value of the parameter vector
is set. In line 5, the function mle_estimate performs ordinary serial calculation of the ML estimator,
while in line 7 the same function is called with 6 arguments. The fourth and fth arguments are empty
placeholders where options to mle_estimate may be set, while the sixth argument is the number of
slave computers to use for parallel execution, 1 in this case. A person who runs the program sees
no parallel programming code - the parallelization is transparent to the end user, beyond having to
select the number of slave computers. When executed, this script prints out the estimates theta_s
and theta_p, which are identical.
It is worth noting that a dierent likelihood function may be used by making the model variable
point to a dierent function. The likelihood function itself is an ordinary Octave function that is not
parallelized. The mle_estimate function is a generic function that can call any likelihood function
that has the appropriate input/output syntax for evaluation either serially or in parallel. Users need
only learn how to write the likelihood function using the Octave language.
GMM
For a sample as above, the GMM estimator of the parameter can be dened as

arg min

s
n
()
where
s
n
() = m
n
()
/
W
n
m
n
()
and
m
n
() =
1
n
n

t=1
m
t
(y
t
[x
t
, )
Since m
n
() is an average, it can obviously be computed blockwise, using for example 2 blocks:
m
n
() =
1
n
_
_
_
_
_
n
1

t=1
m
t
(y
t
[x
t
, )
_
_
+
_
_
n

t=n
1
+1
m
t
(y
t
[x
t
, )
_
_
_
_
_
(23.1)
Likewise, we may dene up to n blocks, each of which could potentially be computed on a dierent
machine.
gmm_example1.m is a script that illustrates how GMM estimation may be done serially or in
parallel. When this is run, theta_s and theta_p are identical up to the tolerance for convergence of
the minimization routine. The point to notice here is that an end user can perform the estimation in
parallel in virtually the same way as it is done serially. Again, gmm_estimate, used in lines 8 and 10,
is a generic function that will estimate any model specied by the moments variable - a dierent model
can be estimated by changing the value of the moments variable. The function that moments points
to is an ordinary Octave function that uses no parallel programming, so users can write their models
using the simple and intuitive HLMP syntax of Octave. Whether estimation is done in parallel or
serially depends only the seventh argument to gmm_estimate - when it is missing or zero, estimation
is by default done serially with one processor. When it is positive, it species the number of slave
nodes to use.
Kernel regression
The Nadaraya-Watson kernel regression estimator of a function g(x) at a point x is
g(x) =

n
t=1
y
t
K [(x x
t
) /
n
]

n
t=1
K [(x x
t
) /
n
]

t=1
w
t
y
y
We see that the weight depends upon every data point in the sample. To calculate the t at every
point in a sample of size n, on the order of n
2
k calculations must be done, where k is the dimension
of the vector of explanatory variables, x. Racine (2002) demonstrates that MPI parallelization can
be used to speed up calculation of the kernel regression estimator by calculating the ts for portions
of the sample on dierent computers. We follow this implementation here. kernel_example1.m is a
script for serial and parallel kernel regression. Serial execution is obtained by setting the number of
slaves equal to zero, in line 15. In line 17, a single slave is specied, so execution is in parallel on the
master and slave nodes.
The example programs show that parallelization may be mostly hidden from end users. Users can
benet from parallelization without having to write or understand parallel code. The speedups one
can obtain are highly dependent upon the specic problem at hand, as well as the size of the cluster,
the eciency of the network, etc. Some examples of speedups are presented in Creel (2005). Figure
23.1 reproduces speedups for some econometric problems on a cluster of 12 desktop computers. The
speedup for k nodes is the time to nish the problem on a single node divided by the time to nish the
problem on k nodes. Note that you can get 10X speedups, as claimed in the introduction. Its pretty
obvious that much greater speedups could be obtained using a larger cluster, for the embarrassingly
Figure 23.1: Speedups from parallelization
1
2
3
4
5
6
7
8
9
10
11
2 4 6 8 10 12
nodes
MONTECARLO
BOOTSTRAP
MLE
GMM
KERNEL
parallel problems.
Bibliography
[1] Bruche, M. (2003) A note on embarassingly parallel computation using OpenMosix and
Ox, working paper, Financial Markets Group, London School of Economics.
[2] Creel, M. (2005) User-friendly parallel computations with econometric examples, Com-
putational Economics, V. 26, pp. 107-128.
[3] Doornik, J.A., D.F. Hendry and N. Shephard (2002) Computationally-intensive econo-
metrics using a distributed matrix-programming language, Philosophical Transactions
of the Royal Society of London, Series A, 360, 1245-1266.
[4] Fernndez Baldomero, J. (2004) LAM/MPI parallel computing under GNU Octave,
atc.ugr.es/javier-bin/mpitb.
[5] Racine, Je (2002) Parallel distributed kernel estimation, Computational Statistics &
Data Analysis, 40, 293-302.
[6] Swann, C.A. (2002) Maximum likelihood estimation using parallel computing: an in-
troduction to MPI, Computational Economics, 19, 145-178.
627
Chapter 24
Introduction to Octave
Why is Octave being used here, since its not that well-known by econometricians? Well, because it is
a high quality environment that is easily extensible, uses well-tested and high performance numerical
libraries, it is licensed under the GNU GPL, so you can get it for free and modify it if you like, and it
runs on both GNU/Linux, Mac OSX and Windows systems. Its also quite easy to learn.
24.1 Getting started
Get the ParallelKnoppix CD, as was described in Section 1.5. Then burn the image, and boot your
computer with it. This will give you this same PDF le, but with all of the example programs ready
to run. The editor is congure with a macro to execute the programs using Octave, which is of course
installed. From this point, I assume you are running the CD (or sitting in the computer room across
the hall from my oce), or that you have congured your computer to be able to run the *.m les
mentioned below.
628
24.2 A short introduction
The objective of this introduction is to learn just the basics of Octave. There are other ways to use
Octave, which I encourage you to explore. These are just some rudiments. After this, you can look
at the example programs scattered throughout the document (and edit them, and run them) to learn
more about how Octave can be used to do econometrics. Students of mine: your problem sets will
include exercises that can be done by modifying the example programs in relatively minor ways. So
study the examples!
Octave can be used interactively, or it can be used to run programs that are written using a text
editor. Well use this second method, preparing programs with NEdit, and calling Octave from within
the editor. The program rst.m gets us started. To run this, open it up with NEdit (by nding the
correct le inside the /home/knoppix/Desktop/Econometrics folder and clicking on the icon) and
then type CTRL-ALT-o, or use the Octave item in the Shell menu (see Figure 24.1).
Note that the output is not formatted in a pleasing way. Thats because printf() doesnt auto-
matically start a new line. Edit first.m so that the 8th line reads printf(hello world\n);
and re-run the program.
We need to know how to load and save data. The program second.m shows how. Once you have
run this, you will nd the le x in the directory Econometrics/Examples/OctaveIntro/ You might
have a look at it with NEdit to see Octaves default format for saving data. Basically, if you have data
in an ASCII text le, named for example myfile.data, formed of numbers separated by spaces,
just use the command load myfile.data. After having done so, the matrix myfile (without
extension) will contain the data.
Please have a look at CommonOperations.m for examples of how to do some basic things in Octave.
Now that were done with the basics, have a look at the Octave programs that are included as examples.
Figure 24.1: Running an Octave program
If you are looking at the browsable PDF version of this document, then you should be able to click on
links to open them. If not, the example programs are available here and the support les needed to
run these are available here. Those pages will allow you to examine individual les, out of context. To
actually use these les (edit and run them), you should go to the home page of this document, since
you will probably want to download the pdf version together with all the support les and examples.
Or get the bootable CD.
There are some other resources for doing econometrics with Octave. You might like to check the
article Econometrics with Octave and the Econometrics Toolbox , which is for Matlab, but much of
which could be easily used with Octave.
24.3 If youre running a Linux installation...
Then to get the same behavior as found on the CD, you need to:
Get the collection of support programs and the examples, from the document home page.
Put them somewhere, and tell Octave how to nd them, e.g., by putting a link to the MyOc-
taveFiles directory in /usr/local/share/octave/site-m
Make sure nedit is installed and congured to run Octave and use syntax highlighting. Copy the
le /home/econometrics/.nedit from the CD to do this. Or, get the le NeditConguration
and save it in your $HOME directory with the name .nedit. Not to put too ne a point on
it, please note that there is a period in that name.
Associate *.m les with NEdit so that they open up in the editor when you click on them. That
should do it.
Chapter 25
Notation and Review
All vectors will be column vectors, unless they have a transpose symbol (or I forget to apply this
rule - your help catching typos and er0rors is much appreciated). For example, if x
t
is a p 1
vector, x
/
t
is a 1 p vector. When I refer to a p-vector, I mean a column vector.
25.1 Notation for dierentiation of vectors and matrices
[3, Chapter 1]
Let s() : 1
p
1 be a real valued function of the p-vector . Then
s()

is organized as a p-vector,
s()

=
_

_
s()

1
s()

2
.
.
.
s()

p
_

_
632
Following this convention,
s()

is a 1 p vector, and

2
s()

is a p p matrix. Also,

2
s()

/
=

_
_
s()

/
_
_
=

/
_
_
s()

_
_
.
Exercise 69. For a and x both p-vectors, show that
a

x
x
= a.
Let f():1
p
1
n
be a n-vector valued function of the p-vector . Let f()
/
be the 1 n valued
transpose of f . Then
_

f()
/
_
/
=

f().
Product rule: Let f():1
p
1
n
and h():1
p
1
n
be n-vector valued functions of the p-vector
. Then

/
h()
/
f() = h
/
_

/
f
_
+ f
/
_

/
h
_
has dimension 1 p. Applying the transposition rule we get

h()
/
f() =
_

f
/
_
h +
_

h
/
_
f
which has dimension p 1.
Exercise 70. For A a p p matrix and x a p 1 vector, show that
x

Ax
x
= A + A
/
.
Chain rule: Let f():1
p
1
n
a n-vector valued function of a p-vector argument, and let g():1
r

1
p
be a p-vector valued function of an r-vector valued argument . Then

/
f [g ()] =

/
f()

=g()

/
g()
has dimension n r.
Exercise 71. For x and both p 1 vectors, show that
exp(x

= exp(x
/
)x.
25.2 Convergenge modes
Readings: [1, Chapter 4];[4, Chapter 4].
We will consider several modes of convergence. The rst three modes discussed are simply for
background. The stochastic modes are those which will be used later in the course.
Denition 72. A sequence is a mapping from the natural numbers 1, 2, ... = n

n=1
= n to some
other set, so that the set is ordered according to the natural numbers associated with its elements.
Real-valued sequences:
Denition 73. [Convergence] A real-valued sequence of vectors a
n
converges to the vector a if for
any > 0 there exists an integer N

such that for all n > N

, | a
n
a |< . a is the limit of a
n
,
written a
n
a.
Deterministic real-valued functions
Consider a sequence of functions f
n
() where
f
n
: T 1.
may be an arbitrary set.
Denition 74. [Pointwise convergence] A sequence of functions f
n
() converges pointwise on to
the function f() if for all > 0 and there exists an integer N

such that
[f
n
() f()[ < , n > N

.
Its important to note that N

depends upon , so that converge may be much more rapid for


certain than for others. Uniform convergence requires a similar rate of convergence throughout .
Denition 75. [Uniform convergence] A sequence of functions f
n
() converges uniformly on to
the function f() if for any > 0 there exists an integer N such that
sup

[f
n
() f()[ < , n > N.
(insert a diagram here showing the envelope around f() in which f
n
() must lie).
Stochastic sequences
In econometrics, we typically deal with stochastic sequences. Given a probability space (, T, P) ,
recall that a random variable maps the sample space to the real line, i.e., X() : 1. A sequence
of random variables X
n
() is a collection of such mappings, i.e., each X
n
() is a random variable
with respect to the probability space (, T, P) . For example, given the model Y = X
0
+, the OLS
estimator

n
= (X
/
X)
1
X
/
Y, where n is the sample size, can be used to form a sequence of random
vectors

n
. A number of modes of convergence are in use when dealing with sequences of random
variables. Several such modes of convergence should already be familiar:
Denition 76. [Convergence in probability] Let X
n
() be a sequence of random variables, and let
X() be a random variable. Let /
n
= : [X
n
() X()[ > . Then X
n
() converges in
probability to X() if
lim
n
P (/
n
) = 0, > 0.
Convergence in probability is written as X
n
p
X, or plim X
n
= X.
Denition 77. [Almost sure convergence] Let X
n
() be a sequence of random variables, and let X()
be a random variable. Let / = : lim
n
X
n
() = X(). Then X
n
() converges almost surely
to X() if
P (/) = 1.
In other words, X
n
() X() (ordinary convergence of the two functions) except on a set C = /
such that P(C) = 0. Almost sure convergence is written as X
n
a.s.
X, or X
n
X, a.s. One can show
that
X
n
a.s.
X X
n
p
X.
Denition 78. [Convergence in distribution] Let the r.v. X
n
have distribution function F
n
and the
r.v. X
n
have distribution function F. If F
n
F at every continuity point of F, then X
n
converges in
distribution to X.
Convergence in distribution is written as X
n
d
X. It can be shown that convergence in probability
implies convergence in distribution.
Stochastic functions
Simple laws of large numbers (LLNs) allow us to directly conclude that

n
a.s.

0
in the OLS example,
since

n
=
0
+
_
_
X
/
X
n
_
_
1
_
_
X
/

n
_
_
,
and
X

n
a.s.
0 by a SLLN. Note that this term is not a function of the parameter . This easy proof is
a result of the linearity of the model, which allows us to express the estimator in a way that separates
parameters from random functions. In general, this is not possible. We often deal with the more
complicated situation where the stochastic sequence depends on parameters in a manner that is not
reducible to a simple sequence of random variables. In this case, we have a sequence of random
functions that depend on : X
n
(, ), where each X
n
(, ) is a random variable with respect to a
probability space (, T, P) and the parameter belongs to a parameter space .
Denition 79. [Uniform almost sure convergence] X
n
(, ) converges uniformly almost surely in
to X(, ) if
lim
n
sup

[X
n
(, ) X(, )[ = 0, (a.s.)
Implicit is the assumption that all X
n
(, ) and X(, ) are random variables w.r.t. (, T, P)
for all . Well indicate uniform almost sure convergence by
u.a.s.
and uniform convergence in
probability by
u.p.
.
An equivalent denition, based on the fact that almost sure means with probability one is
Pr
_
lim
n
sup

[X
n
(, ) X(, )[ = 0
_
= 1
This has a form similar to that of the denition of a.s. convergence - the essential dierence is
the addition of the sup.
25.3 Rates of convergence and asymptotic equality
Its often useful to have notation for the relative magnitudes of quantities. Quantities that are small
relative to others can often be ignored, which simplies analysis.
Denition 80. [Little-o] Let f(n) and g(n) be two real-valued functions. The notation f(n) = o(g(n))
means lim
n
f(n)
g(n)
= 0.
Denition 81. [Big-O] Let f(n) and g(n) be two real-valued functions. The notation f(n) = O(g(n))
means there exists some N such that for n > N,

f(n)
g(n)

< K, where K is a nite constant.


This denition doesnt require that
f(n)
g(n)
have a limit (it may uctuate boundedly).
If f
n
and g
n
are sequences of random variables analogous denitions are
Denition 82. The notation f(n) = o
p
(g(n)) means
f(n)
g(n)
p
0.
Example 83. The least squares estimator

= (X
/
X)
1
X
/
Y = (X
/
X)
1
X
/
(X
0
+ ) =
0
+(X
/
X)
1
X
/
.
Since plim
(X

X)
1
X

1
= 0, we can write (X
/
X)
1
X
/
= o
p
(1) and

=
0
+ o
p
(1). Asymptotically, the
term o
p
(1) is negligible. This is just a way of indicating that the LS estimator is consistent.
Denition 84. The notation f(n) = O
p
(g(n)) means there exists some N

such that for > 0 and all


n > N

,
P
_
_

f(n)
g(n)

< K

_
_
> 1 ,
where K

is a nite constant.
Example 85. If X
n
N(0, 1) then X
n
= O
p
(1), since, given , there is always some K

such that
P ([X
n
[ < K

) > 1 .
Useful rules:
O
p
(n
p
)O
p
(n
q
) = O
p
(n
p+q
)
o
p
(n
p
)o
p
(n
q
) = o
p
(n
p+q
)
Example 86. Consider a random sample of iid r.v.s with mean 0 and variance
2
. The estimator of the
mean

= 1/n

n
i=1
x
i
is asymptotically normally distributed, e.g., n
1/2

A
N(0,
2
). So n
1/2

= O
p
(1),
so

= O
p
(n
1/2
). Before we had

= o
p
(1), now we have have the stronger result that relates the rate
of convergence to the sample size.
Example 87. Now consider a random sample of iid r.v.s with mean and variance
2
. The estimator
of the mean

= 1/n

n
i=1
x
i
is asymptotically normally distributed, e.g., n
1/2
_


_
A
N(0,
2
). So
n
1/2
_


_
= O
p
(1), so

= O
p
(n
1/2
), so

= O
p
(1).
These two examples show that averages of centered (mean zero) quantities typically have plim 0,
while averages of uncentered quantities have nite nonzero plims. Note that the denition of O
p
does
not mean that f(n) and g(n) are of the same order. Asymptotic equality ensures that this is the case.
Denition 88. Two sequences of random variables f
n
and g
n
are asymptotically equal (written
f
n
a
= g
n
) if
plim
_
_
f(n)
g(n)
_
_
= 1
Finally, analogous almost sure versions of o
p
and O
p
are dened in the obvious way.
For a and x both p 1 vectors, show that D
x
a
/
x = a.
For A a p p matrix and x a p 1 vector, show that D
2
x
x
/
Ax = A + A
/
.
For x and both p 1 vectors, show that D

exp x
/
= exp(x
/
)x.
For x and both p 1 vectors, nd the analytic expression for D
2

exp x
/
.
Write an Octave program that veries each of the previous results by taking numeric derivatives.
For a hint, type help numgradient and help numhessian inside octave.
Chapter 26
Licenses
This document and the associated examples and materials are copyright Michael Creel, under the
terms of the GNU General Public License, ver. 2., or at your option, under the Creative Commons
Attribution-Share Alike License, Version 2.5. The licenses follow.
26.1 The GPL
GNU GENERAL PUBLIC LICENSE
Version 2, June 1991
Copyright (C) 1989, 1991 Free Software Foundation, Inc.
59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
Everyone is permitted to copy and distribute verbatim copies
of this license document, but changing it is not allowed.
642
Preamble
The licenses for most software are designed to take away your
freedom to share and change it. By contrast, the GNU General Public
License is intended to guarantee your freedom to share and change free
software--to make sure the software is free for all its users. This
General Public License applies to most of the Free Software
Foundations software and to any other program whose authors commit to
using it. (Some other Free Software Foundation software is covered by
the GNU Library General Public License instead.) You can apply it to
your programs, too.
When we speak of free software, we are referring to freedom, not
price. Our General Public Licenses are designed to make sure that you
have the freedom to distribute copies of free software (and charge for
this service if you wish), that you receive source code or can get it
if you want it, that you can change the software or use pieces of it
in new free programs; and that you know you can do these things.
To protect your rights, we need to make restrictions that forbid
anyone to deny you these rights or to ask you to surrender the rights.
These restrictions translate to certain responsibilities for you if you
distribute copies of the software, or if you modify it.
For example, if you distribute copies of such a program, whether
gratis or for a fee, you must give the recipients all the rights that
you have. You must make sure that they, too, receive or can get the
source code. And you must show them these terms so they know their
rights.
We protect your rights with two steps: (1) copyright the software, and
(2) offer you this license which gives you legal permission to copy,
distribute and/or modify the software.
Also, for each authors protection and ours, we want to make certain
that everyone understands that there is no warranty for this free
software. If the software is modified by someone else and passed on, we
want its recipients to know that what they have is not the original, so
that any problems introduced by others will not reflect on the original
authors reputations.
Finally, any free program is threatened constantly by software
patents. We wish to avoid the danger that redistributors of a free
program will individually obtain patent licenses, in effect making the
program proprietary. To prevent this, we have made it clear that any
patent must be licensed for everyones free use or not licensed at all.
The precise terms and conditions for copying, distribution and
modification follow.
GNU GENERAL PUBLIC LICENSE
TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
0. This License applies to any program or other work which contains
a notice placed by the copyright holder saying it may be distributed
under the terms of this General Public License. The "Program", below,
refers to any such program or work, and a "work based on the Program"
means either the Program or any derivative work under copyright law:
that is to say, a work containing the Program or a portion of it,
either verbatim or with modifications and/or translated into another
language. (Hereinafter, translation is included without limitation in
the term "modification".) Each licensee is addressed as "you".
Activities other than copying, distribution and modification are not
covered by this License; they are outside its scope. The act of
running the Program is not restricted, and the output from the Program
is covered only if its contents constitute a work based on the
Program (independent of having been made by running the Program).
Whether that is true depends on what the Program does.
1. You may copy and distribute verbatim copies of the Programs
source code as you receive it, in any medium, provided that you
conspicuously and appropriately publish on each copy an appropriate
copyright notice and disclaimer of warranty; keep intact all the
notices that refer to this License and to the absence of any warranty;
and give any other recipients of the Program a copy of this License
along with the Program.
You may charge a fee for the physical act of transferring a copy, and
you may at your option offer warranty protection in exchange for a fee.
2. You may modify your copy or copies of the Program or any portion
of it, thus forming a work based on the Program, and copy and
distribute such modifications or work under the terms of Section 1
above, provided that you also meet all of these conditions:
a) You must cause the modified files to carry prominent notices
stating that you changed the files and the date of any change.
b) You must cause any work that you distribute or publish, that in
whole or in part contains or is derived from the Program or any
part thereof, to be licensed as a whole at no charge to all third
parties under the terms of this License.
c) If the modified program normally reads commands interactively
when run, you must cause it, when started running for such
interactive use in the most ordinary way, to print or display an
announcement including an appropriate copyright notice and a
notice that there is no warranty (or else, saying that you provide
a warranty) and that users may redistribute the program under
these conditions, and telling the user how to view a copy of this
License. (Exception: if the Program itself is interactive but
does not normally print such an announcement, your work based on
the Program is not required to print an announcement.)
These requirements apply to the modified work as a whole. If
identifiable sections of that work are not derived from the Program,
and can be reasonably considered independent and separate works in
themselves, then this License, and its terms, do not apply to those
sections when you distribute them as separate works. But when you
distribute the same sections as part of a whole which is a work based
on the Program, the distribution of the whole must be on the terms of
this License, whose permissions for other licensees extend to the
entire whole, and thus to each and every part regardless of who wrote it.
Thus, it is not the intent of this section to claim rights or contest
your rights to work written entirely by you; rather, the intent is to
exercise the right to control the distribution of derivative or
collective works based on the Program.
In addition, mere aggregation of another work not based on the Program
with the Program (or with a work based on the Program) on a volume of
a storage or distribution medium does not bring the other work under
the scope of this License.
3. You may copy and distribute the Program (or a work based on it,
under Section 2) in object code or executable form under the terms of
Sections 1 and 2 above provided that you also do one of the following:
a) Accompany it with the complete corresponding machine-readable
source code, which must be distributed under the terms of Sections
1 and 2 above on a medium customarily used for software interchange; or,
b) Accompany it with a written offer, valid for at least three
years, to give any third party, for a charge no more than your
cost of physically performing source distribution, a complete
machine-readable copy of the corresponding source code, to be
distributed under the terms of Sections 1 and 2 above on a medium
customarily used for software interchange; or,
c) Accompany it with the information you received as to the offer
to distribute corresponding source code. (This alternative is
allowed only for noncommercial distribution and only if you
received the program in object code or executable form with such
an offer, in accord with Subsection b above.)
The source code for a work means the preferred form of the work for
making modifications to it. For an executable work, complete source
code means all the source code for all modules it contains, plus any
associated interface definition files, plus the scripts used to
control compilation and installation of the executable. However, as a
special exception, the source code distributed need not include
anything that is normally distributed (in either source or binary
form) with the major components (compiler, kernel, and so on) of the
operating system on which the executable runs, unless that component
itself accompanies the executable.
If distribution of executable or object code is made by offering
access to copy from a designated place, then offering equivalent
access to copy the source code from the same place counts as
distribution of the source code, even though third parties are not
compelled to copy the source along with the object code.
4. You may not copy, modify, sublicense, or distribute the Program
except as expressly provided under this License. Any attempt
otherwise to copy, modify, sublicense or distribute the Program is
void, and will automatically terminate your rights under this License.
However, parties who have received copies, or rights, from you under
this License will not have their licenses terminated so long as such
parties remain in full compliance.
5. You are not required to accept this License, since you have not
signed it. However, nothing else grants you permission to modify or
distribute the Program or its derivative works. These actions are
prohibited by law if you do not accept this License. Therefore, by
modifying or distributing the Program (or any work based on the
Program), you indicate your acceptance of this License to do so, and
all its terms and conditions for copying, distributing or modifying
the Program or works based on it.
6. Each time you redistribute the Program (or any work based on the
Program), the recipient automatically receives a license from the
original licensor to copy, distribute or modify the Program subject to
these terms and conditions. You may not impose any further
restrictions on the recipients exercise of the rights granted herein.
You are not responsible for enforcing compliance by third parties to
this License.
7. If, as a consequence of a court judgment or allegation of patent
infringement or for any other reason (not limited to patent issues),
conditions are imposed on you (whether by court order, agreement or
otherwise) that contradict the conditions of this License, they do not
excuse you from the conditions of this License. If you cannot
distribute so as to satisfy simultaneously your obligations under this
License and any other pertinent obligations, then as a consequence you
may not distribute the Program at all. For example, if a patent
license would not permit royalty-free redistribution of the Program by
all those who receive copies directly or indirectly through you, then
the only way you could satisfy both it and this License would be to
refrain entirely from distribution of the Program.
If any portion of this section is held invalid or unenforceable under
any particular circumstance, the balance of the section is intended to
apply and the section as a whole is intended to apply in other
circumstances.
It is not the purpose of this section to induce you to infringe any
patents or other property right claims or to contest validity of any
such claims; this section has the sole purpose of protecting the
integrity of the free software distribution system, which is
implemented by public license practices. Many people have made
generous contributions to the wide range of software distributed
through that system in reliance on consistent application of that
system; it is up to the author/donor to decide if he or she is willing
to distribute software through any other system and a licensee cannot
impose that choice.
This section is intended to make thoroughly clear what is believed to
be a consequence of the rest of this License.
8. If the distribution and/or use of the Program is restricted in
certain countries either by patents or by copyrighted interfaces, the
original copyright holder who places the Program under this License
may add an explicit geographical distribution limitation excluding
those countries, so that distribution is permitted only in or among
countries not thus excluded. In such case, this License incorporates
the limitation as if written in the body of this License.
9. The Free Software Foundation may publish revised and/or new versions
of the General Public License from time to time. Such new versions will
be similar in spirit to the present version, but may differ in detail to
address new problems or concerns.
Each version is given a distinguishing version number. If the Program
specifies a version number of this License which applies to it and "any
later version", you have the option of following the terms and conditions
either of that version or of any later version published by the Free
Software Foundation. If the Program does not specify a version number of
this License, you may choose any version ever published by the Free Software
Foundation.
10. If you wish to incorporate parts of the Program into other free
programs whose distribution conditions are different, write to the author
to ask for permission. For software which is copyrighted by the Free
Software Foundation, write to the Free Software Foundation; we sometimes
make exceptions for this. Our decision will be guided by the two goals
of preserving the free status of all derivatives of our free software and
of promoting the sharing and reuse of software generally.
NO WARRANTY
11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY
FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN
OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES
PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED
OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS
TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE
PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING,
REPAIR OR CORRECTION.
12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR
REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES,
INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING
OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED
TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY
YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER
PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE
POSSIBILITY OF SUCH DAMAGES.
END OF TERMS AND CONDITIONS
How to Apply These Terms to Your New Programs
If you develop a new program, and you want it to be of the greatest
possible use to the public, the best way to achieve this is to make it
free software which everyone can redistribute and change under these terms.
To do so, attach the following notices to the program. It is safest
to attach them to the start of each source file to most effectively
convey the exclusion of warranty; and each file should have at least
the "copyright" line and a pointer to where the full notice is found.
<one line to give the programs name and a brief idea of what it does.>
Copyright (C) <year> <name of author>
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
Also add information on how to contact you by electronic and paper mail.
If the program is interactive, make it output a short notice like this
when it starts in an interactive mode:
Gnomovision version 69, Copyright (C) year name of author
Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type show w.
This is free software, and you are welcome to redistribute it
under certain conditions; type show c for details.
The hypothetical commands show w and show c should show the appropriate
parts of the General Public License. Of course, the commands you use may
be called something other than show w and show c; they could even be
mouse-clicks or menu items--whatever suits your program.
You should also get your employer (if you work as a programmer) or your
school, if any, to sign a "copyright disclaimer" for the program, if
necessary. Here is a sample; alter the names:
Yoyodyne, Inc., hereby disclaims all copyright interest in the program
Gnomovision (which makes passes at compilers) written by James Hacker.
<signature of Ty Coon>, 1 April 1989
Ty Coon, President of Vice
This General Public License does not permit incorporating your program into
proprietary programs. If your program is a subroutine library, you may
consider it more useful to permit linking proprietary applications with the
library. If this is what you want to do, use the GNU Library General
Public License instead of this License.
26.2 Creative Commons
Legal Code
Attribution-ShareAlike 2.5
CREATIVE COMMONS CORPORATION IS NOT A LAW FIRM AND DOES NOT PROVIDE
LEGAL SERVICES. DISTRIBUTION OF THIS LICENSE DOES NOT CREATE AN ATTORNEY-
CLIENT RELATIONSHIP. CREATIVE COMMONS PROVIDES THIS INFORMATION ON AN
"AS-IS" BASIS. CREATIVE COMMONS MAKES NO WARRANTIES REGARDING THE INFOR-
MATION PROVIDED, AND DISCLAIMS LIABILITY FOR DAMAGES RESULTING FROM ITS
USE.
License
THE WORK (AS DEFINED BELOW) IS PROVIDED UNDER THE TERMS OF THIS CRE-
ATIVE COMMONS PUBLIC LICENSE ("CCPL" OR "LICENSE"). THE WORK IS PROTECTED
BY COPYRIGHT AND/OR OTHER APPLICABLE LAW. ANY USE OF THE WORK OTHER
THAN AS AUTHORIZED UNDER THIS LICENSE OR COPYRIGHT LAW IS PROHIBITED.
BY EXERCISING ANY RIGHTS TO THE WORK PROVIDED HERE, YOU ACCEPT AND
AGREE TO BE BOUND BY THE TERMS OF THIS LICENSE. THE LICENSOR GRANTS YOU
THE RIGHTS CONTAINED HERE IN CONSIDERATION OF YOUR ACCEPTANCE OF SUCH
TERMS AND CONDITIONS.
1. Denitions
1. "Collective Work" means a work, such as a periodical issue, anthology or encyclopedia, in which
the Work in its entirety in unmodied form, along with a number of other contributions, constituting
separate and independent works in themselves, are assembled into a collective whole. A work that
constitutes a Collective Work will not be considered a Derivative Work (as dened below) for the
purposes of this License.
2. "Derivative Work" means a work based upon the Work or upon the Work and other pre-existing
works, such as a translation, musical arrangement, dramatization, ctionalization, motion picture
version, sound recording, art reproduction, abridgment, condensation, or any other form in which the
Work may be recast, transformed, or adapted, except that a work that constitutes a Collective Work
will not be considered a Derivative Work for the purpose of this License. For the avoidance of doubt,
where the Work is a musical composition or sound recording, the synchronization of the Work in
timed-relation with a moving image ("synching") will be considered a Derivative Work for the purpose
of this License.
3. "Licensor" means the individual or entity that oers the Work under the terms of this License.
4. "Original Author" means the individual or entity who created the Work.
5. "Work" means the copyrightable work of authorship oered under the terms of this License.
6. "You" means an individual or entity exercising rights under this License who has not previously
violated the terms of this License with respect to the Work, or who has received express permission
from the Licensor to exercise rights under this License despite a previous violation.
7. "License Elements" means the following high-level license attributes as selected by Licensor and
indicated in the title of this License: Attribution, ShareAlike.
2. Fair Use Rights. Nothing in this license is intended to reduce, limit, or restrict any rights
arising from fair use, rst sale or other limitations on the exclusive rights of the copyright owner under
copyright law or other applicable laws.
3. License Grant. Subject to the terms and conditions of this License, Licensor hereby grants
You a worldwide, royalty-free, non-exclusive, perpetual (for the duration of the applicable copyright)
license to exercise the rights in the Work as stated below:
1. to reproduce the Work, to incorporate the Work into one or more Collective Works, and to
reproduce the Work as incorporated in the Collective Works;
2. to create and reproduce Derivative Works;
3. to distribute copies or phonorecords of, display publicly, perform publicly, and perform publicly
by means of a digital audio transmission the Work including as incorporated in Collective Works;
4. to distribute copies or phonorecords of, display publicly, perform publicly, and perform publicly
by means of a digital audio transmission Derivative Works.
5.
For the avoidance of doubt, where the work is a musical composition:
1. Performance Royalties Under Blanket Licenses. Licensor waives the exclusive right to collect,
whether individually or via a performance rights society (e.g. ASCAP, BMI, SESAC), royalties for the
public performance or public digital performance (e.g. webcast) of the Work.
2. Mechanical Rights and Statutory Royalties. Licensor waives the exclusive right to collect,
whether individually or via a music rights society or designated agent (e.g. Harry Fox Agency),
royalties for any phonorecord You create from the Work ("cover version") and distribute, subject to
the compulsory license created by 17 USC Section 115 of the US Copyright Act (or the equivalent in
other jurisdictions).
6. Webcasting Rights and Statutory Royalties. For the avoidance of doubt, where the Work
is a sound recording, Licensor waives the exclusive right to collect, whether individually or via a
performance-rights society (e.g. SoundExchange), royalties for the public digital performance (e.g.
webcast) of the Work, subject to the compulsory license created by 17 USC Section 114 of the US
Copyright Act (or the equivalent in other jurisdictions).
The above rights may be exercised in all media and formats whether now known or hereafter
devised. The above rights include the right to make such modications as are technically necessary to
exercise the rights in other media and formats. All rights not expressly granted by Licensor are hereby
reserved.
4. Restrictions.The license granted in Section 3 above is expressly made subject to and limited by
the following restrictions:
1. You may distribute, publicly display, publicly perform, or publicly digitally perform the Work
only under the terms of this License, and You must include a copy of, or the Uniform Resource Identier
for, this License with every copy or phonorecord of the Work You distribute, publicly display, publicly
perform, or publicly digitally perform. You may not oer or impose any terms on the Work that
alter or restrict the terms of this License or the recipients exercise of the rights granted hereunder.
You may not sublicense the Work. You must keep intact all notices that refer to this License and to
the disclaimer of warranties. You may not distribute, publicly display, publicly perform, or publicly
digitally perform the Work with any technological measures that control access or use of the Work in
a manner inconsistent with the terms of this License Agreement. The above applies to the Work as
incorporated in a Collective Work, but this does not require the Collective Work apart from the Work
itself to be made subject to the terms of this License. If You create a Collective Work, upon notice
from any Licensor You must, to the extent practicable, remove from the Collective Work any credit as
required by clause 4(c), as requested. If You create a Derivative Work, upon notice from any Licensor
You must, to the extent practicable, remove from the Derivative Work any credit as required by clause
4(c), as requested.
2. You may distribute, publicly display, publicly perform, or publicly digitally perform a Derivative
Work only under the terms of this License, a later version of this License with the same License
Elements as this License, or a Creative Commons iCommons license that contains the same License
Elements as this License (e.g. Attribution-ShareAlike 2.5 Japan). You must include a copy of, or the
Uniform Resource Identier for, this License or other license specied in the previous sentence with
every copy or phonorecord of each Derivative Work You distribute, publicly display, publicly perform,
or publicly digitally perform. You may not oer or impose any terms on the Derivative Works that
alter or restrict the terms of this License or the recipients exercise of the rights granted hereunder,
and You must keep intact all notices that refer to this License and to the disclaimer of warranties.
You may not distribute, publicly display, publicly perform, or publicly digitally perform the Derivative
Work with any technological measures that control access or use of the Work in a manner inconsistent
with the terms of this License Agreement. The above applies to the Derivative Work as incorporated
in a Collective Work, but this does not require the Collective Work apart from the Derivative Work
itself to be made subject to the terms of this License.
3. If you distribute, publicly display, publicly perform, or publicly digitally perform the Work or
any Derivative Works or Collective Works, You must keep intact all copyright notices for the Work
and provide, reasonable to the medium or means You are utilizing: (i) the name of the Original
Author (or pseudonym, if applicable) if supplied, and/or (ii) if the Original Author and/or Licensor
designate another party or parties (e.g. a sponsor institute, publishing entity, journal) for attribution
in Licensors copyright notice, terms of service or by other reasonable means, the name of such party
or parties; the title of the Work if supplied; to the extent reasonably practicable, the Uniform Resource
Identier, if any, that Licensor species to be associated with the Work, unless such URI does not refer
to the copyright notice or licensing information for the Work; and in the case of a Derivative Work,
a credit identifying the use of the Work in the Derivative Work (e.g., "French translation of the Work
by Original Author," or "Screenplay based on original Work by Original Author"). Such credit may be
implemented in any reasonable manner; provided, however, that in the case of a Derivative Work or
Collective Work, at a minimum such credit will appear where any other comparable authorship credit
appears and in a manner at least as prominent as such other comparable authorship credit.
5. Representations, Warranties and Disclaimer
UNLESS OTHERWISE AGREED TO BY THE PARTIES IN WRITING, LICENSOR OFFERS
THE WORK AS-IS AND MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND
CONCERNING THE MATERIALS, EXPRESS, IMPLIED, STATUTORY OR OTHERWISE, IN-
CLUDING, WITHOUT LIMITATION, WARRANTIES OF TITLE, MERCHANTIBILITY, FIT-
NESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF LA-
TENT OR OTHER DEFECTS, ACCURACY, OR THE PRESENCE OF ABSENCE OF ERRORS,
WHETHER OR NOT DISCOVERABLE. SOME JURISDICTIONS DO NOT ALLOW THE EX-
CLUSION OF IMPLIED WARRANTIES, SO SUCH EXCLUSION MAY NOT APPLY TO YOU.
6. Limitation on Liability. EXCEPT TO THE EXTENT REQUIRED BY APPLICABLE LAW,
IN NO EVENT WILL LICENSOR BE LIABLE TO YOU ON ANY LEGAL THEORY FOR ANY
SPECIAL, INCIDENTAL, CONSEQUENTIAL, PUNITIVE OR EXEMPLARY DAMAGES ARIS-
ING OUT OF THIS LICENSE OR THE USE OF THE WORK, EVEN IF LICENSOR HAS BEEN
ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
7. Termination
1. This License and the rights granted hereunder will terminate automatically upon any breach
by You of the terms of this License. Individuals or entities who have received Derivative Works or
Collective Works from You under this License, however, will not have their licenses terminated provided
such individuals or entities remain in full compliance with those licenses. Sections 1, 2, 5, 6, 7, and 8
will survive any termination of this License.
2. Subject to the above terms and conditions, the license granted here is perpetual (for the duration
of the applicable copyright in the Work). Notwithstanding the above, Licensor reserves the right to
release the Work under dierent license terms or to stop distributing the Work at any time; provided,
however that any such election will not serve to withdraw this License (or any other license that has
been, or is required to be, granted under the terms of this License), and this License will continue in
full force and eect unless terminated as stated above.
8. Miscellaneous
1. Each time You distribute or publicly digitally perform the Work or a Collective Work, the
Licensor oers to the recipient a license to the Work on the same terms and conditions as the license
granted to You under this License.
2. Each time You distribute or publicly digitally perform a Derivative Work, Licensor oers to the
recipient a license to the original Work on the same terms and conditions as the license granted to
You under this License.
3. If any provision of this License is invalid or unenforceable under applicable law, it shall not aect
the validity or enforceability of the remainder of the terms of this License, and without further action
by the parties to this agreement, such provision shall be reformed to the minimum extent necessary
to make such provision valid and enforceable.
4. No term or provision of this License shall be deemed waived and no breach consented to unless
such waiver or consent shall be in writing and signed by the party to be charged with such waiver or
consent.
5. This License constitutes the entire agreement between the parties with respect to the Work
licensed here. There are no understandings, agreements or representations with respect to the Work
not specied here. Licensor shall not be bound by any additional provisions that may appear in any
communication from You. This License may not be modied without the mutual written agreement
of the Licensor and You.
Creative Commons is not a party to this License, and makes no warranty whatsoever in connection
with the Work. Creative Commons will not be liable to You or any party on any legal theory for any
damages whatsoever, including without limitation any general, special, incidental or consequential
damages arising in connection to this license. Notwithstanding the foregoing two (2) sentences, if
Creative Commons has expressly identied itself as the Licensor hereunder, it shall have all rights and
obligations of Licensor.
Except for the limited purpose of indicating to the public that the Work is licensed under the
CCPL, neither party will use the trademark "Creative Commons" or any related trademark or logo
of Creative Commons without the prior written consent of Creative Commons. Any permitted use
will be in compliance with Creative Commons then-current trademark usage guidelines, as may be
published on its website or otherwise made available upon request from time to time.
Creative Commons may be contacted at http://creativecommons.org/.
Chapter 27
The attic
This holds material that is not really ready to be incorporated into the main body, but that I dont
want to lose. Basically, ignore it, unless youd like to help get it ready for inclusion.
Invertibility of AR process
To begin with, dene the lag operator L
Ly
t
= y
t1
The lag operator is dened to behave just as an algebraic quantity, e.g.,
L
2
y
t
= L(Ly
t
)
= Ly
t1
= y
t2
666
or
(1 L)(1 + L)y
t
= 1 Ly
t
+ Ly
t
L
2
y
t
= 1 y
t2
A mean-zero AR(p) process can be written as
y
t

1
y
t1

2
y
t2

p
y
tp
=
t
or
y
t
(1
1
L
2
L
2

p
L
p
) =
t
Factor this polynomial as
1
1
L
2
L
2

p
L
p
= (1
1
L)(1
2
L) (1
p
L)
For the moment, just assume that the
i
are coecients to be determined. Since L is dened to
operate as an algebraic quantitiy, determination of the
i
is the same as determination of the
i
such
that the following two expressions are the same for all z :
1
1
z
2
z
2

p
z
p
= (1
1
z)(1
2
z) (1
p
z)
Multiply both sides by z
p
z
p

1
z
1p

2
z
2p

p1
z
1

p
= (z
1

1
)(z
1

2
) (z
1

p
)
and now dene = z
1
so we get

p1

p2

p1

p
= (
1
)(
2
) (
p
)
The LHS is precisely the determinantal polynomial that gives the eigenvalues of F. Therefore, the
i
that are the coecients of the factorization are simply the eigenvalues of the matrix F.
Now consider a dierent stationary process
(1 L)y
t
=
t
Stationarity, as above, implies that [[ < 1.
Multiply both sides by 1 + L +
2
L
2
+ ... +
j
L
j
to get
_
1 + L +
2
L
2
+ ... +
j
L
j
_
(1 L)y
t
=
_
1 + L +
2
L
2
+ ... +
j
L
j
_

t
or, multiplying the polynomials on the LHS, we get
_
1 + L +
2
L
2
+ ... +
j
L
j
L
2
L
2
...
j
L
j

j+1
L
j+1
_
y
t
=
_
1 + L +
2
L
2
+ ... +
j
L
j
_

t
and with cancellations we have
_
1
j+1
L
j+1
_
y
t
=
_
1 + L +
2
L
2
+ ... +
j
L
j
_

t
so
y
t
=
j+1
L
j+1
y
t
+
_
1 + L +
2
L
2
+ ... +
j
L
j
_

t
Now as j ,
j+1
L
j+1
y
t
0, since [[ < 1, so
y
t

=
_
1 + L +
2
L
2
+ ... +
j
L
j
_

t
and the approximation becomes better and better as j increases. However, we started with
(1 L)y
t
=
t
Substituting this into the above equation we have
y
t

=
_
1 + L +
2
L
2
+ ... +
j
L
j
_
(1 L)y
t
so
_
1 + L +
2
L
2
+ ... +
j
L
j
_
(1 L)

= 1
and the approximation becomes arbitrarily good as j increases arbitrarily. Therefore, for [[ < 1,
dene
(1 L)
1
=

j=0

j
L
j
Recall that our mean zero AR(p) process
y
t
(1
1
L
2
L
2

p
L
p
) =
t
can be written using the factorization
y
t
(1
1
L)(1
2
L) (1
p
L) =
t
where the are the eigenvalues of F, and given stationarity, all the [
i
[ < 1. Therefore, we can invert
each rst order polynomial on the LHS to get
y
t
=
_
_

j=0

j
1
L
j
_
_
_
_

j=0

j
2
L
j
_
_

_
_

j=0

j
p
L
j
_
_

t
The RHS is a product of innite-order polynomials in L, which can be represented as
y
t
= (1 +
1
L +
2
L
2
+ )
t
where the
i
are real-valued and absolutely summable.
The
i
are formed of products of powers of the
i
, which are in turn functions of the
i
.
The
i
are real-valued because any complex-valued
i
always occur in conjugate pairs. This
means that if a + bi is an eigenvalue of F, then so is a bi. In multiplication
(a + bi) (a bi) = a
2
abi + abi b
2
i
2
= a
2
+ b
2
which is real-valued.
This shows that an AR(p) process is representable as an innite-order MA(q) process.
Recall before that by recursive substitution, an AR(p) process can be written as
Y
t+j
= C + FC + + F
j
C + F
j+1
Y
t1
+ F
j
E
t
+ F
j1
E
t+1
+ + FE
t+j1
+ E
t+j
If the process is mean zero, then everything with a C drops out. Take this and lag it by j periods
to get
Y
t
= F
j+1
Y
tj1
+ F
j
E
tj
+ F
j1
E
tj+1
+ + FE
t1
+ E
t
As j , the lagged Y on the RHS drops out. The E
ts
are vectors of zeros except for their
rst element, so we see that the rst equation here, in the limit, is just
y
t
=

j=0
_
F
j
_
1,1

tj
which makes explicit the relationship between the
i
and the
i
(and the
i
as well, recalling the
previous factorization of F
j
).
Invertibility of MA(q) process
An MA(q) can be written as
y
t
= (1 +
1
L + ... +
q
L
q
)
t
As before, the polynomial on the RHS can be factored as
(1 +
1
L + ... +
q
L
q
) = (1
1
L)(1
2
L)...(1
q
L)
and each of the (1
i
L) can be inverted as long as each of the [
i
[ < 1. If this is the case, then we
can write
(1 +
1
L + ... +
q
L
q
)
1
(y
t
) =
t
where
(1 +
1
L + ... +
q
L
q
)
1
will be an innite-order polynomial in L, so we get

j=0

j
L
j
(y
tj
) =
t
with
0
= 1, or
(y
t
)
1
(y
t1
)
2
(y
t2
) + ... =
t
or
y
t
= c +
1
y
t1
+
2
y
t2
+ ... +
t
where
c = +
1
+
2
+ ...
So we see that an MA(q) has an innite AR representation, as long as the [
i
[ < 1, i = 1, 2, ..., q.
It turns out that one can always manipulate the parameters of an MA(q) process to nd an
invertible representation. For example, the two MA(1) processes
y
t
= (1 L)
t
and
y

t
= (1
1
L)

t
have exactly the same moments if

=
2

2
For example, weve seen that

0
=
2
(1 +
2
).
Given the above relationships amongst the parameters,

0
=
2

2
(1 +
2
) =
2
(1 +
2
)
so the variances are the same. It turns out that all the autocovariances will be the same, as is
easily checked. This means that the two MA processes are observationally equivalent. As before,
its impossible to distinguish between observationally equivalent processes on the basis of data.
For a given MA(q) process, its always possible to manipulate the parameters to nd an invertible
representation (which is unique).
Its important to nd an invertible representation, since its the only representation that allows
one to represent
t
as a function of past y
/
s. The other representations express
t
as a function
of future y
/
s
Why is invertibility important? The most important reason is that it provides a justication
for the use of parsimonious models. Since an AR(1) process has an MA() representation,
one can reverse the argument and note that at least some MA() processes have an AR(1)
representation. Likewise, some AR() processes have an MA(1) representation. At the time of
estimation, its a lot easier to estimate the single AR(1) or MA(1) coecient rather than the
innite number of coecients associated with the MA() or AR() representation.
This is the reason that ARMA models are popular. Combining low-order AR and MA models
can usually oer a satisfactory representation of univariate time series data using a reasonable
number of parameters.
Stationarity and invertibility of ARMA models is similar to what weve seen - we wont go into
the details. Likewise, calculating moments is similar.
Exercise 89. Calculate the autocovariances of an ARMA(1,1) model:(1 + L)y
t
= c + (1 + L)
t
Optimal instruments for GMM
PLEASE IGNORE THE REST OF THIS SECTION: there is a aw in the argument that needs
correction. In particular, it may be the case that E(Z
t

t
) ,= 0 if instruments are chosen in the way
suggested here.
An interesting question that arises is how one should choose the instrumental variables Z(w
t
) to
achieve maximum eciency.
Note that with this choice of moment conditions, we have that D
n

m
/
() (a K g matrix) is
D
n
() =

1
n
(Z
/
n
h
n
())
/
=
1
n
_

h
/
n
()
_
Z
n
which we can dene to be
D
n
() =
1
n
H
n
Z
n
.
where H
n
is a Kn matrix that has the derivatives of the individual moment conditions as its columns.
Likewise, dene the var-cov. of the moment conditions

n
= c
_
nm
n
(
0
)m
n
(
0
)
/
_
= c
_
1
n
Z
/
n
h
n
(
0
)h
n
(
0
)
/
Z
n
_
= Z
/
n
c
_
1
n
h
n
(
0
)h
n
(
0
)
/
_
Z
n
Z
/
n

n
n
Z
n
where we have dened
n
= V (h
n
(
0
)) . Note that the dimension of this matrix is growing with the
sample size, so it is not consistently estimable without additional assumptions.
The asymptotic normality theorem above says that the GMM estimator using the optimal weighting
matrix is distributed as

n
_


0
_
d
N(0, V

)
where
V

= lim
n
_
_
_
_
H
n
Z
n
n
_
_
_
Z
/
n

n
Z
n
n
_
_
1
_
_
Z
/
n
H
/
n
n
_
_
_
_
_
1
. (27.1)
Using an argument similar to that used to prove that
1

is the ecient weighting matrix, we can


show that putting
Z
n
=
1
n
H
/
n
causes the above var-cov matrix to simplify to
V

= lim
n
_
_
H
n

1
n
H
/
n
n
_
_
1
. (27.2)
and furthermore, this matrix is smaller that the limiting var-cov for any other choice of instrumental
variables. (To prove this, examine the dierence of the inverses of the var-cov matrices with the
optimal intruments and with non-optimal instruments. As above, you can show that the dierence is
positive semi-denite).
Note that both H
n
, which we should write more properly as H
n
(
0
), since it depends on
0
, and
must be consistently estimated to apply this.
Usually, estimation of H
n
is straightforward - one just uses

H =

h
/
n
_

_
,
where

is some initial consistent estimator based on non-optimal instruments.
Estimation of
n
may not be possible. It is an nn matrix, so it has more unique elements than
n, the sample size, so without restrictions on the parameters it cant be estimated consistently.
Basically, you need to provide a parametric specication of the covariances of the h
t
() in order
to be able to use optimal instruments. A solution is to approximate this matrix parametrically
to dene the instruments. Note that the simplied var-cov matrix in equation 27.2 will not
apply if approximately optimal instruments are used - it will be necessary to use an estimator
based upon equation 27.1, where the term n
1
Z
/
n

n
Z
n
must be estimated consistently apart, for
example by the Newey-West procedure.
27.1 Hurdle models
Returning to the Poisson model, lets look at actual and tted count probabilities. Actual relative
frequencies are f(y = j) =

i
1(y
i
= j)/n and tted frequencies are

f(y = j) =

n
i=1
f
Y
(j[x
i
,

)/n We
Table 27.1: Actual and Poisson tted frequencies
Count OBDV ERV
Count Actual Fitted Actual Fitted
0 0.32 0.06 0.86 0.83
1 0.18 0.15 0.10 0.14
2 0.11 0.19 0.02 0.02
3 0.10 0.18 0.004 0.002
4 0.052 0.15 0.002 0.0002
5 0.032 0.10 0 2.4e-5
see that for the OBDV measure, there are many more actual zeros than predicted. For ERV, there
are somewhat more actual zeros than tted, but the dierence is not too important.
Why might OBDV not t the zeros well? What if people made the decision to contact the doctor
for a rst visit, they are sick, then the doctor decides on whether or not follow-up visits are needed.
This is a principal/agent type situation, where the total number of visits depends upon the decision
of both the patient and the doctor. Since dierent parameters may govern the two decision-makers
choices, we might expect that dierent parameters govern the probability of zeros versus the other
counts. Let
p
be the parameters of the patients demand for visits, and let
d
be the paramter of the
doctors demand for visits. The patient will initiate visits according to a discrete choice model, for
example, a logit model:
Pr(Y = 0) = f
Y
(0,
p
) = 1 1/ [1 + exp(
p
)]
Pr(Y > 0) = 1/ [1 + exp(
p
)] ,
The above probabilities are used to estimate the binary 0/1 hurdle process. Then, for the observations
where visits are positive, a truncated Poisson density is estimated. This density is
f
Y
(y,
d
[y > 0) =
f
Y
(y,
d
)
Pr(y > 0)
=
f
Y
(y,
d
)
1 exp(
d
)
since according to the Poisson model with the doctors paramaters,
Pr(y = 0) =
exp(
d
)
0
d
0!
.
Since the hurdle and truncated components of the overall density for Y share no parameters, they may
be estimated separately, which is computationally more ecient than estimating the overall model.
(Recall that the BFGS algorithm, for example, will have to invert the approximated Hessian. The
computational overhead is of order K
2
where K is the number of parameters to be estimated) . The
expectation of Y is
E(Y [x) = Pr(Y > 0[x)E(Y [Y > 0, x)
=
_
_
1
1 + exp(
p
)
_
_
_
_

d
1 exp(
d
)
_
_
Here are hurdle Poisson estimation results for OBDV, obtained from this estimation program
**************************************************************************
MEPS data, OBDV
logit results
Strong convergence
Observations = 500
Function value -0.58939
t-Stats
params t(OPG) t(Sand.) t(Hess)
constant -1.5502 -2.5709 -2.5269 -2.5560
pub_ins 1.0519 3.0520 3.0027 3.0384
priv_ins 0.45867 1.7289 1.6924 1.7166
sex 0.63570 3.0873 3.1677 3.1366
age 0.018614 2.1547 2.1969 2.1807
educ 0.039606 1.0467 0.98710 1.0222
inc 0.077446 1.7655 2.1672 1.9601
Information Criteria
Consistent Akaike
639.89
Schwartz
632.89
Hannan-Quinn
614.96
Akaike
603.39
**************************************************************************
The results for the truncated part:
**************************************************************************
MEPS data, OBDV
tpoisson results
Strong convergence
Observations = 500
Function value -2.7042
t-Stats
params t(OPG) t(Sand.) t(Hess)
constant 0.54254 7.4291 1.1747 3.2323
pub_ins 0.31001 6.5708 1.7573 3.7183
priv_ins 0.014382 0.29433 0.10438 0.18112
sex 0.19075 10.293 1.1890 3.6942
age 0.016683 16.148 3.5262 7.9814
educ 0.016286 4.2144 0.56547 1.6353
inc -0.0079016 -2.3186 -0.35309 -0.96078
Information Criteria
Consistent Akaike
2754.7
Schwartz
2747.7
Hannan-Quinn
2729.8
Akaike
2718.2
**************************************************************************
Fitted and actual probabilites (NB-II ts are provided as well) are:
Table 27.2: Actual and Hurdle Poisson tted frequencies
Count OBDV ERV
Count Actual Fitted HP Fitted NB-II Actual Fitted HP Fitted NB-II
0 0.32 0.32 0.34 0.86 0.86 0.86
1 0.18 0.035 0.16 0.10 0.10 0.10
2 0.11 0.071 0.11 0.02 0.02 0.02
3 0.10 0.10 0.08 0.004 0.006 0.006
4 0.052 0.11 0.06 0.002 0.002 0.002
5 0.032 0.10 0.05 0 0.0005 0.001
For the Hurdle Poisson models, the ERV t is very accurate. The OBDV t is not so good. Zeros
are exact, but 1s and 2s are underestimated, and higher counts are overestimated. For the NB-II ts,
performance is at least as good as the hurdle Poisson model, and one should recall that many fewer
parameters are used. Hurdle version of the negative binomial model are also widely used.
Finite mixture models
The following are results for a mixture of 2 negative binomial (NB-I) models, for the OBDV data,
which you can replicate using this estimation program
**************************************************************************
MEPS data, OBDV
mixnegbin results
Strong convergence
Observations = 500
Function value -2.2312
t-Stats
params t(OPG) t(Sand.) t(Hess)
constant 0.64852 1.3851 1.3226 1.4358
pub_ins -0.062139 -0.23188 -0.13802 -0.18729
priv_ins 0.093396 0.46948 0.33046 0.40854
sex 0.39785 2.6121 2.2148 2.4882
age 0.015969 2.5173 2.5475 2.7151
educ -0.049175 -1.8013 -1.7061 -1.8036
inc 0.015880 0.58386 0.76782 0.73281
ln_alpha 0.69961 2.3456 2.0396 2.4029
constant -3.6130 -1.6126 -1.7365 -1.8411
pub_ins 2.3456 1.7527 3.7677 2.6519
priv_ins 0.77431 0.73854 1.1366 0.97338
sex 0.34886 0.80035 0.74016 0.81892
age 0.021425 1.1354 1.3032 1.3387
educ 0.22461 2.0922 1.7826 2.1470
inc 0.019227 0.20453 0.40854 0.36313
ln_alpha 2.8419 6.2497 6.8702 7.6182
logit_inv_mix 0.85186 1.7096 1.4827 1.7883
Information Criteria
Consistent Akaike
2353.8
Schwartz
2336.8
Hannan-Quinn
2293.3
Akaike
2265.2
**************************************************************************
Delta method for mix parameter st. err.
mix se_mix
0.70096 0.12043
The 95% condence interval for the mix parameter is perilously close to 1, which suggests that
there may really be only one component density, rather than a mixture. Again, this is not the
way to test this - it is merely suggestive.
Education is interesting. For the subpopulation that is healthy, i.e., that makes relatively few
visits, education seems to have a positive eect on visits. For the unhealthy group, education
has a negative eect on visits. The other results are more mixed. A larger sample could help
clarify things.
The following are results for a 2 component constrained mixture negative binomial model where all
the slope parameters in
j
= e
x
j
are the same across the two components. The constants and the
overdispersion parameters
j
are allowed to dier for the two components.
**************************************************************************
MEPS data, OBDV
cmixnegbin results
Strong convergence
Observations = 500
Function value -2.2441
t-Stats
params t(OPG) t(Sand.) t(Hess)
constant -0.34153 -0.94203 -0.91456 -0.97943
pub_ins 0.45320 2.6206 2.5088 2.7067
priv_ins 0.20663 1.4258 1.3105 1.3895
sex 0.37714 3.1948 3.4929 3.5319
age 0.015822 3.1212 3.7806 3.7042
educ 0.011784 0.65887 0.50362 0.58331
inc 0.014088 0.69088 0.96831 0.83408
ln_alpha 1.1798 4.6140 7.2462 6.4293
const_2 1.2621 0.47525 2.5219 1.5060
lnalpha_2 2.7769 1.5539 6.4918 4.2243
logit_inv_mix 2.4888 0.60073 3.7224 1.9693
Information Criteria
Consistent Akaike
2323.5
Schwartz
2312.5
Hannan-Quinn
2284.3
Akaike
2266.1
**************************************************************************
Delta method for mix parameter st. err.
mix se_mix
0.92335 0.047318
Now the mixture parameter is even closer to 1.
The slope parameter estimates are pretty close to what we got with the NB-I model.
Bibliography
[1] Davidson, R. and J.G. MacKinnon (1993) Estimation and Inference in Econometrics,
Oxford Univ. Press.
[2] Davidson, R. and J.G. MacKinnon (2004) Econometric Theory and Methods, Oxford
Univ. Press.
[3] Gallant, A.R. (1985) Nonlinear Statistical Models, Wiley.
[4] Gallant, A.R. (1997) An Introduction to Econometric Theory, Princeton Univ. Press.
[5] Hamilton, J. (1994) Time Series Analysis, Princeton Univ. Press
[6] Hayashi, F. (2000) Econometrics, Princeton Univ. Press.
[7] Wooldridge (2003), Introductory Econometrics, Thomson. (undergraduate level, for
supplementary use only).
689
Index
A
ARCH, 572
asymptotic equality, 640
C
Chain rule, 633
Cobb-Douglas model, 29
conditional heteroscedasticity, 572
convergence, almost sure, 636
convergence, in distribution, 636
convergence, in probability, 635
Convergence, ordinary, 634
convergence, pointwise, 635
convergence, uniform, 635
convergence, uniform almost sure, 637
E
estimator, linear, 38, 51
estimator, OLS, 32
extremum estimator, 309
F
tted values, 33
G
GARCH, 572
L
leptokurtosis, 571
leverage, 39
likelihood function, 333
M
matrix, idempotent, 37
matrix, projection, 36
matrix, symmetric, 37
O
observations, inuential, 38
outliers, 38
690
own inuence, 39
P
parameter space, 333
Product rule, 633
R
R- squared, uncentered, 42
residuals, 33
R-squared, centered, 44

You might also like