Econometrics Note
Econometrics Note
Econometrics Note
Bruce E. Hansen
c 2000, 20141
University of Wisconsin
Department of Economics
This manuscript may be printed and reproduced for individual or instructional use, but may not be printed for commercial purposes.
Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii 1 Introduction 1.1 What is Econometrics? . . . . . . . . . . . . 1.2 The Probability Approach to Econometrics 1.3 Econometric Terms and Notation . . . . . . 1.4 Observational Data . . . . . . . . . . . . . . 1.5 Standard Data Structures . . . . . . . . . . 1.6 Sources for Economic Data . . . . . . . . . 1.7 Econometric Software . . . . . . . . . . . . 1.8 Reading the Manuscript . . . . . . . . . . . 1.9 Common Symbols . . . . . . . . . . . . . . 2 Conditional Expectation and Projection 2.1 Introduction . . . . . . . . . . . . . . . . 2.2 The Distribution of Wages . . . . . . . . 2.3 Conditional Expectation . . . . . . . . . 2.4 Log Dierences* . . . . . . . . . . . . . 2.5 Conditional Expectation Function . . . 2.6 Continuous Variables . . . . . . . . . . . 2.7 Law of Iterated Expectations . . . . . . 2.8 CEF Error . . . . . . . . . . . . . . . . . 2.9 Intercept-Only Model . . . . . . . . . . 2.10 Regression Variance . . . . . . . . . . . 2.11 Best Predictor . . . . . . . . . . . . . . 2.12 Conditional Variance . . . . . . . . . . . 2.13 Homoskedasticity and Heteroskedasticity 2.14 Regression Derivative . . . . . . . . . . 2.15 Linear CEF . . . . . . . . . . . . . . . . 2.16 Linear CEF with Nonlinear Eects . . . 2.17 Linear CEF with Dummy Variables . . . 2.18 Best Linear Predictor . . . . . . . . . . 2.19 Linear Predictor Error Variance . . . . . 2.20 Regression Coecients . . . . . . . . . . 2.21 Regression Sub-Vectors . . . . . . . . . 2.22 Coecient Decomposition . . . . . . . . 2.23 Omitted Variable Bias . . . . . . . . . . 2.24 Best Linear Approximation . . . . . . . 2.25 Normal Regression . . . . . . . . . . . . 2.26 Regression to the Mean . . . . . . . . . 2.27 Reverse Regression . . . . . . . . . . . . 2.28 Limitations of the Best Linear Predictor i 1 1 1 2 3 4 5 6 7 8 9 9 9 11 13 14 15 16 18 19 20 20 21 23 23 24 25 26 28 34 35 35 36 37 38 38 39 40 41
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
CONTENTS 2.29 Random Coecient Model . . . . . . . . . . . . . . . . . . . 2.30 Causal Eects . . . . . . . . . . . . . . . . . . . . . . . . . . 2.31 Expectation: Mathematical Details* . . . . . . . . . . . . . 2.32 Existence and Uniqueness of the Conditional Expectation* 2.33 Identication* . . . . . . . . . . . . . . . . . . . . . . . . . . 2.34 Technical Proofs* . . . . . . . . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 The Algebra of Least Squares 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . 3.2 Random Samples . . . . . . . . . . . . . . . . . . . 3.3 Sample Means . . . . . . . . . . . . . . . . . . . . . 3.4 Least Squares Estimator . . . . . . . . . . . . . . . 3.5 Solving for Least Squares with One Regressor . . . 3.6 Solving for Least Squares with Multiple Regressors 3.7 Illustration . . . . . . . . . . . . . . . . . . . . . . 3.8 Least Squares Residuals . . . . . . . . . . . . . . . 3.9 Model in Matrix Notation . . . . . . . . . . . . . . 3.10 Projection Matrix . . . . . . . . . . . . . . . . . . 3.11 Orthogonal Projection . . . . . . . . . . . . . . . . 3.12 Estimation of Error Variance . . . . . . . . . . . . 3.13 Analysis of Variance . . . . . . . . . . . . . . . . . 3.14 Regression Components . . . . . . . . . . . . . . . 3.15 Residual Regression . . . . . . . . . . . . . . . . . 3.16 Prediction Errors . . . . . . . . . . . . . . . . . . . 3.17 Inuential Observations . . . . . . . . . . . . . . . 3.18 Normal Regression Model . . . . . . . . . . . . . . 3.19 CPS Data Set . . . . . . . . . . . . . . . . . . . . . 3.20 Programming . . . . . . . . . . . . . . . . . . . . . 3.21 Technical Proofs* . . . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Least Squares Regression 4.1 Introduction . . . . . . . . . . . . . . 4.2 Sample Mean . . . . . . . . . . . . . 4.3 Linear Regression Model . . . . . . . 4.4 Mean of Least-Squares Estimator . . 4.5 Variance of Least Squares Estimator 4.6 Gauss-Markov Theorem . . . . . . . 4.7 Residuals . . . . . . . . . . . . . . . 4.8 Estimation of Error Variance . . . . 4.9 Mean-Square Forecast Error . . . . . 4.10 Covariance Matrix Estimation Under 4.11 Covariance Matrix Estimation Under 4.12 Standard Errors . . . . . . . . . . . . 4.13 Computation . . . . . . . . . . . . . 4.14 Measures of Fit . . . . . . . . . . . . 4.15 Empirical Example . . . . . . . . . . 4.16 Multicollinearity . . . . . . . . . . . 4.17 Normal Regression Model . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Homoskedasticity Heteroskedasticity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
CONTENTS 5 An Introduction to Large Sample Asymptotics 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . 5.2 Asymptotic Limits . . . . . . . . . . . . . . . . . 5.3 Convergence in Probability . . . . . . . . . . . . 5.4 Weak Law of Large Numbers . . . . . . . . . . . 5.5 Almost Sure Convergence and the Strong Law* . 5.6 Vector-Valued Moments . . . . . . . . . . . . . . 5.7 Convergence in Distribution . . . . . . . . . . . . 5.8 Higher Moments . . . . . . . . . . . . . . . . . . 5.9 Functions of Moments . . . . . . . . . . . . . . . 5.10 Delta Method . . . . . . . . . . . . . . . . . . . . 5.11 Stochastic Order Symbols . . . . . . . . . . . . . 5.12 Uniform Stochastic Bounds* . . . . . . . . . . . . 5.13 Semiparametric Eciency . . . . . . . . . . . . . 5.14 Technical Proofs* . . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . .
iii 112 112 112 114 115 116 117 118 120 121 123 124 126 127 130 134 135 135 136 137 141 144 144 145 147 147 148 151 153 154 155 157 158 158 159 160 162 163 164 166 169 169 170 171 172 173 174 175 177 177
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
6 Asymptotic Theory for Least Squares 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Consistency of Least-Squares Estimation . . . . . . . . . . . . . . . . 6.3 Asymptotic Normality . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4 Joint Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5 Consistency of Error Variance Estimators . . . . . . . . . . . . . . . 6.6 Homoskedastic Covariance Matrix Estimation . . . . . . . . . . . . . 6.7 Heteroskedastic Covariance Matrix Estimation . . . . . . . . . . . . 6.8 Summary of Covariance Matrix Notation . . . . . . . . . . . . . . . . 6.9 Alternative Covariance Matrix Estimators* . . . . . . . . . . . . . . 6.10 Functions of Parameters . . . . . . . . . . . . . . . . . . . . . . . . . 6.11 Asymptotic Standard Errors . . . . . . . . . . . . . . . . . . . . . . . 6.12 t statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.13 Condence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.14 Regression Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.15 Forecast Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.16 Wald Statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.17 Homoskedastic Wald Statistic . . . . . . . . . . . . . . . . . . . . . . 6.18 Condence Regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.19 Semiparametric Eciency in the Projection Model . . . . . . . . . . 6.20 Semiparametric Eciency in the Homoskedastic Regression Model* . 6.21 Uniformly Consistent Residuals* . . . . . . . . . . . . . . . . . . . . 6.22 Asymptotic Leverage* . . . . . . . . . . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Restricted Estimation 7.1 Introduction . . . . . . . . . . . . . . . . 7.2 Constrained Least Squares . . . . . . . . 7.3 Exclusion Restriction . . . . . . . . . . . 7.4 Minimum Distance . . . . . . . . . . . . 7.5 Asymptotic Distribution . . . . . . . . . 7.6 Ecient Minimum Distance Estimator . 7.7 Exclusion Restriction Revisited . . . . . 7.8 Variance and Standard Error Estimation 7.9 Misspecication . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
CONTENTS 7.10 Nonlinear Constraints 7.11 Inequality Restrictions 7.12 Constrained MLE . . . 7.13 Technical Proofs* . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
iv 179 180 181 181 183 185 185 186 187 187 188 189 190 192 193 194 194 195 196 197 198 199 202 204 205 207 210 211 213 215 215 218 221 221 222 224 227 229 229 229 231 231 232 234 234 235 237 238 239
8 Hypothesis Testing 8.1 Hypotheses . . . . . . . . . . . . . . . . . . . . . . 8.2 Acceptance and Rejection . . . . . . . . . . . . . . 8.3 Type I Error . . . . . . . . . . . . . . . . . . . . . 8.4 t tests . . . . . . . . . . . . . . . . . . . . . . . . . 8.5 Type II Error and Power . . . . . . . . . . . . . . . 8.6 Statistical Signicance . . . . . . . . . . . . . . . . 8.7 P-Values . . . . . . . . . . . . . . . . . . . . . . . . 8.8 t-ratios and the Abuse of Testing . . . . . . . . . . 8.9 Wald Tests . . . . . . . . . . . . . . . . . . . . . . 8.10 Homoskedastic Wald Tests . . . . . . . . . . . . . . 8.11 Criterion-Based Tests . . . . . . . . . . . . . . . . 8.12 Minimum Distance Tests . . . . . . . . . . . . . . . 8.13 Minimum Distance Tests Under Homoskedasticity 8.14 F Tests . . . . . . . . . . . . . . . . . . . . . . . . 8.15 Likelihood Ratio Test . . . . . . . . . . . . . . . . 8.16 Problems with Tests of NonLinear Hypotheses . . 8.17 Monte Carlo Simulation . . . . . . . . . . . . . . . 8.18 Condence Intervals by Test Inversion . . . . . . . 8.19 Power and Test Consistency . . . . . . . . . . . . . 8.20 Asymptotic Local Power . . . . . . . . . . . . . . . 8.21 Asymptotic Local Power, Vector Case . . . . . . . 8.22 Technical Proofs* . . . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Regression Extensions 9.1 NonLinear Least Squares . . . . . 9.2 Generalized Least Squares . . . . . 9.3 Testing for Heteroskedasticity . . . 9.4 Testing for Omitted NonLinearity . 9.5 Least Absolute Deviations . . . . . 9.6 Quantile Regression . . . . . . . . Exercises . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
10 The Bootstrap 10.1 Denition of the Bootstrap . . . . . . . . . 10.2 The Empirical Distribution Function . . . . 10.3 Nonparametric Bootstrap . . . . . . . . . . 10.4 Bootstrap Estimation of Bias and Variance 10.5 Percentile Intervals . . . . . . . . . . . . . . 10.6 Percentile-t Equal-Tailed Interval . . . . . . 10.7 Symmetric Percentile-t Intervals . . . . . . 10.8 Asymptotic Expansions . . . . . . . . . . . 10.9 One-Sided Tests . . . . . . . . . . . . . . . 10.10Symmetric Two-Sided Tests . . . . . . . . . 10.11Percentile Condence Intervals . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
CONTENTS
10.12Bootstrap Methods for Regression Models . . . . . . . . . . . . . . . . . . . . . . . . 240 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242 11 NonParametric Regression 11.1 Introduction . . . . . . . . . . . . . . . . 11.2 Binned Estimator . . . . . . . . . . . . . 11.3 Kernel Regression . . . . . . . . . . . . . 11.4 Local Linear Estimator . . . . . . . . . . 11.5 Nonparametric Residuals and Regression 11.6 Cross-Validation Bandwidth Selection . 11.7 Asymptotic Distribution . . . . . . . . . 11.8 Conditional Variance Estimation . . . . 11.9 Standard Errors . . . . . . . . . . . . . . 11.10Multiple Regressors . . . . . . . . . . . . 243 243 243 245 246 247 249 252 255 255 256 259 259 259 261 261 261 263 263 266 266 267 268 269 270 271 272 272 278 278 279 280 281 282 282 283 284 285 287 289 289 291 292 293 294
. . . . . . . . . . . . Fit . . . . . . . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
12 Series Estimation 12.1 Approximation by Series . . . . . . . . . . . . 12.2 Splines . . . . . . . . . . . . . . . . . . . . . . 12.3 Partially Linear Model . . . . . . . . . . . . . 12.4 Additively Separable Models . . . . . . . . . 12.5 Uniform Approximations . . . . . . . . . . . . 12.6 Runges Phenomenon . . . . . . . . . . . . . . 12.7 Approximating Regression . . . . . . . . . . . 12.8 Residuals and Regression Fit . . . . . . . . . 12.9 Cross-Validation Model Selection . . . . . . . 12.10Convergence in Mean-Square . . . . . . . . . 12.11Uniform Convergence . . . . . . . . . . . . . . 12.12Asymptotic Normality . . . . . . . . . . . . . 12.13Asymptotic Normality with Undersmoothing 12.14Regression Estimation . . . . . . . . . . . . . 12.15Kernel Versus Series Regression . . . . . . . . 12.16Technical Proofs . . . . . . . . . . . . . . . . 13 Generalized Method of Moments 13.1 Overidentied Linear Model . . . . . . . . . 13.2 GMM Estimator . . . . . . . . . . . . . . . 13.3 Distribution of GMM Estimator . . . . . . 13.4 Estimation of the Ecient Weight Matrix . 13.5 GMM: The General Case . . . . . . . . . . 13.6 Over-Identication Test . . . . . . . . . . . 13.7 Hypothesis Testing: The Distance Statistic 13.8 Conditional Moment Restrictions . . . . . . 13.9 Bootstrap GMM Inference . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . 14 Empirical Likelihood 14.1 Non-Parametric Likelihood . . . . . . . . 14.2 Asymptotic Distribution of EL Estimator 14.3 Overidentifying Restrictions . . . . . . . . 14.4 Testing . . . . . . . . . . . . . . . . . . . . 14.5 Numerical Computation . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
CONTENTS 15 Endogeneity 15.1 Instrumental Variables . . . 15.2 Reduced Form . . . . . . . 15.3 Identication . . . . . . . . 15.4 Estimation . . . . . . . . . 15.5 Special Cases: IV and 2SLS 15.6 Bekker Asymptotics . . . . 15.7 Identication Failure . . . . Exercises . . . . . . . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
16 Univariate Time Series 16.1 Stationarity and Ergodicity . . . . . . 16.2 Autoregressions . . . . . . . . . . . . . 16.3 Stationarity of AR(1) Process . . . . . 16.4 Lag Operator . . . . . . . . . . . . . . 16.5 Stationarity of AR(k) . . . . . . . . . 16.6 Estimation . . . . . . . . . . . . . . . 16.7 Asymptotic Distribution . . . . . . . . 16.8 Bootstrap for Autoregressions . . . . . 16.9 Trend Stationarity . . . . . . . . . . . 16.10Testing for Omitted Serial Correlation 16.11Model Selection . . . . . . . . . . . . . 16.12Autoregressive Unit Roots . . . . . . . 17 Multivariate Time Series 17.1 Vector Autoregressions (VARs) . . . . 17.2 Estimation . . . . . . . . . . . . . . . 17.3 Restricted VARs . . . . . . . . . . . . 17.4 Single Equation from a VAR . . . . . 17.5 Testing for Omitted Serial Correlation 17.6 Selection of Lag Length in an VAR . . 17.7 Granger Causality . . . . . . . . . . . 17.8 Cointegration . . . . . . . . . . . . . . 17.9 Cointegrated VARs . . . . . . . . . . . 18 Limited Dependent Variables 18.1 Binary Choice . . . . . . . . 18.2 Count Data . . . . . . . . . 18.3 Censored Data . . . . . . . 18.4 Sample Selection . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
306 . 306 . 308 . 309 . 309 . 310 . 310 . 311 . 312 . 312 . 313 . 314 . 314 316 316 317 317 317 318 318 319 319 320 322 322 323 324 325
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
19 Panel Data 327 19.1 Individual-Eects Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327 19.2 Fixed Eects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327 19.3 Dynamic Panel Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329 20 Nonparametric Density Estimation 330 20.1 Kernel Density Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330 20.2 Asymptotic MSE for Kernel Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . 332
CONTENTS A Matrix Algebra A.1 Notation . . . . . . . . . . . A.2 Matrix Addition . . . . . . A.3 Matrix Multiplication . . . A.4 Trace . . . . . . . . . . . . . A.5 Rank and Inverse . . . . . . A.6 Determinant . . . . . . . . . A.7 Eigenvalues . . . . . . . . . A.8 Positive Deniteness . . . . A.9 Matrix Calculus . . . . . . . A.10 Kronecker Products and the A.11 Vector and Matrix Norms . A.12 Matrix Inequalities . . . . .
vii 335 335 336 336 337 338 339 340 341 342 342 343 343 348 348 350 350 351 352 354 356 358 359 361 364
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vec Operator . . . . . . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
B Probability B.1 Foundations . . . . . . . . . . . . . . . . . . B.2 Random Variables . . . . . . . . . . . . . . B.3 Expectation . . . . . . . . . . . . . . . . . . B.4 Gamma Function . . . . . . . . . . . . . . . B.5 Common Distributions . . . . . . . . . . . . B.6 Multivariate Random Variables . . . . . . . B.7 Conditional Distributions and Expectation . B.8 Transformations . . . . . . . . . . . . . . . B.9 Normal and Related Distributions . . . . . B.10 Inequalities . . . . . . . . . . . . . . . . . . B.11 Maximum Likelihood . . . . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
C Numerical Optimization C.1 Grid Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C.2 Gradient Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C.3 Derivative-Free Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Preface
This book is intended to serve as the textbook for a rst-year graduate course in econometrics. It can be used as a stand-alone text, or be used as a supplement to another text. Students are assumed to have an understanding of multivariate calculus, probability theory, linear algebra, and mathematical statistics. A prior course in undergraduate econometrics would be helpful, but not required. Two excellent undergraduate textbooks are Wooldridge (2009) and Stock and Watson (2010). For reference, some of the basic tools of matrix algebra, probability, and statistics are reviewed in the Appendix. For students wishing to deepen their knowledge of matrix algebra in relation to their study of econometrics, I recommend Matrix Algebra by Abadir and Magnus (2005). An excellent introduction to probability and statistics is Statistical Inference by Casella and Berger (2002). For those wanting a deeper foundation in probability, I recommend Ash (1972) or Billingsley (1995). For more advanced statistical theory, I recommend Lehmann and Casella (1998), van der Vaart (1998), Shao (2003), and Lehmann and Romano (2005). For further study in econometrics beyond this text, I recommend Davidson (1994) for asymptotic theory, Hamilton (1994) for time-series methods, Wooldridge (2002) for panel data and discrete response models, and Li and Racine (2007) for nonparametrics and semiparametric econometrics. Beyond these texts, the Handbook of Econometrics series provides advanced summaries of contemporary econometric methods and theory. The end-of-chapter exercises are important parts of the text and are meant to help teach students of econometrics. Answers are not provided, and this is intentional. I would like to thank Ying-Ying Lee for providing research assistance in preparing some of the empirical examples presented in the text. As this is a manuscript in progress, some parts are quite incomplete, and there are many topics which I plan to add. In general, the earlier chapters are the most complete while the later chapters need signicant work and revision.
viii
Chapter 1
Introduction
1.1 What is Econometrics?
The term econometrics is believed to have been crafted by Ragnar Frisch (1895-1973) of Norway, one of the three principle founders of the Econometric Society, rst editor of the journal Econometrica, and co-winner of the rst Nobel Memorial Prize in Economic Sciences in 1969. It is therefore tting that we turn to Frischs own words in the introduction to the rst issue of Econometrica to describe the discipline. A word of explanation regarding the term econometrics may be in order. Its denition is implied in the statement of the scope of the [Econometric] Society, in Section I of the Constitution, which reads: The Econometric Society is an international society for the advancement of economic theory in its relation to statistics and mathematics.... Its main object shall be to promote studies that aim at a unication of the theoreticalquantitative and the empirical-quantitative approach to economic problems.... But there are several aspects of the quantitative approach to economics, and no single one of these aspects, taken by itself, should be confounded with econometrics. Thus, econometrics is by no means the same as economic statistics. Nor is it identical with what we call general economic theory, although a considerable portion of this theory has a deninitely quantitative character. Nor should econometrics be taken as synonomous with the application of mathematics to economics. Experience has shown that each of these three view-points, that of statistics, economic theory, and mathematics, is a necessary, but not by itself a sucient, condition for a real understanding of the quantitative relations in modern economic life. It is the unication of all three that is powerful. And it is this unication that constitutes econometrics. Ragnar Frisch, Econometrica, (1933), 1, pp. 1-2. This denition remains valid today, although some terms have evolved somewhat in their usage. Today, we would say that econometrics is the unied study of economic models, mathematical statistics, and economic data. Within the eld of econometrics there are sub-divisions and specializations. Econometric theory concerns the development of tools and methods, and the study of the properties of econometric methods. Applied econometrics is a term describing the development of quantitative economic models and the application of econometric methods to these models using economic data.
1.2
The unifying methodology of modern econometrics was articulated by Trygve Haavelmo (19111999) of Norway, winner of the 1989 Nobel Memorial Prize in Economic Sciences, in his seminal 1
CHAPTER 1. INTRODUCTION
paper The probability approach in econometrics, Econometrica (1944). Haavelmo argued that quantitative economic models must necessarily be probability models (by which today we would mean stochastic ). Deterministic models are blatently inconsistent with observed economic quantities, and it is incoherent to apply deterministic models to non-deterministic data. Economic models should be explicitly designed to incorporate randomness; stochastic errors should not be simply added to deterministic models to make them random. Once we acknowledge that an economic model is a probability model, it follows naturally that an appropriate tool way to quantify, estimate, and conduct inferences about the economy is through the powerful theory of mathematical statistics. The appropriate method for a quantitative economic analysis follows from the probabilistic construction of the economic model. Haavelmos probability approach was quickly embraced by the economics profession. Today no quantitative work in economics shuns its fundamental vision. While all economists embrace the probability approach, there has been some evolution in its implementation. The structural approach is the closest to Haavelmos original idea. A probabilistic economic model is specied, and the quantitative analysis performed under the assumption that the economic model is correctly specied. Researchers often describe this as taking their model seriously. The structural approach typically leads to likelihood-based analysis, including maximum likelihood and Bayesian estimation. A criticism of the structural approach is that it is misleading to treat an economic model as correctly specied. Rather, it is more accurate to view a model as a useful abstraction or approximation. In this case, how should we interpret structural econometric analysis? The quasistructural approach to inference views a structural economic model as an approximation rather than the truth. This theory has led to the concepts of the pseudo-true value (the parameter value dened by the estimation problem), the quasi-likelihood function, quasi-MLE, and quasi-likelihood inference. Closely related is the semiparametric approach. A probabilistic economic model is partially specied but some features are left unspecied. This approach typically leads to estimation methods such as least-squares and the Generalized Method of Moments. The semiparametric approach dominates contemporary econometrics, and is the main focus of this textbook. Another branch of quantitative structural economics is the calibration approach. Similar to the quasi-structural approach, the calibration approach interprets structural models as approximations and hence inherently false. The dierence is that the calibrationist literature rejects mathematical statistics (deeming classical theory as inappropriate for approximate models) and instead selects parameters by matching model and data moments using non-statistical ad hoc 1 methods.
1.3
In a typical application, an econometrician has a set of repeated measurements on a set of variables. For example, in a labor application the variables could include weekly earnings, educational attainment, age, and other descriptive characteristics. We call this information the data, dataset, or sample. We use the term observations to refer to the distinct repeated measurements on the variables. An individual observation often corresponds to a specic economic unit, such as a person, household, corporation, rm, organization, country, state, city or other geographical region. An individual observation could also be a measurement at a point in time, such as quarterly GDP or a daily interest rate.
Ad hoc means for this purpose a method designed for a specic problem and not based on a generalizable principle.
1
CHAPTER 1. INTRODUCTION
Economists typically denote variables by the italicized roman characters y , x, and/or z. The convention in econometrics is to use the character y to denote the variable to be explained, while the characters x and z are used to denote the conditioning (explaining) variables. Following mathematical convention, real numbers (elements of the real line R) are written using lower case italics such as y , and vectors (elements of Rk ) by lower case bold italics such as x, e.g. x1 x2 x = . . . . xk
Upper case bold italics such as X are used for matrices. We typically denote the number of observations by the natural number n, and subscript the variables by the index i to denote the individual observation, e.g. yi , xi and z i . In some contexts we use indices other than i, such as in time-series applications where the index t is common, and in panel studies we typically use the double index it to refer to individual i at a time period t.
The ith observation is the set (yi , xi , z i ). The sample is the set {(yi , xi , z i ) : i = 1, ..., n}. It is proper mathematical practice to use upper case X for random variables and lower case x for realizations or specic values. Since we use upper case to denote matrices, the distinction between random variables and their realizations is not rigorously followed in econometric notation. Thus the notation yi will in some places refer to a random variable, and in other places a specic realization. This is an undesirable but there is little to be done about it without terrically complicating the notation. Hopefully there will be no confusion as the use should be evident from the context. We typically use Greek letters such as , and 2 to denote unknown parameters of an econometric model, and will use boldface, e.g. or , when these are vector-valued. Estimates are typically denoted by putting a hat ^, tilde ~ or bar - over the corresponding letter, e.g. are estimates of . and The covariance matrix of an econometric estimator will typically be written using the capital b boldface V , often with a subscript to denote the estimator, e.g. V = var as the covariance b . Hopefully without causing confusion, we will use the notation V = avar( b ) to denote matrix for b the asymptotic covariance matrix of n (the variance of the asymptotic distribution). b is an estimate of V . Estimates will be denoted by appending hats or tildes, e.g. V
1.4
Observational Data
A common econometric question is to quantify the impact of one set of variables on another variable. For example, a concern in labor economics is the returns to schooling the change in earnings induced by increasing a workers education, holding other variables constant. Another issue of interest is the earnings gap between men and women. Ideally, we would use experimental data to answer these questions. To measure the returns to schooling, an experiment might randomly divide children into groups, mandate dierent levels of education to the dierent groups, and then follow the childrens wage path after they mature and enter the labor force. The dierences between the groups would be direct measurements of the effects of dierent levels of education. However, experiments such as this would be widely condemned as immoral! Consequently, we see few non-laboratory experimental data sets in economics.
CHAPTER 1. INTRODUCTION
Instead, most economic data is observational. To continue the above example, through data collection we can record the level of a persons education and their wage. With such data we can measure the joint distribution of these variables, and assess the joint dependence. But from observational data it is dicult to infer causality, as we are not able to manipulate one variable to see the direct eect on the other. For example, a persons level of education is (at least partially) determined by that persons choices. These factors are likely to be aected by their personal abilities and attitudes towards work. The fact that a person is highly educated suggests a high level of ability, which suggests a high relative wage. This is an alternative explanation for an observed positive correlation between educational levels and wages. High ability individuals do better in school, and therefore choose to attain higher levels of education, and their high ability is the fundamental reason for their high wages. The point is that multiple explanations are consistent with a positive correlation between schooling levels and education. Knowledge of the joint distibution alone may not be able to distinguish between these explanations. Most economic data sets are observational, not experimental. This means that all variables must be treated as random and possibly jointly determined. This discussion means that it is dicult to infer causality from observational data alone. Causal inference requires identication, and this is based on strong assumptions. We will discuss these issues on occasion throughout the text.
1.5
There are three major types of economic data sets: cross-sectional, time-series, and panel. They are distinguished by the dependence structure across observations. Cross-sectional data sets have one observation per individual. Surveys are a typical source for cross-sectional data. In typical applications, the individuals surveyed are persons, households, rms or other economic agents. In many contemporary econometric cross-section studies the sample size n is quite large. It is conventional to assume that cross-sectional observations are mutually independent. Most of this text is devoted to the study of cross-section data. Time-series data are indexed by time. Typical examples include macroeconomic aggregates, prices and interest rates. This type of data is characterized by serial dependence so the random sampling assumption is inappropriate. Most aggregate economic data is only available at a low frequency (annual, quarterly or perhaps monthly) so the sample size is typically much smaller than in cross-section studies. The exception is nancial data where data are available at a high frequency (weekly, daily, hourly, or by transaction) so sample sizes can be quite large. Panel data combines elements of cross-section and time-series. These data sets consist of a set of individuals (typically persons, households, or corporations) surveyed repeatedly over time. The common modeling assumption is that the individuals are mutually independent of one another, but a given individuals observations are mutually dependent. This is a modied random sampling environment. Data Structures Cross-section Time-series Panel
CHAPTER 1. INTRODUCTION
Some contemporary econometric applications combine elements of cross-section, time-series, and panel data modeling. These include models of spatial correlation and clustering. As we mentioned above, most of this text will be devoted to cross-sectional data under the assumption of mutually independent observations. By mutual independence we mean that the ith observation (yi , xi , z i ) is independent of the j th observation (yj , xj , z j ) for i 6= j . (Sometimes the label independent is misconstrued. It is a statement about the relationship between observations i and j , not a statement about the relationship between yi and xi and/or z i .) Furthermore, if the data is randomly gathered, it is reasonable to model each observation as a random draw from the same probability distribution. In this case we say that the data are independent and identically distributed or iid. We call this a random sample. For most of this text we will assume that our observations come from a random sample.
Denition 1.5.1 The observations (yi , xi , z i ) are a random sample if they are mutually independent and identically distributed (iid) across i = 1, ..., n.
In the random sampling framework, we think of an individual observation (yi , xi , z i ) as a realization from a joint probability distribution F (y, x, z ) which we can call the population. This population is innitely large. This abstraction can be a source of confusion as it does not correspond to a physical population in the real world. Its an abstraction since the distribution F is unknown, and the goal of statistical inference is to learn about features of F from the sample. The assumption of random sampling provides the mathematical foundation for treating economic statistics with the tools of mathematical statistics. The random sampling framework was a major intellectural breakthrough of the late 19th century, allowing the application of mathematical statistics to the social sciences. Before this conceptual development, methods from mathematical statistics had not been applied to economic data as they were viewed as inappropriate. The random sampling framework enabled economic samples to be viewed as homogenous and random, a necessary precondition for the application of statistical methods.
1.6
Fortunately for economists, the internet provides a convenient forum for dissemination of economic data. Many large-scale economic datasets are available without charge from governmental agencies. An excellent starting point is the Resources for Economists Data Links, available at rfe.org. From this site you can nd almost every publically available economic data set. Some specic data sources of interest include Bureau of Labor Statistics US Census Current Population Survey Survey of Income and Program Participation Panel Study of Income Dynamics Federal Reserve System (Board of Governors and regional banks) National Bureau of Economic Research
CHAPTER 1. INTRODUCTION U.S. Bureau of Economic Analysis CompuStat International Financial Statistics
Another good source of data is from authors of published empirical studies. Most journals in economics require authors of published papers to make their datasets generally available. For example, in its instructions for submission, Econometrica states: Econometrica has the policy that all empirical, experimental and simulation results must be replicable. Therefore, authors of accepted papers must submit data sets, programs, and information on empirical analysis, experiments and simulations that are needed for replication and some limited sensitivity analysis. The American Economic Review states: All data used in analysis must be made available to any researcher for purposes of replication. The Journal of Political Economy states: It is the policy of the Journal of Political Economy to publish papers only if the data used in the analysis are clearly and precisely documented and are readily available to any researcher for purposes of replication. If you are interested in using the data from a published paper, rst check the journals website, as many journals archive data and replication programs online. Second, check the website(s) of the papers author(s). Most academic economists maintain webpages, and some make available replication les complete with data and programs. If these investigations fail, email the author(s), politely requesting the data. You may need to be persistent. As a matter of professional etiquette, all authors absolutely have the obligation to make their data and programs available. Unfortunately, many fail to do so, and typically for poor reasons. The irony of the situation is that it is typically in the best interests of a scholar to make as much of their work (including all data and programs) freely available, as this only increases the likelihood of their work being cited and having an impact. Keep this in mind as you start your own empirical project. Remember that as part of your end product, you will need (and want) to provide all data and programs to the community of scholars. The greatest form of attery is to learn that another scholar has read your paper, wants to extend your work, or wants to use your empirical methods. In addition, public openness provides a healthy incentive for transparency and integrity in empirical analysis.
1.7
Econometric Software
Economists use a variety of econometric, statistical, and programming software. STATA (www.stata.com) is a powerful statistical program with a broad set of pre-programmed econometric and statistical tools. It is quite popular among economists, and is continuously being updated with new methods. It is an excellent package for most econometric analysis, but is limited when you want to use new or less-common econometric methods which have not yet been programed. R (www.r-project.org), GAUSS (www.aptech.com), MATLAB (www.mathworks.com), and Ox (www.oxmetrics.net) are high-level matrix programming languages with a wide variety of built-in statistical functions. Many econometric methods have been programed in these languages and are available on the web. The advantage of these packages is that you are in complete control of your
CHAPTER 1. INTRODUCTION
analysis, and it is easier to program new methods than in STATA. Some disadvantages are that you have to do much of the programming yourself, programming complicated procedures takes signicant time, and programming errors are hard to prevent and dicult to detect and eliminate. Of these languages, Gauss used to be quite popular among econometricians, but now Matlab is more popular. A smaller but growing group of econometricians are enthusiastic fans of R, which of these languages is uniquely open-source, user-contributed, and best of all, completely free! For highly-intensive computational tasks, some economists write their programs in a standard programming language such as Fortran or C. This can lead to major gains in computational speed, at the cost of increased time in programming and debugging. As these dierent packages have distinct advantages, many empirical economists end up using more than one package. As a student of econometrics, you will learn at least one of these packages, and probably more than one.
1.8
I have endeavored to use a unied notation and nomenclature. The development of the material is cumulative, with later chapters building on the earlier ones. Never-the-less, every attempt has been made to make each chapter self-contained, so readers can pick and choose topics according to their interests. To fully understand econometric methods, it is necessary to have a mathematical understanding of its mechanics, and this includes the mathematical proofs of the main results. Consequently, this text is self-contained, with nearly all results proved with full mathematical rigor. The mathematical development and proofs aim at brevity and conciseness (sometimes described as mathematical elegance), but also at pedagogy. To understand a mathematical proof, it is not sucient to simply read the proof, you need to follow it, and re-create it for yourself. Never-the-less, many readers will not be interested in each mathematical detail, explanation, or proof. This is okay. To use a method it may not be necessary to understand the mathematical details. Accordingly I have placed the more technical mathematical proofs and details in chapter appendices. These appendices and other technical sections are marked with an asterisk (*). These sections can be skipped without any loss in exposition.
CHAPTER 1. INTRODUCTION
1.9
Common Symbols
y x X R Rk E (y ) var (y ) cov (x, y ) var (x) corr(x, y) Pr p
d
scalar vector matrix real line Euclidean k space mathematical expectation variance covariance covariance matrix correlation probability limit convergence in probability convergence in distribution probability limit normal distribution standard normal distribution chi-square distribution with k degrees of freedom identity matrix trace matrix transpose matrix inverse positive denite, positive semi-denite Euclidean norm matrix (Frobinius) norm approximate equality denitional equality is distributed as natural logarithm
Chapter 2
The most commonly applied econometric tool is least-squares estimation, also known as regression. As we will see, least-squares is a tool to estimate an approximate conditional mean of one variable (the dependent variable) given another set of variables (the regressors, conditioning variables, or covariates). In this chapter we abstract from estimation, and focus on the probabilistic foundation of the conditional expectation model and its projection approximation.
2.2
Suppose that we are interested in wage rates in the United States. Since wage rates vary across workers, we cannot describe wage rates by a single number. Instead, we can describe wages using a probability distribution. Formally, we view the wage of an individual worker as a random variable wage with the probability distribution F (u) = Pr(wage u). When we say that a persons wage is random we mean that we do not know their wage before it is measured, and we treat observed wage rates as realizations from the distribution F. Treating unobserved wages as random variables and observed wages as realizations is a powerful mathematical abstraction which allows us to use the tools of mathematical probability. A useful thought experiment is to imagine dialing a telephone number selected at random, and then asking the person who responds to tell us their wage rate. (Assume for simplicity that all workers have equal access to telephones, and that the person who answers your call will respond honestly.) In this thought experiment, the wage of the person you have called is a single draw from the distribution F of wages in the population. By making many such phone calls we can learn the distribution F of the entire population. When a distribution function F is dierentiable we dene the probability density function f (u) = d F (u). du
The density contains the same information as the distribution function, but the density is typically easier to visually interpret.
10
Wage Distribution
0.6
0.7
0.8
0.9
1.0
0.0
0.1
0.2
0.3
0.4
10
20
30
40
50
60
70
Wage Density
0.5
10
20
30
40
50
60
70
80
90
100
Figure 2.1: Wage Distribution and Density. All full-time U.S. workers In Figure 2.1 we display estimates1 of the probability distribution function (on the left) and density function (on the right) of U.S. wage rates in 2009. We see that the density is peaked around $15, and most of the probability mass appears to lie between $10 and $40. These are ranges for typical wage rates in the U.S. population. Important measures of central tendency are the median and the mean. The median m of a continuous2 distribution F is the unique solution to 1 F (m) = . 2 The median U.S. wage ($19.23) is indicated in the left panel of Figure 2.1 by the arrow. The median is a robust3 measure of central tendency, but it is tricky to use for many calculations as it is not a linear operator. The expectation or mean of a random variable y with density f is Z uf (u)du. = E (y) =
A general denition of the mean is presented in Section 2.31. The mean U.S. wage ($23.90) is indicated in the right panel of Figure 2.1 by the arrow. Here we have used the common and convenient convention of using the single character y to denote a random variable, rather than the more cumbersome label wage. We sometimes use the notation the notation Ey instead of E (y) when the variable whose expectation is being taken is clear from the context. There is no distinction in meaning. The mean is a convenient measure of central tendency because it is a linear operator and arises naturally in many economic models. A disadvantage of the mean is that it is not robust4 especially in the presence of substantial skewness or thick tails, which are both features of the wage
The distribution and density are estimated nonparametrically from the sample of 50,742 full-time non-military wage-earners reported in the March 2009 Current Population Survey. The wage rate is constructed as annual individual wage and salary earnings divided by hours worked. 1 2 If F is not continuous the denition is m = inf {u : F (u) } 2 3 The median is not sensitive to pertubations in the tails of the distribution. 4 The mean is sensitive to pertubations in the tails of the distribution.
1
11
distribution as can be seen easily in the right panel of Figure 2.1. Another way of viewing this is that 64% of workers earn less that the mean wage of $23.90, suggesting that it is incorrect to describe the mean as a typical wage rate.
Figure 2.2: Log Wage Density In this context it is useful to transform the data by taking the natural logarithm5 . Figure 2.2 shows the density of log hourly wages log(wage) for the same population, with its mean 2.95 drawn in with the arrow. The density of log wages is much less skewed and fat-tailed than the density of the level of wages, so its mean E (log(wage)) = 2.95 is a much better (more robust) measure6 of central tendency of the distribution. For this reason, wage regressions typically use log wages as a dependent variable rather than the level of wages. Another useful way to summarize the probability distribution F (u) is in terms of its quantiles. For any (0, 1), the th quantile of the continuous7 distribution F is the real number q which satises F (q ) = . The quantile function q , viewed as a function of , is the inverse of the distribution function F. The most commonly used quantile is the median, that is, q0.5 = m. We sometimes refer to quantiles by the percentile representation of , and in this case they are often called percentiles, e.g. the median is the 50th percentile.
2.3
Conditional Expectation
We saw in Figure 2.2 the density of log wages. Is this distribution the same for all workers, or does the wage distribution vary across subpopulations? To answer this question, we can compare wage distributions for dierent groups for example, men and women. The plot on the left in Figure 2.3 displays the densities of log wages for U.S. men and women with their means (3.05 and 2.81) indicated by the arrows. We can see that the two wage densities take similar shapes but the density for men is somewhat shifted to the right with a higher mean.
5 6
Throughout the text, we will use log(y ) to denote the natural logarithm of y. More precisely, the geometric mean exp (E (log w)) = $19.11 is a robust measure of central tendency. 7 If F is not continuous the denition is q = inf {u : F (u) }
12
Women
Men
Figure 2.3: Log Wage Density by Gender and Race The values 3.05 and 2.81 are the mean log wages in the subpopulations of men and women workers. They are called the conditional means (or conditional expectations) of log wages given gender. We can write their specic values as E (log(wage) | gender = man) = 3.05 E (log(wage) | gender = woman) = 2.81. (2.1)
(2.2)
We call these means conditional as they are conditioning on a xed value of the variable gender. While you might not think of a persons gender as a random variable, it is random from the viewpoint of econometric analysis. If you randomly select an individual, the gender of the individual is unknown and thus random. (In the population of U.S. workers, the probability that a worker is a woman happens to be 43%.) In observational data, it is most appropriate to view all measurements as random variables, and the means of subpopulations are then conditional means. As the two densities in Figure 2.3 appear similar, a hasty inference might be that there is not a meaningful dierence between the wage distributions of men and women. Before jumping to this conclusion let us examine the dierences in the distributions of Figure 2.3 more carefully. As we mentioned above, the primary dierence between the two densities appears to be their means. This dierence equals E (log(wage) | gender = man) E (log(wage) | gender = woman) = 3.05 2.81 = 0.24
(2.3)
A dierence in expected log wages of 0.24 implies an average 24% dierence between the wages of men and women, which is quite substantial. (For an explanation of logarithmic and percentage dierences see Section 2.4.) Consider further splitting the men and women subpopulations by race, dividing the population into whites, blacks, and other races. We display the log wage density functions of four of these groups on the right in Figure 2.3. Again we see that the primary dierence between the four density functions is their central tendency.
CHAPTER 2. CONDITIONAL EXPECTATION AND PROJECTION men 3.07 2.86 3.03 women 2.82 2.73 2.86
13
Table 2.1: Mean Log Wages by Sex and Race Focusing on the means of these distributions, Table 2.1 reports the mean log wage for each of the six sub-populations. The entries in Table 2.1 are the conditional means of log(wage) given gender and race. For example E (log(wage) | gender = man, race = white) = 3.07 and E (log(wage) | gender = woman, race = black) = 2.73 One benet of focusing on conditional means is that they reduce complicated distributions to a single summary measure, and thereby facilitate comparisons across groups. Because of this simplifying property, conditional means are the primary interest of regression analysis and are a major focus in econometrics. Table 2.1 allows us to easily calculate average wage dierences between groups. For example, we can see that the wage gap between men and women continues after disaggregation by race, as the average gap between white men and white women is 25%, and that between black men and black women is 13%. We also can see that there is a race gap, as the average wages of blacks are substantially less than the other race categories. In particular, the average wage gap between white men and black men is 21%, and that between white women and black women is 9%.
2.4
Log Dierences*
log (1 + x) x. (2.4)
This can be derived from the innite series expansion of log (1 + x) : log (1 + x) = x x2 x3 x4 + + 2 3 4 = x + O(x2 ).
The symbol O(x2 ) means that the remainder is bounded by Ax2 as x 0 for some A < . A plot of log (1 + x) and the linear approximation x is shown in Figure 2.4. We can see that log (1 + x) and the linear approximation x are very close for |x| 0.1, and reasonably close for |x| 0.2, but the dierence increases with |x|. Now, if y is c% greater than y, then y = (1 + c/100)y. Taking natural logarithms, log y = log y + log(1 + c/100) c 100 where the approximation is (2.4). This shows that 100 multiplied by the dierence in logarithms is approximately the percentage dierence between y and y , and this approximation is quite good for |c| 10. log y log y = log(1 + c/100) or
14
2.5
An important determinant of wage levels is education. In many empirical studies economists measure educational attainment by the number of years of schooling, and we will write this variable as education 8 . The conditional mean of log wages given gender, race, and education is a single number for each category. For example E (log(wage) | gender = man, race = white, education = 12) = 2.84 We display in Figure 2.5 the conditional means of log(wage) for white men and white women as a function of education. The plot is quite revealing. We see that the conditional mean is increasing in years of education, but at a dierent rate for schooling levels above and below nine years. Another striking feature of Figure 2.5 is that the gap between men and women is roughly constant for all education levels. As the variables are measured in logs this implies a constant average percentage gap between men and women regardless of educational attainment. In many cases it is convenient to simplify the notation by writing variables using single characters, typically y, x and/or z . It is conventional in econometrics to denote the dependent variable (e.g. log(wage)) by the letter y, a conditioning variable (such as gender ) by the letter x, and multiple conditioning variables (such as race, education and gender ) by the subscripted letters x1 , x2 , ..., xk . Conditional expectations can be written with the generic notation E (y | x1 , x2 , ..., xk ) = m(x1 , x2 , ..., xk ). We call this the conditional expectation function (CEF). The CEF is a function of (x1 , x2 , ..., xk ) as it varies with the variables. For example, the conditional expectation of y = log(wage) given (x1 , x2 ) = (gender , race ) is given by the six entries of Table 2.1. The CEF is a function of (gender , race ) as it varies across the entries.
8 Here, education is dened as years of schooling beyond kindergarten. A high school graduate has education =12, a college graduate has education =16, a Masters degree has education =18, and a professional degree (medical, law or PhD) has education =20.
15
2.0
2.5
3.0
3.5
4.0
Years of Education
Figure 2.5: Mean Log Wage as a Function of Years of Education For greater compactness, we will typically write the conditioning variables as a vector in Rk : x1 x2 (2.5) x = . . . . xk
Here we follow the convention of using lower case bold italics x to denote a vector. Given this notation, the CEF can be compactly written as E (y | x) = m (x) .
The CEF E (y | x) is a random variable as it is a function of the random variable x. It is also sometimes useful to view the CEF as a function of x. In this case we can write m (u) = E (y | x = u), which is a function of the argument u. The expression E (y | x = u) is the conditional expectation of y, given that we know that the random variable x equals the specic value u. However, sometimes in econometrics we take a notational shortcut and use E (y | x) to refer to this function. Hopefully, the use of E (y | x) should be apparent from the context.
2.6
Continuous Variables
In the previous sections, we implicitly assumed that the conditioning variables are discrete. However, many conditioning variables are continuous. In this section, we take up this case and assume that the variables (y, x) are continuously distributed with a joint density function f (y, x). As an example, take y = log(wage) and x = experience, the number of years of potential labor market experience9 . The contours of their joint density are plotted on the left side of Figure 2.6 for the population of white men with 12 years of education. Given the joint density f (y, x) the variable x has the marginal density Z f (y, x)dy. fx (x) =
R
9
Here, experience is dened as potential labor market experience, equal to age education 6
16
4.0
3.5
2.0
2.5
3.0
Figure 2.6: White men with education =12 For any x such that fx (x) > 0 the conditional density of y given x is dened as fy|x (y | x) = f (y, x) . fx (x) (2.6)
The conditional density is a slice of the joint density f (y, x) holding x xed. We can visualize this by slicing the joint density function at a specic value of x parallel with the y -axis. For example, take the density contours on the left side of Figure 2.6 and slice through the contour plot at a specic value of experience. This gives us the conditional density of log(wage) for white men with 12 years of education and this level of experience. We do this for four levels of experience (5, 10, 25, and 40 years), and plot these densities on the right side of Figure 2.6. We can see that the distribution of wages shifts to the right and becomes more diuse as experience increases from 5 to 10 years, and from 10 to 25 years, but there is little change from 25 to 40 years experience. The CEF of y given x is the mean of the conditional density (2.6) Z yfy|x (y | x) dy. (2.7) m (x) = E (y | x) =
R
Intuitively, m (x) is the mean of y for the idealized subpopulation where the conditioning variables are xed at x. This is idealized since x is continuously distributed so this subpopulation is innitely small. In Figure 2.6 the CEF of log(wage) given experience is plotted as the solid line. We can see that the CEF is a smooth but nonlinear function. The CEF is initially increasing in experience, attens out around experience = 30, and then decreases for high levels of experience.
2.7
An extremely useful tool from probability theory is the law of iterated expectations. An important special case is the known as the Simple Law.
CHAPTER 2. CONDITIONAL EXPECTATION AND PROJECTION Theorem 2.7.1 Simple Law of Iterated Expectations If E |y | < then for any random vector x, E (E (y | x)) = E (y)
17
The simple law states that the expectation of the conditional expectation is the unconditional expectation. In other words, the average of the conditional averages is the unconditional average. When x is discrete X E (y | xj ) Pr (x = xj ) E (E (y | x)) =
j =1
Going back to our investigation of average log wages for men and women, the simple law states that E (log(wage) | gender = man) Pr (gender = man) = E (log(wage)) .
Rk
E (y | x) fx (x)dx.
Or numerically, 3.05 0.57 + 2.79 0.43 = 2.92. The general law of iterated expectations allows two sets of conditioning variables.
Theorem 2.7.2 Law of Iterated Expectations If E |y | < then for any random vectors x1 and x2 , E (E (y | x1 , x2 ) | x1 ) = E (y | x1 )
Notice the way the law is applied. The inner expectation conditions on x1 and x2 , while the outer expectation conditions only on x1 . The iterated expectation yields the simple answer E (y | x1 ) , the expectation conditional on x1 alone. Sometimes we phrase this as: The smaller information set wins. As an example E (log(wage) | gender = man, race = white) Pr (race = white|gender = man)
+ E (log(wage) | gender = man, race = other) Pr (race = other|gender = man) = E (log(wage) | gender = man) or numerically 3.07 0.84 + 2.86 0.08 + 3.05 0.08 = 3.05. A property of conditional expectations is that when you condition on a random vector x you can eectively treat it as if it is constant. For example, E (x | x) = x and E (g (x) | x) = g (x) for any function g (). The general property is known as the conditioning theorem.
CHAPTER 2. CONDITIONAL EXPECTATION AND PROJECTION Theorem 2.7.3 Conditioning Theorem If E |g (x) y | < then E (g (x) y | x) = g (x) E (y | x) and E (g (x) y ) = E (g (x) E (y | x)) . (2.10) (2.9)
18
(2.8)
The proofs of Theorems 2.7.1, 2.7.2 and 2.7.3 are given in Section 2.34.
2.8
CEF Error
The CEF error e is dened as the dierence between y and the CEF evaluated at the random vector x: e = y m(x). By construction, this yields the formula y = m(x) + e. (2.11)
In (2.11) it is useful to understand that the error e is derived from the joint distribution of (y, x), and so its properties are derived from this construction. A key property of the CEF error is that it has a conditional mean of zero. To see this, by the linearity of expectations, the denition m(x) = E (y | x) and the Conditioning Theorem E (e | x) = E ((y m(x)) | x) = m(x) m(x) = E (y | x) E (m(x) | x)
= 0.
This fact can be combined with the law of iterated expectations to show that the unconditional mean is also zero. E (e) = E (E (e | x)) = E (0) = 0. We state this and some other results formally. Theorem 2.8.1 Properties of the CEF error If E |y | < then 1. E (e | x) = 0. 2. E (e) = 0. 3. If E |y |r < for r 1 then E |e|r < . 4. For any function h (x) such that E |h (x) e| < then E (h (x) e) = 0. The proof of the third result is deferred to Section 2.34. The fourth result, whose proof is left to Exercise 2.3, says that e is uncorrelated with any function of the regressors.
19
2.0
0
2.5
3.0
3.5
4.0
10
20
30
40
50
Figure 2.7: Joint density of CEF error e and experience for white men with education =12. The equations y = m(x) + e E (e | x) = 0 together imply that m(x) is the CEF of y given x. It is important to understand that this is not a restriction. These equations hold true by denition. The condition E (e | x) = 0 is implied by the denition of e as the dierence between y and the CEF m (x) . The equation E (e | x) = 0 is sometimes called a conditional mean restriction, since the conditional mean of the error e is restricted to equal zero. The property is also sometimes called mean independence, for the conditional mean of e is 0 and thus independent of x. However, it does not imply that the distribution of e is independent of x. Sometimes the assumption e is independent of x is added as a convenient simplication, but it is not generic feature of the conditional mean. Typically and generally, e and x are jointly dependent, even though the conditional mean of e is zero. As an example, the contours of the joint density of e and experience are plotted in Figure 2.7 for the same population as Figure 2.6. The error e has a conditional mean of zero for all values of experience, but the shape of the conditional distribution varies with the level of experience. As a simple example of a case where x and e are mean independent yet dependent, let e = x where x and are independent N(0, 1). Then conditional on x, the error e has the distribution N(0, x2 ). Thus E (e | x) = 0 and e is mean independent of x, yet e is not fully independent of x. Mean independence does not imply full independence.
2.9
Intercept-Only Model
A special case of the regression model is when there are no regressors x . In this case m(x) = E (y ) = , the unconditional mean of y. We can still write an equation for y in the regression format: y =+e E (e) = 0 This is useful for it unies the notation.
20
2.10
Regression Variance
An important measure of the dispersion about the CEF function is the unconditional variance of the CEF error e. We write this as 2 = var (e) = E (e Ee)2 = E e2 . Theorem 2.8.1.3 implies the following simple but useful result.
Theorem 2.10.1 If Ey 2 < then 2 < . We can call 2 the regression variance or the variance of the regression error. The magnitude of 2 measures the amount of variation in y which is not explained or accounted for in the conditional mean E (y | x) . The regression variance depends on the regressors x. Consider two regressions y = E (y | x1 ) + e1
y = E (y | x1 , x2 ) + e2 .
We write the two errors distinctly as e1 and e2 as they are dierent changing the conditioning information changes the conditional mean and therefore the regression error as well. In our discussion of iterated expectations, we have seen that by increasing the conditioning set, the conditional expectation reveals greater detail about the distribution of y. What is the implication for the regression error? It turns out that there is a simple relationship. We can think of the conditional mean E (y | x) as the explained portion of y. The remainder e = y E (y | x) is the unexplained portion. The simple relationship we now derive shows that the variance of this unexplained portion decreases when we condition on more variables. This relationship is monotonic in the sense that increasing the amont of information always decreases the variance of the unexplained portion.
Theorem 2.10.2 says that the variance of the dierence between y and its conditional mean (weakly) decreases whenever an additional variable is added to the conditioning information. The proof of Theorem 2.10.2 is given in Section 2.34.
2.11
Best Predictor
Suppose that given a realized value of x, we want to create a prediction or forecast of y. We can write any predictor as a function g (x) of x. The prediction error is the realized dierence y g (x). A non-stochastic measure of the magnitude of the prediction error is the expectation of its square E (y g (x))2 . (2.12)
We can dene the best predictor as the function g (x) which minimizes (2.12). What function is the best predictor? It turns out that the answer is the CEF m(x). This holds regardless of the joint distribution of (y, x).
CHAPTER 2. CONDITIONAL EXPECTATION AND PROJECTION To see this, note that the mean squared error of a predictor g (x) is E (y g (x))2 = E (e + m (x) g (x))2 = Ee2 + 2E (e (m (x) g (x))) + E (m (x) g (x))2 = Ee2 + E (m (x) g (x))2 Ee2 = E (y m (x))2
21
where the rst equality makes the substitution y = m(x) + e and the third equality uses Theorem 2.8.1.4. The right-hand-side after the third equality is minimized by setting g (x) = m (x), yielding the inequality in the fourth line. The minimum is nite under the assumption Ey 2 < as shown by Theorem 2.10.1. We state this formally in the following result. Theorem 2.11.1 Conditional Mean as Best Predictor If Ey2 < , then for any predictor g (x), E (y g (x))2 E (y m (x))2 where m (x) = E (y | x). It may be helpful to consider this result in the context of the intercept-only model y =+e E(e) = 0. Theorem 2.11.1 shows that the best predictor for y (in the class of constant parameters) is the unconditional mean = E(y ), in the sense that the mean minimizes the mean squared prediction error.
2.12
Conditional Variance
While the conditional mean is a good measure of the location of a conditional distribution, it does not provide information about the spread of the distribution. A common measure of the dispersion is the conditional variance. Denition 2.12.1 If Ey 2 < , the conditional variance of y given x is 2 (x) = var (y | x) = E (y E (y | x))2 | x = E e2 | x . Generally, 2 (x) is a non-trivial function of x and can take any form subject to the restriction that it is non-negative. One way to think about 2 (x) is that it is the conditional mean of e2 given x.
22
The variance is in a dierent unit of measurement than the original variable. To convert the variance back to the same unit of measure we dene the conditional standard deviation as its p square root (x) = 2 (x). As an example of how the conditional variance depends on observables, compare the conditional log wage densities for men and women displayed in Figure 2.3. The dierence between the densities is not purely a location shift, but is also a dierence in spread. Specically, we can see that the density for mens log wages is somewhat more spread out than that for women, while the density for womens wages is somewhat more peaked. Indeed, the conditional standard deviation for mens wages is 3.05 and that for women is 2.81. So while men have higher average wages, they are also somewhat more dispersed. The unconditional error variance and the conditional variance are related by the law of iterated expectations 2 = E e2 = E E e2 | x = E 2 (x) . That is, the unconditional error variance is the average conditional variance. Given the conditional variance, we can dene a rescaled error = e . (x)
(2.13)
We can calculate that since (x) is a function of x e 1 |x = E (e | x) = 0 E ( | x) = E (x) (x) and var ( | x) = E 2 | x = E 2 2 (x) 1 e2 = | x E e = 1. | x = 2 2 (x) 2 (x) (x)
Thus has a conditional mean of zero, and a conditional variance of 1. Notice that (2.13) can be rewritten as e = (x). and substituting this for e in the CEF equation (2.11), we nd that y = m(x) + (x).
(2.14)
This is an alternative (mean-variance) representation of the CEF equation. Many econometric studies focus on the conditional mean m(x) and either ignore the conditional variance 2 (x), treat it as a constant 2 (x) = 2 , or treat it as a nuisance parameter (a parameter not of primary interest). This is appropriate when the primary variation in the conditional distribution is in the mean, but can be short-sighted in other cases. Dispersion is relevant to many economic topics, including income and wealth distribution, economic inequality, and price dispersion. Conditional dispersion (variance) can be a fruitful subject for investigation. The perverse consequences of a narrow-minded focus on the mean has been parodied in a classic joke:
An economist was standing with one foot in a bucket of boiling water and the other foot in a bucket of ice. When asked how he felt, he replied, On average I feel just ne.
23
2.13
An important special case obtains when the conditional variance 2 (x) is a constant and independent of x. This is called homoskedasticity. Denition 2.13.1 The error is homoskedastic if E e2 | x = 2 does not depend on x. In the general case where 2 (x) depends on x we say that the error e is heteroskedastic. Denition 2.13.2 The error is heteroskedastic if E e2 | x = 2 (x) depends on x. It is helpful to understand that the concepts homoskedasticity and heteroskedasticity concern the conditional variance, not the unconditional variance. By denition, the unconditional variance 2 is a constant and independent of the regressors x. So when we talk about the variance as a function of the regressors, we are talking about the conditional variance 2 (x). Some older or introductory textbooks describe heteroskedasticity as the case where the variance of e varies across observations. This is a poor and confusing denition. It is more constructive to understand that heteroskedasticity means that the conditional variance 2 (x) depends on observables. Older textbooks also tend to describe homoskedasticity as a component of a correct regression specication, and describe heteroskedasticity as an exception or deviance. This description has inuenced many generations of economists, but it is unfortunately backwards. The correct view is that heteroskedasticity is generic and standard, while homoskedasticity is unusual and exceptional. The default in empirical work should be to assume that the errors are heteroskedastic, not the converse. In apparent contradiction to the above statement, we will still frequently impose the homoskedasticity assumption when making theoretical investigations into the properties of estimation and inference methods. The reason is that in many cases homoskedasticity greatly simplies the theoretical calculations, and it is therefore quite advantageous for teaching and learning. It should always be remembered, however, that homoskedasticity is never imposed because it is believed to be a correct feature of an empirical model, but rather because of its simplicity.
2.14
Regression Derivative
One way to interpret the CEF m(x) = E (y | x) is in terms of how marginal changes in the regressors x imply changes in the conditional mean of the response variable y. It is typical to consider marginal changes in a single regressor, say x1 , holding the remainder xed. When a regressor x1 is continuously distributed, we dene the marginal eect of a change in x1 , holding the variables x2 , ..., xk xed, as the partial derivative of the CEF m(x1 , ..., xk ). x1 When x1 is discrete we dene the marginal eect as a discrete dierence. For example, if x1 is binary, then the marginal eect of x1 on the CEF is m(1, x2 , ..., xk ) m(0, x2 , ..., xk ).
CHAPTER 2. CONDITIONAL EXPECTATION AND PROJECTION We can unify the continuous and discrete cases with the notation m(x1 , ..., xk ), if x1 is continuous x1 1 m(x) = m(1, x , ..., x ) m(0, x , ..., x ), if x is binary.
2 k 2 k 1
24
m(x), the When all elements of x are continuous, then we have the simplication m(x) = x vector of partial derivatives. There are two important points to remember concerning our denition of the regression derivative. First, the eect of each variable is calculated holding the other variables constant. This is the ceteris paribus concept commonly used in economics. But in the case of a regression derivative, the conditional mean does not literally hold all else constant. It only holds constant the variables included in the conditional mean. This means that the regression derivative depends on which regressors are included. For example, in a regression of wages on education, experience, race and gender, the regression derivative with respect to education shows the marginal eect of education on mean wages, holding constant experience, race and gender. But it does not hold constant an individuals unobservable characteristics (such as ability), or variables not included in the regression (such as the quality of education). Second, the regression derivative is the change in the conditional expectation of y , not the change in the actual value of y for an individual. It is tempting to think of the regression derivative as the change in the actual value of y , but this is not a correct interpretation. The regression derivative m(x) is the change in the actual value of y only if the error e is unaected by the change in the regressor x. We return to a discussion of causal eects in Section 2.30.
Collecting the k eects into one k 1 vector, we dene the regression derivative with respect to x: 1 m(x) 2 m(x) m(x) = . . . k m(x)
2.15
Linear CEF
An important special case is when the CEF m (x) = E (y | x) is linear in x. In this case we can write the mean equation as m(x) = x1 1 + x2 2 + + xk k + k+1 . Notationally it is convenient to write this as a simple function of the vector x. An easy way to do so is to augment the regressor vector x by listing the number 1 as an element. We call this the constant and the corresponding coecient is called the intercept. Equivalently, specify that the nal element10 of the vector x is xk = 1. Thus (2.5) has been redened as the k 1 vector x1 x2 . (2.15) x= . . . xk1 1
10
CHAPTER 2. CONDITIONAL EXPECTATION AND PROJECTION With this redenition, the CEF is m(x) = x1 1 + x2 2 + + xk k = x0 where 1 . = . . k
25
(2.16)
(2.17)
is a k 1 coecient vector. This is the linear CEF model. It is also often called the linear regression model, or the regression of y on x. In the linear CEF model, the regression derivative is simply the coecient vector. That is m(x) = . This is one of the appealing features of the linear CEF model. The coecients have simple and natural interpretations as the marginal eects of changing one variable, holding the others constant.
If in addition the error is homoskedastic, we call this the homoskedastic linear CEF model.
2.16
The linear CEF model of the previous section is less restrictive than it might appear, as we can include as regressors nonlinear transformations of the original variables. In this sense, the linear CEF framework is exible and can capture many nonlinear eects. For example, suppose we have two scalar variables x1 and x2 . The CEF could take the quadratic form 2 (2.18) m(x1 , x2 ) = x1 1 + x2 2 + x2 1 3 + x2 4 + x1 x2 5 + 6 . This equation is quadratic in the regressors (x1 , x2 ) yet linear in the coecients = (1 , ..., 6 )0 . We will descriptively call (2.18) a quadratic CEF, and yet (2.18) is also a linear CEF in the sense of being linear in the coecients. The key is to understand that (2.18) is quadratic in the variables (x1 , x2 ) yet linear in the coecients .
26
2 To simplify the expression, we dene the transformations x3 = x2 1 , x4 = x2 , x5 = x1 x2 , and 0 x6 = 1, and redene the regressor vector as x = (x1 , ..., x6 ) . With this redenition,
m(x1 , x2 ) = x0 which is linear in . For most econometric purposes (estimation and inference on ) the linearity in is all that is important. An exception is in the analysis of regression derivatives. In nonlinear equations such as (2.18), the regression derivative should be dened with respect to the original variables, not with respect to the transformed variables. Thus m(x1 , x2 ) = 1 + 2x1 3 + x2 5 x1 m(x1 , x2 ) = 2 + 2x2 4 + x1 5 x2 We see that in the model (2.18), the regression derivatives are not a simple coecient, but are functions of several coecients plus the levels of (x1, x2 ). Consequently it is dicult to interpret the coecients individually. It is more useful to interpret them as a group. We typically call 5 the interaction eect. Notice that it appears in both regression derivative equations, and has a symmetric interpretation in each. If 5 > 0 then the regression derivative with respect to x1 is increasing in the level of x2 (and the regression derivative with respect to x2 is increasing in the level of x1 ), while if 5 < 0 the reverse is true. It is worth noting that this symmetry is an articial implication of the quadratic equation (2.18), and is not a general feature of nonlinear conditional means m(x1 , x2 ).
2.17
When all regressors takes a nite set of values, it turns out the CEF can be written as a linear function of regressors. This simplest example is a binary variable, which takes only two distinct values. For example, the variable gender takes only the values man and woman. Binary variables are extremely common in econometric applications, and are alternatively called dummy variables or indicator variables. Consider the simple case of a single binary regressor. In this case, the conditional mean can only take two distinct values. For example, 0 if gender=man E (y | gender) = 1 if gender=woman To facilitate a mathematical treatment, we typically record dummy variables with the values {0, 1}. For example 0 if gender=man (2.19) x1 = 1 if gender=woman
Given this notation we can write the conditional mean as a linear function of the dummy variable x1 , that is E (y | x1 ) = 1 x1 + 2 where 1 = 1 0 and 2 = 0 . In this simple regression equation the intercept 2 is equal to the conditional mean of y for the x1 = 0 subpopulation (men) and the slope 1 is equal to the dierence in the conditional means between the two subpopulations.
CHAPTER 2. CONDITIONAL EXPECTATION AND PROJECTION Equivalently, we could have dened x1 as 1 if gender=man x1 = 0 if gender=woman
27
(2.20)
In this case, the regression intercept is the mean for women (rather than for men) and the regression slope has switched signs. The two regressions are equivalent but the interpretation of the coecients has changed. Therefore it is always important to understand the precise denitions of the variables, and illuminating labels are helpful. For example, labelling x1 as gender does not help distinguish between denitions (2.19) and (2.20). Instead, it is better to label x1 as women or female if denition (2.19) is used, or as men or male if (2.20) is used. Now suppose we have two dummy variables x1 and x2 . For example, x2 = 1 if the person is married, else x2 = 0. The conditional mean given x1 and x2 takes at most four possible values: 00 if x1 = 0 and x2 = 0 (unmarried men) (married men) 01 if x1 = 0 and x2 = 1 E (y | x1 , x2 ) = if x1 = 1 and x2 = 0 (unmarried women) 10 11 if x1 = 1 and x2 = 1 (married women) In this case we can write the conditional mean as a linear function of x1 , x2 and their product x1 x2 : E (y | x1 , x2 ) = 1 x1 + 2 x2 + 3 x1 x2 + 4
where 1 = 10 00 , 2 = 01 00 , 3 = 11 10 01 + 00 , and 4 = 00 . We can view the coecient 1 as the eect of gender on expected log wages for unmarried wages earners, the coecient 2 as the eect of marriage on expected log wages for men wage earners, and the coecient 3 as the dierence between the eects of marriage on expected log wages among women and among men. Alternatively, it can also be interpreted as the dierence between the eects of gender on expected log wages among married and non-married wage earners. Both interpretations are equally valid. We often describe 3 as measuring the interaction between the two dummy variables, or the interaction eect, and describe 3 = 0 as the case when the interaction eect is zero. In this setting we can see that the CEF is linear in the three variables (x1 , x2 , x1 x2 ). Thus to put the model in the framework of Section 2.15, we would dene the regressor x3 = x1 x2 and the regressor vector as x1 x2 x= x3 . 1 So even though we started with only 2 dummy variables, the number of regressors (including the intercept) is 4. If there are 3 dummy variables x1 , x2 , x3 , then E (y | x1 , x2 , x3 ) takes at most 23 = 8 distinct values and can be written as the linear function E (y | x1 , x2 , x3 ) = 1 x1 + 2 x2 + 3 x3 + 4 x1 x2 + 5 x1 x3 + 6 x2 x3 + 7 x1 x2 x3 + 8 which has eight regressors including the intercept. In general, if there are p dummy variables x1 , ..., xp then the CEF E (y | x1 , x2 , ..., xp ) takes at most 2p distinct values, and can be written as a linear function of the 2p regressors including x1 , x2 , ..., xp and all cross-products. This might be excessive in practice if p is modestly large. In the next section we will discuss projection approximations which yield more parsimonious parameterizations. We started this section by saying that the conditional mean is linear whenever all regressors take only a nite number of possible values. How can we see this? Take a categorical variable,
28
This is not a linear function of x3 itself, but it can be made a linear function by constructing dummy variables for two of the three categories. For example 1 if black x4 = 0 if not black x5 = 1 if other 0 if not other
When doing so, the values of x3 have no meaning in terms of magnitude, they simply indicate the relevant category. When the regressor is categorical the conditional mean of y given x3 takes a distinct value for each possibility: 1 if x3 = 1 if x3 = 2 E (y | x3 ) = 2 3 if x3 = 3
such as race. For example, we earlier divided race into three categories. We can record categorical variables using numbers to indicate each category, for example 1 if white 2 if black x3 = 3 if other
Given these transformations, we can write the conditional mean of y as a linear function of x4 and x5 E (y | x3 ) = E (y | x4 , x5 ) = 1 x4 + 2 x5 + 3 We can write the CEF as either E (y | x3 ) or E (y | x4 , x5 ) (they are equivalent), but it is only linear as a function of x4 and x5 . This setting is similar to the case of two dummy variables, with the dierence that we have not included the interaction term x4 x5 . This is because the event {x4 = 1 and x5 = 1} is empty by construction, so x4 x5 = 0 by denition.
In this case, the categorical variable x3 is equivalent to the pair of dummy variables (x4 , x5 ). The explicit relationship is 1 if x4 = 0 and x5 = 0 x3 = 2 if x4 = 1 and x5 = 0 3 if x4 = 0 and x5 = 1
2.18
While the conditional mean m(x) = E (y | x) is the best predictor of y among all functions of x, its functional form is typically unknown. In particular, the linear CEF model is empirically unlikely to be accurate unless x is discrete and low-dimensional so all interactions are included. Consequently in most cases it is more realistic to view the linear specication (2.16) as an approximation. In this section we derive a specic approximation with a simple interpretation. Theorem 2.11.1 showed that the conditional mean m (x) is the best predictor in the sense that it has the lowest mean squared error among all predictors. By extension, we can dene an approximation to the CEF by the linear function with the lowest mean squared error among all linear predictors. For this derivation we require the following regularity condition.
CHAPTER 2. CONDITIONAL EXPECTATION AND PROJECTION Assumption 2.18.1 1. Ey 2 < . 2. E kxk2 < . 3. Qxx = E (xx0 ) is positive denite.
29
The best linear predictor of y given x, written P (y | x), is found by selecting the vector to minimize S ().
In Assumption 2.18.1.2 we use the notation kxk = (x0 x)1/2 to denote the Euclidean length of the vector x. The rst two parts of Assumption 2.18.1 imply that the variables y and x have nite means, variances, and covariances. The third part of the assumption is more technical, and its role will become apparent shortly. It is equivalent to imposing that the columns of Qxx = E (xx0 ) are linearly independent, or equivalently that the matrix Qxx is invertible. A linear predictor for y is a function of the form x0 for some Rk . The mean squared prediction error is 2 S () = E y x0 .
Denition 2.18.1 The Best Linear Predictor of y given x is P (y | x) = x0 where minimizes the mean squared prediction error 2 S () = E y x0 . = argmin S ( )
Rk
The minimizer
(2.21)
The quadratic structure of S () means that we can solve explicitly for the minimizer. The rstorder condition for minimization (from Appendix A.9) is 0= Rewriting (2.22) as S () = 2E (xy) + 2E xx0 . 2E (xy ) = 2E xx0 Qxy = Qxx (2.23) (2.22)
We now calculate an explicit expression for its value. The mean squared prediction error can be written out as a quadratic function of : S () = Ey 2 20 E (xy) + 0 E xx0 .
30
where Qxy = E (xy ) is k 1 and Qxx = E (xx0 ) is k k. The solution is found by inverting the matrix Qxx , and is written 1 = Q xx Qxy or 1 E (xy) . = E xx0 (2.24)
It is worth taking the time to understand the notation involved in the expression (2.24). Qxx is a E(xy ) k k matrix and Qxy is a k 1 column vector. Therefore, alternative expressions such as E (xx0 )
This expression is also referred to as the linear projection of y on x. The projection error is e = y x0 .
or E (xy ) (E (xx0 ))1 are incoherent and incorrect. We also can now see the role of Assumption 2.18.1.3. It is necessary in order for the solution (2.24) to exist. Otherwise, there would be multiple solutions to the equation (2.23). We now have an explicit expression for the best linear predictor: 1 E (xy ) . P (y | x) = x0 E xx0
(2.25)
This equals the error from the regression equation when (and only when) the conditional mean is linear in x, otherwise they are distinct. Rewriting, we obtain a decomposition of y into linear predictor and error y = x0 + e. (2.26)
In general we call equation (2.26) or x0 the best linear predictor of y given x, or the linear projection of y on x. Equation (2.26) is also often called the regression of y on x but this can sometimes be confusing as economists use the term regression in many contexts. (Recall that we said in Section 2.15 that the linear CEF model is also called the linear regression model.) An important property of the projection error e is E (xe) = 0. (2.27)
To see this, using the denitions (2.25) and (2.24) and the matrix properties AA1 = I and Ia = a, E (xe) = E x y x0 1 = E (xy ) E xx0 E xx0 E (xy ) =0 (2.28) as claimed. Equation (2.27) is a set of k equations, one for each regressor. In other words, (2.27) is equivalent to (2.29) E (xj e) = 0 for j = 1, ..., k. As in (2.15), the regressor vector x typically contains a constant, e.g. xk = 1. In this case (2.29) for j = k is the same as E (e) = 0. (2.30) Thus the projection error has a mean of zero when the regressor vector contains a constant. (When x does not have a constant, (2.30) is not guaranteed. As it is desirable for e to have a zero mean, this is a good reason to always include a constant in any regression model.)
31
It is also useful to observe that since cov(xj , e) = E (xj e) E (xj ) E (e) , then (2.29)-(2.30) together imply that the variables xj and e are uncorrelated. This completes the derivation of the model. We summarize some of the most important properties. Theorem 2.18.1 Properties of Linear Projection Model Under Assumption 2.18.1, 1. The moments E (xx0 ) and E (xy) exist with nite elements. 2. The Linear Projection Coecient (2.21) exists, is unique, and equals 1 = E xx0 E (xy ) .
3. The best linear predictor of y given x is 1 P (y | x) = x0 E xx0 E (xy ) . 4. The projection error e = y x0 exists and satises E e2 < and E (xe) = 0. 5. If x contains an constant, then E (e) = 0. 6. If E |y |r < and E kxkr < for r 2 then E |e|r < .
A complete proof of Theorem 2.18.1 is given in Section 2.34. It is useful to reect on the generality of Theorem 2.18.1. The only restriction is Assumption 2.18.1. Thus for any random variables (y, x) with nite variances we can dene a linear equation (2.26) with the properties listed in Theorem 2.18.1. Stronger assumptions (such as the linear CEF model) are not necessary. In this sense the linear model (2.26) exists quite generally. However, it is important not to misinterpret the generality of this statement. The linear equation (2.26) is dened as the best linear predictor. It is not necessarily a conditional mean, nor a parameter of a structural or causal economic model.
Linear Projection Model y = x0 + e. E (xe) = 0 1 E (xy) = E xx0 We illustrate projection using three log wage equations introduced in earlier sections.
32
For our rst example, we consider a model with the two dummy variables for gender and race similar to Table 2.1. As we learned in Section 2.17, the entries in this table can be equivalently expressed by a linear CEF. For simplicity, lets consider the CEF of log(wage) as a function of Black and Female. E(log(wage) | Black, F emale) = 0.20Black 0.24F emale + 0.10Black F emale + 3.06. (2.31) This is a CEF as the variables are dummys and all interactions are included. Now consider a simpler model omitting the interaction eect. This is the linear projection on the variables Black and F emale P (log(wage) | Black, F emale) = 0.15Black 0.23F emale + 3.06. (2.32)
What is the dierence? The full CEF (2.31) shows that the race gap is dierentiated by gender: it is 20% for black men (relative to non-black men) and 10% for black women (relative to non-black women). The projection model (2.32) simplies this analysis, calculating an average 15% wage gap for blacks, ignoring the role of gender. Notice that this is despite the fact that the gender variable is included in (2.32).
2.0
2.5
3.0
3.5
4.0
10
12
14
16
18
20
Years of Education
Figure 2.8: Projections of log(wage) onto Education For our second example we consider the CEF of log wages as a function of years of education for white men which was illustrated in Figure 2.5 and is repeated in Figure 2.8. Superimposed on the gure are two projections. The rst (given by the dashed line) is the linear projection of log wages on years of education P (log(wage) | Education) = 0.11Education + 1.5 This simple equation indicates an average 11% increase in wages for every year of education. An inspection of the Figure shows that this approximation works well for education 9, but underpredicts for individuals with lower levels of education. To correct this imbalance we use a linear spline equation which allows dierent rates of return above and below 9 years of education: P (log(wage) | Education, (Education 9) 1 (Education > 9))
This equation is displayed in Figure 2.8 using the solid line, and appears to t much better. It indicates a 2% increase in mean wages for every year of education below 9, and a 12% increase in
33
4.0
2.0
0
2.5
3.0
3.5
10
20
30
40
50
Figure 2.9: Linear and Quadratic Projections of log(wage) onto Experience mean wages for every year of education above 9. It is still an approximation to the conditional mean but it appears to be fairly reasonable. For our third example we take the CEF of log wages as a function of years of experience for white men with 12 years of education, which was illustrated in Figure 2.6 and is repeated as the solid line in Figure 2.9. Superimposed on the gure are two projections. The rst (given by the dot-dashed line) is the linear projection on experience P (log(wage) | Experience) = 0.011Experience + 2.5 and the second (given by the dashed line) is the linear projection on experience and its square P (log(wage) | Experience) = 0.046Experience 0.0007Experience2 + 2.3. It is fairly clear from an examination of Figure 2.9 that the rst linear projection is a poor approximation. It over-predicts wages for young and old workers, and under-predicts for the rest. Most importantly, it misses the strong downturn in expected wages for older wage-earners. The second projection ts much better. We can call this equation a quadratic projection since the function is quadratic in experience.
CHAPTER 2. CONDITIONAL EXPECTATION AND PROJECTION Invertibility and Identication The linear projection coecient = (E (xx0 ))1 E (xy) exists and is unique as long as the k k matrix Qxx = E (xx0 ) is invertible. The matrix Qxx is sometimes called the design matrix, as in experimental settings the researcher is able to control Qxx by manipulating the distribution of the regressors x. Observe that for any non-zero Rk , 2 0 Qxx = E 0 xx0 = E 0 x 0
34
so Qxx by construction is positive semi-denite. The assumption that it is positive denite means that this is a strict inequality, E (0 x)2 > 0. Equivalently, there cannot exist a non-zero vector such that 0 x = 0 identically. This occurs when redundant variables are included in x. Positive semi-denite matrices are invertible if and only if they are positive denite. When Qxx is invertible then = (E (xx0 ))1 E (xy ) exists and is uniquely dened. In other words, in order for to be uniquely dened, we must exclude the degenerate situation of redundant varibles. Theorem 2.18.1 shows that the linear projection coecient is identied (uniquely determined) under Assumptions 2.18.1. The key is invertibility of Qxx . Otherwise, there is no unique solution to the equation Qxx = Qxy . (2.33)
When Qxx is not invertible there are multiple solutions to (2.33), all of which yield an equivalent best linear predictor x0 . In this case the coecient is not identied as it does not have a unique value. Even so, the best linear predictor x0 still identied. One solution is to set E (xy ) = E xx0 where A denotes the generalized inverse of A (see Appendix A.5).
2.19
As in the CEF model, we dene the error variance as 2 = E e2 . Setting Qyy = E y 2 and Qyx = E (yx0 ) we can write 2 as
2 2 = E y x0 = Ey 2 2E y x0 + 0 E xx0
1 = Qyy Qyx Q xx Qxy def
= Qyyx .
(2.34)
1 One useful feature of this formula is that it shows that Qyyx = Qyy Qyx Q xx Qxy equals the variance of the error from the linear projection of y on x.
35
2.20
Regression Coecients
Sometimes it is useful to separate the constant from the other regressors, and write the linear projection equation in the format (2.35) y = x0 + + e where is the intercept and x does not contain a constant. Taking expectations of this equation, we nd Ey = Ex0 + E + Ee or y = 0 x + where y = Ey and x = Ex, since E (e) = 0 from (2.30). (While x does not contain a constant, the equation does so (2.30) still applies.) Rearranging, we nd = y 0 x . Subtracting this equation from (2.35) we nd y y = (x x )0 + e, (2.36)
a linear equation between the centered variables y y and x x . (They are centered at their means, so are mean-zero random variables.) Because x x is uncorrelated with e, (2.36) is also a linear projection, thus by the formula for the linear projection model, 1 E ((x x ) (y y )) = E (x x ) (x x )0 = var (x)1 cov (x, y) a function only of the covariances11 of x and y.
Theorem 2.20.1 In the linear projection model y = x0 + + e, then = y 0 x and = var (x)1 cov (x, y ) . (2.38) (2.37)
2.21
Regression Sub-Vectors
x1 x2
(2.39)