Econometrics
Econometrics
Francis X. Diebold
University of Pennsylvania
Version 2024.08.04
Econometric Data Science:
A Predictive Modeling Approach
Econometric Data Science
Francis X. Diebold
Copyright © 2013-2024
by Francis X. Diebold.
This work is freely available for your use, although preliminary and al-
ways evolving. It is licensed under the Creative Commons Attribution-
NonCommercial-NoDerivatives 4.0 International License. (Briefly: I retain
copyright, but you can use, copy and distribute non-commercially, so long as
you give me attribution and do not modify. To view a copy of the license,
go to http://creativecommons.org/licenses/by-nc-nd/4.0/.) In return
I ask that you please cite the book whenever appropriate, as: “Diebold, F.X.
(2024), Econometric Data Science: A Predictive Modeling Approach, Depart-
ment of Economics, University of Pennsylvania, http://www.ssc.upenn.
edu/~fdiebold/Textbooks.html.”
To my undergraduates,
who continually surprise and inspire me
Brief Table of Contents
Preface xxix
I Beginnings 1
1 Introduction to Econometrics 3
II Cross Sections 29
4 Non-Normality 73
6 Nonlinearity 103
7 Heteroskedasticity 121
ix
x BRIEF TABLE OF CONTENTS
12 Forecasting 237
V Appendices 293
Bibliography 307
Index 307
Detailed Table of Contents
Preface xxix
I Beginnings 1
1 Introduction to Econometrics 3
1.1 Welcome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 Who Uses Econometrics? . . . . . . . . . . . . . . . . . 3
1.1.2 What Distinguishes Econometrics? . . . . . . . . . . . 5
1.2 Types of Recorded Economic Data . . . . . . . . . . . . . . . 5
1.3 Online Information and Data . . . . . . . . . . . . . . . . . . 6
1.4 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5 Tips on How to use this book . . . . . . . . . . . . . . . . . . 8
1.6 Exercises, Problems and Complements . . . . . . . . . . . . . 10
1.7 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
xi
xii DETAILED TABLE OF CONTENTS
2.6 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.7 Graphics Legend: Edward Tufte . . . . . . . . . . . . . . . . . 27
II Cross Sections 29
4 Non-Normality 73
4.0.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.1 Assessing Normality . . . . . . . . . . . . . . . . . . . . . . . 74
4.1.1 QQ Plots . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.1.2 Residual Sample Skewness and Kurtosis . . . . . . . . 75
4.1.3 The Jarque-Bera Test . . . . . . . . . . . . . . . . . . 75
4.2 Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.2.1 Outlier Detection . . . . . . . . . . . . . . . . . . . . . 76
Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Leave-One-Out and Leverage . . . . . . . . . . . . . . 76
4.3 Robust Estimation . . . . . . . . . . . . . . . . . . . . . . . . 77
4.3.1 Robustness Iteration . . . . . . . . . . . . . . . . . . . 77
4.3.2 Least Absolute Deviations . . . . . . . . . . . . . . . . 78
4.4 Wage Determination . . . . . . . . . . . . . . . . . . . . . . . 79
4.4.1 W AGE . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.4.2 LW AGE . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.5 Exercises, Problems and Complements . . . . . . . . . . . . . 92
6 Nonlinearity 103
6.1 Models Linear in Transformed Variables . . . . . . . . . . . . 104
6.1.1 Logarithms . . . . . . . . . . . . . . . . . . . . . . . . 104
Log-Log Regression . . . . . . . . . . . . . . . . . . . . 104
Log-Lin Regression . . . . . . . . . . . . . . . . . . . . 105
Lin-Log Regression . . . . . . . . . . . . . . . . . . . . 105
6.1.2 Box-Cox and GLM . . . . . . . . . . . . . . . . . . . . 106
Box-Cox . . . . . . . . . . . . . . . . . . . . . . . . . . 106
GLM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.2 Intrinsically Non-Linear Models . . . . . . . . . . . . . . . . . 107
6.2.1 Nonlinear Least Squares . . . . . . . . . . . . . . . . . 107
6.3 Series Expansions . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.4 A Final Word on Nonlinearity and the IC . . . . . . . . . . . 110
6.5 Selecting a Non-Linear Model . . . . . . . . . . . . . . . . . . 110
6.5.1 t and F Tests, and Information Criteria . . . . . . . . 110
6.5.2 The RESET Test . . . . . . . . . . . . . . . . . . . . 111
6.6 Non-Linearity in Wage Determination . . . . . . . . . . . . . . 111
6.6.1 Non-Linearity in Continuous and Discrete Variables Si-
multaneously . . . . . . . . . . . . . . . . . . . . . . . 113
6.7 Exercises, Problems and Complements . . . . . . . . . . . . . 115
7 Heteroskedasticity 121
7.1 Consequences of Heteroskedasticity for Estimation, Inference,
and Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . 121
7.2 Detecting Heteroskedasticity . . . . . . . . . . . . . . . . . . . 122
7.2.1 Graphical Diagnostics . . . . . . . . . . . . . . . . . . 122
7.2.2 Formal Tests . . . . . . . . . . . . . . . . . . . . . . . 123
The Breusch-Pagan-Godfrey Test (BPG) . . . . . . . . 123
White’s Test . . . . . . . . . . . . . . . . . . . . . . . . 124
7.3 Dealing with Heteroskedasticity . . . . . . . . . . . . . . . . . 126
7.3.1 Adjusting Standard Errors . . . . . . . . . . . . . . . . 126
7.3.2 Adjusting Density Forecasts . . . . . . . . . . . . . . . 127
7.4 Exercises, Problems and Complements . . . . . . . . . . . . . 127
12 Forecasting 237
12.1 *** . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
12.2 Exercises, Problems and Complements . . . . . . . . . . . . . 241
DETAILED TABLE OF CONTENTS xvii
V Appendices 293
Bibliography 307
Index 307
About the Author
Francis X. Diebold is Paul F. Miller, Jr. and E. Warren Shafer Miller Pro-
fessor of Social Sciences, and Professor of Economics, Finance, and Statistics,
University of Pennsylvania.
His research focuses on predictive modeling of financial asset markets,
macroeconomic fundamentals, and the interface. He has made well-known
contributions to the measurement and modeling of asset-return volatility,
business cycles, yield curves, and network connectedness. He has published
more than 150 scientific papers and 8 books, and he is regularly ranked among
globally most-cited economists.
His academic research is firmly linked to practical matters: He has served
as Research Economist at the Board of Governors of the Federal Reserve
System, Executive Director at Morgan Stanley Investment Management, and
Chairman of the Federal Reserve System’s Model Validation Council.
He has won both undergraduate and graduate economics “teacher of the
year” awards, and his academic “family” includes thousands of undergraduate
students and approximately 75 Ph.D. students.
xix
About the Cover
I used the painting mostly just because I like it. But econometrics is indeed
something of an enigma, part economics and part statistics, part science and
part art, hunting faint and fleeting signals buried in large amounts of noise.
Yet it often succeeds.
xxi
List of Figures
xxiii
xxiv LIST OF FIGURES
xxvii
xxviii LIST OF TABLES
Preface
xxix
xxx PREFACE
for the ideas that have congealed here. Related, I am grateful to an army of
energetic and enthusiastic Penn graduate and undergraduate students, who
read and improved much of the manuscript and code over many years.
Finally, I apologize and accept full responsibility for the many errors and
shortcomings that undoubtedly remain – minor and major – despite ongoing
efforts to eliminate them.
Francis X. Diebold
Philadelphia
Beginnings
1
Chapter 1
Introduction to Econometrics
1.1 Welcome
3
4 CHAPTER 1. INTRODUCTION TO ECONOMETRICS
and so on.
Sales modeling is a good example. Firms routinely use econometric models
of sales to help guide management decisions in inventory management, sales
force management, production planning, and new market entry.
More generally, business firms use econometric models to help decide what
to produce (What product or mix of products should be produced?), when to
produce (Should we build up inventories now in anticipation of high future
demand? How many shifts should be run?), how much to produce and how
much capacity to build (What are the trends in market size and market share?
Are there cyclical or seasonal effects? How quickly and with what pattern will
a newly-built plant or a newly-installed technology depreciate?), and where
to produce (Should we have one plant or many? If many, where should we
locate them?). Firms also use forecasts of future prices and availability of
inputs to guide production decisions.
Econometric models are also crucial in financial services, including asset
management, asset pricing, mergers and acquisitions, investment banking,
and insurance. Portfolio managers, for example, are keenly interested in
the empirical modeling and understanding of asset returns (stocks, bonds,
exchange rates, commodity prices, ...).
Econometrics is similarly central to financial risk management. In recent
decades, econometric methods for volatility modeling have been developed
and widely applied to evaluate and insure risks associated with asset portfo-
lios, and to price assets such as options and other derivatives.
Finally, econometrics is central to the work of a wide variety of consulting
firms, many of which support the business functions already mentioned. Lit-
igation support, for example, is also a very active area, in which econometric
models are routinely used for damage assessment (e.g., lost earnings), “but
for” analyses, and so on.
Indeed these examples are just the tip of the iceberg. Surely you can think
1.2. TYPES OF RECORDED ECONOMIC DATA 5
Econometrics is much more than just “statistics using economic data,” al-
though it is of course very closely related to statistics.
Econometrics must confront the special issues and features that arise
routinely in economic data, such as heteroskedasticity and serial corre-
lation. (Don’t worry if those terms mean nothing to you now.)
Econometrics must confront the special problems arising due to its largely
non-experimental nature: Model mis-specification, structural change,
etc.
take just two values, as with a 0-1 indicator for whether or not someone
purchased a particular product during the last month.
Another issue is whether the data are recorded over time, over space, or
some combination of the two. Time series data are recorded over time, as
for example with U.S. GDP, which is measured once per quarter. A GDP
dataset might contain quarterly data for, say, 1960 to the present.
Cross sectional data, in contrast, are recorded over space (at a point in
time), as with yesterday’s closing stock price for each of the U.S. S&P 500
firms. The data structures can be blended, as for example with a time series
of cross sections. If, moreover, the cross-sectional units are identical over
time, we speak of panel data, or longitudinal data. An example would
be the daily closing stock price for each of the U.S. S&P 500 firms, recorded
over each of the last 30 days.
Much useful information is available on the web. The best way to learn about
what’s out there is to spend a few hours searching the web for whatever inter-
ests you. Here we mention just a few key “must-know” sites. Resources for
Economists, maintained by the American Economic Association, is a fine por-
tal to almost anything of interest to economists. (See Figure 1.1.) It contains
hundreds of links to data sources, journals, professional organizations, and so
on. FRED (Federal Reserve Economic Data) is a tremendously convenient
source for economic data. The National Bureau of Economic Research site
has data on U.S. business cycles, and the Real-Time Data Research Center
at the Federal Reserve Bank of Philadelphia has real-time vintage macroeco-
nomic data.
1.4. SOFTWARE 7
1.4 Software
Econometric software tools are widely available. Two good and time-honored
high-level environments with extensive capabilities are Stata and EViews.
Stata has particular strength in cross sections and panels, and EViews
has particular strength in time series. Both reflect a balance of generality
and specialization well-suited to the sorts of tasks that will concern us. If you
feel more comfortable with another environment, however, that’s fine, as none
of our discussion is wed to Stata or EViews (or any computing environment)
in any way.
There are also many flexible and more open-ended “mid-level” environ-
ments in which you can quickly program, evaluate, and apply new tools
8 CHAPTER 1. INTRODUCTION TO ECONOMETRICS
Many images in the digital version of this book are clickable to reach
related material.
Key concepts appear in bold, and they also appear in the book’s (hy-
perlinked) index.
Additional related materials appear on the book’s web page. These may
include book updates, presentation slides, datasets, and computer code.
The data that we use in the book from national income accounts, firms,
people, financial and other markets, etc. – are fictitious. Sometimes
they data are based on real data for various real countries, firms, etc.,
and sometimes they are artificially constructed. Ultimately, however,
any resemblance to particular countries, firms, etc. should be viewed as
coincidental and irrelevant.
10 CHAPTER 1. INTRODUCTION TO ECONOMETRICS
1.7 Notes
2
https://blog.revolutionanalytics.com/2012/12/coursera-videos.html
12 CHAPTER 1. INTRODUCTION TO ECONOMETRICS
Chapter 2
It’s almost always a good idea to begin an econometric analysis with graphical
data analysis. When compared to the modern array of econometric methods,
graphical analysis might seem trivially simple, perhaps even so simple as to
be incapable of delivering serious insights. Such is not the case: in many
respects the human eye is a far more sophisticated tool for data analysis
and modeling than even the most sophisticated statistical techniques. Put
differently, graphics is a sophisticated technique. That’s certainly not to
say that graphical analysis alone will get the job done – certainly, graphical
analysis has its limitations of its own – but it’s usually the best place to
start. With that in mind, we introduce in this chapter some simple graphical
techniques, and we consider some basic elements of graphical style.
We will segment our discussion into two parts: univariate (one variable) and
multivariate (more than one variable). Because graphical analysis “lets the
data speak for themselves,” it is most useful when the dimensionality of
the data is low; that is, when dealing with univariate or low-dimensional
multivariate data.
13
14 CHAPTER 2. GRAPHICS AND GRAPHICAL STYLE
First consider time series data. Graphics is used to reveal patterns in time
series data. The great workhorse of univariate time series graphics is the
simple time series plot, in which the series of interest is graphed against
time.
In the top panel of Figure 2.1, for example, we present a time series plot
of a 1-year Government bond yield over approximately 500 months. A num-
ber of important features of the series are apparent. Among other things,
its movements appear sluggish and persistent, it appears to trend gently up-
ward until roughly the middle of the sample, and it appears to trend gently
downward thereafter.
The bottom panel of Figure 2.1 provides a different perspective; we plot
the change in the 1-year bond yield, which highlights volatility fluctuations.
Interest rate volatility is very high in mid-sample.
Univariate graphical techniques are also routinely used to assess distri-
butional shape, whether in time series or cross sections. A histogram, for
example, provides a simple estimate of the probability density of a random
variable. The observed range of variation of the series is split into a number
of segments of equal length, and the height of the bar placed at a segment
is the percentage of observations falling in that segment.1 In Figure 2.2 we
show a histogram for the 1-year bond yield.
When two or more variables are available, the possibility of relations be-
tween the variables becomes important, and we use graphics to uncover the
existence and nature of such relationships. We use relational graphics to
1
In some software packages (e.g., Eviews), the height of the bar placed at a segment is simply the
number, not the percentage, of observations falling in that segment. Strictly speaking, such histograms are
not density estimators, because the “area under the curve” doesn’t add to one, but they are equally useful
for summarizing the shape of the density.
2.1. SIMPLE TECHNIQUES OF GRAPHICAL ANALYSIS 15
Figure 2.3: Bivariate Scatterplot, 1-Year and 10-Year Government Bond Yields
Let’s summarize and extend what we’ve learned about the power of graphics:
Figure 2.4: Scatterplot Matrix, 1-, 10-, 20- and 30-Year Government Bond Yields
2.2. ELEMENTS OF GRAPHICAL STYLE 19
We might add to this list another item of tremendous relevance in our age
of big data: Graphics enables us to summarize and learn from huge datasets.
b. Show the data, and only the data, within the bounds of reason.
c. Revise and edit, again and again (and again). Graphics produced using
software defaults are almost never satisfactory.
We can use a number of devices to show the data. First, avoid distorting
the data or misleading the viewer, in order to reveal true data variation rather
than spurious impressions created by design variation. Thus, for example,
avoid changing scales in midstream, use common scales when performing
multiple comparisons, and so on. The sizes of effects in graphics should match
their size in the data.
Second, minimize, within reason, non-data ink (ink used to depict any-
thing other than data points). Avoid chartjunk (elaborate shadings and
grids that are hard to decode, superfluous decoration including spurious 3-D
perspective, garish colors, etc.)
Third, choose a graph’s aspect ratio (the ratio of the graph’s height, h,
to its width, w) to maximize pattern revelation. A good aspect ratio often
makes the average absolute slope of line segments connecting the data points
approximately equal 45 degrees. This procedure is called banking to 45
degrees.
Fourth, maximize graphical data density. Good graphs often display lots
of data, indeed so much data that it would be impossible to learn from them
in tabular form.3 Good graphics can present a huge amount of data in a
concise and digestible form, revealing facts and prompting new questions, at
both “micro” and “macro” levels.4
3
Conversely, for small amounts of data, a good table may be much more appropriate and informative
than a graphic.
4
Note how maximization of graphical data density complements our earlier prescription to maximize the
ratio of data ink to non-data ink, which deals with maximizing the relative amount of data ink. High data
density involves maximizing as well the absolute amount of data ink.
2.3. U.S. HOURLY WAGES 21
Ultimately good graphics proceeds just like good writing, so if good writing
is good thinking, then so too is good graphics. So the next time you hear an
ignorant person blurt out something along the lines of “I don’t like to write; I
like to think,” rest assured, his/her writing, thinking, and graphics are likely
all poor.
2. (Empirical warm-up)
2.5. EXERCISES, PROBLEMS AND COMPLEMENTS 23
(a) Obtain time series of quarterly real GDP and quarterly real con-
sumption for a country of your choice. Provide details.
(b) Display time-series plots and a scatterplot (put consumption on the
vertical axis).
(c) Convert your series to growth rates in percent, and again display
time series plots.
(d) From now on use the growth rate series only.
(e) For each series, provide summary statistics (e.g., mean, standard
deviation, range, skewness, kurtosis, ...).
(f) For each series, perform t-tests of the null hypothesis that the pop-
ulation mean growth rate is 2 percent.
(g) For each series, calulate 90 and 95 percent confidence intervals for
the population mean growth rate. For each series, which interval is
wider, and why?
(h) Regress consumption on GDP. Discuss.
5. (Color)
There is a temptation to believe that color graphics is always better
than grayscale. That’s often far from the truth, and in any event, color
is typically best used sparingly.
Maturity
(Months) ȳ σ̂y ρ̂y (1) ρ̂y (12)
6 4.9 2.1 0.98 0.64
12 5.1 2.1 0.98 0.65
24 5.3 2.1 0.97 0.65
36 5.6 2.0 0.97 0.65
60 5.9 1.9 0.97 0.66
120 6.5 1.8 0.97 0.68
Notes: We present descriptive statistics for end-of-month yields at various maturities. We show sample
mean, sample standard deviation, and first- and twelfth-order sample autocorrelations. Data are from the
Board of Governors of the Federal Reserve System. The sample period is January 1985 through December
2008.
The power of tables for displaying data and revealing patterns is very
limited compared to that of graphics, especially in this age of Big Data.
Nevertheless, tables are of course sometimes helpful, and there are prin-
ciples of tabular style, just as there are principles of graphical style.
Compare, for example, the nicely-formatted Table 2.1 (no need to worry
about what it is or from where it comes...) to what would be produced
by a spreadsheet such as Excel.
Try to formulate a set of principles of tabular style. (Hint: One principle
is that vertical lines should almost never appear in tables, as in the table
above.)
8. (The “golden” aspect ratio, visual appeal, and showing the data)
A time-honored approach to visual graphical appeal is use of an aspect
ratio such that height is to width as width is to the sum of height and
width. This turns out to correspond to height approximately sixty per-
cent of width, the so-called “golden ratio.” Graphics that conform to
the golden ratio, with height a bit less than two thirds of width, are
visually appealing. Other things the same, it’s a good idea to keep the
golden ratio in mind when producing graphics. Other things are not
always the same, however. In particular, the golden aspect ratio may
not be the one that maximizes pattern revelation (e.g., by banking to
45 degrees).
2.6 Notes
This chapter has been heavily influenced by Tufte (1983), as are all modern
discussions of statistical graphics.7 Tufte’s book is an insightful and enter-
taining masterpiece on graphical style, and I recommend enthusiastically. Be
sure to check out his web page and other books, which go far beyond his 1983
work.
7
Photo details follow.
Date: 7 February 2011.
Source: http://www.flickr.com/photos/roebot/5429634725/in/set-72157625883623225.
Author: Aaron Fulkerson.
Originally posted to Flickr by Roebot at http://flickr.com/photos/40814689@N00/5429634725. Reviewed
on 24 May 2011 by the FlickreviewR robot and confirmed to be licensed under the terms of the cc-by-sa-2.0.
Licensed under the Creative Commons Attribution-Share Alike 2.0 Generic license.
28 CHAPTER 2. GRAPHICS AND GRAPHICAL STYLE
Part II
Cross Sections
29
Chapter 3
You have already been introduced to probability and statistics, but chances
are that you could use a bit of review before plunging into regression, so
begin by studying Appendix A. Be warned, however: it is no substitute
for a full-course introduction to probability and statistics, which you should
have had already. Instead it is intentionally much more narrow, reviewing
some material related to moments of random variables, which we will use
repeatedly. It also introduces notation, and foreshadows certain ideas, that
we develop subsequently in greater detail.
In this chapter we’ll be working with cross-sectional data on log wages, ed-
ucation and experience. We already examined the distribution of log wages.
For convenience we reproduce it in Figure 3.1, together with the distributions
of the new data on education and experience.
31
32 CHAPTER 3. REGRESSION UNDER IDEAL CONDITIONS
Suppose that we have data on two variables, y and x, as in Figure 3.2, and
suppose that we want to find the linear function of x that best fits y, where
“best fits” means that the sum of squared (vertical) deviations of the data
points from the fitted line is as small as possible. When we “run a regression,”
or “fit a regression line,” that’s what we do. The estimation strategy is called
least squares, or sometimes “ordinary least squares” to distinguish it from
fancier versions that we’ll introduce later.
The specific data that we show in Figure 3.2 are log wages (LWAGE, y)
and education (EDUC, x) for a random sample of nearly 1500 people, as
described in Appendix B.
Let us elaborate on the fitting of regression lines, and the reason for the
name “least squares.” When we run the regression, we use a computer to fit
the line by solving the problem
N
X
min (yi − β1 − β2 xi )2 ,
β
i=1
i = 1, ..., N . The residuals are the difference between actual and fitted values,
ei = yi − ŷi ,
34 CHAPTER 3. REGRESSION UNDER IDEAL CONDITIONS
Figure 3.3: (Log Wage, Education) Scatterplot with Superimposed Regression Line
i = 1, ..., N .
In Figure 3.3, we illustrate graphically the results of regressing LWAGE on
EDUC. The best-fitting line slopes upward, reflecting the positive correlation
between LWAGE and EDUC.1 Note that the data points don’t satisfy the
fitted linear relationship exactly; rather, they satisfy it on average. To predict
LWAGE for any given value of EDUC, we use the fitted line to find the value
of LWAGE that corresponds to the given value of EDUC.
1
Note that use of log wage promostes several desiderata. First, it promotes normality, as we discussed
in Chapter 2. Second, it enforces positivity of the fitted wage, because W\ AGE = exp(LW \ AGE), and
exp(x) > 0 for any x.
36 CHAPTER 3. REGRESSION UNDER IDEAL CONDITIONS
LW
\ AGE = 1.273 + .081EDU C.
Everything generalizes to allow for more than one RHS variable. This is
called multiple linear regression.
Suppose, for example, that we have two RHS variables, x2 and x3 . Before
we fit a least-squares line to a two-dimensional data cloud; now we fit a least-
squares plane to a three-dimensional data cloud. We use the computer to
find the values of β1 , β2 , and β3 that solve the problem
N
X
min (yi − β1 − β2 x2i − β3 x3i )2 ,
β
i=1
where β denotes the set of three model parameters. We denote the set of
estimated parameters by β̂, with elements β̂1 , β̂2 , and β̂3 . The fitted values
are
ŷi = β̂1 + β̂2 x2i + β̂3 x3i ,
i = 1, ..., N .
3.2. REGRESSION AS CURVE FITTING 37
LW
\ AGE = .867 + .093EDU C + .013EXP ER.
3.2.3 Onward
Before proceeding, two aspects of what we’ve done so far are worth noting.
First, we now have two ways to analyze data and reveal its patterns. One is
the graphical scatterplot of Figure 3.2, with which we started, which provides
a visual view of the data. The other is the fitted regression line of Figure 3.3,
which summarizes the data through the lens of a linear fit. Each approach
has its merit, and the two are complements, not substitutes, but note that
linear regression generalizes more easily to high dimensions.
Second, least squares as introduced thus far has little to do with statistics
or econometrics. Rather, it is simply a way of instructing a computer to
fit a line to a scatterplot in a way that’s rigorous, replicable and arguably
reasonable. We now turn to a probabilistic interpretation.
38 CHAPTER 3. REGRESSION UNDER IDEAL CONDITIONS
We work with the full multiple regression model (simple regression is of course
a special case). Collect the RHS variables into the vector x, where x′i =
(1, x2i , ..., xKi ).
Thus far we have not postulated a probabilistic model that relates yi and xi ;
instead, we simply ran a mechanical regression of yi on xi to find the best
fit to yi formed as a linear function of xi . It’s easy, however, to construct
a probabilistic framework that lets us make statistical assessments about
the properties of the fitted line. We assume that yi is linearly related to
an exogenously-determined xi , and we add an independent and identically
distributed zero-mean (iid) Gaussian disturbance:
εi ∼ iidN (0, σ 2 ),
i = 1, ..., N . The intercept of the line is β1 , the slope parameters are the
other β’s2 , and the variance of the disturbance is σ 2 . Collectively, we call the
β’s (and σ) the model’s parameters.
We assume that the the linear model sketched is true in population; that is,
it is the data-generating process (DGP). But in practice, of course, we don’t
know the values of the model’s parameters, β1 , β2 , ..., βK and σ 2 . Our job is
to estimate them using a sample of data from the population. We estimate
the β’s precisely as before, using the computer to solve minβ N 2
P
i=1 εi .
2
We speak of the regression intercept and the regression slope.
3.3. REGRESSION AS A PROBABILITY MODEL 39
The discussion thus far was intentionally a bit loose, focusing on motivation
and intuition. Let us now be more precise about what we assume and what
results we obtain.
y = Xβ + ε. (3.1)
40 CHAPTER 3. REGRESSION UNDER IDEAL CONDITIONS
In addition,
εi ∼ iid N (0, σ 2 )
becomes
ε ∼ N (0, σ 2 I). (3.2)
εi ∼ iidN (0, σ 2 ),
3. The coefficients (β’s) are fixed (whether over space or time, depending
on whether we’re working in a time-series or cross-section environment)
1. E(εi xik ) = 0, for all i, k (εi is uncorrelated with the xik ’s)
2. E(εi | xi1 , ..., xiK ) = 0, for all i (εi is conditional mean independent of
the xik ’s)
3. var(εi | xi1 , ..., xiK ) = σ 2 , for all i (εi is conditional variance independent
of the xik ’s)
IC2 is subtle, and it may seem obscure at the moment, but it is very important
in the context of causal estimation, which we will discuss in chapter 9.
The IC’s are surely heroic in many contexts, and much of econometrics
is devoted to detecting and dealing with various IC failures. But before we
worry about IC failures, it’s invaluable first to understand what happens
when they hold.3
β̂LS = (X ′ X)−1 X ′ y,
a
β̂LS ∼ N (β, V ) .
3
Certain variations of the IC as stated above can be entertained, and in addition we have ommitted some
technical details.
42 CHAPTER 3. REGRESSION UNDER IDEAL CONDITIONS
Now let’s do more than a simple graphical analysis of the regression fit. In-
stead, let’s look in detail at the computer output, which we show in Figure 5.2
for a regression of log wages (LW AGE) on an intercept, education (EDU C)
and experience (EXP ER). We run regressions dozens of times in this book,
and the output format and interpretation are always the same, so it’s im-
portant to get comfortable with it quickly. The output is in Eviews format.
Other software will produce more-or-less the same information, which is fun-
damental and standard.
Before proceeding, note well that the IC may not be satisfied for this
dataset, yet we will proceed assuming that they are satisfied. As we proceed
through this book, we will confront violations of the various assumptions –
3.4. A WAGE EQUATION 43
indeed that’s what econometrics is largely about – and we’ll return repeatedly
to this dataset and others. But we must begin at the beginning.
The software output begins by reminding us that we’re running a least-
squares (LS) regression, and that the left-hand-side (LHS) variable is the log
wage (LWAGE), using a total of 1323 observations.
Next comes a table listing each RHS variable together with four statistics.
The RHS variables EDUC and EXPER are education and experience, and the
C variable refers to the earlier-mentioned intercept. The C variable always
equals one, so the estimated coefficient on C is the estimated intercept of the
regression line.4
The four statistics associated with each RHS variable are the estimated
coefficient (“Coefficient”), its standard error (“Std. Error”), a t statistic,
and a corresponding probability value (“Prob.”). The standard errors of
the estimated coefficients indicate their likely sampling variability, and hence
their reliability. The estimated coefficient plus or minus one standard error is
approximately a 68% confidence interval for the true but unknown population
parameter, and the estimated coefficient plus or minus two standard errors
is approximately a 95% confidence interval, assuming that the estimated
coefficient is approximately normally distributed, which will be true if the
regression disturbance is normally distributed or if the sample size is large.
Thus large coefficient standard errors translate into wide confidence intervals.
Each t statistic provides a test of the hypothesis of variable irrelevance:
that the true but unknown population parameter is zero, so that the corre-
sponding variable contributes nothing to the regression and can therefore be
dropped. One way to test variable irrelevance, with, say, a 5% probability
of incorrect rejection, is to check whether zero is outside the 95% confidence
interval for the parameter. If so, we reject irrelevance. The t statistic is
just the ratio of the estimated coefficient to its standard error, so if zero is
4
Sometimes the population coefficient on C is called the constant term, and the regression estimate is
called the estimated constant term.
44 CHAPTER 3. REGRESSION UNDER IDEAL CONDITIONS
outside the 95% confidence interval, then the t statistic must be bigger than
two in absolute value. Thus we can quickly test irrelevance at the 5% level
by checking whether the t statistic is greater than two in absolute value.5
Finally, associated with each t statistic is a probability value, which is the
probability of getting a value of the t statistic at least as large in absolute
value as the one actually obtained, assuming that the irrelevance hypothesis
true. Hence if a t statistic were two, the corresponding probability value
would be approximately .05. The smaller the probability value, the stronger
the evidence against irrelevance. There’s no magic cutoff, but typically prob-
ability values less than 0.1 are viewed as strong evidence against irrelevance,
and probability values below 0.05 are viewed as very strong evidence against
irrelevance. Probability values are useful because they eliminate the need for
consulting tables of the t distribution. Effectively the computer does it for us
and tells us the significance level at which the irrelevance hypothesis is just
rejected.
Now let’s interpret the actual estimated coefficients, standard errors, t
statistics, and probability values. The estimated intercept is approximately
.867, so that conditional on zero education and experience, our best forecast
of the log wage would be 86.7 cents. Moreover, the intercept is very precisely
estimated, as evidenced by the small standard error of .08 relative to the
estimated coefficient. An approximate 95% confidence interval for the true
but unknown population intercept is .867 ± 2(.08), or [.71, 1.03]. Zero is
far outside that interval, so the corresponding t statistic is huge, with a
probability value that’s zero to four decimal places.
The estimated coefficient on EDUC is .093, and the standard error is again
small in relation to the size of the estimated coefficient, so the t statistic is
large and its probability value small. The coefficient is positive, so that
5
If the sample size is small, or if we want a significance level other than 5%, we must refer to a table of
critical values of the t distribution. We also note that use of the t distribution in small samples also requires
an assumption of normally distributed disturbances.
3.4. A WAGE EQUATION 45
LWAGE tends to rise when EDUC rises. In fact, the interpretation of the
estimated coefficient of .09 is that, holding everything else constant, a one-
year increase in EDUC will produce a .093 increase in LWAGE.
The estimated coefficient on EXPER is .013. Its standard error is also
small, and hence its t statistic is large, with a very small probability value.
Hence we reject the hypothesis that EXPER contributes nothing to the fore-
casting regression. A one-year increase in EXP ER tends to produce a .013
increase in LWAGE.
A variety of diagnostic statistics follow; they help us to evaluate the ad-
equacy of the regression. We provide detailed discussions of many of them
elsewhere. Here we introduce them very briefly:
The likelihood function is the joint density function of the data, viewed as a
function of the model parameters. Hence a natural estimation strategy, called
maximum likelihood estimation, is to find (and use as estimates) the param-
eter values that maximize the likelihood function. After all, by construction,
those parameter values maximize the likelihood of obtaining the data that
were actually obtained. In the leading case of normally-distributed regres-
sion disturbances, maximizing the likelihood function (or equivalently, the
log likelihood function, because the log is a monotonic transformation) turns
out to be equivalent to minimizing the sum of squared residuals, hence the
maximum-likelihood parameter estimates are identical to the least-squares
parameter estimates. The number reported is the maximized value of the
log of the likelihood function.6 Like the sum of squared residuals, it’s not of
direct use, but it’s useful for comparing models and testing hypotheses.
Let us now dig a bit more deeply into the likelihood function, maximum-
likelihood estimation, and related hypothesis-testing procedures. A natural
estimation strategy with wonderful asymptotic properties, called maximum
likelihood estimation, is to find (and use as estimates) the parameter val-
6
Throughout this book, “log” refers to a natural (base e) logarithm.
3.4. A WAGE EQUATION 47
ues that maximize the likelihood function. After all, by construction, those
parameter values maximize the likelihood of obtaining the data that were
actually obtained.
In the leading case of normally-distributed regression disturbances, max-
imizing the likelihood function turns out to be equivalent to minimizing the
sum of squared residuals, hence the maximum-likelihood parameter estimates
are identical to the least-squares parameter estimates.
To see why maximizing the Gaussian log likelihood gives the same pa-
rameter estimate as minimizing the sum of squared residuals, let us derive
the likelihood for the Gaussian linear regression model with non-stochastic
regressors,
yi = x′i β + ε
εi ∼ iidN (0, σ 2 ).
Hence f (y1 , ..., yN ) = f (y1 )f (y2 ) · · · f (yN ) (by independence of the yi ’s). In
particular, the likelihood of the sample, denoted by L, is
N
−1 −1 ′ 2
Y
L= (2πσ 2 ) 2 e 2σ2 (yi −xi β)
i=1
48 CHAPTER 3. REGRESSION UNDER IDEAL CONDITIONS
so
N
−N
1 X
ln L = ln (2πσ ) 2 2 − 2 (yi − x′i β)2
2σ i=1
N
N N 1 X
= − ln(2π) − ln σ 2 − 2 (yi − x′i β)2 .
2 2 2σ i=1
Note in particular that the β vector that maximizes the likelihood (or log
likelihood – the optimizers must be identical because the log is a positive
monotonic transformation) is the β vector that minimizes the sum of squared
residuals.
The log likelihood is also useful for hypothesis testing via likelihood-ratio
tests. Under very general conditions we have asymptotically that:
−2(ln L0 − ln L1 ) ∼ χ2d ,
We use the F statistic to test the hypothesis that the coefficients of all vari-
ables in the regression except the intercept are jointly zero.7 That is, we
test whether, taken jointly as a set, the variables included in the forecasting
model have any explanatory value. This contrasts with the t statistics, which
7
We don’t want to restrict the intercept to be zero, because under the hypothesis that all the other
coefficients are zero, the intercept would equal the mean of y, which in general is not zero. See Problem 6.
3.4. A WAGE EQUATION 49
(SSRres − SSR)/(K − 1)
F = ,
SSR/(N − K)
where SSRres is the sum of squared residuals from a restricted regression that
contains only an intercept. Thus the test proceeds by examining how much
the SSR increases when all the variables except the constant are dropped. If
it increases by a great deal, there’s evidence that at least one of the variables
has explanatory content.
The probability value for the F statistic gives the significance level at which
we can just reject the hypothesis that the set of RHS variables has no pre-
dictive value. Here, the value is indistinguishable from zero, so we reject the
hypothesis overwhelmingly.
If we knew the elements of β and predicted yi using x′i β, then our prediction
errors would be the εi ’s, with variance σ 2 . We’d like an estimate of σ 2 ,
because it tells us whether our prediction errors are likely to be large or small.
The observed residuals, the ei ’s, are effectively estimates of the unobserved
population disturbances, the εi ’s. Thus the sample variance of the e’s, which
we denote s2 (read “s-squared”), is a natural estimator of σ 2 :
PN 2
2 i=1 ei
s = .
N −K
8
In the degenerate case of only one RHS variable, the t and F statistics contain exactly the same infor-
mation, and F = t2 . When there are two or more RHS variables, however, the hypotheses tested differ, and
F ̸= t2 .
50 CHAPTER 3. REGRESSION UNDER IDEAL CONDITIONS
which makes clear that the numerator in the large fraction is very close to
s2 , and the denominator is very close to the sample variance of y.
The interpretation is the same as that of R2 , but the formula is a bit different.
Adjusted R2 incorporates adjustments for degrees of freedom used in fitting
the model, in an attempt to offset the inflated appearance of good fit if many
RHS variables are tried and the “best model” selected. Hence adjusted R2
is a more trustworthy goodness-of-fit measure than R2 . As long as there is
more than one RHS variable in the model fitted, adjusted R2 is smaller than
R2 ; here, however, the two are extremely close (23.1% vs. 23.2%). Adjusted
R2 is often denoted R̄2 ; the formula is
1
PN 2
N −K i=1 ei
R̄2 = 1 − 1
P N
,
(y − ȳ)2
N −1 i=1 i
where K is the number of RHS variables, including the constant term. Here
the numerator in the large fraction is precisely s2 , and the denominator is
precisely the sample variance of y.
and “smaller is better”. That is, we select the model with smallest AIC. We
will discuss AIC in greater depth in Chapter 16.
The AIC and SIC are tremendously important for guiding model selection
in a ways that avoid data mining and in-sample overfitting.
You will want to start using AIC and SIC immediately, so we provide
a bit more information here. Model selection by maximizing R2 , or equiv-
alently minimizing residual SSR, is ill-advised, because they don’t penalize
for degrees of freedom and therefore tend to prefer models that are “too big.”
Model selection by maximizing R̄2 , or equivalently minimizing residual s2 ,
is still ill-advised, even though R̄2 and s2 penalize somewhat for degrees of
freedom, because they don’t penalize harshly enough and therefore still tend
to prefer models that are too big. In contrast, AIC and SIC get things just
right. SIC has a wonderful asymptotic optimality property when the set of
candidate models is viewed as fixed: Basically SIC “gets it right” asymptoti-
cally, selecting either the DGP (if the DGP is among the models considered)
or the best predictive approximation to the DGP (if the DGP is not among
the models considered). AIC has a different and also-wonderful asymptotic
3.4. A WAGE EQUATION 53
The residual scatter is often useful in both cross-section and time-series sit-
uations. It is a plot of y vs ŷ. A perfect fit (R2 = 1) corresponds to all
points on the 45 degree line, and no fit (R2 = 0) corresponds to all points on
a vertical line corresponding to y = ȳ.
In Figure 3.5 we show the residual scatter for the wage regression. It is
not a vertical line, but certainly also not the 45 degree line, corresponding to
the positive but relatively low R2 of .23.
In time-series settings, it’s always a good idea to assess visually the adequacy
of the model via time series plots of the actual data (yi ’s), the fitted values
(ŷi ’s), and the residuals (ei ’s). Often we’ll refer to such plots, shown together
in a single graph, as a residual plot.9 We’ll make use of residual plots through-
out this book. Note that even with many RHS variables in the regression
9
Sometimes, however, we’ll use “residual plot” to refer to a plot of the residuals alone. The intended
meaning should be clear from context.
54 CHAPTER 3. REGRESSION UNDER IDEAL CONDITIONS
model, both the actual and fitted values of y, and hence the residuals, are
simple univariate series that can be plotted easily.
The reason we examine the residual plot is that patterns would indicate
violation of our iid assumption. In time series situations, we are particularly
interested in inspecting the residual plot for evidence of serial correlation
in the ei ’s, which would indicate failure of the assumption of iid regression
disturbances. More generally, residual plots can also help assess the overall
performance of a model by flagging anomalous residuals, due for example to
outliers, neglected variables, or structural breaks.
Our wage regression is cross-sectional, so there is no natural ordering of
the observations, and the residual plot is of limited value. But we can still
use it, for example, to check for outliers.
In Figure 3.6, we show the residual plot for the regression of LWAGE on
EDUC and EXPER. The actual and fitted values appear at the top of the
3.5. LEAST SQUARES AND OPTIMAL POINT PREDICTION 55
graph; their scale is on the right. The fitted values track the actual values
fairly well. The residuals appear at the bottom of the graph; their scale is
on the left. It’s important to note that the scales differ; the ei ’s are in fact
substantially smaller and less variable than either the yi ’s or the ŷi ’s. We
draw the zero line through the residuals for visual comparison. No outliers
are apparent.
The linear regression DGP under the ideal conditions implies the conditional
mean function,
E(yi | x1i = 1, x2i = x∗2i , ..., xKi = x∗Ki ) = β1 + β2 x∗2i + ... + βK x∗Ki
4. LS fails for causal prediction unless the IC hold, so credible causal pre-
diction is much harder.
strategy of econometrics for many decades, and it is very much at the center
of modern “data science” and “machine learning”.
Finally full density forecasts are of interest. The linear regression DGP
under the IC implies the conditional density function
yi | xi = x∗i ∼ N (x∗i ′ β, σ 2 ).
yi | xi = x∗i ∼ N (x∗i ′ β, σ 2 ),
Notice that the interval and density forecasts rely for validity on more parts
of the IC than do the point forecasts: Gaussian disturbances and constant
disturbance variances – which makes clear in even more depth why violations
of the IC are generally problematic even in non-causal forecasting situations.
In light of our predictive emphasis throughout this book, here we offer some
predictive perspective on the regression statistics discussed earlier.
The sample, or historical, mean of the dependent variable, ȳ, an estimate
of the unconditional mean of y, is a benchmark forecast. It is obtained by
regressing y on an intercept alone – no conditioning on other regressors.
The sample standard deviation of y is a measure of the in-sample accuracy
of the unconditional mean forecast ȳ.
The OLS fitted values, ŷi = x′i β̂, are effectively in-sample regression pre-
dictions.
The OLS residuals, ei = yi − ŷi , are effectively in-sample prediction errors
corresponding to use of those in-sample regression predictions.
OLS coefficient signs and sizes relate to the weights put on the various x
variables in forming the best in-sample prediction of y.
The standard errors, t statistics, and p-values let us do statistical inference
as to which regressors are most relevant for predicting y.
SSR measures “total” in-sample accuracy of the regression predictions. It
is closely related to in-sample M SE:
N
1 1 X 2
M SE = SSR = e
N N i=1 i
Residual plots are useful for visually flagging neglected things that im-
pact forecasting. Residual correlation (in time-series contexts) indicates that
point forecasts could possibly be improved. Non-constant residual volatility
indicates that interval and density forecasts could be possibly improved.
3.8 Multicollinearity
where:
a|e| if e ≤ 0
linlin(e) =
b|e| if e > 0
I(·) stands for “indicator” variable where I(x) = 1 if x is true, and I(x) = 0
otherwise. “linlin” refers to linearity on each side of the origin.
QR is not as simple as OLS, but it is still simple (solves a linear program-
ming problem).
A key issue is what, precisely, quantile regression fits. QR fits the d · 100%
quantile:
quantiled (y|X) = xβ
64 CHAPTER 3. REGRESSION UNDER IDEAL CONDITIONS
where
b 1
d= =
a + b 1 + a/b
This is an important generalization of regression (e.g., How do the wages
of people in the far left tail of the wage distribution vary with education and
experience, and how does that compare to those in the center of the wage
distribution?)
a. Coefficient
b. Standard error
c. t statistic
d. Probability value of the t statistic
e. R-squared
f. Adjusted R-squared
66 CHAPTER 3. REGRESSION UNDER IDEAL CONDITIONS
5. (Regression semantics)
Regression analysis is so important, and used so often by so many people,
that a variety of associated terms have evolved over the years, all of which
are the same for our purposes. You may encounter them in your reading,
so it’s important to be aware of them. Some examples:
7. (Dimensionality)
We have emphasized, particularly in Chapter 2, that graphics is a pow-
erful tool with a variety of uses in the construction and evaluation of
econometric models. We hasten to add, however, that graphics has its
limitations. In particular, graphics loses much of its power as the dimen-
sion of the data grows. If we have data in ten dimensions, and we try to
squash it into two or three dimensions to make graphs, there’s bound to
be some information loss.
Thus, in contrast to the analysis of data in two or three dimensions,
in which case learning about data by fitting models involves a loss of
information whereas graphical analysis does not, graphical methods lose
their comparative advantage in higher dimensions. In higher dimen-
sions, graphical analysis can become comparatively laborious and less
insightful.
8. (Wage regressions)
The relationship among wages and their determinants is one of the most
important in all of economics. In the text we have examined, and will
continue to examine, the relationship for 1995 using a CPS subsample.
Here you will thoroughly analyze the relationship for 2004 and 2012,
compare your results to those for 1995, and think hard about the mean-
ing and legitimacy of your results.
68 CHAPTER 3. REGRESSION UNDER IDEAL CONDITIONS
(a) Obtain the relevant 1995, 2004 and 2012 CPS subsamples.
(b) Discuss any differences in the datasets. Are the same people in each
dataset?
(c) For now, assume the validity of the ideal conditions. Using each
dataset, run the OLS regression W AGE → c, EDU C, EXP ER.
(Note that the LHS variable is W AGE, not LW AGE.) Discuss and
compare the results in detail.
(d) Now think of as many reasons as possible to be skeptical of your
results. (This largely means think of as many reasons as possible
why the IC might fail.) Which of the IC might fail? One? A few?
All? Why? Insofar as possible, discuss the IC, one-by-one, how/why
failure could happen here, the implications of failure, how you might
detect failure, what you might do if failure is detected, etc.
(e) Repeat all of the above using LW AGE as the LHS variable.
(b) What is the OLS estimator, and what finite-sample properties does
it enjoy?
(c) Display and discuss the exact distribution of the OLS estimator.
3.11 Notes
Non-Normality
4.0.1 Results
yi ∼ iid(µ, σ 2 ), i = 1, ..., N,
73
74 CHAPTER 4. NON-NORMALITY
σ2
a
ȳ ∼ N µ, .
N
This result forms the basis for asymptotic inference. It is a Gaussian central
limit theorem. We consistently estimate σ 2 using s2 .
Now consider the linear regression under the IC except that we allow non-
Gaussian disturbances. OLS remains consistent, asymptotically normal, and
asymptotically efficient, with
a
β̂OLS ∼ N (β, V ) .
4.1.1 QQ Plots
E(y − µ)3
S=
σ3
E(y − µ)4
K= .
σ4
Obviously, each tells about a different aspect of non-normality. Kurtosis, in
particular, tells about fatness of distributional tails relative to the normal.
A simple strategy is to check various implications of residual normality,
such as S = 0 and K = 3, via informal examination of Ŝ and K̂.
4.2 Outliers
attention because they can have substantial influence on the fitted regression
line.
On the one hand, OLS retains its magic in such outlier situations – it
is Best Linear Unbiased Estimator (BLUE) regardless of the disturbance
distribution. On the other hand, the fully-optimal (MVUE) estimator may
be highly non-linear, so the fact that OLS remains BLUE is less than fully
comforting. Indeed OLS parameter estimates are particularly susceptible to
distortions from outliers, because the quadratic least-squares objective really
hates big errors (due to the squaring) and so goes out of its way to tilt the
fitted surface in a way that minimizes them.
How to identify and treat outliers is a time-honored problem in data anal-
ysis, and there’s no easy answer. If an outlier is simply a data-recording
mistake, then it may well be best to discard it if you can’t obtain the correct
data. On the other hand, every dataset, even a perfectly “clean” dataset,
has a “most extreme observation,” but it doesn’t follow that it should be dis-
carded. Indeed the most extreme observations are often the most informative
– precise estimation requires data variation.
(−i) 1
β̂OLS − β̂OLS = − (X ′ X)−1 x′i ei ,
1 − hi
where hi is the i-th diagonal element of the “hat matrix,” X(X ′ X)−1 X ′ .
(−i) 1
Hence the estimated coefficient change β̂OLS − β̂OLS is driven by 1−hi . hi is
called the observation-i leverage. hi can be shown to be in [0, 1], so that
(−i)
the larger is hi , the larger is β̂OLS − β̂OLS . Hence one really just needs to
examine the leverage sequence, and scrutinize carefully observations with
high leverage.
ŷ (0) = X β̂ (0)
where " #
N
X
β̂ (0) = arg min (yi − x′i β)2 .
β
i=1
78 CHAPTER 4. NON-NORMALITY
where
(0) (0)
ei = yi − ŷi ,
and S(z) is a function such that S(z) = 1 for z ∈ [−1, 1] but downweights
outside that interval.
Fit at robustness iteration 1:
ŷ (1) = X β̂ (1)
where " N #
(1) 2
X
β̂ (1) = arg min ρi (yi − xi ′ β) .
β
i=1
Continue as desired.
or
N
X
min |εi |
β
i=1
the new estimator “least absolute deviations” (LAD) and we write β̂LAD .2
By construction, β̂LAD is not influenced by outliers as much as β̂OLS . Put
differently, LAD is more robust to outliers than is OLS.
Of course nothing is free, and the price of LAD is a bit of extra compu-
tational complexity relative to OLS. In particular, the LAD estimator does
not have a tidy closed-form analytical expression like OLS, so we can’t just
plug into a simple formula to obtain it. Instead we need to use the computer
to find the optimal β directly. If that sounds complicated, rest assured that
it’s largely trivial using modern numerical methods, as embedded in modern
software.3
It is important to note that whereas OLS fits the conditional mean func-
tion:
mean(y|X) = Xβ,
median(y|X) = Xβ
The conditional mean and median are equal under symmetry and hence under
normality, but not under asymmetry, in which case the median is a better
measure of central tendency. Hence LAD delivers two kinds of robustness to
non-normality: it is robust to outliers and robust to asymmetry.
Here we show some empirical results that make use of the ideas sketched
above. There are many tables and figures appearing at the end of the chapter.
We do not refer to them explicitly, but all will be clear upon examination.
2
Note that LAD regression is just quantile regression for d = .50.
3
Indeed computation of the LAD estimator turns out to be a linear programming problem, which is
well-studied and simple.
80 CHAPTER 4. NON-NORMALITY
4.4.1 W AGE
4.4.2 LW AGE
Now we run LW AGE → c, EDU C, EXP ER. Again we show the re-
gression results, the residual plot, the residual histogram and statistics, the
residual Gaussian QQ plot, the leave-one-out plot, and the results of LAD
estimation. Among other things, and in sharp contrast to the results for
WAGE and opposed to LWAGE, the residual histogram and Gaussian QQ
plot indicate approximate residual normality.
4.4. WAGE DETERMINATION 81
Leave−One−Out Plot
1.23
●
Coefficient (Education)
● ●
● ●
●
● ● ● ● ● ●
● ● ●
● ● ●●
● ● ● ● ● ●
●
● ● ●●
● ● ● ● ● ●● ● ● ● ● ● ● ●
●● ● ● ● ● ●
●
● ●●● ● ●● ● ●● ●● ● ● ●● ●●● ● ● ● ● ●● ● ●
●● ●●● ● ●● ●
●
●
●●●●● ●● ●● ● ●●
● ● ●● ● ●● ● ●
●● ● ● ●
● ●● ●
●
● ●● ● ● ●●●●● ● ● ● ●● ●●●● ●●● ● ●●●●●● ● ●●
●●●●● ● ●●
●●●● ●●● ●● ●● ● ●●● ● ● ●● ● ● ● ●●
● ● ● ● ● ● ●● ● ● ●●●
● ●
● ●
● ●●● ●● ● ●
●
●● ●
●●●●●
●●● ●●●
● ●●●●●● ● ●●● ●●●● ● ●● ●●● ● ●●● ●
●●●●●● ● ●
● ●●
●● ●●●●●
●●● ●● ●●
● ●● ●● ● ●
● ● ●
●
●● ●● ● ●
● ●●
●● ●●●●
● ●
● ●●
●●●● ●
● ●
●●● ●●
●●●●●●● ● ●●
●●
● ●
●
●●●
●
●●
●
●●
●
●●●
●
●●
●
● ●
●●●●
●
●●
●
●
●●
●●
●●
●●
●●
●
● ●
●
●
●●
●
●
●●
●●●●
●
●●●
●
●
●●
●
●
●●
●
●●
●
●●
●
●●
●●
●
●
●●●
●
● ●
●●
●
●
●●
●
●●
●●
●
●●
●
●●
●
●
●●
●
●●
●●
●●
●
●
●
●
●
●
●
●●
●
●●
●●
●
●
●●
●
●●
●
●
●●
●
●
●●●
●
●
●●
●●
●
●●
●
●
●●●
●
●
●●
●
●●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●●●
●
●
●●
●
●●
●
●
●●●●
●
●
●●
●
●●
●
●
●●
●
●●
●
●●
●
●
●●
●
●
●
●
●●
●
●●
●
●●
●
●
●●
●●●●
●
●
●●
●
●●
●
●●●
●●●
●
●
●●
●
●●●
●●
●
●●●
●
●●
●
●●
●
●●
●●
●●
●●
●
●
●●
●
●●
●●
●
●
●●
●
●●
●●
●
●
●
●●
●●
●
●
●●●
●
●●
●
●
●
●
●●
●●●●
●
●●
●
●●
●●
●●
●
●
●●
●
●●
●●
● ●
●
●●●
●
●
●
●●
●
●●
●
●●
●
●
●●
●
●●●
●
●
●
●
●
●●●
●●
●
●●
●
●
●
●●
●●
●
●●
●
●
●
●●
●
●●
●
●●
●
●●
●
●●
● ●
●
●
●●
●●
●
●
●●
●
●●
●●
●
●
●
●
●
●●
● ●
●
●●
●
●
●
●●
●
●
●●
●●
●
●
●●
●
●●
●
●
●
●
●
●
● ●● ●●●
● ●●
●●●●
●●●●●
● ● ●
●●
●
●●
●●
●●
●● ●●● ●●
●●
●●
●
●●●●
●●●
●●
●●●
●●●
●●●●● ●
● ●
●●●
●●
●●
●●●
●
●●●●● ●
●●●●
●● ●●
●●●
●●●
●●
●●● ●● ●
● ●●● ● ●
●
●● ●
●●●●●●
●●
●●●●●
● ●
●●●
●●
●●
●
●●
●
●●
●
● ●
●●●
●●●
●●●●●
●
●●
●●●
●
●
●
●●
●●
●
●
●●●●●
●●●●●
●●●
● ●●
●●●●
● ●
●●●●
●● ●●●●
●●
●●●● ● ●● ●●
●
●●
●
● ●
● ●
●● ●
● ● ● ●● ● ●●●●
● ● ●● ● ●
● ●
●
●
● ● ● ●● ●●● ● ●●
●●
●●● ● ●● ● ● ● ●●●● ● ● ●
● ● ● ● ●●
● ● ● ●● ● ● ●● ●● ● ● ● ●
● ● ● ● ●● ● ●
● ● ●
1.21
●
● ● ●
● ● ● ● ●
● ●● ●
● ● ●
● ●
●
● ●
●
● ●
1.19
Leave t out
Figure 4.9: OLS Log Wage Regression: Residual Histogram and Statistics
90 CHAPTER 4. NON-NORMALITY
and so on. Moreover, it takes a model to beat a model, and Taleb offers
little.
P |y − µ| > 5σ
P (y > y ∗ ) = ky ∗ −γ .
From one perspective we continue working under the IC. From another we
now begin relaxing the IC, effectively by recognizing RHS variables that were
omitted from, but should not have been omitted from, our original wage
regression.
In Figure 5.1 we show histograms and statistics for all potential determi-
nants of wages. Education (EDUC) and experience (EXPER) are standard
continuous variables, although we measure them only discretely (in years);
95
96 CHAPTER 5. GROUP HETEROGENEITY AND INDICATOR VARIABLES
we have examined them before and there is nothing new to say. The new vari-
ables are 0-1 dummies, UNION (already defined) and NONWHITE, where
1 if observation i corresponds to a non-white person
N ON W HIT Ei =
0 otherwise.
Note that the sample mean of a dummy variable is the fraction of the
sample with the indicated attribute. The histograms indicate that roughly
one-fifth of people in our sample are union members, and roughly one-fifth
are non-white.
We also have a third dummy, FEMALE, where
1 if observation i corresponds to a female
F EM ALEi =
0 otherwise.
We don’t show its histogram because it’s obvious that FEMALE should be
approximately 0 w.p. 1/2 and 1 w.p. 1/2, which it is.
5.2. GROUP DUMMIES IN THE WAGE REGRESSION 97
shown in Figure 5.2. Both explanatory variables are highly significant, with
expected signs.
Now consider the same regression, but with our three group dummies
added, as shown in Figure 5.3. All dummies are significant with the expected
signs, and R2 is higher. Both SIC and AIC favor including the group dum-
mies. We show the residual scatter in Figure 5.4. Of course it’s hardly the
forty-five degree line (the regression R2 is higher but still only .31), but it’s
98 CHAPTER 5. GROUP HETEROGENEITY AND INDICATOR VARIABLES
Figure 5.4: Residual Scatter from Wage Regression on Education, Experience and Group
Dummies
getting closer.
1. (Slope dummies)
Consider the regression
yi = β1 + β2 xi + εi .
(a) How would you test the hypothesis that none of the four new fertil-
izers is effective?
5.4. NOTES 101
(b) Assuming that you reject the null, how would you estimate the im-
provement (or worsening) due to using fertilizer A, B, C or D?
5.4 Notes
ANOVA traces to Sir Ronald Fischer’s 1918 article, “The Correlation Be-
tween Relatives on the Supposition of Mendelian Inheritance,” and it was
featured prominently in his classic 1925 book, Statistical Methods for Re-
search Workers. Fischer is in many ways the “father” of much of modern
statistics.
102 CHAPTER 5. GROUP HETEROGENEITY AND INDICATOR VARIABLES
30 (original upload date). Source: Transferred from en.wikipedia. Author: Original uploader was Bletchley
at en.wikipedia. Permission (Reusing this file): Released under the GNU Free Documentation License; PD-
OLD-70. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU
Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation;
with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included
Nonlinearity
In general there is no reason why the conditional mean function should be lin-
ear. That is, the appropriate functional form may not be linear. Whether
linearity provides an adequate approximation is an empirical matter.
Non-linearity is related to non-normality, which we studied in chapter 4.
In particular, in the mutivariate normal case, the conditional mean function
is linear in the conditioning variables. But once we leave the terra firma
of multivariate normality, anything goes. The conditional mean function
and disturbances may be linear and Gaussian, non-linear and Gaussian, linear
and non-Gaussian, or non-linear and non-Gaussian.
In the Gaussian case, because the conditional mean is a linear function
of the conditioning variable(s), it coincides with the linear projection. In
non-Gaussian cases, however, linear projections are best viewed as approxi-
mations to generally non-linear conditional mean functions. That is, we can
view the linear regression model as a linear approximation to a generally non-
linear conditional mean function. Sometimes the linear approximation may
be adequate, and sometimes not.
103
104 CHAPTER 6. NONLINEARITY
6.1.1 Logarithms
Log-Log Regression
First, consider log-log regression. We write it out for the simple regression
case, but of course we could have more than one regressor. We have
ln yi = β1 + β2 ln xi + εi .
ln yi = ln A + α ln Li + β ln Ki + εi .
6.1. MODELS LINEAR IN TRANSFORMED VARIABLES 105
Log-Lin Regression
yt = Aert εt
ln yt = ln A + rt + εt ,
which is linear. The growth rate r gives the approximate percent change in
E(yt |t) for a one-unit change in time (because logs appear only on the left).
Lin-Log Regression
yi = β ln xi + εi
106 CHAPTER 6. NONLINEARITY
It’s a bit exotic but it sometimes arises. β gives the effect on E(yi |xi ) of a
one-percent change in xi , because logs appear only on the right.
B(yi ) = β1 + β2 xi + εi ,
where
ytλ − 1
B(yi ) = .
λ
Hence
E(yi |xi ) = B −1 (β1 + β2 xi ).
Because
yλ − 1
lim = ln(yi ),
λ→0 λ
the Box-Cox model corresponds to the log-lin model in the special case of
λ = 0.
GLM
The so-called “generalized linear model” (GLM) provides an even more flex-
ible framework. Almost all models with left-hand-side variable transforma-
tions are special cases of those allowed in the generalized linear model
(GLM). In the GLM, we have
G(yi ) = β1 + β2 xt + εi ,
so that
E(yi |xi ) = G−1 (β1 + β2 xi ).
6.2. INTRINSICALLY NON-LINEAR MODELS 107
The least squares estimator is often called “ordinary” least squares, or OLS.
As we saw earlier, the OLS estimator has a simple closed-form analytic ex-
pression, which makes it trivial to implement on modern computers. Its
computation is fast and reliable.
The adjective “ordinary” distinguishes ordinary least squares from more
laborious strategies for finding the parameter configuration that minimizes
the sum of squared residuals, such as the non-linear least squares (NLS)
estimator. When we estimate by non-linear least squares, we use a computer
to find the minimum of the sum of squared residual function directly, using
numerical methods, by literally trying many (perhaps hundreds or even thou-
sands) of different β values until we find those that appear to minimize the
108 CHAPTER 6. NONLINEARITY
sum of squared residuals. This is not only more laborious (and hence slow),
but also less reliable, as, for example, one may arrive at a minimum that is
local but not global.
Why then would anyone ever use non-linear least squares as opposed to
OLS? Indeed, when OLS is feasible, we generally do prefer it. For example,
in all regression models discussed thus far OLS is applicable, so we prefer it.
Intrinsically non-linear models can’t be estimated using OLS, but they can
be estimated using non-linear least squares. We resort to non-linear least
squares in such cases.
Intrinsically non-linear models obviously violate the linearity assumption
of the IC. But the violation is not a big deal. Under the remaining IC (that
is, dropping only linearity), β̂N LS has a sampling distribution similar to that
under the IC.
yi = g(xi , εi )
yi = f (xi ) + εi .
f (xi ) ≈ β1 + β2 xi + β3 x2i .
The final term picks up interaction effects. Interaction effects are also rele-
vant in situations involving dummy variables. There we capture interactions
by including products of dummies.2
The ultimate point is that even so-called “intrinsically non-linear” models
are themselves linear when viewed from the series-expansion perspective. In
principle, of course, an infinite number of series terms are required, but in
practice nonlinearity is often quite gentle (e.g., quadratic) so that only a few
series terms are required. From this viewpoint non-linearity is in some sense
really an omitted-variables problem.
One can also use Fourier series approximations:
and one can also mix Taylor and Fourier approximations by regressing not
only on powers and cross products (“Taylor terms”), but also on various sines
and cosines (“Fourier terms”). Mixing may facilitate parsimony.
2
Notice that a product of dummies is one if and only if both individual dummies are one.
110 CHAPTER 6. NONLINEARITY
It is of interest to step back and ask what parts of the IC are violated in our
various non-linear models.
Models linear in transformed variables (e.g., log-log regression) actually
don’t violate the IC, after transformation. Neither do series expansion mod-
els, if the adopted expansion order is deemed correct, because they too are
linear in transformed variables.
The series approach to handling non-linearity is actually very general and
handles intrinsically non-linear models as well, and low-ordered expansions
are often adequate in practice, even if an infinite expansion is required in
theory. If series terms are needed, a purely linear model would suffer from
misspecification of the X matrix (a violation of the IC) due to the omitted
higher-order expansion terms. Hence the failure of the IC discussed in this
chapter can be viewed either as:
One can use the usual t and F tests for testing linear models against non-
linear alternatives in nested cases, and information criteria (AIC and SIC)
for testing against non-linear alternatives in non-nested cases. To test linear-
ity against a quadratic alternative in a simple regression case, for example,
we can simply run y → c, x, x2 and perform a t-test for the relevance of x2 .
And of course, use AIC and SIC as always.
6.6. NON-LINEARITY IN WAGE DETERMINATION 111
Note that the powers of ŷi are linear combinations of powers and cross prod-
ucts of the x variables – just what the doctor ordered. There is no need
to include the first power of ŷi , because that would be redundant with the
included x variables. Instead we include powers ŷi2 , ŷi3 , ... Typically a small
m is adequate. Significance of the included set of powers of ŷi can be checked
using an F test. This procedure is called RESET (Regression Specification
Error Test).
For convenience we reproduce in Figure 6.1 the results of our current linear
wage regression,
LW AGE → c, EDU C, EXP ER,
The RESET test from that regression suggests neglected non-linearity; the
p-value is .03 when using ŷt2 and ŷt3 in the RESET test regression.
Non-Linearity in EDU C and EXP ER: Powers and Interactions
112 CHAPTER 6. NONLINEARITY
Given the results of the RESET test, we proceed to allow for non-linearity.
In Figure 6.2 we show the results of the quadratic regression
Now let’s incorporate powers and interactions in EDU C and EXP ER, and
interactions in F EM ALE, U N ION and N ON W HIT E.
114 CHAPTER 6. NONLINEARITY
Figure 6.3: Wage Regression on Education, Experience, Group Dummies, and Interactions
Figure 6.4: Wage Regression with Continuous Non-Linearities and Interactions, and Discrete
Interactions
The RESET statistic has a p-value of .19, so we would not reject adequacy
of functional form at conventional levels.
(a) Is tax revenue likely related to the tax rate? (That is, do you think
that the mean of tax revenue conditional on the tax rate actually is
a function of the tax rate?)
(b) Is the relationship likely linear? (Hint: how much revenue would be
collected at tax rates of zero or one hundred percent?)
(c) If not, is a linear regression nevertheless likely to produce a good
approximation to the true relationship?
yi = β1 + β2 xi + β3 x2i + β4 zi + εi
under the full ideal conditions. Find the mean of yi conditional upon
xi = x∗i and zi = zi∗ . Is the conditional mean linear in (x∗i ? zi∗ )?
yi = β1 + β2 xi + εi
yi = β1 eβ2 xi εi
yi = β1 + eβ2 xi + εi .
a. For each model, determine whether OLS may be used for estimation
(perhaps after transforming the data), or whether NLS is required.
b. For those models for which OLS is feasible, do you expect NLS and
OLS estimation results to agree precisely? Why or why not?
c. For those models for which NLS is “required,” show how to avoid it
using series expansions.
(a) The model was estimated using ordinary least squares (OLS). What
loss function is optimized in calculating the OLS estimate? (Give
a formula and a graph.) What is the formula (if any) for the OLS
estimator?
(b) Consider instead estimating the same model numerically (i.e., by
NLS) rather than analytically (i.e., by OLS). What loss function is
optimized in calculating the NLS estimate? (Give a formula and a
graph.) What is the formula (if any) for the NLS estimator?
(c) Does the estimated equation indicate a statistically significant effect
of union status on log wage? An economically important effect?
What is the precise interpretation of the estimated coefficient on
UNION? How would the interpretation change if the wage were not
logged?
(d) Precisely what hypothesis does the F-statistic test? What are the
restricted and unrestricted sums of squared residuals to which it
is related, and what are the two OLS regressions to which they
correspond?
(e) Consider an additional regressor, AGE, where AGE = 6 + EDUC
+ EXPER. (The idea is that 6 years of early childhood, followed by
EDUC years of education, followed by EXPER years of work expe-
rience should, under certain assumptions, sum to a person’s age.)
Discuss the likely effects, if any, of adding AGE to the regression.
6.7. EXERCISES, PROBLEMS AND COMPLEMENTS 119
(f) The log wage may of course not be linear in EDUC and EXPER.
How would you assess the possibility of quadratic nonlinear effects
using t-tests? An F-test? The Schwarz criterion (SIC)? R2 ?
(g) Suppose you find that the log wage relationship is indeed non-linear
but still very simple, with only EXPER2 entering in addition to the
∂ E(LWAGE | X)
variables in Figure 6.5. What is ∂ EXPER in the expanded model?
∂ E(LWAGE | X)
How does it compare to ∂ EXPER in the original model of Figure
6.5? What are the economic interpretations of the two derivatives?
(X refers to the full set of included right-hand-side variables in a
regression.)
(h) Return to the original model of Figure 6.5. How would you assess
the overall adequacy of the fitted model using the standard error of
the regression? The model residuals? Which is likely to be more
useful/informative?
(i) Consider estimating the model not by OLS or NLS, but rather by
quantile regression (QR). What loss function is optimized in calcu-
lating the QR estimate? (Give a formula and a graph.) What is the
formula (if any) for the QR estimator? How is the least absolute
deviations (LAD) estimator related to the QR estimator? Under
the IC, are the OLS and LAD estimates likely very close? Why or
why not?
(j) Discuss whether and how you would incorporate trend and seasonal-
ity by using a linear time trend variable and a set of seasonal dummy
variables.
120 CHAPTER 6. NONLINEARITY
Heteroskedasticity
We continue exploring issues associated with possible failure of the ideal con-
ditions. This chapter’s issue is “Do we really believe that disturbance vari-
ances are constant?” As always, consider: ε ∼ N (0, Ω). Heteroskedasticity
corresponds to Ω diagonal but Ω ̸= σ 2 I
σ12 0 . . . 0
0 σ22 . . .
0
Ω=
... ... . . . ..
.
2
0 0 . . . σN
Heteroskedasticity can arise for many reasons. A leading cause is that σi2
may depend on one or more of the xi ’s. A classic example is an “Engel curve”,
a regression relating food expenditure to income. Wealthy people have much
more discretion in deciding how much of their income to spend on food, so
their disturbances should be more variable, as routinely found.
121
122 CHAPTER 7. HETEROSKEDASTICITY
The first thing we can do is graph e2i against xi , for various regressors, looking
for relationships. This makes sense because e2i is effectively a proxy for σi2 .
Recall, for example, our “Final” wage regression, shown in Figure 7.1.
In Figure 7.2 we graph the squared residuals agains EDUC. There is appar-
ently a positive relationship, although it is noisy. This makes sense, because
7.2. DETECTING HETEROSKEDASTICITY 123
very low education almost always leads to very low wage, whereas high ed-
ucation can produce a larger variety of wages (e.g., both neurosurgeons and
college professors are highly educated, but neurosurgeons typically earn much
more).
White’s Test
White’s test is a simple extension of BPG, replacing the linear BPG test
regression with a more flexible (quadratic) regression:
We will consider both adjusting standard errors and adjusting density fore-
casts.
Using advanced methods, one can obtain consistent standard errors, even
when heteroskedasticity is present. Mechanically, it’s just a simple regression
option. e.g., in EViews, instead of “ls y,c,x”, use “ls(cov=white) y,c,x”.
Even if you’re only interested in prediction, you still might want to use
robust standard errors, in order to do credible inference regarding the con-
tributions of the various x variables to the point prediction.
In Figure 7.5 we show the final wage regression with robust standard errors.
Although the exact values of the standard errors change, it happens in this
case that significance of all coefficients is preserved.
7.4. EXERCISES, PROBLEMS AND COMPLEMENTS 127
Recall operational density forecast under the ideal conditions (which include,
among other things, Gaussian homoskedastic disturbances):
where σ̂∗2 is the fitted value from the BPG or White test regression evaluated
at x∗ .
1. (Vocabulary)
All these have the same meaning:
(a) Show that when Ω = σ 2 I the GLS estimator is just the standard
OLS estimator:
(b) Show that when Ω = σ 2 I the covariance matrix of the GLS estimator
is just that of the standard OLS estimator:
yi = x′i β + εi
εi ∼ iidN (0, σi2 ).
yi x′i β εi
= + .
σi σi σi
N 2 N
X yi − x ′ β X 1 2
min i
= min 2 (yi − x′i β) .
β
i=1
σi β σ
i=1 i
Good idea: Use weights wi = 1/eb2i , where eb2i are from the BGP test
regression
Good idea: Use wi = 1/eb2i , where eb2i are from the White test regres-
sion.
Bad idea: Use wi = 1/e2i is not a good idea. e2i is too noisy; we’d like
to use not e2i but rather E(e2i |xi ). So we use an estimate of E(e2i |xi ),
namely eb2 from e2 → X
i
6. (Robustness iteration)
Sometimes, after an OLS regression, people do a second-stage WLS with
weights 1/|ei |, or something similar. This is not a heteroskedasticity
130 CHAPTER 7. HETEROSKEDASTICITY
7. (Spatial Correlation)
So far we have studied a heteroskedastic situation (εi independent across
i but not identically distributed across t). But do we really believe that
the disturbances are uncorrelated over space (i)? Spatial correlation in
cross sections is another type of violation of the IC. (This time it’s “non-
zero disturbance covariances” as opposed to “non-constant disturbance
variances”.) As always, consider ε ∼ N (0, Ω). Spatial correlation (with
possible heteroskedasticity as well) corresponds to:
σ12 σ12 . . . σ1N
σ21 σ22 . . . σ2N
Ω= ... .. . . . .. .
. .
2
σN 1 σN 2 . . . σ N
7.4. EXERCISES, PROBLEMS AND COMPLEMENTS 131
yi = x′i β + εi
133
134 CHAPTER 8. LIMITED DEPENDENT VARIABLES
is,
(
1 if event z occurs
Ii (z) =
0 otherwise.
In that case we have
E(Ii (z)|xi ) = x′i β.
That is, when the LHS variable is a 0-1 indicator variable, the model is effec-
tively a model relating a conditional probability to the conditioning variables.
There are numerous “events” that fit the 0-1 paradigm. Examples pur-
chasing behavior does a certain consumer buy or not buy a certain product?,
hiring behavior (does a certain firm hire or not hire a certain worker?), and
loan defaults (does a certain borrower default or not default on a loan?), and
recessions (will a certain country have or not have a recession begin during
the next year?).
But how should we “fit a line” when the LHS variable is binary? The
linear probability model does it by brute-force OLS regression Ii (z) → xi .
There are several econometric problems associated with such regressions, but
the one of particular relevance is simply that the linear probability model
fails to constrain the fitted values of E(Ii (z)|xi ) = P (Ii (z) = 1|xi ) to lie in
the unit interval, in which probabilities must of course lie. We now consider
models that impose that constraint by running x′i β through a “squashing
function,” F (·), that keeps P (Ii (z) = 1|xi ) in the unit interval. That is, we
8.2. THE LOGIT MODEL 135
where F (·) is monotone increasing, with limw→−∞ F (w) = 0 and limw→∞ F (w) =
1. Many squashing functions can be entertained, and many have been enter-
tained.
The most popular and useful squashing function for our purposes is the logis-
tic function, which takes us to the so-called “logit” model. There are several
varieties and issues, to which we now turn.
8.2.1 Logit
In the logit model, the squashing function F (·) is the logistic function,
ew 1
F (w) = = ,
1 + ew 1 + e−w
so ′
exi β
P (Ii (z) = 1|xi ) = ′ .
1 + e xi β
At one level, there’s little more to say; it really is that simple. The likelihood
function can be derived, and the model can be immediately estimated by
numerical maximization of the likelihood function.
But an alternative latent variable formulation yields useful insights. In
particular, consider a latent variable, yt∗ , where
yi∗ = x′i β + εi
εi ∼ logistic(0, 1),
and let Ii (z) be Ii (yi∗ > 0), or equivalently, Ii (ε > −x′i β). Interestingly, this
136 CHAPTER 8. LIMITED DEPENDENT VARIABLES
E(Ii (yi∗ > 0)|xi ) = P ((yi∗ > 0)|xi ) = P (εi > −x′i β)
= P (εi < x′i β) (by symmetry of the logistic density of ε)
′
exi β
= ′ ,
1 + e xi β
where the last equality holds because the logistic cdf is ew /(1 + ew ).
This way of thinking about the logit DGP – a continuously-evolving latent
variable yi∗ with an observed indicator that turns “on” when yi∗ > 0 – is
very useful. For example, it helps us to think about consumer choice as a
function of continuous underlying utility, business cycle regime as a function
of continuous underlying macroeconomic conditions, etc.
The latent-variable approach also leads to natural generalizations like or-
dered logit, to which we now turn.
yi∗ = x′i β + εi
εi ∼ logistic(0, 1).
8.2. THE LOGIT MODEL 137
8.2.3 Complications
In logit regression, both the marginal effects and the R2 are hard to determine
and/or interpret directly.
Marginal Effects
∂E(y|x)
= f (x′ β)βi ,
∂xi
R2
It’s not clear how to define or interpret R2 when the LHS variable is 0-1, but
several variants have been proposed. The two most important are Effron’s
and McFadden’s.
Effron’s R2 is
Solving the log odds for P (Ii (z) = 1|xi ) yields the logit model,
′
1 e xi β
P (Ii (z) = 1|xi ) = ′ = ′ .
1 + e−xi β 1 + e xi β
Hence the logit model is simply a linear regression model for log odds.
A full statement of the model is
yi ∼ Bern(pi )
pi
ln = x′i β.
1 − pi
5. (Probit and GLM Squashing Functions)
Other squashing functions are sometimes used for binary-response re-
gression.
6. (Multinomial Models)
In contrast to the binomial logit model, we can also have more than
two categories (e.g., what transportation method will I choose to get to
work: Private transportation, public transportation, or walking?), and
use multinomial logit.
yi∗ = β0 + β1 xi + εi .
Causal Estimation
143
144 CHAPTER 9. CAUSAL ESTIMATION
Let B be a set of β’s and let β ∗ ∈ B minimize R(β). We will say that β̂
is consistent for a predictive effect (“P-consistent”) if plim R(β̂) = R(β ∗ ).
Hence in large samples β̂ provides a good way to predict y for any hypothet-
ical x: simply use x′ β̂. OLS is effectively always P-consistent; we require
almost no conditions of any kind! P-consistency is effectively induced by the
minimization problem that defines OLS, as the minimum-MSE predictor is
the conditional mean.
Thus far we have sketched why P-consistency holds very generally, whereas
we have simply asserted that T-consistency is much more difficult to obtain
and relies critically on IC2. We now sketch why T-consistency is much more
difficult to obtain and relies critically on IC2.
Consider the following example. Suppose that y and z are in fact causally
unrelated, so that the true treatment effect of z on y is 0 by construction.
But suppose that z is correlated with an unobserved variable x that does
cause y. Then y and z will be correlated due to their joint dependence on x,
and that correlation can be used to predict y given z, despite the fact that, by
construction, z treatments (interventions) will have no effect on y. Clearly
this sort of situation – omission of a relevant variable correlated with an
included variable – may happen commonly, and it violates IC2. In the next
section we sketch several situations that produce violations of IC2, beginning
with an elaboration on the above-sketched omitted-variables problem.
9.2. REASONS FOR FAILURE OF IC2 147
IC2 can fail for several reasons, and we now sketch some of the most impor-
tant.
y = βx + ε,
with all IC satisfied, but that we incorrectly regress y → z, where corr(x, z) >
0. If we erroneously interpret the regression y → z causally, then clearly
we’ll estimate a positive causal effect of z on y, in large as well as small
samples, even though it’s completely spurious and would vanish if x had been
included in the regression. The positive bias arises because in our example
corr(x, z) > 0; in general the sign of the bias could go either way, depending
on the sign of the correlation. We speak of “omitted variable bias”.
In this example the problem is that condition IC2 is violated in the regres-
sion y → z, because the disturbance is correlated with the regressor. β̂OLS
is P-consistent, as always. But it’s not T-consistent, because the omitted
variable x is lurking in the disturbance of the fitted regression y → z, which
makes the disturbance correlated with the regressor (i.e., IC2 fails in the fit-
ted regression). The fitted OLS regression coefficient on z will be non-zero
and may be very large, even asymptotically, despite that fact that the true
causal impact of z on y is zero by construction. The OLS estimated coeffi-
cient is reliable for predicting y given z, but not for assessing the effects on
148 CHAPTER 9. CAUSAL ESTIMATION
y of treatments in z.
y = βx + ε,
y = βx + ε
= β(xm − v) + ε
= βxm + (ε − v)
= βxm + ν.
9.2.4 Simultaneity
Q = βP + ε,
but note that the ε shocks affect not only Q but also P . (Any shock to Q
is also a shock to P , since Q and P are determined jointly by supply and
demand!) That is, ε is correlated with P , violating IC2. So if you want
to estimate a demand curve (or a supply curve), simply running a mongrel
regression of Q on P will produce erroneous results. To estimate a demand
curve we need exogenous supply shifters, and to estimate a supply curve we
need exogenous demand shifters.
150 CHAPTER 9. CAUSAL ESTIMATION
because those farms that adopted the new fertilizer may have done so because
their characteristics made them particularly likely to benefit from it. That
is, the sample of adopters may be systematic rather than random. This is
known as “sample selection”. Effectively it produces simultaneity (again, a
violation of IC2) since it produces clear reason for yield to cause adopt in
addition to adopt causing yield.
The remedy for omitted relevant variables is simple in principle: start includ-
ing them!2 Let’s continue with our earlier example. The DGP is
y = βx + ε,
y = βx + ε.
Following standard usage, let us call a regressor x that satisfies IC2 “ex-
ogenous” (satisfying IC2 at least insofar as x is uncorrelated with ε if not
conditionally independent or fully independent), and a regressor x that fails
IC2 “endogenous”.
If x is endogenous it means that IC2 fails, so we need to do something
about it. One solution is to find an acceptable “instrument” for x. An in-
strument inst is a new regressor that is both exogenous (uncorrelated with ε)
and “strong” or “relevant” (highly-correlated with x). IV estimation proceeds
as follows: (0) Find an exogenous and relevant inst, (1) In a first-stage regres-
sion run x → inst and get the fitted values x̂(inst), and (2) in a second-stage
regression run y → x̂(inst). So in the second-stage regression we replace the
endogenous x with our best linear approximation to x based on inst, namely
x̂(inst). This second-stage regression does not violate IC2 – inst is exogenous
so x̂(inst) must be as well.
152 CHAPTER 9. CAUSAL ESTIMATION
In closing this section, we note that just as the prescription “start including
omitted variables” for a specific violation of IC2 (omitted variables) may at
first appear vapid, so too might the general prescription for violations of
IC2, “find a good instrument”. But plausible, if not unambiguously “good”,
instruments can often be found. Economists generally rely on a blend of
introspection and formal economic theory. Economic theory requires many
assumptions, but if the assumptions are plausible, then theory can be used
to suggest plausible instruments.
because those farms that adopted the new fertilizer may have done so because
their characteristics made them particularly likely to benefit from it. Instead
you’d need an instrument for adoption.
An RCT effectively creates an instrument for adopt in regression (9.2), by
randomizing. You randomly select some farms for adoption (treatment) and
some not (control), and you inspect the difference between yields for the two
groups. More formally, for the firms in the experiment you could run
The OLS estimate of c is the mean yield for the non-adoption (control) group,
and the OLS estimate of the coefficient on the treatment dummy is the mean
enhancement from adoption (treatment). You can test its significance with
the usual t test. The randomization guarantees that the regressor in (9.3) is
exogenous, so that IC2 is satisfied.
The key insight bears repeating: RCT’s, if successfully implemented, guar-
antee that IC2 is satisfied. But of course there’s no free lunch, and there
are various issues and potential problems with “successful implementation of
an RCT” in all but the simplest cases (like the example above), just as there
are many issues and potential problems with “finding a strong and exogenous
instrument”.
RCT’s can be expensive and wasteful when estimating the efficacy of a treat-
ment, as many people who don’t need treatment will be randomly assigned
treatment anyway. Hence alternative experimental designs are often enter-
tained. A leading example is the “regression discontinuity design” (RDD).
To understand the RDD, consider a famous scholarship example. You want
to know whether receipt of an academic scholarship causes enhanced aca-
demic performance among top academic performers. You can’t just regress
academic performance on a scholarship receipt dummy, because recipients
are likely to be stong academic performers even without the scholarship.
The question is whether scholarship receipt causes enhanced performance for
already-strong performers.
You could do an RCT, but in an RCT approach you’re going to give lots
of academic scholarships to lots of randomly-selected people, many of whom
are not strong performers. That’s wasteful.
An RDD design is an attractive alternative. You give scholarships only to
those who score above some threshold in the scholarship exam, and compare
154 CHAPTER 9. CAUSAL ESTIMATION
the performances of students who scored just above and below the threshold.
In the RDD you don’t give any scholarships to weak performers, so it’s not
wasteful.
Notice how the RDD effectively attempts to approximate an RCT. People
just above and below the scholarship threshold are basically the same in terms
of academic talent – the only difference is that one group gets the scholarship
and one doesn’t. The RCT does the controlled experiment directly; it’s
statistically efficient but can be wasteful. The RDD does the controlled
experiment indirectly; it’s statistically less efficient but also less wasteful.
4. Who among the participants knows what about who is treated and who
is not, and related, are the behaviors of the groups evolving over time,
perhaps due to interaction with each other (“spillovers”)?
5. Are people entering and/or leaving the study at different times? If so,
when and why?
Even if an RCT study is internally valid, its results may not generalize to
other populations or situations. That is, even if internally valid, it may not
be externally valid.
Consider, for example, a study of the effects of fertilizer on crop yield done
for region X in India during a heat wave. Even if successfully randomized,
and hence internally valid, the estimated treatment effect is for the effects
of fertilizer on crop yield in region X during a heat wave. The results do
not necessarily generalize – and in this example surely do not generalize –
to times of “normal” weather, even in region X. And of course, for a variety
of reasons, they may not generalize to regions other than X, even in a heat
wave.
Hence, even if an RCT is internally valid, there is no guarantee that it
is externally valid, or “extensible”. That is, there is no guarantee that its
results will hold in other cross sections and/or time periods.
4. Panel models.
Our discussion of “differences-in-differences” designs, which involve com-
parisons over both time and space, raises the general issue of panel data
models.
Consider, for example, the bivariate panel regression case. (Extension
to multiple panel regression is immediate but notationally tedious.) The
data are (yit , xit ) (for “person” i at time t), i = 1, ..., N (cross section
dimension), t = 1, ..., T (time series dimension).
In panels we have N × T “observations,” or data points, which is a lot of
data. In a pure cross section we have just i = 1, ..., N observations, so we
could never allow for different intercepts across all people (“individual
effects”), because there are N people, and we have only N observations,
so we’d run out of degrees of freedom. Similarly, in a pure time series we
have just t = 1, ..., T observations, so we could never allow for different
intercepts (say) across all time periods (“time effects”), because there
are T time periods, and we have only T observations, so we’d run out
of degrees of freedom. But with panel data, allowing for such individual
and time effects is possible. There are N + T coefficients to be estimated
(N individual effects and T time effects), but we have N ×T observations!
yit → I(i = 1), I(i = 2), I(i = 3), ..., I(i = N ), xit .
(yit − yi ) → (xit − xi ) ,
9.5. EXERCISES, PROBLEMS AND COMPLEMENTS 161
Time Series
163
Chapter 10
The time series that we want to model vary over time, and we often men-
tally attribute that variation to unobserved underlying components related
to trend and seasonality.
T rendt = β1 + β2 T IM Et .
165
166 CHAPTER 10. TREND AND SEASONALITY
value of the trend at time t=0. β2 is the slope; it’s positive if the trend is
increasing and negative if the trend is decreasing. The larger the absolute
value of β1 , the steeper the trend’s slope. In Figure 10.1, for example, we show
two linear trends, one increasing and one decreasing. The increasing trend
has an intercept of β1 = −50 and a slope of β2 = .8, whereas the decreasing
trend has an intercept of β1 = 10 and a gentler absolute slope of β2 = −.25.
In business, finance, and economics, linear trends are typically increasing,
corresponding to growth, but they don’t have to increase. In recent decades,
for example, male labor force participation rates have been falling, as have
the times between trades on stock exchanges. Morover, in some cases, such
as records (e.g., world records in the marathon), trends are decreasing by
definition.
Estimation of a linear trend model (for a series y, say) is easy. First we
need to create and store on the computer the variable T IM E. Fortunately we
don’t have to type the T IM E values (1, 2, 3, 4, ...) in by hand; in most good
software environments, a command exists to create the trend automatically.
Then we simply run the least squares regression y → c, T IM E.
10.2. NON-LINEAR TREND 167
The insight that exponential growth is non-linear in levels but linear in log-
arithms takes us to the idea of exponential trend, or log-linear trend,
which is very common in business, finance and economics.2
Exponential trend is common because economic variables often display
roughly constant real growth rates (e.g., two percent per year). If trend is
2
Throughout this book, logarithms are natural (base e) logarithms.
10.2. NON-LINEAR TREND 169
T rendt = β1 eβ2 T IM Et .
10.3 Seasonality
In the last section we focused on the trends; now we’ll focus on seasonal-
ity. A seasonal pattern is one that repeats itself every year.3 The annual
seasonal repetition can be exact, in which case we speak of deterministic
seasonality Here we focus exclusively on deterministic seasonality models.
Seasonality arises from links of technologies, preferences and institutions
to the calendar. The weather (e.g., daily high temperature) is a trivial but
very important seasonal series, as it’s always hotter in the summer than in
the winter. Any technology that involves the weather, such as production of
agricultural commodities, is likely to be seasonal as well.
Preferences may also be linked to the calendar. Consider, for example,
gasoline sales. People want to do more vacation travel in the summer, which
tends to increase both the price and quantity of summertime gasoline sales,
both of which feed into higher current-dollar sales.
3
Note therefore that seasonality is impossible, and therefore not an issue, in data recorded once per year,
or less often than once per year.
10.3. SEASONALITY 173
Figure 10.6: Liquor Sales Log-Quadratic Trend Estimation with Seasonal Dummies
Finally, social institutions that are linked to the calendar, such as holidays,
are responsible for seasonal variation in a variety of series. In Western coun-
tries, for example, sales of retail goods skyrocket every December, Christmas
season. In contrast, sales of durable goods fall in December, as Christmas
purchases tend to be nondurables. (You don’t buy someone a refrigerator for
Christmas.)
You might imagine that, although certain series are seasonal for the rea-
sons described above, seasonality is nevertheless uncommon. On the con-
trary, and perhaps surprisingly, seasonality is pervasive in business and eco-
nomics. Many industrialized economies, for example, expand briskly every
fourth quarter and contract every first quarter.
174 CHAPTER 10. TREND AND SEASONALITY
Figure 10.7: Residual Plot, Liquor Sales Log-Quadratic Trend Estimation With Seasonal
Dummies
SEAS1 indicates whether we’re in the first quarter (it’s 1 in the first quarter
and zero otherwise), SEAS2 indicates whether we’re in the second quarter
(it’s 1 in the second quarter and zero otherwise), and so on. At any given
time, we can be in only one of the four quarters, so one seasonal dummy is
1, and all others are zero.
To estimate the model for a series y, we simply run the least squares
regression,
y → SEAS1 , ..., SEASS .
whose value is always one, but note that the full set of S seasonal dummies
sums to a variable whose value is always one, so it is completely redundant.
Trend may be included as well. For example, we can account for season-
ality and linear trend by running5
In fact, you can think of what we’re doing in this section as a generalization
of what we did in the last, in which we focused exclusively on trend. We still
want to account for trend, if it’s present, but we want to expand the model
so that we can account for seasonality as well.
The idea of seasonality may be extended to allow for more general calendar
effects. “Standard” seasonality is just one type of calendar effect. Two
additional important calendar effects are holiday variation and trading-
day variation.
Holiday variation refers to the fact that some holidays’ dates change over
time. That is, although they arrive at approximately the same time each year,
the exact dates differ. Easter is a common example. Because the behavior
of many series, such as sales, shipments, inventories, hours worked, and so
on, depends in part on the timing of such holidays, we may want to keep
track of them in our forecasting models. As with seasonality, holiday effects
may be handled with dummy variables. In a monthly model, for example,
in addition to a full set of seasonal dummies, we might include an “Easter
dummy,” which is 1 if the month contains Easter and 0 otherwise.
Trading-day variation refers to the fact that different months contain dif-
ferent numbers of trading days or business days, which is an important con-
sideration when modeling and forecasting certain series. For example, in a
5
Note well that we drop the intercept!
10.4. TREND AND SEASONALITY IN LIQUOR SALES 177
Linear trend estimation results appear in Table 10.10. The trend is in-
creasing and highly significant. The adjusted R2 is 84%, reflecting the fact
that trend is responsible for a large part of the variation in liquor sales.
The residual plot (Figure 10.11) suggests, however, that linear trend is
inadequate. Instead, the trend in log liquor sales appears nonlinear, and the
neglected nonlinearity gets dumped in the residual. (We’ll introduce nonlin-
ear trend later.) The residual plot also reveals obvious residual seasonality.
In Figure 10.12 we show estimation results for a model with linear trend
and seasonal dummies. All seasonal dummies are of course highly significant
(no month has average sales of 0), and importantly the various seasonal
coefficients in many cases are significantly different from each other (that’s
the seasonality). R2 is higher.
In Figure 10.13 we show the corresponding residual plot. The model now
picks up much of the seasonality, as reflected in the seasonal fitted series
and the non-seasonal residuals. However, it clearly misses nonlinearity in the
trend, which therefore appears in the residuals.
In Figure 10.14 we plot the estimated seasonal pattern (the set of 12
estimated seasonal coefficients), which peaks during the winter holidays.
10.5. EXERCISES, PROBLEMS AND COMPLEMENTS 179
All of these results are crude approximations, because the linear trend
is clearly inadequate. We will subsequently allow for more sophisticated
(nonlinear) trends.
d. The residuals from your fitted model are effectively a linearly de-
trended version of your original series. Why? Discuss.
3. (Seasonal adjustment)
Just as we sometimes want to remove the trend from a series, sometimes
we want to seasonally adjust a series before modeling it. Seasonal
adjustment may be done using a variety of methods.
a. Discuss in detail how you’d use a linear trend plus seasonal dummies
model to seasonally adjust a series.
10.5. EXERCISES, PROBLEMS AND COMPLEMENTS 181
b. Seasonally adjust the log liquor sales data using a linear trend plus
seasonal dummy model. Discuss the patterns present and absent from
the seasonally adjusted series.
c. Search the Web (or the library) for information on the latest U.S.
Census Bureau seasonal adjustment procedure, and report what you
learned.
roughly the same for each month in a given three-month season. For
example, sales are similar in the winter months of January, February
and March, in the spring months of April, May and June, and so on.
b. A campus bookstore suspects that detrended sales are roughly the
same for all first, all second, all third, and all fourth months of
each trimester. For example, sales are similar in January, May, and
September, the first months of the first, second, and third trimesters,
respectively.
c. (Trading-day effects) A financial-markets trader suspects that de-
trended trading volume depends on the number of trading days in
the month, which differs across months.
d. (Time-varying holiday effects) A candy manufacturer suspects that
detrended candy sales tend to rise at Easter.
specific. For example, some software may use highly accurate analytic
derivatives whereas other software uses approximate numerical deriva-
tives. Even the same software package may change algorithms or details
of implementation across versions, leading to different results.
(a) Fit a linear trend plus seasonal dummy model to log liquor sales
(LSALES), using a full set of seasonal dummies.
(b) Find a “best” linear trend plus seasonal dummy LSALES model.
That is, consider tightening the seasonal specification to include
fewer than 12 seasonal dummies, and decide what’s best.
(c) Keeping the same seasonality specification as in (12b), re-estimate
the model in levels (that is, the LHS variable is now SALES rather
than LSALES) using exponential trend and nonlinear least squares.
Do your coefficient estimates match those from (12b)? Does the SIC
match that from (12b)?
(d) Repeat (12c), again using SALES and again leaving intact your
seasonal specification from (12b), but try linear and quadratic trend
instead of the exponential trend in (12c). What is your “final”
SALES model?
(e) Critique your final SALES model from (12d). In what ways is it
likely still deficient? You will of course want to discuss its residual
plot (actual values, fitted values, residuals), as well as any other
diagnostic plots or statistics that you deem relevant.
(f) Take your final estimated SALES model from (12d), and include
as regressors three lags of SALES (i.e., SALESt−1 , SALESt−2 and
SALESt−3 ). What role do the lags of SALES play? Consider this
new model to be your “final, final” SALES model, and repeat (12e).
where the wi are weights and m is an integer chosen by the user. The
“standard” one-sided moving average corresponds to a one-sided weighted
moving average with all weights equal to (m + 1)−1 .
Switching mean:
!
1 −(yt − µst )2
f (yt |st ) = √ exp .
2πσ 2σ 2
10.5. EXERCISES, PROBLEMS AND COMPLEMENTS 189
Switching regression:
!
1 −(yt − x′t βst )2
f (yt |st ) = √ exp .
2πσ 2σ 2
190 CHAPTER 10. TREND AND SEASONALITY
Chapter 11
Serial Correlation
ε ∼ N (0, Ω),
191
192 CHAPTER 11. SERIAL CORRELATION
yt = x′t β + εt
εt = ϕεt−1 + vt , |ϕ| < 1
vt ∼ iid N (0, σ 2 )
We’ve already considered models with trend and seasonal components. In this
chapter we consider a crucial third component, cycles. When you think of a
“cycle,” you might think of a rigid up-and-down pattern, as for example with
11.1. CHARACTERIZING SERIAL CORRELATION (IN POPULATION, MOSTLY)193
Typically the observations are ordered in time – hence the name time series
– but they don’t have to be. We could, for example, examine a spatial series,
such as office space rental rates as we move along a line from a point in
194 CHAPTER 11. SERIAL CORRELATION
midtown Manhattan to a point in the New York suburbs thirty miles away.
But the most important case, by far, involves observations ordered in time,
so that’s what we’ll stress.
In theory, a time series realization begins in the infinite past and continues
into the infinite future. This perspective may seem abstract and of limited
practical applicability, but it will be useful in deriving certain very important
properties of the models we’ll be using shortly. In practice, of course, the data
we observe is just a finite subset of a realization, {y1 , ..., yT }, called a sample
path.
Shortly we’ll be building models for cyclical time series. If the underlying
probabilistic structure of the series were changing over time, we’d be doomed
– there would be no way to relate the future to the past, because the laws gov-
erning the future would differ from those governing the past. At a minimum
we’d like a series’ mean and covariance structure (that is, the covariances
between current and past values) to be stable over time, in which case we say
that the series is covariance stationary.
Let’s discuss covariance stationarity in greater depth. The first require-
ment for a series to be covariance stationary is that its mean be stable over
time. The mean of the series at time t is Eyt = µt . If the mean is stable over
time, as required by covariance stationarity, then we can write Eyt = µ, for
all t. Because the mean is constant over time, there’s no need to put a time
subscript on it.
The second requirement for a series to be covariance stationary is that
its covariance structure be stable over time. Quantifying stability of the
covariance structure is a bit tricky, but tremendously important, and we do
it using the autocovariance function. The autocovariance at displacement
τ is just the covariance between yt and yt−τ . It will of course depend on τ ,
11.1. CHARACTERIZING SERIAL CORRELATION (IN POPULATION, MOSTLY)195
cov(x, y)
corr(x, y) = .
σx σy
γ(τ )
ρ(τ ) = , τ = 0, 1, 2, ....
γ(0)
The formula for the autocorrelation is just the usual correlation formula,
specialized to the correlation between yt and yt−τ . To see why, note that the
variance of yt is γ(0), and by covariance stationarity, the variance of y at any
other time yt−τ is also γ(0). Thus,
γ(0)
as claimed. Note that we always have ρ(0) = γ(0) = 1, because any series is
perfectly contemporaneously correlated with itself. Thus the autocorrelation
at displacement 0 isn’t of interest; rather, only the autocorrelations beyond
displacement 0 inform us about a series’ dynamic structure.
Finally, the partial autocorrelation function, p(τ ), is sometimes use-
ful. p(τ ) is just the coefficient of yt−τ in a population linear regression of
yt on yt−1 , ..., yt−τ .2 We call such regressions autoregressions, because the
variable is regressed on lagged values of itself. It’s easy to see that the
autocorrelations and partial autocorrelations, although related, differ in an
important way. The autocorrelations are just the “simple” or “regular” corre-
lations between yt and yt−τ . The partial autocorrelations, on the other hand,
measure the association between yt and yt−τ after controlling for the effects
of yt−1 , ..., yt−τ +1 ; that is, they measure the partial correlation between yt
and yt−τ .
As with the autocorrelations, we often graph the partial autocorrelations
as a function of τ and examine their qualitative shape, which we’ll do soon.
Like the autocorrelation function, the partial autocorrelation function pro-
vides a summary of a series’ dynamics, but as we’ll see, it does so in a different
way.3
All of the covariance stationary processes that we will study subsequently
have autocorrelation and partial autocorrelation functions that approach
zero, one way or another, as the displacement gets large. In Figure 11.1 we
show an autocorrelation function that displays gradual one-sided damping.
The precise decay patterns of autocorrelations and partial autocorrelations of
a covariance stationary series, however, depend on the specifics of the series.
2
To get a feel for what we mean by “population regression,” imagine that we have an infinite sample
of data at our disposal, so that the parameter estimates in the regression are not contaminated by sampling
variation – that is, they’re the true population values. The thought experiment just described is a population
regression.
3
Also in parallel to the autocorrelation function, the partial autocorrelation at displacement 0 is always
one and is therefore uninformative and uninteresting. Thus, when we graph the autocorrelation and partial
autocorrelation functions, we’ll begin at displacement 1 rather than displacement 0.
11.1. CHARACTERIZING SERIAL CORRELATION (IN POPULATION, MOSTLY)199
Figure 11.1
Now suppose we have a sample of data on a time series, and we don’t know
the true model that generated the data, or the mean, autocorrelation function
or partial autocorrelation function associated with that true model. Instead,
we want to use the data to estimate the mean, autocorrelation function, and
partial autocorrelation function, which we might then use to help us learn
about the underlying dynamics, and to decide upon a suitable model or set
of models to fit to the data.
Sample Mean
µ = Eyt .
200 CHAPTER 11. SERIAL CORRELATION
Sample Autocorrelations
is
1
ρ̂(τ ) ∼ N 0, .
T
Note how simple the result is. The sample autocorrelations of a white noise
series are approximately normally distributed, and the normal is always a
convenient distribution to work with. Their mean is zero, which is to say the
sample autocorrelations are unbiased estimators of the true autocorrelations,
which are in fact zero. Finally, the variance of the sample autocorrelations
√
is approximately 1/T (equivalently, the standard deviation is 1/ T ), which
is easy to construct and remember. Under normality, taking plus or minus
two standard errors yields an approximate 95% confidence interval. Thus, if
the series is white noise, approximately 95% of the sample autocorrelations
√
should fall in the interval 0 ± 2/ T . In practice, when we plot the sample
autocorrelations for a sample of data, we typically include the “two standard
error bands,” which are useful for making informal graphical assessments of
whether and how the series deviates from white noise.
The two-standard-error bands, although very useful, only provide 95%
bounds for the sample autocorrelations taken one at a time. Ultimately,
we’re often interested in whether a series is white noise, that is, whether all
its autocorrelations are jointly zero. A simple extension lets us test that
hypothesis. Rewrite the expression
1
ρ̂(τ ) ∼ N 0,
T
as
√
T ρ̂(τ ) ∼ N (0, 1).
Recall that the partial autocorrelations are obtained from population linear
regressions, which correspond to a thought experiment involving linear re-
gression using an infinite sample of data. The sample partial autocorrelations
correspond to the same thought experiment, except that the linear regression
is now done on the (feasible) sample of size T . If the fitted regression is
p̂(τ ) ≡ β̂τ .
two diverge.
In this section we’ll study the population properties of certain important time
series models, or time series processes. Before we estimate time series
models, we need to understand their population properties, assuming that
the postulated model is true. The simplest of all such time series processes
is the fundamental building block from which all others are constructed. In
fact, it’s so important that we introduce it now. We use y to denote the
observed series of interest. Suppose that
yt = εt
εt ∼ (0, σ 2 ),
where the “shock,” εt , is uncorrelated over time. We say that εt , and hence
yt , is serially uncorrelated. Throughout, unless explicitly stated otherwise,
we assume that σ 2 < ∞. Such a process, with zero mean, constant variance,
and no serial correlation, is called zero-mean white noise, or simply white
noise.6 Sometimes for short we write
εt ∼ W N (0, σ 2 )
and hence
yt ∼ W N (0, σ 2 ).
Note that, although εt and hence yt are serially uncorrelated, they are
not necessarily serially independent, because they are not necessarily nor-
6
It’s called white noise by analogy with white light, which is composed of all colors of the spectrum,
in equal amounts. We can think of white noise as being composed of a wide variety of cycles of differing
periodicities, in equal amounts.
11.2. MODELING SERIAL CORRELATION (IN POPULATION) 205
yt ∼ iid(0, σ 2 ),
yt ∼ iidN (0, σ 2 ).
Figure 11.2
unconditional mean and variance must be constant for any covariance sta-
tionary process. The reason is that constancy of the unconditional mean was
our first explicit requirement of covariance stationarity, and that constancy
of the unconditional variance follows implicitly from the second requirement
of covariance stationarity, that the autocovariances depend only on displace-
ment, not on time.10
To understand fully the linear dynamic structure of a covariance station-
ary time series process, we need to compute and examine its mean and its
autocovariance function. For white noise, we’ve already computed the mean
and the variance, which is the autocovariance at displacement 0. We have
yet to compute the rest of the autocovariance function; fortunately, however,
it’s very simple. Because white noise is, by definition, uncorrelated over time,
all the autocovariances, and hence all the autocorrelations, are zero beyond
displacement 0.11 Formally, then, the autocovariance function for a white
10
Recall that σ 2 = γ(0).
11
If the autocovariances are all zero, so are the autocorrelations, because the autocorrelations are propor-
tional to the autocovariances.
11.2. MODELING SERIAL CORRELATION (IN POPULATION) 207
Figure 11.3
noise process is
σ 2 if τ = 0
γ(τ ) =
0 if τ ≥ 1
and the autocorrelation function for a white noise process is
1 if τ = 0
ρ(τ ) =
0 if τ ≥ 1.
Figure 11.4
which means that they’re forecastable, and if forecast errors are forecastable
then the forecast can’t be very good. Thus it’s important that we understand
and be able to recognize white noise.
Thus far we’ve characterized white noise in terms of its mean, variance,
autocorrelation function and partial autocorrelation function. Another char-
acterization of dynamics involves the mean and variance of a process, condi-
tional upon its past. In particular, we often gain insight into the dynamics in
a process by examining its conditional mean.12 In fact, throughout our study
of time series, we’ll be interested in computing and contrasting the uncondi-
tional mean and variance and the conditional mean and variance of
various processes of interest. Means and variances, which convey information
about location and scale of random variables, are examples of what statisti-
cians call moments. For the most part, our comparisons of the conditional
and unconditional moment structure of time series processes will focus on
means and variances (they’re the most important moments), but sometimes
we’ll be interested in higher-order moments, which are related to properties
such as skewness and kurtosis.
For comparing conditional and unconditional means and variances, it will
simplify our story to consider independent white noise, yt ∼ iid(0, σ 2 ). By
the same arguments as before, the unconditional mean of y is 0 and the un-
conditional variance is σ 2 . Now consider the conditional mean and variance,
where the information set Ωt−1 upon which we condition contains either the
past history of the observed series, Ωt−1 = yt−1 , yt−2 , ..., or the past history of
the shocks, Ωt−1 = εt−1 , εt−2 .... (They’re the same in the white noise case.)
In contrast to the unconditional mean and variance, which must be constant
by covariance stationarity, the conditional mean and variance need not be
constant, and in general we’d expect them not to be constant. The uncondi-
tionally expected growth of laptop computer sales next quarter may be ten
12
If you need to refresh your memory on conditional means, consult any good introductory statistics book,
such as Wonnacott and Wonnacott (1990).
210 CHAPTER 11. SERIAL CORRELATION
percent, but expected sales growth may be much higher, conditional upon
knowledge that sales grew this quarter by twenty percent. For the indepen-
dent white noise process, the conditional mean is
E(yt |Ωt−1 ) = 0,
Conditional and unconditional means and variances are identical for an inde-
pendent white noise series; there are no dynamics in the process, and hence
no dynamics in the conditional moments.
The lag operator and related constructs are the natural language in which
time series models are expressed. If you want to understand and manipulate
time series models – indeed, even if you simply want to be able to read the
software manuals – you have to be comfortable with the lag operator. The
lag operator, L, is very simple: it “operates” on a series by lagging it. Hence
Lyt = yt−1 . Similarly, L2 yt = L(L(yt )) = L(yt−1 ) = yt−2 , and so on. Typically
we’ll operate on a series not with the lag operator but with a polynomial
in the lag operator. A lag operator polynomial of degree m is just a linear
function of powers of L, up through the m-th power,
B(L) = b0 + b1 L + b2 L2 + ...bm Lm .
Lm yt = yt−m .
11.2. MODELING SERIAL CORRELATION (IN POPULATION) 211
11.2.3 Autoregression
When building models, we don’t want to pretend that the model we fit is
true. Instead, we want to be aware that we’re approximating a more complex
reality. That’s the modern view, and it has important implications for time-
series modeling. In particular, the key to successful time series modeling
is parsimonious, yet accurate, approximations. Here we emphasize a very
important class of approximations, the autoregressive (AR) model.
We begin by characterizing the autocorrelation function and related quan-
tities under the assumption that the AR model is “true.”13 These charac-
terizations have nothing to do with data or estimation, but they’re crucial
for developing a basic understanding of the properties of the models, which
is necessary to perform intelligent modeling. They enable us to make state-
ments such as “If the data were really generated by an autoregressive process,
13
Sometimes, especially when characterizing population properties under the assumption that the models
are correct, we refer to them as processes, which is short for stochastic processes.
212 CHAPTER 11. SERIAL CORRELATION
then we’d expect its autocorrelation function to have property x.” Armed
with that knowledge, we use the sample autocorrelations and partial auto-
correlations, in conjunction with the AIC and the SIC, to suggest candidate
models, which we then estimate.
The autoregressive process is a natural approximation to time-series dy-
namics. It’s simply a stochastic difference equation, a simple mathematical
model in which the current value of a series is linearly related to its past
values, plus an additive stochastic shock. Stochastic difference equations are
a natural vehicle for discrete-time stochastic dynamic modeling.
yt = ϕyt−1 + εt
εt ∼ W N (0, σ 2 ).
(1 − ϕL)yt = εt.
and the same innovation sequence underlies each realization. The fluctuations
in the AR(1) with parameter ϕ = .95 appear much more persistent that those
11.2. MODELING SERIAL CORRELATION (IN POPULATION) 213
Figure 11.5
of the AR(1) with parameter ϕ = .4. Thus the AR(1) model is capable of
capturing highly persistent dynamics.
Certain conditions must be satisfied for an autoregressive process to be
covariance stationary. If we begin with the AR(1) process,
yt = ϕyt−1 + εt ,
and substitute backward for lagged y’s on the right side, we obtain
This moving average representation for y is convergent if and only if |ϕ| < 1;
thus, |ϕ| < 1 is the condition for covariance stationarity in the AR(1) case.
214 CHAPTER 11. SERIAL CORRELATION
Figure 11.6
Figure 11.7
Note in particular that the simple way that the conditional mean adapts
to the changing information set as the process evolves.
To find the autocovariances, we proceed as follows. The process is
yt = ϕyt−1 + εt ,
216 CHAPTER 11. SERIAL CORRELATION
Figure 11.8
Figure 11.9
Thus we have
2
γ(0) = σ 1−ϕ2
2
γ(1) = ϕσ 1−ϕ2
2
γ(2) = ϕ2 σ 1−ϕ2 ,
ρ(τ ) = ϕτ , τ = 0, 1, 2, ....
It’s easy to see why. The partial autocorrelations are just the last coeffi-
cients in a sequence of successively longer population autoregressions. If the
true process is in fact an AR(1), the first partial autocorrelation is just the
autoregressive coefficient, and coefficients on all longer lags are zero.
In Figures 11.8 and 11.9 we show the partial autocorrelation functions for
our two AR(1) processes. At displacement 1, the partial autocorrelations are
simply the parameters of the process (.4 and .95, respectively), and at longer
displacements, the partial autocorrelations are zero.
The autocorrelation function for the general AR(p) process, as with that of
the AR(1) process, decays gradually with displacement. Finally, the AR(p)
partial autocorrelation function has a sharp cutoff at displacement p, for
the same reason that the AR(1) partial autocorrelation function has a sharp
cutoff at displacement 1.
Let’s discuss the AR(p) autocorrelation function in a bit greater depth.
The key insight is that, in spite of the fact that its qualitative behavior
(gradual damping) matches that of the AR(1) autocorrelation function, it
can nevertheless display a richer variety of patterns, depending on the order
and parameters of the process. It can, for example, have damped monotonic
decay, as in the AR(1) case with a positive coefficient, but it can also have
damped oscillation in ways that AR(1) can’t have. In the AR(1) case, the
only possible oscillation occurs when the coefficient is negative, in which case
the autocorrelations switch signs at each successively longer displacement. In
higher-order autoregressive models, however, the autocorrelations can oscil-
late with much richer patterns reminiscent of cycles in the more traditional
sense. This occurs when some roots of the autoregressive lag operator poly-
14
Pp
A necessary condition for covariance stationarity, which is often useful as a quick check, is i=1 ϕi < 1.
If the condition is satisfied, the process may or may not be stationary, but if the condition is violated, the
process can’t be stationary.
220 CHAPTER 11. SERIAL CORRELATION
The corresponding lag operator polynomial is 1 − 1.5L + .9L2 , with two com-
plex conjugate roots, .83±.65i. The inverse roots are .75±.58i, both of which
are close to, but inside, the unit circle; thus the process is covariance station-
ary. It can be shown that the autocorrelation function for an AR(2) process
is
ρ(0) = 1
ϕ1
ρ(1) =
1 − ϕ2
ρ(τ ) = ϕ1 ρ(τ − 1) + ϕ2 ρ(τ − 2), τ = 2, 3, ...
Using this formula, we can evaluate the autocorrelation function for the
process at hand; we plot it in Figure 11.10. Because the roots are complex,
the autocorrelation function oscillates, and because the roots are close to the
unit circle, the oscillation damps slowly.
If a model has extracted all the systematic information from the data, then
what’s left – the residual – should be iid random noise. Hence the usefulness
of various residual-based tests of the hypothesis that regression disturbances
are white noise (i.e., not serially correlated).
15
Note that complex roots can’t occur in the AR(1) case.
11.3. MODELING SERIAL CORRELATION (IN SAMPLE) 221
Figure 11.10
Of course the most obvious thing is simply to inspect the residual plot.
For convenience we reproduce our liquor sales residual plot in Figure 11.11.
There is clear visual evidence of serial correlation in our liquor sales residu-
als. Sometimes, however, things are not so visually obvious. Hence we now
introduce some additional tools.
Residual Scatterplots
Figure 11.11
Figure 11.12
11.3. MODELING SERIAL CORRELATION (IN SAMPLE) 223
Durbin-Watson
yt = x′t β + εt
εt = ϕεt−1 + vt
vt ∼ iid N (0, σ 2 )
DW takes values in the interval [0, 4], and if all is well, DW should be
around 2. If DW is substantially less than 2, there is evidence of positive
serial correlation. As a rough rule of thumb, if DW is less than 1.5, there
may be cause for alarm, and we should consult the tables of the DW statistic,
available in many statistics and econometrics texts.
224 CHAPTER 11. SERIAL CORRELATION
Hence as T → ∞:
σ 2 + σ 2 − 2cov(et , et−1 )
DW ≈ = 2(1 − corr(et , et−1 ))
σ2 | {z }
ρe (1)
yt = x′t β + εt
εt = ϕ1 εt−1 + ... + ϕp εt−p + vt
vt ∼ iidN (0, σ 2 )
1
P
cov(e
c t , et−τ ) T t et et−τ
ρ̂e (τ ) = = 1
P 2
.
vd
ar(et ) T t et
The remaining issue is how to estimate a regression model with serially cor-
related disturbances. Let us illustrate with the AR(1) case. The model is:
yt = x′t β + εt
εt = ϕεt−1 + vt
vt ∼ iid N (0, σ 2 ).
This “new” model satisfies the IC (recall that v is iid), so dealing with
autocorrelated disturbances amounts to little more than including an autore-
gressive lag in the regression.17 The IC are satisfied so OLS is fine. AR(1)
disturbances require 1 lag, as we just showed. General AR(p) disturbances
require p lags.
For liquor sales, everything points to AR(4) disturbance dynamics – from
the residual correlogram of the original trend + seasonal model, to the DW
test results (DW is designed to detect AR(1) but of course it can also re-
ject against higher-order autoregressive alternatives), to the BG test results,
to the SIC pattern (AR(1) = −3.797, AR(2) = −3.941, AR(3) = −4.080,
AR(4) = −4.086, AR(5) = −4.071, AR(6) = −4.058, AR(7) = −4.057,
17
Note the constraint that the coefficient on xt−1 is the product of the coefficients on yt−1 and xt . One
may or may not want to rigidly impose that contraint; in any event the main thing is to “clean out” the
dynamics in εt by including lags of yt .
11.3. MODELING SERIAL CORRELATION (IN SAMPLE) 229
Figure 11.17: Trend + Seasonal Model with Four Lags of y, Residual Plot
11.3. MODELING SERIAL CORRELATION (IN SAMPLE) 231
Figure 11.18: Trend + Seasonal Model with Four Autoregressive Lags, Residual Scatterplot
Figure 11.19: Trend + Seasonal Model with Four Autoregressive Lags, Residual Autocorre-
lations
232 CHAPTER 11. SERIAL CORRELATION
a. γ(t, τ ) = α
b. γ(t, τ ) = e−ατ
c. γ(t, τ ) = ατ
d. γ(t, τ ) = ατ , where α is a positive constant.
7. Dynamic logit.
Note that, in a logit regression, one or more of the RHS variables could
be lagged dependent variables, It−i (z), i = 1, 2, ...
8. IC2.1 in time-series.
In cross sections we wrote IC2 as “εi independent of xi ”. We did not
yet have occasion to state IC2 in time series, since we will not introduce
time series until now. In time series IC2 becomes “εt independent of
xt , xt−1 , ...”.
causal inferences from the historical record. One would of course like
to watch the realizations of two universes, the one that actually oc-
curred with some treatment applied, and a parallel counterfactual uni-
verse without the treatment applied. That’s not possible in general,
but we can approximate the comparison by estimating a model on pre-
treatment data and using it to predict what would have happened in
the absence of the treatment, and comparing it to what happened in the
real data with the treatment.
“Treatment” sounds like active intervention, but again, the treatment is
usually passive in event study contexts. Consider the following example.
We want to know the effect of a new gold discovery on stock returns of a
certain gold mining firm. We can’t just look at the firm’s returns on the
announcement day, because daily stock returns vary greatly for lots of
reasons. Event studies proceed by (1) specifying and estimating a model
for the object of interest (in this case a firm’s daily stock returns) over
the pre-event period, using only pre-event data, 1, ..., T (in this case pre-
announcement data), (2) using the model to predict into the post-event
period T + 1, T + 2, ..., and (3) comparing the post-event forecast to the
post-event realization.
236 CHAPTER 11. SERIAL CORRELATION
Chapter 12
Forecasting
12.1 ***
yt = x′t β + εt
yt+h,t = x′t+h,t β
237
238 CHAPTER 12. FORECASTING
yt = ϕyt−1 + εt
yt+h,t = ϕyt+h−1,t
2
yt+h | yt , yt−1 , ... ∼ N (yt+h,t , σt+h,t ).
2
σt+h,t = var(et+h,t ) = var(yt+h − yt+h,t ).
yt = ϕyt−1 + εt εt ∼ W N (0, σ 2 )
12.1. *** 239
with variance
2
σt+1,t = var(et+1,t ) = σ 2
2
σt+h,t = var(et+h,t ) = σ 2 (1 + b21 + ... + b2h−1 )
LSALESt → c, T IM Et
SIC = 0.45
LSALESt → c, T IM Et , T IM Et2
12.2. EXERCISES, PROBLEMS AND COMPLEMENTS 241
SIC = 0, 28
SIC = 0.04
SIC = 0.02
(We report exponentiated SIC’s because the software actually reports ln(SIC))
Structural Change
Recall the full ideal conditions, one of which was that the model coefficients are fixed. Violations of that
condition are of great concern in time series. The cross-section dummy variables that we already studied
effectively allow for structural change in the cross section (heterogeneity across groups). But structural
change is of special relevance in time series. It can be gradual (Lucas critique, learning, evolution of tastes,
...) or abrupt (e.g., new legislation).
Structural change is related to nonlinearity, because structural change is actually a type of nonlinearity.
Structural change is also related to outliers, because outliers can sometimes be viewed as a kind of structural
change – a quick intercept break and return.
For notational simplicity we consider the case of simple regression throughout, but the ideas extend
immediately to multiple regression.
yt = β1t + β2t xt + εt
where
β1t = γ1 + γ2 T IM Et
β2t = δ1 + δ2 T IM Et .
Then we have:
yt = (γ1 + γ2 T IM Et ) + (δ1 + δ2 T IM Et )xt + εt .
We simply run:
yt → c, , T IM Et , xt , T IM Et · xt .
This is yet another important use of dummies. The regression can be used both to test for structural
change (F test of γ2 = δ2 = 0), and to accommodate it if present.
243
244 CHAPTER 13. STRUCTURAL CHANGE
Let (
0, t = 1, ..., T ∗
Dt =
1, t = T ∗ + 1, ...T
Then we can write the model as:
We simply run:
yt → c, Dt , xt , Dt · xt
The regression can be used both to test for structural change, and to accommodate it if present. It represents
yet another use of dummies. The no-break null corresponds to the joint hypothesis of zero coefficients on Dt
and Dt · xt , for which an F test is appropriate.
The Chow Test The dummy-variable setup and associated F test above is actually just a laborious way
of calculating the so-called Chow breakpoint test statistic,
(SSRres − SSR)/K
Chow = ,
SSR/(T − 2K)
where SSRres is from the regression using sample t = 1, ..., T and SSR = SSR1 + SSR2 , where SSR1 is
from the regression using sample t = 1, ..., T ∗ and SSR2 is from the regression using sample t = T ∗ + 1, ...T .
Under the IC, Chow is distributed F , with K and T − 2K degrees of freedom.
where τ denotes sample fraction (typically we take τ1 = .15 and τ2 = .85). The distribution of M axChow
has been tabulated.
************************
– Exogenously-specified break in log-linear trend model
– Endogenously-selected break in log-linear trend model
– SIC for best broken log-linear trend model vs. log-quadratic trend model
*********************************
1. If there are neglected group effects in cross-section regression, we fix the problem (of omitted group
dummies) by including the requisite group dummies.
2. If there is neglected trend or seasonality in time-series regression, we fix the problem (of omitted trend
or seasonal dummies) by including the requisite trend or seasonal dummies.
3. If there is neglected non-linearity, we fix the problem (effectively one of omitted Taylor series terms)
by including the requisite Taylor series terms.
4. If there is neglected structural change in time-series regression, we fix the problem (effectively one of
omitted parameter trend dummies or break dummies) by including the requisite trend dummies or
break dummies.
You can think of the basic “uber-strategy” as ”If some systematic feature of the DGP is missing from the
model, then include it.” That is, if something is missing, then model what’s missing, and then the new uber-
model won’t have anything missing, and all will be well (i.e., the IC will be satisfied). This is an important
recognition. In a subsequent chapter, for example, we’ll study another violation of the IC known as serial
correlation (Chapter ??). The problem amounts to a feature of the DGP neglected by the initially-fitted
model, and we address the problem by incorporating the neglected feature into the model.
246 CHAPTER 13. STRUCTURAL CHANGE
Vector Autoregression
**************
After introducing Granger causality, introduce time series causal est via event studies.
247
248 CHAPTER 14. VECTOR AUTOREGRESSION
is to create an artificial time series for the stock price that matches the behavior of the price before the
earnings announcement. If we construct such artificial time series with data from companies unaffected by
the earning announcement, and we construct it in such a way that it matches the behavior of the treated
company prior to treatment, then we can use this constructed time series as a control group post treatment.
The synthetic control will capture how the price would have evolved after the event maintaining most of the
features of the environment except the earnings announcement. Then using the synthetic control, we can
measure the effect of the earning announcement by comparing the observed time series for the stock price
and the artificial price time series from the synthetic control.
This strategy requires that we can find price data for companies that are unaffected by the earnings
announcement and then that their data can be used to construct the synthetic control. In other words,
we need to find good controls, and intuitively, this implies finding companies with similar features to the
company we are studying but for which we can claim the earnings announcement has not effect, that is,
these companies are not treated by the event.
In our example the synthetic control is constructed by combining the price time series of the control
companies. Typically, linear combination weights are carefully selected so that the resulting time series
matches the behavior of the price of the treated company prior to the earnings announcement.
——————————————————————
A univariate autoregression involves one variable. In a univariate autoregression of order p, we regress
a variable on p lags of itself. In contrast, a multivariate autoregression – that is, a vector autoregression, or
V AR – involves N variables. In an N -variable vector autoregression of order p, or V AR(p), we estimate N
different equations. In each equation, we regress the relevant left-hand-side variable on p lags of itself, and
p lags of every other variable.1 Thus the right-hand-side variables are the same in every equation – p lags
of every variable.
The key point is that, in contrast to the univariate case, vector autoregressions allow for cross-variable
dynamics. Each variable is related not only to its own past, but also to the past of all the other variables
in the system. In a two-variable V AR(1), for example, we have two equations, one for each variable (y1 and
y2 ) . We write
y1,t = ϕ11 y1,t−1 + ϕ12 y2,t−1 + ε1,t
Each variable depends on one lag of the other variable in addition to one lag of itself; that’s one obvious
source of multivariate interaction captured by the V AR that may be useful for forecasting. In addition, the
disturbances may be correlated, so that when one equation is shocked, the other will typically be shocked
as well, which is another type of multivariate interaction that univariate models miss. We summarize the
disturbance variance-covariance structure as
The innovations could be uncorrelated, which occurs when σ12 = 0, but they needn’t be.
1
Trends, seasonals, and other exogenous variables may also be included, as long as they’re all included in
every equation.
14.1. PREDICTIVE CAUSALITY 249
You might guess that V ARs would be hard to estimate. After all, they’re fairly complicated models,
with potentially many equations and many right-hand-side variables in each equation. In fact, precisely the
opposite is true. V ARs are very easy to estimate, because we need only run N linear regressions. That’s one
reason why V ARs are so popular – OLS estimation of autoregressive models is simple and stable. Equation-
by-equation OLS estimation also turns out to have very good statistical properties when each equation has
the same regressors, as is the case in standard V ARs. Otherwise, a more complicated estimation procedure
called seemingly unrelated regression, which explicitly accounts for correlation across equation disturbances,
would be required to obtain estimates with good statistical properties.
When fitting V AR’s to data, we ca use the Schwarz criterion, just as in the univariate case. The formula
differs, however, because we’re now working with a multivariate system of equations rather than a single
equation. To get an SIC value for a V AR system, we could add up the equation-by-equation SIC’s, but
unfortunately, doing so is appropriate only if the innovations are uncorrelated across equations, which is
a very special and unusual situation. Instead, explicitly multivariate versions of information criteria are
required, which account for cross-equation innovation correlation. We interpret the SIC values computed
for V ARs of various orders in exactly the same way as in the univariate case: we select that order p such
that SIC is minimized.
We construct V AR forecasts in a way that precisely parallels the univariate case. We can construct
1-step-ahead point forecasts immediately, because all variables on the right-hand side are lagged by one
period. Armed with the 1-step-ahead forecasts, we can construct the 2-step-ahead forecasts, from which we
can construct the 3-step-ahead forecasts, and so on in the usual way, following the chain rule of forecasting.
We construct interval and density forecasts in ways that also parallel the univariate case. The multivariate
nature of V AR’s makes the derivations more tedious, however, so we bypass them. As always, to construct
practical forecasts we replace unknown parameters by estimates.
of that yi that appear on the right side of the yj equation must have zero coefficients.2 Statistical causality
tests are based on this formulation of non-causality. We use an F -test to assess whether all coefficients on
lags of yi are jointly zero.
Note that we’ve defined non-causality in terms of 1-step-ahead prediction errors. In the bivariate V AR,
this implies non-causality in terms of h-step-ahead prediction errors, for all h. (Why?) In higher dimensional
cases, things are trickier; 1-step-ahead noncausality does not necessarily imply noncausality at other horizons.
For example, variable i may 1-step cause variable j, and variable j may 1-step cause variable k. Thus, variable
i 2-step causes variable k, but does not 1-step cause variable k.
Causality tests are often used when building and assessing forecasting models, because they can inform
us about those parts of the workings of complicated multivariate models that are particularly relevant for
forecasting. Just staring at the coefficients of an estimated V AR (and in complicated systems there are many
coefficients) rarely yields insights into its workings. Thus we need tools that help us to see through to the
practical forecasting properties of the model that concern us. And we often have keen interest in the answers
to questions such as “Does yi contribute toward improving forecasts of yj ?,” and “Does yj contribute toward
improving forecasts of yi ?” If the results violate intuition or theory, then we might scrutinize the model
more closely. In a situation in which we can’t reject a certain noncausality hypothesis, and neither intuition
nor theory makes us uncomfortable with it, we might want to impose it, by omitting certain lags of certain
variables from certain equations.
Various types of causality hypotheses are sometimes entertained. In any equation (the j-th, say), we’ve
already discussed testing the simple noncausality hypothesis that:
We can broaden the idea, however. Sometimes we test stronger noncausality hypotheses such as:
3. No variable in a set A causes any variable in a set B, in which case we say that the variables in A are
block non-causal for those in B.
autocorrelation function decays slowly, whereas the sample partial autocorrelation function appears to cut
off at displacement 2. The patterns in the sample autocorrelations and partial autocorrelations are highly
statistically significant, as evidenced by both the Bartlett standard errors and the Ljung-Box Q-statistics.
The completions correlogram, in Table 14.4 and Figure 14.5, behaves similarly.
We’ve not yet introduced the cross correlation function. There’s been no need, because it’s not relevant
for univariate modeling. It provides important information, however, in the multivariate environments that
now concern us. Recall that the autocorrelation function is the correlation between a variable and lags of
itself. The cross-correlation function is a natural multivariate analog; it’s simply the correlation between a
252 CHAPTER 14. VECTOR AUTOREGRESSION
variable and lags of another variable. We estimate those correlations using the usual estimator and graph
them as a function of displacement along with the Bartlett two- standard-error bands, which apply just as
in the univariate case.
The cross-correlation function (Figure 14.6) for housing starts and completions is very revealing. Starts
and completions are highly correlated at all displacements, and a clear pattern emerges as well: although
the contemporaneous correlation is high (.78), completions are maximally correlated with starts lagged by
roughly 6-12 months (around .90). Again, this makes good sense in light of the time it takes to build a
house.
Now we proceed to model starts and completions. We need to select the order, p, of our V AR(p). Based
on exploration using SIC, we adopt a V AR(4).
First consider the starts equation (Table 14.7a), residual plot (Figure 14.7b), and residual correlogram
(Table 14.8, Figure 14.9). The explanatory power of the model is good, as judged by the R2 as well as
the plots of actual and fitted values, and the residuals appear white, as judged by the residual sample
autocorrelations, partial autocorrelations, and Ljung-Box statistics. Note as well that no lag of completions
has a significant effect on starts, which makes sense – we obviously expect starts to cause completions,
but not conversely. The completions equation (Table 14.10a), residual plot (Figure 14.10b), and residual
correlogram (Table 14.11, Figure 14.12) appear similarly good. Lagged starts, moreover, most definitely
14.2. APPLICATION: HOUSING STARTS AND COMPLETIONS 253
Figure 14.9: VAR Starts Equation - Sample Autocorrelation and Partial Autocorrelation
258 CHAPTER 14. VECTOR AUTOREGRESSION
Figure 14.12: VAR Completions Equation - Sample Autocorrelation and Partial Autocorre-
lation
14.2. APPLICATION: HOUSING STARTS AND COMPLETIONS 261
Finally, we construct forecasts for the out-of-sample period, 1992.01-1996.06. The starts forecast appears
in Figure 14.14. Starts begin their recovery before 1992.01, and the V AR projects continuation of the
recovery. The V AR forecasts captures the general pattern quite well, but it forecasts quicker mean reversion
than actually occurs, as is clear when comparing the forecast and realization in Figure 14.15. The figure
also makes clear that the recovery of housing starts from the recession of 1990 was slower than the previous
recoveries in the sample, which naturally makes for difficult forecasting. The completions forecast suffers the
same fate, as shown in Figures 14.16 and 14.17. Interestingly, however, completions had not yet turned by
1991.12, but the forecast nevertheless correctly predicts the turning point. (Why?)
262 CHAPTER 14. VECTOR AUTOREGRESSION
Dynamic Heteroskedasticity
265
266 CHAPTER 15. DYNAMIC HETEROSKEDASTICITY
yt = B(L)εt
∞
X
B(L) = bi Li
i=0
∞
X
b2i < ∞
i=0
b0 = 1
εt ∼ W N (0, σ 2 ).
We will work with various cases of this process.
Suppose first that εt is strong white noise, εt ∼ iid(0, σ 2 ). Let us review some results already discussed
for the general linear process, which will prove useful in what follows. The unconditional mean and variance
of y are
E(yt ) = 0
and
∞
X
E(yt2 ) = σ 2 b2i ,
i=0
which are both time-invariant, as must be the case under covariance stationarity. However, the condi-
tional mean of y is time-varying:
∞
X
E(yt |Ωt−1 ) = bi εt−i ,
i=1
The ability of the general linear process to capture covariance stationary conditional mean dynamics is
the source of its power.
Because the volatility of many economic time series varies, one would hope that the general linear
process could capture conditional variance dynamics as well, but such is not the case for the model as
presently specified: the conditional variance of y is constant at
This potentially unfortunate restriction manifests itself in the properties of the h-step-ahead conditional
prediction error variance. The minimum mean squared error forecast is the conditional mean,
∞
X
E(yt+h |Ωt ) = bh+i εt−i ,
i=0
15.1. THE BASIC ARCH PROCESS 267
h−1
X
yt+h − E(yt+h |Ωt ) = bi εt+h−i ,
i=0
h−1
X
2 2
E (yt+h − E(yt+h |Ωt )) |Ωt = σ b2i .
i=0
The conditional prediction error variance is different from the unconditional variance, but it is not time-
varying: it depends only on h, not on the conditioning information Ωt . In the process as presently specified,
the conditional variance is not allowed to adapt to readily available and potentially useful conditioning
information.
So much for the general linear process with iid innovations. Now we extend it by allowing εt to be weak
rather than strong white noise, with a particular nonlinear dependence structure. In particular, suppose that,
as before,
yt = B(L)εt
∞
X
B(L) = bi Li
i=0
∞
X
b2i < ∞
i=0
b0 = 1,
σt2 = ω + γ(L)ε2t
p
X X
ω > 0γ(L) = γi Li γi ≥ 0f oralli γi < 1.
i=1
Note that we parameterize the innovation process in terms of its conditional density,
εt |Ωt−1 ,
which we assume to be normal with a zero conditional mean and a conditional variance that depends
linearly on p past squared innovations. εt is serially uncorrelated but not serially independent, because the
current conditional variance σt2 depends on the history of εt .2 The stated regularity conditions are sufficient
to ensure that the conditional and unconditional variances are positive and finite, and that yt is covariance
stationary.
2
In particular, σt2 depends on the previous p values of εt via the distributed lag
γ(L)ε2t .
268 CHAPTER 15. DYNAMIC HETEROSKEDASTICITY
E(εt ) = 0
and
2 ω
E(εt − E(εt )) = P .
1− γi
The important result is not the particular formulae for the unconditional mean and variance, but the
fact that they are fixed, as required for covariance stationarity. As for the conditional moments of εt , its
conditional variance is time-varying,
Thus, we now treat conditional mean and variance dynamics in a symmetric fashion by allowing for
movement in each, as determined by the evolving information set Ωt−1 . In the above development, εt
is called an ARCH(p) process, and the full model sketched is an infinite-ordered moving average with
ARCH(p) innovations, where ARCH stands for autoregressive conditional heteroskedasticity. Clearly εt is
conditionally heteroskedastic, because its conditional variance fluctuates. There are many models of condi-
tional heteroskedasticity, but most are designed for cross-sectional contexts, such as when the variance of a
cross-sectional regression disturbance depends on one or more of the regressors.3 However, heteroskedasticity
is often present as well in the time-series contexts relevant for forecasting, particularly in financial markets.
The particular conditional variance function associated with the ARCH process,
σt2 = ω + γ(L)ε2t ,
is tailor-made for time-series environments, in which one often sees volatility clustering, such that
large changes tend to be followed by large changes, and small by small, of either sign. That is, one may
see persistence, or serial correlation, in volatility dynamics (conditional variance dynamics), quite apart
from persistence (or lack thereof) in conditional mean dynamics. The ARCH process approximates volatility
dynamics in an autoregressive fashion; hence the name autoregressiveconditional heteroskedasticity. To un-
derstand why, note that the ARCH conditional variance function links today’s conditional variance positively
to earlier lagged ε2t ’s, so that large ε2t ’s in the recent past produce a large conditional variance today, thereby
increasing the likelihood of a large ε2t today. Hence ARCH processes are to conditional variance dynamics
3
The variance of the disturbance in a model of household expenditure, for example, may depend on
income.
15.2. THE GARCH PROCESS 269
precisely as standard autoregressive processes are to conditional mean dynamics. The ARCH process may be
viewed as a model for the disturbance in a broader model, as was the case when we introduced it above as a
model for the innovation in a general linear process. Alternatively, if there are no conditional mean dynamics
of interest, the ARCH process may be used for an observed series. It turns out that financial asset returns
often have negligible conditional mean dynamics but strong conditional variance dynamics; hence in much
of what follows we will view the ARCH process as a model for an observed series, which for convenience we
will sometimes call a “return.”
The stated conditions ensure that the conditional variance is positive and that yt is covariance stationary.
Back substitution on σt2 reveals that the GARCH(p,q) process can be represented as a restricted infinite-
ordered ARCH process,
∞
ω α(L) 2 ω X
σt2 = + εt = P + δi ε2 ,
1 − βi i=1 t−i
P
1− βi 1 − β(L)
which precisely parallels writing an ARMA process as a restricted infinite-ordered AR. Hence the
GARCH(p,q) process is a parsimonious approximation to what may truly be infinite-ordered ARCH volatility
dynamics.
It is important to note a number of special cases of the GARCH(p,q) process. First, of course, the
ARCH(p) process emerges when
β(L) = 0.
Second, if both α(L) and β(L) are zero, then the process is simply iid Gaussian noise with variance ω.
Hence, although ARCH and GARCH processes may at first appear unfamiliar and potentially ad hoc, they
4
By “pure” we mean that we have allowed only for conditional variance dynamics, by setting yt = εt . We
could of course also introduce conditional mean dynamics, but doing so would only clutter the discussion
while adding nothing new.
270 CHAPTER 15. DYNAMIC HETEROSKEDASTICITY
are in fact much more general than standard iid white noise, which emerges as a potentially highly-restrictive
special case.
Here we highlight some important properties of GARCH processes. All of the discussion of course applies
as well to ARCH processes, which are special cases of GARCH processes. First, consider the second-order
moment structure of GARCH processes. The first two unconditional moments of the pure GARCH process
are constant and given by
E(εt ) = 0
and
ω
E(εt − E(εt ))2 = P P ,
1− αi − βi
while the conditional moments are
E(εt |Ωt−1 ) = 0
and of course
In particular, the unconditional variance is fixed, as must be the case under covariance stationarity, while
the conditional variance is time-varying. It is no surprise that the conditional variance is time-varying – the
GARCH process was of course designed to allow for a time-varying conditional variance – but it is certainly
worth emphasizing: the conditional variance is itself a serially correlated time series process.
Second, consider the unconditional higher-order (third and fourth) moment structure of GARCH pro-
cesses. Real-world financial asset returns, which are often modeled as GARCH processes, are typically
unconditionally symmetric but leptokurtic (that is, more peaked in the center and with fatter tails than a
normal distribution). It turns out that the implied unconditional distribution of the conditionally Gaus-
sian GARCH process introduced above is also symmetric and leptokurtic. The unconditional leptokurtosis
of GARCH processes follows from the persistence in conditional variance, which produces clusters of “low
volatility” and “high volatility” episodes associated with observations in the center and in the tails of the
unconditional distribution, respectively. Both the unconditional symmetry and unconditional leptokurtosis
agree nicely with a variety of financial market data.
Third, consider the conditional prediction error variance of a GARCH process, and its dependence on the
conditioning information set. Because the conditional variance of a GARCH process is a serially correlated
random variable, it is of interest to examine the optimal h-step-ahead prediction, prediction error, and
conditional prediction error variance. Immediately, the h-step-ahead prediction is
E(εt+h |Ωt ) = 0,
Ωt ,
because of the dynamics in the conditional variance. Simple calculations reveal that the expression for the
GARCH(p, q) process is given by
h−2
!
X
E(ε2t+h |Ωt ) =ω (α(1) + β(1)) i
+ (α(1) + β(1))h−1 σt+1
2
.
i=0
In the limit, this conditional variance reduces to the unconditional variance of the process,
ω
lim E(ε2t+h |Ωt ) = .
h→∞ 1 − α(1) − β(1)
For finite h, the dependence of the prediction error variance on the current information set Ωt can be
exploited to improve interval and density forecasts.
Fourth, consider the relationship between ε2t and σt2 . The relationship is important: GARCH dynamics
in σt2 turn out to introduce ARMA dynamics in ε2t .5 More precisely, if εt is a GARCH(p,q) process, then
ε2t
where
νt = ε2t − σt2
is the difference between the squared innovation and the conditional variance at time t. To see this, note
that if εt is GARCH(p,q), then
σt2 = ω + α(L)ε2t + β(L)σt2 .
β(L)ε2t
Adding
ε2t
so that
ε2t = ω + (α(L) + β(L))ε2t − β(L)(ε2t − σt2 ) + (ε2t − σt2 ),
Thus,
ε2t
νt ∈ [−σt2 , ∞).
ε2t is covariance stationary if the roots of α(L)+β(L)=1 are outside the unit circle.
Fifth, consider in greater depth the similarities and differences between σt2 and
ε2t .
νt = ε2t − σt2 ,
is effectively a “proxy” for σt2 , behaving similarly but not identically, with νt being the difference, or error.
In particular, ε2t is a noisy proxy: ε2t is an unbiased estimator of σt2 , but it is more volatile. It seems
reasonable, then, that reconciling the noisy proxy ε2t and the true underlying σt2 should involve some sort
of smoothing of ε2t . Indeed, in the GARCH(1,1) case σt2 is precisely obtained by exponentially smoothing
ε2t . To see why, consider the exponential smoothing recursion, which gives the current smoothed value as a
convex combination of the current unsmoothed value and the lagged smoothed value,
Back substitution yields an expression for the current smoothed value as an exponentially weighted
moving average of past actual values:
X
ε̄2t = wj ε2t−j ,
where
wj = γ(1 − γ)j .
Now compare this result to the GARCH(1,1) model, which gives the current volatility as a linear com-
bination of lagged volatility and the lagged squared return, σt2 = ω + αε2t−1 + βσt−1
2
.
2 ω j−1 2
P
Back substitution yields σt = 1−β + α β εt−j , so that the GARCH(1,1) process gives current volatil-
15.3. EXTENSIONS OF ARCH AND GARCH MODELS 273
stock returns, which occur when γ <0.8 Asymmetric response may also be introduced via the exponential
GARCH (EGARCH) model,
Note that volatility is driven by both size and sign of shocks; hence the model allows for an asymmetric
response depending on the sign of news.9 The log specification also ensures that the conditional variance
is automatically positive, because σt2 is obtained by exponentiating ln(σt2 ) ; hence the name “exponential
GARCH.”
where γ is a parameter and x is a positive exogenous variable.10 Allowance for exogenous variables in the
conditional variance function is sometimes useful. Financial market volume, for example, often helps to
explain market volatility.
h−1
!
X i h−1
2
= E ε2t+h |Ωt = ω 2
σt+h,t [α(1) + β(1)] + [α(1) + β(1)] σt+1 .
i=1
12
ω̄ is sometimes called the “long-run” variance, referring to the fact that the unconditional variance is
the long-run average of the conditional variance.
13
It turns out, moreover, that under suitable conditions the component GARCH model introduced here is
covariance stationary, and equivalent to a GARCH(2,2) process subject to certain nonlinear restrictions on
its parameters.
14
The precise form of the likelihood is complicated, and we will not give an explicit expression here, but
it may be found in various of the surveys mentioned in the Notes at the end of the chapter.
15
Routines for maximizing the GARCH likelihood are available in a number of modern software packages
such as Eviews. As with any numerical optimization, care must be taken with startup values and convergence
criteria to help insure convergence to a global, as opposed to merely local, maximum.
276 CHAPTER 15. DYNAMIC HETEROSKEDASTICITY
In words, the optimal h-step-ahead forecast is proportional to the optimal 1-step-ahead forecast. The
2
optimal 1-step-ahead forecast, moreover, is easily calculated: all of the determinants of σt+1 are lagged by
at least one period, so that there is no problem of forecasting the right-hand side variables. In practice, of
course, the underlying GARCH parameters α and β are unknown and so must be estimated, resulting in the
2
feasible forecast σ̂t+h,t formed in the obvious way. In financial applications, volatility forecasts are often of
2
direct interest, and the GARCH model delivers the optimal h-step-ahead point forecast, σt+h,t . Alternatively,
and more generally, we might not be intrinsically interested in volatility; rather, we may simply want to use
GARCH volatility forecasts to improve h-step-ahead interval or density forecasts of εt , which are crucially
2
dependent on the h-step-ahead prediction error variance, σt+h,t . Consider, for example, the case of interval
forecasting. In the case of constant volatility, we earlier worked with Gaussian ninety-five percent interval
forecasts of the form
yt+h,t ± 1.96σh ,
where σh denotes the unconditional h-step-ahead standard deviation (which also equals the conditional h-
step-ahead standard deviation in the absence of volatility dynamics). Now, however, in the presence of
volatility dynamics we use
yt+h,t ± 1.96σt+h,t .
The ability of the conditional prediction interval to adapt to changes in volatility is natural and desirable:
when volatility is low, the intervals are naturally tighter, and conversely. In the presence of volatility
dynamics, the unconditional interval forecast is correct on average but likely incorrect at any given time,
whereas the conditional interval forecast is correct at all times. The issue arises as to how to detect GARCH
effects in observed returns, and related, how to assess the adequacy of a fitted GARCH model. A key and
simple device is the correlogram of squared returns, ε2t . As discussed earlier, ε2t is a proxy for the latent
conditional variance; if the conditional variance displays persistence, so too will ε2t .16 Once can of course
also fit a GARCH model, and assess significance of the GARCH coefficients in the usual way.
Note that we can write the GARCH process for returns as εt = σt vt , where vt ∼ iidN (0, 1), σt2 = ω + αε2t−1 + βσt−1
2
.
Equivalently, the standardized return, v, is iid, ε σt = vt ∼ iidN (0, 1).
t
This observation suggests a way to evaluate the adequacy of a fitted GARCH model: standardize returns
by the conditional standard deviation from the fitted GARCH model, σ̂ , and then check for volatility
dynamics missed by the fitted model by examining the correlogram of the squared standardized return,
2
(εt /σ̂t ) . This is routinely done in practice.
with the standard residual plot, the squared or absolute residual plot is always a simple univariate
plot, even when there are many right-hand side variables. Such plots feature prominently, for example,
in tracking and forecasting time-varying volatility.
a. Instead, first fit autoregressive models using the SIC to guide order selection, and then fit GARCH
models to the residuals. Redo the entire empirical analysis reported in the text in this way, and
discuss any important differences in the results.
b. Consider instead the simultaneous estimation of all parameters of AR(p)-GARCH models. That
is, estimate regression models where the regressors are lagged dependent variables and the distur-
bances display GARCH. Redo the entire empirical analysis reported in the text in this way, and
discuss any important differences in the results relative to those in the text and those obtained in
part a above.
3. (Variations on the basic ARCH and GARCH models) Using the stock return data, consider richer
models than the pure ARCH and GARCH models discussed in the text.
a. Construct, display and discuss the fitted volatility series from the AR(5) model.
b. Construct, display and discuss an alternative fitted volatility series obtained by exponential smooth-
ing, using a smoothing parameter of .10, corresponding to a large amount of smoothing, but less
than done in the text.
c. Construct, display and discuss the volatility series obtained by fitting an appropriate GARCH
model.
278 CHAPTER 15. DYNAMIC HETEROSKEDASTICITY
a. The conditional normality assumption may sometimes be violated. However, GARCH parameters
are consistently estimated by Gaussian maximum likelihood even when the normality assumption
is incorrect. Sketch some intuition for this result.
b. Fit an appropriate conditionally Gaussian GARCH model to the stock return data. How might you
use the histogram of the standardized returns to assess the validity of the conditional normality
assumption? Do so and discuss your results.
c. Sometimes the conditionally Gaussian GARCH model does indeed fail to explain all of the lep-
tokurtosis in returns; that is, especially with very high-frequency data, we sometimes find that
the conditional density is leptokurtic. Fortunately, leptokurtic conditional densities are easily in-
corporated into the GARCH model. For example, in the conditionally Student’s-t GARCH
model, the conditional density is assumed to be Student’s t, with the degrees-of-freedom d treated
as another parameter to be estimated. More precisely, we write
td
vt ∼ iid .
std(td )
ε t = σ t vt
What is the reason for dividing the Student’s t variable, td , by its standard deviation, std(td ) ?
How might such a model be estimated?
a. Is the GARCH conditional variance specification introduced earlier, say for the i − th return,
2
σit = ω + αε2i,t−1 + βσi,t−1
2
, still appealing in the multivariate case? Why or why not?
15.6. NOTES 279
b. Consider the following specification for the conditional covariance between i − th and j-th returns:
σij,t = ω + αεi,t−1 εj,t−1 + βσij,t−1 . Is it appealing? Why or why not?
c. Consider a fully general multivariate volatility model, in which every conditional variance and
covariance may depend on lags of every conditional variance and covariance, as well as lags of
every squared return and cross product of returns. What are the strengths and weaknesses of such
a model? Would it be useful for modeling, say, a set of five hundred returns? If not, how might
you proceed?
15.6 Notes
280 CHAPTER 15. DYNAMIC HETEROSKEDASTICITY
Part IV
281
Chapter 16
The IC’s of Chapter 3 are surely heroic in economic contexts, so let us begin to relax them. One aspect of
IC 1 is that the fitted model matches the true DGP. In reality we can never know the DGP, and surely any
model that we might fit fails to match it, so there is an issue of how to select and fit a “good” model.
Recall that the Akaike information criterion, or AIC, is effectively an estimate of the out-of-sample
forecast error variance, as is s2 , but it penalizes degrees of freedom more harshly. It is used to select among
competing forecasting models. The formula is:
PN 2
i=1 ei
AIC = e( N )
2K
.
N
Also recall that the Schwarz information criterion, or SIC, is an alternative to the AIC with the
same interpretation, but a still harsher degrees-of-freedom penalty. The formula is:
PN 2
i=1 ei
SIC = N ( N )
K
.
N
Here we elaborate. We start with more on selection (“hard threshold” – variables are either kept
or discarded), and then we introduce shrinkage (“soft threshold” – all variables are kept, but parameter
estimates are coaxed in a certain direction), and then lasso, which blends selection and shrinkage.
283
284 CHAPTER 16. MISSPECIFICATION AND MODEL SELECTION
of-sample 1-step-ahead mean squared prediction error. The criteria we examine fit this general approach;
the differences among criteria amount to different penalties for the number of degrees of freedom used in
estimating the model (that is, the number of parameters estimated). Because all of the criteria are effectively
estimates of out-of-sample mean square prediction error, they have a negative orientation – the smaller the
better.
First consider the mean squared error,
PN 2
i=1 ei
M SE = ,
N
where N is the sample size and ei = yi − ŷi . M SE is intimately related to two other diagnostic statistics
routinely computed by regression software, the sum of squared residuals and R2 . Looking at the M SE
formula reveals that the model with the smallest M SE is also the model with smallest sum of squared
residuals, because scaling the sum of squared residuals by 1/N doesn’t change the ranking. So selecting
the model with the smallest M SE is equivalent to selecting the model with the smallest sum of squared
residuals. Similarly, recall the formula for R2 ,
PN 2
2 i=1 ei M SE
R = 1 − PN =1− 1
PN .
i=1 (yi − ȳ)2 N i=1 (yi − ȳ)
2
The denominator of the ratio that appears in the formula is just the sum of squared deviations of y from its
sample mean (the so-called “total sum of squares”), which depends only on the data, not on the particular
model fit. Thus, selecting the model that minimizes the sum of squared residuals – which as we saw is
equivalent to selecting the model that minimizes MSE – is also equivalent to selecting the model that
maximizes R2 .
Selecting forecasting models on the basis of MSE or any of the equivalent forms discussed above – that
is, using in-sample MSE to estimate the out-of-sample 1-step-ahead MSE – turns out to be a bad idea.
285
In-sample MSE can’t rise when more variables are added to a model, and typically it will fall continuously
as more variables are added, because the estimated parameters are explicitly chosen to minimize the sum of
squared residuals. Newly-included variables could get estimated coefficients of zero, but that’s a probability-
zero event, and to the extent that the estimate is anything else, the sum of squared residuals must fall. Thus,
the more variables we include in a forecasting model, the lower the sum of squared residuals will be, and
therefore the lower M SE will be, and the higher R2 will be. Again, the sum of squared residuals can’t rise,
and due to sampling error it’s very unlikely that we’d get a coefficient of exactly zero on a newly-included
variable even if the coefficient is zero in population.
The effects described above go under various names, including in-sample overfitting, reflecting the idea
that including more variables in a forecasting model won’t necessarily improve its out-of-sample forecasting
performance, although it will improve the model’s “fit” on historical data. The upshot is that in-sample
M SE is a downward biased estimator of out-of-sample M SE, and the size of the bias increases with the
number of variables included in the model. In-sample M SE provides an overly-optimistic (that is, too small)
assessment of out-of-sample M SE.
To reduce the bias associated with M SE and its relatives, we need to penalize for degrees of freedom
used. Thus let’s consider the mean squared error corrected for degrees of freedom,
PN 2
i=1 ei
s2 = ,
N −K
where K is the number of degrees of freedom used in model fitting.1 s2 is just the usual unbiased estimate of
the regression disturbance variance. That is, it is the square of the usual standard error of the regression. So
selecting the model that minimizes s2 is equivalent to selecting the model that minimizes the standard error
of the regression. s2 is also intimately connected to the R2 adjusted for degrees of freedom (the “adjusted
R2 ,” or R̄2 ). Recall that
PN 2
e /(N − K) s2
R̄2 = 1 − PN i=1 i = 1 − PN .
2 2
i=1 (yi − ȳ) /(N − 1) i=1 (yi − ȳ) /(N − 1)
The denominator of the R̄2 expression depends only on the data, not the particular model fit, so the
model that minimizes s2 is also the model that maximizes R̄2 . In short, the strategies of selecting the model
that minimizes s2 , or the model that minimizes the standard error of the regression, or the model that
maximizes R̄2 , are equivalent, and they do penalize for degrees of freedom used.
To highlight the degree-of-freedom penalty, let’s rewrite s2 as a penalty factor times the M SE,
PN 2
N i=1 ei
s2 = .
N −K N
Note in particular that including more variables in a regression will not necessarily lower s2 or raise R̄2 –
the M SE will fall, but the degrees-of-freedom penalty will rise, so the product could go either way.
As with s2 , many of the most important forecast model selection criteria are of the form “penalty factor
times M SE.” The idea is simply that if we want to get an accurate estimate of the 1-step-ahead out-of-
sample forecast M SE, we need to penalize the in-sample residual M SE to reflect the degrees of freedom
used. Two very important such criteria are the Akaike Information Criterion (AIC) and the Schwarz
Information Criterion (SIC). Their formulas are:
1
The degrees of freedom used in model fitting is simply the number of parameters estimated.
286 CHAPTER 16. MISSPECIFICATION AND MODEL SELECTION
PN 2
i=1 ei
AIC = e( N )
2K
N
and
PN 2
i=1 ei
SIC = N ( N )
K
.
N
How do the penalty factors associated with M SE, s2 , AIC and SIC compare in terms of severity? All
of the penalty factors are functions of K/N , the number of parameters estimated per sample observation,
and we can compare the penalty factors graphically as K/N varies. In Figure *** we show the penalties as
K/N moves from 0 to .25, for a sample size of N = 100. The s2 penalty is small and rises slowly with K/N ;
the AIC penalty is a bit larger and still rises only slowly with K/N . The SIC penalty, on the other hand,
is substantially larger and rises much more quickly with K/N .
It’s clear that the different criteria penalize degrees of freedom differently. In addition, we could propose
many other criteria by altering the penalty. How, then, do we select among the criteria? More generally,
what properties might we expect a “good” model selection criterion to have? Are s2 , AIC and SIC “good”
model selection criteria?
We evaluate model selection criteria in terms of a key property called consistency, also known as the
oracle property. A model selection criterion is consistent if:
a. when the true model (that is, the data-generating process, or DGP) is among a fixed set models
considered, the probability of selecting the true DGP approaches one as the sample size gets large, and
b. when the true model is not among a fixed set of models considered, so that it’s impossible to select the
true DGP, the probability of selecting the best approximation to the true DGP approaches one as the
sample size gets large.
We must of course define what we mean by “best approximation” above. Most model selection criteria
– including all of those discussed here – assess goodness of approximation in terms of out-of-sample mean
squared forecast error.
Consistency is of course desirable. If the DGP is among those considered, then we’d hope that as the
sample size gets large we’d eventually select it. Of course, all of our models are false – they’re intentional
simplifications of a much more complex reality. Thus the second notion of consistency is the more compelling.
M SE is inconsistent, because it doesn’t penalize for degrees of freedom; that’s why it’s unattractive. s2
does penalize for degrees of freedom, but as it turns out, not enough to render it a consistent model selection
procedure. The AIC penalizes degrees of freedom more heavily than s2 , but it too remains inconsistent;
even as the sample size gets large, the AIC selects models that are too large (“overparameterized”). The
SIC, which penalizes degrees of freedom most heavily, is consistent.
The discussion thus far conveys the impression that SIC is unambiguously superior to AIC for selecting
forecasting models, but such is not the case. Until now, we’ve implicitly assumed a fixed set of models.
In that case, SIC is a superior model selection criterion. However, a potentially more compelling thought
experiment for forecasting may be that we may want to expand the set of models we entertain as the sample
size grows, to get progressively better approximations to the elusive DGP. We’re then led to a different
optimality property, called asymptotic efficiency. An asymptotically efficient model selection criterion
chooses a sequence of models, as the sample size get large, whose out-of-sample forecast MSE approaches
16.1. CROSS VALIDATION (HARD THRESHOLDING) 287
the one that would be obtained using the DGP at a rate at least as fast as that of any other model selection
criterion. The AIC, although inconsistent, is asymptotically efficient, whereas the SIC is not.
In practical forecasting we usually report and examine both AIC and SIC. Most often they select the
same model. When they don’t, and despite the theoretical asymptotic efficiency property of AIC, this author
recommends use of the more parsimonious model selected by the SIC, other things equal. This accords with
the parsimony principle of Chapter ?? and with the results of studies comparing out-of-sample forecasting
performance of models selected by various criteria.
The AIC and SIC have enjoyed widespread popularity, but they are not universally applicable, and
we’re still learning about their performance in specific situations. However, the general principle that we
need somehow to inflate in-sample loss estimates to get good out-of-sample loss estimates is universally
applicable.
The versions of AIC and SIC introduced above – and the claimed optimality properties in terms of
out-of-sample forecast MSE – are actually specialized to the Gaussian case, which is why they are written
in terms of minimized SSR’s rather than maximized lnL’s.2 More generally, AIC and SIC are written not
in terms of minimized SSR’s, but rather in terms of maximized lnL’s. We have:
AIC = −2lnL + 2K
and
SIC = −2lnL + KlnN.
These are useful for any model estimated by maximum likelihood, Gaussian or non-Gaussian.
16.2.1 Forward
Algorithm:
– Begin regressing only on an intercept
– Move to a one-regressor model by including that variable with the smallest t-stat p-value
– Move to a two-regressor model by including that variable with the smallest p-value
– Move to a three-regressor model by including that variable with the smallest p-value
Often people use information criteria or CV to select from the stepwise sequence of models. This is a
“greedy algorithm,” producing an increasing sequence of candidate models. Often people use information
criteria or CV to select from the stepwise sequence of models. No guaranteed optimality properties of the
selected model.
“forward stepwise regression”
– Often people use information criteria or cross validation to select from the stepwise sequence of models.
16.2.2 Backward
Algorithm:
– Start with a regression that includes all K variables
– Move to a K − 1 variable model by dropping the variable with the largest t-stat p-value
– Move to a K − 2 variable model by dropping the variable with the largest p-value
Often people use information criteria or CV to select from the stepwise sequence of models. This is a
“greedy algorithm,” producing a decreasing sequence of candidate models. Often people use information
criteria or CV to select from the stepwise sequence of models. No guaranteed optimality properties of the
selected model.
β̂bayes = ω1 β̂M LE + ω2 β0 ,
where the weights depend on prior precision. Hence the the Bayes rule pulls, or “shrinks,” the MLE toward
the prior mean.
A classic shrinkage estimator is ridge regression,,3
β̂ridge = (X ′ X + λI)−1 X ′ y.
3
The ridge regression estimator can be shown to be the posterior mean for a certain prior and likelihood.
16.4. SELECTION AND SHRINKAGE (MIXED HARD AND SOFT THRESHOLDING)289
λ → 0 produces OLS, whereas λ → ∞ shrinks completely to 0. λ can be chosen by CV. (Notice that λ
can not be chosen by information criteria, as K regressors are included regardless of λ. Hence CV is a
more general selection procedure, useful for selecting various “tuning parameters” (like λ) as opposed to just
numbers of variables in hard-threshold procedures.
or equivalently
N
!2
X X
β̂P EN = argminβ yi − βi xit
i=1 i
s.t.
K
X
|βi |q ≤ c.
i=1
Concave penalty functions non-differentiable at the origin produce selection. Smooth convex penalties pro-
duce shrinkage. Indeed one can show that taking q → 0 produces subset selection, and taking q = 2 produces
ridge regression. Hence penalized estimation nests those situations and includes an intermediate case (q = 1)
that produces the lasso, to which we now turn.
or equivalently
N
!2
X X
β̂LASSO = argminβ yi − βi xit
i=1 i
s.t.
K
X
|βi | ≤ c.
i=1
Ridge shrinks, but the lasso shrinks and selects. Figure ?? says it all. Notice that, like ridge and other
Bayesian procedures, lasso requires only one estimation. And moreover, the lasso uses minimization problem
290 CHAPTER 16. MISSPECIFICATION AND MODEL SELECTION
is convex (lasso uses the smallest q for which it is convex), which renders the single estimation highly tractable
computationally.
Lasso also has a very convenient d.f. result. The effective number of parameters is precisely the number
of variables selected (number of non-zero β’s). This means that we can use info criteria to select among “lasso
models” for various λ. That is, the lasso is another device for producing an “increasing” sequence of candidate
models (as λ increases). The “best” λ can then be chosen by information criteria (or cross-validation, of
course).
In general:
zj = Xvj ⊥ zj ′ , j ′ ̸= j
var(zj ) ≤ d2j /N
PN !
2
i=1 ei 2K
ln(AIC) = ln +
N N
PN !
2
i=1 ei K ln(N )
ln(SIC) = ln + .
N N
The practice is so common that log(AIC) and log(SIC) are often simply called the “AIC” and “SIC.”
AIC and SIC must be greater than zero, so log(AIC) and log(SIC) are always well-defined and can
take on any real value. The important insight, however, is that although these variations will of course
change the numerical values of AIC and SIC produced by your computer, they will not change the
rankings of models under the various criteria. Consider, for example, selecting among three models.
If AIC1 < AIC2 < AIC3 , then it must be true as well that ln(AIC1 ) < ln(AIC2 ) < ln(AIC3 ) , so
we would select model 1 regardless of the “definition” of the information criterion used.
Appendices
293
Appendix A
Here we review a few aspects of probability and statistics that we will rely upon at various times.
295
296 APPENDIX A. PROBABILITY AND STATISTICS REVIEW
sorts of information. You are already familiar with two crucially important moments, the mean and variance.
In what follows we’ll consider the first four moments: mean, variance, skewness and kurtosis.3
The mean, or expected value, of a discrete random variable is a probability-weighted average of the
values it can assume,4
X
E(y) = pi yi .
i
Often we use the Greek letter µ to denote the mean, which measures the location, or central tendency,
of y.
The variance of y is its expected squared deviation from its mean,
We use σ 2 to denote the variance, which measures the dispersion, or scale, of y around its mean.
Often we assess dispersion using the square root of the variance, which is called the standard deviation,
p
σ = std(y) = E(y − µ)2 .
The standard deviation is more easily interpreted than the variance, because it has the same units of
measurement as y. That is, if y is measured in dollars (say), then so too is std(y). V ar(y), in contrast,
would be measured in rather hard-to-grasp units of “dollars squared”.
The skewness of y is its expected cubed deviation from its mean (scaled by σ 3 for technical reasons),
E(y − µ)3
S= .
σ3
Skewness measures the amount of asymmetry in a distribution. The larger the absolute size of the skewness,
the more asymmetric is the distribution. A large positive value indicates a long right tail, and a large negative
value indicates a long left tail. A zero value indicates symmetry around the mean.
The kurtosis of y is the expected fourth power of the deviation of y from its mean (scaled by σ 4 , again
for technical reasons),
E(y − µ)4
K= .
σ4
Kurtosis measures the thickness of the tails of a distribution. A kurtosis above three indicates “fat tails”
or leptokurtosis, relative to the normal, or Gaussian distribution that you studied earlier. Hence a
kurtosis above three indicates that extreme events (“tail events”) are more likely to occur than would be the
case under normality.
A.1.2 Multivariate
Suppose now that instead of a single random variable Y , we have two random variables Y and X.5 We
can examine the distributions of Y or X in isolation, which are called marginal distributions. This is
3
In principle, we could of course consider moments beyond the fourth, but in practice only the first four
are typically examined.
4
R
A similar formula holds in the continuous case, E(y) = y f (y) dy.
5
We could of course consider more than two variables, but for pedagogical reasons we presently limit
ourselves to two.
A.2. SAMPLES: SAMPLE MOMENTS 297
effectively what we’ve already studied. But now there’s more: Y and X may be related and therefore move
together in various ways, characterization of which requires a joint distribution. In the discrete case the
joint distribution f (y, x) gives the probability associated with each possible pair of y and x values, and in
the continuous case the joint density f (y, x) is such that the area in any region under it gives the probability
of (y, x) falling in that region.
We can examine the moments of y or x in isolation, such as mean, variance, skewness and kurtosis. But
again, now there’s more: to help assess the dependence between y and x, we often examine a key moment
of relevance in multivariate environments, the covariance. The covariance between y and x is simply the
expected product of the deviations of y and x from their respective means,
A positive covariance means that y and x are positively related; that is, when y is above its mean x tends
to be above its mean, and when y is below its mean x tends to be below its mean. Conversely, a negative
covariance means that y and x are inversely related; that is, when y is below its mean x tends to be above
its mean, and vice versa. The covariance can take any value in the real numbers.
Frequently we convert the covariance to a correlation by standardizing by the product of σy and σx ,
cov(y, x)
corr(y, x) = .
σy σx
The correlation takes values in [-1, 1]. Note that covariance depends on units of measurement (e.g., dollars,
cents, billions of dollars), but correlation does not. Hence correlation is more immediately interpretable,
which is the reason for its popularity.
Note also that covariance and correlation measure only linear dependence; in particular, a zero covariance
or correlation between y and x does not necessarily imply that y and x are independent. That is, they may be
non-linearly related. If, however, two random variables are jointly normally distributed with zero covariance,
then they are independent.
Our multivariate discussion has focused on the joint distribution f (y, x). In various chapters we will also
make heavy use of the conditional distribution f (y|x), that is, the distribution of the random variable Y
conditional upon X = x. Conditional moments are similarly important. In particular, the conditional
mean and conditional variance play key roles in econometrics, in which attention often centers on the
mean or variance of a series conditional upon the past.
{yi }N
i=1 ∼ f (y),
298 APPENDIX A. PROBABILITY AND STATISTICS REVIEW
and we want to learn from the sample about various aspects of f , such as its moments. To do so we use
various estimators.6 We can obtain estimators by replacing population expectations with sample averages,
because the arithmetic average is the sample analog of the population expectation. Such “analog estimators”
turn out to have good properties quite generally. The sample mean is simply the arithmetic average,
N
1 X
ȳ = yi .
N i=1
or s
PN
√ − ȳ)2
i=1 (yi
s= s2 = .
N −1
It provides an empirical measure of dispersion in the same units as y.
The sample skewness is
1
PN
(yi − ȳ)3
Ŝ = N i=1 3 .
σ̂
It provides an empirical measure of the amount of asymmetry in the distribution of y.
The sample kurtosis is
1
PN 4
N i=1 (yi − ȳ)
K̂ = .
σ̂ 4
It provides an empirical measure of the fatness of the tails of the distribution of y relative to a normal
distribution.
Many of the most famous and important statistical sampling distributions arise in the context of sample
moments, and the normal distribution is the father of them all. In particular, the celebrated central limit
theorem establishes that under quite general conditions the sample mean ȳ will have a normal distribution
as the sample size gets large. The χ2 distribution arises from squared normal random variables, the t
distribution arises from ratios of normal and χ2 variables, and the F distribution arises from ratios of χ2
6
An estimator is an example of a statistic, or sample statistic, which is simply a function of the sample
observations.
A.3. FINITE-SAMPLE AND ASYMPTOTIC SAMPLING DISTRIBUTIONS OF THE SAMPLE MEAN2
variables. Because of the fundamental nature of the normal distribution as established by the central limit
theorem, it has been studied intensively, a great deal is known about it, and a variety of powerful tools have
been developed for use in conjunction with it.
A.2.2 Multivariate
We also have sample versions of moments of multivariate distributions. In particular, the sample covariance
is
N
1 X
cov(y,
c x) = [(yi − ȳ)(xi − x̄)],
N i=1
which corresponds to a special case of what we will later call the “full ideal conditions” for regression
modeling. The sample mean ȳ is the natural estimator of the population mean µ. In this case, as you
learned earlier, ȳ is unbiased, consistent, normally distributed with variance σ 2 /N , and efficient (minimum
variance unbiased, MVUE). We write
σ2
ȳ ∼ N µ, ,
N
or equivalently
√
N (ȳ − µ) ∼ N (0, σ 2 ).
we construct exact finite-sample (likelihood ratio) hypothesis tests of H0 : µ = µ0 against the two-sided
alternative H0 : µ ̸= µ0 using
ȳ − µ0
s ∼ t1− α2 (N − 1).
√
N
yi ∼ iid(µ, σ 2 ), i = 1, ..., N.
Despite our dropping the normality assumption we still have that ȳ is consistent, asymptotically normally
distributed with variance σ 2 /N , and asymptotically efficient. We write,
a
σ2
ȳ ∼ N µ, .
N
More precisely, as T → ∞,
√
N (ȳ − µ) →d N (0, σ 2 ).
ȳ − µ0
∼ N (0, 1).
√s
N
d. For high-quality pencils, the desired graphite content per batch is 1.8 grams, with low variation
across batches. With that in mind, discuss the nature of the density f (y).
a. Based on the covariance, the claim is made that the revenues are “very strongly positively related.”
Evaluate the claim.
b. Suppose instead that, again based on the covariance, the claim is made that the revenues are
“positively related.” Evaluate the claim.
c. Suppose you learn that the revenues have a correlation of 0.93. In light of that new information,
re-evaluate the claims in parts a and b above.
3. (Simulation)
You will often need to simulate data of various types, such as iidN (µ, σ 2 ) (Gaussian simple random
sampling).
a. Using a random number generator, simulate a sample of size 30 for y, where y ∼ iidN (0, 1).
b. What is the sample mean? Sample standard deviation? Sample skewness? Sample kurtosis?
Discuss.
c. Form an appropriate 95 percent confidence interval for E(y).
d. Perform a t test of the hypothesis that E(y) = 0.
e. Perform a t test of the hypothesis that E(y) = 1.
a. Calculate the sample mean wage and test the hypothesis that it equals $9/hour.
b. Calculate sample skewness.
c. Calculate and discuss the sample correlation between wage and years of education.
302 APPENDIX A. PROBABILITY AND STATISTICS REVIEW
Appendix B
We construct our datasets by adjusting and sampling from the much-larger Current Population Survey (CPS)
datasets.
We extract the data from the March CPS for three years: I, II, and III, each approximately a decade
apart, using the National Bureau of Economic Research (NBER) front end (http://www.nber.org/data/
cps.html) and NBER SAS, SPSS, and Stata data definition file statements (http://www.nber.org/data/
cps_progs.html). Here we focus our discussion on the CPS-I dataset.
There are many CPS observations for which earnings data are missing. We drop those observations,
leaving 14363 observations. From those, we draw a random subsample with 1323 observations.
We use seven variables. From the CPS we obtain AGE (age), FEMALE (1 if female, 0 otherwise),
NONWHITE (1 if nonwhite, 0 otherwise), and UNION (1 if union member, 0 otherwise). We also create
EDUC (years of schooling) based on CPS variable PEEDUCA (educational attainment). Because the CPS
does not ask about years of experience, we create EXPER (potential working experience) as AGE minus
EDUC minus 6. Finally, we create WAGE as PRERNHLY (earnings per hour) in dollars for those paid
hourly, and PRERNWA (gross earnings last week) divided by PEHRUSL1 (usual working hours per week)
for those not paid hourly (PRERNHLY=0).
303
304 APPENDIX B. CONSTRUCTION OF THE WAGE DATASETS
Appendix C
I have cited many of these books elsewhere, typically in various end-of-chapter complements. Here I list
them collectively.
Lewis (2003) [Michael Lewis, Moneyball ]. “Appearances may lie, but the numbers don’t, so pay attention
to the numbers.”
Gladwell (2000) [Malcolm Gladwell, The Tipping Point]. “Nonlinear phenomena are everywhere.”
Gladwell pieces together an answer to the puzzling question of why certain things “take off” whereas
others languish (products, fashions, epidemics, etc.) More generally, he provides deep insights into nonlinear
environments, in which small changes in inputs can lead to small changes in outputs under some conditions,
and to huge changes in outputs under other conditions.
Taleb (2007) [Nassim Nicholas Taleb, The Black Swan] “Warnings, and more warnings, and still more
warnings, about non-normality and much else.” See Chapter 4 EPC 1.
Angrist and Pischke (2009) [Joshua Angrist and Jorn-Steffen Pischke, Mostly Harmless Econometrics].
“Natural and quasi-natural experiments suggesting instruments.”
This is a fun and insightful treatment of instrumental-variables and related methods. Just don’t be
fooled by the book’s attempted landgrab, as discussed in a 2015 No Hesitations post.
Silver (2012) [Nate Silver, The Signal and the Noise]. “Pitfalls and opportunities in predictive modeling.”
305
306 APPENDIX C. SOME POPULAR BOOKS WORTH ENCOUNTERING
Bibliography
Angrist, J.D. and J.-S. Pischke (2009), Mostly Harmless Econometrics, Princeton University Press.
Silver, N.. (2012), The Signal and the Noise, Penguin Press.
Tufte, E.R. (1983), The Visual Display of Quantatative Information, Chesire: Graphics Press.
307
Index
308
INDEX 309
Variance, 296
Variance inflation factor, 62
Volatility clustering, 268
Volatility dynamics , 268