100% found this document useful (1 vote)
99 views

Lecture Notes On Advanced Econometrics

Uploaded by

selena.shi.jw
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
99 views

Lecture Notes On Advanced Econometrics

Uploaded by

selena.shi.jw
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 378

Lecture Notes on

ADVANCED ECONOMETRICS

Yongmiao Hong
Department of Economics and
Department of Statistical Sciences
Cornell University

Emails: [email protected] & [email protected]

SPRING 2016

c 2016 Yongmiao Hong. All rights reserved.


Tables of Contents
Chapter 1 Introduction to Econometrics
1.1 Introduction
1.2 Quantitative Features of Modern Economics
1.3 Mathematical Modeling
1.4 Empirical Validation
1.5 Illustrative Examples
1.6 Limitations of Econometric Analysis
1.7 Conclusion
Chapter 2 General Regression Analysis
2.1 Conditional Probability Distribution
2.2 Regression Analysis
2.3 Linear Regression Modeling
2.4 Correct Model Speci…cation for Conditional Mean
2.5 Conclusion
Chapter 3 Classical Linear Regression Models
3.1 Framework and Assumptions
3.2 OLS Estimation
3.3 Goodness of Fit and Model Selection Criteria
3.4 Consistency and E¢ ciency of OLS
3.5 Sampling Distribution of OLS
3.6 Variance Matrix Estimator for OLS
3.7 Hypothesis Testing
3.8 Applications
3.9 Generalized Least Squares (GLS) Estimation
3.10 Conclusion
Chapter 4 Linear Regression Models with I.I.D. Observations
4.1 Introduction to Asymptotic Theory
4.2 Framework and Assumptions
4.3 Consistency of OLS
4.4 Asymptotic Normality of OLS
4.5 Asymptotic Variance Estimator for OLS
4.6 Hypothesis Testing
4.7 Testing for Conditional Homoskedasticity
4.8 Empirical Applications
4.9 Conclusion

0
Chapter 5 Linear Regression Models with Dependent Observations
5.1 Introduction to Time Series Analysis
5.2 Framework and Assumptions
5.3 Consistency of OLS
5.4 Asymptotic Normality of OLS
5.5 Asymptotic Variance Estimator for OLS
5.6 Hypothesis Testing
5.7 Testing for Conditional Heteroskedasticity and Autoregressive Conditional Heteroskedas-
ticity
5.8 Testing for Serial Correlation
5.9 Conclusion
Chapter 6 Linear Regression Models under Conditional Heteroskedasticity and
Autocorrelation
6.1 Framework and Assumptions
6.2 Long-run Variance Estimation
6.3 Consistency of OLS
6.4 Asymptotic Normality of OLS
6.5 Hypothesis Testing
6.6 Testing Whether Long-run Variance Estimation Is Needed
6.7 A Classical Ornut-Cochrane Procedure
6.8 Empirical Applications
6.9 Conclusion
Chapter 7 Instrumental Variables Regression
7.1 Framework and Assumptions
7.2 Two-Stage Least Squares (2SLS) Estimation
7.3 Consistency of 2SLS
7.4 Asymptotic Normality of 2SLS
7.5 Interpretation and Estimation of the 2SLS Asymptotic Variance
7.6 Hypothesis Testing
7.7 Hausman’s Test
7.8 Empirical Applications
7.9 Conclusion
Chapter 8 Generalized Method of Moments Estimation
8.1 Introduction to the Method of Moments Estimation
8.2 Generalized Method of Moments (GMM) Estimation
8.3 Consistency of GMM

1
8.4 Asymptotic Normality of GMM

8.5 Asymptotic E¢ ciency of GMM


8.6 Asymptotic Variance Estimation
8.7 Hypothesis Testing
8.8 Model Speci…cation Testing
8.9 Empirical Applications
8.10 Conclusion
Chapter 9 Maximum Likelihood Estimation and Quasi-Maximum Likelihood
Estimation
9.1 Motivation
9.2 Maximum Likelihood Estimation (MLE) and Quasi-MLE
9.3 Statistical Properties of MLE/QMLE
9.3.1 Consistency
9.3.2 Implication of Correct Model Speci…cation
9.3.3 Asymptotic Distribution
9.3.4 E¢ ciency of MLE
9.3.5 MLE-based Hypothesis Testing
9.4 Quasi-Maximum Likelihood Estimation
9.4.1 Asymptotic Variance Estimation
9.4.2 QMLE-based Hypothesis Testing
9.5 Model Speci…cation Testing
9.6 Empirical Applications
9.7 Conclusion
Chapter 10 Conclusion
10.1 Summary
10.2 Directions for Further Study in Econometrics
References
Preface
Modern economies are full of uncertainties and risk. Economics studies resource allocations
in an uncertain market environment. As a generally applicable quantitative analytic tool for
uncertain events, probability and statistics have been playing an important role in economic
research. Econometrics is statistical analysis of economic and …nancial data. It has become an
integral part of training in modern economics and business. This book develops a coherent set of
econometric theory and methods for economic models. It is written for an advanced econometrics
course for doctoral students in economics, business, management, statistics, applied mathematics,
and related …elds. It can also be used as a reference book on econometric theory by scholars who
may be interested in both theoretical and applied econometrics.

The book is organized in a coherent manner. Chapter 1 is a general introduction to econo-


metrics. It …rst describes the two most important features of modern economics, namely math-
ematical modeling and empirical validation, and then discusses the role of econometrics as a
methodology in empirical studies. A few motivating economic examples are given to illustrate
how econometrics can be used in empirical studies. Finally, it points out the limitations of
econometrics and economics due to the fact that an economy is not a repeatedly controlled ex-
periment. Assumptions and careful interpretations are needed when conducting empirical studies
in economics and …nance.

Chapter 2 introduces a general regression analysis. Regression analysis is modeling, esti-


mation, inference, and speci…cation analysis of the conditional mean of economic variables of
interest given a set of explanatory variables. It is most widely applied in economics. Among
other things, this chapter interprets the mean squared error and its optimizer, which lays down
the probability-theoretic foundation for least squares estimation. In particular, it provides an
interpretation for the least squares estimator and its relationship with the true parameter value
of a correctly speci…ed regression model.

Chapter 3 introduces the classical linear regression analysis. A set of classical assumptions
are given and discussed, and conventional statistical procedures for estimation, inference, and
hypothesis testing are introduced. The roles of conditional homoskedasticity, serial uncorrelat-
edness, and normality of the disturbance of a linear regression model are analyzed in a …nite
sample econometric theory. We also discuss the generalized least squares estimation as an ef-
…cient estimation method of a linear regression model when the variance-covariance matrix is
known up to a constant. In particular, the generalized least squares estimation is embedded as
an ordinary least squares estimation of a suitably transformed regression model via conditional
variance scaling and autocorrelation …ltering.

1
The subsequent chapters 4–7 are the generalizations of classical linear regression analysis
when various classical assumptions fail. Chapter 4 …rst relaxes the normality and conditional
homoskedasticity assumptions, two key conditions assumed in the classical linear regression mod-
eling. A large sample theoretic approach is taken. For simplicity, it is assumed that the observed
data are generated from an independent and identically distributed random sample. It is shown
that while the …nite distributional theory is no longer valid, the classical statistical procedures are
still approximately applicable when the sample size is large, provided conditional homoskedas-
ticity holds. In contrast, if the data display conditional heteroskedasticity, classical statistical
procedures are not applicable even for large samples, and heteroskedasticity-robust procedures
will be called for. Tests for existence of conditional heteroskedasticity in a linear regression
framework are introduced.

Chapter 5 extends the linear regression theory to time series data. First, it introduces a
variety of basic concepts in time series analysis. Then it shows that the large sample theory for
i.i.d. random samples carries over to stationary ergodic time series data if the regression error
follows a martingale di¤erence sequence. We introduce tests for serial correlation, and tests for
conditional heteroskedasticity and autoregressive conditional heteroskedasticity in a time series
regression framework. We also discuss the impact of autoregressive conditional heteroskedasticity
on inferences of static time series regressions and dynamic time series regressions.

Chapter 6 extends the large sample theory to a very general case where there exist conditional
heteroskedasticity and autocorrelation. In this case, the classical regression theory cannot be
used, and a long-run variance-covariance matrix estimator is called for to validate statistical
inferences in a time series regression framework.

Chapter 7 is the instrumental variable estimation for linear regression models, where the
regression error is correlated with the regressors. This can arise due to measurement errors,
simultaneous equation biases, and other various reasons. Two- stage least squares estimation
method and related statistical inference procedures are fully exploited. We describe tests for
endogeneity.

Chapter 8 introduces the generalized method of moments, which is a popular estimation


method for possibly nonlinear econometric models characterized as a set of moment conditions.
Indeed, most economic theories, such as rational expectations, can be formulated by a moment
condition. The generalized method of moments is particularly suitable to estimate model para-
meters contained in the moment conditions for which the conditional distribution is usually not
available.

Chapter 9 introduces the maximum likelihood estimation and the quasi-maximum likelihood
estimation methods for conditional probability models and other nonlinear econometric mod-

2
els. We exploit the important implications of correctly speci…cation of a conditional distribution
model, especially the analogy between the martingale di¤erence sequence property of the score
function and serial uncorrelatedness, and the analogy between the conditional information equal-
ity and conditional homoskedasticity. These links can provide a great help in understanding the
large sample properties of the maximum likelihood estimator and the quasi-maximum likelihood
estimator.

Chapter 10 concludes the book by summarizing the main econometric theory and methods
covered in this book, and pointing out directions for further build-up in econometrics.

This book has several important features. It covers, in a progressive manner, various econo-
metrics models and related methods from conditional means to possibly nonlinear conditional
moments to the entire conditional distributions, and this is achieved in a uni…ed and coherent
framework. We also provide a brief review of asymptotic analytic tools and show how they are
used to develop the econometric theory in each chapter. By going through this book progres-
sively, readers will learn how to do asymptotic analysis for econometric models. Such skills are
useful not only for those students who intend to work on theoretical econometrics, but also for
those who intend to work on applied subjects in economics because with such analytic skills,
readers will be able to understand more specialized or more advanced econometrics textbooks.

This book is based on my lecture notes taught at Cornell University, Renmin University of
China, Shandong University, Shanghai Jiao Tong University, Tsinghua University, and Xiamen
University, where the graduate students provide detailed comments on my lecture notes.

3
CHAPTER 1 INTRODUCTION TO
ECONOMETRICS
Abstract: Econometrics has become an integral part of training in modern economics and
business. Together with microeconomics and macroeconomics, econometrics has been taught as
one of the three core courses in most undergraduate and graduate economic programs in North
America. This chapter discusses the philosophy and methodology of econometrics in economic
research, the roles and limitations of econometrics, and the di¤erences between econometrics and
mathematical economics as well as mathematical statistics. A variety of illustrative econometric
examples are given, which cover various …elds of economics and …nance.

Key Words: Data generating process, Econometrics, Probability law, Quantitative analysis,
Statistics.

1.1 Introduction
Econometrics has become an integrated part of teaching and research in modern economics
and business. The importance of econometrics has been increasingly recognized over the past
several decades. In this chapter, we will discuss the philosophy and methodology of econometrics
in economic research. First, we will discuss the quatitative feature of modern economics, and the
di¤erences between econometrics and mathematical economics as well as mathematical statistics.
Then we will focus on the important roles of econometrics as a fundamental methodology in
economic research via a variety of illustrative economic examples including the consumption
function, marginal propensity to consume and multipliers, rational expectations models and
dynamic asset pricing, the constant return to scale and regulations, evaluation of e¤ects of
economic reforms in a transitional economy, the e¢ cient market hypothesis, modeling uncertainty
and volatility, and duration analysis in labor economics and …nance. These examples range
from econometric analysis of the conditional mean to the conditional variance to the conditional
distribution of economic variables of interest. we will also discuss the limitations of econometrics,
due to the nonexperimental nature of economic data and the time-varying nature of econometric
structures.

1.2 Quantitative Features of Modern Economics


Modern market economies are full of uncertainties and risk. When economic agents make a
decision, the outcome is usually unknown in advance and economic agents will take this uncer-
tainty into account in their decision-making. Modern economics is a study on scarce resource
allocations in an uncertain market environment. Generally speaking, modern economics can be

1
roughly classi…ed into four categories: macroeconomics, microeconomics, …nancial economics,
and econometrics. Of them, macroeconomics, microeconomics and econometrics now consti-
tute the core courses for most economic doctoral programs in North America, while …nancial
economics is now mainly being taught in business and management schools.
Most doctoral programs in economics in the U.S. emphasize quantitative analysis. Quantita-
tive analysis consists of mathematical modeling and empirical studies. To understand the roles
of quantitative analysis, it may be useful to …rst describe the general process of modern economic
research. Like most natural science, the general methodology of modern economic research can
be roughly summarized as follows:

Step 1: Data collection and summary of empirical stylized facts. The so-called stylized
facts are often summarized from observed economic data. For example, in microeconomics,
a well-known stylized fact is the Engel’s curve, which characterizes that the share of a
consumer’s expenditure on a commodity out of her or his total income will vary as his/her
income changes; in macroeconomics, a well-known stylized fact is the Phillips Curve, which
characterizes a negative correlation between the in‡ation rate and the unemployment rate
in an aggregate economy; and in …nance, a well-known stylized fact about …nancial markets
is volatility clustering, that is, a high volatility today tends to be followed by another high
volatility tomorrow, a low volatility today tends to be followed by another low volatility
tomorrow, and both alternate over time. The empirical stylized facts often serve as a
starting point for economic research. For example, the development of unit root and
cointegration econometrics was mainly motivated by the empirical study of Nelson and
Plosser (1982) who found that most macroeconomic time series are unit root processes.

Step 2: Development of economic theories/models. With the empirical stylized facts in


mind, economists then develop an economic theory or model in order to explain them. This
usually calls for specifying a mathematical model of economic theory. In fact, the objective
of economic modeling is not merely to explain the stylized facts, but to understand the
mechanism governing the economy and to forecast the future evolution of the economy.

Step 3: Empirical veri…cation of economic models. Economic theory only suggests a quali-
tative economic relationship. It does not o¤er any concrete functional form. In the process
of transforming a mathematical model into a testable empirical econometric model, one of-
ten has to assume some functional form, up to some unknown model parameters. One needs
to estimate unknown model parameters based on the observed data, and check whether
the econometric model is adequate. An adequate model should be at least consistent with
the empirical stylized facts.

Step 4: Applications. After an econometric model passes the empirical evaluation, it can

2
then be used to test economic theory or hypotheses, to forecast future evolution of the
economy, and to make policy recommendations.

For an excellent example highlighting these four steps, see Gujarati (2006, Section 1.3) on
labor force participation. We note that not every economist or every research paper has to
complete these four steps. In fact, it is not uncommon that each economist may only work on
research belonging to a certain stage in his/her entire academic lifetime.
From the general methodology of economic research, we see that modern economics has two
important features: one is mathematical modeling for economic theory, and the other is empirical
analysis for economic phenomena. These two features arise from the e¤ort of several generations
of economists to make economics a "science". To be a science, any theory must ful…ll two criteria:
one is logical consistency and coherency in theory itself, and the other is consistency between
theory and stylized facts. Mathematics and econometrics serve to help ful…ll these two criteria
respectively. This has been the main objective of the econometric society. The setup of the
Nobel Memorial Prize in economics in 1969 may be viewed as the recognition of economics as a
science in the academic profession.

1.3 Mathematical Modeling


We …rst discuss the role of mathematical modeling in economics. Why do we need mathe-
matics and mathematical models in economics? It should be pointed out that there are many
ways or tools (e.g., graphical methods, verbal discussions, mathematical models) to describe eco-
nomic theory. Mathematics is just one of them. To ensure logical consistency of the theory, it is
not necessary to use mathematics. Chinese medicine is an excellent example of science without
using mathematical modeling. However, mathematics is well-known as the most rigorous logical
language. Any theory, when it can be represented by the mathematical language, will ensure
its logical consistency and coherency, thus indicating that it has achieved a rather sophisticated
level. Indeed, as Karl Marx pointed out, the use of mathematics is an indication of the mature
development of a science.
It has been a long history to use mathematics in economics. In his Mathematical Principles
of the Wealth Theory, Cournot (1838) was among the earliest to use mathematics in economic
analysis. Although the marginal revolution, which provides a cornerstone for modern economics,
was not proposed using mathematics, it was quickly found in the economic profession that the
marginal concepts, such as marginal utility, marginal productivity, and marginal cost, correspond
to the derivative concepts in calculus. Walras (1874), a mathematical economist, heavily used
mathematics to develop his general equilibrium theory. The game theory, which was proposed
by Von Neumann and Morgenstern (1944) and now becomes a core in modern microeconomics,
originated from a branch in mathematics.

3
Why does economics need mathematics? Brie‡y speaking, mathematics plays a number of
important roles in economics. First, the mathematical language can summarize the essence of
a theory in a very concise manner. For example, macroeconomics studies relationships between
aggregate economic variables (e.g., GDP, consumption, unemployment, in‡ation, interest rate,
exchange rate, etc.) A very important macroeconomic theory was proposed by Keynes (1936).
The classical Keynesian theory can be summarized by two simple mathematical equations:

National Income identity: Y = C + I + G;


Consumption function: C= + Y;

where Y is income, C is consumption, I is private investment, G is government spending, is


the “survival level”consumption, and is the marginal propensity to consume. Substituting the
consumption function into the income identity, arranging terms, and taking a partial derivative,
we can obtain the multiplier e¤ect of (e.g.) government spending

@Y 1
= :
@G 1

Thus, the Keynesian theory can be e¤ectively summarized by two mathematical equations.
Second, complicated logical analysis in economics can be greatly simpli…ed by using math-
ematics. In introductory economics, economic analysis can be done by verbal descriptions or
graphical methods. These methods are very intuitive and easy to grasp. One example is the
partial equilibrium analysis where a market equilibrium can be characterized by the intersection
of the demand curve and the supply curve. However, in many cases, economic analysis cannot
be done easily by verbal languages or graphical methods. One example is the general equilib-
rium theory …rst proposed by Walras (1874). This theory addresses a fundamental problem in
economics, namely whether the market force can achieve an equilibrium for a competitive mar-
ket economy where there exist many markets and when there exist mutual interactions between
di¤erent markets. Suppose there are n goods, with demand Di (P ); supply Si (P ) for good i;
where P = (P1 ; P2 ; :::; Pn )0 is a price vector for n goods. Then the general equilibrium analysis
addresses whether there exists an equilibrium price vector P such that all markets are clear
simultaneously:
Di (P ) = Si (P ) for all i 2 f1; :::; ng:

Conceptually simple, it is rather challenging to give a de…nite answer because both the demand
and supply functions could be highly nonlinear. Indeed, Walras was unable to establish this
theory formally. It was satisfactorily solved by Arrow and Debreu many years later, when they
used the …xed point theorem in mathematics to prove the existence of an equilibrium price vector.
The power and magic of mathematics was clearly demonstrated in the development of the general

4
equilibrium theory.
Third, mathematical modeling is a necessary path to empirical veri…cation of an economic
theory. Most economic and …nancial phenomena are in form of data (indeed we are in a digital
era!). We need “digitalize” economic theory so as to link the economic theory to data. In
particular, one needs to formulate economic theory into a testable mathematical model whose
functional form or important structural model parameters will be estimated from observed data.

1.4 Empirical Validation


We now turn to discuss the second feature of modern economics: empirical analysis of an
economic theory. Why is empirical analysis of an economic theory important? The use of
mathematics, although it can ensure logical consistency of a theory itself, cannot ensure that
economics is a science. An economic theory would be useless from a practical point of view if
the underlying assumptions are incorrect or unrealistic. This is the case even if the mathematical
treatment is free of errors and elegant. As pointed out earlier, to be a science, an economic theory
must be consistent with reality. That is, it must be able to explain historical stylized facts and
predict future economic phenomena.
How to check a theory or model empirically? Or how to validate an economic theory? In
practice, it is rather di¢ cult or even impossible to check whether the underlying assumptions
of an economic theory or model are correct. Nevertheless, one can confront the implications of
an economic theory with the observed data to check if they are consistent. In the early stage
of economics, empirical validation was often conducted by case studies or indirect veri…cations.
For example, in his well-known Wealth of Nations, Adam Smith (1776) explained the advan-
tage of specialization using a case study example. Such a method is still useful nowadays, but
is no longer su¢ cient for modern economic analysis, because economic phenomena are much
more complicated while data may be limited. For rigorous empirical analysis, we need to use
econometrics. Econometrics is the …eld of economics that concerns itself with the application
of mathematical statistics and the tools of statistical inference to the empirical measurement of
relationships postulated by economic theory. It was founded as a scienti…c discipline around 1930
as marked by the founding of the econometric society and the creation of the most in‡uential
economic journal— Econometrica in 1933.
Econometrics has witnessed a rather rapid development in the past several decades, for a
number of reasons. First, there is a need for empirical veri…cation of economic theory, and for
forecasting using economic models. Second, there are more and more high-quality economic data
available. Third, advance in computing technology has made the cost of computation cheaper
and cheaper over time. The speed of computing grows faster than the speed of data accumulation.
Although not explicitly stated in most of the econometric literature, modern econometrics is
essentially built upon the following fundamental axioms:

5
Any economy can be viewed as a stochastic process governed by some probability law.

Economic phenomenon, as often summarized in form of data, can be reviewed as a realiza-


tion of this stochastic data generating process.

There is no way to verify these axioms. They are the philosophic views of econometricians
toward an economy. Not every economist or even econometrician agrees with this view. For
example, some economists view an economy as a deterministic chaotic process which can generate
seemingly random numbers. However, most economists and econometricians (e.g., Granger and
Teräsvirta 1993, Lucas 1977) view that there are a lot of uncertainty in an economy, and they
are best described by stochastic factors rather than deterministic systems. For instance, the
multiplier-accelerator model of Samuelson (1939) is characterized by a deterministic second-
order di¤erence equation for aggregate output. Over a certain range of parameters, this equation
produces deterministic cycles with a constant period of business cycles. Without doubt this
model sheds deep insight into macroeconomic ‡uctuations. Nevertheless, a stochastic framework
will provide a more realistic basis for analysis of periodicity in economics, because the observed
periods of business cycles never occur evenly in any economy. Frisch (1933) demonstrates that
a structural propagation mechanism can convert uncorrelated stochastic impulses into cyclical
outputs with uneven, stochastic periodicity. Indeed, although not all uncertainties can be well
characterized by probability theory, probability is the best quantitative analytic tool to describe
uncertainties. The probability law of this stochastic economic system, which characterizes the
evolution of the economy, can be viewed as the “law of economic motions.” Accordingly, the
tools and methods of mathematical statistics will provide the operating principles.
One important implication of the fundamental axioms is that one should not hope to de-
termine precise, deterministic economic relationships, as do the models of demand, production,
and aggregate consumption in standard micro- and macro-economic textbooks. No model could
encompass the myriad essentially random aspects of economic life (i.e., no precise point forecast
is possible, using a statistical terminology). Instead, one can only postulate some stochastic
economic relationships. The purpose of econometrics is to infer the probability law of the eco-
nomic system using observed data. Economic theory usually takes a form of imposing certain
restrictions on the probability law. Thus, one can test economic theory or economic hypotheses
by checking the validity of these restrictions.
It should be emphasized that the role of mathematics is di¤erent from the role of econometrics.
The main task of mathematical economics is to express economic theory in the mathematical form
of equations (or models) without regard to measurability or empirical veri…cation of economic
theory. Mathematics can check whether the reasoning process of an economic theory is correct
and sometime can give surprising results and conclusions. However, it cannot check whether
an economic theory can explain reality. To check whether a theory is consistent with reality,
one needs econometrics. Econometrics is a fundamental methodology in the process of economic

6
analysis. Like the development of a natural science, the development of economic theory is a
process of refuting the existing theories which cannot explain newly arising empirical stylized facts
and developing new theories which can explain them. Econometrics rather than mathematics
plays a crucial role in this process. There is no absolutely correct and universally applicable
economic theory. Any economic theory can only explain the reality at certain stage, and therefore,
is a “relative truth”in the sense that it is consistent with historical data available at that time.
An economic theory may not be rejected due to limited data information. It is possible that
more than one economic theory or model coexist simultaneously, because data does not contain
su¢ cient information to distinguish the true one (if any) from false ones. When new data become
available, a theory that can explain the historical data well may not explain the new data well
and thus will be refuted. In many cases, new econometric methods can lead to new discovery
and call for new development of economic theory.
Econometrics is not simply an application of a general theory of mathematical statistics.
Although mathematical statistics provides many of the operating tools used in econometrics,
econometrics often needs special methods because of the unique nature of economic data, and
the unique nature of economic problems at hand. One example is the generalized method of
moment estimation (Hansen 1982), which was proposed by econometricians aiming to estimate
rational expectations models which only impose certain conditional moment restrictions charac-
terized by the Euler equation and the conditional distribution of economic processes is unknown
(thus, the classical maximum likelihood estimation cannot be used). The development of unit
root and cointegration (e.g., Engle and Granger 1987, Phillips 1987), which is a core in modern
time series econometrics, has been mainly motivated from Nelson and Plosser’s (1982) empirical
documentation that most macroeconomic time series display unit root behaviors. Thus, it is
necessary to provide an econometric theory for unit root and cointegrated systems because the
standard statistical inference theory is no longer applicable. The emergence of …nancial econo-
metrics is also due to the fact that …nancial time series display some unique features such as
persistent volatility clustering, heavy tails, infrequent but large jumps, and serially uncorrelated
but not independent asset returns. Financial applications, such as …nancial risk management,
hedging and derivatives pricing, often call for modeling for volatilities and the entire conditional
probability distributions of asset returns. The features of …nancial data and the objectives of
…nancial applications make the use of standard time series analysis quite limited, and therefore,
call for the development of …nancial econometrics. Labor economics is another example which
shows how labor economics and econometrics have bene…ted from each other. Labor economics
has advanced quickly over the last few decades because of availability of high-quality labor data
and rigorous empirical veri…cation of hypotheses and theories on labor economics. On the other
hand, microeconometrics, particularly panel data econometrics, has also advanced quickly due
to the increasing availability of microeconomic data and the need to develop econometric theory

7
to accommondate the features of microeconomic data (e.g., censoring and endogeneity).
In the …rst issue of Econometrica, the founder of the econometric society, Fisher (1933), nicely
summarizes the objective of the econometric society and main features of econometrics: “Its main
object shall be to promote studies that aim at a uni…cation of the theoretical-quantitative and the
empirical-quantitative approach to economic problems and that are penetrated by constructive
and rigorous thinking similar to that which has come to dominate the natural sciences.
But there are several aspects of the quantitative approach to economics, and no single one of
these aspects taken by itself, should be confounded with econometrics. Thus, econometrics is by
no means the same as economic statistics. Nor is it identical with what we call general economic
theory, although a considerable portion of this theory has a de…nitely quantitative character.
Nor should econometrics be taken as synonymous [sic] with the application of mathematics
to economics. Experience has shown that each of these three viewpoints, that of statistics,
economic theory, and mathematics, is a necessary, but not by itself a su¢ cient, condition for a
real understanding of the quantitative relations in modern economic life. It is the uni…cation of
all three that is powerful. And it is this uni…cation that constitutes econometrics.”

1.5 Illustrative Examples


Speci…cally, econometrics can play the following roles in economics:
Examine how well an economic theory can explain historical economic data (particularly
the important stylized facts);

Test validity of economic theories and economic hypotheses;

Predict the future evolution of the economy.


To appreciate the roles of modern econometrics in economic analysis, we now discuss a number
of illustrative econometric examples in various …elds of economics and …nance.

Example 1: The Keynes Model, the Multiplier and Policy Recommendation


The simplest Keynes model can be described by the system of equations
(
Yt = Ct + It + Gt ;
Ct = + Yt + "t ;

where Yt is aggregate income, Ct is private consumption, It is private investment, Gt is gov-


ernment spending, and "t is consumption shock. The parameters and can have appealing
economic interpretations: is survival level consumption, and is the marginal propensity to
consume. The multiplier of the income with respect to government spending is

@Yt 1
= ;
@Gt 1

8
which depends on the marginal propensity to consume :
To assess the e¤ect of …scal policies on the economy, it is important to know the magnitude
of . For example, suppose the Chinese government wants to maintain a steady growth rate
(e.g., an annual 8%) for its economy by active …scal policy. It has to …gure out how many
government bonds to be issued each year. Insu¢ cient government spending will jeopardize the
goal of achieving the desired growth rate, but excessive government spending will cause budget
de…cit in the long run. The Chinese government has to balance these con‡icting e¤ects and this
crucially depends on the knowledge of the value of : Economic theory can only suggest a positive
qualitative relationship between income and consumption. It never tells exactly what should
be for a given economy. It is conceivable that di¤ers from country to country, because cultural
factors may have impact on the consumption behavior of an economy. It is also conceivable that
will depend on the stage of economic development in an economy. Fortunately, econometrics
o¤ers a feasible way to estimate from observed data. In fact, economic theory even does not
suggest a speci…c functional form for the consumption function. The linear functional form for
the consumption is assumed for convenience, not implied by economic theory. Econometrics can
provide a consistent estimation procedure for the unknown consumption function. This is called
the nonparametric method (see, e.g., Hardle 1990, Pagan and Ullah 1999).

Example 2: Rational Expectations and Dynamic Asset Pricing Models


Suppose a representative agent has a constant relative risk aversion utility

X
n X
n
t t Ct 1
U= u(Ct ) = ;
t=0 t=0

where > 0 is the agent’s time discount factor, 0 is the risk aversion parameter, u( ) is
the agent’s utility function in each time period, and Ct is consumption during period t: Let
the information available to the agent at time t be represented by the -algebra It — in the
sense that any variable whose value is known at time t is presumed to be It -measurable, and let
Rt = Pt =Pt 1 be the gross return to an asset acquired at time t 1 at a price of Pt 1 : The agent’s
optimization problem is to choose a sequence of consumptions fCt g over time to

max E(U )
fCt g

subject to the intertemporal budget constraint

Ct + Pt qt W t + Pt qt 1 ;

where qt is the quantity of the asset purchased at time t and Wt is the agent’s period t income.

9
De…ne the marginal rate of intertemporal substitution
@u(Ct+1 ) 1
@Ct+1 Ct+1
MRSt+1 ( ) = @u(Ct )
= ;
Ct
@Ct

where model parameter vector = ( ; )0 : Then the …rst order condition of the agent’s optimiza-
tion problem can be characterized by

E [ MRSt+1 ( )Rt+1 jIt ] = 1:

That is, the marginal rate of intertemporal substitution discounts gross returns to unity. This
FOC is usually called the Euler equation of the economic system (see Hansen and Singelton 1982
for more discussion).
How to estimate this model? How to test validity of a rational expectations model? Here, the
traditional popular maximum likelihood estimation method cannot be used, because one does not
know the conditional distribution of economic variables of interest. Nevertheless, econometricians
have developed a consistent estimation method based on the conditional moment condition or
the Euler equation, which does not require knowledge of the conditional distribution of the data
generating process. This method is called the generalized method of moments (see Hansen 1982).
In the empirical literature, it was documented that the empirical estimates of risk aversion
parameter are often too small to justify the substantial di¤erence between the observed returns
on stock markets and bond markets (e.g., Mehra and Prescott 1985). This is the well-known
equity premium puzzle. To resolve this puzzle, e¤ort has been devoted to the development of new
economic models with time-varying, large risk aversion. An example is Campbell and Cochrance’s
(1999) consumption-based capital asset pricing model. This story con…rms our earlier statement
that econometric analysis calls for new economic theory after documenting the inadequacy of the
existing model.
Example 3: The Production Function and the Hypothesis on Constant Return
to Scale
Suppose that for some industry, there are two inputs— labor Li and capital stock Ki ; and one
output Yi ; where i is the index for …rm i. The production function of …rm i is a mapping from
inputs (Li ; Ki ) to output Yi :
Yi = exp("i )F (Li ; Ki );

where "i is a stochastic factor (e.g., the uncertain weather condition if Yi is an agricultural prod-
uct). An important economic hypothesis is that the production technology displays a constant
return to scale (CRS), which is de…ned as follows:

F (Li ; Ki ) = F ( Li ; Ki ) for all > 0:

10
CRS is a necessary condition for the existence of a long-run equilibrium of a competitive market
economy. If CRS does not hold for some industry, and the technology displays the increasing
return to scale (IRS), the industry will lead to natural monopoly. Government regulation is then
necessary to protect consumers’welfare. Therefore, testing CRS versus IRS has important policy
implication, namely whether regulation is necessary.
A conventional approach to testing CRS is to assume that the production function is a Cobb-
Douglas function:
F (Li ; Ki ) = A exp("i )Li Ki :

Then CRS becomes a mathematical restriction on parameters ( ; ) :

H0 : + = 1:

If a + > 1; the production technology displays IRS.


In statistics, a popular procedure to test one-dimensional parameter restriction is Student’s t-
test. Unfortunately, this procedure is not suitable for many cross-sectional economic data, which
usually display conditional heteroskedasticity (e.g., a larger …rm has a larger output variation).
One needs to use a robust, heteroskedasticity-consistent test procedure, originally proposed in
White (1980).
It should be emphasized that CRS is equivalent to the statistical hypothesis H0 : + = 1
under the assumption that the production technology is a Cobb-Douglas function. This addi-
tional condition is not part of the CRS hypothesis and is called an auxiliary assumption. If the
auxiliary assumption is incorrect, the statistical hypothesis H0 : + = 1 will not be equivalent
to CRS. Correct model speci…cation is essential here for a valid conclusion and interpretation for
the econometric inference.

Example 4: E¤ect of Economic Reforms on a Transitional Economy


We now consider an extended Cobb-Dauglas production function (after taking a logarithmic
operation)
ln Yit = ln Ait + ln Lit + ln Kit + Bonusit + Contractit + "it ;

where i is the index for …rm i 2 f1; :::; N g; and t is the index for year t 2 f1; :::; T g; Bonusit is
the proportion of bonus out of total wage bill, and Contractit is the proportion of workers who
have signed a …xed-term contract. This is an example of the so-called panel data model (see,
e.g., Hsiao 2003).
Paying bonus and signing …xed-term contracts were two innovative incentive reforms in the
Chinese state-owned enterprises in the 1980s, compared to the …xed wage and life-time employ-
ment systems in the pre-reform era. Economic theory predicts that the introduction of the bonus
and contract systems provides stronger incentives for workers to work harder, thus increasing

11
the productivity of a …rm (see Groves, Hong, McMillan and Naughton 1994).
To examine the e¤ects of these incentive reforms, we consider the null statistical hypothesis

H0 : = = 0:

It appears that conventional t-tests or F -tests would serve our purpose here, if we can assume
conditional homoskedasticity. Unfortunately, this cannot be used because there may well exist
the other way of causation from Yit to Bonusit : a productive …rm may pay its workers higher
bonuses regardless of their e¤orts. This will cause correlation between the bonuses and the error
term uit ; rendering the OLS estimator inconsistent and invalidating the conventional t-tests or
F -tests. Fortunately, econometricians have developed an important estimation procedure called
Instrumental Variables estimation, which can e¤ectively …lter out the impact of the causation
from output to bonus and obtain a consistent estimator for the bonus parameter. Related
hypothesis test procedures can be used to check whether bonus and contract reforms can increase
…rm productivity.
In evaluating the e¤ect of economic reforms, we have turned an economic hypothesis— that
introducing bonuses and contract systems has no e¤ect on productivity— into a statistical hy-
pothesis H0 : = = 0: When the hypothesis H0 : = = 0 is not rejected, we should
not conclude that the reforms have no e¤ect. This is because the extended production function
model, where the reforms are speci…ed additively, is only one of many ways to check the e¤ect
of the reforms. For example, one could also specify the model such that the reforms a¤ect the
marginal produtivities of labor and capital (i.e., the coe¢ cients of labor and capital). Thus,
when the hypothesis H0 : = = 0 is not rejected, we can only say that we do not …nd evidence
against the economic hypothesis that the reforms have no e¤ect. We should not conclude that
the reforms have no e¤ect.

Example 5: The E¢ cient Market Hypothesis and Predictability of Financial Re-


turns
Let Yt be the stock return in period t; and let It 1 = fYt 1 ; Yt 2 ; :::g be the information set
containing the history of past stock returns. The weak form of e¢ cient market hypothesis (EMH)
states that it is impossible to predict future stock returns using the history of past stock returns:

E(Yt jIt 1 ) = E(Yt ):

The LHS, the so-called conditional mean of Yt given It 1 ; is the expected return that can be
obtained when one is fully using the information available at time t 1: The RHS, the uncondi-
tional mean of Yt ; is the expected market average return in the long-run; it is the expected return
of a buy-and-hold trading strategy. When EMH holds, the past information of stock returns has
no predictive power for future stock returns. An important implication of EMH is that mutual

12
fund managers will have no informational advantage over layman investors.
One simple way to test EMH is to consider the following autoregression
p
X
Yt = 0 + j Yt j + "t ;
j=1

where p is a pre-selected number of lags, and "t is a random disturbance. EMH implies

H0 : 1 = 2 = = p = 0:

Any nonzero coe¢ cient j ; 1 j p; is evidence against EMH. Thus, to test EMH, one can test
whether the j are jointly zero. The classical F -test in a linear regression model can be used to
test the hypothesis H0 when var("t jIt 1 ) = 2 ; i.e., when there exists conditional homoskedastic-
ity. However, EMH may coexist with volatility clustering (i.e., var("t jIt 1 ) may be time-varying),
which is one of the most important empirical stylized facts of …nancial markets (see Chen and
Hong (2003) for more discussion). This implies that the standard F -test statistic cannot be
used here, even asymptotically. Similarly, the popular Box and Pierce’s (1970) portmanteau Q
test, which is based on the sum of the …rst p squared sample autocorrelations, also cannot be
used, because its asymptotic 2 distribution is invalid in presence of autoregressive conditional
heteroskedasticity. One has to use procedures that are robust to conditional heteroskedasticity.
Like the discussion in Subsection 5.4, when one rejects the null hypothesis H0 that the j are
jointly zero, we have evidence against EMH. Furthermore, the linear AR(p) model has predictive
ability for asset returns. However, when one fails to reject the hypothesis H0 that the j are
jointly zero, one can only conclude that we do not …nd evidence against EMH. One cannot
conclude that EMH holds. The reason is, again, that the linear AR(p) model is one of many
possibilities to check EMH (see, e.g., Hong and Lee 2005, for more discussion).

Example 6: Volatility Clustering and ARCH Models


Since the 1970s, oil crisis, the ‡oating foreign exchanges system, and the high interest rate
policy in the U.S. have stimulated a lot of uncertainty in the world economy. Economic agents
have to incorporate these uncertainty in their decision-making. How to measure uncertainty has
become an important issue.
In economics, volatility is a key instrument for measuring uncertainty and risk in …nance. This
concept is important to investigate information ‡ows and volatility spillover, …nancial contagions
between …nancial markets, options pricing, and calculation of Value at Risk.
Volatility can be measured by the conditional variance of asset return Yt given the information
available at time t 1:

2
t var(Yt jIt 1 ) = E (Yt E(Yt jIt 1 ))2 jIt 1 :

13
An example of the conditional variance is the AutoRegressive Conditional Heteroskedasticity
(ARCH) model, originally proposed by Engle (1982). An ARCH(q) model assumes that
8
>
> Yt = t + " t ;
>
>
>
>
< "t = t zt ;
t = E(Yt jIt 1 );
>
> P
>
> 2
t = + qj=1 j "2t j ; > 0; > 0;
>
>
: fz g i:i:d:(0; 1):
t

This model can explain a well-known stylized fact in …nancial markets— volatility clustering: a
high volatility tends to be followed by another high volatility, and a small volatility tends to
be followed by another small volatility. It can also explain the non-Gaussian heavy tail of asset
returns. More sophisticated volatility models, such as Bollerslev’s (1986) Generalized ARCH or
GARCH model, have been developed in time series econometrics.
In practice, an important issue is how to estimate a volatility model. Here, the models for
the conditional mean t and the conditional variance 2t are assumed to be correctly speci…ed,
but the conditional distribution of Yt is unknown, because the distribution of the standardized
innovation fzt g is unknown. Thus, the popular maximum likelihood estimation (MLE) method
cannot be used. Nevertheless, one can assume that fzt g is i.i.d.N (0; 1) or follows other plausible
distribution. Under this assumption, we can obtain a conditional distribution of Yt given It 1
and estimate model parameters using the MLE procedure. Although fzt g is not necessarily
i.i.d.N (0; 1) and we know this, the estimator obtained this way is still consistent for the true
model parameters. However, the asymptotic variance of this estimator is larger than that of
the MLE (i.e., when the true distribution of fzt g is known), due to the e¤ect of not knowing
the true distribution of fzt g: This method is called the quasi-MLE, or QMLE (see, e.g., White
1994). Inference procedures based on the QMLE are di¤erent from those based on the MLE.
For example, the popular likelihood ratio test cannot be used. The di¤erence comes from the
fact that the asymptotic variance of the QMLE is di¤erent from that of the MLE, just like the
fact that the asymptotic variance of the OLS estimator under conditional heteroskedasticity is
di¤erent from that of the OLS under conditional homoskedasticity. Incorrect calculation of the
asymptotic variance estimator for the QMLE will lead to misleading inference and conclusion
(see White 1982, 1994 for more discussion).

Example 7: Modeling Economic Durations


Suppose we are interested in the time it takes for an unemployed person to …nd a job, the
time that elapses between two trades or two price changes, the length of a strike, the length
before a cancer patient dies, and the length before a …nancial crisis (e.g., credit default risk)
comes out. Such analysis is called duration analysis.
In practice, the main interest often lies in the question of how long a duration will continue,

14
given that it has not …nished yet. The so-called hazard rate measures the chance that the
duration will end now, given that it has not ended before. This hazard rate therefore can be
interpreted as the chance to …nd a job, to trade, to end a strike, etc.
Suppose Ti is the duration from a population with the probability density function f (t) and
probability distribution function F (t): Then the survival function is

S(t) = P (Ti > t) = 1 F (t);

and the hazard rate


P (t < Ti t + jTi > t) f (t)
(t) = lim+ = :
!0 S(t)
Intuitively, the hazard rate (t) is the instantaneous probability that an event of interest will
end at time t given that it has lasted for period t: Note that the speci…cation of (t) is equivalent
to a speci…cation of the probability density f (t): But (t) is more interpretable from an economic
point of view.
The hazard rate may not be the same for all individuals. To control heterogeneity across
individuals, we assume that the individual-speci…c hazard rate depends on some individual char-
acteristics Xi via the form
0
i (t) = exp(Xi ) (t):

This is called the proportional hazard model, originally proposed by Cox (1972). The parameter

@ 1 @
= ln i (t) = i (t)
@Xi i (t) @Xi

can be interpreted as the marginal relative e¤ect of Xi on the hazard rate of individual i. Inference
of will allow one to examine how individual characteristics a¤ect the duration of interest. For
example, suppose Ti is the unemployment duration for individual i; then the inference of will
allow us to examine how individual characteristics, such as age, education, gender, and etc,
can a¤ect the unemployment duration. This will provide important policy implication on labor
markets.
Because one can obtain the conditional probability density function of Yi given Xi

fi (t) = i (t)Si (t);

Rt
where the survival function Si (t) = exp[ 0 i (s)ds], we can estimate by the maximum likeli-
hood estimation method.
For an excellent survey on duration analysis in labor economics, see Kiefer (1988), and for
a complete and detailed account, see Lancaster (1990). Duration analysis has been also widely
used in credit risk modeling in the recent …nancial literature.

15
The above examples, although not exhaustive, illustrate how econometric models and tools
can be used in economic analysis. As noted earlier, an economy can be completely characterized
by the probability law governing the economy. In practice, which attributes (e.g., conditional
moments) of the probability law should be used depends on the nature of the economic problem
at hand. In other words, di¤erent economic problems will require modeling di¤erent attributes of
the probability law and thus require di¤erent econometric models and methods. In particular, it
is not necessary to specify a model for the entire conditional distribution function for all economic
applications. This can be seen clearly from the above examples.

1.6 Limitatons of Econometric Analysis


Although the general methodology of economic research is very similar to that of natural
science, in general, economics and …nance have not reached the mature stage that natural science
(e.g., physics) has achieved. In particular, the prediction in economics and …nance is not as
precise as natural science (see, e.g., Granger 2001, for an assessment of macroeconomic forecasting
practice).
Why?
Like any other statistical analysis, econometrics is the analysis of the “average behavior” of
a large number of realizations, or the outcomes of a large number of random experiments with
the same or similar features. However, economic data are not produced by a large number of
repeated random experiments, due to the fact that an economy is not a controlled experiment.
Most economic data are nonexperimental in their nature. This imposes some limitations on
econometric analysis.
First, as a simpli…cation of reality, economic theory or model can only capture the main or
most important factors, but the observed data is the joint outcome of many factors together, and
some of them are unknown and unaccounted for. These unknown factors are well present but their
in‡uences are ignored in economic modeling. This is unlike natural science, where one can remove
secondary factors via controlled experiments. In the realm of economics, we are only passive
observers; most data collected in economics are nonexperimental in that the data collecting
agency may not have direct control over the data. The recently emerging …eld of experimental
economics can help somehow, because it studies the behavior of economic agents under controlled
experiments (see, e.g., Samuelson 2005). In other words, experimental economics controls the
data generating process so that data is produced by the factors under study. Nevertheless, the
scope of experimental economics is limited. One can hardly imagine how an economy with 1.3
billions of people can be experimented. For example, can we repeat the economic reforms in
China and former Eastern European Socialist countries?
Second, an economy is an irreversible or non-repeatable system. A consequence of this is
that data observed are a single realization of economic variables. For example, we consider the

16
annual Chinese GDP growth rate fYt g over the past several years:

Y1997 Y1998 Y1999 Y2000 Y2001 Y2002 Y2003 Y2004 Y2005


:
9:3% 7:8% 7:6% 8:4% 8:3% 9:1% 10:0% 10:1% 9:9%

GDP growths in di¤erent years should be viewed as di¤erent random variables, and each variable
Yt only has one realization! There is no way to conduct statistical analysis if one random
variable only has a single realization. As noted earlier, statistical analysis studies the “average”
behavior of a large number of realizations from the same data generating process. To conduct
statistical analysis of economic data, economists and econometricians often assume some time-
invarying "common features" of an economic system so as to use time series data or cross-
sectional data of di¤erent economic variables. These common features are usually termed as
"stationarity" or "homogeneity" of the economic system. With these assumptions, one can
consider that the observed data are generated from the same population or populations with
similar characters. Economists and econometricians assume that the conditions needed to employ
the tools of statistical inference hold, but this is rather di¢ cult, if not impossible, to check in
practice.
Third, economic relationships are often changing over time for an economy. Regime shifts and
structural changes are rather a rule than an exception, due to technology shocks and changes in
preferences, population structure and institution arrangements. An unstable economic relation-
ship makes it di¢ cult for out-of-sample forecasts and policy-making. With a structural break,
an economic model that was performing well in the past may not forecast well in the future.
Over the past several decades, econometricians have made some progress to copy with the time-
varying feature of an economic system. Chow’s (1960) test, for example, can be used to check
whether there exist structural breaks. Engle’s (1982) volatility model can be used to forecast
time-varying volatility using historical asset returns. Nevertheless, the time-varying feature of
an economic system always imposes a challenge for economic forecasts. This is quite di¤erent
from natural sciences, where the structure and relationships are more or less stable over time.
Fourth, data quality. The success of any econometric study hinges on the quantity as well as
the quality of data. However, economic data may be subject to various defects. The data may be
badly measured or may correspond only vaguely to the economic variables de…ned in the model.
Some of the economic variables may be inherently unmeasurable, and some relevant variables
may be missing from the model. Moreover, sample selection bias will also cause a problem. In
China, there may have been a tendency to over-report or estimate the GDP growth rates given
the existing institutional promotion mechanism for local government o¢ cials. Of course, the
advances in computer technology, the development of statistical sampling theory and practice
can help improve the quality of economic data. For example, the use of scanning machines makes
every transaction data available.

17
The above features of economic data and economic systems together unavoidably impose
some limitations for econometrics to achieve the same mature stage as the natural science.

1.7 Conclusion
In this chapter, we have discussed the philosophy and methodology of econometrics in eco-
nomic research, and the di¤erences between econometrics and mathematical economics and math-
ematical statistics. I …rst discussed two most important features of modern economics, namely
mathematical modeling and empirical analysis. This is due to the e¤ort of several generations
of economists to make economics a science. As the methodology for empirical analysis in eco-
nomics, econometrics is an interdisciplinary …eld. It uses the insights from economic theory, uses
statistics to develop methods, and uses computers to estimate models. We then discussed the
roles of econometrics and its di¤erences from mathematics, via a variety of illustrative examples
in economics and …nance. Finally, we pointed out some limitations of econometric analysis, due
to the fact that any economy is not a controlled experiment. It should be emphasized that these
limitations are not only the limitations of econometrics, but of economics as a whole.

18
EXERCISES

1.1. Discuss the di¤erences of the roles of mathematics and econometrics in economic research.

1.2. What are the fundamental axioms of econometrics? Discuss their roles and implications.

1.3. What are the limitations of econometric analysis? Discuss possible ways to alleviate the
impact of these limits.

1.4. How do you perceive the roles of econometrics in decision-making in economics and business?

19
CHAPTER 2 GENERAL REGRESSION
ANALYSIS
Abstract: This chapter introduces regression analysis, the most popular statistical tool to ex-
plore the dependence of one variable ( say Y ) on others (say X). The variable Y is called the
dependent variable, and X is called the independent variable or explanatory variable. The re-
gression relationship between X and Y can be used to study the e¤ect of X on Y or to predict
Y using X. We motivate the importance of the regression function from both the economic
and statististical perspectives, and characterize the condition for correct speci…cation of a linear
model for the regression function, which is shown to be crucial for a valid economic interpretation
of model parameters.

Key words: Conditional distribution, Conditional mean, Consumption function, Linear


regression model, Marginal propensity to consume, Model speci…cation, Regression

2.1 Conditional Probability Distribution


Notational Convention: Throughout this book, capital letters (e:g:; Y ) denote random vari-
ables or random vectors, lower case letters (e:g:; y) denote realizations of random variables.

We assume that Z = (Y; X 0 )0 is a random vector with E(Y 2 ) < 1; where Y is a scalar, X
is a (k + 1) 1 vector of economic variables with its …rst component being a constant, and X 0
denotes the transpose of X: Given this assumption, the conditional mean E(Y jX) exists and is
well-de…ned.

Statistically speaking, the relationship between two random variables or vectors X (e.g., oil
price change) and Y (e.g., economic growth) can be characterized by their joint distribution
function. Suppose (X 0 ; Y )0 are continuous random vectors, and the joint probability density
function (pdf ) of (X 0 ; Y )0 is f (x; y): Then the marginal pdf of X is
Z 1
fX (x) = f (x; y)dy;
1

and the conditional pdf of Y given X = x is

f (x; y)
fY jX (yjx) = ;
fX (x)

provided fX (x) > 0: The conditional pdf fY jX (yjx) completely describes how Y depends on X:
In other words, it characterizes a predictive relationship of Y using X: With this conditional pdf
fY jX (yjx), we can compute the following quantities:

1
The conditional mean

E(Y jx) E(Y jX = x)


Z 1
= yfY jX (yjx)dy;
1

the conditional variance

var(Y jx) var(Y jX = x)


Z 1
= [y E(Y jx)]2 fY jX (yjx)dy
1
= E(Y 2 jx) [E(Y jx)]2 ;

the conditional skewness


E[(Y E(Y jx))3 jx]
S(Y jx) = ;
[var(Y jx)]3=2

the conditional kurtosis


E[(Y E(Y jx))4 jx]
K(Y jx) = ;
[var(Y jx)]2

the -conditional quantile Q(x; ) :

P [Y Q(X; )jX = x] = 2 (0; 1):

Note that when = 0:5; Q(x; 0:5) is the conditional median, which is the cuto¤ point or
threshold that divides the population into two equal halves, conditional on X = x.

The class of conditional moments is a summary characterization of the conditional distribution


fY jX (yjx): A mathematical model (i.e., an assumed functional form with a …nite number of un-
known parameters) for a conditional moment is called an econometric model for that conditional
moment.

Question: Which moment to model and use in practice?


It depends on economic applications. For some applications, we only need to model the …rst
conditional moment, namely the conditional mean. For example, asset pricing aims at explaining
excess asset returns by systematic risk factors. An asset pricing model is essentially a model for
the conditional mean of asset returns on risk factors. For others, we may have to model higher
order conditional moments and even the entire conditional distribution. In econometric practice,
the most popular models are the …rst two conditional moments, namely the conditional mean

2
and conditional variance. There is no need to model the entire conditional distribution of Y given
X when only certain conditional moments are needed. For example, when the conditional mean
is of concern, there is no need to model the conditional variance or impose restrictive conditions
on it.

The conditional moments, and more generally the conditional probability distribution of Y
given X; are not the causal relationship from X to Y: They are a predictive relationship. That
is, one can use the information on X to predict the distribution of Y or its attributes. These
probability concepts cannot tell whether the change in Y is caused by the change in X: Such
causal interpretation has to reply on economic theory. Economic theory usually hypothesizes
that a change in Y is caused by a change in X; i.e., there exists a causal relationship from X to
Y: If such an economic causal relationship exists, we will …nd a predictive relationship from X
to Y: On the other hand, a documented predictive relationship from X to Y may not be caused
by an economic causal relationship from X to Y . For example, it is possible that both X and
Y are positively correlated due to their dependence on a common factor. As a result, we will
…nd a predictive relationship from X to Y; although they do not have any causal relationship.
In fact, it is well-known in econometrics that some economic variables that trend consistently
upwards over time are highly correlated even in the absence of any causal relationship between
them. Such strong correlations are called spurious relationships.

2.2 Regression Analysis


We now focus on the …rst conditional moment, E(Y jX); which is called the regression function
of Y on X; where Y is called the regressand, and X is called the regressor vector. The term
“regression”is used to signify a predictive relationship between Y and X.

De…nition 2.1 [Regression Function]: The conditional mean E(Y jX) is called a regression
function of Y on X:

Many economic theories can be characeterized by the conditional mean E(Y jX) of Y given
X; provided X and Y are suitably de…ned. Most, though not all, of dynamic economic theories
and/or dynamic optimization models, such as rational expectations, e¢ cient markets hypothesis,
expectations hypothesis, and optimal dynamic asset pricing, have important implications on (and
only on) the conditional mean of underlying economic variables given the information available to
economic agents (e.g., Cochrane 2001, Sargent and Ljungqvist 2002). For example, the classical
e¢ cient market hypothesis states that the expected asset return given the information available,
is zero, or at most, is constant over time; the optimal dynamic asset pricing theory implies
that the expectation of the pricing error given the information available is zero for each asset
(Cochrane 2001). Although economic theory may suggest a nonlinear relationship, it does not

3
give a completely speci…ed functional form for the conditional mean of economic variables. It is
therefore important to model the conditional mean properly.

Before modeling E(Y jX); we …rst discuss some probabilistic properties of E(Y jX):

Lemma 2.1: E[E(Y jX)] = E(Y ):

Proof: The result follows immediately from applying the law of iterated expectations below.

Lemma 2.2 [Law of Iterated Expectations (LIE)]: For any measurable function G(X; Y );

E [G(X; Y )] = E fE [G(X; Y )jX]g ;

provided the expectation E[G(X; Y )] exists.

Proof: We consider the case of the continuous distribution of (Y; X 0 )0 only. By the multiplication
rule that the joint pdf f (x; y) = fY jX (yjx)fX (x); we have

Z 1 Z 1
E[G(X; Y )] = G(x; y)fXY (x; y)dxdy
1 1
Z 1 Z 1
= G(x; y)fY jX (yjx)fX (x)dxdy
1
Z 1 Z11
= G(x; y)fY jX (yjx)dy fX (x)dx
1 1
Z 1
= E[G(X; Y )jX = x]fX (x)dx
1
= EfE[G(X; Y )jX]g;

where the operator E( jX) is the expectation with respect to fY jX ( jX); and the operator E( )
is the expectation with respect to fX ( ): This completes the proof.

Interpretation of E(Y jX) and LIE:

Example 1: Suppose Y is wage, and X is a gender dummy variable, taking value 1 if an


employee is female and value 0 if an employee is male. Then

E(Y jX = 1) = average wage of a female worker,


E(Y jX = 0) = average wage of a male worker,

4
and the overall average wage

E(Y ) = E[E(Y jX)]


= P (X = 1)E(Y jX = 1) + P (X = 0)E(Y jX = 0);

where P (X = 1) is the proportion of female employees in the labor force, and P (X = 0) is the
proportion of the male employees in the labor force. The use of LIE here thus provides some
insight into the income distribution between genders.

Example 2: Suppose Y is an asset return and we have two information sets: X and X; ~ where
X X ~ so that all information in X is also in X~ but X~ contains some extra information. Then
we have a conditional version of the law of iterated expectations says that

~
E(Y jX) = E[E(Y jX)jX]

or equivalently nh io
E Y ~
E(Y jX)jX = 0:

where Y E(Y jX) ~ is the prediction error using the superior information set X:
~ The conditional
LIE says that one cannot use limited information X to predict the prediction error one would
~ See Campbell, Lo and MacKinlay (1997, p.23) for more
make if one had superior information X:
discussion.

Question: Why is E(Y jX) important from a statistical perspective?

Suppose we are interested in predicting Y using some function g(X) of X; and we use a
so-called Mean Squared Error (MSE) criterion to evaluate how well g(X) approximates Y: Then
the optimal predictor under the MSE criterion is the conditional mean, as will be shown below.
We …rst de…ne the MSE criterion. Intuitively, MSE is the average of the squared deviations
between the predictor g(X) and the actual outcome Y .

De…nition 2.2 [MSE]: Suppose function g(X) is used to predict Y: Then the mean squared
error of function g(X) is de…ned as

M SE(g) = E [Y g(X)]2 ;

provided the expectation exists.

The theorem below states that E(Y jX) minimizes the MSE.

5
Theorem 2.3 [Optimality of E(Y jX)]: The regression function E(Y jX) is the solution to the
optimization problem

E(Y jX) = arg min M SE(g)


g2F

= arg min E[Y g(X)]2 ;


g2F

where F is the space of all measurable and square-integrable functions


Z 1
F = fg( ): g 2 (x)fX (x)dx < 1g:
1

Proof: We will use the variance and squared-bias decomposition technique. Put

go (X) E(Y jX):

Then

M SE(g) = E[Y g(X)]2


= E [Y go (X) + go (X) g(X)]2
= E [Y go (X)]2 + E [go (X) g(X)]2
+2Ef[Y go (X)] [go (X) g(X)]g
= E [Y go (X)]2 + E [go (X) g(X)]2 ;

where the cross-product term

Ef[Y go (X)] [go (X) g(X)]g = 0

by LIE and the fact that Ef[Y go (X)]jXg = 0 a.s.


In the above MSE decomposition, the …rst term E[Y go (X)]2 is the quadratic variation
of the prediction error of the regression function go (X). This does not depend on the choice
of function g(X): The second term E[go (X) g(X)]2 is the is the quadratic variation of the
approximation error of g(X) for go (X): This term achieves its minimum of zero if and only if
one chooses g(X) = go (X) a.s. Because the …rst term E[Y go (X)]2 does not depend on g(X),
minimizing MSE(g) is equivalent to minimizing the second term E[go (X) g(X)]2 : Therefore,
the optimal solution for minimizing MSE(g) is given by g (X) = go (X): This completes the
proof.

Remarks:

6
MSE is a popular criterion for measuring precision of a predictior g(X) for Y: It has at least
two advantages: …rst, it can be analyzed conveniently, and second, it has a nice decomposition
of a variance component and a squared-bias component.
However, MSE is one of many possible criteria for measuring goodness of the predictor g(X)
for Y: In general, any increasing function of the absolute value jY g(X)j can be used to measure
the goodness of …t for the predictor g(X): For example, the Mean Absolute Error

M AE(g) = EjY g(X)j

is also a reasonable criterion.


It should be emphasized that di¤erent criteria have di¤erent optimizers. For example, the
optimizer for M AE(g) is the conditional median, rather than the conditional mean. The condi-
tional median, say m(x); is de…ned as the solution to
Z m
fY jX (yjx)dy = 0:5:
1

In other words, m(x) divides the conditional population into two equal halves.

y
Example 3: Let the joint pdf fXY (x; y) = e for 0 < x < y < 1. Find E(Y jX) and var(Y jX):

Solution: We …rst …nd the conditional pdf fY jX (yjx): The marginal pdf of X
Z 1
fX (x) = fXY (x; y)dy
1
Z 1
= e y dy
x
x
= e for 0 < x < 1:

Therefore,

fXY (x; y)
fY jX (yjx) =
fX (x)
= e (y x) for 0 < x < y < 1:

7
Then
Z 1
E(Y jx) = yfY jX (yjx)dy
1
Z 1
(y x)
= ye dy
x
Z 1
x
= e ye y dy
x
Z 1
x y
= e yde
x
= 1 + x:

Thus, the regression function E(Y jX) is linear in X:


To compute var(Y jX); we will use the formula

var(Y jX) = E(Y 2 jX) [E(Y jX)]2 :

Because
Z 1
2
E(Y jx) = y 2 fY jX (yjx)dy
1
Z 1
= y2e (y x)
dy
x
Z 1
x
= e y 2 e y dy
x
Z 1
x
= e where de y = e y dy:
y 2 de y
x
Z 1
x 2 y 1
= ( e ) y e jx e y dy 2
x
Z 1
x 2 x
= [ e ] 0 xe 2 ye y dy
x
Z 1
= x2 + 2ex ye y dy
Z 1x
= x2 + 2 ye (y x) dy
x
= x2 + 2(1 + x);

8
we have

var(Y jx) = E(Y 2 jx) [E(Y jx)]2


= x2 + 2(1 + x) (1 + x)2
= 1:

The conditional variance of Y given X does not depend on X: That is, X has no e¤ect on the
conditional variance of Y:

The above example shows that while the conditional mean of Y given X is a linear function
of X; the conditional variance of Y may not depend on X: This is essentially the assumption
made in the classical linear regression model (see Chapter 3). Another example for which we
have a linear regression function with constant conditional variance is when X and Y are jointly
normally distributed (see Exercise 2 at the end of this chapter).

Theorem 2.4 [Regression Identity]: Suppose E(Y jX) exists. Then we can always write

Y = E(Y jX) + ";

where " is called the regression disturbance and has the property that

E("jX) = 0:

Proof: Put " = Y E(Y jX): Then

Y = E(Y jX) + ";

where

E("jX) = Ef[Y E(Y jX)]jXg


= E(Y jX) E[go (X)jX]
= E(Y jX) go (X)
= 0:

Remarks:
The regression function E(Y jX) can be used to predict the expected value of Y using the
information of X: In regression analysis, an important issue is the direction of causation between
Y and X:In practice, one often hope to check whether Y “depends”on or can be “explained”by
X; with help of economic theory. For this reason, Y is called the dependent variable, and X is

9
called the explanatory variable or vector. However, it should be emphasized that the regression
function E(Y jX) itself does not tell any causal relationship between Y and X:
The random variable " represents the part of Y that is not captured by E(Y jX): It is usually
called a noise or a disburbance, because it “disturbs” an otherwise stable relationship between
Y and X: On the other hand, the regression function E(Y jX) is called a signal.
The property that E("jX) = 0 implies that the regression disturbance " contains no system-
atic information of X that can be used to predict the expected value of Y: In other words, all
information of X that can be used to predict the expectation of Y has been completely summa-
rized by E(Y jX): The condition E("jX) = 0 is crucial for the validity of economic interpretation
of model parameters, as will be seen shortly.
E("jX) = 0 implies that the unconditional mean of " is zero:

E(") = E[E("jX)] = 0

and that " is orthogonal to X :

E(X") = E [E(X"jX)]
= E [XE("jX)]
= E(X 0)
= 0:

Since E(") = 0; we have E(X") = cov(X; "): Thus, orthogonality (E(X") = 0) means that
X and " are uncorrelated.
In fact, " is orthogonal to any measurable function of X; i.e., E["h(X)] = 0 for any measurable
function h( ): This implies that we cannot predict the mean of " by using any possible model
h(X); no matter it is linear or nonlinear.

Question: Is E("jX) = 0 equivalent to E["h(X)] = 0 for all measurable h( )?


Answer: Yes. How to show it? See Exercise 11 at the end of this chapter for more discussion.
It is possible that E("jX) = 0 but var("jX) is a function of X: If var("jX) = 2 > 0; we
say that there exists conditional homoskedasticity for ": In this case, X cannot be used to
predict the (quadratic) variation of Y: On the other hand, if var("jX) 6= 2 for any constant
2
> 0; we say that there exists conditional heteroskedasticity. Econometric procedures
of regression analysis are usually di¤erent, depending on whether there exists conditional het-
eroskedasticity. For example, the so-called conventional t-test and F -test are invalid under
conditional heteroskedasticity (see Chapter 3 for the introduction of the t-test and F -test). This
will be discussed in detail in subsequent chapters.

10
Example 4: Suppose
p
"= + 2;
0 1X

where random variables X and are independent, and E( ) = 0;var( ) = 1: Find E("jX) and
var("jX):

Solution:
h p i
2
E("jX) = E 0+ 1 X jX
p
= 2 E( jX)
0+ 1X
p
= 2 E(
0+ 1X )
p
= 2
0+ 1X 0
= 0:

Next,

var("jX) = E [" E("jX)]2 jX


= E("2 jX)
= E[ 2 ( 0 + 1X
2
)jX]
2
= ( 0 + 1X )E( 2 jX)
2
= ( 0 + 1X ) 1
2
= 0 + 1X :

Although the conditional mean " given X is identically zero, the conditional variance of " given
X depends on X:

The regression analysis (conditional mean analysis) is the most popular statistical method in
econometrics. It has been applied widely to economics. For example, it can be used to

estimate the relationship between economic variables.

test economic hypotheses.

forecast future values of Y:

Example 5: Let Y =consumption, X=disposable income. Then the regression function E(Y jX) =
C(X) is the so-called consumption function, and the marginal propensity to consume (MPC) is
the derivative
d
M P C = C 0 (X) = E(Y jX):
dX
11
MPC is an important concept in the “multiplier e¤ect” analysis. The magnitude of MPC
is important in macroeconomic policy analysis and forecasting. On the other hand, when Y is
consumption on food only, then Engle’s law implies that MPC must be a decreasing function of
d
X: Therefore, we can test Engle’s law by testing whether C 0 (X) = dX E(Y jX) is a decreasing
function of X:

Example 6: Y =output, X=(labor, capital, raw material)0 , then the regression E(Y jX) = F (X)
is the so-called production function. This can be used to test the hypothesis of constant return
to scale (CRS), which is de…ned as

F (X) = F ( X) for all > 0:

Example 7: Let Y be the cost of producing certain output X: Then the regression function
E(Y jX) = C(X) is the cost function. For a monopoly …rm or industry, the marginal cost must
be declining in output X: That is,

d
E(Y jX) = C 0 (X) > 0;
dX
d2
E(Y jX) = C 00 (X) < 0:
dX 2
These imply that the cost function of a monopoly is a nonlinear function of X:

Question: Why may there exist conditional heteroskedasticity?

Generally speaking, given that E(Y jX) depends on X; it is conceivable that var(Y jX) and other
higher order conditional moments may also depend on X: In fact, conditional heteroskedasticity
may arise from di¤erent sources. For example, a larger …rm may have a larger output variation.
Granger and Machina (2006) explain why economic variables may display volatility clustering
from an econometric structral perspective.

The following example shows that conditional heteroskedasticity may arise due to random
coe¢ cients in a data generating process.

Example 8 [Random Coe¢ cient Process]: Suppose

Y = 0 +( 1 + 2 )X + ;

2
where X and are independent, and E( ) = 0;var( ) = : Find the conditional mean E(Y jX)
and conditional variance var(Y jX).

12
Solution: (i)

E(Y jX) = 0 + E[( 1 + 2 )XjX] + E( jX)


= 0 + 1X + 2 XE( jX) + E( jX)
= 0 + 1X + 2 XE( ) + E( )
= 0 + 1X + 2X 0+0
= 0 + 1 X:

(ii)

var(Y jX) = E (Y E(Y jX))2 jX


2
= E ( 0 +( 1 + 2 )X + 0 1 X) jX

= E ( 2X + )2 jX
= E ( 2X + 1)2 2 jX
2 2
= (1 + 2 X) E( jX)
2 2
= (1 + 2 X) E( )
2 2
= (1 + 2 X) :

The random coe¢ cient process has been used to explain why the conditional variance may
depend on the regressor X: We can write this process as

Y = 0 + 1X + ";

where
" = (1 + 2 X) :
2 2
Note that E("jX) = 0 but var("jX) = (1 + 2 X) :

2.3 Linear Regression Modeling


As we have known above, the conditional mean go (X) E(Y jX) is the solution to the MSE
optimization problem
min E[Y g(X)]2 ;
g2F

where F is a class of functions that includes all measurable and square-integrable functions, i.e.,
Z
k+1
F= g( ) : R !Rj g 2 (x)fX (x)dx < 1 :

13
In general, the regression function E(Y jX) is an unknown functional form of X. Economic
theory usually suggests a qualitative relationship between X and Y (e.g., the cost of production
is an increasing function of output X); but it never suggests a concrete functional form. One
needs to use some mathematical model to approximate go (X):

Question: How to model go (X)?

In econometrics, a most popular modeling strategy is the parametric approach, which assumes
a known functional form for go (X); up to some unknown parameters. In particular, one usually
uses a class of linear functions to approximate go (x); which is simple and easy to interpret. This
is the approach we will take in most of this book.

We …rst introduce a class of a¢ ne functions.

De…nition 2.3 [A¢ ne Functions]: Denote


0 1 0 1
1 0
B C B C
B X1 C B 1 C
X=B
B .. C;
C =B . C
B
C:
@ . A @ .. A
Xk k

Then the class of a¢ ne functions is de…ned as

X
k
A = fg : Rk+1 ! R : g(X) = 0+ j Xj ; j 2 Rg
j=1
k+1 0
= fg : R ! R j g(X) = Xg:

Here, there is no restriction on the values of parameter vector : For this class of functions, the
functional form is known to be linear in both explanatory varibles X and parameters ; the
unknown is the (k + 1) 1 vector .

Remarks:

From an econometric point of view, the key feature of A is that g(X) = X 0 is linear in ;
not in X: Later, we will generalize A so that g(X) = X 0 is linear in but is possibly nonlinear
in X: For example, when k = 1; we can generalize A to include

2
g(X) = 0 + 1 X1 + 2 X1 ;

or
g(X) = 0 + 1 ln X1 :

14
These possibilities are included in A if we properly rede…ne X as X = (1; X1 ; X12 )0 or X =
(1; ln X1 )0 : Therefore, our econometric theory to be developed in subsequent chapters are actually
applicable to all regression models that are linear in but not necessarily linear in X: Such models
are called linear regression models. Conversely, a nonlinear regression model for go (X) means a
known parametric functional form g(X; ) which is nonlinear in : An example is the so-called
logistic regression model
1
g(X; ) = :
1 + exp( X 0 )
Nonlinear regression models can be handled using the analytic tools developed in Chapter 8. See
more discussions there.

We now solve the constrained minimization problem

min E[Y g(X)]2 = min E(Y X 0 )2 :


g2A 2Rk+1

The solution g (X) = X 0 is called the Best Linear Least Squares Predictor for Y; and
is called the best LS approximation coe¢ cient vector.

Theorem 2.5 [Best Linear LS Prediction]: Suppose E(Y 2 ) < 1 and the (k + 1) (k + 1)
matrix E(XX 0 ) is nonsingular. Then the best linear LS predictor that solves

min E[Y g(X)]2 = min E(Y X 0 )2


g2A 2Rk+1

is the linear function


g (X) = X 0 ;

where the optimizing coe¢ cient vector

= [E(XX 0 )] 1 E(XY ):

Proof: First, noting that

min E[Y g(X)]2 = min E(Y X 0 )2 ;


g2A 2Rk+1

we …rst …nd the FOC:


d
E(Y X 0 )2 j = = 0:
d

15
The left hand side

d @
E(Y X 0 )2 = E (Y X 0 )2
d @
@
= E 2(Y X0 )
( X0 )
@
@
= 2E (Y X 0 ) (X 0 )
@
0
= 2E[X(Y X )]:

Therefore, FOC implies that

E[X(Y X0 )] = 0 or
E(XY ) = E(XX 0 ) :

Multiplying the inverse of E(XX 0 ); we obtain

= [E(XX 0 )] 1 E(XY ):

It remains to check SOC: The Hessian matrix

d2
E(Y X 0 )2 = 2E(XX 0 )
d d 0

is positive de…nite provided E(XX 0 ) is nonsingular (why?). Therefore, is a global minimizer.


This completes the proof.

Remarks:
The moment condition E(Y 2 ) < 1 ensures that E(Y jX) exists and is well-de…ned. When
the (k + 1) (k + 1) matrix
2 3
1 E(X1 ) E(X2 ) E(Xk )
6 7
6 E(X1 ) E(X12 ) E(X1 X2 ) E(X1 Xk ) 7
6 7
E(XX ) = 6
0
6 E(X2 ) E(X2 X1 ) E(X22 ) 7
7
6 .. .. 7
4 . . 5
2
E(Xk ) E(Xk X1 ) E(Xk )

is nonsingular and E(XY ) exists, the best linear LS approximation coe¢ cient is always
well-de…ned, no matter whether E(Y jX) is linear or nonlinear in X.
0
To gain insight into the nature of , we consider a simple case where = ( 0; 1) and

16
X = (1; X1 )0 : Then the slope coe¢ cient and the intercept coe¢ cient are, respectively,

cov(Y; X1 )
1 = ;
var(X1 )
0 = E(Y ) 1 E(X1 ):

Thus, the best linear LS approximation coe¢ cient 1 is proportional to cov(Y; X1 ). In other
words, 1 captures the dependence between Y and X1 that is measurable by cov(Y; X1 ): It will
miss the dependence between Y and X1 that cannot be measured by cov(Y; X1 ): Therefore, linear
regression analysis is essentially correlation analysis.

In general, the best linear LS predictor g (X) X 0 6= E(Y jX): An important question is
what happens if g (X) = X 0 6= E(Y jX)? In particular, what is the interpretation of ?

We now discuss the relationship between the best linear LS prediction and a linear regression
model.

De…nition 2.4 [Linear Regression Model]: The speci…cation

Y = X 0 + u; 2 Rk+1 ;

is called a linear regression model, where u is the regression model disturbance or regression
model error. If k = 1; it is called a bivariate linear regression model or a straight line regression
model. If k > 1; it is called a multiple linear regression model.

The linear regression model is an arti…cial speci…cation. Nothing ensures that the regression
function is linear, namely E(Y jX) = X 0 o for some o : In other words, the linear model may
not contain the true regression function go (X) E(Y jX): However, even if go (X) is not a linear
function of X; the linear regresson model Y = X 0 + u may still have some predictive ability
although it is a misspeci…ed model.

We …rst characterize the relationship between the best linear LS approximation and the linear
regression model.

Theorem 2.6: Suppose the conditions of the previous theorem hold. Let

Y = X 0 + u;

and let be the best linear least squares approximation coe¢ cient. Then

17
if and only if the following orthogonality condition holds:

E(Xu) = 0:

Proof: From the linear regression model Y = X 0 + u, we have u = Y X 0 ; and so

E(Xu) = E(XY ) E(XX 0 ) :

(a) Necessarity: If = ; then

E(Xu) = E(XY ) E(XX 0 )


= E(XY ) E(XX 0 )[E(XX 0 )] 1 E(XY )
= 0:

(b) Su¢ ciency: If E(Xu) = 0; then

E(Xu) = E(XY ) E(XX 0 )


= 0:

From this and the fact that E(XX 0 ) is nonsingular, we have

= [E(XX 0 )] 1 E(XY ) :

This completes the proof.

Remarks:
This theorem implies that no matter whether E(Y jX) is linear or nonlinear in X; we can
always write
Y = X0 + u

for some = such that the orthogonality condition E(Xu) = 0 holds, where u = Y X 0 .
The orthogonality condition E(Xu) = 0 is fundamentally linked with the best least squares
optimizer. If is the best linear LS coe¢ cient , then the disturbance u must be orthogonal to
X: On the other hand, if X is orthogonal to u; then must be the least squares minimizer .
Essentially the orthogonality between X and " is the FOC of the best linear LS problem! In other
words, the orthoganality condition E(Xu) = 0 will always hold as long as the MSE ceriterion is
used to obtain the best linear prediction. Note that when X contains an intercept, the orthog-
onality condition E(Xu) = 0 implies that E(u) = 0: In this case, we have E(Xu) =cov(X; u):
In other words, the orthogonality condition is equivalent to uncorrelatedness between X and u:
This implies that u does not contain any component that can be predicted by a linear function

18
of X:
The condition E(Xu) = 0 is fundamentally di¤erent from E(ujX) = 0: The latter implies
the former but not vice versa. In other words, E(ujX) = 0 implies E(Xu) = 0 but it is possible
that E(Xu) = 0 and E(ujX) 6= 0: This can be illustrated by the following example.

Example 1: Suppose u = (X 2 1)+"; where X and " are independent N(0,1) random variables.
Then

E(ujX) = X 2 1 6= 0; but
E(Xu) = E[X(X 2 1)] + E(X")
= E(X 3 ) E(X) + E(X)E(")
= 0:

2.4 Correct Model Speci…cation for Condtional


Mean
Question: What is the characterization for correct model speci…cation in conditional mean?

De…nition 2.5 [Correct Model Speci…cation in Conditional Mean]: The linear regression
model
Y = X 0 + u; 2 Rk+1 ;

is said to be correctly speci…ed for E(Y jX) if

o o
E(Y jX) = X 0 for some 2 Rk+1 :

On the other hand, if


E(Y jX) 6= X 0 for all 2 Rk+1 ;

then the linear model is said to be misspeci…ed for E(Y jX):

Remarks:

The class of linear regression models contains an in…nite number of linear functions, each
corresponding to a particular value of : When the linear model is correctly speci…ed, a linear
function corresponding to some o will coincide with go (X): The coe¢ cient o is called the “true
parameter”, because now it has a meaningful economic interpretation as the expected marginal
e¤ect of X on Y :
o d
= E(Y jX):
dX

19
o
For example, when Y is comsumption and X is income, is the marginal propensity to consume
(MPC).
When o is a vector, the component

o @E(Y jX)
j = ; 1 j k;
@Xj

is the partial marginal e¤ect of Xj on Y when holding all other explanatory variables in X
…xed.
Question: What is the interpretation of the intercept coe¢ cient o0 when a linear regression
model is correctly speci…ed for go (X)?

Answer: The intercept o0 corresponds to the variable X0 = 1; which is always uncorrelated


with any other random variables. It captures the “average e¤ect”on Y from all possible factors
rather than the explanatory variables in Xt . For example, consider the standard Capital Asset
Pricing Model (CAPM)
E(Y jX) = o0 + o1 X1 ;

where Y is the excess portfolio return (i.e., the di¤erence between a portfolio return and a risk-
free rate) and X1 is the excess market portfolio return (i.e., the di¤erence between the market
portfolio return and a risk-free rate). Here, o0 represents the average pricing error. When CAPM
holds, o0 = 0: Thus, if the data generating process has o0 > 0; CAPM underprices the portfolio.
If o0 < 0; CAPM overprices the portfolio.

No economic theory ensures that the functional form of E(Y jX) must be linear in X. Non-
linear functional form in X is a generic possibility. Therefore, we must be very cautious about
the economic interpretation of linear coe¢ cients.

Theorem 2.7: If the linear model


Y = X0 + u

is correctly speci…ed for E(Y jX); then


(a) Y = X 0 o + " for some o and "; where E("jX) = 0;
(b) = o:

Proof: (a) If the linear model is correctly speci…ed for E(Y jX); then E(Y jX) = X 0 o for some
o
:
On the other hand, we always have the regression identity Y = E(Y jX)+"; where E("jX) = 0:
Combining these two equations gives result (a) immediately.

20
(b) From part (a) we have

E(X") = E[XE("jX)]
= E(X 0)
= 0:

o o
It follows that the orthogonality condition holds for Y = X 0 + ": Therefore, we have =
by the previous theorem (which one?).

Remarks:
Theorem (a) implies E(Y jX) = X 0 o under correct model speci…cation for E(Y jX): This,
together with Theorem (b), implies that when a linear regression model is correctly speci…ed, the
conditional mean E(Y jX) will coincide with the best linear least squares predictor g (X) = X 0 :

Under correct model speci…cation, the best linear LS approximation coe¢ cient is equal to
o
the true marginal e¤ect parameter : In other words, can be interpreted as the true parameter
o
when (and only when) the linear regression model is correctly speci…ed.
Question: What happens if the linear regression model

Y = X 0 + u;

where E(Xu) = 0; is misspeci…ed for E(Y jX)? In other words, what happens if E(Xu) = 0 but
E(ujX) 6= 0?

Answer: The regression function

E(Y jX) = X 0 + E(ujX)


6= X 0 :

There exists some neglected structure in u that can be exploited to improve the prediction of Y
using X: A misspeci…ed model always yields suboptimal predictions. A correctly speci…ed model
yields optimal predictions in terms of MSE.

Example 1: Consider the following data generating process (DGP)

1 1
Y = 1 + X1 + (X12 1) + ";
2 4
where X1 and " are mutually independent N (0; 1):
(a) Find the conditional mean E(Y jX1 ) and dXd 1 E(Y jX1 ); the marginal e¤ect of X1 on Y .

21
Suppose now a linear regression model

Y = 0 + 1 X1 +u
= X 0 + u;

where X = (X0 ; X1 )0 = (1; X1 )0 ; is speci…ed to approximate this DGP.


(b) Find the best LS approximation coe¢ cient and the best linear LS predictor gA (X) =
X0 :
(c) Let u = Y X 0 : Show E(Xu) = 0:
(d) Check if the true marginal e¤ect dXd 1 E(Y jX1 ) is equal to 1 ; the model-implied marginal
e¤ect.

Solution: (a) Given that X1 and u are independent, we obtain

1 1
E(Y jX1 ) = 1 + X1 + (X12 1);
2 4
d 1 1
E(Y jX1 ) = + X1 :
dX1 2 2

(b) Using the best LS approximation formula, we have

1
= [E(XX 0 )] E(XY )
" # 1" #
1 0 1
= 1
0 1 2
" #
1
= 1
:
2

Hence, we have
1
g (X) = X 0 = 1 + X1 :
2
(c) By de…nition and part (b), we have

u = Y X0
= Y ( 0 + 1 X1 )
1 2
= (X 1) + ":
4 1

22
It follows that
" #
1 ( 14 (X12 1) + ")
E(Xu) = E
X1 ( 14 (X12 1) + ")
" #
0
= ;
0

although
1
E(ujX1 ) = (X12 1) 6= 0:
4
(d) No, because
d 1 1 1
E(Y jX1 ) = + X1 6= 1 = :
dX1 2 2 2
The marginal e¤ect depends on the level of X1 ; rather than only on a constant. Therefore, the
condition E(Xu) = 0 is not su¢ cient for the validity of the economic interpretation for 1 as
the marginal e¤ect.

Any parametric regression model is subject to potential model misspeci…cation. This can
occur due to the use of a misspeci…ed functional form, as well as the existence of omitted variables
which are correlated with the existing regressors, among other things. In econometrics, there
exists a modeling strategy which is free of model misspeci…cation when a data set is su¢ ciently
large. This modeling strategy is called a nonparametric approach, which does not assume any
functional form for E(Y jX) but let data speak for the true relationship. We now introduce the
basic idea of a nonparametric approach.

Nonparametric modeling is a statistical method that can model the unknown function arbi-
trarily well without having to know the functional form of E(Y jX): To illustrate the basic idea
of nonparametric modeling, suppose go (x) is a smooth function of x: Then we can expand go (x)
using a set of orthonormal “basis”functions f j (x)g1j=0 :

X
1
go (x) = j j (x) for x 2 support(X);
j=0

where the Fourier coe¢ cient Z 1

j = go (x) j (x)dx
1

and Z (
1
1 if i = j;
i (x) j (x)dx = ij
1 0 if i 6= j:
The function ij is called the Kronecker delta.

23
Example 2: Suppose go (x) = x2 where x 2 [ ; ]: Then
2
cos(2x) cos(3x)
go (x) = 4 cos(x) +
3 22 32
2 X
1
1 cos(jx)
= 4 ( 1)j :
3 j=1
j2

Example 3: Suppose 8
>
< 1 if < x < 0;
go (x) = 0 if x = 0;
>
:
1 if 0 < x < :
Then

4 sin(3x) sin(5x)
go (x) = sin(x) + + +
3 5
4 X sin[(2j + 1)x]
1
= :
j=0
(2j + 1)

Generally, suppose go (x) is square-integrable. We have


Z X
1 X
1 Z
go2 (x)dx = j k j (x) k (x)dx
j=0 k=0
X1 X 1
= j k jk by orthonormality of f j( )g
j=0 k=0
X1
2
= j < 1;
j=0

Therefore, j ! 0 as j ! 1: That is, the Fourier coe¢ cient j will eventually vanish to zero as

the order j goes to in…nity. This motivates us to use the following truncated approximation:
p
X
gp (x) = j j (x);
j=0

24
where p is the order of bases. The approximation bias of gp (x) for go (x) is

Bp (x) = go (x) gp (x)


X1
= j j (x)
j=p+1
= Bias.

The coe¢ cients f j g are unknown in practice, so we have to estimate them from an observed
data fYt ; Xt gnt=1 ; where n is the sample size. We consider a linear regression
p
X
Yt = j j (Xt ) + ut ; t = 1; :::; n:
j=0

Obviously, we need to let p = p(n) ! 1 as n ! 1 to ensure that the bias Bp (x) vanishes
to zero as n ! 1: However, we should not let p grow to in…nity too fast, because otherwise
there will be too much sampling variation in parameter estimators (due to too many unknown
parameters). This requires p=n ! 0 as n ! 1:

The nonparametric approach just described is called nonparametric series regression


(see, e.g., Andrews 1991, Hong and White 1995). There are many nonparametric methods
available in the literature. Another popular nonparametric method is called kernel method,
which is based on the idea of the Taylor series expansion in a local region. See Härdle (1990),
Applied Nonparametric Regression, for more discussion on kernel smoothing. The key feature
of nonparametric modeling is that it does not specify a concrete functional form or model but
rather estimates the unknown true function from data. As can be seen above, nonparametric
series regression is easy to use and understand, because it is a natural extension of linear regression
with the number of regressors increasing with the sample size n.

The nonparametric approach is ‡exible and powerful, but it generally requires a large data set
for precise estimation because there is a large number of unknown parameters. Moreover, there
is little economic interpretation for it (for example, it is di¢ cult to give economic interpretation
for the coe¢ cients f j g). Nonparametric analysis is usually treated in a separate, more advanced
econometric course (see more discussion in Chapter 10).
2.5 Conclusion
Most economic theories (e.g., rational expectations theory) have implications on and only on
the conditional mean of the underlying economic variable given some suitable information set.
The conditional mean E(Y jX) is called the regression function of Y on X: In this chapter, we
have shown that the regression function E(Y jX) is the optimal solution to the MSE minimization

25
problem
min E[Y g(X)]2 ;
g2F

where F is the space of measurable and square-integrable functions.


The regression function E(Y jX) is generally unknown, because economic theory usually does
not tell a concrete functional form. In practice, one usually uses a parametric model for E(Y jX)
that has a known functional form but with a …nite number of unknown parameters. When we
restrict g(X) to A = fg : RK ! R j g(x) = x0 g; a class of a¢ ne functions, the optimal predictor
that solves
min E[Y g(X)]2 = min E(Y X 0 )2
g2A 2RK

is g (X) = X 0 ; where
= [E(XX 0 )] 1 E(XY )

is called the best linear least squares approximation coe¢ cient. The best linear least squares
predictor gA (X) = X 0 is always well-de…ned, no matter whether E(Y jX) is linear in X:
Suppose we write
Y = X 0 + u:

Then = if and only if


E(Xu) = 0:

This orthogonality condition is actually the …rst order condition for the best linear least squares
minimization problem. It does not guarantee correct speci…cation of a linear regression model.
A linear regression model is correctly speci…ed for E(Y jX) if E(Y jX) = X 0 o for some o ; which
is equivalent to the condition that
E(ujX) = 0;

where u = Y X 0 o : That is, correct model speci…cation for E(Y jX) holds if and only if the
conditional mean of the linear regression model error is zero when evaluated at some parameter
o
. Note that E(ujX) = 0 is equivalent to the condition that E[uh(X)] = 0 for all measurable
functions h( ): When E(Y jX) = X 0 o for some o ; we have = o : That is, the best linear least
squares approximation coe¢ cient will coincide with the true model parameter o and can be
interpreted as the marginal e¤ect of X on Y: The condition E(ujX) = 0 fundamentally di¤ers
from E(Xu) = 0: The former is crucial for validity of economic interpretation of the coe¢ cient
as the true coe¢ cient o : The orthogonality condition E(Xu) = 0 does not guarantee this
interpretation. Correct model speci…cation is important for economic interpretation of model
coe¢ cient and for optimal predictions.

An econometric model aims to provide a concise and reasonably accurate re‡ection of the
data generating process. By disregarding less relevant aspects of the data, the model helps to

26
obtain a better understanding of the main aspects of the DGP. This implies that an econometric
model will never provide a completely accurate description of the DGP. Therefore, the concept of
a “true model”does not make much practical sense. It re‡ects an idealized situation that allows
us to obtain mathematically exact results. The idea is that similar results hold approximately
true if the model is a reasonably accurate approximation of the DGP.

The main purpose of this chapter is to provide a general idea of regression analysis and to shed
some light on the nature and limitation of linear regression models, which have been popularly
used in econometrics and will be the subject of study in Chapters 3 to 7.

27
EXERCISES
2.1. Put " = Y E(Y jX): Show var(Y jX) = var("jX):
2.2. Show var(Y ) =var[E(Y jX)] +var[Y E(Y jX)]:
2.3. Suppose (X; Y ) follows a bivariate normal distribution with joint pdf

fXY (x; y)

1
= p
2 1 2
1 2
( " #)
2 2
1 x 1 x 1 y 2 y 2
exp 2)
2 + ;
2(1 1 1 2 2

where 1 < < 1; 1 < 1 ; 2 < 1; 0 < 1 ; 2 < 1: Find


(a) E(Y jX):
(b) var(Y jX): (Hint: Use the change of variable method for integration and the fact that
R1 1 1 2
p
1 2
exp 2
x dx = 1:)

2.4. Suppose Z (Y; X 0 )0 is a stochastic process such that the conditional mean go (X)
E(Y jX) exists, where X is a (k + 1) 1 random vector. Suppose one uses a model (or a
function) g(X) to predict Y: A popular evaluation criterion for model g(X) is the mean squared
error M SE(g) E[Y g(X)]2 :
(a) Show that the optimal predictor g (X) for Y that minimizes M SE(g) is the conditional
mean go (X); namely, g (X) = go (X):
(b) Put " Y go (X); which is called the true regression disturbance: Show that E("jX) = 0
and interpret this result.

2.5. The choices of model g(X) in Exercise 2.4 are very general. Suppose that we now restrict our
choice of g(X) to a linear (or a¢ ne) models fgA (X) = X 0 g; where is a (k + 1) 1 parameter.
One can choose a linear function gA (X) by choosing a value for parameter : Di¤erent values
of give di¤erent linear functions gA (X): The best linear predictor gL that minimizes the mean
squared error criterion is de…ned as gA (X) X 0 ; where

arg min E(Y X 0 )2


2Rk+1

is called the optimal linear coe¢ cient.


(a) Show that
= [E(XX 0 )] 1 E(XY ):

(b) De…ne u Y X0 : Show that E(Xu ) = 0; where 0 is a (k + 1) 1 zero vector.

28
(c) Suppose the conditional mean go (X) = X 0 o for some given o : Then we say that the
linear model gA (X) is correctly speci…ed for conditional mean go (X), and o is the true parameter
of the data generating process. Show that = o and E(u jX) = 0:
(d) Suppose the conditional mean go (X) 6= X 0 for any value of : Then we say that the
linear model gA (X) is misspeci…ed for conditional mean go (X): Check if E(u jX) = 0 and discuss
its implication.

2.6. Suppose Y = 0 + 1 X1 + u; where Y and X1 are scalars, and = ( 0 ; 1 )0 is the best


linear least squares approximation coe¢ cient.
(a) Show that 1 = cov(Y; X1 )= 2X1 and 0 = E(Y ) 1 E(X); and the mean squared error

2 2 2
E[Y ( 0 + 1 X1 )] = Y (1 X1 Y );

where 2Y = var(Y ) and X1 Y is the correlation coe¢ cient between Y and X1 :


(b) Suppose in addition Y and X1 follow a bivariate normal distribution. Show E(Y jX1 ) =
2 2
0 + 1 X1 and var(Y jX1 ) = Y (1 X1 Y ): That is, the conditional mean of Y given X1 coincides
with the best linear least squares predictor and the conditional variance of Y given X1 is equal
to the mean squared error of the best linear least squares predictor.

2.7. Suppose
Y = 0 + 1 X1 + jX1 j";

where E(X1 ) = 0; var(X1 ) = 2X1 > 0; E(") = 0; var(") = 2


" > 0; and " and X1 are independent.
Both 0 and 1 are scalar constants.
(a) Find E(Y jX1 ):
(b) Find var(Y jX1 ):
(c) Show that 1 = 0 if and only if cov(X1 ; Y ) = 0:

2.8. Suppose an aggregate consumption function is given by

1
Y = 1 + 0:5X1 + (X12 1) + ";
4
where X1 N (0; 1); " N (0; 1); and X1 is independent of ":
(a) Find the conditional mean go (X) E(Y jX); where X (1; X1 )0 :
(b) Find the marginal propensity to consume (MPC) dXd 1 go (X).
(c) Suppose we use a linear model

Y = X0 + u = 0 + 1 X1 +u

where ( 0 ; 1 )0 to predict Y: Find the optimal linear coe¢ cient and the optimal linear
predictor gA (X) X 0 :

29
d
(d) Compute the partial derivative of the linear model g (X),
dX1 A
and compare it with the
MPC in part (b). Discuss the results you obtain.

2.9. Put go (X) = E(Y jX);where X = (1; X1 )0 : Then we have

Y = go (X) + ";

where E("jX) = 0:
Consider a …rst order Taylor series expansion of go (X) around 1 = E(X1 ) :

go (X) go ( 1 ) + go0 ( 1 )(X1 1)

= [go ( 1 ) go0 ( 1 )] + go0 ( 1 )X1 :

Suppose = ( 0 ; 1 )0 is the best linear least squares approximation coe¢ cient. Is it true
that 1 = go0 ( 1 )? Provide your reasoning.

2.10. Suppose a data generating process is given by

Y = 0:8X1 X2 + ";

where X1 N (0; 1); X2 N (0; 1); " N (0; 1); and X1 ; X2 and " are mutually independent.
0
Put X = (1; X1 ; X2 ) :
(a) Is Y predictable in mean using information X?
(b) Suppose we use a linear model

gA (X) = X 0 + u
= 0 + 1 X1 + 2 X2 +u

to predict Y: Does this linear model has any predicting power? Explain.

2.11. Show that E(ujX) = 0 if and only if E[h(X)u] = 0 for any measurable functions h( ):

2.13. Suppose E(ujX) exists, X is a bounded random variable, and h(X) is an arbitrary
measurable function. Put g(X) = E("jX) and assume that E[g 2 (X)] < 1:
(a) Show that if g(X) = 0; then E["h(X)] = 0:
(b) Show that if E["h(X)] = 0; then E("jX) = 0: [Hint: Consider h(X) = etX for t in a small

neighborhood containing 0: Given that X is bounded, we can expand

X
1
j
g(X) = jX
j=0

30
R1
where j = 1
g(x)xj fX (x)dx is the Fourier coe¢ cient. Then

E("etX ) = E E("jX)etX
= E g(X)etX
X1 j
t
= E g(X)X j
j=0
j!
X
1 j
t
= j
j=0
j!

for all t in a small neighborhood containing 0.]

2.14. Consider the following nonlinear least squares problem

min E [Y g(X; )]2 ;


2Rk+1

where g(X; ) is possibly a nonlinear function


h of : [An example
i is a logistic regression model
1 @ @
where g(X; ) = 1+exp( X 0 ) :] Suppose E @ g(X; ) @ 0 g(X; ) is a (k + 1) (k + 1) bounded
and nonsingular matrix for all 2 Rk+1 ; where @@ 0 g(X; ) is the transpose of the (k + 1) 1
column vector @@ g(X; ).
(a) Derive the …rst order condition for the best nonlinear least squares approximation coe¢ -
cient (say).
(b) Put Y = g(X; ) + u: Show that = if and only if E[u @@ g(X; )] = 0: Do we have
E(Xu) = 0 when g(X; ) is nonlinear in ?
(c) The nonlinear regression model g(X; ) is said to be correctly speci…ed for E(Y jX) if
there exists some unknown o such that E(Y jX) = g(X; o ) almost surely. Here, o can be
interpreted as a true model parameter. Show that = o if and only if the model g(X; ) is
correctly speci…ed for E(Y jX).
(d) Do we have E(ujX) = 0; where u = Y g(X; o ); for some o ; when the model g(X; )
is correctly speci…ed?
(e) If E(ujX) = 0, where u = Y g(X; o ) for some o ; is g(X; ) correctly speci…ed for
E(Y jX)?

2.15. Comment on the following statement: “All econometric models are approximations of the
economic system of interest and are therefore misspeci…ed. Therefore, there is no need to check
correct model speci…cation in practice.”

31
CHAPTER 3 CLASSICAL LINEAR
REGRESSION MODELS
Abstract: In this chapter, we will introduce the classical linear regression theory,
including the classical model assumptions, the statistical properties of the OLS esti-
mator, the t-test and the F -test, as well as the GLS estimator and related statistical
procedures. This chapter will serve as a starting point from which we will develop the
modern econometric theory.

Key words: Classical linear regression, Conditional heteroskedasticity, Conditional


homoskedasticity, F -test, GLS, Hypothesis testing, Model selection criterion, OLS, R2 ;
t-test

3.1 Framework and Assumptions


Suppose we have an observed random sample fZt gnt=1 of size n; where Zt = (Yt ; Xt0 )0 ; Yt
is a scalar, Xt = (1; X1t ; X2t ; :::; Xkt )0 is a (k + 1) 1 vector, t is an index (either cross-
sectional unit or time period) for observations, and n is the sample size. We are interested
in the conditional mean E(Yt jXt ) using an observed realization (i.e., a data set) of the
random sample fYt ; Xt0 g0 ; t = 1; :::; n:
Notations:
Throughtout this book, we set K k + 1; the number of regressors which contains
k economic variables and an intercept. The index t may denote an individual unit (e.g.,
a …rm, a household, a country) for cross-sectional data, or denote a time period (e.g.,
day, weak, month, year) in a time series context.

We …rst list and discuss the assumptions of the classical linear regression theory.
Assumption 3.1 [Linearity]:

o
Yt = Xt0 + "t ; t = 1; :::; n;

where o is a K 1 unknown parameter vector, and "t is an unobservable disturbance.


Remarks:
In Assumption 3.1, Yt is the dependent variable (or regressand), Xt is the vector of
regressors (or independent variables, or explanatory variables), and o is the regression
coe¢ cient vector. When the linear model is correctly speci…ed for the conditional mean
E(Yt jXt ), i.e., when E("t jXt ) = 0; the parameter o = @X
@
t
E(Yt jXt ) can be interpreted
as the marginal e¤ect of Xt on Yt :

1
The key notion of linearity in the classical linear regression model is that the re-
gression model is linear in o rather than in Xt :In other words, linear regression models
cover some models for Yt which have a nonlinear relationship with Xt :

Question: Does Assumption 3.1 imply a causal relationship from Xt to Yt ?

Not necessarily. As Kendall and Stuart (1961, Vol.2, Ch. 26, p.279) point out,
“a statistical relationship, however strong and however suggestive, can never establish
causal connection. Our ideas of causation must come from outside statistics ultimately,
from some theory or other.”Assumption 3.1 only implies a predictive relationship: Given
Xt , can we predict Yt linearly?

Denote

Y = (Y1 ; :::; Yn )0 ; n 1;
" = ("1 ; :::; "n )0 ; n 1;
X = (X1 ; :::; Xn )0 ; n K:

where the t-th row of X is Xt0 = (1; X1t ; :::; Xkt ): With these matrix notations, we have
a compact expression for Assumption 3.1:
o
Y = X + ";
n 1 = (n K)(K 1) + n 1:

The second assumption is a strict exogeneity condition.

Assumption 3.2 [Strict Exogeneity]:

E("t jX) = E("t jX1 ; :::; Xt ; :::; Xn ) = 0; t = 1; :::; n:

Remarks:
Among other things, Assumption 3.2 implies correct model speci…cation for E(Yt jXt ):
This is because Assumption 3.2 implies E("t jXt ) = 0 by conditional expectation. It also
implies E("t ) = 0 by the law of iterated expectations.
Under Assumption 3.2, we have E(Xs "t ) = 0 for any (t; s); where t; s 2 f1; :::; ng:
This follows because

E(Xs "t ) = E[E(Xs "t jX)]


= E[Xs E("t jX)]
= E(Xs 0)
= 0:

2
Note that (i) and (ii) imply cov(Xs ; "t ) = 0 for all t; s 2 f1; :::; ng:
Because X contains regressors fXs g for both s t and s > t; Assumption 3.2
essentially requires that the error "t do not depend on the past and future values of
regressors if t is a time index. This rules out dynamic time series models for which
"t may be correlated with the future values of regressors (because the future values of
regressors depend on the current shocks), as is illustrated in the following example.

Example 1: Consider a so-called AutoRegressive AR(1) model

Yt = 0 + 1 Yt 1 + "t ; t = 1; :::; n;
= Xt0 + "t ;
2
f"t g i.i.d.(0; );

where Xt = (1; Yt 1 )0 : This is a dynamic regression model because the term 1 Yt 1


represents the “memory”or “feedback”of the past into the present value of the process,
which induces a correlation between Yt and the past. The term autoregression refers to
the regression of Yt on its own past values. The parameter 1 determines the amount of
feedback, with a large absolute value of 1 resulting in more feedback. The disturbance
"t can be viewed as representing the e¤ect of “new information”that is revealed at time
t. Information that is truly new cannot be anticipated so that the e¤ects of today’s
new information should be unrelated to the e¤ects of yesterday’s news in the sense that
E("t jXt ) = 0. Here, we make a stronger assumption that we can model the e¤ect of new
information as an i.i.d.(0; 2 ) sequence.
Obviously, E(Xt "t ) = E(Xt )E("t ) = 0 but E(Xt+1 "t ) 6= 0: Thus, we have E("t jX) 6=
0; and so Assumption 3.2 does not hold. Here, the lagged dependent variable Yt 1 in the
regressor vector Xt is called a predetermined variable, since it is orthogonal to "t but
depends on the past history of f"t g:
In Chapter 5 later, we will consider linear regression models with dependent obser-
vations, which will include this example as a special case. In fact, the main reason of
imposing Assumption 3.2 is to obtain a …nite sample distribution theory. For a large
sample theory (i.e., an asymptotic theory), the strict exogeneity condition will not be
needed.

In econometrics, there are some alternative de…nitions of strict exogeneity. For ex-
ample, one de…nition assumes that "t and X are independent. Another example is that
X is nonstochastic. This rules out conditional heteroskedasticity (i.e., var("t jX) depends
on X). In Assumption 3.2, we still allow for conditional heteroskedasticity, because we

3
do not assume that "t and X are independent. We only assume that the conditional
mean E("t jX) does not depend on X:

Question: What happens to Assumption 3.2 if X is nonstochastic?

If X is nonstochastic, Assumption 3.2 becomes

E("t jX) = E("t ) = 0:

An example of nonstochastic X is Xt = (1; t; :::; tk )0 : This corresponds to a time-trend


regression model

Yt = Xt0 o + "t
Xk
o j
= j t + "t :
j=0

Question: What happens to Assumption 3.2 if Zt = (Yt ; Xt0 )0 is an independent random


sample (i.e., Zt and Zs are independent whenever t 6= s; although Yt and Xt may not be
independent)?

When fZt g is i.i.d., Assumption 3.2 becomes

E("t jX) = E("t jX1 ; X2 ; :::Xt ; :::; Xn )


= E("t jXt )
= 0:

In other words, when fZt g is i.i.d., E("t jX) = 0 is equivalent to E("t jXt ) = 0:

Assumption 3.3 [Nonsingularity]: (a) The minimum eigenvalue of the K K square


P
matrix X 0 X = nt=1 Xt Xt0 is nonsingular, and (b)

min (X0 X) ! 1 as n ! 1

with probability one.

Remarks:
Assumption 3.3(a) rules out multicollinearity among the (k + 1) regressors in Xt : We
say that there exists multicollinearity (sometimes called the exact or perfect multicolin-
earity in the literature) among the Xt if for all t 2 f1; :::; ng; the variable Xjt for some
j 2 f0; 1; :::; kg is a linear combination of the other K 1 column variables fXit ; i 6= jg.

4
In this case, the matrix X0 X is singular, and as a consequence, the true model parameter
o
in Assumption 3.1 is not identi…able.

The nonsingularity of X0 X implies that X must be of full rank of K = k + 1. Thus,


we need K n: That is, the number of regressors cannot be larger than the sample size.
This is a necessary condition for identi…cation of parameter o :

The eigenvalue of a square matrix A is characterized by the system of linear equa-


tions:
det(A I) = 0;
where det( ) denotes the determinant of a square matrix, and I is an identity matrix
with the same dimension as A.
It is well-known that the eigenvalue can be used to summarize information con-
tained in a matrix (recall the popular principal component analysis). Assumption 3.3
implies that new information must be available as the sample size n ! 1 (i.e., Xt
should not only have same repeated values as t increases).
Intuitively, if there are no variations in the values of the Xt ; it will be di¢ cult
to determine the relationship between Yt and Xt (indeed, the purpose of classical linear
regression is to investigate how a change in X causes a change in Y ): In certain sense, one
may call X0 X the “information matrix”of the random sample X because it is a measure
of the information contained in X: The magnitude of X0 X will a¤ect the preciseness
of parameter estimation for o . Indeed, as will be shown below, the condition that
0
min (X X) ! 1 as n ! 1 ensures that variance of the OLS estimator will vanish to
zero as n ! 1: This rule out a possibility called near-multicolinearity that there exists
an approximate linear relationship among the sample values of explanatory variables in
Xt such that although X0 X is nonsingular, its minimum eigenvalue min (X0 X) does not
grow with the sample size n: When min (X0 X) does not grow with n, the OLS estimator
is well-de…ned and has a well-behaved …nite sample distribution, but its variance never
vanishes to zero as n ! 1: In other words, in the near multicolinearity case where
0
min (X X) does not grow with n; the OLS estimator will never converge to the true
parameter value o ; although it will still have a well-de…ned …nite sample distribution.
Question: Why can the eigenvalue be used as a measure of the information contained
in X0 X?

Assumption 3.4 [Spherical error variance]:


(a) [conditional homoskedasticity]:

E("2t jX) = 2
> 0; t = 1; :::; n;

5
(b) [conditional non-autocorrelation]:

E("t "s jX) = 0; t 6= s; t; s 2 f1; :::; ng:

Remarks:
We can write Assumption 3.4 as

2
E("t "s jX) = ts ;

where ts = 1 if t = s and ts = 0 otherwise. In mathematics, ts is called the Kronecker


delta function. Under this assumption, we have

var("t jX) = E("2t jX) [E("t jX)]2


= E("2t jX)
2
=

and

cov("t ; "s jX) = E("t "s jX)


= 0 for all t 6= s:

By the law of iterated expectations, Assumption 3.4(b) implies that var("t ) = 2


for all t = 1; :::; n; the so-called unconditional homoskedasticity. Similarly, Assumption
3.4(a) implies cov("t ; "s ) = 0 for all t 6= s: Thus, there exists no serial correlation between
"t and its lagged values when t is an index for time, or there exists no spatial correlation
between the disburbances associated with di¤erent cross-sectional units when t is an
index for the cross-sectional unit (e.g., consumer, …rm, household, etc).
Assumption 3.4 does not imply that "t and X are independent. It allows the possibil-
ity that the conditional higher order moments (e.g., skewness and kurtosis) of "t depend
on X:
We can write Assumptions 3.2 and 3.4 compactly as follows:

E("jX) = 0 and E(""0 jX) = 2


I;

where I In is a n n identity matrix.

3.2 OLS Estimation


Question: How to estimate o using an observed data set generated from the ran-
dom sample fZt gnt=1 ; where Zt = (Yt ; Xt0 )0 ?

6
De…nition 3.1 [OLS estimator]: Suppose Assumptions 3.1 and 3.3(a) hold. De…ne
the sum of squared residuals (SSR) of the linear regression model Yt = Xt0 + ut as

SSR( ) (Y X )0 (Y X )
Xn
= (Yt Xt0 )2 :
t=1

Then the Ordinary Least Squares (OLS) estimator ^ is the solution to

^ = arg min SSR( ):


2RK

Note that SSR( ) is the sum of squred model errors fut = Yt Xt0 g, with equal
weighting for each t.

Theorem 3.1 [Existence of OLS]: Under Assumptions 3.1 and 3.3, the OLS estimator
^ exists and

^ = (X0 X) 1 X0 Y
! 1
1X 1X
n n
= Xt Xt0 Xt Yt :
n t=1 n t=1

The last expression will be useful for our asymptotic analysis in subsequent chapters.

Proof: Using the formula that for an K 1 vector A and and K 1 vector , the
derivative
@(A0 )
= A;
@
we have

d X
n
dSSR( )
= (Yt Xt0 )2
d d t=1
Xn
@
= (Yt Xt0 )2
t=1
@
X
n
@
= 2(Yt Xt0 ) (Yt Xt0 )
t=1
@
X
n
= 2 Xt (Yt Xt0 )
t=1
= 2X0 (Y X ):

7
The OLS must satisfy the FOC:

2X0 (Y X ^ ) = 0;
X0 (Y X ^ ) = 0;
X0 Y (X0 X) ^ = 0:

It follows that
(X0 X) ^ = X0 Y:
By Assumption 3.3, X0 X is nonsingular. Thus,

^ = (X0 X) 1 X0 Y:

Checking the SOC, we have the K K Hessian matrix

@ 2 SSR( ) Xn
@
= 2 [(Yt Xt0 )Xt ]
@ @ 0 t=1
@ 0
= 2X0 X
positive de…nite

given min (X0 X) > 0: Thus, ^ is a global minimizer. Note that for the existence of ^ ; we
only need that X0 X is nonsingular, which is implied by the condition that min (X0 X) !
1 as n ! 1 but it does not require that min (X0 X) ! 1 as n ! 1: This completes
the proof.

Remarks:
Suppose Zt = fYt ; Xt0 g0 ; t = 1; :::; n; is an independent and identically distributed
(i.i.d.) random sample of size n. Consider the sum of squared residual scaled by n 1 :

1X
n
SSR( )
= (Yt Xt0 )2
n n t=1

and its minimizer ! 1


1X 1X
n n
^= Xt Xt0 Xt Yt :
n t=1 n t=1
These are the sample analogs of the population MSE criterion

M SE( ) = E(Yt Xt0 )2

and its minimizer


1
[E(Xt Xt0 )] E(Xt Yt ):

8
That is, SSR( ); after scaled by n 1 ; is the sample analogue of M SE( ); and the OLS
^ is the sample analogue of the best LS approximation coe¢ cient :
Put Y^t Xt0 ^ : This is called the …tted value (or predicted value) for observation Yt ;
and et Yt Y^t is the estimated residual (or prediction error) for observation Yt : Note
that

et = Yt Y^t
= (Xt0 o
+ "t ) Xt0 ^
= "t X 0( ^
t
o
);

where "t is the unavoidable true disturbance "t , and Xt0 ( ^ o


) is an estimation error,
which is smaller when a larger data set is available (so ^ becomes closer to o ).

The FOC implies that the estimated residual e = Y X ^ is orthogonal to regressors


X in the sense that
Xn
0
Xe= Xt et = 0:
t=1

This is the consequence of the very nature of OLS, as implied by the FOC of min 2RK SSR( ).
It always holds no matter whether E("t jX) = 0 (recall that we do not impose Assump-
tion 3.2 in the Theorem above). Note that if Xt contains the intercept, then X0 e = 0
implies nt=1 et = 0:

Some useful identities


To investigate the statistical properties of ^ ; we …rst state some useful lemmas.

Lemma 3.2: Under Assumptions 3.1 and 3.3(a), we have:


(i)
X0 e = 0;
(ii)
^ o
= (X0 X) 1 X0 ";
(iii) De…ne a n n projection matrix

P = X(X0 X) 1 X0

and
M = In P:

9
Then both P and M are symmetric (i.e., P = P 0 and M = M 0 ) and idempotent (i.e.,
P 2 = P; M 2 = M ), with

P X = X;
M X = 0:

(iv)
SSR( ^ ) = e0 e = Y 0 M Y = "0 M ":
Proof: (i) The result follows immediately from the FOC of the OLS estimator.
(ii) Because ^ = (X0 X) 1 X0 Y and Y = X o + "; we have

^ o
= (X0 X) 1 X0 (X o
+ ") o

= (X0 X) 1 X0 ":

(iii) P is idempotent because

P2 = PP
= [X(X0 X) 1 X0 ][X(X0 X) 1 X0 ]
= X(X0 X) 1 X0
= P:

Similarly we can show M 2 = M:


(iv) By the de…nition of M; we have

e = Y X^
= Y X(X0 X) 1 X0 Y
= [I X(X0 X) 1 X0 ]Y
= MY
o
= M (X + ")
o
= MX + M"
= M"

given M X = 0: It follows that

SSR( ^ ) = e0 e
= (M ")0 (M ")
= "0 M 2 "
= "0 M ";

10
where the last equality follows from M 2 = M:

3.3 Goodness of Fit and Model Selection Criteria


Question: How well does the linear regression model …t the data? That is, how well
does the linear regression model explain the variation of the observed data of fYt gnt=1 ?

We need some criteria or some measures to characterize goodness of …t.

We …rst introduce two measures for goodness of …t. The …rst measure is called the
uncentered squared multi-correlation coe¢ cient R2

De…nition 3.2 [Uncentered R2 ] : The uncentered squared multi-correlation coe¢ cient


is de…ned as
2 Y^ 0 Y^ e0 e
Ruc = 0 = 1 ;
Y Y Y 0Y
where the second equality follows from the …rst order condition of the OLS estimation.

Remarks:
2
The measure Ruc has a nice interpretation: The proportion of the uncentered sample
quadratic variation in the dependent variables fYt g that can be attributed to the un-
centered sample quadratic variation of the predicted values fY^t g. Note that we always
2
have 0 Ruc 1:

Next, we de…ne a closely related measure called Centered R2 :

De…nition 3.3 [Centered R2 : Coe¢ cient of Determination]: The coe¢ cient of


determination Pn 2
e
R 2
1 Pn t=1 t 2 ;
t=1 (Yt Y)
P
where Y = n 1 nt=1 Yt is the sample mean.

Remarks:

11
When Xt contains the intercept, we have the following orthogonal decomposition:
Xn Xn
(Yt Y )2 = (Y^t Y + Yt Y^t )2
t=1 t=1
X
n X
n
= (Y^t Y) +2
e2t
t=1 t=1
X
n
+2 (Y^t Y )et
t=1
X
n X
n
= (Y^t Y )2 + e2t ;
t=1 t=1

where the cross-product term


X n X
n X
n
(Y^t Y )et = Y^t et Y et
t=1 t=1 t=1

0 X
n X
n
= ^ Xt et Y et
t=1 t=1

0 X
n
= ^ (X0 e) Y et
t=1
0
= ^ 0 Y 0
= 0;
P
where we have made use of the facts that X0 e = 0 and nt=1 et = 0 from the FOC of
the OLS estimation and the fact that Xt contains the intercept (i.e., X0t = 1): It follows
that
e0 e
R2 1 Pn
(Yt Y )2
Pn t=1 Pn 2
t=1 (Yt Y )2 t=1 et
= Pn 2
t=1 (Yt Y)
Pn ^ 2
(Yt Y )
= Pnt=1 :
t=1 (Yt Y )2
and consequently we have
0 R2 1:
Question: Can R2 be negative?

Yes, it is possible! If Xt does not contain the intercept, then the orthogonal decom-
position identity
Xn Xn Xn
2
(Yt Y ) = ^
(Yt Y ) +2
e2t
t=1 t=1 t=1

12
no longer holds. As a consequence, R2 may be negative when there is no intercept! This
is because the cross-product term
X
n
2 (Y^t Y )et
t=1

may be negative.

When Xt contains an intercept, the centered R2 has a similar interpretation to the


2
uncentered Ruc : That is, R2 measures the proportion of the sample variance of fYt gnt=1
that can be explained by the linear predictor of Xt :

Example 1 [Capital Asset Pricing Model (CAPM)]: The classical CAPM is char-
acterized by the equation

rpt rf t = p + p (rmt rf t ) + "pt ; t = 1; :::; n;

where rpt is the return on portfolio (or asset) p; rf t is the return on a risk-free asset,
and rmt is the return on the market portfolio. Here, rpt rf t is the risk premium of
portfolio p; rmt rf t is the risk premium of the market portfolio, which is the only sys-
tematic market risk factor, and "pt is the individual-speci…c risk which can be eliminated
by diversi…cation if the "pt are uncorrelated across di¤erent assets. In this model, R2
has an interesting economic interpretation: it is the proportion of the risk of portfolio
p (as measured by the sample variance of its risk premium rpt rf t ) that is attributed
to the market risk factor (rmt rf t ). In contrast, 1 R2 is the proportion of the risk of
portfolio p that is contributed by individual-speci…c risk factor "pt :

For any given random sample fYt ; Xt0 g0 ; t = 1; :::; n; R2 is nondecreasing in the number
of explanatory variables Xt : In other words, the more explanatory variables are added
in the linear regression, the higher R2 is. This is always true no matter whether Xt has
any true explanatory power for Yt :

Theorem 3.3: Suppose fYt ; X1t ; :::; X(k+q)t g0 ; t = 1; :::; n; is a random sample, and
Assumptions 3.1 and 3.3(a) hold. Let R12 be the centered R2 from the linear regression

Yt = Xt0 + ut ;

where Xt = (1; X1t ; :::; Xkt )0 ; and is a K 1 parameter vector; also, R22 is the centered
R2 from the extended linear regression

~ 0 + vt ;
Yt = Xt

13
~ t = (1; X1t ; :::; Xkt ; X(k+1)t ; :::; X(k+q)t )0 ;and
where X is a (K + q) 1 parameter vector.
Then R22 R12 :

Proof: By de…nition, we have


e0 e
R12 = 1 Pn ;
t=1 (Yt Y )2
e~0 e~
R22 = 1 Pn ;
t=1 (Yt Y )2

where e is the estimated residual vector from the regression of Y on X; and e~ is the
estimated residual vector from the regression of Y on X: ~ It su¢ ces to show e~0 e~ e0 e:
Because the OLS estimator ^ = (X ~ 0 X)
~ 1X
~ 0 Y minimizes SSR( ) for the extended model,
we have
X
n X
n
e~0 e~ = ~ t0 ^ )2
(Yt X (Yt X ~ t0 )2 for all 2 RK+q :
t=1 t=1

Now we choose
0
= ( ^ ; 00 )0 ;
where ^ = (X0 X) 1 X0 Y is the OLS from the …rst regression. It follows that

k+q
!2
X
n X
k X
e~0 e~ Yt ^ j Xjt 0 Xjt
t=1 j=0 j=k+1
Xn
= (Yt Xt0 ^ )2
t=1
0
= e e:

Hence, we have R12 R22 : This completes the proof.

Question: What is the implication of this theorem?

The measure R2 can be used to compare models with the same number of predictors,
but it is not a useful criterion for comparing models of di¤erent sizes because it is biased
in favor of large models.
The measure R2 is not a suitable criterion for correct model speci…cation. It is a
measure for sampling variation rather than a measure of population. A high value of R2
does not necessarily imply correct model speci…cation, and correct model speci…cation
also does not necessarily imply a high value of R2 :
Strictly speaking, R2 is a measure merely of association with nothing to say about
causality. High values of R2 are often very easy to achieve when dealing with economic

14
time series data, even when the causal link between two variables is extremely tenuous
or perhaps nonexistent. For example, in the spurious regressions where the dependent
variable Yt and the regressors Xt have no causal relationship but they diaplay similar
trending behaviors over time, it is often found that R2 is close to unity.

Finally, R2 is a measure of the strength of linear association between the dependent


variable Yt and the regressors Xt (see Exercise 3.2). It is not a suitable measure for
goodness of …t of a nonlinear regression model where E(Yt jXt ) is a nonlinear function
of Xt :
Question: How to interpret R2 for the linear regression model

ln Yt = 0 + 1 ln Lt + 2 ln Kt + "t ;

where Yt is output, Lt is labor and Kt is capital?

Answer: R2 is the proportion of the total sample variations in ln Yt that can be at-
tributed to the sample variations in ln Lt and ln Kt : It is not the proportion of the sample
quadratic variations in Yt that can be attributed to the sample variations of Lt and Kt :
Question: Does a high R2 value imply a precise estimation for o ?

Two popular model selection criteria


Often, a large number of potential predictors are available, but we do not necessarily
want to include all of them. There are two con‡icting factors to consider: on one hand,
a larger model has less systematic bias and it would give the best predictions if all para-
meters could be estimated without error. On the other hand, when unknown parameters
are replaced by estimates, the prediction becomes less accurate, and this e¤ect is worse
when there are more parameters to estimate. An important idea in statistics is to use
a simple model to capture essential information contained in data as much as possible.
This is often called the KISS principle, namely “Keep It Sophisticatedly Simple”!

Below, we introduce two popular model selection criteria that re‡ect such an idea.

Akaike Information Criterion [AIC]:

A linear regression model can be selected by minimizing the following AIC criterion
with a suitable choice of K :

2K
AIC = ln(s2 ) +
n
goodness of …t + model complexity

15
where
s2 = e0 e=(n K);
is called the residual variance estimator for E("2t ) = 2
and K = k + 1 is the number of
regressors. AIC is proposed by Akaike (1973).

Bayesian Information Criterion [BIC, Schwarz (1978)]:


A linear regression model can be selected by minimizing the following criterion with
a suitable choice of K :
K ln(n)
BIC = ln(s2 ) + :
n
This is called the Baysian information criterion (BIC), proposed by Schwarz (1978).

Both AIC and BIC try to trade o¤ the goodness of …t to data measured by ln(s2 )
with the desire to use as few paramerers as possible. When ln n 2; which is the
case when n > 7; BIC gives a heavier penalty for model complexity than AIC, which is
measured by the number of estimated parameters (relative to the sample size n). As a
consequence, BIC will choose a more parsimonious linear regression model than AIC.

The di¤erence between AIC and BIC is due to the way they are constructed. AIC
is designed to select a model that will predict best and is less concerned than BIC
with having a few too many parameters. BIC is designed to select the true value of
K exactly. Under certain regularity conditions, BIC is strongly consistent in the sense
that it determines the true model asymptotically (i.e., as n ! 1), whereas for AIC
an overparameterized model will emerge no matter how large the sample is. Of course,
such properties are not necessarily guaranteed in …nite samples. In practice, te best AIC
model is usually close to the best BIC model and often they deliver the same model.

In addition to AIC and BIC, there are other criteria such as R2 ; the so-called adjusted
R2 that can also be used to select a linear regression model. The adjusted R2 is de…ned
as
2 e0 e=(n K)
R =1 :
(Y Y )0 (Y Y )=(n 1)
This di¤ers from
2 e0 e
R =1 :
(Y Y )0 (Y Y)
In R2 ; the adjustmet is made according to the degrees of freedom, or the number of
explanatory variables in Xt . It may be shown that

n 1
R2 = 1 (1 R2 ) :
n K

16
we note that R2 may take a negative value although there is an intercept in Xt :
All model criteria are structured in terms of the estimated residual variance ^ 2 plus
a penalty adjustment involving the number of estimated parameters, and it is in the
extent of this penalty that the criteria di¤er from. For more discussion about these and
other selection criteria, see Judge et al. (1985, Section 7.5).

Question: Why is it not a good practice to use a complicated model?

A complicated model contains many unknown parameters. Given a …xed amount of data
information, parameter estimation will become less precise if more parameters have to be
estimated. As a consequence, the out-of-sample forecast for Yt may become less precise
than the forecast of a simpler model. The latter may have a larger bias but more precise
parameter estimates. Intuitively, a complicated model is too ‡exible in the sense that it
may capture not only systematic components but also some features in the data which
will not show up again. Thus, it cannot forecast futures well.

3.4 Consistency and E¢ ciency of OLS


We now investigate the statistical properties of ^ : We are interested in addressing
the following basic questions:

Is ^ a good estimator for o


(consistency)?

Is ^ the best estimator (e¢ ciency)?

What is the sampling distribution of ^ (normality)?

Question: What is the sampling distribution of ^ ?


The distribution of ^ is called the sampling distribution of ^ ; because ^ is a function
of the random sample fZt gnt=1 ; where Zt = (Yt ; Xt0 )0 :

The sampling distribution of ^ is useful for any statistical inference involving ^ ; such as
con…dence interval estimation and hypothesis testing.

We …rst investigate the statistical properties of ^ :

Theorem 3.4: Suppose Assumptions 3.1-3.3(a) and 3.4 hold. Then


(i) [Unbiasedness] E( ^ jX) = o and E( ^ ) = o :
(ii) [Vanishing Variance]
h i
var( ^ jX) = E ( ^ E ^ )( ^ E ^ )0 jX
2
= (X0 X) 1 :

17
0
If in addition Assumption 3.3(b) holds, then for any K 1 vector such that = 1;
we have
0
var( ^ jX) ! 0 as n ! 1:
(iii) [Orthogonality between e and ^ ]

cov( ^ ; ejX) = Ef[ ^ E( ^ jX)]e0 jXg = 0:

(iv) [Gauss-Markov]

var(^bjX) var( ^ jX) is positive semi-de…nite (p.s.d.)

for any unbiased estimator ^b that is linear in Y with E(^bjX) = o


:
(v) [Residual variance estimator]

1 X
n
s2 = e0 e=(n K) = e2t
n K t=1

2
is unbiased for = E("2t ): That is, E(s2 jX) = 2
:

Proof: (i) Given ^ o


= (X0 X) 1 X0 "; we have

E[( ^ o
)jX] = E[(X0 X) 1 X0 "jX]
= (X0 X) 1 X0 E("jX)
= (X0 X) 1 X0 0
= 0:

(ii) Given ^ o
= (X0 X) 1 X0 " and E(""0 jX) = 2 I; we have
h i
var( ^ jX) E ( ^ E ^ )( ^ E ^ )0 jX
h i
= E ( ^ o ^
)( o 0
) jX
= E[(X0 X) 1 X0 ""0 X(X0 X) 1 jX]
= (X0 X) 1 X0 E(""0 jX)X(X0 X) 1

= (X0 X) 1 X0 2
IX(X0 X) 1

2
= (X0 X) 1 X0 X(X0 X) 1

2
= (X0 X) 1 :

18
2
Note that Assumption 3.4 is crucial here to obtain the expression of (X0 X) 1
for
var( ^ jX): Moreover, for any 2 RK such that 0 = 1; we have

0
var( ^ jX) = 2 0
(X0 X) 1

2 0 1
max [(X X) ]
2 1 0
= min (X X)

! 0

given min (X0 X) ! 1 as n ! 1 with probability one. Note that the condition that
0 ^
min (X X) ! 1 ensures that var( jX) vanishes to zero as n ! 1:
(iii) Given ^ o
= (X0 X) 1 X0 "; e = Y X ^ = M Y = M " (since M X = 0); and
E(e) = 0; we have
h i
cov( ^ ; ejX) = E ( ^ E ^ )(e Ee)0 jX
h i
= E (^ o 0
)e jX
= E[(X0 X) 1 X0 ""0 M jX]
= (X0 X) 1 X0 E(""0 jX)M
= (X0 X) 1 X0 2
IM
2
= (X0 X) 1 X0 M
= 0:

Again, Assumption 3.4 plays a crucial role in ensuring zero correlation between ^
and e:
(iv) Consider a linear estimator

^b = C 0 Y;

o
where C = C(X) is a n K matrix depending on X: It is unbiased for regardless of
the value of o if and only if

E(^bjX) = C 0 X o
+ C 0 E("jX)
o
= C 0X
o
= :

This follows if and only if


C 0 X = I:

19
Because

^b = C 0 Y
o
= C 0 (X + ")
o
= C 0X + C 0"
o
= + C 0 ";

the variance of ^b
h i
var(^b) = E (^b o
)(^b o 0
) jX
= E [C 0 ""0 CjX]
= C 0 E(""0 jX)C
= C0 2
IC
2
= C 0 C:

Using C 0 X = I, we now have

var(^bjX) var( ^ jX) = 2


C 0C 2
(X0 X) 1

2
= [C 0 C C 0 X(X0 X) 1 X0 C]
2
= C 0 [I X(X0 X) 1 X0 ]C
2
= C 0M C
2
= C 0M M C
2
= C 0M 0M C
2
= (M C)0 (M C)
2
= D0 D
Xn
2
= Dt Dt0
t=1
p.s.d.

where we have used the fact that for any real-valued matrix D; the squared matrix D0 D
is always p.s.d. [Question: How to show this?]

(v) Now we show E[e0 e=(n K)] = 2


: Because e0 e = "0 M " and tr(AB) =tr(BA);

20
we have

E(e0 ejX) = E("0 M "jX)


= E[tr("0 M ")jX]
[putting A = "0 M; B = "]
= E[tr(""0 M )jX]
= tr[E(""0 jX)M ]
= tr( 2 IM )
2
= tr(M )
2
= (n K)

where

tr(M ) = tr(In ) tr(X(X0 X) 1 X0 )


= tr(In ) tr(X0 X(X0 X) 1 )
= n K;

using tr(AB) = tr(BA) again. It follows that

2 E(e0 ejX)
E(s jX) =
n K
2
(n K)
=
(n K)
2
= :

This completes the proof.

Remarks:
Both Theorem 3.4 (i) and (ii) imply that the conditional MSE

M SE( ^ jX) = E[( ^ o ^


)( o 0
) jX]
= var( ^ jX) + Bias( ^ jX)Bias( ^ jX)0
= var( ^ jX)
! 0 as n ! 1;

where we have used the fact that

Bias( ^ jX) E( ^ jX) o


= 0:

21
Recall that MSE measures how close an estimator ^ is to the target parameter o :
Theorem (iv) implies that ^ is the best linear unbiased estimator (BLUE) for o

because var( ^ jX) is the smallest among all unbiased linear estimators for o :
Formally, we can de…ne a related concept for comparing two unbiased estimators:

De…nition 3.4 [E¢ ciency]: An unbiased estimator ^ of parameter o


is more e¢ cient
than another unbiased estimator ^b of parameter o if

var(^bjX) var( ^ jX) is p.s.d.

When ^ is more e¢ cient than ^b; we have that for any 2 RK such that 0
= 1;
h i
0
var(^bjX) var( ^ jX) 0:

Choosing = (1; 0; :::; 0)0 ; for example, we have

var(^b1 ) var( ^ 1 ) 0:

We note that the OLS estimator ^ is still BLUE even when there exists near-
multicolinearity, where min (X0 X) does not grow with the sample size n; and var( ^ jX)
does not vanish to zero as n ! 1: Near-multicolinearity is essentially a sample or data
problem which we cannot remedy or improve upon when the objective is to estimate the
unknown parameter o :

3.5 Sampling Distribution of OLS


To obtain the …nite sample sampling distribution of ^ ; we impose the normality
assumption on ":

2
Assumption 3.5: "jX N (0; I):

Remarks:
Assumption 3.5 implies both Assumptions 3.2 (E("jX) = 0) and 3.4 (E(""jX) =
2
I). Moreover, under Assumption 3.5, the conditional pdf of " given X is

1 "0 "
f ("jX) = p exp = f (");
( 2 2 )n 2 2

which does not depend on X; so the disturbance " is independent of X: Thus, every
conditional moment of " given X does not depend on X:

22
The normal distribution is also called the Gaussian distribution named after the
German mathematician and astronomer Carl F. Gauss. It is assumed here so that we
can derive the …nite sample distributions of ^ and related statistics, i.e., the distributions
of ^ and related statistics when the sample size n is a …nite integer. This assumption
may be reasonable for observations that are computed as the averages of the outcomes
of many repeated experiments, due to the e¤ect of the so-called central limit theorem
(CLT). This may occur in physics, for example. In economics, the normality assumption
may not always be reasonable. For example, many high-frequency …nancial time series
usually display heavy tails (with kurtosis larger than 3).

Question: What is the sampling distribution of ^ ?

We write

^ o
= (X0 X) 1 X0 "
Xn
0 1
= (X X) Xt "t
t=1
X
n
= Ct "t ;
t=1

where the weighting vector


Ct = (X0 X) 1 Xt
is called the leverage of observation Xt .

Theorem 3.5 [Normality of ^ ]: Under Assumptions 3.1, 3.3(a) and 3.5,

(^ o
)jX N [0; 2
(X0 X) 1 ]:

Proof: Conditional on X; ^ o
is a weighted sum of independent normal random
variables f"t g; and so it is also normally distributed.

We note that the OLS estimator ^ still has the conditional …nite sample normal distri-
bution N ( o ; 2 (X0 X) 1 ) even when there exists near-multicolinearity, where min (X0 X)
does not grow with the sample size n and var( ^ jX) does not vanish to zero as n ! 1:
The corollary below follows immediately.

Corollary 3.6 [Normality of R( ^ o


)]: Suppose Assumptions 3.1, 3.3(a) and 3.5
hold. Then for any nonstochastic J K matrix R; we have

R( ^ o
)jX N [0; 2
R(X0 X) 1 R0 ]:

23
Proof: Conditional on X; ^ o
is normally distributed. Therefore, conditional on X;
the linear combination R( ^ o
) is also normally distributed, with

E[R( ^ o
)jX] = RE[( ^ o
)jX] = R 0 = 0

and
h i
var[R( ^ o
)jX] = E R( ^ o
)(R( ^ o 0
)) jX
h i
= E R( ^ o
)( ^ o 0 0
) R jX
h i
= RE ( ^ o
)( ^ o 0
) jX R0
= Rvar( ^ jX)R0
2
= R(X0 X) 1 R0 :

It follows that
R( ^ o
)jX N (0; 2
R(X0 X) 1 R0 ):

Question: What is the role of the J K nonstochastic matrix R?


Answer: The J K matrix R is a selection matrix. For example, when R = (1; 0; :::; 0);
we then have R( ^ o
) = ^0 o
0:

Question: Why would we like to know the sampling distribution of R( ^ o


)?

This is mainly for con…dence interval estimation and hypothesis testing.

3.6 Variance Matrix Estimator for OLS


Since var("t ) = 2 is unknown, var[R( ^ o
)jX] = 2 R(X0 X) 1 R0 is unknown. We
need to estimate 2 : We can use the residual variance estimator

s2 = e0 e=(n K):

Theorem 3.7 [Residual Variance Estimator]: Suppose Assumptions 3.1, 3.3(a) and
3.5 hold. Then we have for all n > K; (i)

(n K)s2 e0 e 2
2
jX = 2
jX n K;

where 2n K denotes the Chi-square distribution with n K degrees of freedom;


(ii) conditional on X; s2 and ^ are independent.

24
Proof: (i) Because e = M "; we have
e0 e "0 M " " 0 "
2
= 2
= M :

In addition, because "jX N (0; 2 In ); and M is an idempotent matrix with rank


= n K (as has been shown before), we have the quadratic form
e0 e "0 M " 2
2
= 2
jX n K

by the following lemma.

Lemma 3.8 [Quadratic form of normal random variables]: If v N (0; In ) and


Q is an n n nonstochastic symmetric idempotent matrix with rank q n; then the
quadratic form
v 0 Qv 2
q:

In our application, we have v = "= N (0; I); and Q = M: Since rank(M ) = n K;


we have
e0 e 2
2
jX n K:

(ii) Next, we show that s2 and ^ are independent. Because s2 = e0 e=(n K) is a


function of e; it su¢ ces to show that e and ^ are independent. This follows immedi-
ately because both e and ^ are jointly normally distributed and they are uncorrelated.
It is well-known that for a joint normal distribution, zero correlation is equivalent to
independence.

It remains to show that e and ^ jointly normally distributed? For this purpose, we
write
" # " #
e M"
=
^ o
(X0 X) 1 X0 "
" #
M
= ":
(X0 X) 1 X0
2
Because "jX N (0; I); the linear combination of
" #
M
"
(X0 X) 1 X0

is also normally distributed conditional on X. It follows that e and ^ are independent


given cov( ^ ; ejX) = 0: This completes the proof.

25
2
Question: What is a q distribution?

De…nition 3.5 [Chi-square Distribution, 2


q] Suppose fZi gqi=1 are i.i.d.N(0,1) ran-
dom variables. Then the random variable
q
X
2
= Zi2
i=1

2
will follow a q distribution.

The 2q distribution is nonsymmetric and has long right tails. For a 2


q random
variable, we have E( 2q ) = q and var( 2q ) = 2q:
Based on these properties of a 2 distribution, Theorem 3.7(i) implies

(n K)s2
E 2
jX = n K:

(n K)
2
E(s2 jX) = n K:

It follows that E(s2 jX) = 2 : Note that we have shown this result with a di¤erent
method but under a more general condition.
Theorem 3.7(i) also implies

(n K)s2
var 2
jX = 2(n K);
4
2
var(s2 jX) =
n K
! 0

as n ! 1:
Both Theorems 3.7(i) and (ii) imply that the conditional MSE of s2

M SE(s2 jX) = E (s2 2 2


) jX
= var(s2 jX) + [E(s2 jX) 2 2
]
! 0:

Thus, s2 is a good estimator for 2


:

The independence between s2 and ^ is crucial for us to obtain the sampling distrib-
ution of the popular t-test and F -test statistics, which will be introduced shortly.

26
The sample residual variance s2 = e0 e=(n K) is a generalization of the sample
variance Sn2 = (n 1) 1 nt=1 (Yt Y )2 for the random sample fYt gnt=1 : The factor n K
is called the degrees of freedom of the estimated residual sample fet gnt=1 : To gain intuition
why the degrees of freedom is equal to n K; note that the orginal sample fZt gnt=1 =
0n
fYt ; Xt0 gt=1 has n observations, which can be viewed to have n degrees of freedom. Now
when estimating 2 ; we have to use the estimated residual sample fet gnt=1 : These n
estimated residuals are not linearly independent because they have to satisfy the FOC
of the OLS estimation, namely,

X0 e = 0:
(K n) (n 1) = K 1:

The FOC imposes K restrictions on fet gnt=1 , conditional on X. These K restrictions


are needed in order to estimate K unknown parameters o : They can be used to obtain
the remaining K estmated residuals feT K+1 ; :::; eT g from the …rst n K estimated
residuals fe1 ; :::; en K g if the latter have been available. Thus, the remaining degrees
of freedom of e is n K. Note that the sample variance Sn2 is the residual variance
estimator with Yt = o0 + "t :
Question: Why are these sampling distributions of ^ and s2 useful in practice?

They are useful in con…dence interval estimation and hypothesis testing on model para-
meters. In this book, we will focus on hypothesis testing on model parameters. Statisti-
cally speaking, con…dence interval estimation and hypothesis testing on model parame-
ters are just two sides of the same coin.

3.7 Hypothesis Testing


We now use the sampling distributions of ^ and s2 to develop test procedures for
hypotheses of interest. We consider testing the following linear hypothesis in form of
o
H0 : R = r;
(J K)(K 1) = J 1;

where R is called the selection matrix, and J is the number of restrictions. We assume
J K:

It is important to emphasize that we will test H0 under correct model speci…cation


for E(Yt jXt ).

Motivation

27
We …rst provide a few motivating examples for hypothesis testing.

Example 1 [Reforms have no e¤ect]: Consider the extended production function

ln(Yt ) = 0 + 1 ln(Lt ) + 2 ln(Kt ) + 3 AUt + 4 P St + "t ;

where AUt is a dummy variable indicating whether …rm t is granted autonomy, and P St
is the pro…t share of …rm t with the state.
Suppose we are interested in testing whether autonomy AUt has an e¤ect on produc-
tivity. Then we can write the null hypothesis

o
Ha0 : 3 =0
This is equivalent to the choices of:
o 0
= ( 0; 1; 2; 3; 4) :

R = (0; 0; 0; 1; 0);
r = 0:

If we are interested in testing whether pro…t sharing has an e¤ect on productivity,


we can consider the null hypothesis
o
Hb0 : 4 = 0:

Alternatively, to test whether the production technology exhibits the constant return
to scale (CRS), we can write the null hypothesis as follows:
o o
Hc0 : 1 + 2 = 1:

This is equivalent to the choice of R = (0; 1; 1; 0; 0) and r = 1:


Finally, if we are interested in examining the joint e¤ect of both autonomy and pro…t
sharing, we can test the hypothesis that neither autonomy nor pro…t sharing has impact:
o o
Hd0 : 3 = 4 = 0:

This is equivalent to the choice of


" #
0 0 0 1 0
R = ;
0 0 0 0 1
" #
0
r = :
0

28
Example 2 [Optimal Predictor for Future Spot Exchange Rate]: Consider

St+ = 0 + 1 Ft ( ) + "t+ ; t = 1; :::; n;

where St+ is the spot exchange rate at period t + ; and Ft ( ) is the forward exchange
rate, namely the period t’s price for the foreign currency to be delivered at period t + :
The null hypothesis of interest is that the forward exchange rate Ft ( ) is an optimal
predictor for the future spot rate St+ in the sense that E(St+ jIt ) = Ft ( ); where It is
the information set available at time t. This is actually called the expectations hypothesis
in economics and …nance. Given the above speci…cation, this hypothesis can be written
as
He0 : o0 = 0; o1 = 1;
and E("t+ jIt ) = 0: This is equivalent to the choice of

" # " #
1 0 0
R= ;r = :
0 1 1

All examples considered above can be formulated with a suitable speci…cation of R;


where R is a J K matrix in the null hypothesis
o
H0 : R = r;

where r is a J 1 vector.

Basic Ideas of Hypothesis Testing


To test the null hypothesis
o
H0 : R = r;
we can consider the statistic:
R^ r
and check if this di¤erence is signi…cantly di¤erent from zero.
o
Under H0 : R = r; we have

R^ r = R^ R o
= R( ^ o
)
! 0 as n ! 1

because ^ o
! 0 as n ! 1 in terms of MSE.

29
Under the alternative to H0 ; R o
6= r; but we still have ^ o
! 0 in terms of MSE.
It follows that

R^ r = R( ^ o
)+R o
r
o
! R r 6= 0

as n ! 1; where the convergence is in terms of MSE. In other words, R ^ r will


converge to a nonzero limit, R o r.

The fact that the behavior of R ^ r is di¤erent under H0 and under the alternative
hypothesis to H0 provides a basis to construct hypothesis tests. In particular, we can
test H0 by examining whether R ^ r is signi…cantly di¤erent from zero.

Question: How large should the magnitude of the absolute value of the di¤erence R ^ r
be in order to claim that R ^ r is signi…cantly di¤erent from zero?

For this purpose, we need a decision rule which speci…es a threshold value with which
we can compare the (absolute) value of R ^ r. Because R ^ r is a random variable
and so it can take many (possibly an in…nite number of) values. Given a data set, we
only obtain one realization of R ^ r: Whether a realization of R ^ r is close to zero
should be judged using the critical value of its sampling distribution, which depends on
the sample size n and the signi…cance level 2 (0; 1) one preselects.

Question: What is the sampling distribution of R ^ r under H0 ?

Because

R( ^ o
)jX N (0; 2
R(X0 X) 1 R0 );
we have that conditional on X;

R^ r = R( ^ o
)+R o
r
o 2
N (R r; R(X0 X) 1 R0 )
o
Corollary 3.9: Under Assumptions 3.1, 3.3 and 3.5, and H0 : R = r; we have for
each n > K;
(R ^ r)jX N (0; 2 R(X0 X) 1 R0 ):
The di¤erence R ^ r cannot be used as a test statistic for H0 ; because 2 is unknown
and there is no way to calculate the critical values of the sampling distribution of R ^ r:

30
Question: How to construct a feasible (i.e., computable) test statistic?

The forms of test statistics will di¤er depending on whether we have J = 1 or J > 1:
We …rst consider the case of J = 1:

Case I: t-Test (J = 1):

Recall that we have

(R ^ r)jX N (0; 2
R(X0 X) 1 R0 );

When J = 1; the conditional variance

var[(R ^ r)jX] = 2
R(X0 X) 1 R0

is a scalar (1 1). It follows that conditional on X; we have

R^ r R^ r
q = p
2 R(X0 X) 1 R0
var[(R ^ r)jX]
N (0; 1):

Question: What is the unconditional distribution of

R^ r
p ?
2 R(X0 X) 1 R0

The unconditional distribution is also N(0,1).


However, 2 is unknown, so we cannot use the ratio

R^ r
p
2 R(X0 X) 1 R0

as a test statistic. We have to replace 2 by s2 ; which is a good estimator for 2


: This
gives a feasible (i.e., computable) test statistic

R^ r
T =p :
s2 R(X0 X) 1 R0

31
However, the test statistic T will be no longer normally distributed. Instead,

R^ r
T = p
s2 R(X0 X) 1 R0
p R^ r
2 R(X0 X) 1 R0
= q
(n K)s2
2 =(n K)
N (0; 1)
q
2
n K =(n K)
tn K;

where tn K denotes a Student’s t-distribution with n K degrees of freedom. Note that


the numerator and denominator are mutually independent conditional on X; because ^
and s2 are mutually independent conditional on X. The feasible statistic T is called a
t-test statistic because it follows a tn K distribution.

Question: What is the Student tq distribution?

2
De…nition 3.6 [Student’s t-distribution]: Suppose Z N (0; 1) and V q; and
both Z and V are independent. Then the ratio
Z
p tq :
V =q

The tq -distribution is symmetric about 0 with heavier tails than the N (0; 1) distri-
bution. The smaller number of the degrees of freedom, the heavier tails it has. When
d d
q ! 1; tq ! N (0; 1); where ! denotes convergence in distribution. This implies that
we have
R^ r d
T =p ! N (0; 1) as n ! 1:
s2 R(X0 X) 1 R0
This result has a very important implication in practice: for a large sample size n; it
makes no di¤erence to use either the critical values from tn K or from N (0; 1).

Question: What is convergence in distribution?

De…nition 3.7 [Convergence in distribution]: Suppose fZn ; n = 1; 2; g is a


sequence of random variables/vectors with distribution functions Fn (z) = P (Zn z);
and Z is a random variable/vector with distribution F (z) = P (Z z): We say that Zn

32
converges to Z in distribution if the distribution of Zn converges to the distribution of
Z at all continuity points; namely,

lim Fn (z) = F (z) or


n!1
Fn (z) ! F (z) as n ! 1

for any continuity point z (i.e., for any point at which F (z) is continuous): We use the
d
notation Zn ! Z: The distribution of Z is called the asymptotic or limiting distribution
of Zn :

In practice, Zn is a test statistic or a parameter estimator, and often its sampling distri-
bution Fn (z) is either unknown or very complicated, but F (z) is known or very simple.
d
As long as Zn ! Z; then we can use F (z) as an approximation to Fn (z): This gives
a convenient procedure for statistical inference. The potential cost is that the approx-
imation of Fn (z) to F (z) may not be good enough in …nite samples (i.e., when n is
…nite). How good the approximation is will depend on the data generating process and
the sample size n:

Example 3: Suppose f"n ; n = 1; 2; g is an i.i.d. sequence with distribution function


d
F (z): Let " be a random variable with the same distribution function F (z): Then "n ! ":

With the obtained sampling distribution for the test statistic T; we can now describe
a decision rule for testing H0 when J = 1:

Decision Rule of the T-test Based on Critical Values


(i) Reject H0 : R o = r at a prespeci…ed signi…cance level 2 (0; 1) if

jT j > Ctn K; 2
;

where Ctn K; is the so-called upper-tailed critical value of the tn K distribution at level
2

2
; which is determined by h i
P tn K > Ctn K; 2
=
2
or equivalently h i
P jtn Kj > Ctn K; 2
= :
(ii) Do not reject H0 at the signi…cance level if

jT j Ctn K; 2
:

Remarks:

33
In testing H0 ; there exist two types of errors, due to the limited information about
the population in a given random sample fZt gnt=1 . One possibility is that H0 is true but
we reject it. This is called the “Type I error”. The signi…cance level is the probability
of making the Type I error. If
h i
P jT j > Ctn K; jH0 = ;
2

we say that the decision rule is a test with size :


On the other hand, the probability P [jT j > Ctn K; jH0 is false] is called the power
2
function of a size test. When
h i
P jT j > Ctn K; jH0 is false < 1;
2

there exists a possibility that one may fail to reject H0 when it is false. This is called
the “Type II error”.

Ideally one would like to minimize both the Type I error and Type II error, but
this is impossible for any given …nite sample. In practice, one usually presets the level
for Type I error, the so-called signi…cance level, and then minimizes the Type II error.
Conventional choices for signi…cance level are 10%, 5% and 1% respectively.

Next, we describe an alternative decision rule for testing H0 when J = 1; using the
so-called p-value of test statistic T:

An Equivalent Decision Rule Based on p-values


Given a data set zn = fyt ; x0t g0n n
t=1 ; which is a realization of the random sample Z =
fYt ; Xt0 g0n
t=1 ; we can compute a realization (i.e., a number) for the t-test statistic T;
namely
R^ r
T (zn ) = p :
s2 R(x0 x) 1 R0
Then the probability
p(zn ) = P [jtn Kj > jT (zn )j] ;
0 n
is called the p-value (i.e., probability value) of the test statistic T given that fYt ; Xt0 gt=1 =
0 n
fyt ; x0t gt=1 is observed, where tn K is a Student’s t random variable with n K degrees
of freedom, and T (zn ) is a realization for test statistic T = T (Zn ) given the observed
data zn . Intuitively, the p-value is the smallest value of signi…cance level for which
the null hypothesis is rejected. Here, it is the tail probability that the absolute value of
a Student’s tn K random variable takes values larger than the absolute value of the test
statistic T (zn ). If this probability is very small relative to the signi…cance level, then it

34
is unlikely that the test statistic T (Zn ) will follow a Student’s tn K distribution. As a
consequence, the null hypothesis is likely to be false.
The above decision rule can be described equivalently as follows:

Decision Rule Based on the p-value


(i) Reject H0 at the signi…cance level if p(zn ) < :
(ii) Do not reject H0 at the signi…cance level if p(zn ) :

Remarks:

A small p-value is evidence against the null hypothesis. A large p-value shows that
the data are consistent with the null hypothesis.
Question: What are the advantages/disadvantages of using p-values versus using
critical values?

p-values are more informative than only rejecting/accepting the null hypothesis at
some signi…cance level . A p-value is the smallest signi…cance level at which a null
hypothesis can be rejected. It not only tells us whether the null hypothesis should be
accepted or rejected, but it also tells us whether the decision to accept or reject the null
hypothesis is a close call.

Most statistical software reports p-values of parameter estimates. This is much more
convenient than asking the user to specify signi…cance level and then reporting whether
the null hypothesis is accepted or rejected for that :

When we reject a null hypothesis, we often say there is a statistically signi…cant


e¤ect. This does not mean that there is an e¤ect of practical importance (i.e., an
e¤ect of economic importance). This is because when large samples are used, small and
practically unimportant e¤ects are likely to be statistically signi…cant.

The t-test and associated procedures just introduced are valid even when there ex-
ists near-multicolinearity, where min (X0 X) does not grow with the sample size n and
var( ^ jX) does not vanish to zero as n ! 1: However, the degree of near-multicolineary,
as measured by sample correlations between explanatory variables, will a¤ect the the
precision of the OLS estimator ^ : Other things being equal, the higher degree of near-
multicolinearity, the larger the variance of ^ : As a result, the t-statistic is often insignif-
icant even when the null hypothesis H0 is false.

Examples of t-tests

35
Example 4 [Reforms have no e¤ects (continued.)]
We …rst consider testing the null hypothesis
Ha0 : 3 = 0;
where 3 is the coe¢ cient of the autonomy AUt in the extended production function
regression model. This is equivalent to the selection of R = (0; 0; 0; 1; 0): In this case,
we have
s2 R(X0 X) 1 R0 = s2 (X0 X) 1
(4;4)
= S ^2
3

which is the estimator of var( ^ 3 jX): The squared root of var( ^ 3 jX) is called the standard
error of estimator ^ ; and S ^ is called the estimated standard error of ^ : The t-test
3 3 3
statistic
R^ r
T = p
s2 R(X0 X) 1 R0
^
= q3
S ^2
3

tn K:

Next, we consider testing the CRS hypothesis


Hc0 : 1 + 2 = 1;
which corresponds to R = (0; 1; 1; 0; 0) and r = 1: In this case,
s2 R(X0 X) 1 R0 = S ^2 + S ^2 + 2ĉov( ^ 1 ; ^ 2 )
1 2

= s2 (X0 X)
(2;2)
1

+ s (X X) 1 (3;3)
2 0

+2 s2 (X0 X) 1 (2;3)
= S ^2+ ^ ;
2

which is the estimator of var( ^ 1 + ^ 2 jX): Here, ĉov( ^ 1 ; ^ 2 ) is the estimator for cov( ^ 1 ; ^ 2 jX);
the covariance between ^ 1 and ^ 2 conditional on X:
The t-test statistic is
R^ r
T = p
s2 R(X0 X) 1 R0
^ +^ 1
1 2
=
S^1+^2
tn K :

36
Case II: F -testing (J > 1)

Question: How to construct a test statistic for H0 if J > 1?

We …rst state a useful lemma.

Lemma 3.10: If Z N (0; V ); where V = var(Z) is a nonsingular J J variance-


covariance matrix, then
Z 0V 1Z 2
J:

Proof: Because V is symmetric and positive de…nite, we can …nd a symmetric and
invertible matrix V 1=2 such that

V 1=2 V 1=2 = V;
1=2 1=2 1
V V = V :

(Question: What is this decomposition called?) Now, de…ne


1=2
Y =V Z:

Then we have E(Y ) = 0; and

var(Y ) = E f[Y E(Y )][Y E(Y )]0 g


= E(Y Y 0 )
1=2
= E(V ZZ 0 V 1=2
)
1=2
= V E(ZZ 0 )V 1=2

1=2 1=2
= V VV
1=2
= V V 1=2 V 1=2 V 1=2

= I:

It follows that Y N (0; I): Therefore, we have

Y 0Y 2
J:

Applying this lemma, and using the result that

(R ^ r)jX N [0; 2
R(X0 X) 1 R0 ]

under H0 ; we have the quadratic form

(R ^ r)0 [ 2 R(X0 X) 1 R0 ] 1 (R ^ r) 2
J

37
conditional on X; or

(R ^ r)0 [R(X0 X) 1 R0 ] 1 (R ^ r) 2
2 J

conditional on X:
Because 2J does not depend on X; therefore, we also have

(R ^ r)0 [R(X0 X) 1 R0 ] 1 (R ^ r) 2
2 J

unconditionally.

Like in constructing a t-test statistic, we should replace 2 by s2 in the left hand


side:
(R ^ r)0 [R(X0 X) 1 R0 ] 1 (R ^ r)
:
s2
The replacement of 2 by s2 renders the distribution of the quadratic form no longer
Chi-squared. Instead, after proper scaling, the quadratic form will follow a so-called
F -distribution with degrees of freedom equal to (J; n K).
Why?
To explain this, we observe

(R ^ r)0 [R(X0 X) 1 R0 ] 1 (R ^ r)
s2
(R ^ r)0 [R(X0 X) 1 R0 ] 1 (R ^ r)
2 =J
= J (n K)s2
2 =(n K)
J FJ;n K;

where FJ;n K denotes the F distribution with degrees of J and n K distributions.

Question: What is a FJ;n K distribution?


2 2
De…nition 3.8: Suppose U p and V q; and both U and V are independent.
Then the ratio
U=p
Fp;q
V =q
is called to follow a Fp;q distribution with degrees of freedom (p; q):

This distribution is called F -distribution because it is named after Professor Fisher,


a well-known statistician in the 20th century. It is similar to the shape of the 2 distri-
bution with a long right tail. An Fp;q random variable has the following properties:
(i) If F Fp;q ; then F 1 Fq;p :

38
(ii) t2q F1;q :
2
1 =1
t2q = 2 =q
F1;q
2
(iii) Given any …xed integer p; p Fp;q ! p as q ! 1:

Property (ii) implies that when J = 1; using either the t-test or the F -test will deliver
the same conclusion. Property (iii) implies that the conclusions based on Fp;q and on
p Fp;q using the 2p approximation will be approximately the same when q is su¢ ciently
large.

We now de…ne the following F -test statistic to test H0 :

(R ^ r)0 [R(X0 X) 1 R0 ] 1 (R ^ r)=J


F
s2
FJ;n K:

Theorem 3.11: Suppose Assumptions 3.1, 3.3(a) and 3.5 hold. Then under H0 :

o
R = r, we have

(R ^ r)0 [R(X0 X) 1 R0 ] 1 (R ^ r)=J


F =
s2
FJ;n K

for all n > K:

Alternative Expression for the F -Test Statistic


A practical issue now is how to compute the F -statistic. One can of course compute
the F -test statistic using the above de…nition of the F test statistic. However, there is
a very convenient way to compute the F -test statistic. We now introduce this method.
Theorem 3.12: Suppose Assumptions 3.1 and 3.3(a) hold. Let SSRu = e0 e be the sum
of squared residuals from the unrestricted model

o
Y =X + ":

Let SSRr = e~0 e~ be the sum of squared residuals from the restricted model

o
Y =X +"

subject to
o
R = r;

39
where ~ is the restricted OLS estimator. Then under H0 ; the F -test statistic can be
written as
(~e0 e~ e0 e)=J
F = 0 FJ;n K :
e e=(n K)
Proof: Let ~ be the OLS under H0 ; that is,

~ = arg min (Y X )0 (Y X )
2RK

subject to the constraint that R = r: We …rst form the Lagrangian function

L( ; ) = (Y X )0 (Y X ) + 2 0 (r R );

where is a J 1 vector called the Lagrange multiplier vector.


We have the following FOC:

@L( ~ ; ~ )
= 2X0 (Y X ~ ) 2R0 ~ = 0;
@
@L( ~ ; ~ )
= 2(r R ~ ) = 0:
@

With the unconstrained OLS estimator ^ = (X0 X) 1 X0 Y; and from the …rst equation of
FOC, we can obtain

(^ ~ ) = (X0 X) 1 R0 ~ ;
R(X0 X) 1 R0 ~ = R( ^ ~ ):

Hence, the Lagrange multiplier

~ = [R(X0 X) 1 R0 ] 1 R( ^ ~ ):
= [R(X0 X) 1 R0 ] 1 (R ^ r);

where we have made use of the constraint that R ~ = r: It follows that

^ ~ = (X0 X) 1 R0 [R(X0 X) 1 R0 ] 1 (R ^ r):

Now,

e~ = Y X~
= Y X ^ + X( ^ ~)
= e + X( ^ ~ ):

40
It follows that

e~0 e~ = e0 e + ( ^ ~ )0 X0 X( ^ ~)
= e0 e + (R ^ r)0 [R(X0 X) 1 R0 ] 1 (R ^ r):

We have
(R ^ r)0 [R(X0 X) 1 R0 ] 1 (R ^ r) = e~0 e~ e0 e
and

(R ^ r)0 [R(X0 X) 1 R0 ] 1 (R ^ r)=J


F =
s2
0 0
(~
e e~ e e)=J
= :
e0 e=(n K)

This completes the proof.

Remarks:

The F -statistic is a convenient test statistic! One only needs to compute SSR in
order to compute the F -test statistic. Intuitively, the sum of squared residuals SSRu
of the unrestricted regression model is always larger than or at least equal to that of
the restricted regression model. When the null hypothesis H0 is true (i.e., when the
parameter restriction is valid), the sum of squared residuals SSRr of the restricted
model is more or less similar to that of the unrestricted model, subject to the di¤erence
due to sampling variations. If SSRr is su¢ ciently larger than SSRu ; then there exists
evidence against H0 :How large a di¤erence between SSRr and SSRu is considered as
su¢ ciently large to reject H0 is determined by the critical value of the associated F
distribution.

Question: What is the interpretation for the Lagrange multiplier ~ ?

Recall that we have obtained the relation that

~ = [R(X0 X) 1 R0 ] 1 R( ^ ~)
= [R(X0 X) 1 R0 ] 1 (R ^ r):

Thus, ~ is an indicator of the departure of R ^ from r: That is, the value of ~ will indicate
whether R ^ r is signi…cantly di¤erent from zero.

Question: What happens to the distribution of F when n ! 1?

41
d
Recall the important property of the Fp;q distribution that p Fp;q ! 2p when q ! 1:
Since our F -statistic for H0 follows a FJ;n K distribution, it follows that under H0 ; the
quadratic form
1 d
J F = (R ^ r)0 s2 R(X0 X) 1 R0 (R ^ r) ! 2
J

as n ! 1: We formally state this result below.

Theorem 3.13: Suppose Assumptions 3.1, 3.3(a) and 3.5 hold. Then under H0 ; we
have the Wald test statistic
(R ^ r)0 [R(X0 X) 1 R0 ] 1 (R ^ r) d 2
W = ! J
s2
as n ! 1:

This result implies that when n is su¢ ciently large, using the F -statistic and the exact
FJ;n K distribution and using the quadratic form W and the simpler 2J approximation
will make no essential di¤erence in statistical inference.

3.8 Applications
We now consider some special but important cases often encountered in economics
and …nance.
Case I: Testing for the Joint Signi…cance of Explanatory Variables
Consider a linear regression model
o
Yt = Xt0 + "t
Xk
o o
= 0 + j Xjt + "t :
j=1

We are interested in testing the combined e¤ect of all the regressors except the intercept.
The null hypothesis is

o
H0 : j = 0 for 1 j k;
which implies that none of the explanatory variables in‡uences Yt :
The alternative hypothesis is
o o
HA : j 6= 0 at least for some j; j = 1; ; k:

One can use the F -test and


F Fk;n (k+1) :

42
In fact, the restricted model under H0 is very simple:
o
Yt = 0 + "t :

The restricted OLS estimator ~ = (Y ; 0; ; 0)0 : It follows that

e~ = Y X~ = Y Y:

Hence, we have
e~0 e~ = (Y Y )0 (Y Y ):
Recall the de…nition of R2 :
e0 e
R2 = 1
(Y Y )0 (Y Y)
e0 e
= 1 :
e~0 e~
It follows that
(~ e0 e~ e0 e)=k
F =
e0 e=(n k 1)
0
(1 ee~0 ee~ )=k
= e0 e
e~0 e~
=(n k 1)
R2 =k
= :
(1 R2 )=(n k 1)
Thus, it su¢ ces to run one regression, namely the unrestricted model in this case. We
emphasize that this formula is valid only when one is testing for H0 : oj = 0 for all
1 j k:

Example 1 [E¢ cient Market Hypothesis]: Suppose Yt is the exchange rate return
in period t; and It 1 is the information available at time t 1: Then a classical version
of the e¢ cient market hypothesis (EMH) can be stated as follows:

E(Yt jIt 1 ) = E(Yt )

To check whether exchange rate changes are unpredictable using the past history of
exchange rate changes, we specify a linear regression model:

o
Yt = Xt0 + "t ;
where
Xt = (1; Yt 1 ; :::; Yt k )0 :

43
Under EMH, we have
o
H0 : j = 0 for all j = 1; :::; k:
If the alternative
o
HA : j 6= 0 at least for some j 2 f1; :::; kg
holds, then exchange rate changes are predictable using the past information.

Question: What is the appropriate interpretation if H0 is not rejected?

Note that there exists a gap between the e¢ ciency hypothesis and H0 , because the
linear regression model is just one of many ways to check EMH. Thus, H0 is not rejected,
at most we can only say that no evidence against the e¢ ciency hypothesis is found. We
should not conclude that EMH holds.

Strictly speaking, the current theory (Assumption 3.2: E("t jX) = 0) rules out this
application, which is a dynamic time series regression model. However, we will justify
in Chapter 5 that

R2
k F =
(1 R2 )=(n k 1)
d 2
! k

under conditional homoskedasticity even for a linear dynamic regression model.


In fact, we can use a simpler version when n is large:
d
(n k 1)R2 ! 2
k:

p
This follows from the Slutsky theorem because R2 ! 0 under H0 : Although Assumption
3.5 is not needed for this result, conditional homoskedasticity is still needed, which rules
out autoregressive conditional heteroskedasticity (ARCH) in the time series context.

Below is a concrete numerical example.

Example 2 [Consumption Function and Wealth E¤ect]: Let Yt = consumption,


X1t = labor income, X2t = liquidity asset wealth. A regression estimation gives

Yt = 33:88 26:00X1t + 6:71X2t + et ; R2 = 0:742; n = 25:


[1:77] [ 0:74] [0:77]

where the numbers inside [ ] are t-statistics.

44
Suppose we are interested in whether labor income or liquidity asset wealth has
impact on consumption. We can use the F -test statistic,
R2 =2
F =
(1 R2 )=(n 3)
= (0:742=2)=[(1 0:742)=(25 3)]
= 31:636
F2 ;22

Comparing it with the critical value of F2;22 at the 5% signi…cance level, we reject the
null hypothesis that neither income nor liquidity asset has impact on consumption at
the 5% signi…cance level.

Case II: Testing for Omitted Variables (or Testing for No E¤ect)
Suppose X = (X(1) ; X(2) ); where X(1) is a n (k1 + 1) matrix and X(2) is a n k2
matrix:
(2)
A random vector Xt has no explanatory power for the conditional expectation of
Yt if
(1)
E(Yt jXt ) = E(Yt jXt ):
Alternatively, it has explanatory power for the conditional expectation of Yt if
(1)
E(Yt jXt ) 6= E(Yt jXt ):
(2)
When Xt has explaining power for Yt but is not included in the regression, we say that
(2)
Xt is an omitted random variable or vector.
(2)
Question: How to test whether Xt is an omitted variable in the linear regression
context?

Consider the restricted model

Yt = 0 + 1 X1t + + k1 Xk1 t + "t :

Suppose we have additional k2 variables (X(k1 +1)t ; ; X(k1 +k2 )t ), and so we consider the
unrestricted regression model

Yt = 0 + 1 X1t + ::: + k1 Xk1 t

+ k1 +1 X(k1 +1)t + + (k1 +k2 ) X(k1 +k2 )t + "t :

The null hypothesis is that the additional variables have no e¤ect on Yt : If this is the
case, then
H0 : k1 +1 = k1 +2 = = k1 +k2 = 0:

45
The alternative is that at least one of the additional variables has e¤ect on Yt :
The F -Test statistic is
e0 e~ e0 e)=k2
(~
F = Fk2 ;n (k1 +k2 +1) :
e0 e=(n k1 k2 1)

Question: Suppose we reject the null hypothesis. Then some important explanatory
variables are omitted, and they should be included in the regression. On the other hand,
if the F -test statistic does not reject the null hypothesis H0 ; can we say that there is no
omitted variable?

No. There may exist a nonlinear relationship for additional variables which a linear
regression speci…cation cannot capture.

Example 3 [Testing for the E¤ect of Reforms]:


Consider the extended production function

Yt = 0 + 1 ln(Lt ) + 2 ln(Kt )
+ 3 AUt + 4 P St + 5 CMt + "t ;

where AUt is the autonomy dummy, P St is the pro…t sharing ratio, and CMt is the
dummy for change of manager. The null hypothesis of interest here is that none of the
three reforms has impact:
H0 : 3 = 4 = 5 = 0:
We can use the F -test, and F F3;n 6 under H0 :

Suppose rejection occurs. Then there exists evidence against H0 : However, if no rejection
occurs, then we can only say that we …nd no evidence against H0 (which is not the same
(2)
as the statement that reforms have no e¤ect): It is possible that the e¤ect of Xt is
(2)
of nonlinear form. In this case, we may obtain a zero coe¢ cient for Xt ; because the
linear speci…cation may not be able to capture it.

Example 4 [Testing for Granger Causality]:


Consider two time series fYt ; Zt g; where t is the time index, ItY 1 = fYt 1 ; :::; Y1 g and
ItZ 1 = fZt 1 ; :::; Z1 g. For example, Yt is the GDP growth, and Zt is the money supply
growth. We say that Zt does not Granger-cause Yt in conditional mean with respect to
(Y ) (Z)
It 1 = fIt 1 ; It 1 g if
(Y ) (Z) (Y )
E(Yt jIt 1 ; It 1 ) = E(Yt jIt 1 ):
In other words, the lagged variables of Zt have no impact on the level of Yt :

46
In time series analysis, Granger causality is de…ned in terms of incremental predictability
rather than the real cause-e¤ect relationship. From an econometric point of view, it is
a test of omitted variables in a time series context. It is …rst introduced by Granger
(1969).

Question: How to test Granger causality?

Consider now a linear regression model

Yt = 0 + 1 Yt 1 + + p Yt p

+ p+1 Zt 1 + + p+q Zt q + "t :

Under non-Granger causality, we have

H0 : p+1 = = p+q = 0:

The F -test statistic


F Fq;n (p+q+1) :

The current econometric theory (Assumption 3.2: E("t jX) = 0) actually rules out this
application, because it is a dynamic regression model. However, we will justify in Chap-
ter 5 that under H0 ;
d
q F ! 2q
as n ! 1 under conditional homoskedasticity even for a linear dynamic regression
model.

Example 5 [Testing for Structural Change (or testing for regime shift)]
Consider a bivariate regression model

Yt = 0 + 1 X1t + "t ;

where t is a time index, and fXt g and f"t g are mutually independent. Suppose there
exist changes after t = t0 ; i.e., there exist structural changes. We can consider the
extended regression model:

Yt = ( 0 + 0 Dt ) +( 1 + 1 Dt )X1t + "t
= 0 + 1 X1t + 0 Dt + 1 (Dt X1t ) + "t ;

where Dt = 1 if t > t0 and Dt = 0 otherwise. The variable Dt is called a dummy


variable, indicating whether it is a pre- or post-structral break period.

47
The null hypothesis of no structral change is

H0 : 0 = 1 = 0:

The alternative hypothesis that there exists a structral change is

HA : 0 6= 0 or 1 6= 0:

The F -test statistic


F F2;n 4 :
The idea of such a test is …rst proposed by Chow (1960).

Case III: Testing for linear restrictions


Example 6 [Testing for CRS]:
Consider the extended production function

ln(Yt ) = 0 + 1 ln(Lt ) + 2 ln(Kt ) + 3 AUt + 4 P St + 5 CMt + "t :

We will test the null hypothesis of CRS:

H0 : 1 + 2 = 1:

The alternative hypothesis is


H0 : 1 + 2 6= 1:
What is the restricted model under H0 ? It is given by

ln(Yt ) = 0 + 1 ln(Lt ) + (1 1 ) ln(Kt ) + 3 AUt + 4 P St + 5 CMt + "t

or equivanetly

ln(Yt =Kt ) = 0 + 1 ln(Lt =Kt ) + 3 AUt + 4 CONt + 5 CMt + "t :

The F -test statistic


F F1;n 6 :
Because there is only one restriction, both t- and F - tests are applicable to test CRS.

Example 7 [Wage Determination]: Consider the wage function

Wt = 0 + 1 Pt + 2 Pt 1 + 3 Ut

+ 4 Vt + 5 Wt 1 + "t ;

48
where Wt = wage, Pt = price, Ut = unemployment, and Vt = un…lled vacancies.
We will test the null hypothesis

H0 : 1 + 2 = 0; 3 + 4 = 0; and 5 = 1:

Question: What is the economic interpretation of the null hypothesis H0 ?

Under H0 ; we have the restricted wage equation:

Wt = 0 + 1 Pt + 4 Dt + "t ;

where Wt = Wt Wt 1 is the wage growth rate, Pt = Pt Pt 1 is the in‡ation rate,


and Dt = Vt Ut is an index for job market situation (excess job supply). This implies
that the wage increase depends on the in‡ation rate and the excess labor supply.

The F -test statistic for H0 is


F F3;n 6 :

Case IV: Testing for Near-Multicollinearity


Example 8 [Consumption Function (Cont.)]:
Consider the following estimation results for three separate regressions based on the
same data set with n = 25. The …rst is a regression of consumption on income:

Yt = 36:74 + 0:832X1t + e1t ; R2 = 0:735


[1:98][7:98]

The second is a regression of consumption on wealth:

Yt = 36:61 + 0:208X2t + e2t ; R2 = 0:735


[1:97][7:99]

The third is a regression of consumption on both income and wealth:

Y = 33:88 26:00X1t + 6:71X2t + et ; R2 = 0:742;


[1:77][ 0:74][0:77]

Note that in the …rst two separate regressions, we can …nd signi…cant t-test statistics
for income and wealth, but in the third joint regression, both income and wealth are

49
insigni…cant. This may be due to the fact that income and wealth are highly multico-
linear! To test neither income nor wealth has impact on consumption, we can use the
F -test:
R2 =2
F =
(1 R2 )=(n 3)
0:742=2
=
(1 0:742)=(25 3)
= 31:636
F2 ;22 :

This F -test shows that the null hypothesis is …rmly rejected at the 5% signi…cance level,
because the critical value of F2;22 at the 5% level is 3.44.

3.9 Generalized Least Squares (GLS) Estimation


Question: The classical linear regression theory crucially depends on the assumption
that "jX N (0; 2 I), or equivalently f"t g i:i:d:N (0; 2 ); and fXt g and f"t g are
mutually independent. What may happen if some classical assumptions do not hold?

Question: Under what conditions, the existing procedures and results are still approx-
imately true?

Assumption 3.5 is unrealistic for many economic and …nancial data. Suppose As-
sumption 3.5 is replaced by the following condition:

Assumption 3.6: "jX N (0; 2 V ); where 0 < 2 < 1 is unknown and V = V (X) is
a known n n symmetric, …nite and positive de…nite matrix.

Remarks:
Assumption 3.6 implies that

var("jX) = E(""0 jX)


2 2
= V = V (X)

is known up to a constant 2 : It allows for conditional heteroskedasticity of known form.


In Assumption 3.6, it is possible that V is not a diagonal matrix. Thus, cov("t ; "s jX)
may not be zero. In other words, Assumption 3.6 allows conditional autocorrelation
of known form. If t is a time index, this implies that there exists serial correlation of

50
known form. If t is an index for cross-sectional units, this implies that there exists spatial
correlation of known form.
However, the assumption that V is known is still very restrictive from a practical
point of view. In practice, V usually has an unknown form.

Question: What is the statistical property of OLS ^ under Assumption 3.6?

Theorem 3.14: Suppose Assumptions 3.1, 3.3(a) and 3.6 hold. Then
(i) unbiasedness: E( ^ jX) = o :
(ii) variance:

var( ^ jX) = 2
(X0 X) 1 X0 V X(X0 X) 1

2
6= (X0 X) 1 :

(iii)
(^ o
)jX N (0; 2
(X0 X) 1 X0 V X(X0 X) 1 ):
(iv) cov( ^ ; ejX) 6= 0 in general.

Proof: (i) Using ^ o


= (X0 X) 1 X0 "; we have

E[( ^ o
)jX] = (X0 X) 1 X0 E("jX)
= (X0 X) 1 X0 0
= 0:

(ii)

var( ^ jX) = E[( ^ o


)( ^ o 0
) jX]
= E[(X0 X) 1 X0 ""0 X(X0 X) 1 jX]
= (X0 X) 1 X0 E(""0 jX)X(X0 X) 1

2
= (X0 X) 1 X0 V X(X0 X) 1 :

Note that we cannot further simplify the expression here because V 6= I:


(iii) Because

^ o
= (X0 X) 1 X0 "
Xn
= Ct "t ;
t=1

51
where the weighting vector
Ct = (X0 X) 1 Xt ;
^ o
follows a normal distribution given X; because it is a sum of a normal random
variables. As a result,

^ o
N (0; 2
(X0 X) 1 X0 V X(X0 X) 1 ):

(iv)

cov( ^ ; ejX) = E[( ^ o


)e0 jX]
= E[(X0 X) 1 X0 ""0 M jX]
= (X0 X) 1 X0 E(""0 jX)M
2
= (X0 X) 1 X0 V M
6= 0

because X0 V M 6= 0: We can see that it is conditional heteroskedasticity and/or auto-


correlation in f"t g that cause ^ to be correlated with e:

Remarks:
OLS ^ is still unbiased and one can show that its variance goes to zero as n ! 1
(see Question 6, Problem Set 03). Thus, it converges to o in the sense of MSE.
However, the variance of the OLS estimator ^ does no longer have the simple expres-
sion of 2 (X0 X) 1 under Assumption 3.6. As a consequence, the classical t- and F -test
statistics are invalid because they are based on an incorrect variance-covariance matrix
of ^ . That is, they use an incorrect expression of 2 (X0 X) 1 rather than the correct
variance formula of 2 (X0 X) 1 X0 V X(X0 X) 1 :
Theorem (iv) implies that even if we can obtain a consistent estimator for 2 (X0 X) 1 X0 V X(X0 X) 1

and use it to construct tests, we can no longer obtain the Student t-distribution and
F -distribution, because the numerator and the denominator in de…ning the t- and F -test
statistics are no longer independent.

Generalized Least Squares (GLS) Estimation


To introduce GLS, we …rst state a useful lemma.

Lemma 3.15: For any symmetric positive de…nite matrix V; we can always write

1
V = C 0 C;
1
V = C (C 0 ) 1

52
where C is a n n nonsingular matrix.
Question: What is this decomposition called? Note that C may not be symmetric.
Consider the original linear regression model:
o
Y =X + ":

If we mutitiply the equation by C; we obtain the transformed regression model


o
CY = (CX) + C"; or
o
Y = X +" ;

where Y = CY; X = CX and " = C": Then the OLS of this transformed model
^ = (X 0 X ) 1 X 0 Y
= (X0 C 0 CX) 1 (X0 C 0 CY )
= (X0 V 1
X) 1 X0 V 1
Y

is called the Generalized Least Squares (GLS) estimator.

Question: What is the nature of GLS?

Observe that

E(" jX) = E(C"jX)


= CE("jX)
= C 0
= 0:

Also, note that

var(" jX) = E[" " 0 jX]


= E[C""0 C 0 jX]
= CE(""0 jX)C 0
2
= CV C 0
2 1
= C[C (C 0 ) 1 ]C 0
2
= I:

It follows from Assumption 3.6 that


2
" jX N (0; I):

53
The transformation makes the new error " conditionally homoskedastic and serially
uncorrelated, while maintaining the normality distribution. Suppose that for t; "t has
a large variance 2t : The transformation "t = C"t will discount "t by dividing it by
its conditional standard deviation so that "t becomes conditionally homoskedastic. In
addition, the transformation also removes possible correlation between "t and "s ; t 6= s:
As a consequence, GLS becomes the best linear LS estimator for o in term of the
Gauss-Markov theorem.

To appreciate how the transformation by matrix C removes conditional heteroskedas-


ticity and eliminates serial correlation, we now consider two examples.

Example 1 [Removing Heteroskedasticity]: Suppose


2 2 3
1 0 0
6 0 2
0 7
6 2 7
V =6 7;
4 0 5
2
0 n

Then 2 3
1
1 0 0
6 0 1
0 7
6 2 7
C=6 7
4 0 5
1
0 n
2 2
where i = i (X); i = 1; :::; n; and
2 "1
3
1
6 "2 7
6 7
" = C" = 6 2
7:
4 5
"n
n

The transformed regression model is


o
Yt = Xt 0 + "t ; t = 1; :::; n;

where

Yt = Yt = t ;
Xt = Xt = t ;
"t = "t = t :

Example 2 [Eliminating Serial Correlation] Suppose

54
2 2 n 2 n 1
3
1
6 n 3 n 2 7
6 1 7
6 7
6 2
1 n 4 n 3 7
V =6
6
7:
7
6 7
6 n 2 n 3 n 4 7
4 1 5
n 1 n 2 n 3
1
Then we have 2 3
1 0 0 0
6 2 n 3 7
6 1+ 0 7
6 7
6 0 1+ 2 n 4
0 7
V 1
=6
6
7:
7
6 7
6 2 7
4 0 0 0 1+ 5
0 0 0 1
and 2 p 3
1 2 0 0 0 0
6 7
6 1 0 0 0 7
6 7
6 0 1 0 0 7
C=6
6
7:
7
6 7
6 7
4 0 0 0 1 0 5
0 0 0 1
It follows that 2 p 3
1 2"
1
6 " "1 7
6 2 7
" = C" = 6 7:
4 5
"n "n 1

The transformed regression model is


o
Yt = Xt 0 + "t ; t = 1; :::; n;

where
p
Y1 = 1 2Y
1; Yt = Yt Yt 1 ; t = 2; :::; n;
p
X1 = 1 2X ; Xt = Xt Xt 1 ; t = 2; :::; n;
1
p
"1 = 1 2"
1; "t = "t "t 1 ; t = 2; :::; n:
p
The 1 2 transformation for t = 1 is called the Prais-Winsten transformation.

Theorem 3.16: Under Assumptions 3:1; 3:3(a) and 3.6,

55
(i) E( ^ jX) = o ;
(ii) var( ^ jX) = 2 (X 0 X ) 1 = 2 (X0 V 1 X) 1 ;
(iii) cov( ^ ; e jX) = 0; where e = Y X ^ ;
(iv) ^ is BLUE.
(v) E(s 2 jX) = 2 ; where s 2 = e 0 e =(n K):

Proof: Results in (i)–(iii) follow because the GLS is the OLS of the transformed model.
(iv) The transformed model satis…es 3.1, 3.3 and 3.5 of the classical regression as-
sumptions with " jX N (0; 2 In ): It follows that GLS is BLUE by the Gauss-Markov
theorem. Result (v) also follows immediately. This completes the proof.

Remarks:
Because ^ is the OLS of the transformed regression model with i.i.d. N (0; 2 I)
errors, the t-test and F -test are applicable, and these test statistics are de…ned as follows:

R^ r
T = p tn K ;
s 2 R(X 0 X ) 1 R0
(R ^ r)0 [R(X 0 X ) 1 R0 ] 1 (R ^ r)=J
F =
s2
FJ;n K :
2
It is very important to note that we still have to estimate the proportionality in
spite of the fact that V = V (X) is known.

When testing whether all coe¢ cients except the intercept are jointly zero, we have
d
(n K)R 2 ! 2k :
Because GLS ^ is BLUE and OLS ^ di¤ers from ^ ; OLS ^ cannot be BLUE.

^ = (X 0 X ) 1 X 0 Y ;
= (X0 V 1
X) 1 X0 V 1
Y:
^ = (X0 X) 1 X0 Y:

In fact, the most important message of GLS is the insight it provides into the impact
of conditional heteroskedasticity and serial correlation on the estimation and inference
of the linear regression model. In practice, GLS is generally not feasible, because the
n n matrix V is of unknown form, where var("jX) = 2 V .

Question: What are feasible solutions?

Two Approaches

56
(i) First Approach: Adaptive feasible GLS
In some cases with additional assumptions, we can use a nonparametric estimator V^
to replace the unknown V; we obtain the adaptive feasible GLS

^ = (X0 V^ 1
X) 1 X0 V^ 1
Y;
a

where V^ is an estimator for V: Because V is an n n unknown matrix and we only have


n data points, it is impossible to estimate V consistently using a sample of size n if we
do not impose any restriction on the form of V: In other words, we have to impose some
restrictions on V in order to estimate it consistently. For example, suppose we assume

2
V = diagf 21 (X); :::; 2
n (X)g

= diagf 2 (X1 ); :::; 2


(Xn )g;

where diag{ } is a n n diagonal matrix and 2 (Xt ) = E("2t jXt ) is unknown. The fact
that 2 V is a diagonal matrix can arise when cov("t "s jX) = 0 for all t 6= s; i.e., when
there is no serial correlation. Then we can use the nonparametric kernel estimator
1
Pn 2 1 x Xt
2 n t=1 et b K b
^ (x) = 1 Pn 1 x Xt
n t=1 b K b
p 2
! (x);

where et is the estimated OLS residual, and K( ) is a kernel function which is a speci…ed
symmetric density function (e.g., K(u) = (2 ) 1=2 exp( 21 u2 ) if x is a scalar); and b =
b(n) is a bandwidth such that b ! 0; nb ! 1 as n ! 1: The …nite sample distribution
of ^ a will be di¤erent from the …nite sample distribution of ^ ; which assumes that V
were known. This is because the sampling errors of the estimator V^ have some impact on
the estimator ^ a . However, under some suitable conditions on V^ ; ^ a will share the same
asymptotic property as the infeasible GLS ^ (i.e., the MSE of ^ a is approximately
equal to the MSE of ^ ). In other words, the …rst stage estimation of 2 ( ) has no
impact on the asymptotic distribution of ^ a : For more discussion, see Robinson (1988)
and Stinchcombe and White (1991).
(ii) Second Approach
Continue to use OLS ^ , obtaining the correct formula for

var( ^ jX) = 2
(X0 X) 1 X0 V X(X0 X) 1

as well as a consistent estimator for var( ^ jX): The classical de…nitions of t and F -tests
cannot be used, because they are based on an incorrect formula for var( ^ jX). However,

57
some modi…ed tests can be obtained by using a consistent estimator for the correct
formula for var( ^ jX): The trick is to estimate 2 X0 V X; which is a K K unknown
matrix, rather than to estimate V; which is a n n unknown matrix. However, only
asymptotic distributions can be used in this case.

Question: Suppose we assume

E(""0 jX) = 2
V
= diagf 21 (X); :::; 2
n (X)g:

As pointed out earlier, this essentially assumes E("t "s jX) = 0 for all t 6= s: That is,
there is no serial correlation in f"t g conditional on X: Instead of estimating 2t (X); one
can estimate the K K matrix 2 X0 V X directly.
Then, how to estimate
X
n
2
X0 V X = Xt Xt0 2
t (X)?
t=1

We can use the folllowing estimator

0
X
n
0
X D(e)D(e) X = Xt Xt0 e2t ;
t=1

where D(e) = diag(e1 ; :::; en ) is a n n diagonal matrix with all o¤-diagonal elements be-
ing zero. This is called White’s (1980) heteroskedasticity-consistent variance-covariance
matrix estimator. See more discussion in Chapter 4.

Question: For J = 1; do we have

R^ r
p tn K?
R(X0 X) 1 X0 D(e)D(e)0 X(X0 X) 1 R0
For J > 1; do we have
1
(R ^ r)0 R(X0 X) 1 X0 D(e)D(e)0 X(X0 X) 1 R0 (R ^ r)=J
FJ;n K?

No. Although we have standardized both test statistics by the correct variance
estimators, we still have cov( ^ ; ejX) 6= 0 under Assumption 3.6. This implies that
^ and e are not independent, and therefore, we no longer have a t-distribution or an
F -distribution in …nite samples.

58
However, when n ! 1; we have
(i) Case I (J = 1) :

R^ r d
p ! N (0; 1):
R(X0 X) 1 X0 D(e)D(e)0 X(X0 X) 1 R0

This can be called a robust t-test.


(ii) Case II (J > 1) :
1 d
(R ^ r)0 R(X0 X) 1 X0 D(e)D(e)0 X(X0 X) 1 R0 (R ^ r) ! 2
J:

This is a robust Wald test statistic.

The above two feasible solutions are based on the assumption that E("t "s jX) = 0
for all t 6= s:
In fact, we can also consistently estimate the limit of X 0 V X when there exists con-
ditional heteroskedasticity and autocorrelation. This is called heteroskedasticity and
autocorrelation consistent variance-covariance estimation. When there exists serial cor-
relation of unknown form, an alternative solution should be provided. This is discussed
in Chapter 6. See also Andrews (1991) and Newey and West (1987, 1994).

3.10 Conclusion
In this chapter, we have presented the econometric theory for the classical linear
regression models. We …rst provide and discuss a set of assumptions on which the
classical linear regression model is built. This set of regularity conditions will serve as
the starting points from which we will develop modern econometric theory for linear
regression models.
We derive the statistical properties of the OLS estimator. In particular, we point out
that R2 is not a suitable model selection criterion, because it is always nondecreasing
with the dimension of regressors. Suitable model selection criteria, such as AIC and
BIC, are discussed. We show that conditional on the regressor matrix X; the OLS
estimator ^ is unbiased, has a vanishing variance, and is BLUE. Under the additional
conditional normality assumption, we derive the …nite sample normal distribution for ^ ;
the Chi-squared distribution for (n K)s2 = 2 ; as well as the independence between ^
and s2 .
Many hypotheses encountered in economics can be formulated as linear restrictions
on model parameters. Depending on the number of parameter restrictions, we derive
the t-test and the F -test. In the special case of testing the hypothesis that all slope

59
coe¢ cients are jointly zero, we also derive an asymptotically Chi-squared test based on
R2 :
When there exist conditional heteroskedasticity and/or autocorrelation, the OLS
estimator is still unbiased and has a vanishing variance, but it is no longer BLUE, and ^
and s2 are no longer mutually independent. Under the assumption of a known variance-
covariance matrix up to some scale parameter, one can transform the linear regression
model by correcting conditional heteroskedasticity and eliminating autocorrelation, so
that the transformed regression model has conditionally homoskedastic and uncorrelated
errors. The OLS estimator of this transformed linear regression model is called the GLS
estimator, which is BLUE. The t-test and F -test are applicable. When the variance-
covariance structure is unknown, the GLS estimator becomes infeasible. However, if the
error in the original linear regression model is serially uncorrelated (as is the case with
independent observations across t), there are two feasible solutions. The …rst is to use
a nonparametric method to obtain a consistent estimator for the conditional variance
var("t jXt ), and then obtain a feasible plug-in GLS. The second is to use White’s (1980)
heteroskedasticity-consistent variance-covariance matrix estimator for the OLS estimator
^ : Both of these two methods are built on the asymptotic theory. When the error of the
original linear regression model is serially correlated, a feasible solution to estimate the
variance-covariance matrix is provided in Chapter 6.

60
EXERCISES
3.1. Consider a bivariate linear regression model
o
Yt = Xt0 + "t ; t = 1; :::; n;

where Xt = (X0t ; X1t )0 = (1; X1t )0 ; and "t is a regression error.


(a) Let ^ = ( ^ 0 ; ^ 1 )0 be the OLS estimator. Show that ^ 0 = Y ^ X1 ; and
1
Pn
^1 = t=1 (X1t X1 )(Yt Y )
Pn
(X1t X1 )2
Pn t=1
(X1t X1 )Yt
= Pt=1 n
t=1 (X1t X1 )2
Xn
= Ct Yt ;
t=1
Pn
where Ct = (X1t X1 )= t=1 (X1t X1 )2 :
(b) Suppose X = (X11 ; :::; X1n )0 and " = ("1 ; :::; "n )0 are independent. Show that
var( ^ 1 jX) = 2" =[(n 1)SX2
1
]; where SX2
1
is the sample variance of fX1t gnt=1 and 2" is the
variance of "t :Thus, the more variations in fX1t g; the more accurate estimation for o1 :
(c) Let ^ denote the sample correlation between Yt and X1t ; namely,
Pn
(X1t X1 )(Yt Y )
^ = pPn t=1 P :
t=1 (X1t X1 )2 nt=1 (Yt Y )2
Show that R2 = ^2 :Thus, the squared sample correlation between Y and X1 is the
fraction of the sample variation in Y that can be predicted using the linear predictor
of X1 :This resuly also implies that R2 is a measure of the strength of sample linear
association between Yt and X1t :

3.2. For the OLS estimation of the linear regression model Yt = Xt0 o + "; where Xt is
a K 1 vector, show R2 = ^2Y Y^ ; the squared sample correlation between Yt and Y^t :

3.2. Suppose Xt = Q for all t m; where m is a …xed integer, and Q is a K 1 constant


vector. Do we have min (X0 X) ! 1 as n ! 1? Explain.

3.3. The adjusted R2 ; denoted as R2 ; is de…ned as follows:


e0 e=(n K)
R2 = 1 :
(Y Y )0 (Y Y )=(n 1)
Show
n 1
R2 = 1 (1 R2 ) :
n K

61
3.4. [E¤ect of Multicolinearity] Consider a regression model

Yt = 0 + 1 X1t + 2 X2t + "t :

Suppose Assumptions 3.1–3.4 hold. Let ^ = ( ^ 0 ; ^ 1 ; ^ 2 )0 be the OLS estimator.


Show
2
var( ^ 1 jX) = Pn ;
(1 r^2 ) t=1 (X1t X1 )2
2
var( ^ 2 jX) = P ;
(1 r^2 ) nt=1 (X2t X2 )2
P P
where X1 = n 1 nt=1 X1t ; X2 = n 1 nt=1 X2t ; and
Pn 2
2 t=1 (X1t X1 )(X2t X2 )
r^ = Pn P :
t=1 (X1t X1 )2 nt=1 (X2t X2 )2
3.5. Consider the linear regression model
o
Yt = Xt0 + "t ;

where Xt = (1; X1t ; :::; Xkt )0 : Suppose Assumptions 31.–3.3 hold. Let Rj2 is the coef-
…cient of determination of regressing variable Xjt on all the other explanatory variables
fXit ; 0 i k; i 6= jg. Show
2
var( ^ j jX) = Pn ;
(1 Rj2 ) t=1 (Xjt Xj )2
P
where Xj = n 1 nt=1 Xjt : The factor 1=(1 Rj2 ) is called the variance in‡ation factor
(VIF); it is used to measure the degree of multicolinearity among explanatory variables
in Xt .

3.6. Consider the following linear regression model


o
Yt = Xt0 + ut ; t = 1; :::; n; (4.1)

where
ut = (Xt )"t ;
where fXt g is a nonstochastic process, and (Xt ) is a positive function of Xt such that
2 3
2
(X1 ) 0 0 ::: 0
6 2 7
6 0 (X2 ) 0 ::: 0 7
6 7
=66 0 0 2
(X3 ) ::: 0 7 = 12 12 ;
7
6 7
4 ::: ::: ::: ::: ::: 5
2
0 0 0 ::: (Xn )

62
with
2 3
(X1 ) 0 0 ::: 0
6 7
6 0 (X2 ) 0 ::: 0 7
1 6 7
2 =6
6 0 0 (X3 ) ::: 0 7:
7
6 7
4 ::: ::: ::: ::: ::: 5
0 0 0 ::: (Xn )
Assume that f"t g is i.i.d. N (0; 1): Then fut g is i.i.d. N (0; 2 (Xt )): This di¤ers from
Assumption 3.5 of the classical linear regression analysis, because now fut g exhibits
conditional heteroskedasticity.
Let ^ denote the OLS estimator for o :
(a) Is ^ unbiased for o ?
(b) Show that var( ^ ) = (X0 X) 1 X0 X(X0 X) 1 :
Consider an alternative estimator
~ = (X0 1
X) 1 X0 1
Y
" n # 1
X X
n
2
= (Xt )Xt Xt0 2
(Xt )Xt Yt :
t=1 t=1

(c) Is ~ unbiased for o ?


(d) Show that var( ~ ) = (X0 1 X) 1 :
(e) Is var( ^ ) var( ~ ) positive semi-de…nite (p.s.d.)? Which estimator, ^ or ~ ; is more
e¢ cient?
(f) Is ~ the Best Linear Unbiased Estimator (BLUE) for o ? [Hint: There are several
approaches to this question. A simple one is to consider the transformed model
o
Yt = Xt 0 + "t ; t = 1; :::; n; (4.2)

where Yt = Yt = (Xt ); Xt = Xt = (Xt ): This model is obtained from model (4.1) after
dividing by (Xt ): In matrix notation, model (4.2) can be written as
o
Y =X + ";
1 1
where the n 1 vector Y = 2 Y and the n k matrix X = 2 X:]

(g) Construct two test statistics for the null hypothesis of interest H0 : o2 = 0.
One test is based on ^ ; and the other test is based on ~ : What are the …nite sample
distributions of your test statistics under H0 ? Can you tell which test is better?
(h) Construct two test statistics for the null hypothesis of interest H0 : R o = r;
where R is a J k matrix with J > 0: One test is based on ^ ; and the other test is
based on ~ : What are the …nite sample distributions of your test statistics under H0 ?

63
3.7. Consider the following classical regression model

o
Yt = Xt0 + "t :

Suppose that we are interested in testing the null hypothesis

o
H0 : R = r;

where R is a J K matrix, and r is a J 1 vector. The F -test statistic is de…ned as

(R ^ r)0 [R(X 0 X) 1 R0 ] 1 (R ^ r)=J


F = :
s2
Show that
(~e0 e~ e0 e)=J
F = :
e0 e=(n k 1)
where e0 e is the sum of squared residuals from the unrestricted model, and e~0 e~ is the
sum of squared residuals from the restricted regression model subject to the restriction
R = r:

3.8. The F -test statistic is de…ned as follows:

(R ^ r)0 [R(X0 X) 1 R0 ] 1 (R ^ r)=J


F = :
s2
Show that
Pn ^
t=1 (Yt Y~t )2 =J
F =
s2
(^ ~ )0 X 0 X( ^ ~ )=J
= ;
s2

where Y^t = Xt0 ^ ; Y~t = Xt0 ~ ; and ^ ; ~ are the unrestricted and restricted OLS estimators
respectively.

3.9. Consider the following classical regression model

o
Yt = Xt0 + "t
Xk
o o
= 0 + j Xjt + "t ; t = 1; :::; n: (7.1)
j=1

Suppose that we are interested in testing the null hypothesis

o o o
H0 : 1 = 2 = = k = 0:

64
Then the F -statistic can be written as
e0 e~ e0 e)=k
(~
F = 0 :
e e=(n k 1)

where e0 e is the sum of squared residuals from the unrestricted model (7.1), and e~0 e~ is
the sum of squared residuals from the restricted model (7.2)

o
Yt = 0 + "t : (7.2)

(a) Show that under Assumptions 3.1 and 3.3,

R2 =k
F = ;
(1 R2 )=(n k 1)

where R2 is the coe¢ cient of determination of the unrestricted model (7.1).


(b) Suppose in addition Assumption 3.5 holds. Show that under H0 ;
d
(n k 1)R2 ! 2
k:

3.10. The F -test statistic is de…ned as folllows:

(R ^ r)0 [R(X0 X) 1 R0 ] 1 (R ^ r)=J


F = :
s2
Show that F
Pn
(1=J) t=1 (Yt
^ Y~t )2 (^ ~ )0 X 0 X( ^ ~ )=J
F = = ;
s2 s2

where Y^t = Xt0 ^ ; Y~t = Xt0 ~ ; and ^ ; ~ are the unrestricted and restricted OLS estimators
respectively.

3.11. [Structral Change] Suppose Assumptions 3.1 and 3.3 hold. Consider the fol-
lowing model on the whole sample:

o
Yt = Xt0 + (Dt Xt )0 o
+ "t ; t = 1; :::; n;

where the time dummy variable Dt = 0 if t n1 and Dt = 1 if t > n1 : This model


can be written as two separate models:

o
Yt = Xt0 + "t ; t = 1; :::; n1

and
o
Yt = Xt0 ( + o
) + "t ; t = n1 + 1; :::; n:

65
Let SSRu ; SSR1 ; SSR2 denotes the sums of squared residuals of the above three
regression models via OLS. Show

SSRu = SSR1 + SSR2 :

This identity implies that estimating the …rst regression mdoel with time dummy
variable Dt via OLS is equivalent to estimating two separate regression models over two
subsample periods respectively.

3.12. A quadratic polynomial regression moel


2
Yt = 0 + 1 Xt + 2 Xt + "t

is …t to data. Suppose the p-value for the OLS estimator of 1 was 0.67 and for 2
was 0.84. Can we accept the hypothesis that 1 and 2 are both 0? Explain.

3.13. Suppose X0 X is a K K matrix, and V is a n n matrix, and both X0 X and


V are symmetric and nonsingular, with the minimum eigenvalue min (X0 X) ! 1 as
n ! 1 and 0 < c max (V ) C < 1: Show that for any 2 RK such that 0 = 1;
0
var( ^ jX) = 2 0
(X0 X) 1 X0 V X(X0 X) 1
!0

as n ! 1: Thus, var( ^ jX) vanishes to zero as n ! 1 under conditional heteroskedas-


ticity.

3.14. Suppose the conditions in 3.9 hold. It can be shown that the variances of the
OLS ^ and GLS ^ are respectively:

var( ^ jX) = 2
(X0 X) 1 X0 V X(X0 X) 1 ;
var( ^ jX) = 2
(X0 V 1
X) 1 :

Show that var( ^ jX) var( ^ jX) is positive semi-de…nite.

3.15. Suppose a data generating process is given by


o o o
Yt = 1 X1t + 2 X2t + "t = Xt0 + "t ;

where Xt = (X1t ; X2t )0 ; E(Xt Xt0 ) is nonsingular, and E("t jXt ) = 0. For simplicity, we
further assume E(X2t ) = 0 and E(X1t X2t ) 6= 0:
Now consider the following bivariate linear regression model
o
Yt = 1 X1t + ut :

66
(a) Show that if o2 6= 0; then E(Y1 jXt ) = Xt0 o 6= E(Y1t jX1t ): That is, there exists
an omitted variable (X2t ) in the bivariate regression model.
(b) Show that E(Yt jX1t ) 6= 1 X1t for all 1 : That is, the bivariate linear regression
model is misspeci…ed for E(Yt jX1t ):
(c) Is the best linear least squares approximation co¢ cient 1 in the bivariate linear
regression model equal to o1 ?

3.16. Suppose a data generating process is given by


o o o
Yt = 1 X1t + 2 X2t + "t = Xt0 + "t ;

where Xt = (X1t ; X2t )0 ; and Assumptions 3.1–3.4 hold. (For simplicity, we have assumed
no intercept.) Denote the OLS estimator by ^ = ( ^ 1 ; ^ 2 )0 :
If o2 = 0 and we know it. Then we can consider a simpler regression
o
Yt = 1 X1t + "t :

Denote the OLS of this simpler regression as ~ 1 :


Please compare the relative e¢ ciency between ^ 1 and ~ 1 : That is, which estimator
is better for o1 ? Give your reasoning.

3.17. Suppose Assumption 3.6 is replaced by the following assumption:


Assumption 3.6 0 : "jX N (0; V ); where V = V (X) is a known n n …nite and positive
de…nite matrix.

Compared to Assumption 3.6, Assumption 3.60 assumes that var("jX) = V is com-


pletely known and there is no unknown proportionality 2 : De…ne GLS ^ = (X0 V 1 X) 1 X0 V 1
Y:
(a) Is ^ BLUE?
(b) Put X = CX and s 2 = e 0 e =(n K); where e = Y X ^ ; C 0 C = V: Do the
the usual t-test and F -test de…ned as
R^ r
T = p ; for J = 1;
s 2 R(X 0 X ) 1 R0
(R ^ r)0 [R(X 0 X ) 1 R0 ] 1 (R ^ r)=J
F = 2
s
follow the tn K and FJ;n K distributions respectively under the null hypothesis that
R = r? Explain.
(c) Construct two new test statistics:
R^ r
T~ = p ; for J = 1;
R(X 0 X ) 1 R0
~
Q = (R ^ r)0 [R(X 0 X ) 1 R0 ] 1 (R ^ r):

67
What distributions will these test statistics follow under the null hypothesis that
R = r? Explain.
(d) Which set of tests, (T ; F ) or (T~ ; Q ~ ); are more powerful at the same signi…-
cance level? Explain. [Hint: The t-distribution has a heavier tail than N (0; 1) and so
has a larger critical value at a given signi…cance level.]

3.18. Consider a linear regression model

0
Yt = Xt0 + "t ; t = 1; 2; :::; n;

where "t = (Xt )vt ; Xt is a K 1 nonstochastic vector, and (Xt ) is a positive function
of Xt ; and fvt g is i.i.d. N (0; 1):
Let ^ = (X 0 X) 1 X 0 Y denote the OLS estimator for 0 ; where X is a n K matrix
whose t-th row is Xt ; and Y is a n 1 vector whose t-th component is Yt :
(a) Is ^ unbiased for 0 ?
(b) Find var( ^ ) = E[( ^ E ^ )( ^ E ^ )0 ]: You may …nd the following notation useful:
= diagf 2 (X1 ); 2 (X2 ); :::; 2 (Xn )g; i.e., is a n n diagonal matrix with the t-th
diagonal component equal to 2 (Xt ) and all o¤-diagonal components equal to zero.
Consider the transformed regression model
1 1 0
Yt = X0 + vt
(Xt ) (Xt ) t
or
0
Yt = Xt 0 + vt ;
where Yt = 1 (Xt )Yt and Xt = 1 (Xt )Xt :
Denote the OLS estimator of this transformed model as ~ :
(c) Show

~ = (X 0 1
X) 1 X 0 1
Y:
(d) Is ~ unbiased for 0 ?
(e) Find var( ~ ):
(f) Which estimator, ^ or ~ ; is more e¢ cient in terms of the mean squared error
criterion? Give your reasoning.
(g) Use the di¤erence R ~ r to construct a test statistic for the null hypothesis of
interest H0 : R 0 = r; where R is a J K matrix, r is K 1; and J > 1: What is the
…nite sample distribution of your test statistic under H0 ?

68
CHAPTER 4 LINEAR REGRESSION MODELS
WITH I.I.D. OBSERVATIONS
Abstract: When the conditional normality assumption on the regression error does
not hold, the OLS estimator no longer has the …nite sample normal distribution, and
the t-test statistics and F -test statistics no longer follow the Student t-distribution and
a F -distribution in …nite samples respectively. In this chapter, we show that under the
assumption of i.i.d. observations with conditional homoskedasticity, the classical t-test
and F -test are approximately applicable in large samples. However, under conditional
heteroskedasticity, the t-test statistics and F -test statistics are not applicable even when
the sample size goes to in…nity. Instead, White’s (1980) heteroskedasticity-consistent
variance-covariance matrix estimator should be used, which yields asymptotically valid
hypothesis test procedures. A direct test for conditional heteroskedasticity due to White
(1980) is presented. To facilitate asymptotic analysis in this and subsequent chapters,
we …rst introduce some basic tools in asymptotic analysis.
Key words: Asymptotic analysis, Almost sure convergence, Central limit theorem,
Convergence in distribution, Convergence in quadratic mean, Convergence in probability,
I.I.D., Law of large numbers, the Slutsky theorem, White’s heteroskedasticity-consistent
variance-covariance matrix estimator.
Motivation

The assumptions of classical linear regression models are rather strong and one may
have a hard time …nding practical applications where all these assumptions hold exactly.
For example, it has been documented that most economic and …nancial data have heavy
tails, and so they are not normally distributed. An interesting question now is whether
the estimators and tests which are based on the same principles as before still make
sense in this more general setting. In particular, what happens to the OLS estimator,
the t- and F -tests if any of the following assumptions fails:

strict exogeneity E("t jX) = 0?


2
normality ("jX N (0; I))?
2
conditional homoskedasticity (var("t jX) = )?

serial uncorrelatedness (cov("t ; "s jX) = 0 for t 6= s)?

When classical assumptions are violated, we do not know the …nite sample statistical
properties of the estimators and test statistics anymore. A useful tool to obtain the

1
understanding of the properties of estimators and tests in this more general setting
is to pretend that we can obtain a limitless number of observations. We can then
pose the question how estimators and test statistics would behave when the number of
observations increases without limit. This is called asymptotic analysis. In practice, the
sample size is always …nite. However, the asymptotic properties translate into results
that hold true approximately in …nite samples, provided that the sample size is large
enough. We now need to introduce some basic analytic tools for asymptotic theory.
For more systematic introduction of asymptotic theory, see, for example, White (1994,
1999).

2
4.1 Introduction to Asymptotic Theory
In this section, we introduce some important convergence concepts and limit the-
orems. First, we introduce the concept of convergence in mean squares, which is a
distance measure of a sequence of random variables from a random variable.

De…nition 4.1 [Convergence in mean squares (or in quadratic mean)] A se-


quence of random variables/vectors/matrices Zn ; n = 1; 2; :::; is said to converge to Z in
mean squares as n ! 1 if

EjjZn Zjj2 ! 0 as n ! 1;

where jj jj is the sum of the absolute value of each component in Zn Z:

When Zn is a vector or matrix, convergence can be understood as convergence in


each element of Zn :When Zn Z is a l m matrix, where l and m are …xed positive
integers, then we can also de…ne the squared norm as

X
l X
m
jjZn Zjj2 = [Zn Z]2(t;s) :
t=1 s=1

Note that Zn converges to Z in mean squares if and only if each component of Zn


converges to the corresponding component of Z in mean squares.

2 1
Pn
Example 1: Suppose fZt g is i.i.d.( ; ); and Zn = n t=1 Zt : Then
q:m:
Zn ! :

Solution: Because E(Zn ) = ; we have

E(Zn )2 = var(Zn )
!
X
n
1
= var n Zt
t=1
!
1 X
n
= var Zt
n2 t=1

1 X
n
= var(Zt )
n2 t=1
2
=
n
! 0 as n ! 1:

3
It follows that
2
E(Zn )2 = ! 0 as n ! 1:
n
Next, we introduce the concept of convergence in probability that is another popular
distance measure between a sequence of random variables and a random variable.

De…nition 4.2 [Convergence in probability] Zn converges to Z in probability if for


any given constant > 0;
Pr[jjZn Zjj > ] ! 0 as n ! 1 or
Pr[jjZn Zjj ] ! 1 as n ! 1:
For convergence in probability, we can also write
p
Zn Z ! 0 or Zn Z = oP (1);
The notation oP (1) means that Zn Z vanishes to zero in probability. When Z = b
p
is a constant, we can write Zn ! b and b = p lim Zn is called the probability limit of Zn :
Convergence in probability is also called weak convergence or convergence with prob-
p
ability approaching one. When Zn ! Z; the probability that the di¤erence jjZn Zjj
exceeds any given small constant is rather small for all n su¢ ciently large. In other
words, Zn will be very close to Z with very high probability, when the sample size n is
su¢ ciently large.
To gain more intuition of the convergence in probability, we de…ne the event
An ( ) = f! 2 : jZn (!) Z(!)j > g;
where ! is a basic outcome in sample space : Then convergence in probability says that
the probability of event An ( ) may be nonzero for any …nite n; but such a probability
will eventually vanish to zero as n ! 1: In other words, it becomes less and less likely
that the di¤erence jZn Zj is larger than a prespeci…ed constant > 0: Or, we have
more and more con…dence that the di¤erence jZn Zj will be smaller than as n ! 1:
The constant can be viewed as a prespeci…ed tolerance level.

Lemma 4.1 [Weak Law of Large Numbers (WLLN) for I.I.D. Sample] Suppose
P
fZt g is i.i.d.( ; 2 ); and de…ne Zn = n 1 nt=1 Zt ; n = 1; 2; ::: : Then
p
Zn ! as n ! 1:
Proof: For any given constant > 0; we have by Chebyshev’s inequality
E(Zn )2
Pr(jZn j > ) 2
2
= 2
! 0 as n ! 1:
n
4
Hence,
p
Zn ! as n ! 1:
This is the so-called weak law of large numbers (WLLN). In fact, we can weaken the
moment condition.

We now provide an economic interpretation of the WLLN using an example. In


…nance, there is a popular trading strategy called buy-and-hold trading startegy. An
investor buys a stock at some day and then hold it for a long time period before he sells
it out. This is called a buy-and-hold trading strategy. How is the average return of this
trading strategy?

Suppose Zt is the return of the stock on period t; and the returns over di¤erent time
periods are i.i.d.( ; 2 ): Also assume the investor holds the stock for a total of n period.
Then the average return over each time period is the sample mean

1X
n
Z= Zt :
n t=1

When the number n of holding periods is large, we have


p
Z! = E(Zt )

as n ! 1: Thus, the average return of the buy-and-hold trading startegy is approx-


imately equal to when n is su¢ ciently large.

Lemma 4.2 [WLLN for I.I.D. Random Sample] Suppose fZt g is i.i.d. with
P
E(Zt ) = and EjZt j < 1: De…ne Zn = n 1 nt=1 Zt : Then
p
Zn ! as n ! 1:

Question: Why do we need the moment condition EjZt j < 1?


We can consider a counter example: Suppose fZt g is a sequence of i.i.d. Cauchy(0; 1)
random variables whose moments do not exist. Then Zn Cauchy(0; 1) for all n 1;
and so it does not converge in probability to some constant as n ! 1:

We now introduce a useful related concept:

De…nition 4.3 [Boundedness in Probability] A sequence of random variables/vectors/matrices


fZn g is bounded in probability if for any small constant > 0; there exists a constant
C < 1 such that
P (jjZn jj > C)

5
as n ! 1: We denote
Zn = OP (1):
Intuitively, when Zn = OP (1); the probability that jjZn jj exceeds a very large constant is
small as n ! 1. Or, equivalently, jjZn jj is smaller than C with a very high probability
as n ! 1:

2
Example 2: Suppose Zn N( ; ) for all n 1: Then

Zn = OP (1):

Solution: For any > 0; we always have a su¢ ciently large constant C = C( ) > 0
such that

P (jZn j > C) = 1 P( C Zn C)
C Zn C
= 1 P

C C+
= 1 +
;

where (z) = P (Z z) is the CDF of N (0; 1): [We can choose C such that [(C
)= ] 1 12 and [ (C + )= ] 12 :]

A Special Case: What happens to C if Zn N (0; 1)?

In this case,

P (jZn j > C) = 1 (C) + ( C)


= 2[1 (C)]:

Suppose we set
2[1 (C)] = ;
that is, we set
1
C= 1 ;
2
1
where ( ) is the inverse function of ( ): Then we have

P (jZn j > C) = :

6
The following lemma provides a convenient way to verify convergence in probability.

q:m: p
Lemma 4.3: If Zn Z ! 0; then Zn Z ! 0:

Proof: By Chebyshev’s inequality, we have

E[Zn Z]2
P (jZn Zj > ) 2
!0

for any given > 0 as n ! 1: This completes the proof.

Example 3: Suppose Assumptions 3.1–3.4 hold. Does the OLS estimator ^ converges
in probability to o ?

Solution: From Theorem 3.4, we have

0
E[( ^ o
)( ^ o 0
) jX] = 2 0
(X 0 X) 1

! 0

for any 2 RK ; 0 = 1 as n ! 1 with probability one. It follows that Ejj ^ o 2


jj =
p
EfE[jj ^ jj jX]g ! 0 as n ! 1: Therefore, by Lemma 4.3, we have ^ ! :
o 2 o

Example 4: Suppose Assumptions 3.1, 3.3 and 3.5 hold. Does s2 converge in
probability to 2 ?

Solution: Under the given assumptions,

s2 2
(n K) 2 n K;

4
and therefore we have E(s2 ) = 2 and var(s2 ) = n2 K : It follows that E(s2 2 2
) =
4 2 q:m: 2 2 p 2
2 =(n K) ! 0; s ! and so s ! because convergence in quadratic mean
implies convergence in probability.

While convergence in mean squares implies convergence in probability, the converse


is not true. We now give an example.

Example 5: Suppose (
1
0 with prob 1 n
Zn =
n with prob n1 :
p
Then Zn ! 0 as n ! 1 but E(Zn 0)2 = n ! 1: Please verify it.

7
Solution:
(i) For any given " > 0; we have
1
P (jZn 0j > ") = P (Zn = n) = ! 0:
n
(ii)
X
E(Zn 0)2 = (zn 0)2 f (zn )
zn 2f0;ng

= (0 0)2 1 n 1
+ (n 0)2 n 1

= n ! 1:

Next, we provide another convergence concept called almost sure convergence.

De…nition 4.4 [Almost Sure Convergence] fZn g converges to Z almost surely if


h i
Pr lim jjZn Zjj = 0 = 1:
n!1

a:s:
We denote Zn Z ! 0:

To gain intuition for the concept of almost sure convergence, recall the de…nition of
a random variable: any random variable is a mapping from the sample space to the
real line, namely Z : ! R: Let ! be a basic outcome in the sample space : De…ne a
subset in :
Ac = f! 2 : lim Zn (!) = Z(!)g:
n!1
c
That is, A is the set of basic outcomes on which the sequence of fZn ( )g converges to
Z( ) as n ! 1: Then almost sure convergence can be stated as

P (Ac ) = 1:

In other words, the convergent set Ac has probability one to occur.

Example 6: Let ! be uniformly distributed on [0; 1]; and de…ne

Z(!) = ! for all ! 2 [0; 1]:

and
Zn (!) = ! + ! n for ! 2 [0; 1]:
a:s:
Is Zn Z ! 0?

8
Solution: Consider

Ac = f! 2 : lim jZn (!) Z(!)j = 0g:


n!1

Because for any given ! 2 [0; 1); we always have

lim jZn (!) Z(!)j = lim j(! + ! n ) !j


n!1 n!1
= lim ! n = 0:
n!1

In contrast, for ! = 1; we have

lim jZn (1) Z(1)j = 1n = 1 6= 0:


n!1

Thus, Ac = [0; 1) and P (Ac ) = 1: We also have P (A) = P (! = 1) = 0:

In probability theory, almost sure convergence is closely related to pointwise conver-


gence (almost everywhere). It is also called strong convergence.

Lemma 4.4 [Strong Law of Large Numbers (SLLN) for I.I.D. Random Sam-
ples] Suppose fZt g be i.i.d. with E(Zt ) = and EjZt j < 1: Then
a:s:
Zn ! as n ! 1:

Almost sure convergence implies convergence in probability but not vice versa.
p p
Question: If s2 ! 2
; do we have s ! ?

Answer: Yes. It follows from the following continuity lemma with the choice of g(s2 ) =
p
s2 = s:
p p
Lemma 4.5 [Continuity]: (i) Suppose an ! a and bn ! b; and g( ) and h( ) are
continuous functions. Then
p
g(an ) + h(bn ) ! g(a) + h(b); and
p
g(an )h(bn ) ! g(a)h(b):

(ii) Similar results hold for almost sure convergence.

The last convergence concept we will introduce is called convergenve in distribution.

It should be emphasized that convergence in mean squres, convergence in probability


and almost sure convergence all measure the closeness between the random variable Zn

9
and the random variable Z: This di¤ers from the concept of convergence in distribution
introduced in Chapter 3. There, convergence in distribution is de…ned in terms of the
closeness of the CDF Fn (z) of Zt to the CDF F (z) of Z; not between the closeness of
the random variable Zn to the random variable Z: As a result, for convergence in mean
squares, convergence in probability and almost sure convergence, Zn converges to Z if
and only if convergence of Zn to Z occurs element by element (that is, each element of
Zn converges to the corresponding element of Z). For the convergence in distribution
of Zn to Z, however, element by element convergence does not imply convergence in
distribution of Zn to Z;because element-wise convergence in distribution ignores the
d
relationships among the components of Zn : Nevertheless, Zn ! Z does imply element
by element convergence in distribution. That is, convergence in joint distribution implies
convergence in marginal distribution.
The main purpose of asymptotic analysis is to derive the large sample distribution
of the estimator or statistic of interest and use it as an approximation in statistical
inference. For this purpose, we need to make use of an important limit theorem, namely
Central Limit Theorem (CLT). We now state and prove the CLT for i.i.d. random
samples, a fundamental limit theorem in probability theory.

Lemma 4.6 [Central Limit Theorem (CLT) for I.I.D. Random Samples]: Sup-
P
pose fZt g is i.i.d.( ; 2 ); and Zn = n 1 nt=1 Zt : Then as n ! 1;

Zn E(Zn ) Zn
p = p
var(Zn ) 2 =n
p
n(Zn )
=
d
! N (0; 1):

Proof: Put
Zt
Yt = ;
1
Pn
and Yn = n t=1 Yt : Then
p
n(Zn ) p
= n Yn :

10
p
The characteristic function of n Yn
p p
n (u) = E[exp(iu nYn )]; i= 1
" !#
iu X
n
= E exp p Yt
n t=1
Y
n
iu
= E exp p Yt by independence
t=1
n
n
u
= Y p by identical distribution:
n
n
0 u 1 00 u2
= Y (0) + (0) p + (0) +
n 2 n
n
u2
= 1 + o(1)
2n
u2
! exp as n ! 1;
2
where the third equality follows from independence, the fourth equality follows from
identical distribution, the …fth equality follows from the Taylor series expansion, and
(0) = 1; 0 (0) = 0; 00 (0) = 1: Note that o(1) means a reminder term that vanishes to
n
zero as n ! 1; and we have also made use of the fact that 1 + na ! ea :
More rigorously, we can show
u
ln n (u) = n ln Y p
n
ln pu
Y n
= 1
n p
0
Y (u= n)
p
u Y (u= n)
! lim
2 n!1 n 1=2 p p p
00
u2 Y (u= n) Y (u= n) [ 0Y (u= n)]2
= lim 2 p
2 n!1 Y (u= n)
u2
= :
2
It follows that
1 2
u
lim n (u) =e 2 :
n!1

This is the characteristic function of N (0; 1). By the uniqueness of the characteristic
function, the asymptotic distribution of
p
n(Zn )

11
is N (0; 1): This completes the proof.
d
Lemma 4.7 [Cramer-Wold Device] A p 1 random vector Zn ! Z if and only if
for any nonzero 2 Rp such that 0 = pj=1 2j = 1; we have
0 d 0
Zn ! Z:

This lemma is useful for obtaining asymptotic multivariate distributions.


d p p
Lemma 4.8 [Slutsky Theorem] Let Zn ! Z; an ! a and bn ! b; where a and b are
constants. Then
d
an + bn Zn ! a + bZ as n ! 1:
d d d
Question: If Xn ! X and Yn ! Y: Is Xn + Yn ! X + Y ?

Answer: No. We consider two examples:

Example 7: Xn and Yn are independent N(0,1). Then


d
Xn + Yn ! N (0; 2):

Example 8: Xn = Yn N (0; 1) for all n 1: Then

Xn + Yn = 2Xn N (0; 4):

Example 9: Suppose Assumptions 3.1, 3.3(a) and 3.5, and the hypothesis H0 : R o = r
hold, where R is a J K nonstochastic matrix with rank J, r is a J 1 nonstochastic
vector, and J K. Then the quadratic form
(R ^ r)0 [R(X0 X) 1 R0 ] 1 (R ^ r) 2
2 J:

2
Suppose now we replace by s2 : What is the asymptotic distribution of the quadratic
form
(R ^ r)0 [R(X0 X) 1 R0 ] 1 (R ^ r)
?
s2
Finally, we introduce a lemma which is very useful in deriving the asymptotic distri-
butions of nonlinear statistics (i.e., nonlinear functions of the random sample).
p d
Lemma 4.9 [Delta Method] Suppose n(Zn )= ! N (0; 1), and g( ) is continu-
ously di¤erentiable with g 0 ( ) 6= 0: Then as n ! 1;
p d
n[g(Zn ) g( )] ! N (0; [g 0 ( )]2 2 ):

12
p d p
Proof: First, because n(Zn )= ! N (0; 1) implies n(Zn )= = OP (1); we
have Zn = OP (n 1=2 ) = oP (1):
Next, by a …rst order Taylor series expansion, we have

Yn = g(Zn ) = g( ) + g 0 ( n )(Zn );

where n = + (1 )Zn for 2 [0; 1]: It follows by the Slutsky theorem that

p g(Zn ) g( ) p Zn
n = g0( n) n
d
! N (0; [g 0 ( )]2 );
p p
where g 0 ( n ) ! g 0 ( ) given n ! :
By the Slutsky theorem again, we have
p d 2
n[Yn g( )] ! N (0; [g 0 ( )]2 ):

This completes the proof.

The Delta method is a Taylor series approximation in a statistical context. It lin-


earizes a smooth (i.e., di¤erentiable) nonlinear statistic so that the CLT can be applied
to the linearized statistic. Therefore, it can be viewed as a generalization of the CLT
from a sample average to a nonlinear statistic. This method is very useful when more
than one parameter makes up the function to be estimated and more than one random
variable is used in the estimator.
p d
Example 10: Suppose n(Zn )= ! N (0; 1) and 6= 0 and 0 < < 1: Find the
p
limiting distribution of n(Zn 1 1
):

Solution: Put g(Zn ) = Zn 1 : Because 6= 0; g( ) is continuous at : By a …rst order


Taylor series expansion, we have

g(Zn ) = g( ) + g 0 ( n )(Zn ); or
Zn 1 1
= ( n
2
)(Zn )
p p
where n = + (1 )Zn ! given Zn ! and 2 [0; 1]: It follows that
p
p n(Zn )
n(Zn 1 1
) = 2
n
d 2
! N (0; = 4 ):

13
Taylor series expansions, various convergence concepts, laws of large numbers, central
limit theorems, and slutsky theorem constitute a tool kit of asymptotic analysis. We
now use these asymptotic tools to investigate the large sample behavior of the OLS
estimator and related statistics in subsequent chapters.

14
4.2 Framework and Assumptions
We …rst state the assumptions under which we will establish the asymptotic theory
for linear regression models.

Assumption 4.1 [I.I.D.]: fYt ; Xt0 g0n


t=1 is an i.i.d. random sample.

Assumption 4.2 [Linearity]:

o
Yt = Xt0 + "t ; t = 1; :::; n;

o
for some unknown K 1 parameter and some unobservable random variable "t :

Assumption 4.3 [Correct Model Speci…cation]: E("t jXt ) = 0 a.s. with E("2t ) =
2
< 1:

Assumption 4.4 [Nonsingularity]: The K K matrix

Q = E(Xt Xt0 )

is nonsingular and …nite:

Assumption 4.5: The K K matrix V var(Xt "t ) = E(Xt Xt0 "2t ) is …nite and positive
de…nite (p.d.).

Remarks:
The i.i.d. observations assumption in Assumption 4.1 implies that the asymptotic
theory developed in this chapter will be applicable to cross-sectional data, but not time
series data. The observations of the later are usually correlated and will be considered
in Chapter 5. Put Zt = (Yt ; Xt0 )0 : Then I.I.D. implies that Zt and Zs are independent
when t 6= s, and the Zt have the same distribution for all t: The identical distribution
means that the observations are generated from the same data generating process, and
independence means that di¤erent observations contain new information about the data
generating process.
Assumptions 4.1 and 4.3 imply the strict exogeneity condition (Assumption 3.2)
holds, because we have

E("t jX) = E("t jX1 ; X2 ; :::Xt ; :::Xn )


= E("t jXt )
= 0 a:s:

15
As a most important feature of Assumptions 4.1–4.5 together, we allow for condi-
tional heteroskedasticity (i.e., var("t jXt ) 6= 2 a.s.); and do not assume normality for the
conditional distribution of "t jXt . It is possible that var("t jXt ) may be correlated with
Xt : For example, the variation of the output of a …rm may depend on the size of the
…rm, and the variation of a household may depend on its income level. In economics
and …nance, conditional heteroskedasticity is more likely to occur in cross-sectional ob-
servations than in time series observations, and for time series observations, conditional
heteroskedasticity is more likely to occur for high-frequency data than low-frequency
data. In this chapter, we will consider the e¤ect of conditional heteroskedasticity in
cross-section observations. The e¤ect of conditional heteroskedasticity in time series
observations will be considered in Chapter 5.
On the other hand, relaxation of the normality assumption is more realistic for
economic and …nancial data. For example, it has been well documented (Mandelbrot
1963, Fama 1965, Kon 1984) that returns on …nancial assets are not normally distributed.
However, the I.I.D. assumption implies that cov("t ; "s ) = 0 for all t 6= s: That is, there
exists no serial correlation in the regression disturbance.
2
Among other things, Assumption 4.4 implies E(Xjt ) < 1 for 0 j k: By the
SLLN for i.i.d. random samples, we have

1X
n
X0 X a:s:
= Xt Xt0 ! E(Xt Xt0 ) = Q
n n t=1

as n ! 1: Hence, when n is large, the matrix X0 X behaves approximately like nQ; whose
minimum eigenvalue min (nQ) = n min (Q) ! 1 at the rate of n: Thus, Assumption 4.4
implies Assumption 3.3.
When X0t = 1; Assumption 4.5 implies E("2t ) < 1: If E("2t jXt ) = 2 < 1 a.s.,
i.e., there exists conditional homoskedasticity, then Assumption 4.5 can be ensured by
Assumption 4.4. More generally, there exists conditional heteroskedasticity, the moment
condition in Assumption 4.5 can be ensured by the moment conditions that E("4t ) < 1
4
and E(Xjt ) < 1 for 0 j k; because by repeatedly using the Cauchy-Schwarz
inequality twice, we have

jE("2t Xjt Xlt )j [E("4t )]1=2 [E(Xjt


2
Xlt2 )]1=2
[E("4t )]1=2 [E(Xjt
4
)E(Xlt4 )]1=4

where 0 j; l k and 1 t n:

We now address the following questions:

16
Consistency of OLS?

Asymptotic normality?

Asymptotic e¢ ciency?

Con…dence interval estimation?

Hypothesis testing?

In particular, we are interested in knowing whether the statistical properties of OLS


^ and related test statistics derived under the classical linear regression setup are still
valid under the current setup, at least when n is large.

4.3 Consistency of OLS


0
n
Suppose we have a random sample fYt ; Xt0 gt=1 : Recall that the OLS estimator:
^ = (X0 X) 1 X0 Y
1
X0 Y
X0 X
=
n n
X
n
^ 1n 1
= Q Xt Yt ;
t=1

where
X
n
^=n
Q 1
Xt Xt0 :
t=1
Substituting Yt = Xt0 o + "t ; we obtain
X
n
^= o ^ 1n
+Q 1
Xt "t :
t=1

We will consider the consistency of ^ directly.

Theorem 4.10 [Consistency of OLS] Under Assumptions 4.1-4.4, as n ! 1;


p
^! o
or ^ o
= oP (1):

Proof: Let C > 0 be some bounded constant. Also, recall Xt = (X0t ; X1t ; :::; Xkt )0 :
First, the moment condition holds: for all 0 j k;
1 1
2 2
EjXjt "t j (EXjt ) (E"2t ) 2 by the Cauchy-Schwarz inequality
1 1
C2C2
C

17
2
where E(Xjt ) C by Assumption 4.4, and E("2t ) C by Assumption 4.3. It follows
from WLLN (with Zt = Xt "t ) that

X
n
p
1
n Xt "t ! E(Xt "t ) = 0;
t=1

where

E(Xt "t ) = E[E(Xt "t jXt )] by the law of iterated expectations


= E[Xt E("t jXt )]
= E(Xt 0)
= 0:

Applying WLLN again (with Zt = Xt Xt0 ) and noting that


1
2
EjXjt Xlt j [E(Xjt )E(Xlt2 )] 2 C

by the Cauchy-Schwarz inequality for all pairs (j; l); where 0 j; l k; we have
p
^!
Q E(Xt Xt0 ) = Q:

^ 1 p 1
Hence, we have Q !Q by continuity. It follows that

^ o
= (X0 X) 1 X0 "
Xn
^
= Q n1 1
Xt "t
t=1
p 1
!Q 0 = 0:

This completes the proof.

4.4 Asymptotic Normality of OLS


Next, we derive the asymptotic distribution of ^ : We …rst provide a multivariate CLT
for I.I.D. random samples.

Lemma 4.11 [Multivariate Central Limit Theorem (CLT) for I.I.D. Random
Samples]: Suppose fZt g is a sequence of i.i.d. random vectors with E(Zt ) = 0 and
var(Zt ) = E(Zt Zt0 ) = V is …nite and positive de…nite. De…ne

X
n
1
Zn = n Zt :
t=1

18
Then as n ! 1;
p d
nZn ! N (0; V )
or
1 p d
V 2 nZn ! N (0; I):
p
Question: What is the variance-covariance matrix of n Zn ?
Answer: Noting that E(Zt ) = 0; we have
!
p 1
Xn
var( nZn ) = var n 2 Zt
t=1
" ! !0 #
1
X
n
1
X
n
= E n 2 Zt n 2 Zs
t=1 s=1
X
n X
n
1
= n E(Zt Zs0 )
t=1 s=1
X
n
1
= n E(Zt Zt0 ) (because Zt and Zs are independent for t 6= s)
t=1
= E(Zt Zt0 )
= V:
p
In other words, the variance of nZn is identical to the variance of each individual
random vector Zt :

Theorem 4.12 [Asymptotic Normality of OLS] Under Assumptions 4.1-4.5, we


have
p d
n( ^ o
) ! N (0; Q 1 V Q 1 )
as n ! 1; where V var(Xt "t ) = E(Xt Xt0 "2t ):

Proof: Recall that


p 1
X
n
n( ^ o ^ 1n
)=Q 2 Xt "t :
t=1

First, we consider the second term

1
X
n
n 2 Xt "t :
t=1

Noting that E(Xt "t ) = 0 by Assumption 4.3, and var(Xt "t ) = E(Xt Xt0 "2t ) = V; which
is …nite and p.d. by Assumption 4.5. Then, by the CLT for i.i.d. random sequences

19
fZt = Xt "t g, we have
!
1
X
n
p X
n
1
n 2 Xt "t = n n Xt "t
t=1 t=1
p
= n Zn
d
! Z ~ N (0; V ):

On the other hand, as shown earlier, we have


p
^!
Q Q;

and so
^ 1 p 1
Q !Q
given that Q is nonsingular so that the inverse function is continuous and well de…ned.
It follows by the Slutsky Theorem that

p 1
X
n
n( ^ o ^ 1n
) = Q 2 Xt "t
t=1
d 1
!Q Z N (0; Q V Q 1 ):
1

This completes the proof.

Remarks:
p
The theorem implies that the asymptotic mean of n( ^ o
) is equal to 0. That is,
p ^ o
the mean of n( ) is approximately 0 when n is large.
p
It also implies that the asymptotic variance of n( ^ o
) is Q 1 V Q 1 : That is, the
p ^ o
variance of n( ) is approximately Q 1 V Q 1 : Because the asymptotic variance is
p
a di¤erent concept from the variance of n( ^ o
); we denote the asymptotic variance
p ^ o p ^ 1 1
of n( ) as follows: avar( n ) = Q V Q :
p
We now consider a special case under which we can simplfy the expression of avar( n ^ ):

Special Case: Conditional Homoskedasticity


Assumption 4.6: E("2t jXt ) = 2 a.s.
Theorem 4.13: Suppose Assumptions 4.1–4.6 hold. Then as n ! 1;
p d
n( ^ o
) ! N (0; 2
Q 1 ):

20
Proof: Under Assumption 4.6, we can simplify

V = E(Xt Xt0 "2t )


= E[E(Xt Xt0 "2t jXt )] by the law of iterated expectations
= E[Xt Xt0 E("2t jXt )]
2
= E(Xt Xt0 )
2
= Q:

The results follow immediately because

Q 1V Q 1
=Q 1 2
QQ 1
= 2
Q 1:

Remarks:
p
Under conditional homoskedasticity, the asymptotic variance of n( ^ o
) is
p
avar( n ^ ) = 2
Q 1:

Question: Is the OLS estimator ^ the BLUE estimator asymptotically (i.e., when
n ! 1)?

4.5 Asymptotic Variance Estimator


To construct con…dence interval estimators or hypothesis tests, we need to estimate
p p p
the asymptotic variance of n( ^ o
), avar( n ^ ): Because the expression of avar( n ^ )
di¤ers under conditional homoskedasticity and conditional heteroskedasticity respec-
p
tively, we consider the estimator for avar( n ^ ) under these two cases separately.

Case I: Conditional Homoskedasticity


p
In this case, the asymptotic variance of n( ^ o
) is
p
avar( n ^ ) = Q 1 V Q 1
= 2
Q 1:

Question: How to estimate Q?

Lemma 4.14: Suppose Assumptions 4.1, 4.2 and 4.4 hold. Then
X
n
p
^=n
Q 1
Xt Xt0 ! Q:
t=1

2
Question: How to estimate ?

21
2
Recalling that = E("2t ); we use the sample residual variance estimator

s2 = e0 e=(n K)
1 X
n
= e2t
n K t=1

1 Xn
= (Yt Xt0 ^ )2 :
n K t=1

2
Theorem 4.15 [Consistent Estimator for ]: Under Assumptions 4.1-4.4,
p
s2 ! 2
:

Proof: Given that s2 = e0 e=(n K) and

et = Yt Xt0 ^
= "t + Xt0 o
Xt0 ^
= "t Xt0 ( ^ o
);

we have
1 X
n
s 2
= ["t Xt0 ( ^ o
)]2
n K t=1
!
n X
n
1
= n "2t
n K t=1
" #
X
n
+( ^ o 0
) (n K) 1
Xt Xt0 (^ o
)
t=1
X
n
2( ^ o 0
) (n K) 1
Xt "t
t=1
p 2
!1 +0 Q 0 2 0 0
2
=

given that K is a …xed number (i.e., K does not grow with the sample size n), where
we have made use of the WLLN in three places respectively.

We can then consistently estimate 2


Q 1 ^ 1:
by s2 Q
p
Theorem 4.16 [Asymptotic Variance Estimator of n( ^ o
)] Under Assump-
tions 4.1-4.4, we have
p
^ 1!
s2 Q 2
Q 1:

22
Remarks:
p
The asymptotic variance estimator of n( ^ o
) is

^
s2 Q 1
= s2 (X0 X=n) 1 :

This is equivalent to saying that the variance estimator of ^ o


is approximately equal
to
^ 1 =n = s2 (X0 X) 1
s2 Q
when for a large n: Thus, when n ! 1 and there exists conditional homoskedasticity, the
variance estimator of ^ o
coincides with the form of the variance estimator for ^ o

in the classical regression case. Because of this, as will be seen below, the conventional
t-test and F -test are still valid for large samples under conditional homoskedasticity.

Case II: Conditional Heteroskedasticity


In this case,
p
avar( n ^ ) = Q 1 V Q 1 ;
which cannot be simpli…ed.

^ to estimate Q: How to estimate V = E(Xt X 0 "2 )?


Question: We can still use Q t t

We can use its sample analog


X
n
X0 D(e)D(e)0 X
V^ = n 1
Xt Xt0 e2t = ;
t=1
n

where
D(e) = diag(e1 ; e2 ; :::; en )
is an n n diagonal matrix with diagonal elements equal to et for t = 1; :::; n: To ensure
consistency of V^ to V; we impose the following additional moment conditions.

4
Assumption 4.7: (i) E(Xjt ) < 1 for all 0 j k; and (ii) E("4t ) < 1:
Lemma 4.17: Suppose Assumptions 4.1–4.5 and 4.7 hold. Then
p
V^ ! V:

Proof: Because et = "t (^ o 0


) Xt ; we have

23
X
n
V^ = n 1
Xt Xt0 "2t
t=1
Xn
+n 1
Xt Xt0 [( ^ o 0
) Xt Xt0 ( ^ o
)]
t=1
Xn
2n 1
Xt Xt0 ["t Xt0 ( ^ o
)]
t=1
p
! V +0 2 0;

where for the …rst term, we have


X
n
p
1
n Xt Xt0 "2t ! E(Xt Xt0 "2t ) = V
t=1

by the WLLN and Assumption 4.7, which implies


1
EjXit Xjt "2t j [E(Xit2 Xjt
2
)E("4t )] 2 :

For the second term, we have


X
n
n 1
Xit Xjt ( ^ ) Xt Xt0 ( ^
o 0 o
)
t=1
!
X
k Xk X
n
= (^l o ^
l )( m
o
m) n 1
Xit Xjt Xlt Xmt
l=0 m=0 t=1
p
!0
o p
given ^ ! 0; and
X
n
p
1
n Xit Xjt Xlt Xmt ! E (Xit Xjt Xlt Xmt ) = O(1)
t=1

by the WLLN and Assumption 4.7.


Similarly, for the last term, we have
X
n
n 1
Xit Xjt "t Xt0 ( ^ o
)
t=1
!
X
k X
n
= (^l o
l) n 1
Xit Xjt Xlt "t
l=0 t=1
p
! 0

24
o p
given ^ ! 0; and
X
n
p
1
n Xit Xjt Xlt "t ! E (Xit Xjt Xlt "t ) = 0
t=1

by the WLLN and Assumption 4.7. This completes the proof.


p
We now construct a consistent estimator for avar( n ^ ) under conditional heteroskedas-
ticity.
p
Theorem 4.18 [Asymptotic variance estimator for n( ^ o
)]: Under Assump-
tions 4.1–4.5 and 4.7, we have

^ 1 V^ Q
^ 1 p
Q ! Q 1V Q 1:

Remarks:

This is the so-called White’s (1980) heteroskedasticity-consistent variance-covariance


p
matrix of the estimator n( ^ o
): It follows that when there exists conditional het-
eroskedasticity, the estimator for the variance of ^ o
is

(X0 X=n) 1 V^ (X0 X=n) 1 =n


= (X0 X) 1 X0 D(e)D(e)0 X(X0 X) 1 ;

which di¤ers from the estimator s2 (X0 X) 1


in the case of conditional homoskedasticity.

Question: What happens if we use s2 Q ^ 1 as an estimator for the avar[pn( ^ o


)]
while there exists conditional heteroskedasticity?

Observe that

V E(Xt Xt0 "2t )


2
= Q + cov(Xt Xt0 ; "2t )
2
= Q + cov[Xt Xt0 ; 2
(Xt )];

where 2 = E("2t ); 2 (Xt ) = E("2t jXt ); and the last equality follows from the LIE.
Thus, if 2 (Xt ) is positively correlated with Xt Xt0 ; 2 Q will underestimate the true
variance-covariance E(Xt Xt0 "2t ) in the sense that V 2
Q is a positive de…nite matrix.
Consequently, the standard t-test and F -test will overreject the correct null hypothesis
at any given signi…cance level. There will exist substantial Type I errors.

25
^ 1 V^ Q
Question: What happens if one use the asymptotic variance estimator Q ^ 1
but
there exists conditional homoskedasticity?

The asymptotic variance estimator is asymptotically valid, but it will not perform as
^ 1 in …nite samples, because the latter exploits the information
well as the estimator s2 Q
of conditonal homoskedasticity.

4.6 Hypothesis Testing


Question: How to construct a test statistic for the null hypothesis
o
H0 : R = r;

where R is a J K constant matrix, and r is a J 1 constant vector?

We …rst consider
R^ r = R( ^ o
)+R o
r:
It follows that under H0 : R o = r; we have
p d
n(R ^ r) ! N (0; RQ 1 V Q 1 R0 ):

The test procedures will di¤er depending on whether there exists conditional het-
eroskedasticity. We …rst consider the case of conditional homoskedasticity.

Case I: Conditional Homoskedasticity

Under conditional homoskedasticity, we have V = 2 Q and so


p d
n(R ^ r) ! N (0; 2 RQ 1 R0 )

when H0 holds.

When J = 1; we can use the conventional t-test statistic for large sample inference.

Theorem 4.19 [t-test]: Suppose Assumptions 4.1-4.4 and 4.6 hold. Then under H0
with J = 1;
R^ r d
T =p ! N (0; 1)
2 0
s R(X X) R1 0

as n ! 1:
p d
Proof: Give R n( ^ o
) ! N (0; 2 RQ 1 R0 ); R o = r under H0 ; and J = 1; we
have p p
n(R ^ r) R n( ^ o
) d
p = p ! N (0; 1):
2 RQ 1 R0 2 RQ 1 R0

26
^ = X0 X=n, we obtain
By the Slutsky theorem and Q
p
n(R ^ r) d
q ! N (0; 1):
^ 1 R0
s2 R Q

This ratio is the conventional t-test statistic we examined in Chapter 3, namely:


p
n(R ^ r) R^ r
q =p = T:
s 2 R(X0 X) 1 R0
2 ^
s RQ R 1 0

For J > 1; we use a quadratic form test statistic.

2
Theorem 4.20 [Asymptotic Test] Suppose Assumptions 4.1–4.4 and 4.6 hold.
Then under H0 ;
1
J F (R ^ r)0 s2 R(X0 X) 1 R0 (R ^ r)
d 2
! J

as n ! 1:

Proof: Under H0 ; the quadratic form


p 1 p d
n(R ^ r)0 2
RQ 1 R0 n(R ^ r) ! 2
J:

^ 1 p
Also, s2 Q ! 2
Q 1 ; so we have by the Slutsky theorem
p 1 p d
n(R ^ ^ 1 R0
r)0 s2 RQ n(R ^ r) ! 2
J:

or equivalently

(R ^ r)0 [R(X0 X) 1 R0 ] 1 (R ^ r)=J d 2


J =J F ! J;
s2
namely
d 2
J F ! J:

Remarks:

When f"t g is not i.i.d.N (0; 2 ) conditional on Xt ; we cannot use the F distribution,
but we can still compute the F -statistic and the appropriate test statistic is J times the
F -statistic, which is asymptotically 2J . That is,
(~e0 e~ e0 e) d 2
J F = ! J:
e0 e=(n K)

27
2
Because J FJ;n K approaches J as n ! 1; we may interpret the above theorem in

the following way: the classical results for the F -test are still approximately valid under
conditional homoskedasticity when n is large.

When the null hypothesis is that all slope coe¢ cients except the intercept are jointly
zero, we can use a test statistic based on R2 :

A Special Case: Testing for Joint Signi…cance of All Economic Variables

Theorem 4.21 [(n K)R2 Test]: Suppose Assumption 4.1-4.6 hold, and we are inter-
ested in testing the null hypothesis that

o o o
H0 : 1 = 2 = = k = 0;

where the are the regression coe¢ cients from

o o o
Yt = 0 + 1 X1t + + k Xkt + "t :

Let R2 be the coe¢ cient of determination from the unrestricted regression model

o
Yt = Xt0 + "t :

Then under H0 ;
d
(n K)R2 ! 2
k;

where K = k + 1:

Proof: First, recall that in this special case we have

R2 =k
F =
(1 R2 )=(n k 1)
R2 =k
= :
(1 R2 )=(n K)

By the above theorem and noting J = k; we have

(n K)R2 d 2
k F = ! k
1 R2
under H0 : This implies that k F is bounded in probability; that is,

(n K)R2
= OP (1):
1 R2

28
Consequently, given that k is a …xed integer,
R2
2
= OP (n 1 ) = oP (1)
1 R
or
p
R2 ! 0:
p
Therefore, 1 R2 ! 1: By the Slutsky theorem, we have
K)R2 =k (n
(n K)R = k 2
(1 R2 )
1 R2
= (k F )(1 R2 )
d 2
! k;

or asymptotically equivalently,
d
(n K)R2 ! 2
k:

This completes the proof.


d
Question: Do we have nR2 ! 2k ?
Yes, we have
n n
nR2 = (n K)R2 and ! 1:
n K n K

Case II: Conditional Heteroskedasticity


Recall that under H0 ;
p p p
n(R ^ r) = R n( ^ o
)+ n(R o
r)
p
= R n( ^ o
)
d
! N (0; RQ 1 V Q 1 R0 );

where
V = E(Xt Xt0 "2t ):
Therefore, when J = 1;we have
p
n(R ^ r) d
p ! N (0; 1) as n ! 1:
1
RQ V Q R 1 0

p p
Given Q^! Q and V^ ! V; where V^ = X0 D(e)D(e)0 X=n; and the Slutsky theorem,
we can de…ne a robust t-test statistic
p
n(R ^ r) d
Tr = q ! N (0; 1) as n ! 1
RQ ^ 1 V^ Q
^ 1 R0

29
when H0 holds. By robustness, we mean that Tr is valid no matter whether there
exists conditional heteroskedasticity.

Theorem 4.22 [Robust t-Test Under Conditional Heteroskedasticity] Sup-


pose Assumptions 4.1–4.5 and 4.7 hold. Then under H0 with J = 1; as n ! 1; the
robust t-test statistic p
n(R ^ r) d
Tr = q ! N (0; 1):
RQ^ 1 V^ Q
^ 1 R0

When J > 1; we have the quadratic form


p p
W = n(R ^ r)0 [RQ 1 V Q 1 R0 ] 1
n(R ^ r)
d 2
! J

p p
^!
under H0 : Given Q Q and V^ ! V; the robust Wald test statistic
p p
W = n(R ^ ^ 1 V^ Q
r)0 [RQ ^ 1 R0 ] 1
n(R ^ r)
d 2
! J

by the Slutsky theorem.


We can write W equivalently as follows:

W = (R ^ r)0 [R(X0 X) 1 X0 D(e)D(e)0 X(X0 X) 1 R0 ] 1 (R ^ r);

where we have used the fact that

1X
n
V^ = Xt et et Xt0
n t=1
X0 D(e)D(e)0 X
= ;
n
where D(e)= diag(e1 ; e2 ; :::; en ):

Theorem 4.23 [Robust Wald Test Under Conditional Heteroskedasticity] Sup-


pose Assumptions 4.1–4.5 and 4.7 hold. Then under H0 ; as n ! 1;
d
W = n(R ^ r)0 [RQ ^ 1 R0 ] 1 (R ^
^ 1 V^ Q r) ! 2
J:

Remarks:

30
Under conditional heteroskedasticity, the test statistics J F and (n K)R2 cannot
be used.

Question: What happens if there exists conditional heteroskedasticity but J F or


(n K)R2 is used.

There will exist Type I errors because J F or (n K)R2 will be no longer asymp-
totically 2 -distributed under H0 .

Although the general form of the Wald test statistic developed here can be used
no matter whether there exists conditional homoskedasticity, this general form of test
statistic may perform poorly in small samples. Thus, if one has information that the
error term is conditionally homoskedastic, one should use the test statistics derived under
conditional homoskedasticity, which will perform better in small sample sizes. Because
of this reason, it is important to test whether conditional homoskedasticity holds.

4.7 Testing for Conditional Homoskedasticity


We now introduce a method to test conditional heteroskedasticity.

Question: How to test conditional homoskedasticity for f"t g in a linear regression


model?

There have been many tests for conditional homoskedasticity. Here, we introduce a
popular one due to White (1980).

White’s (1980) test

The null hypothesis


H0 : E("2t jXt ) = 2
;
where "t is the regression error in the linear regression model
o
Yt = Xt0 + "t :

First, suppose "t were observed, and we consider the auxiliary regression

X
k X
"2t = 0+ j Xjt + jl Xjt Xlt + vt
j=1 1 j l k
0
= vech(Xt Xt0 ) + vt
0
= Ut + vt ;

31
where vech(Xt Xt0 ) is an operator stacks all lower triangular elements of matrix Xt Xt0
into a K(K+1)
2
1 column vector. For example, when Xt = (1; X1t ; X2t )0 ; we have

vech(Xt Xt0 ) = (1; X1t ; X2t ; X1t


2 2 0
; X1t X2t ; X2t ):

For the auxiliary regression, there is a total of K(K+1)2


regressors in Ut . This is
2
essentially regressing "t on the intercept, Xt ; and the quadratic terms and cross-product
terms of Xt : Under H0 ; all coe¢ cients except the intercept are jointly zero. Any nonzero
coe¢ cients will indicate the existence of conditional heteroskedasticity. Thus, we can test
H0 by checking whether all coe¢ cients except the intercept are jointly zero. Assuming
that E("4t jXt ) = 4 (which implies E(vt2 jXt ) = 2v under H0 ); then we can run an OLS
regression and construct a R2 -based test statistic. Under H0 ; we can obtain

~2 ! d 2
(n J 1)R J;

where J = K(K+1)
2
1 is the number of the regressors except the intercept.
Unfortunately, "t is not observable. However, we can replace "t with et = Yt Xt0 ^ ; and
run the following feasible auxiliary regression
X
k X
e2t = 0+ j Xjt + jl Xjt Xlt + v~t
j=1 1 j l k
0
= vech(Xt Xt0 ) + v~t ;

the resulting test statistic


d
(n J 1)R2 ! 2
J:

It can be shown that the replacement of "2t by e2t has no impact on the asymptotic 2J
distribution of (n J 1)R2 : The proof, however, is rather tedious. For the details of
the proof, see White (1980). Below, we provide some intuition.

Question: Why does the use of e2t in place of "2t have no impact on the asymptotic
distribution of (n J 1)R2 ?

To explain this, we put Ut = vech(Xt Xt0 ): Then the infeasible auxiliary regression is

"2t = Ut0 0
+ vt :
p 0 d
We have n(~ ) ! N (0; 2v Quu1 ); where Quu = E(Ut Ut0 ); and under H0 : R 0 = 0;
where R is a J J diagonal matrix with the …rst diagonal element being 0 and other
diagonal elements being 1, we have
p d
nR~ ! N (0; 2v RQuu1 R0 );

32
where ~ is the OLS estimator and 2v = E(vt2 ): This implies R~ = OP (n 1=2 ); which
vanishes to zero in probability at rate n 1=2 : It is this term that yields the asymptotic
2 ~ 2 ; which is asymptotically equivalent to the test statistic
J distribution for (n J 1)R
p p
^ uu1 R0 ]
n(R~ )0 [s2v RQ 1
nR~ :

Now suppose we replace "2t with e2t ; and consider the auxiliary regression

e2t = Ut0 0
+ v~t :

Denote the OLS estimator by ^ : We decompose


h i2
e2t = "t Xt0 ( ^ o
)
= "2t + ( ^ ) Xt Xt0 ( ^
o 0 o
) 2( ^ o 0
) Xt "t
0
= Ut + v~t :

Thus, ^ can be written as follows:

^ = ~ + ^ + ^;

where ~ is the OLS estimator of the infeasible auxiliary regression, ^ is the e¤ect of
the second term, and ^ is the e¤ect of the third term. For the third term, Xt "t is
uncorrelated with Ut given E("t jXt ) = 0: Therefore, this term, after scaled by the factor
^ o
that itself vanishes to zero in probability at the rate n 1=2 ; will vanish to zero
in probability at a rate n 1 ; that is, ^ = OP (n 1 ): This is expected to have negligible
impact on the asymptotic distribution of the test statistic. For the second term, Xt Xt0
is perfectly correlated with Ut : However, it is scaled by a factor of jj ^ o 2
jj rather than
by jj ^ o
jj only. As a consequence, the regression coe¢ cient of ( ^ ) Xt Xt0 ( ^
o 0 o
)
on Ut will also vanish to zero at rate n ; that is, ^ = OP (n ): Therefore, it also has
1 1

negligible impact on the asymptotic distribution of (n J 1)R2 :

Question: How to test conditional homoskedasticity if E("4t jXt ) is not a constant (i.e.,
E("4t jXt ) 6= 4 for some 4 under H0 )? This corresponds to the case when vt displays
conditional heteroskedasticity.

Question: Suppose White’s (1980) test rejects the null hypothesis of conditional ho-
moskedasticity, one can then conclude that there exists evidence of conditional het-
eroskedasticity. What conclusion can one reach if White’s test fails to reject H0 :
E("2t jXt ) = 2 ?

33
Because White (1980) considers a quadratic alternative to test H0 ; it may have no
power against some conditional heteroskedastic alternatives for which E("2t jXt ) does not
depend on the quadratic form of Xt but depends on cubic or higher order polynomials of
Xt : Thus, when White’s test fails to reject H0 ; one can only say that we …nd no evidence
against H0 :
However, when White’s test fails to reject H0 ; we have

E("2t Xt Xt0 ) = 2
E(Xt Xt0 ) = 2
Q

even if H0 is false. Therefore, one can use the conventional variance-covariance


matrix estimator s2 (X 0 X) 1 for ^ : Indeed, the main motivation for White’s (1980) test
for conditional heteroskedasticity is whether the heteroskedasticity-consistent variance-
covariance matrix of ^ has to be used, not really whether conditional heteroskedasticity
exists. For this purpose, it su¢ ces to regress "2t or e2t on the quadratic form of Xt : This
can be seen from the decomposition

V = E(Xt Xt0 "2t ) = 2


Q + cov(Xt Xt0 ; "2t );
2
which indicates that V = Q if and only if "2t is uncorrelated with Xt Xt0 :

The validity of White’s test procedure and associated interpretations is built upon
the assumption that the linear regression model is correctly speci…ed for the condi-
tional mean E(Yt jXt ):Suppose the linear regression model is not correctly speci…ed, i.e.,
E(Yt jXt ) 6= Xt0 for all : Then the OLS ^ will converge to = [E(Xt Xt0 )] 1 E(Xt Yt );
the best linear least squares approximation coe¢ cient, and E(Yt jXt ) 6= Xt0 . In this
case, the estimated residual

et = Yt Xt0 ^
= "t + [E(Yt jXt ) Xt0 ] + Xt0 ( ^ );

where "t = Yt E(Yt jXt ) is the true disturbance with E("t jXt ) = 0; the estimation
error Xt0 ( ^ ) vanishes to 0 as n ! 1; but the approximation error E(Yt jXt ) X 0
t
never disappears. In other words, when the linear regression model is misspeci…ed for
E(Yt jXt ); the estimated residual et will contain not only the true disturbance but also the
approximation error which is a function of Xt : This will result in a spurious conditional
heteroskedasticity when White’s test is used. Therefore, before using White’s test or
any other tests for conditional heteroskedasticity, it is important to …rst check whether
the linear regression model is correctly speci…ed. For tests of correct speci…cation of a
linear regression model, see Hausman’s test in Chapter 7 and other speci…cation tests
mentioned there.

34
4.8 Empirical Applications
4.9 Conclusion
In this chapter, within the context of i.i.d. observations, we have relaxed some key
assumptions of the classical linear regression model. In particular, we do not assume
conditional normality for "t and allow for conditional heteroskedasticity. Because the
exact …nite sample distribution of the OLS is generally unknown, we have relied on as-
ymptotic analysis. It is found that for large samples, the results of the OLS estimator
^ and related test statistics (e.g., t-test statistic and F -test statistic) are still applicable
under conditional homoskedasticity. Under conditional heteroskedasticity, however, the
statistic properties of ^ are di¤erent from those of ^ under conditional homoskedas-
ticity, and as a consequence, the conventional t-test and F -test are invalid even when
the sample size n ! 1: One has to use White’s (1980) heteroskedasticity-consistent
variance-covariance matrix estimator for the OLS estimator ^ and use it to construct
robust test statistics. A direct test for conditional heteroskedasticity, due to White
(1980), is described.

The asymptotic theory provides convenient inference procedures in practice. How-


ever, the …nite sample distribution of ^ may be di¤erent from its asymptotic distribution.
How well the approximation of the asymptotic distribution for the unknown …nite sample
distribution depends on the data generating process and the sample size of the data. In
econometrics, simulation studies have been used to examine how well asymptotic theory
can approximate the …nite sample distributions of econometric estimators or related sta-
tistics. They are the nearest approach that econometricians can make to the laboratory
experiments of the physical sciences and are a very useful way of reinforcing or checking
the theoretical results. Alternatively, resampling methods called bootstrap have been
proposed in econometrics to approximate the …nite sample distributions of econometric
estimators or related statistics by simulating data on a computer. In this book, we focus
on asymptotic theory.

35
EXERCISES
4.1. Suppose Assumptions 3.1, 3.3 and 3.5 hold. Show (a) s2 converges in probability
to 2 ; and (b) s converges in probability to .

4.2. Let Z1 ; :::; Zn be a random sample from a population with mean and variance
2
. Show that
p p
n(Zn ) n(Zn )
E = 0 and V ar = 1:

4.3. Suppose a sequence of random variables fZn ; n = 1; 2; :::g is de…ned as


1
Zn n
n
1 1
PZn 1 n n

(a) Does Zn converges in mean squares to 0? Give your reasoning clearly.


(b) Does Zn converges in probability to 0? Give your reasoning clearly.

4.4. Let the sample space S be the closed interval [0,1] with the uniform probability
distribution. De…ne Z(s) = s for all s 2 [0; 1]: Also, for n = 1; 2; :::; de…ne a sequence of
random variables (
s + sn if s 2 [0; 1 n 1 ]
Zn (s) =
s + 1 if s 2 (1 n 1 ; 1]:
(a) Does Zn converge in quadratic mean to Z?
(a) Does Zn converge in probability to Z?
(b) Does Zn converge almost surely to Z?

4.5. Suppose g( ) is a real-valued continuous function, and fZn ; n = 1; 2; :::g is a se-


quence of real-valued random variables which converges in probability to random variable
p
Z: Show g(Zn ) ! g(Z):

4.6. Suppose a stochastic process fYt ; Xt0 g0n


t=1 satis…es the following assumptions:

Assumption 1.1 [Linearity] fYt ; Xt0 g0n


t=1 is an i.i.d. process with

o
Yt = Xt0 + "t ; t = 1; :::; n;

o
for some unknown parameter and some unobservable disturbance "t ;

36
Assumption 1.2 [i.i.d.] The K K matrix E(Xt Xt0 ) = Q is nonsingular and …nite;

Assumption 1.3 [conditional heteroskedasticity]:


(i) E(Xt "t ) = 0;
(ii) E("2t jXt ) 6= 2 ;
4
(iii) E(Xjt ) C for all 0 j k; and E("4t ) C for some C < 1:
p
(a) Show that ^ ! o ?
p d
(b) Show that n( ^ o
) ! N (0; ); where = Q 1 V Q 1 ; and V = E(Xt Xt0 "2t ):
(c) Show that the asymptotic variance estimator

^ =Q
^ 1 V^ Q
^ 1 p
! ;

where Q^ = n 1 Pn Xt Xt0 and V^ = n 1 Pn Xt Xt0 e2t : This is called White’s (1980)


t=1 t=1
heteroskedasticity-consistent variance-covariance matrix estimator.
d
(d) Consider a test for hypothesis H0 : R o = r: Do we have J F ! 2J ; where

(R ^ r)0 [R(X0 X) 1 R0 ] 1 (R ^ r)=J


F =
s2
is the usual F -test statistic? If it holds, give the reasoning. If it does not, could you
provide an alternative test statistic that converges in distribution to 2J :

4.7. Put Q = E(Xt Xt0 ); V = E("2t Xt Xt0 ) and 2 = E("2t ): Suppose there exists con-
ditional heteroskedasticity, and cov("2t ; Xt Xt0 ) = V 2
Q is positive semi-de…nite, i.e,
2
(Xt ) is positively correlated with Xt Xt : Show that Q 1 V Q 1
0 2
Q 1 is positive
semi-de…nite.

4.8. Suppose the following assumptions hold:

Assumption 2.1: fYt ; Xt0 g0n


t=1 is an i.i.d. random sample with

o
Yt = Xt0 + "t ;
o
for some unknown parameter and unobservable random disturbance "t :

Assumption 2.2: E("t jXt ) = 0 a.s.

Assumption 2.3:
(i) Wt = W (Xt ) is a positive function of Xt ;
(ii) The K K matrix E (Xt Wt Xt0 ) = Qw is …nite and nonsingular.
(iii) E(Wt8 ) C < 1; E(Xjt 8
) C < 1 for all 0 j k; and E("4t ) C;

37
Assumption 2.4: Vw = E(Wt2 Xt Xt0 "2t ) is …nite and nonsingular.

o
We consider the so-called weighted least squares (WLS) estimator for :
! 1
X
n X
n
^w = n 1
Xt Wt Xt0 n 1
Xt Wt Yt :
t=1 t=1

(a) Show that ^ w is the solution to the following problem

X
n
min Wt (Yt Xt0 )2 :
t=1

(b) Show that ^ w is consistent for o ;


p d
(c) Show that n( ^ w o
) ! N (0; w ) for some K K …nite and positive de…nite matrix
2 2
w : Obtain the expression of w under (i) conditional homoskedasticity E("t jXt ) =
a.s. and (ii) conditional heteroskedasticity E("2t jXt ) 6= 2 .
(d) Propose an estimator ^ w for w ; and show that ^ w is consistent for w under
conditional homoskedasticity and conditional heteroskedasticity respectively:
(e) Construct a test statistic for H0 : R o = r; where R is a J K matrix and r is a
J 1 vector under conditional homoskedasticity and under conditional heteroskedasticity
respectively. Derive the asymptotic distribution of the test statistic under H0 in each
case.
(f) Suppose E("2t jXt ) = 2 (Xt ) is
known, and we set Wt = 1 (Xt ). Construct a test statistic for H0 : R o = r; where
R is a J K matrix and r is a J 1 vector. Derive the asymptotic distribution of
the test statistic under H0 .

4.9. Consider the problem of testing conditional homoskedasticity (H0 : E("2t jXt ) = 2
)
for a linear regression model
Yt = Xt0 o + "t ;
where Xt is a K 1 vector consisting of an intercept and explanatory variables. To
test conditional homoskedasticity, we consider the auxiliary regression

"2t = vech(Xt Xt0 )0 + vt


= Ut0 + vt ;

Show that under H0 : E("2t jXt ) = 2 ; (a) E(vt jXt ) = 0, and (b) E(vt2 jXt ) = 2
v if
and only if E("4t jXt ) = 4 for some constant 4 :

38
4.10. Consider the problem of testing conditional homoskedasticity (H0 : E("2t jXt ) =
2
) for a linear regression model

o
Yt = Xt0 + "t ;

where Xt is a K 1 vector consisting of an intercept and explanatory variables. To


test conditional homoskedasticity, we consider the auxiliary regression

"2t = vech(Xt Xt0 )0 + vt


= Ut0 + vt :

Suppose Assumptions 4.1, 4.2, 4.3, 4.4, 4.7 hold, and E("4t jXt ) 6= 4 : That is,
E("4t jXt ) is a function of Xt :
(a) Show var(vt jXt ) 6= 2v under H0 : That is, the disturbance vt in the auxiliary
regression model displays conditional heteroskedasticity.
(b) Suppose "t is directly observable. Construct an asymptotically valid test for the
null hypothesis H0 of conditional homoskedasticity of "t . Justify your reasoning and test
statistic.

39
CHAPTER 5 LINEAR REGRESSION
MODELS WITH DEPENDENT
OBSERVATIONS
Abstract: In this chapter, we will show that the asymptotic theory for linear regression
models with i.i.d. observations carries over to linear time series regression models with
martingale di¤erence sequence disturbances. Some basic concepts in time series analysis
are introduced, and some tests for serial correlation are described.

Key words: Dynamic regression model, Ergodicity, Martingale di¤erence sequence,


Random walk, Serial correlation, Static regression model, Stationarity, Time series, Unit
root, White noise.

Motivation

The asymptotic theory developed in Chapter 4 is applicable for cross-sectional data


(due to the i.i.d. random sample assumption). What happens if we have time series
data? Could the asymptotic theory for linear regression models with i.i.d. observations
be applicable to linear regression models with time series observations?
Consider a simple regression model
o
Yt = Xt0 + "t
= 0 + 1 Yt 1 + "t ;
2
f"t g i.i.d.N (0; ):

Here, Xt = (1; Yt 1 )0 . This is called an autoregression model, which violates the i.i.d.
assumption for fYt ; Xt0 g0n
t=1 in Chapter 4. Here, we have

E("t jXt ) = 0 a.s.

but we no longer have

E("t jX) = E("t jX1 ; X2 ; :::; Xn )


= 0 a.s:

because Xt+j contains "t when j > 0: Hence, Assumption 3.2 (strict exogeneity) fails.

In general, the i.i.d. assumption for fYt ; Xt0 g0n


t=1 in Chapter 4 rules out time series
data. Most economic and …nancial data are time series observations.

1
Question: Under what conditions will the asymptotic theory developed in Chapter 4
carry over to linear regression models with dependent observations?

5.1 Introduction to Time Series Analysis


To establish asymptotic theory for linear regression models with time series observa-
tions, we need to …rst introduce some basic concepts in time series.

Question: What is a time series process?

A time series process can be stochastic or deterministic. In this book, we only


consider stochastic time series processes, which is consistent with the fundamental axiom
of modern econometrics discussed in Chapter 1.

De…nition 5.1 [Stochastic Time Series Process]: A stochastic time series fZt g is a
sequence of random variables or random vectors indexed by time t 2 f:::; 0; 1; 2; :::g and
governed by some probability law ( ; F; P ); where is the sample space, F is a -…eld,
and P is a probability measure, with P : F ! [0; 1]:

Remarks:

More precisely, we can write Zt = Z(t; ); and its realization zt = Z(t; !); where
! 2 is a basic outcome in sample space .
For each !; we can obtain a sample path zt = Z(t; !) of the process fZt g as a
deterministic function of time t: Di¤erent !’s will give di¤erent sample paths.
The dynamics of fZt g is completely determined by the transition probability of Zt ;
that is, the conditional probability of Zt given its past history It 1 = fZt 1 ; Zt 2 ; :::g.

Time Series Random sample: Consider a subset (or a segment) of a time series
process fZt g for t = 1; ; n: This is called a time series random sample of size n;
denoted as
Z n = fZ1 ; ; Zn g0 :
Any realization of this random sample is called a data set, denoted as

z n = fz1 ; ; zn g0 :

This corresponds to the occurrence of some speci…c outcome ! 2 : In theory, a random


sample Z n can generate many data sets, each corresponding to a speci…c ! 2 : In
reality, however, one only observes a data set for any random sample of the economic
process, due to the nonexperimental nature of the economic system.

2
Question: Why can the dynamics of fZt g be completely captured by its conditional
probability distribution?

Consider the random sample Z n : It is well-known from basic statistics courses that
the joint probability distribution of the random sample Z n ;

fZ n (z n ) = fZ1 ;Z2 ;:::;Zn (z1 ; z2 ; :::; zn ); z n 2 Rn ;

completely captures all the sample information contained in Z n . With fZ n (z n ); we can,


in theory, obtain the sampling distribution of any statistic (e.g., sample mean estimator,
sample variance estimator, con…dence interval estimator) that is a function of Z n :
Now, by sequential partition (repeating the multiplication rule P (A\B) = P (AjB)P (B)
for any event A and B), we can write

Y
n
n
fZ n (z ) = fZt jIt 1 (zt jIt 1 );
t=1

where by convention, for t = 1; f (z1 jI0 ) = f (z1 ); the marginal density of Z1 . Thus, the
conditional density function fZt jIt 1 (zjIt 1 ) completely describes the joint probability of
the random sample Z n .

Example 1: Let Zt be the US Gross Domestic Product (GDP) in quarter t: Then the
quarterly records of U.S. GDP from the …rst quarter of 1961 to the last quarter of 2001
constitute a time series data set, denoted as z n = (z1 ; ; zn )0 ; with n = 164.

Example 2: Let Zt be the S&P 500 closing price index at day t: Then the daily records
of S & P 500 index from July 2, 1962 to December 31, 2001 constitute a time series data
set, denoted as z n = (z1 ; ; zn )0 ; with n = 9987.

Here is a fundamental feature of economic time series: each random variable Zt only
has one observed realization zt in practice. It is impossible to obtain more realizations
for each economic variable Zt ; due to the nonexperimental nature of an economic system.
In order to “aggregate” realizations from di¤erent random variables fZt gnt=1 ; we need
to impose stationarity— a concept of stability for certain aspects of the probability law
fZt jIt 1 (zt jIt 1 ). For example, we may need to assume:
(i) The marginal probability of each Zt shares some common features (e.g., the same
mean, the same variance).
(ii) The relationship (joint distribution) between Zt and It 1 is time-invariant in certain
aspects (e.g., cov(Zt ; Zt j ) = (j) does not depend on time t; it only depends on the
time distance j).

3
With these assumptions, observations from di¤erent random variables fZt g can be
viewed to contain some common features of the data generating process, so that one can
conduct statistical inference by pooling them together.

Stationarity
A stochastic time series fZt g can be stationary or nonstationary. There are at least
two notions for stationarity. The …rst is strict stationarity.

De…nition 5.2 [Strict Stationarity]: A stochastic time series process fZt g is strictly
stationary if for any admissible t1 ; t2 ; ; tm ; the joint probability distribution of fZt1 ; Zt2 ; ; Ztm g
is the same as the joint distribution of fZt1 +k ; Zt2 +k ; ; Ztm +k g for all integers k: That
is,
fZt1 Zt2 Ztm (z 1 ; ; z m ) = f Zt +k Zt +k Ztm +k (z 1 ; ; z m ):
1 2

Remarks:
If Zt is strictly stationary, the conditional probability of Zt given It 1 will have a time-
invariant functional form. In other words, the probabilistic structure of a completely
stationary process is invariant under a shift of the time origin.
Strict stationarity is also called “complete stationarity”, because it characterizes the
time-invariance property of the entire joint probability distribution of the process fZt g.
No moment condition on fZt g is needed when de…ning strict stationarity. Thus, a
strictly stationary process may not have …nite moments (e.g., var(Zt ) = 1). However,
if moments (e.g., E(Zt )) and cross-moments (e:g:; E(Zt Zt j )) of fZt g exist, then they
are time-invariant when fZt g is strictly stationary.
Any measurable transformation of a strictly stationary process is still strictly sta-
tionary.
Strict stationarity implies identical distribution for each of the Zt : Thus, although a
strictly stationary time series data are realizations from di¤erent random variables, they
can be viewed as realizations from the same (marginal) population distribution.

Example 3: Suppose fZt g is an i.i.d. Cauchy (0; 1) sequence with marginal pdf
1
f (z) = ; 1 < z < 1:
(1 + z 2 )
Note that Zt has no moment. Consider fZt1 ; ; Ztm g: Because the joint distribution
Y
m
fZt1 Zt2 Ztm (z1 ; ; zm ) = f (zj )
j=1

is time-invariant, fZt g is strictly stationary.

4
We now introduce another concept of stationarity based on the time-invariance prop-
erty of the joint moments of fZt1 ; Zt2 ; :::; Ztm g:

De…nition 5.3 [N -th order stationarity]: The time series process fZt g is said to be
stationary up to order N if, for any admissible t1 ; t2 ; ; tm ; and any k; all the joint
moments up to order N of fZt1 ; Zt2 ; ; Ztm g exist and equal to the corresponding joint
moments up to order N of fZt1 +k ; ; Ztm +k g: That is,

E [(Zt1 )n1 (Ztm )nm ] = E [(Zt1 +k )n1 (Ztm +k )nm ] ;


Pm
for any k and all nonnegative integers n1 ; ; nm satisfying j=1 nj N:

Remarks:
Setting n2 = n3 = = nm = 0; we have

E [(Zt )n1 ] = E [(Z0 )n1 ] for all t:

On the other hand, for n1 + n2 N; we have the pairwise joint product moment

E [(Zt )n1 (Zt j )n2 ] = E [(Z0 )n1 (Z j )n2 ]


= function of j;

where j is called a lag order.

We now consider a special case: N = 2: This yields a concept called weak stationarity.

De…nition 5.4 [Weak Stationarity] A stochastic time series process fZt g is weakly
stationary if
(i) E(Zt ) = for all t;
(ii) var(Zt ) = 2 < 1 for all t;
(iii) cov(Zt ; Zt j ) = (j) is only a function of lag order j for all t:

Remarks:
Strict stationarity is de…ned in terms of the “time invariance”property of the entire
distribution of fZt g; while weak-stationarity is de…ned in terms of the “time-invariance”
property in the …rst two moments (means, variances and covariances) of fZt g. Suppose
all moments of fZt g exist. Then it is possible that the …rst two moments are time-
invariant but the higher order moments are time-varying. In other words, a process
fZt g can be weakly stationary but not strictly stationary. However, Example 1 shows
that a process can be strictly stationary but not weakly stationary, because the …rst two
moments simply do not exist.

5
Weak stationarity is also called “covariance-stationarity”, or “2nd order stationarity”
because it is based on the time-invariance property of the …rst two moments. It does
not require identical distribution for each of the Zt : The higher order moments of Zt can
be di¤erent for di¤erent t’s:

Question: Which, strict or weak stationarity, is more restrictive?

We consider two cases:


(i) If E(Zt2 ) < 1; then strict stationarity implies weak stationarity.
(ii) However, if E(Zt2 ) = 1; strict stationarity does not imply weak stationarity. In
other words, a time series process can be strictly stationary but not weakly stationary.

Example 4: An i.i.d. Cauchy(0; 1) process is strictly stationary but not weakly sta-
tionary.

A special but important weakly stationary time series is a process with zero auto-
correlations.

De…nition 5.5 [White Noise]: A time series process fZt g is a white noise (or serially
uncorrelated) process if
(i) E(Zt ) = 0:
(ii) var(Zt ) = 2 ;
(iii) cov(Zt ; Zt j ) = (j) = 0 for all j > 0:

Remarks:
Later we will explain why such a process is called a white noise (WN) process. WN
is a basic building block for linear time series modeling.
When fZt g is a white noise and fZt g is a Gaussian process (i.e., any …nite set
(Zt1 ; Zt2 ; :::; Ztm ) of fZt g has a joint normal distribution), we call fZt g is a Gaussian
white noise. For a Gaussian white noise process, fZt g is an i.i.d. sequence.

Example 5: A …rst order autoregressive (AR(1)) process

Zt = Zt 1 + "t ;
2
"t white noise (0; )

is weakly stationary if j j < 1 (Zt is a unit root process if = 1) because Zt =

6
P1 j
j=0 "t j ; and

E(Zt ) = 0;
2
var(Zt ) = 2
;
1
2
jjj
(j) = 2
; j = 0; 1; 2; ::: :
1
Here, "t may be interpreted as a random shock or an innovation that derives the move-
ment of the process fZt g over time.

More generally, fZt g is an AR(p) process if


p
X
Zt = 0 + j Zt j + "t ;
j=1
2
"t White noise (0; ):

Example 6: fZt g is a q-th order moving-average process (MA(q)) if


q
X
Zt = 0 + j "t j + "t ;
j=1
2
f"t g White noise (0; ):

This is a weakly stationary process. For an MA(q) process, we have (j) = 0 for all
jjj > q:

Example 7: fZt g is an autoregressive-moving average (ARMA) process of orders (p; q)


if
p q
X X
Zt = 0 + j Zt j + j "t j + "t ;
j=1 j=1
2
f"t g white noise (0; ):

ARMA models include AR models and MA models as special cases. An estimation


method for ARMA models can be found in Chapter 9. In practice, the orders of (p; q)
can be selected according to the AIC or BIC criterion.

Under rather mild regularity conditions, a zero-mean weakly stationary process can
be represented by an MA(1) process
X
1
Zt = j "t j ;
j=0
2
"t WN(0; );

7
1 2
where j=1 j < 1: This is called Wold’s decomposition. The partial derivative

@Zt+j
= j; j = 0; 1; :::
@"t
is called the impulse response function of the time series process fZt g with respect to
a random shock "t : This function characterizes the impact of a random shock "t on the
immediate and subsequent observations fZt+j ; j 0g: For a weakly stationary process,
the impact of any shock on a future Zt+j will always diminish to zero as the lag order
j ! 1; because j ! 0: The ultimate cumulative impact of "t on the process fZt g is
the sum 1 j=0 j :

The function (j) = cov(Zt ; Zt j ) is called the autocovariance function of the weakly
stationary process fZt g; where j is a lag order. It characterizes the (linear) serial de-
pendence of Zt on its own lagged variable Zt j : Note that (j) = ( j) for all integers
j:
The normalized function (j) = (j)= (0) is called the autocorrelation function of
fZt g: It has the property that j (j)j 1: The plot of (j) as a function of j is called the
autocorrelogram of the time series process fZt g: It can be used to judge which linear
time series model (e.g., AR, MA, or ARMA) should be used to …t a particular time
series data set.

We now consider the Fourier transform of the autocovariance function (j):

De…nition 5.6 [Spectral Density Function] The Fourier transform of (j)

1 X
1
ij!
h(!) = (j)e ; !2[ ; ];
2 j= 1
p
where i = 1; is called the power spectral density of process fZt g:
The normalized version

1 X
1
h(!) ij!
f (!) = = (j)e ; !2[ ; ];
(0) 2 j= 1

is called the standardized spectral density of fZt g.

Question: What are the properties of f (!)?


R
It can be shown that (i) f (!) is real-valued, and f (!) 0; (ii) f (!)d! = 1; (iii)
f ( !) = f (!):

8
The spectral density h(!) is widely used in economic analysis. For example, it can be
used to search for business cycles. Speci…cally, a frequency ! 0 corresponding to a special
peak is closely associated with a business cycle with periodicity T0 = 2 =! 0 : Intuitively,
time series can be decomposed as the sum of many cyclical components with di¤erent
frequencies !; and h(!) is the strength or magnitude of the component with frequency
!: When h(!) has a peak at ! 0 ; it means that the cyclical component with frequency
! 0 or periodicity T0 = 2 =! 0 dominates all other frequencies. Consequently, the whole
time series behaves as mainly having a cycle with periodicity T0 :

The functions h(!) and (j) are Fourier transforms of each other. Thus, they contain
the same information on serial dependence in fZt g: In time series analysis, the use of
(j) is called the time domain analysis, and the use of h(!) is called the frequency
domain analysis. Which tool to use depends on the convenience of the user. In some
applications, the use of (j) is simpler and more intuitive, while in other applications,
the use of h(!) is more enlightening. This is exactly the same as the case that it is more
convenient to use Chinese in China, while it is more convenient to use English in U.S.

Example 8: Hamilton, James (1994, Time Series Analysis): Business cycles of U.S.
industrial production

Example 9: Steven Durlauf (1990, Journal of Monetary Economics): Income tax rate
changes

Reference: Sargent, T. (1987): Macroeconomic Theory, 2nd Edition. Academic Press:


Orlando, U.S.A.

For a serially uncorrelated sequence, the spectral density h(!) is ‡at as a function of
frequency ! :
1
h(!) = (0)
2
1 2
= for all ! 2 [ ; ]:
2
This is analogous to the power (or energy) spectral density of a physical white color
light. It is for this reason that we call a serially uncorrelated time series a white noise
process.
Intuitively, a white color light can be decomposed via a lens as the sum of equal
magnitude components of di¤erent frequencies. That is, a white color light has a ‡at
physical spectral density function.

9
It is important to point out that a white noise may not be i.i.d., as is illustrated by
the example below:

Example 10: Consider an autoregressive conditional heteroskedastic (ARCH) process


1=2
Zt = "t ht ;
2
ht = 0 + 1 Zt 1 ;

"t i.i.d.(0,1).
This is …rst proposed by Engle (1982) and it has been widely used to model volatility
in economics and …nance. We have E(Zt jIt 1 ) = 0 and var(Zt jIt 1 ) = ht ; where It 1 =
fZt 1 ; Zt 2 ; :::g is the information set containing all past history of Zt :

It can be shown that


E(Zt ) = 0;
cov(Zt ; Zt j ) = 0 for j > 0;
0
var(Zt ) = :
1 1

When 1 < 1; fZt g is a stationary white noise. But it is not weakly stationary if
1 = 1; because var(Zt ) = 1: In both cases, fZt g is strictly stationary (e.g., Nelson
1990, Journal of Econometrics).
Although fZt g is a white noise, it is not an i.i.d. sequence because the correlation
jjj
in fZt2 g is corr(Zt2 ; Zt2 j ) = 1 for j = 0; 1; 2; :::: In other words, an ARCH process is
uncorrelated in level but is autocorrelated in squares.

Nonstationarity
Usually, we call fZt g a nonstationary time series when it is not covariance-stationary.
In time series econometrics, there have been two types of nonstationary processes that
display similar sample paths when the sample size is not large but have quite di¤erent
implications. We …rst discuss a nonstationary process called trend-stationary process.

Example 11: fZt g is called a trend-stationary process if


Zt = 0 + 1t + "t ;
where "t is a weakly stationary process with mean 0 and variance 2 : To see why fZt g
is not weakly stationary, we consider a simplest case where f"t g is i.i.d. (0; 2 ): Then
E(Zt ) = 0 + 1 t;
2
var(Zt ) = ;
cov(Zt ; Zt j ) = 0:

10
Question: What happens if Zt = Zt Zt 1 ?

More generally, a trend-stationary time series process can be de…ned as follows:


p
X
j
Zt = 0 + jt + "t ;
j=1

where f"t g is a weakly stationary process. The reason that fZt g is called trend-stationary
is that it will become weakly stationary after the deterministic trend is removed.

Next, we discuss the second type of nonstationary process called di¤erence-stationary


process. Again, we start with a special case:

Example 12: fZt g is a random walk with a drift if

Zt = 0 + Zt 1 + "t ;
2
where f"t g is i.i.d. (0; ). For simplicity, we assume Z0 = 0: Then

E(Zt ) = 0 t;
2
var(Zt ) = t;
2
cov(Zt ; Zt j ) = (t j):

Note that for any given j;


r
t j
corr(Zt ; Zt j ) = ! 1 as t ! 1;
t
which implies that the impact of an in…nite past event on today’s behavior never dies
out. Indeed, this can be seen more clearly if we write
X
t 1
Z t = Z0 + 0t + "t j :
j=0

Note that fZt g has a deterministic linear time trend but with an increasing variance
over time. The impulse response function @Zt+j =@"t = 1 for all j 0; which never dies
o¤ to zero as j ! 1:

There is another nonstationary process called martingale process which is closely


related to a random walk.

De…nition 5.7 [Martingale] A time series process fZt g is a martingale with drift if

Zt = + Zt 1 + "t ;

11
and f"t g satis…es
E("t jIt 1 ) = 0 a.s.;
where It 1 is the -…eld generated by f"t 1 ; "t 2 ; g: We call that f"t g is a martingale
di¤erence sequence (MDS).

Question: Why is "t called an MDS?


Because "t is the di¤erence of a martingale process. That is, "t = Zt Zt 1 :

Example 13 [Martingale and E¢ cient Market Hypothesis]: Suppose a stock


log-price ln Pt follows a martingale process, i.e.,

ln Pt = ln Pt 1 + "t ;
Pt Pt 1
where E("t jIt 1 ) = 0: Then "t = ln Pt ln Pt 1 Pt 1
is the stock relative price
change or stock return (if no dividend) from time t 1 to time t; which can be viewed as
the proxy for the new information arrival from time t 1 to time t that derives the stock
price change in the same period. For this reason, "t is also called an innovation sequence.
The MDS property of "t implies that the price change "t is unpredictable using the past
information available at time t 1; and the market is called informationally e¢ cient.
Thus, the best predictor for the stock price at time t using the information available at
time t 1 is Pt 1 ; that is, E(Pt jIt 1 ) = Pt 1 :

Question: What is the relationship between a random walk and a martingale?

A random walk is a martingale because IID with zero mean implies E("t jIt 1 ) =
E("t ) = 0: However, the converse is not true.

Example 14: Reconsider an ARCH(1) process


1=2
"t = ht zt ;
2
ht = 0 + 1 "t 1 ;

fzt g i.i.d.(0,1).

where 0; 1 > 0: It follows that

E("t jIt 1 ) = 0;
2
var("t jIt 1 ) = ht = 0 + 1 "t 1 ;

where It 1 denotes the information available at time t 1: Clearly f"t g is MDS but
not IID, because its conditional variance ht is time-varying (depending on the past
information set It 1 ).

12
Since the only condition for MDS is E("t jIt 1 ) = 0 a.s., an MDS need not be strictly
stationary or weakly stationary. However, if it is assumed that var("t ) = 2 exists, then
an MDS is weakly stationary.

When the variance E("2t ) exists, we have the following directional relationships:

IID (with = 0) =) MDS =) WHITE NOISE.

Lemma 5.1: If f"t g is an MDS with E("2t ) = 2


< 1; then f"t g is a white noise.

Proof: By the law of iterated expectations, we have

E("t ) = E[E("t jIt 1 )] = 0;

and for any j > 0;

cov("t ; "t j ) = E("t "t j ) E("t )E("t j )


= E[E("t "t j jIt 1 )]
= E[E("t jIt 1 )"t j ]
= E(0 "t j )
= 0:

This implies that MDS, together with var("t ) = 2 ; is a white noise.


However, a white noise does not imply a MDS.

Example 15: A nonlinear MA process

"t = zt 1 zt 2 + zt ;
fzt g i:i:d:(0; 1):

Then it can be shown that f"t g is a white noise but not MDS, because cov("t ; "t j ) = 0
for all j > 0 but
E("t jIt 1 ) = zt 1 zt 2 6= 0:
Question: When will the concepts of IID, MDS and White noise coincide?
When f"t g is a stationary Gaussian process. A time series is a stationary Gaussian
process if f"t1 ; "t2 ; :::; "tm g is multivariate normally distributed for any admissible sets
of integers ft1 ; t2 ; :::; tm g: Unfortunately, an important stylized fact for economic and
…nancial time series is that they are typically non-Gaussian. Therefore, it is important
to emphasize the di¤erence among the concepts of IID, MDS and White Noise in time
series econometrics.

13
When var("t ) exists, both random walk and martingale processes are special cases of the
so-called unit root process which is de…ned below.

De…nition 5.8 [Unit root or di¤erence stationary process]: fZt g is a unit root
process if

Zt = 0 + Zt 1 + "t ;
2
f"t g is covariance-stationary (0; ):

The process fZt g is called a unit root process because its autoregressive coe¢ cient
is unity. It is also called a di¤erence-stationary process because its …rst di¤erence,

Zt = Zt Zt 1 = 0 + "t ;

becomes weakly stationary. In fact, the …rst di¤erence of a linear trend-stationary


process Zt = 0 + 1 t + "t is also weakly stationary:

Zt = 1 + "t "t 1 :

The inverse of di¤erencing is “integrating”. For the di¤erence-stationary process


fZt g; we can write it as the integral of the weakly stationary process f"t g in the sense
that
X
t 1
Zt = 0 t + Z0 + "t j ;
j=0

where Z0 is the starting value of the process fZt g: This is analogous to di¤erentiation
and integration in calculus which are inverses of each other. For this reason, fZt g is also
called an Integrated process of order 1, denoted as I(1):Obviously, a random walk and
a martingale process are I(1) processes if the variance of the innovation "t is …nite.
We will assume strict stationarity in most cases in the present and subsequent chap-
ters. This implies that some economic variables have to be transformed before used in
Yt = Xt0 o + "t : Otherwise, the asymptotic theory developed here cannot be applied.
Indeed, a di¤erent asymptotic theory should be developed for unit root processes (see,
e.g., Hamilton (1994), Time Series Analysis).

In macroeconomics, it is important to check whether a nonstationary macroeconomic


time series is trend-stationary or di¤erence-stationary. If it is a unit root process, then
a shock to the economy will never die out to zero as time evolves. In contrast, a random
shock to a trend-stationary process will die out to zero eventually.

Question: Why has the unit root econometrics been so popular in econometrics?

14
It was found in empirical studies (e.g., Nelson and Plosser (1982, Journal of Monetary
Economics)) that most macroeconomic time series display unit root properties.

Ergodicity
Next, we introduce a concept of asymptotic independence.

Question: Consider the following time series

Z n = (Z1 ; Z2 ; ; Zn )0
= (W; W; ; W )0 ,

where W is a random variable that does not depend on time index t: Obviously, the
stationarity condition holds. However, any realization of this random sample Z n will be

z n = (w; w; ; w)0 ;

i.e., it will contain the same realization w for all n observations (no new information
as n increases). In order to avoid this, we need to impose a condition called ergodicity
that assumes that (Zt ; ; Zt+k ) and (Zm+t ; ; Zm+t+l ) are asymptotically independent
when their time distance m ! 1:

Statistically speaking, independence or little correlation generates new or more in-


formation as the sample size n increases. Recall that X and Y are independent if and
only if
E[f (X)g(Y )] = E[f (X)]E[g(Y )]
for any measurable functions f ( ) and g( ). We now extend this de…nition to de…ne
ergodicity.

De…nition 5.9 [Ergodicity]: A strictly stationary process fZt g is said to be ergodic if


for any two bounded functions f : Rk+1 ! R and g : Rl+1 ! R;

lim jE [f (Zt ; ; Zt+k )g(Zm+t ; ; Zm+t+l )]j


m!1
= jE [f (Zt ; ; Zt+k )]j jE [g(Zm+t ; ; Zm+t+l )]j :

Remarks:
Clearly, ergodicity is a concept of asymptotic independence. A strictly stationary
process that is ergodic is called ergodic stationary. If fZt g is ergodic stationary, then
ff (Zt )g is also ergodic stationary for any measurable function f ( ):

15
Theorem 5.2 [WLLN for Ergodic Stationary Random Samples]: Let fZt g be
an ergodic stationary process with E(Zt ) = and EjZt j < 1: Then the sample mean
X
n
p
1
Zn = n Zt ! as n ! 1:
t=1

Question: Why do we need to assume ergodicity?

Consider a counter example which does not satisfy the ergodicity condition: Zt = W for
all t: Then Zn = W; a random variable which will not converge to as n ! 1:

Next, we state a CLT for ergodic stationary MDS random samples.

Theorem 5.3 [Central Limit Theorem for Ergodic Stationary MDS]: Suppose
fZt g is a stationary ergodic MDS process, with var(Zt ) E(Zt Zt0 ) = V …nite, symmetric
and positive de…nite. Then as n ! 1;

p X
n
d
1=2
n Zn = n Zt ! N (0; V )
t=1

or equivalently,
1=2
p d
V nZn ! N (0; I):
p p
Question: Is avar( nZn ) = V =var(Zt )? That is, is the asymptotic variance of nZn

coincides with the individual variance var(Zt ):


To check this, we have

p p p
var( nZn ) = E[ nZn nZn0 ]
" ! !0 #
X
n X
n
= E n 1=2 Zt n 1=2
Zs
t=1 s=1
X
n X
n
1
= n E(Zt Zs0 )
t=1 s=1
[E(Zt Zs ) = 0 for t 6= s; by the LIE]
X n
= n 1 E(Zt Zt0 )
t=1
= E(Zt Zt0 )
= V:

16
Here, the MDS property plays a crucial rule in simplifying the asymptotic variance
p
of nZn because it implies cov(Zt ; Zs ) = 0 for all t 6= s:MDS is one of the most
important concepts in modern economics, particularly in macroeconomics, …nance, and
econometrics. For example, rational expectations theory can be characterized by an
expectational error being an MDS.

5.2 Framework and Assumptions


With the basic time series concepts and analytic tools introduced above, we can now
develop an asymptotic theory for linear regression models with time series observations.
We …rst state the assumptions that allow for time series observations.

Assumption 5.1 [Ergodic stationarity]: The stochastic process fYt ; Xt0 g0n
t=1 is jointly
stationary and ergodic.

Assumption 5.2 [Linearity]:


o
Yt = Xt0 + "t ;
o
where is a K 1 unknown parameter vector, and "t is the unobservable disturbance.

Assumption 5.3 [Correct Model Speci…cation]: E("t jXt ) = 0 a.s. with E("2t ) =
2
< 1:

Assumption 5.4 [Nonsingularity]: The K K matrix

Q = E(Xt Xt0 )

is …nite and nonsingular.

Assumption 5.5 [MDS]: fXt "t g is an MDS process with respect to the -…eld gener-
ated by fXs "s ; s < tg and the K K matrix V var(Xt "t ) = E(Xt Xt0 "2t ) is …nite and
positive de…nite.

Remarks:
In Assumption 5.1, the ergodic stationary process Zt = fYt ; Xt0 g0n
t=1 can be indepen-
dent or serially dependent across di¤erent time periods. we thus allow for time series
observations from a stationary stochastic process.
It is important to emphasize that the asymptotic theory to be developed below
and in subsequent chapters is not applicable to nonstationary time series. A problem
associated with nonstationary time series is the so-called spurious regression or spurious
correlation problem. If the dependent variable Yt and the regressors Xt display similar

17
trending behaviors over time, one is likely to obtain seemly highly “signi…cant”regression
coe¢ cients and high values for R2 ; even if they do not have any causal relationship.
Such results are completely spurious. In fact, the OLS estimator for nonstationary time
series regression model does not follow the asymptotic theory to be developed below.
A di¤erent asymptotic theory for nonstationary time series regression models has to be
used (see, e.g., Hamiltion 1994). Using the correct asymptotic theory, the seemingly
highly “signi…cant” regression coe¢ cient estimators would become insigni…cant in the
spurious regression models.
Unlike the i.i.d. case, where E("t jXt ) = 0 is equivalent to the strict exogeneity
condition that
E("t jX) = E("t jX1 ; :::; Xt ; :::; Xn ) = 0;
the condition E("t jXt ) = 0 is weaker than E("t jX) = 0 in a time series context. In other
words, it is possible that E("t jXt ) = 0 but E("t jX) 6= 0: Assumption 5.3 allows for the
inclusion of predetermined variables in Xt ; the lagged dependent variables Yt 1 ; Yt 2 ;
etc.
For example, suppose Xt = (1; Yt 1 )0 : Then we obtain an AR(1) model

o
Yt = Xt0 + "t
= 0 + 1 Yt 1 + "t ; t = 2; ; n:
2
f"t g MDS(0; ):

Then E("t jXt ) = 0 holds if E("t jIt 1 ) = 0; namely if f"t g is an MDS, where It 1 is
the sigma-…eld generated by f"t 1 ; "t 2 ; :::g. However, we generally have E("t jX) =
6 0
because E("t Xt+1 ) 6= 0:

When Xt contains an intercept the MDS condition for fXt "t g in Assumption 5.5
implies that E("t jIt 1 ) = 0; that is, f"t g is an MDS, where It 1 = f"t 1 ; "t 2 ; :::g.

Question: When can an MDS disturbance "t arise in economics and …nance?

Example 1: Rational Expectations Economics

Recall the dynamic asset pricing model under a rational expectations framework in
Chapter 1. The behavior of the economic agent is characterized by the Euler equation:

u0 (Ct )
E Rt It 1 = 1 or
u0 (Ct 1 )
E[Mt Rt jIt 1 ] = 1;

18
where is the time discount factor of the representative economic agent, Ct is the
consumption, Rt is the asset gross return, and Mt is the stochastic discount factor
de…ned as follows:
u0 (Ct )
Mt =
u0 (Ct 1 )
u00 (Ct 1 )
= + 0 Ct + higher order
u (Ct 1 )
risk adjustment factor.

Using the formula that cov(Xt ; Yt jIt 1 ) = E(Xt Yt jIt 1 ) E(Xt jIt 1 )E(Yt jIt 1 ) and re-
arranging, we can write the Euler equation as

E(Mt jIt 1 )E(Rt jIt 1 ) + cov(Mt ; Rt jIt 1 ) = 1:

It follows that
1 cov(Mt ; Rt jIt 1 ) var(Mt jIt 1 )
E(Rt jIt 1 ) = +
E(Mt jIt 1 ) var(Mt jIt 1 ) E(Mt jIt 1 )
= t + t t;

where t = (It 1 ) is the riskfree interest rate, t = (It 1 ) is the market risk, and
t = (It 1 ) is the price of market risk, or the so-called investment beta factor.
Equivalently, we can write a regression equation for the asset return

Rt = t + t t + "t ; where
E("t jIt 1 ) = 0:

A conventional CAPM usually assumes t 1 = ; t = and use some proxies for t :


Like in Chapter 4, no normality on f"t g is imposed. Furthermore, no conditional
homoskedasticity is imposed. We now allow that var("t jXt ) is a function of Xt . Because
Xt may contain lagged Yt 1 ; Yt 2 ; ; var("t jXt ) may change over time (e.g., volatility
clustering). Volatility clustering is a well-known …nancial phenomenon where a large
volatility today tends to be followed by another large volatility tomorrow, and a small
volatility today tends to be followed by another small volatility tomorrow.

Although Assumptions 5.1–5.5 allow for temporal dependences between observations,


we will still obtain the same asymptotic properties for the OLS estimator and related
test procedures as in the i.i.d. case. Put it di¤erently, all the large sample properties for
the OLS and related tests established under the i.i.d. assumption in Chapter 4 remain
applicable to time series observations with the stationary MDS assumptions for fXt "t g:
We now show that this is indeed the case in subsequent sections.

19
5.3 Consistency of OLS
We …rst investigate the consistency of OLS ^ : Recall the OLS estimator
^ = (X0 X) 1 X0 Y
Xn
= Q^ 1n 1 Xt Yt ;
t=1

where, as before,
X
n
^=n
Q 1
Xt Xt0 :
t=1
Substituting Yt = Xt0 o + "t from Assumption 5.2, we have
X
n
^ o ^ 1n
=Q 1
Xt "t :
t=1

Theorem 5.4: Suppose Assumptions 5.1–5.5 hold. Then


^ o p
! 0 as n ! 1:

Proof: Because fXt g is ergodic stationary, fXt Xt0 g is also ergodic stationary. Thus,
given Assumption 5.4, which implies EjXit Xjt j C < 1 for 0 i; j k and for some
constant C; we have
p
Q^! E(Xt Xt0 ) = Q
by the WLLN for ergodic stationary processes. Because Q 1 exists, by continuity we
have
p
Q^ 1! Q 1 as n ! 1:
P
Next, we consider n 1 nt=1 Xt "t : Because fYt ; Xt0 g0n
t=1 is ergodic stationary, "t =
0 o
Yt Xt is ergodic stationary, and so is Xt "t : In addition,
2 1=2
EjXjt "t j E(Xjt )E("2t ) C < 1 for 0 j k

by the Cauchy-Schwarz inequality and Assumptions 5.3 and 5.4. It follows that
X
n
p
1
n Xt "t ! E(Xt "t ) = 0
t=1

by the WLLN for ergodic stationary processes, where

E(Xt "t ) = E[E(Xt "t jXt )]


= E[Xt E("t jXt )]
= E(Xt 0)
= 0

20
by the law of iterated expectations and Assumption 5.3. Therefore, we have
X
n
p
^ o ^ 1n
=Q 1
Xt "t ! Q 1
0 = 0:
t=1

This completes the proof.

5.4 Asymptotic Normality of OLS


Next, we derive the asymptotic distribution of ^ :

Theorem 5.5: Suppose Assumptions 5.1–5.5 hold. Then


p d
n( ^ o
) ! N (0; Q 1 V Q 1 ) as n ! 1:

Proof: Recall
p 1
X
n
n( ^ o ^ 1n
)=Q 2 Xt "t :
t=1
First, we consider the second term
1
X
n
n 2 Xt "t :
t=1

Because fYt ; Xt0 g0n


t=1 is stationary ergodic, Xt "t is also stationary ergodic. Also, fXt "t g is
a MDS with var(Xt "t ) = E(Xt Xt0 "2t ) = V being …nite and positive de…nite (Assumption
5.5). By the CLT of stationary ergodic MDS processes, we have

1
X
n
d
n 2 Xt "t ! N (0; V ):
t=1

^ 1 p
Moreover, Q ! Q 1 ; as shown earlier. It follows from the Slutsky theorem that
p 1
X
n
n( ^ o ^ 1n
) = Q 2 Xt "t
t=1
d
! Q 1 N (0; V )
N (0; Q 1 V Q 1 ):

This completes the proof.

The asymptotic distribution of ^ under Assumptions 5.1–5.5 is exactly the same as


p
that of ^ in Chapter 4. In particular, the asymptotic mean of n( ^ o
) is 0, and the
p ^ o 1 1
asymptotic variance of n( ) is Q V Q ; we denote
p
avar( n ^ ) = Q 1 V Q 1 :

21
Special Case: Conditional Homoskedasticity
p
The asymptotic variance of n ^ can be simpli…ed if there exists conditional ho-
moskedasticity.

Assumption 5.6: E("2t jXt ) = 2


a.s.

This assumption rules out the possibility that the conditional variance of "t changes
with Xt . For low-frequency macroeconomic time series, this might be a reasonable
assumption. For high-frequency …nancial time series, however, this assumption will be
rather restrictive.

Theorem 5.6: Suppose Assumptions 5.1–5.6 hold. Then


p d
n( ^ o
) ! N (0; 2
Q 1 ):

Proof: Under Assumption 5.6, we can simplify

V = E(Xt Xt0 "2t )


= E[E(Xt Xt0 "2t jXt )]
= E[Xt Xt0 E("2t jXt )]
2
= E(Xt Xt0 )
2
= Q:

The desired results follow immediately from the previous theorem. This completes the
proof.
p
Under conditional homoskedasticity, the asymptotic variance of n( ^ o
) is
p
avar( n ^ ) = Q 1 V Q 1

2
= Q 1:

This is rather convenient to estimate.


5.5 Asymptotic Variance Estimator for OLS
To construct con…dence interval estimators or hypothesis test statistics, we need
p p
to estimate the asymptotic variance of n( ^ o
), namely avar( n ^ ): We consider
p
consistent estimation for avar( n ^ ) under conditional homoskedasticity and conditional
heteroskedasticity respectively.

22
Case I: Conditional Homoskedasticity
p
Under this case, the asymptotic variance of n( ^ o
) is
p
avar( n ^ ) = Q 1 V Q 1 = 2
Q 1:
2
It su¢ ces to have consistent estimators for and Q respectively.

Question: How to estimate Q?

Lemma 5.7: Suppose Assumptions 5.1 and 5.3 hold. Then


p
^!
Q Q as n ! 1:

Question: How to estimate 2 ?


To estimate the residual sample variance estimator, we have
e0 e
s2 = :
n K
2
Theorem 5.8 [Consistent Estimator for ]: Under Assumptions 5.1-5.5, as n !
1;
p
s2 ! 2
:
Proof: The proof is analogous to the proof of Theorem 4.15 in Chapter 4. We have
1 X
n
2
s = e2t
n K t=1
X
n
1
= (n K) "2t
t=1
!
1 X
n
+( ^ o 0
) Xt Xt0 (^ o
)
n K t=1

1 X
n
2( ^ o 0
) Xt "t
n K t=1
p 2 2
! +0 Q 0 2 0 0=

given that K is a …xed number, where we have made use of the WLLN for ergodic
stationary processes in several places. This completes the proof.
p
We can then estimate avar( n ^ ) = 2
Q 1 ^ 1:
by s2 Q

Theorem 5.9: [Asymptotic Variance Estimator of ^ ]: Under Assumptions 5.1-


p
5.4, we can consistently estimate the asymptotic variance avar( n ^ ) by
^ 1 p
s2 Q ! 2
Q 1:

23
This implies that the variance estimator of ^ is calculated as

^ 1 =n = s2 (X0 X) 1 ;
s2 Q

which is the same as in the classical linear regression case.

Case II: Conditional Heteroskedasticity


In this case,
p
avar( n ^ ) = Q 1 V Q 1

cannot be further simpli…ed.

Question: How to estimate Q 1 V Q 1 ?

^ How to estimate V = E(Xt X 0 "2 )?


Question: It is straightforward to estimate Q by Q: t t

We can use its sample analog


X
n
V^ = n 1
Xt Xt0 e2t :
t=1

To ensure consistency of V^ for V; we impose the following moment condition:

4
Assumption 5.7: E(Xjt ) < 1 for 0 j k and E("4t ) < 1:
Lemma 5.10: Suppose Assumptions 5.1–5.5 and 5.7 hold. Then
p
V^ ! V as n ! 1:

Proof: The proof is analogous to the proof of Lemma 4.17 in Chapter 4. Because
et = "t ( ^ o 0
) Xt ; we have

X
n
V^ = n 1
Xt Xt0 "2t
t=1
Xn
+n 1
Xt Xt0 [( ^ ) Xt Xt0 ( ^
o 0 o
)]
t=1
Xn
2n 1
Xt Xt0 ["t Xt0 ( ^ o
)]
t=1
p
!V +0 2 0;

24
where for the …rst term, we have
X
n
p
1
n Xt Xt0 "2t ! E(Xt Xt0 "2t ) = V
t=1

by the WLLN for ergodic stationary processes and Assumption 5.5. For the second term,
it su¢ ces to show that for any combination (i; j; l; m); where 0 i; j; l; m k;

X
n
n 1
Xit Xjt [( ^ ) Xt Xt0 ( ^
o 0 o
)]
t=1
!
X
k Xk X
n
= (^l o ^
l )( m
o
m) n 1
Xit Xjt Xlt Xmt
l=0 m=0 t=1
p
! 0;

o p P p
which follows from ^ ! 0 and n 1 nt=1 Xit Xjt Xlt Xmt ! E(Xit Xjt Xlt Xmt ) = O(1)
by the WLLN and Assumption 5.7:
For the last term, it su¢ ces to show
X
n
n 1
Xit Xjt ["t Xt0 ( ^ o
)]
t=1
!
X
k X
n
= (^l o
l) n 1
Xit Xjt Xlt "t
l=0 t=1
p
! 0;

o p P p
which follows from ^ ! 0; n 1 nt=1 Xit Xjt Xlt "t ! E(Xit Xjt Xlt "t ) = 0 by the
WLLN for ergodic stationary processes, the law of iterated expectations, and E("t jXt ) =
0 a.s:
We have proved the following result.
p
Theorem 5.11 [Asymptotic variance estimator for n( ^ o
)]: Under Assump-
p
tions 5.1–5.5 and 5.7, we can estimate avar( n ^ ) by

^ 1 V^ Q
^ 1 p
Q ! Q 1V Q 1:

^ 1 V^ Q
The variance estimator Q ^ 1 is the so-called White’s heteroskedasticity-consistent
p
variance-covariance matrix of estimator n( ^ o
) in a linear time series regression
model with MDS disturbances.

5.6 Hypothesis Testing


25
Question: How to construct a test for the null hypothesis

o
H0 : R = r;

where R is a J K constant matrix, and r is a J 1 constant vector?


Because
p d
n( ^ o
) ! N (0; Q 1 V Q 1 );
we have under H0 ;
p d
nR( ^ o
) ! N (0; RQ 1 V Q 1 R0 ):
When E("2t jXt ) = 2
a.s.; we have V = 2
Q; and so
p d
R n( ^ o
) ! N (0; 2
RQ 1 R0 ):

The test statistics di¤er in two cases. We …rst construct a test under conditional ho-
moskedasticity.

Case I: Conditional Homoskedasticity

When J = 1; we can use the conventional t-test statistic for large sample inference.

Theorem 5.12 [t-test]: Suppose Assumptions 5.1-5.6 hold. Then under H0 with
J = 1;
R^ r d
T =p ! N (0; 1)
2 0
s R(X X) R1 0

as n ! 1:
p d
Proof: Given R n( ^ o
) ! N (0; 2
RQ 1 R0 ); R o
= r under H0 ; and J = 1; we
have p
n(R ^ r) d
p ! N (0; 1):
2 RQ 1 R0

^ = X0 X=n, we obtain
By the Slutsky theorem and Q
p
n(R ^ r) d
q ! N (0; 1):
2 ^
s RQ R1 0

This ratio is the conventional t-test statistic we examined in Chapter 3, namely:


p
n(R ^ r) R^ r
q =p = T:
s 2 R(X0 X) 1 R0
2 ^
s RQ R1 0

26
2
For J > 1; we can consider an asymptotic test that is based on the conventional
F -statistic.

2
Theorem 5.13 [Asymptotic Test]: Suppose Assumptions 5.1-5.6 hold. Then
under H0 ;
d 2
J F ! J

as n ! 1:

Proof: We write
R^ r = R( ^ o
)+R o
r:
o
Under H0 : R = r; we have
p p
n(R ^ r) = R n( ^ o
)
d 2
! N (0; RQ 1 R0 ):

It follows that the quadratic form


p p d
n(R ^ r)0 [ 2 RQ 1 R0 ] 1
n(R ^ r) ! 2
J:

^ 1 p
Also, because s2 Q ! 2
Q 1 ; we have the Wald test statistic
p p
W = n(R ^ ^ 1 R0 ]
r)0 [s2 RQ 1
n(R ^ r)
d 2
! J

by the Slutsky theorem. This can be written equivalently as follows:

(R ^ r)0 [R(X0 X) 1 R0 ] 1 (R ^ r) d 2
W = ! J;
s2
namely
d 2
W =J F ! J;

where F is the conventional F -test statistic derived in Chapter 3.

Remarks:
We cannot use the F distribution for a …nite sample size n, but we can still compute
the F -statistic and the appropriate test statistic is J times the F -statistic, which is
asymptotically 2J as n ! 1. That is,

(~e0 e~ e0 e) d 2
J F = ! J:
e0 e=(n K)

27
Put it di¤erently, the classical F -test is still approximately applicable under Assumptions
5.1–5.6 for a large n.

We now give two examples that are not covered under the assumptions of classical
linear regression models.

Example 1 [Testing for Granger Causality]: Consider a bivariate time series


(Y ) (X)
fYt ; Xt g; where t is the time index, It 1 = fYt 1 ; :::; Y1 g and It 1 = fXt 1 ; :::; X1 g.
For example, Yt is the GDP growth, and Xt is the money supply growth. We say that
(Y ) (X)
Xt does not Granger-cause Yt in conditional mean with respect to It 1 = fIt 1 ; It 1 g if
(Y ) (X) (Y )
E(Yt jIt 1 ; It 1) = E(Yt jIt 1 ):

In other words, the lagged variables of Xt have no impact on the level of Yt :

Granger causality is de…ned in terms of incremental predictability rather than the


real cause-e¤ect relationship. From an econometric point of view, it is a test of omitted
variables in a time series context. It is …rst introduced by Granger (1969).

Question: How to test Granger causality?

We consider two approaches to testing Granger causality. The …rst test is proposed
by Granger (1969). Consider now a linear regression model

Yt = 0 + 1 Yt 1 + + p Yt p

+ p+1 Xt 1 + + p+q Xt q + "t :

Under non-Granger causality, we have

H0 : p+1 = = p+q = 0:

The F -test statistic


F Fq;n (p+q+1) :

The classical regression theory of Chapter 3 (Assumption 3.2: E("t jX) = 0) rules out
this application, because it is a dynamic regression model. However, we have justi…ed
in this chapter that under H0 ;
d
q F ! 2q
as n ! 1 under conditional homoskedasticity even for a linear dynamic regression
model.

28
There is another well-known test for Granger causality proposed by Sims (1980),
which is based on the fact that the future cannot cause the present in any notion of
causality. To test whether fXt g Granger-causes fYt g; we consider the following linear
regression model
p q
X X
J X
Xt = 0 + j Xt j + j Yt+j + j Yt j + "t :
j=1 j=1 j=1

Here, the dependent variable is Xt rather than Yt : If fXt g Granger-causes fYt g; we expect
some relationship between the current Xt and the future values of Yt : Note that nonzero
values for any of f j gJj=1 cannot be interpreted as causality from the future values of Yt
to the current Xt ; simply because the future cannot cause the present. Nonzero values
of any j must imply that there exists causality from current Xt to future values of Yt :
Therefore, we test the null hypothesis

H0 : j = 0 for 1 j J:

Let F be the associated F -test statistic. Then under H0 ;


d 2
J F ! J

as n ! 1 under conditional homoskedasticity.

Example 2 [Wage Determination]: Consider the wage function

Wt = 0 + 1 Pt + 2 Pt 1 + 3 Ut

+ 4 Vt + 5 Wt 1 + "t ;

where Wt = wage, Pt = price, Ut = unemployment, and Vt = un…lled vacancies. We will


test the null hypothesis

H0 : 1 + 2 = 0; 3 + 4 = 0; and 5 = 1:

Question: What is the economic interpretation of the null hypothesis H0 ?

Under H0 ; we have the restricted wage equation:

Wt = 0 + 1 Pt + 4 Dt + "t ;

where Wt = Wt Wt 1 is the wage growth rate, Pt = Pt Pt 1 is the in‡ation rate,


and Dt = Vt Ut is an index for job market situation (excess job supply). This implies
that the wage increase depends on the in‡ation rate and the excess labor supply.

29
Under H0 ; we have
d 2
3F ! 3:

A Special Case: Testing for Joint Signi…cance of All Economic Variables

Theorem 5.14 [(n K)R2 Test]: Suppose Assumption 5.1-5.6 hold, and we are inter-
ested in testing the null hypothesis that
o o o
H0 : 1 = 2 = = k = 0;

where the oj ; 1 j k; are the slope coe¢ cients in the linear regression model Yt =
o
Xt0 + "t :
Let R2 be the coe¢ cient of determination from the unrestricted regression model
o
Yt = Xt0 + "t :

Then under H0 ;
d
(n K)R2 ! 2
k:

Proof: First, note that as shown earlier, we have in this case,

R2 =k
F = :
(1 R2 )=(n K)
Here, we have J = k; and under H0 ;

(n K)R2 d 2
k F = ! k:
1 R2
This implies that k F is bounded in probability; that is,

(n K)R2
= OP (1):
1 R2
Consequently, given that k is …xed (i.e., does not grow with the sample size n), we have
p
R2 =(1 R2 ) ! 0

or equivalently,
p
R2 ! 0:
p
Therefore, 1 R2 ! 1: By the Slutsky theorem, we have

(n K)R2
(n K)R2 = (1 R2 )
1 R2
d 2
! k:

30
This completes the proof.

Example 3 [E¢ cient Market Hypothesis]: Suppose Yt is the exchange rate return
in period t; and It 1 is the information available at time t 1: Then a classical version
of the e¢ cient market hypothesis (EMH) can be stated as follows:

E(Yt jIt 1 ) = E(Yt )

To check whether exchange rate changes are unpredictable using the past history of
exchange rate changes, we specify a linear regression model:

o
Yt = Xt0 + "t ;
where
Xt = (1; Yt 1 ; :::; Yt k )0 :
Under EMH, we have
o
H0 : j = 0 for all j = 1; :::; k:
If the alternative
o
HA : j 6= 0 at least for some j 2 f1; :::; kg
holds, then exchange rate changes are predictable using the past information.

Remarks:
What is the appropriate interpretation if H0 is not rejected? Note that there exists
a gap between the e¢ ciency hypothesis and H0 , because the linear regression model is
just one of many ways to check EMH. Thus, H0 is not rejected, at most we can only
say that no evidence against the e¢ ciency hypothesis is found. We should not conclude
that EMH holds.

In using k F or (n K)R2 statistic to test EMH, although the normality assump-


tion is not needed for this result, we still require conditional homoskedasticity, which
rules out autoregressive conditional heteroskedasticity (ARCH) in the dynamic time se-
ries regression framework. ARCH e¤ects arise in high-frequency …nancial time series
processes.

Case II: Conditional Heteroskedasticity

31
Next, we construct hypothesis tests for H0 under conditional heteroskedasticity. Re-
call that under H0 ;
p p p
n(R ^ r) = R n( ^ o
)+ n(R o
r)
p
= nR( ^ o
)
d
! N (0; RQ 1 V Q 1 R0 );

where V = E(Xt Xt0 "2t ):

For J = 1; we have
p
n(R ^ r) d
p ! N (0; 1) as n ! 1:
1
RQ V Q R1 0

p p
Because Q^! Q and V^ ! V; where V^ = X0 D(e)D(e)0 X=n; we have by the Slutsky
theorem that the robust t-test statistic
p
n(R ^ r) d
Tr = q ! N (0; 1) as n ! 1:
RQ ^ 1 V^ Q
^ 1 R0

Theorem 5.15 [Robust t-Test Under Conditional Heteroskedasticity] Sup-


pose Assumptions 5.1–5.5 and 5.7 hold. Then under H0 with J = 1; as n ! 1; the
robust t-test statistic p
n(R ^ r) d
Tr = q ! N (0; 1):
RQ^ 1 V^ Q
^ 1 R0

For J > 1; the quadratic form


p p
n(R ^ r)0 [RQ 1 V Q 1 R0 ] 1
n(R ^ r)
d 2
! J

p p
under H0 : Given Q ^ ! Q and V^ ! V; where V^ = X0 D(e)D(e)0 X=n; we have a robust
Wald test statistic

W = n(R ^ ^ 1 R0 ] 1 (R ^
^ 1 V^ Q
r)0 [RQ r)
d 2
! J

by the Slutsky theorem. We can equivalently write


d
W = (R ^ r)0 [R(X0 X) 1 X0 D(e)D(e)0 X(X0 X) 1 R0 ] 1 (R ^ r) ! 2
J:

32
Theorem 5.16 [Robust Wald Test Under Conditional Heteroskedasticity] Sup-
pose Assumptions 5.1–5.5 and 5.7 hold. Then under H0 ; as n ! 1;
d
W = n(R ^ r)0 [RQ ^ 1 R0 ] 1 (R ^
^ 1 V^ Q r) ! 2
J:

Remarks:
Under conditional heteroskedasticity, J F and (n K)R2 cannot be used even when
n ! 1.
On the other hand, although the general form of the test statistic W developed
here can be used no matter whether there exists conditional homoskedasticity, W may
perform poorly in small samples (i.e., the asymptotic 2J approximation may be poor in
small samples, or Type I errors are large). Thus, if one has information that the error
term is conditionally homoskedastic, one should use the test statistics derived under
conditional homoskedasticity, which will perform better in small sample sizes. Because
of this reason, it is important to test whether conditional homoskedasticity holds in a
time series context.

5.7 Testing for Conditional Heteroskedasticity and


Autoregressive Conditional Heteroskedasticity
Question: How to test conditional heteroskedasticity in a time series regression
context?

Question: Can we still use White’s (1980) test for conditional heteroskedasticity?

Yes. Although White’s (1980) test is developed under the independence assumption,
it is still applicable to a time series linear regression model when fXt "t g is an MDS
process. Thus, the test procedure to implement White’s (1980) test as is discussed in
Chapter 4 can be used here.

In the time series econometrics, there is an alternative approach to testing condi-


tional heteroskedasticity in an autoregressive time series context. This is Engle’s (1982,
Econometrica) Lagrange Multiplier test for autoregressive conditional heteroskedasticity
(ARCH) in f"t g.

Consider the regression model


o
Yt = Xt0 + "t ;
"t = t zt ;

fzt g i:i:d:(0; 1):

33
The null hypothesis
2 2 2
H0 : t = for some > 0:

where It 1 = f"t 1 ; "t 2 ; :::g:

Here, to allow for a possibly time-varying conditional variance of the regression dis-
turbance "t given It 1 ; "t is formulated as the product between a random shock zt and
t = (It 1 ): When the random shock series fzt g is i.i.d.(0; 1); we have

var("t jIt 1 ) = E(zt2 2


t jIt 1 )
2 2
= t E(zt jIt 1 )
2
= t:

That is, 2t is the conditional variance of "t given It 1 : The null hypothesis H0 says
that the conditional variance of "t given It 1 does not change over time.

The alternative hypothesis to H0 is that 2t is a function of It 1 ; so it changes over


time: In particular, we consider the following auxiliary regression for "2t :
q
X
"2t = 0 + 2
j "t j + vt ;
j=1

where E(vt jIt 1 ) = 0 a.s. This is called an ARCH(q) process in Engle (1982). ARCH
models can capture a well-known empirical styles fact called volatility clustering in
…nancial markets, that is, a high volatility today tends to be followed by another large
volatility tomorrow, and a small volatility today tends to be followed by another small
volatility tomorrow, and such patterns alternate over time. To see this more clearly, we
consider an ARCH(1) model where

2 2
t = 0 + 1 "t 1 ;

where, to ensure nonnegativity of 2t ; both 0 and 1 are required to be nonnegative


parameters. Suppose 1 > 0: Then if "t 1 is an unusually large deviation from its
expectation of 0 so that "2t 1 is large, then the conditional variance of "t is larger than
usual. Therefore, "t is expected to have an unusually large deviation from its mean of
0, with either direction. Similarly, if "2t 1 is usually small. then 2t is small, and "2t is
expected to be small as well. Because of this behavior, volatility clustering arises.

34
In addition to volatility clustering, the ARCH(1) model can also generate heavy tails
for "t even when the random shock zt is i.i.d.N (0; 1): This can be seen from its kurtosis

E("4t )
K =
[E("2t )]2
E(zt4 )(1 2
1)
= 2
(1 3 1)
> 3

given 1 > 0:
With an ARCH modeling framework, all autoregressive coe¢ cients j ; 1 j q;
are identically zero when H0 holds. Thus, we can test H0 by checking whether all
j; 1 j q; are jointly zero. If j 6= 0 for some 1 j q; then there exists
2
autocorrelation in f"t g and H0 is false.

Observe that with "t = t zt and fzt g is i.i.d.(0,1), the disturbance vt in the auxiliary
autoregression model is an i.i.d. sequence under H0 , which implies that E(vt2 jIt 1 ) = 2v ;
that is, fvt g is conditionally homoskedastic. Thus, when H0 holds, we have

~2 ! d 2
(n q 1)R q;

~ 2 is the centered R2 from the auxiliary regression.


where R

The auxiliary regression for "2t , unfortunately, is infeasible because "t is not observ-
able. However, we can replace "t by the estimated residual et and consider the regression
q
X
e2t = 0 + 2
j et j + v~t :
j=1

Then we have
d
(n q 1)R2 ! 2
q:

Note that the replacement of "t by et has no impact on the asymptotic distribution of
the test statistic, for the same reason as in White’s (1980) direct test for conditional
heteroskedasticity. See Chapter 4 for more discussions.

Remarks:
The existence of ARCH e¤ect for f"t g does not automatically imply that we have to
use White’s heteroskedasticity-consistent variance-covariance matrix Q 1 V Q 1 for the
OLS estimator ^ : Suppose Yt = Xt0 o + "t is a static time series model such that the

35
two time series fXt g and f"t g are independent of each other, and f"t g displays ARCH
e¤ect, i.e.,
p
X
2
var("t jIt 1 ) = 0 + j "t j
j=1

with at least some j 6= 0. Then Assumption 5.6 still holds because var("t jXt ) =
var("t ) = 2 given the assumption that fXt g and f"t g are independent. In this case, we
p
have avar( n ^ ) = 2 Q 1 :
Next, suppose Yt = Xt0 o + "t is a dynamic time series regression model such that
Xt contains some lagged dependent variables (say Yt 1 ). Then if f"t g displays ARCH
e¤ect, Assumption 5.6 may fail because we may have E("2t jXt ) 6= 2 ; which generally
occurs when Xt and f"2t j ; j = 1; :::; pg are not independent. In this case, we have to use
p
avar( n ^ ) = Q 1 V Q 1 :

5.8 Testing for Serial Correlation


Question: Why is it important to test serial correlation for f"t g?

We …rst provide some motivation for doing so. Recall that under Assumptions 5.1–
5.5,
p d
n( ^ o
) ! N (0; Q 1 V Q 1 );
where V =var(Xt "t ): Among other things, this implies that the asymptotic variance
P
of n 1=2 nt=1 Xt "t is the same as the variance of Xt "t . This follows from the MDS
assumption for fXt "t g :
!
Xn
var n 1=2 Xt "t
t=1
X
n X
n
1
= n E(Xt "t Xs0 "s )
t=1 s=1
X
n
1
= n E(Xt Xt0 "2t )
t=1
= E(Xt Xt0 "2t )
= V:

This result will not generally hold if the MDS property for fXt "t g is violated.

Question: How to check E(Xt "t jIt 1 ) = 0 ; where It 1 is the -…eld generated by
fXs "s ; s < tg?

36
When Xt contains the intercept, we have that f"t g is MDS with respect to the -…eld
generated by f"s ; s < tg, which implies that f"t g is serially uncorrelated (or is a white
noise).

If f"t g is serially correlated, then fXt "t g will not be MDS, and consequently we
P
will generally have var(n 1=2 nt=1 Xt "t ) 6= V . Therefore, serial uncorrelatedness is an
p
important necessary condition for the validity of avar( n ^ ) = Q 1 V Q 1 with V =
E(Xt Xt0 "2t ):

On the other hand, let us revisit the correct model speci…cation condition that

E("t jXt ) = 0 a.s.

in a time series context. Note that this condition does not necessarily imply that f"t g
or fXt "t g is MDS in a time series context.

To see this, consider the case when Yt = Xt0 o + "t is a static regression model (i.e.,
when fXt g and f"t g are mutually independent, or at least when cov(Xt ; "s ) = 0 for
all t; s), it is possible that E("t jXt ) = 0 but f"t g is serially correlated. An example
is that f"t g is an AR(1) process but f"t g and fXt g are mutually independent. In this
case, serial dependence in f"t g does not cause inconsistency of OLS ^ to o ; but we no
longer have var(n 1=2 nt=1 Xt "t ) = V = E(Xt Xt0 "2t ): In other words, the MDS property
for f"t g is crucial for var(n 1=2 nt=1 Xt "t ) = V in a static regression model, although
it is not needed to ensure E("t jXt ) = 0. For a static regression model, the regressors
Xt are usually called exogenous variables. In particular, if fXt g and f"t g are mutually
independent, then Xt is called strictly exogenous.

On the other hand, when Yt = Xt0 o + "t is a dynamic model (i.e., when Xt includes
lagged dependent variables such as fYt 1 ; ; Yt k g so that Xt and "t j are generally not
independent for j > 0), the correct model speci…cation condition

E("t jXt ) = 0 a.s.

holds when f"t g is MDS. If f"t g is not an MDS, the condition that E("t jXt ) = 0 a.s.
generally does not hold. To see this, we consider, for example, an AR(1) model
o o
Yt = 0 + 1 Yt 1 + "t
o
= Xt0 + "t :

Suppose f"t g is an MA(1) process. Then E(Xt "t ) 6= 0; and so E("t jXt ) 6= 0: Thus, to
ensure correct speci…cation (E(Yt jXt ) = Xt0 o a.s.) of a dynamic regression model in a

37
time series context, it is important to check MDS for f"t g. In this case, tests for MDS
can be viewed as speci…cation tests for dynamic regression models.

In time series econometrics such as rational expectations econometrics, correct model


speci…cation usually requires that "t be MDS:

E("t jIt 1 ) = 0 a.s.

where It 1 is the information set available to the economic agent at time t 1: In this
content, Xt is usually a subset of It 1 ; namely Xt 2 It 1 : Thus both Assumptions 5.3
and 5.5 hold simultaneously:

E("t jXt ) = E[E("t jIt 1 )jXt ] = 0 a.s.


and
E(Xt "t jIt 1 ) = Xt E("t jIt 1 ) = 0 a.s.

because Xt belongs to It 1 :

To check the MDS property of f"t g; one may check whether there exists serial corre-
lation in f"t g: Evidence of serial correlation in f"t g will indicate that f"t g is not MDS.
The existence of serial correlation may be due to various sources of model misspec-
i…cation. For example, it may be that in the linear regression model, an important
explanatory variable is missing (omitted variables), or that the functional relationship
is nonlinear (functional form misspeci…cation), or that lagged dependent variables or
lagged explanatory variables should be included as regressors (neglected dynamics or
dynamic misspeci…cation). Therefore, tests for serial correlation can also be viewed as
a model speci…cation check in a dynamic time series regression context.

Question: How to check serial dependence in f"t g?

We now introduce a number of tests for serial correlation of the disturbance f"t g in
a linear regression model.

Breusch and Godfrey’s Lagrange Multiplier Test for Serial Correlation

The null hypothesis


H0 : E("t jIt 1 ) = 0;
where "t is the regression error in the linear regression model
o
Yt = Xt0 + "t ;

38
It 1 = f"t 1 ; "t 2 ; :::g; and E("2t jXt ) = 2
a.s.

Below, following the vast literature, we will …rst assume conditional homoskedasticity in
testing serial correlation for f"t g: Thus, this method is not suitable for high-frequency
…nancial time series, where volatility clustering has been well-documented. Extensions
to conditional heteroskedasticity will be discussed later.

First, suppose "t is observed, and we consider the auxiliary regression model (an
AR(p))
p
X
"t = j "t j + ut ; t = p + 1; ; n;
j=1

where fut g is MDS: Under H0 ; we have j = 0 for 1 j p: Thus, we can test H0 by


checking whether the j are jointly equal to 0. Assuming E("2t jXt ) = 2 (which implies
E(u2t jXt ) = 2 under H0 ); then we can run an OLS regression and obtain

~2 ! d 2
(n 2p)R uc p;

where R ~ uc
2
is the uncentered R2 in the auxiliary regression (note that there is no inter-
cept), and p is the number of the regressors. The reason that we use (n 2p)R ~ uc
2
is that
t begins from p + 1:
Unfortunately, "t is not observable. However, we can replace "t with the estimated
residual et = Yt Xt0 ^ : Unlike White’s (1980) test for heteroskedasticity of unknown form,
this replacement will generally change the asymptotic 2p distribution for (n 2p)Ruc 2

here. To remove the impact of the estimation error Xt0 ( ^ o


); we have to modify the
auxiliary regression as follows:
p
X
K X
et = j Xjt + j et j + ut
j=1 j=1
p
X
0
= Xt + j et j + ut ; t = p + 1; ; n;
j=1

where Xt contains the intercept. The inclusion of the regressors Xt in the auxiliary
regression will purge the impact of the estimation error Xt0 ( ^ o
) of the test statistic,
because Xt and Xt0 ( ^ o
) are perfectly correlated. Therefore, the resulting statistic
d
(n 2p K)R2 ! 2
p;

under H0 ; where R2 is the centered squared multi-correlation coe¢ cient in the feasible
auxiliary regression model.

39
Question: Why should Xt be generally included in the auxiliary regression?

When we replace "t by et = "t Xt0 ( ^ o


); the estimation error Xt0 ( ^ o
) will have
nontrivial impact on the asymptotic distribution of a test statistic for H0 ; because Xt
may be correlated with "t j at least for some lag order j > 0 (this occurs when the
regression model is dynamic). To remove the impact of Xt0 ( ^ o
); we have to add the
regressor Xt in the auxiliary regression, which is perfectly correlated with the estimation
error Xt0 ( ^ o
); and thus can extract its impact. This can be proven rigorously but we
do not attempt to do so here, because it would be very tedious and o¤er no much new
insight than the above intuition. Below, we provide a heuristic explanation.

First, we consider the infeasible auxiliary autoregression. Under the null hypothesis
of no serial correlation, the OLS estimator
p 0
p
n(~ )= n~

converges to an asymptotic normal distribution, which implies ~ = OP (n 1=2 ) vanishes


in probability at a rate of n 1=2 : The test statistic nR~ uc
2
is asymptotically equivalent to a
p 2
quadratic form in n~ which follows an asymptotic p distribution. In other words, the
asymptotic distribution of nR ~ 2 is determined by the asymptotic distribution of pn~ :
uc
Now, suppose we replace "t by et = "t ( ^ o 0
) Xt ; and consider the feasible
autoregression
p
X
et = j et j + vt :
j=1

Suppose the OLS estimator is ^ : We can then decompose

^ = ~ + ^ + reminder term,

where ~ ; as discussed above, is the OLS estimator of regressing "t on "t 1 ; :::; "t p ; and ^
is the OLS estimator of regressing ( ^ o 0
) Xt on "t 1 ; :::; "t p : For a dynamic regression
model, the regressor Xt contains lagged dependent variables and so E(Xt "t j ) is likely
nonzero for some j 2 f1; :::; pg: It follows that ^ will converge to zero at the same rate
p
as ~ 0
; which is n 1=2 : Because ^ ! 0 at the same rate as ~ ; ^ will have impact
2 2
on the asymptotic distribution of nRuc ; where Ruc is the uncentered R2 in the auxiliary
autoregression. To remove the impact of ^; we need to include Xt as additional regressors
in the auxiliary regression.

Question: When do we need not include Xt in the auxiliary regression?

40
Answer: When we have a static regression model, cov(Xt ; "s ) = 0 for all t; s (so
E(Xt "t j ) = 0 for all j = 1; :::; p), the estimation error Xt0 ( ^ o
) has no impact
2
on the asymptotic distribution of nRuc : It follows that we do not need to include Xt in
the auxiliary autoregression. In other words, we can test serial correlation for f"t g by
running the following auxiliary regression model
p
X
et = j et j + ut :
j=1

2 2
The resulting nRuc is asymptotically p under the null hypothesis of no serial correlation.

Question: Suppose we have a static regression model, and we include Xt in the auxiliary
regression in testing serial correlation of f"t g: What will happen?

For a static regression model, whether Xt is included in the auxiliary regression has
no impact on the asymptotic 2p distribution of (n 2p)Ruc 2
or (n 2p)R2 under the null
hypothesis of no serial correlation in f"t g: Thus, we will still obtain an asymptotic valid
test statistic (n 2p)R2 under H0 : In fact, the size performance of the test can be better
in …nite samples. However, the test may be less powerful than the test without including
Xt ; because Xt may take away some serial correlation in f"t g under the alternative to
H0 :

Question: What happens if we include an intercept in the auxiliary regression


p
X
et = 0 + j et j + ut ;
j=1

where et is the OLS residual from a static regression model.

With the inclusion of the intercept here, we can then use (n 2p)R2 to test serial corre-
2
lation in f"t g; which is more convenient to compute than (n 2p)Ruc : (Most statistical
d
software report R2 but not Ruc 2
:) Under H0 ; (n 2p)R2 ! 2p : However, the inclusion
of the intercept 0 may have some adverse impact on the power of the test in small
samples, because there is an additional parameter to estimate.

As discussed at the beginning of this section, a test for serial correlation can be
viewed as a speci…cation test for dynamic regression models in a time series context,
because existence of serial correlation in the estimated model residual fet g will generally
indicate misspeci…cation of a dynamic regression model.

41
On the other hand, for static regression models with time series observations, it is
possible that a static regression model Yt = Xt0 o + "t is correctly speci…ed in the sense
that E("t jXt ) = 0 but f"t g displays serial correlation. In this case, existence of serial
correlation in f"t g does not a¤ect the consistency of the OLS estimator ^ but a¤ects the
asymptotic variance and therefore the e¢ ciency of the OLS estimator ^ : However, since
"t is unobservable, one has to use the estimated residual et in testing for serial correlation
in a static regression model in the same way as in a dynamic regression model. Because
the estimated residual

et = Yt Xt0 ^
= "t + [E(Yt jXt ) Xt0 ] + Xt0 ( ^ );

it contains the true disturbance "t = Yt E(Yt jXt ) and model approximation error
E(Yt jXt ) Xt0 ; where = [E(Xt Xt0 )] 1 E(Xt Yt ) is the best linear least squares approx-
imation coe¢ cient which the OLS ^ always converges to as n ! 1. If the linear regres-
sion model is misspeci…ed for E(Yt jXt ); then the approximation error E(Yt jXt ) Xt0
will never vanish to zero and this term can cause serial correlation in et if Xt is a time
series process. Thus, when one …nds that there exists serial correlation in the estimated
residuals fet g of a static linear regression model, it is also likely due to the misspeci…-
cation of the static regression model. In this case, the OLS estimator ^ is generally not
consistent. Therefore, one has to …rst check correct speci…cation of a static regression
model in order to give correct interpretation of any documented serial correlation in the
estimated residuals.
In the development of tests for serial correlation in regression disturbances, there
have been two very popular tests that have historical importance. One is the Durbin-
Watson test and the other is Durbin’s h test. The Durbin-Watson test is the …rst formal
procedure developed for testing …rst order serial correlation

2
"t = "t 1 + ut ; fut g i.i.d. 0; ;

o
using the OLS residuals fet gnt=1 in a static linear regression model Yt = Xt0 +"t . Durbin
and Watson (1950,1951) propose a test statistic

n
t=2 (et et 1 )2
d= n 2
:
t=1 et
Durbin and Watson present tables of bounds at the 0.05, 0.025 and 0.01 signi…cance
levels of the d statistic for static regressions with an intercept. Against the one-sided
alternative that > 0; if d is less than the lower bound dL , the null hypothesis that

42
= 0 is rejected; if d is greater than the upper bound dU , the null hypothesis is accepted.
Otherwise, the test is equivocal. Against the one-sided alternative that < 0; 4 d can
be used to replace d in the above procedure.
The Durbin-Watson test has been extended to test for lag 4 autocorrelation by Wallis
(1972) and for autocorrelation at any lag by Vinod (1973).

The Durbin-Watson d test is not applicable to dynamic linear regression models,


because parameter estimation uncertainty in the OLS estimator ^ will have nontrivial
impact on the distribution of d. Durbin (1970) developed the so-called h test for …rst-
order autocorrelation in f"t g that takes into account parameter estimation uncertainty
in ^ . Consider a simple dynamic linear regression model

o o o
Yt = 0 + 1 Yt 1 + 2 Xt + "t ;

where Xt is strictly exogenous. Durbin’s h statistic is de…ned as:


r
n
h=^ ;
1 n v̂ar( ^ 1 )

where v̂ar( ^ 1 ) is an estimator for the asymptotic variance of ^ 1 ; ^ is the OLS estimator
d
from regressing et on et 1 (in fact, ^ 1 d=2). Durbin (1970) shows that h ! N (0; 1)
as n ! 1 under null hypothesis that = 0. In fact, Durbin’s h test is asymptotically
equivalent to the Lagrange multiplier test introduced above.

The Box-Pierce Portmanteau Test

De…ne the sample autocovariance function


X
n
1
^ (j) = n (et e)(et j e);
t=j+1

P
where e = n 1 nt=1 et (this is zero when Xt contains an intercept). The Box-Pierce
portmanteau test statistic is de…ned as
p
X
Q(p) = n ^2 (j);
j=1

where the sample aucorrelation function

^(j) = ^ (j)=^ (0):

43
When fet g is a directly observed data or is the estimated residual from a static
regression model, we can show
d
Q(p) ! 2p
under the null hypothesis of no serial correlation.

On the other hand, when et is an estimated residual from an ARMA(r; s) model

X
r X
s
Yt = 0 + j Yt j + j "t j + "t ;
j=1 j=1

then
d 2
Q(p) ! p (r+s)

where p > r + s: See Box and Pierce (1970).

To improve small sample performance of the Q(p) test, Ljung and Box (1978) propose
a modi…ed Q(p) test statistic:
p
X d
Q (p) n(n + 2) (n j) 1 ^2 (j) ! 2
p (r+q) :
j=1

The modi…cation matches the …rst two moments of Q (p) with those of the 2 distribu-
tion. This improves the size in small samples, although not the power of the test.

When fet g is an estimated residual from a dynamic regression model with regressors
including both lagged dependent variables and exogenous variables, then the asymptotic
distribution of Q(p) is generally unknown (Breusch and Pagan 1980). One solution is
to modify the Q(p) test statistic as follows:

^ d
^ ) 1^ !
Q(p) n^0 (I 2
p as n ! 1;

where ^ = [^ (1) ; ; ^ (p)]0 , and ^ captures the impact caused by nonzero correlation
between fXt g and f"t j ; 1 j pg : See Hayashi (2000, Section 2.10) for more discus-
sion and the expression of ^ .

Like the (n p)R2 test, the Q(p) test also assumes conditional homoskedasticity. In
fact, it can be shown to be asymptotically equivalent to the (n p)R2 test statistic when
et is the estimated residual of a static regression model.

The Kernel-Based Test for Serial Correlation

44
Hong (1996, Econometrica)
Let k : R ! [ 1; 1] be a symmetric function that is continuous at all points except
R1
a …nite number of points on R; with k(0) = 1 and 1 k 2 (z)dz < 1:

Examples of k( ) :

(i) The truncated kernel


k(z) = 1(jzj 1):
(ii) The Bartlett kernel

k(z) = (1 jzj)1(jzj 1):

(iii) The Daniell kernel

sin( z)
k(z) = ; z 2 R;
z
Here, 1(jzj 1) is the indicator function that takes value 1 if jzj 1 and 0 otherwise.
De…ne a test statistic
" n 1 #
X p
M (p) = n k 2 (j=p)^2 (j) C(p) = D(p);
j=1

where ^(j) is the sample autocorrelation function,

X
n 1
C(p) = k 2 (j=p);
j=1

X
n 2
D(p) = 2 k 4 (j=p):
j=1

Under the null hypothesis of no serial correlation with conditional homoskedasticity, it


can be shown that
p
M (p) ! N (0; 1)
as p = p(n) ! 1; p=n ! 0: This holds no matter whether et is the estimated residual
from a static regression model or a dynamic regression model.
d
To appreciate why M (p) ! N (0; 1); we consider a special case of using the truncated
kernel k(z) = 1(jzj 1); which assigns an equal weight to each of the …rst p lags: In
this case, M (p) becomes P
n pj=1 ^2 (j) p
MT (p) = p :
2p

45
This can be viewed as a generalized version of the Box-Pierce test. In other words,
the Box-Pierce test can be viewed as a kernel-based test with the choice of the truncated
kernel.
d
For a static regression model, we have n pj=1 ^2 (j) ! 2p under the null hypothesis of
no serial correlation. When p is large, we can obtain a normal approximation for 2p by
p
substracting its mean p and dividing by its standard deviation 2p :
2
p p d
p ! N (0; 1) as p ! 1:
2p
In fact, when p ! 1 as n ! 1; we have the same asymptotic result even when the
regression model is dynamic.

Question: Why is it not needed to correct for the impact of the estimation error
contained in et even when the regression model is dynamic?

Answer: The estimation error indeed does have some impact but such impact becomes
asymptotically negligible when p grows to in…nity as n ! 1: In contrast, the Box-Pierce
portmanteau test has some problem because it uses a …xed lag order p (i.e., p is …xed
when n ! 1:)

Question: What is the advantage of using a kernel function?

For a weakly stationary process f"t g, the autocorrelation function (j) typically decays
to zero as j increases. Consequently, it is more powerful if one can discount higher
order lags rather than treat all lags equally. This can be achieved by using a downward
weighting kernel function such as the Bartlett kernel and the Daniell kernel. Hong
(1996) shows that the Daniell kernel gives a most powerful test among a class of kernel
functions.

Testing Serial Correlation Under Conditional Heteroskedasticity


We have been testing serial correlation under conditional homoskedasticity. All afore-
mentioned tests assume conditional homoskedasticity or even i.i.d. on f"t g under the
null hypothesis of no serial correlation, which rules out high frequency …nancial time
series, which has been documented to have persistent volatility clustering. To test ser-
ial correlation under conditional heteroskedasticity, we need to use di¤erent procedures
because the F -test and (n p)R2 are no longer valid.

Question: Under what conditions will conditional homoskedasticity be a reasonable


assumption? And under what conditions will it not be a reasonable assumption?

46
Answer: It is a reasonable assumption for low-frequency macroeconomic time series.
It is not a reasonable assumption for high-frequency …nancial time series.

Question: How to construct a test for serial correlation under conditional heteroskedas-
ticity?

Wooldridge’s (1991) Robust Test


Some e¤ort has been devoted to robustifying tests for serial correlation. Wooldridge
(1990,1991) proposes some regression-based new procedures to test serial correlation that
are robust to conditional heteroskedasticity. Speci…cally, Wooldridge (1990,1991) pro-
poses a two-stage procedure to robustify the nR2 test for serial correlation in estimated
residuals fet g of a linear regression model (2.1):

Step 1: Regress (et 1 ; ; et p ) on Xt and save the estimated p 1 residual vector


v^t ;

Step 2: Regress 1 on v^t et and obtain SSR; the sum of squared residuals;
2
Step 3: Compare the n SSR statistic with the asymptotic p distribution.

The …rst auxiliary regression purges the impact of parameter estimation uncertainty
in the OLS estimator ^ and the second auxiliary regression delivers a test statistic robust
to conditional heteroskedasticity of unknown form.

The Robust Kernel-based Test


Hong and Lee (2006) have recently robusti…ed Hong’s (1996) spectral density-based
consistent test for serial correlation of unknown form:
" # q
X
n 1
^
M n 1
k 2 (j=p)^ 2 (j) ^
C(p) ^
= D(p);
j=1

where the centering and scaling factors

X
n 1 X
n 1
^
C(p) 2
^ (0) 2
k (j=p) + k 2 (j=p)^ 22 (j);
j=1 j=1

X
n 2 X
n 2
^
D(p) 2^ (0) 4 4
k (j=p) + 4^ (0) 2
k 4 (j=p)^ 22 (j)
j=1 j=1

X
n 2X
n 2
+2 ^ j; l)2 ;
k 2 (j=p)k 2 (l=p)C(0;
j=1 l=1

47
with
X
n 1
1
^ 22 (j) n [e2t ^ (0)][e2t j ^ (0)]
t=j+1

and
X
n
^ j; l)
C(0; n 1
[e2t ^ (0)]et j et l :
t=max(j;l)+1

Intuitively, the centering and scaling factors have taken into account possible volatility
^ test is robust to these
clustering and asymmetric features of volatility dynamics, so the M
e¤ects. It allows for various volatility processes, including GARCH models, Nelson’s
(1991) EGARCH, and Glosten et al.’s (1993) Threshold GARCH models.

5.9 Conclusion
In this chapter, after introducing some basic concepts in time series analysis, we show
that the asymptotic theory established under the i.i.d. assumption in Chapter 4 carries
over to linear ergodic stationary time series regression models with MDS disturbances.
The MDS assumption for the regression disturbances plays a key role here. For a static
linear regression model, the MDS assumption is crucial for the validity of White’s (1980)
heteroskedasticity-consistent variance-covariance matrix estimator. For a dynamic linear
regression model, the MDS assumption is crucial for correct model speci…cation for the
conditional mean E(Yt jIt 1 ):
To check the validity of the MDS assumption, one can test serial correlation in
the regression disturbances. We introduce a number of tests for serial correlation and
discuss the di¤erence in testing serial correlation between a static regression model and
a dynamic regression model.

48
EXERCISES
5.1. (a) Suppose that using the Lagrange Multiplier test, one …nds that there exists ser-
ial correlation in f"t g: Can we conclude that f"t g is not a martingale di¤erence sequence
(m.d.s)? Give your reasoning.
(b) Suppose one …nds that there exists no serial correlation in f"t g: Can we conclude
that f"t g is a m.d.s.? Give your reasoning. [Hint: Consider a process "t = zt 1 zt 2 + zt ;
where zt i:i:d:(0; 2 ):]

5.2. Suppose fZt g is a zero-mean weakly stationary process with spectral density func-
tion h(!) and normalized spectral density function f (!): Show that:
(a) f (!) is real-valued for all ! 2 [ ; ];
(b) f (!) is a symmetric function, i.e., f ( !) = f (!);
R
(c) f (!)d! = 1;
P
(d) f (!) 0 for all ! 2 [ ; ]: [Hint: Consider the limit of Ejn 1=2 nt=1 Zt eit! j2 ;
P
the variance of the complex-valued random variable n 1=2 nt=1 Zt eit! :
5.3. Suppose a time series linear regression model
o
Yt = Xt0 + "t ;

where the disturbance "t is directly observable, satis…es Assumptions 5.1–5.3. This class
of models contains both static regression models and dynamic regression models.
(a) Does the condition E("t jXt ) = 0 imply that f"t g is a white noise? Explain.
(b) If f"t g is MDS, does it imply E("t jXt ) = 0? Explain.
(c) If f"t g is serially correlated, does it necessarily imply E("t jXt ) 6= 0; i.e., the linear
regression model is misspeci…ed for E(Yt jXt )? Explain.

5.4. Suppose that in a linear regression model


o
Yt = Xt0 + "t ;

the disturbance "t is directly observable. We are interested in testing the null hypothesis
H0 that f"t g is serially uncorrelated. Suppose Assumptions 5.1–5.6 hold.
(a) Consider the auxiliary regression
p
X
"t = j "t j + ut ; t = p + 1; :::; n:
j=1

~ 2 is the uncentered R2 from the OLS estimation of this auxiliary regression. Show
Let R uc
d
~ uc ! 2
that (n 2p)R p as n ! 1 under H0 :

49
(b) Now consider another auxiliary regression
p
X
"t = 0 + j "t j + ut ; t = p + 1; :::; n:
j=1

~ 2 be the centered R2 from this auxiliary regression model. Show that (n 2p)R d
~2 !
Let R
2
p as n ! 1 under H0 :
(c) Which test statistic, (n 2p)R ~ uc
2
or (n 2p)R~ 2 ; performs better in …nite samples?
Give your heuristic reasoning.

5.5. Suppose that in a linear regression model


o
Yt = Xt0 + "t ;

the disturbance "t is directly observable. We are interested in testing the null hypothesis
H0 that f"t g is serially uncorrelated. Suppose Assumptions 5.1–5.5 hold, and E("2t jXt ) 6=
2
.
(a) Consider the auxiliary regression
p
X
"t = j "t j + ut ; t = p + 1; :::; n:
j=1

Construct an asymptotically valid test statistic for the null hypothesis that there exists
no serial correlation in f"t g:

5.6. Suppose "t follows an ARCH(1) process

"t = zt t ;
2 2
t = 0 + 1 "t 1 ;

fzt g i:i:d:N (0; 1)

(a) Show E("t jIt 1 ) = 0 and cov("t ; "t j ) = 0 for all j > 0; where It 1 = f"t 1 ; "t 2 ; :::g:
(b) Show cov("2t ; "2t 1 ) = 1 :
(c) Show the kurtosis of "t is given by
E("4t ) 3(1 2
1)
K = 2 2
= 2
[E("t )] 1 3 1
> 3 if 1 > 0:

5.7. Suppose a time series linear regression model


o
Yt = Xt0 + "t ;

50
where the disturbance "t is directly observable, satis…es Assumptions 5.1–5.5. Both
static and dynamic regression models are covered.
Suppose there exists autoregressive conditional heteroskedasticity (ARCH) for f"t g,
namely,
q
X
2 2
E("t jIt 1 ) = 0 + j "t j ;
j=1

where It 1 is the sigma-…eld generated by f"t 1 ; "t 2 ; :::g: Does this imply that one
p
has to use the asymptotic variance formula Q 1 V Q 1 for avar( n ^ )? Explain.

5.8. Suppose a time series linear regression model


o
Yt = Xt0 + "t ;

where the disturbance "t is directly observable, satis…es Assumptions 5.1–5.5, and the
two time series fXt g and f"t g are independent of each other.
Suppose there exists autoregressive conditional heteroskedasticity for f"t g, namely,
q
X
E("2t jIt 1 ) = 0 + 2
j "t j ;
j=1

where It 1 is the sigma-…eld generated by f"t 1 ; "t 2 ; :::g:


p
What is the form of avar( n ^ ); where ^ is the OLS estimator?

5.9. Suppose a dynamic time series linear regression model


o o
Yt = 0 + 1 Yt 1 + "t
o
= Xt0 + "t

satis…es Assumptions 5.1–5.5. Suppose further there exists autoregressive conditional


heteroskedasticity for f"t g in form of the following:

E("2t jIt 1 ) = 0 + 2
1 Yt 1 :

p
What is the form of avar( n ^ ); where ^ is the OLS estimator?

5.10. Suppose a time series linear regression model


o
Yt = Xt0 + "t ;

satis…es Assumptions 5.1, 5.2 and 5.4, the two time series fXt g and f"t g are independent
of each other, and E("t ) = 0. Suppose further that there exist serial correlation in f"t g:

51
(a) Does the presence of serial correlation in f"t g a¤ect the consistency of ^ for o ?
Explain.
(b) Does the presence of serial correlation in f"t g a¤ect the form of asymptotic
p
variance avar( n ^ ) = Q 1 V Q 1 ; where V = limn!1 var(n 1=2 nt=1 Xt "t )? In particular,
do we still have V = E(Xt Xt0 "2t )? Explain.

5.11. Suppose a dynamic time series linear regression model

o o
Yt = 0 + 1 Yt 1 + "t
o
= Xt0 + "t ;

where Xt = (1; Yt 1 )0 , satis…es Assumptions 5.1, 5.2 and 5.4. Suppose further f"t g
follows an MA(1) process:
"t = v t 1 + v t ;
where fvt g is i.i.d.(0; 2v ): Thus, there exists …rst order serial correlation in f"t g.
Is the OLS estimator ^ consistent for o ? Explain.

52
CHAPTER 6 LINEAR REGRESSION
MODELS UNDER CONDITIONAL
HETEROSKEDASTICITY AND
AUTOCORRELATION
Abstract: When the regression disturbance f"t g displays serial correlation, the asymp-
totic results in Chapter 5 are no longer applicable, because the asymptotic variance of
the OLS estimator will depend on serial correlation in fXt "t g: In this chapter, we intro-
duce a method to estimate the asymptotic variance of the OLS estimator in the presence
of heteroskedasticity and autocorrelation, and then develop test procedures based on it.
Some empirical applications are considered.

Key words: Heteroskedasticity and Autocorrelation (HAC) consistent variance-


covariance matrix, Kernel function, Long-run variance-covariance matrix, Newey-West
estimator, Nonparametric estimation, Spectral density matrix.

Motivation

In Chapter 5, we assumed that fXt "t g is an MDS. In many economic applications,


there may exist serial correlation in the regression error f"t g: As a consequence, fXt "t g
is generally no longer an MDS. We now provide a few examples where f"t g is serially
correlated.

Example 1 [Testing a zero population mean]: Suppose the daily stock return fYt g
is a stationary ergodic process with E(Yt ) = : We are interested in testing the null
hypothesis
H0 : = 0
versus the alternative hypothesis
HA : 6= 0:
A test for H0 can be based on the sample mean
X
n
1
Yn = n Yt :
t=1

By a suitable CLT (White (1999)), the sampling distribution of the sample mean Yn
p
scaled by n
p d
nYn ! N (0; V );

1
where the asymptotic variance of the sample mean
p
V avar n Yn :

Because
p X
n
1
var( nYn ) = n var (Yt )
t=1
X
n 1X
t 1
1
+2n cov(Yt ; Yt j );
t=2 j=1

p
serial correlation in fYt g is expected to a¤ect the asymptotic variance of nYn : Thus,
p
unlike in Chapter 5, avar( nYn ) is no longer equal to var(Yt ):
p
Suppose there exists a variance-covariance estimator V^ such that V^ ! V: Then, by
the Slutsky theorem, we can construct a test statistic which is asymptotically N(0,1)
under H0 : p
nY d
p n ! N (0; 1):
V^
Example 2 [Unbiasedness Hypothesis]: Consider the following linear regression
model
St+ = + Ft ( ) + "t+ ;
where St+ is the spot foreign exchange rate at time t + ; Ft ( ) is the forward exchange
rate (with maturity > 0) at time t; and the disturbance "t+ is not observable. Forward
currency contracts are agreements to exchange, in the future, …xed amounts of two
currencies at prices set today. No money changes hand over until the contract expires
or is o¤set.
It has been a longstanding controversy on whether the current forward rate Ft ( );
as opposed to the current spot rate St ; is a better predictor of the future spot rate St+ :
The unbiasedness hypothesis states that the forward exchange rate (with maturity ) at
time t is the optimal predictor for the spot exchange rate at time t + ; namely,

E(St+ jIt ) = Ft ( ) a.s.;

where It is the information set available at time t. This implies

H0 : = 0; = 1;

and
E("t+ jIt ) = 0 a.s., t = 1; 2; ::::

2
However, with > 1; we generally do not have E("t+j jIt ) = 0 a.s. for 1 j 1:
Consequently, there exists serial correlation in f"t g up to 1 lags under H0 :

Example 3 [Long Horizon Return Predictability]: There has been much interest
in regressions of asset returns, measured over various horizons, on various forecasting
variables. The latter include ratios of price to dividends or earnings various interest rate
measures such as the yield spread between long and short term rates, and the quality
yield spread between low and high-grade corporate bonds, and the short term interest
rate.
Consider a regression

Yt+h;h = 0 + 1 rt + 2 (dt pt ) + "t+h;h

where Yt+h;h is the cumulative return over the holding period from time t to time
t + h; namely,
X
h
Yt+h;h = Rt+j ;
j=1

where Rt+j is an asset return in period t + j; rt is the short term interest rate
in time t; and dt pt is the log dividend-price ratio, which is expected to be a good
proxy for market expectations of future stock return, because dt pt is equal to the
expectation of the sum of all discounted future returns and dividend growth rates. In
the empirical …nance, there has been an interest in investigating how the predictability
of asset returns by various forecasting variables depends on time horizon h: For example,
it is expected that dt pt is a better proxy for expectations of long horizon returns than
for expectations of short horizon returns. When monthly data is used and h > 1, there
exists an overlapping for observations on Yt+h;h : As a result, the regression disturbance
"t+h;h is expected to display serial correlation up to lag order h 1:

Example 4 [Relationship between GDP and Money Supply]: Consider the linear
macroeconomic regression model

Yt = + M t + "t ;

where Yt is GDP at time t; Mt is the money supply at time t; and "t is an unobservable
disturbance such that E("t jMt ) = 0 but there may exist strong serial correlation of
unknown form in f"t g:

Question: What happens to the OLS estimator ^ if the disturbance f"t g displays
conditional heteroskedasticity (i.e., E("2t jXt ) = 2 a.s. fails) and/or autocorrelation
(i.e., cov("t ; "t j ) 6= 0 for some j > 0)? In particular,

3
Is the OLS estimator ^ consistent for o
?

Is ^ asymptotically most e¢ cient?

Is ^ ; after properly scaled, asymptotically normal?

Are the t-test and F -test statistics are applicable for large sample inference?

6.1 Framework and Assumptions


We now state the set of assumptions which allow for serial correlation and conditional
heteroskedasticity of unknown form.

Assumption 6.1 [Ergodic Stationarity]: fYt ; Xt0 g0n


t=1 is a stationary ergodic process.

Assumption 6.2 [Linearity]:


o
Yt = Xt0 + "t ;
o
where is a K 1 unknown parameter and "t is the unobservable disturbance.

Assumption 6.3 [Correct Model Speci…cation]: E("t jXt ) = 0 a.s:

Assumption 6.4 [Nonsingularity]: The K K matrix

Q = E(Xt Xt0 )

is …nite and nonsingular.

Assumption 6.5 [Long-run Variance]: (i) For j = 0; 1; :::; put the K K matrix

(j) = cov(Xt "t ; Xt j "t j )


= E[Xt "t "t j Xt0 j ]:

Then 1 j= 1 jj (j)jj < 1; where jjAjj =


K
i=1
K
j=1 jA(i;j) j for any K K matrix, and the
long-run variance-covariance matrix
X
1
V = (j)
j= 1

is p.d.
(ii) The conditional expectation
q:m:
E(Xt "t jXt j "t j ; Xt j 1 "t j 1 ; :::) ! 0 as j ! 1;

4
P1 0 1=2
(iii) j=0 [E(rj rj )] < 1; where

rj = E(Xt "t jXt j "t j ; Xt j 1 "t j 1 ; :::)

E(Xt "t jXt j 1 "t j 1 ; Xt j 2 "t j 2 ; :::):

Remarks:
Assumptions 6.1–6.4 have been assumed in Chapter 5 but Assumption 6.5 is new.
Assumption 6.5(i) allows for both conditional heteroskedasticity and autocorrelation of
unknown form in f"t g, and no normality assumption is imposed on f"t g.
We do not assume that fXt "t g is an MDS, although E(Xt "t ) = 0 as implied by
E("t jXt ) = 0 a.s. Note that E("t jXt ) = 0 a.s. does not necessarily imply that fXt "t g is
MDS in a time series context. See the aforementioned examples for which fXt "t g is not
MDS.

Assumptions 6.5(ii, iii) imply that the serial dependence of Xt "t on its past history
in term of mean and variance respectively vanishes to zero as the lag order j ! 1: Intu-
itively, Assumption 6.5(iii) may be viewed as the net e¤ect of Xt j "t j on the conditional
mean of Xt "t : It assumes that E(rj0 rj ) ! 0 as j ! 1:

6.2 Long-run Variance Estimation


Question: Why are we interested in V ?

Recall that for the OLS estimator ^ , we have

p X
n
n( ^ o ^ 1n
)=Q 1=2
Xt "t :
t=1

Suppose the CLT holds for fXt "t g. That is, suppose
X
n
d
1=2
n Xt "t ! N (0; V );
t=1

where V is an asymptotic variance, namely


!
X
n
1=2
V avar n Xt "t
t=1
!
X
n
1=2
= lim var n Xt "t :
n!1
t=1

5
Then, by the Slutsky theorem, we have
p d
n( ^ o
) ! N (0; Q 1 V Q 1 )

under suitable regularity conditions.


Put
gt = Xt "t :
Note that E(gt ) = 0 given E("t jXt ) = 0 and the law of iterated expectations. Because
fgt g is not an MDS, it may be serially correlated. Thus, the autocovariance function
(j) = cov(gt ; gt j ) may not be zero at least for some lag order j > 0:
Now we consider the variance
! !
Xn X
n
var n 1=2 Xt "t = var n 1=2 gt
t=1 t=1
" ! !0 #
X
n X
n
1=2 1=2
= E n gt n gs
t=1 s=1
X
n X
n
1
= n E(gt gs0 )
t=1 s=1
Xn
1
= n E(gt gt0 )
t=1
Xn X
t 1
1
+n E(gt gs0 )
t=2 s=1
X1 X
n n
1
+n E(gt gs0 )
t=1 s=t+1
Xn
1
= n E(gt gt0 )
t=1
X1
n Xn
1
+ n E(gt gt0 j )
j=1 t=j+1
n+j
X1 X
1
+ n E(gt gt0 j )
j= (n 1) t=1

X
n 1
= (1 jjj=n) (j)
j= (n 1)
X1
! (j) as n ! 1
j= 1

6
P1
by dominated convergence. Therefore, we have V = j= 1 (j):

In contrast, when fgt g is MDS, we have


!
X
n
1=2
V avar n Xt "t
t=1
= E(gt gt0 )
= E(Xt Xt0 "2t )
= (0)

when fgt g is MDS.

When cov(gt ; gt j ) is p.s.d. for all j > 0; the di¤erence 1j= 1 (j) (0) is a p.s.d
matrix. Intuitively, when (j) is p.s.d.; a large deviation of gt from its mean will tend
to be followed by another large deviation. As a result, V (0) is p.s.d:
To explore the link between the long-run variance V and the spectral density matrix
of fXt "t g, which is crucial for consistent estimation of V; we now extend the concept of
the spectral density of a univariate time series to a multivariate time series context.

De…nition 6.1 [Spectral Density Matrix] Suppose fgt = Xt "t g is a K 1 weakly


stationary process with E(gt ) = 0 and autocovariance function (j) cov(gt ; gt j ) =
0
E(gt gt j ); which is a K K matrix. Suppose

X
1
jj (j)jj < 1:
j= 1

Then the Fourier transform of the autocovariance function (j) exists and is given by

1 X
1
H(!) = (j) exp( ij!); !2[ ; ];
2 j= 1
p
where i = 1: The K K matrix-valued function H(!) is called the spectral density
matrix of the weakly stationary time series vector-valued process fgt g:

Remarks:

The inverse Fourier transform of the spectral density matrix is


Z
(j) = H(!)eij! d!:

7
Both H(!) and (j) are Fourier transforms of each other. They contain the same amount
of information on serial dependence of the process fgt = Xt "t g: The spectral distribution
function H(!) is useful to identify business cycles (see Sargent 1987, Dynamic Marcoeco-
nomics, 2nd Edition). For example, if gt is the GDP growth rate at time t; then H(!)
can be used to identify business cycle of the economy.
When ! = 0; then the long-run variance-covariance matrix
X
1
V = 2 H(0) = (j):
j= 1

That is, the long-run variance V is 2 times the spectral density matrix of the time
series process fgt g at frequency zero. As will be seen below, this link provides a basis
for consistent nonparametric estimation of V:

Question: What are the elements of the K K matrix (j)?

Recall that gt = (g0t; g1t ; :::; gkt )0 ; where glt = Xlt "t for 0 l k: Then the (l + 1; m +
1)-th element of (j) is

[ (j)](l+1;m+1) = lm (j)

= cov[glt ; gm(t j) ]

= cov[Xlt "t ; Xm(t j) "(t j) ];

which is the cross-covariance between Xlt "t and Xm(t j) "(t j) : We note that

lm (j) 6= lm ( j);

because gt is a vector, not a scalar. Instead, we have

(j) = ( j)0 ;

which implies lm (j) = ml ( j):

Question: What is the (l + 1; m + 1)-th element of H(!) when l 6= m? The function

1 X
1
ij!
Hlm (!) = lm (j)e
2 j= 1

is called the cross-spectral density between fglt g and fgmt g: The cross-spectrum is very
useful in investigating the comovements between di¤erent economic time series. The
popular concept of Granger causality was …rst de…ned using the cross-spectrum (see
Granger 1969, Econometrica). In general, Hlm (!) is complex-valued.

8
Question: How to estimate V ?

Recall the important identity:


X
1
V = 2 H(0) = (j);
j= 1

where (j) = cov(gt ; gt j ): The long-run variance V is 2 times H(0); the spectral density
matrix at frequency zero. This provides the basis to use a nonparametric approach to
estimating V:

A possible naive estimation method:

Given a random sample fYt ; Xt0 g0n


t=1 ; we can obtain the estimated OLS residual et
from the linear regression model Yt = Xt0 o + "t : Because

X
1
V = (j);
j= 1

we …rst consider a naive estimator


X
n 1
V^ = ^ (j);
j= (n 1)

where the sample autocovariance function


( Pn
1 0
^ (j) = n Pt=j+1 Xt et Xt j et j ; j = 0; 1; :::; n 1;
n 1 nt=1 j Xt et Xt0 j et j ; j = 1; 2; :::; (n 1):

There is no need to subtract the same mean from Xt et and Xt j et j because X0 e =


n ^
t=1 Xt et = 0: Also, note that the summation over lag orders in V extends to the
maximum lag order n 1 for the sample autocovariance function ^ (j): Unfortunately,
although ^ (j) is consistent for (j) for each given j as n ! 1; the estimator V^ is not
consistent for V:

Question: Why?
There are too many estimated terms in the summation over lag orders. In fact, there
are n estimated parameters f ^ (j)gj=0
n 1
in V^ : The asymptotic variance of the estimator
V^ de…ned above is proportional to the ratio of the number of estimated autocovariance
matrices f ^ (j)g to the sample size n; which will not vanish to zero if the number of
estimated covariances is the same as or close to the sample size n:

9
Nonparametric Kernel Estimation
The above explanation motivates us to consider the following truncated sum
p
X
V^ = ^ (j);
j= p

where p is a positive integer. If p is …xed (i.e., p does not grow when the sample size n
increases), however, we expect
p
p
X
V^ ! (j) 6= 2 H(0) = V;
j= p

because the resulting bias


p
X X
2 H(0) (j) = (j)
j= p jjj>p

will never vanish to zero as n ! 1 when p is …xed. Hence, we should let p grows to
in…nity as n ! 1; that is, let p = p(n) ! 1 as n ! 1: The bias will then vanish to
zero as n ! 1: However, we cannot let p grow as fast as the sample size n: Otherwise,
the variance of V^ will never vanish to zero. Therefore, to ensure consistency of V^ to
V; we should balance the bias and the variance of V^ properly. This requires using a
truncated variance estimator pn
X
V^ = ^ (j);
j= pn

where pn ! 1; pn =n ! 0: An example pn = n1=3 :


Although this estimator is consistent for V; it may not be positive semi-de…nite for
all n. To ensure that it is always positive semi-de…nite, we can use a weighted average
estimator pn
X
^
V = k(j=pn ) ^ (j)
j= pn

where the weighting function k( ) is called a kernel function. An example of such kernels
is the Bartlett kernel
k(z) = (1 jzj)1(jzj 1);
where 1( ) is the indicator function, which takes value 1 if the condition inside holds, and
takes value 0 if the condition inside does not hold. Newey and West (1987, Econometrica;
1994, Review of Economic Studies) …rst used this kernel function to estimate V in
econometrics. The truncated variance estimator V^ can be viewed as a kernel-based

10
estimator with the use of the truncated kernel k(z) = 1(jzj 1) ; which assigns equal
weighting to each of the …rst pn lags.

Most kernels are downward-weighting in the sense that k(z) ! 0 as jzj ! 1: The
use of a downward weighting kernel may enhance estimation e¢ ciency of V because
when 1 j= 1 jj (j)jj < 1; we have (j) ! 0 as j ! 1; and so it is more e¢ cient to
assign a larger weight to a lower order j and a smaller weight to a higher order j:
In fact, we can consider a more general form of estimator for V :
X
n 1
V^ = k(j=pn ) ^ (j);
j=1 n

where k( ) may have unbounded support. Although the lag order j sums up from 1 n to
n 1; the variance of the estimator V^ still vanishes to zero, provided pn ! 1; pn =n ! 0;
and k( ) discounts higher order lags as j ! 1: An example of k( ) that has unbounded
support is the Quadratic-Spectral kernel:
3 sin( z)
k(z) = cos( z) ; 1 < z < 1:
( z)2 z
Andrews (1991, Econometrica) uses it to estimate for V . This kernel also delivers a
p.s.d. matrix. Moreover, it minimizes the asymptotic MSE of the estimator V^ over a
class of kernel functions.
Under certain regularity conditions on random sample fYt ; Xt0 g0n
t=1 , kernel function
k( ), and lag order pn (Newey and West 1987, Andrews 1991), we have
p
V^ ! V

provided pn ! 1; pn =n ! 0: Intuitively, although the summation over lag orders in V^


extends to the maximum lag order n 1; the lag orders that are much larger than pn are
expected to have negligible contributions to V^ ; given that k( ) discounts higher order lags.
p
As a consequence, we have V^ ! V: There are many rules to satisfy pn ! 1; pn =n ! 0:
Andrews (1991) and Newey and West (1994) discuss data-driven methods to choose pn :

Question: What are the regularity conditions on k( )?

Assumption on the kernel function: k : R ! [ 1; 1] is symmetric about 0, and


is continuous at all points except a …nite number of points on R; with k(0) = 1 and
R1 2
1
k (z)dz < 1:

At point 0; k( ) attains the maximal value, and the fact that k( ) is square-integrable
implies k(z) ! 0 as jzj ! 1:

11
For derivations of asymptotic variance and asymptotic bias of the long-run variance
estimator V^ ; see Newey and West (1987) and Andrews (1991).

6.3 Consistency of OLS

When there exists conditional heteroskedasticity and autocorrelation of unknown


form in f"t g; it is very di¢ cult, if not impossible, to use the GLS estimation. Instead,
the OLS estimator ^ is convenient to use in practice. We now investigate the asymptotic
properties of the OLS ^ when there exist conditional heteroskedasticity and autocorre-
lation of unknown form.

Theorem 6.1: Suppose Assumptions 6.1–6.5(i) hold. Then


p
^! o
as n ! 1:

Proof: Recall that we have


X
n
^ o ^ 1n
=Q 1
Xt "t :
t=1

By Assumptions 6.1, 6.2 and 6.4 and the WLLN for stationary ergodic processes, we
have
p p
^!
Q ^ 1!
Q and Q Q 1:
Similarly, by Assumptions 6.1–6.3 and 6.5(i), we have
X
n
p
1
n Xt "t ! E(Xt "t ) = 0
t=1

using the WLLN for ergodic stationary processes, where E(Xt "t ) = 0 given Assumption
6.2 (E("t jXt ) = 0 a.s.) and LIE.

6.4 Asymptotic Normality of OLS


p
Next, we derive the asymptotic distribution of n( ^ o
):

Theorem 6.2: Suppose Assumptions 6.1–6.5 hold. Then


p d
n( ^ o
) ! N (0; Q 1 V Q 1 );
P1
where V = j= 1 (j) is as in Assumption 6.5.

The proof of this theorem calls for the use of a new CLT.

12
Lemma 6.3 [CLT for Zero Mean Ergodic Stationary Processes (White 1984,
Theorem 5.15)]: Suppose fZt g is a stationary ergodic process with
(i) E(Zt ) = 0;
P
(ii) V = 1 j= 1 (j) is …nite and nonsingular; where (j) = E(Zt Zt0 j );
q:m:
(iii) E(Zt jZt j ; Zt j 1 ; :::) ! 0;
P
(iv) 1 0
j=0 [E(rj rj )]
1=2
< 1; where

rj = E(Zt jZt j ; Zt j 1 ; :::) E(Zt jZt j 1 ; Zt j 2 ; :::):

Then as n ! 1;
X
n
d
n1=2 Zn = n 1=2
Zt ! N (0; V ):
t=1
p
We now use this CLT to derive the asymptotic distribution of n( ^ o
):

Proof: Recall that


p X
n
n( ^ o ^ 1n
)=Q 1=2
Xt "t :
t=1

By Assumptions 6.1–6.3 and 6.5 and the CLT for stationary ergodic processes, we have
X
n
d
1=2
n Xt "t ! N (0; V );
t=1
P1 p
^ ! p
^ 1 !
where V = j= 1 (j) is as in Assumption 6.5. Also, Q Q and Q Q 1 by
Assumption 6.4 and the WLLN for ergodic stationary processes. We then have by the
Slutsky theorem
p d
n( ^ o
) ! N (0; Q 1 V Q 1 ):

6.5 Hypothesis Testing


We now consider testing the null hypothesis
o
H0 : R = r;

where R is a nonstochastic J K matrix, and r is a J 1 nonstochastic vector.


When there exists autocorrelation in fXt "t g; there is no need (and in fact there is no
way) to consider the cases of conditional homoskedasticity and conditional heteroskedas-
ticity separately (why?).

Corollary 6.4: Suppose Assumptions 6.1–6.5 hold. Then under H0 ; as n ! 1;


p d
n(R ^ r) ! N (0; RQ 1 V Q 1 R0 ):

13
We directly assume a consistent estimator V^ for V:
p
Assumption 6.6: V^ ! V:

When there exists serial correlation of unknown form, we can estimate V using the
nonparametric kernel estimator V^ ; as described in Section 6.3. In some special scenarios,
we may have (j) = 0 for all j > p0 ; where p0 is a …xed lag order. An example of this
case is Example 2 in Section 6.1. In this case, we can use the following estimator
p0
X
V^ = ^ (j):
j= p0

p
It can be shown that V^ ! V in this case.
For the case where J = 1; a robust t-type test statistic
p
n(R ^ r) d
q ! N (0; 1);
RQ^ 1 V^ Q
^ 1 R0

where the convergence to N (0; 1) in distribution holds under H0 :

Question: Why is it called a “robust”t-type test?

This statistic has used the asymptotic variance estimator that is robust to conditional
heteroskedasticity and autocorrelation of unknown form.

For the case where J > 1; we consider a “robust”Wald test.

Theorem 6.5: Under Assumptions 6.1–6.6, we have the Wald test statistic

^ = n 1 (R ^ d
W r)0 [R(X0 X) 1 V^ (X0 X) 1 R0 ] 1 (R ^ r) ! 2
J

o
as n ! 1 under H0 : R = r:

Proof: Because
p d
n(R ^ r) ! N (0; RQ 1 V Q 1 R0 );
we have the quadratic form
p 1 p d
n(R ^ r)0 RQ 1 V Q 1 R0 n(R ^ r) ! 2
J:

By the Slutsky theorem, we have the Wald test statistic


1 d
^ = n(R ^
W ^ 1 V^ Q
r)0 RQ ^ 1 R0 (R ^ r) ! 2
J:

14
^ = X0 X=n; we have an equivalent expression for W
Using the expression of Q ^ :

^ = n 1 (R ^ d
W r)0 [R(X0 X) 1 V^ (X0 X) 1 R0 ] 1 (R ^ r) ! 2
J:

Remarks:

The standard t-statistic and F -statistic cannot be used when there exists autocorre-
lation and conditional heteroskedasticity in fXt "t g.
Question: Can we use this Wald test when (j) = 0 for all nonzero j?
Yes. But this is not a good test statistic because it may perform poorly in …nite
samples. In particular, it usually overrejects the correct null hypothesis H0 in …nite
samples even if (j) = 0 for all j 6= 0: In the case where (j) = 0 for all j 6= 0; a better
estimator to use is

V^ = ^ (0)
Xn
= n 1 Xt et et Xt0
t=1
= X D(e)D(e)0 X=n:
0

This is essentially White’s heteroskedasticity consistent variance estimator (also see


Chapter 5).

Question: Why do the robust t- and Wald tests tend to overreject H0 in the presence
of HAC?

We use the robust t-test as an example. Recall V^ is an estimator for H(0) up to a


factor of 2 : When there exists strong positive serial correlation in f"t g; as is the case of
economic time series, H(!) will display a peak or mode at frequency zero. The kernel
estimator, which is a local averaging estimator, always tends to underestimate H(0);
because it has an asymptotic negative bias. Consequently, the robust t-statistic tends
to be a larger statistic value, because it is the ratio of R ^ r to the square root of a
variance estimator which tends to be smaller than the true variance.

Simulation Evidence

6.6 Testing Whether Long-run Variance Estimation


Is Needed
Because of the notorious poor performance of the robust t- and W tests even when
(j) = 0 for all j 6= 0; it is very important to test whether we really have to use a
long-run variance estimator.

15
Question: How to test whether we need to use the long-run variance-covariance matrix
estimator? That is, how to test whether the null hypothesis that
X
1
H0 : 2 H(0) (j) = (0)?
j= 1

The null hypothesis H0 can be equivalently written as follows:


X
1
H0 : (j) = 0:
j=1

It can arise from two cases:


(i) (j) = 0 for all j 6= 0:
P
(ii) (j) 6= 0 for some j 6= 0; but 1 j=1 (j) = 0: For simplicity, we will consider the
…rst case only. Case (ii) is pathological, although it could occur in practice.

We now provide a test for H0 under case (i). See Hong (1997) in a related univariate
context.

To test the null hypothesis that 1 j=1 (j) = 0; we can use a consistent estimator A
^
(say) for 1 ^
j=1 (j) and then check whether A is close to a zero matrix. Any signi…cant
di¤erence of A^ from zero will indicate the violation of the null hypothesis, and thus a
long-run variance estimator is needed.
To estimate 1 j=1 (j) consistently, we can use a nonparametric kernel estimator

X
n 1
A^ = k(j=pn )vech[ ^ (j)];
j=1

where pn = p(n) ! 1 at a suitable rate as n ! 1: We shall derive the asymptotic


distribution of A^ (with suitable scaling) under the assumption that fgt = Xt "t g is MDS,
which implies the null hypothesis Ho that 1 j=1 (j) = 0: First, we consider the case
when fgt = Xt "t g is autoregressively conditionally homoskedastic, namely, var(gt jIt 1 ) =
var(gt ); where It 1 = fgt 1 ; gt 2 ; :::g: In this case, we can show
Z 1 1=2
p d
p 2
k (z)dz vech 1
[ (0)] nA^ ! N (0; IK(K+1)=2 ):
0

We can then construct a test statistic


Z 1 1 h i
^
M = p k 2 (z)dz nA^0 vech 2 ^ (0) A^
0
d 2
! K(K+1)=2 :

16
Next, we consider the case when fgt = Xt "t g is autoregressively conditionally het-

eroskedastic, namely var(gt jIt 1 ) 6= var(gt ): In this case, the test statistic is

^ = A^0 B
M ^ 1 A;
^

where
X
n 1X
n 1
^ =
B ^ l);
k(j=p)k(l=p)C(j;
j=1 l=1

X
n 1
^ l) = 1
C(j; gt g^t0 j )vech0 (^
vech(^ gt g^t0 l );
n
t=1+max(j;l)

with g^t = Xt et : Under the assumption that fgt = Xt "t g is an MDS, we have
d
^ ! 2
M K(K+1)=2 :

This test is robust to autoregressive conditional heteroskedasticity of unknown form for


fgt = Xt "t g.

A Related Test: Variance Ratio Test

In fact, the above test is closely related to a variance ratio test that is popular
in …nancial econometrics. Extending an idea of Cochrane (1988), Lo and MacKinlay
(1988) …rst rigorously present an asymptotic theory for a variance ratio test for the
P
MDS hypothesis of asset returns fYt g. Recall that pj=1 Yt j is the cumulative asset
return over a total of p periods. Then under the MDS hypothesis, which implies (j)
cov(Yt ; Yt j ) = 0 for all j > 0; one has
Pp Pp
var Yt p (0) + 2p j=1 (1 j=p) (j)
j=1 j
= = 1:
p var(Yt ) p (0)
This unity property of the variance ratio can be used to test the MDS hypothesis because
any departure from unity is evidence against the MDS hypothesis.
The variance ratio test is essentially based on the statistic
p
p X p 1
VRo n=p (1 j=p)^(j) = n=p f^(0) ;
j=1
2 2

where p
^ 1 X jjj
f (0) = 1 ^(j)
2 j= p p

17
is a kernel-based normalized spectral density estimator at frequency 0, with the Bartlett
kernel K(z) = (1 jzj)1(jzj 1) and a lag order equal to p: This, the variance ratio test
is the same as checking whether the long-run variance is equal to the individual variance
(0): Because VRo is based on a spectral density estimator of frequency 0, and because
of this, it is particularly powerful against long memory processes, whose spectral density
at frequency 0 is in…nity (see Robinson 1994, for discussion on long memory processes).
Under the MDS hypothesis with conditional homoskedasticity for fYt g, Lo and
MacKinlay (1988) show that for any …xed p;
d
VRo ! N [0; 2(2p 1)(p 1)=3p] as n ! 1:

When fYt g displays conditional heteroskedasticity, Lo and MacKinlay (1988) also con-
sider a heteroskedasticity-consistent variance ratio test:
p
p X p
VR n=p (1 j=p)^ (j)= ^ 2 (j);
j=1

where ^ 2 (j) is a consistent estimator for the asymptotic variance of ^ (j) under condi-
tional heteroskedasticity. Lo and MacKinlay (1988) assume a fourth order cumulant
condition that

E (Yt )2 (Yt j )(Yt l ) = 0; j; l > 0; j 6= l:


Intuitively, this condition ensures that the sample autocovariances at di¤erent lags are
p p
asymptotically uncorrelated; that is, cov[ n^ (j); n^ (l)] ! 0 for all j 6= l: As a result,
the heteroskedasticity-consistent VR has the same asymptotic distribution as VRo : How-
ever, the condition in the above equation rules out many important volatility processes,
such as EGARCH and Threshold GARCH models. Moreover, the variance ratio test
only exploits the implication of the MDS hypothesis on the spectral density at fre-
quency 0; it does not check the spectral density at nonzero frequencies. As a result, it is
not consistent against serial correlation of unknown form. See Durlauf (1991) for more
discussion.

6.7 A Classical Ornut-Cochrane Procedure


Long-run variance estimators are necessary for statistical inference of the OLS esti-
mation in a linear regression model when there exists serial correlation of unknown form.
If serial correlation in the regression error has a known special pattern, then simpler sta-
tistical inference procedures are possible. One example is the classical Ornut-Cochrane

18
procedure. Consider a linear regression model with serially correlated errors:
o
Yt = Xt0 + "t ;

where E("t jXt ) = 0 but f"t g follows an AR(p) process


p
X
2
"t = j "t j + vt ; fvt g i.i.d.(0; ):
j=1

The OLS estimator ^ is consistent for o


given E(Xt "t ) = 0 but its asymptotic variance

depends on serial correlation in f"t g: We can consider the following transformed linear
regression model
p p
!0
X X
o
Yt j Yt j = Xt j Xt j
j=1 j=1
p
!
X
+ "t j "t j
j=1
p
!0
X
o
= Xt j Xt j + vt :
j=1

We can write it as follows:


o
Yt = Xt 0 + vt ;
where
p
X
Yt = Yt j Yt j ;
j=1
p
X
Xt = Xt j Xt j :
j=1

The OLS estimator ~ of this transformed regression will be consistent for o


and
asymptotically normal:
p d
n( ~ o
) ! N (0; 2v Qx 1x );
where Qx x = E(Xt Xt 0 ): Moreover it is asymptotically BLUE. However, the OLS
estimator ~ is infeasible, because (Yt ; Xt ) is not available due to the unknown parame-
ters f j gpj=1 . As a solution, one can use a feasible two-step procedure:

Step 1: Regress
o
Yt = Xt0 + "t ; t = 1; :::; n;
Yt on Xt ; and obtain the estimated OLS residual et = Yt Xt0 ^ ;

19
Step 2: Regress an AR(p) model
p
X
et = j et j + v~t ; t = p + 1; :::; n;
j=1

obtain the OLS estimators f^ j gpj=1 ;

Step 3: Regress the transformed model

Y^t = X
^ 0
t
o
+ vt ; t = p + 1; :::; n;

where Y^t and X ^ are de…ned in the same way as Yt and Xt respectively, with
t
f^ j gpj=1 replacing f j gpj=1 : The resulting OLS estimator is denoted as ~ a :

It can be shown that the adaptive feasible OLS estimator ~ a has the same asymptotic
properties as the infeasible OLS estimator ~ : In other words, the sampling error resulting
from the …rst step estimation has no impact on the asymptotic properties of the OLS
estimator in the second step. The asymptotic variance estimator of ~ a is given by

^ x 1x ;
s^2v Q

where
1 X
n
s^2v = v^t 2 ;
n K t=1

1X ^ ^ 0
n
^x
Q x = X X ;
n t=1 t t

with v^t = Y^t X^ 0 ~ a : The t-test statistic which is asymptotically N (0; 1) and the
t
J F -test statistic which is asymptotically 2J from the last stage regression are applicable
when the sample size n is large.
The estimator ~ a is essentially the adaptive feasible GLS estimator described in
Chapter 3, and it is asymptotically BLUE. This estimation method is therefore asymp-
totically more e¢ cient than the robust test procedures developed in Section 6, but it
is based on the assumption that the AR(p) process for the disturbance f"t g is known.
The robust test procedures in Section 6 are applicable when f"t g has conditional het-
eroskedasticity and serial correlation of unknown form.

6.8 Empirical Applications


6.9 Conclusion
20
In this chapter, we have …rst discussed some motivating economic examples where
a long-run variance estimator is needed. Then we discussed consistent estimation of a
long-run variance-covariance matrix by a nonparametric kernel method. The asymptotic
properties of the OLS estimator are investigated, which calls for the use of a new CLT
because fXt "t g is not a MDS. Robust t- and Wald test statistics that are valid under
conditional heteroskedasticity and autocorrelation of unknown form are then derived.
When there exists serial correlation of unknown form, there is no need (and no way)
to separate the cases of conditional homoskedasticity and conditional heteroskedasticity.
Because robust t- and Wald tests have very poor …nite sample performances even if
fXt "t g is a MDS, it is desirable to …rst check whether we really need a long-run variance
estimator. We provide such a test. Finally, some empirical applications are considered.
We also introduce a classical estimation method called Ornut-Ochrance procedure when
it is known that the regression disburbance follows an AR process with a known order.
Long-run variances have been also widely used in nonstationary time series econo-
metrics such as in unit root and cointegration (e.g., Phillips 1987).

21
EXERCISES

6.1. Suppose Assumptions 6.1–6.3 and 6.5(i) hold. Show


! !
Xn X
n
avar n 1=2 Xt "t lim var n 1=2 Xt "t
n!1
t=1 t=1
X
1
= (j):
j= 1

6.2. Suppose (j) = 0 for all j > p0 ; where p0 is a …xed lag order. An example of this
P0
case is Example 2 in Section 6.1. In this case, the long-run variance V = pj= p0 (j)
and we can estimate it by using the following estimator
p0
X
V^ = ^ (j):
j= p0

p
where ^ (j) is de…ned as in Section 6.1. Show that for each given j; ^ (j) ! (j) as
n ! 1:
p
Given that p0 is a …xed interger, an important implication that ^ (j) ! (j) for each
p
given j as n ! 1 is that V^ ! V as n ! 1:
6.3. Suppose fYt g is a stationary time series process with the following spectral density
function exists:
1 X
1
h(!) = (j)e ij! :
2 j= 1

Show that !
p
X
var Yt j ! 2 h(0) as p ! 1:
j=1

6.4. Suppose fYt g is a weakly stationary process with (j) = cov(Yt ; Yt j ):


P
(a) Find an example of fYt g such that 1 j=1 (j) = 0 but there exists at least one
j > 0; such that (j) 6= 0:
(b) Can the variance ration test detect the time series process in part (a) with a
high probability.

22
CHAPTER 7 INSTRUMENTAL
VARIABLES REGRESSION
Abstract: In this chapter we …rst discuss possibilities that the condition E("t jXt ) = 0
a.s. may fail, which will generally render inconsistent the OLS estimator for the true
model parameters. We then introduce a consistent two-stage least squares (2SLS) esti-
mator, investigating its statistical properties and providing intuitions for the nature of
the 2SLS estimator. Hypothesis tests are constructed. We consider various test proce-
dures corresponding to the cases for which the disturbance is an MDS with conditional
homoskedasticity, an MDS with conditional heteroskedasticity, and a non-MDS process,
respectively. The latter case will require consistent estimation of a long-run variance-
covariance matrix. It is important to emphasize that the t-test and F -test obtained
from the second stage regression estimation cannot be used even for large samples. Fi-
nally, we consider some empirical applications and conclude this chapter by presenting
a brief summary of the comprehensive econometric theory for linear regression models
developed in Chapters 2–7.

Key Words: Endogeneity, Instrumental variables, Hausman’s test, 2SLS.

Motivation

In all previous chapters, we always assumed that E("t jXt ) = 0 holds even when there
exist conditional heteroskedasticity and autocorrelation.

Questions: When may the condition E("t jXt ) = 0 fail? And, what will happen to
the OLS estimator ^ if E("t jXt ) = 0 fails?
There are at least three possibilities where E("t jXt ) = 0 may fail. The …rst is model
misspeci…cation (e.g., functional form misspeci…cation or omitted variables). The second
is the existence of measurement errors in regressors (also called errors in variables). The
third is the estimation of a subset of a simultaneous equation system. We will consider
the last two possibilities in this chapter. For the …rst case (i.e., model misspeci…ca-
tion), it may not be meaningful to discuss consistent estimation of the parameters in a
misspeci…ed regression model.
Some Motivating Examples

We …rst provide some examples in which E("t jXt ) 6= 0:


Example 1 [Errors of Measurements or Errors in Variables]:
Often, economic data measure concepts that di¤er somewhat from those of economic
theory. It is therefore important to take into account errors of measurements. This is

1
usually called errors in variables in econometrics. Consider a data generating process
(DGP)
Yt = o0 + o1 Xt + ut ; (7.1)
where Xt is the income, Yt is the consumption, and fut g is i.i.d. (0; 2u ) and is inde-
pendent of fXt g:
Suppose both Xt and Yt are not observable. The observed variables Xt and Yt contain
measurement errors in the sense that

Xt = Xt + vt ; (7.2)
Yt = Yt + wt ; (7.3)

where fvt g and fwt g are measurement errors independent of fXt g and fYt g, such that
fvt g i:i:d: (0; 2v ) and fwt g i:i:d: (0; 2w ): We assume that the series fvt g; fwt g and
fut g are all mutually independent of each other.
Because we only observe (Xt ; Yt ); we are forced to estimate the following regression
model
Yt = o0 + o1 Xt + "t ; (7.4)
where "t is some unobservable disturbance.
Clearly, the disturbance "t is di¤erent from the original (true) disturbance ut : Al-
though the linear regression model is correctly speci…ed, we no longer have E("t jXt ) = 0
due to the existence of the measurement errors. This is explained below.
Question: If we use the OLS estimator ^ to estimate this model, is ^ consistent for o ?
From the general regression analysis in Chapter 2, we have known that the key for
the consistency of the OLS estimator ^ for o is to check if E(Xt "t ) = 0: From Eqs.
(7:1) (7:3); we have

Yt = Yt + wt
o o
= ( 0 + 1 Xt + ut ) + wt
Xt = Xt + vt :

Therefore, from Eq. (7.4), we obtain


o o
" t = Yt 0 1 Xt
o o o o
= [ 0 + 1 Xt + ut + wt ] 0 1 (Xt + vt )
o
= ut + wt 1 vt :

The regression error "t contains the true disturbance ut and a linear combination of
measurement errors.

2
Now, the expectation

E(Xt "t ) = E[(Xt + vt )"t ]


= E(Xt "t ) + E(vt "t )
o 2
= 0 1 E(vt )
o 2
= 1 v

6= 0:

Consequently, by the WLLN, the OLS estimator


X
n
^ o ^ xx1 n
= Q 1
Xt "t
t=1
p
! Qxx1 E(Xt "t )
o 2 1
= 1 v Qxx 6= 0:

In other words, ^ is not consistent for o


due to the existence of the measurement errors
in regressors fXt g.

Question: What is the e¤ect of the measurement errors fwt g in the dependent variable
Yt ?

Example 2 [Errors of Measurements in Dependent Variable]: Now we consider


a data generating process (DGP) given by
o o
Yt = 0 + 1 Xt + ut ;

where Xt is the income, Yt is the consumption, and fut g is i.i.d. (0; 2u ) and is inde-
pendent of fXt g:
Suppose Xt is now observed, and Yt is still not observable, such that

Xt = Xt ;
Yt = Yt + wt ;

where fwt g is i.i.d.(0; 2w ) measurement errors independent of fXt g and fYt g: We as-
sume that the two series fwt g and fut g are mutually independent.
Because we only observe (Xt ; Yt ); we are forced to estimate the following model
o o
Yt = 0 + 1 Xt + "t :

Question: If we use the OLS estimator ^ to estimate this model, is ^ consistent for o
?

3
Answer: Yes! The measurement errors in Yt do not cause any trouble for consistent
estimation of o .
The measurement error in Yt can be regarded as part of the true regression distur-
p
bance. It increases the asymptotic variance of n( ^ o
); that is, the existence of
o
measurement errors in Yt renders the estimation of less precise.

Example 3 [Errors in Expectations] Consider a linear regression model

Yt = 0 + 1 Xt + "t ;

where Xt is the economic agent’s conditional expectation of Xt at time t 1; and f"t g


is an i.i.d.(0; 2 ) sequence with E("t jXt ) = 0. The conditional expectation Xt is a latent
variable. When the economic agent follows rational expectations, then Xt = E(Xt jIt 1 )
and we have
Xt = Xt + vt ;
where
E(vt jIt 1 ) = 0;
where It 1 is the information available to the economic agent at time t 1: Assume that
two error series f"t g and fvt g are independent of each other.
We can consider the following regression model
o o
Yt = 0 + 1 Xt + ut ;

where the error term


o
ut = "t 1 vt :

Since
o
E(Xt ut ) = E[(Xt + vt )("t 1 vt )]
o 2
= 1 v

6= 0
o o
provided 1 6= 0; the OLS estimator is not consistent for 1:

Example 4 [Endogeneity due to Omitted variables] Consider an earning data


generating process
Yt = Xt0 o + At + ut ;
where Yt is the earning, Xt is a vector consisting of working experience and schooling,
and At is ability which is unobservable, and the disturbance ut satis…es the condition

4
that E(ut jXt ; At ) = 0: Because one does not observe At ; one is forced to consider the
regression model
Yt = Xt0 o + "t
and is interested in knowing o ; the marginal e¤ect of schooling and working experience.
However, we have E(Xt "t ) 6= 0 because At is usually correlated with Xt :

Example 5 [Production-Bonus Causality; Groves, Hong, McMillan and Naughton


1994]: Consider a production function data generating process

o o o o
ln(Yt ) = 0 + 1 ln(Lt ) + 2 ln(Kt ) + 3 Bt + "t ;

where Yt ; Lt ; Kt are the output, labor and capital stock, Bt is the proportion of bonus
out of total pay, and t is a time index. Without loss of generality, we assume that

E("t ) = 0;
E[ln(Lt )"t ] = 0;
E[ln(Kt )"t ] = 0:

Economic theory suggests that the use of bonus in addition to basic wage will provide
a stronger incentive for workers to work harder in a transitional economy. This theory
can be tested by checking if o3 = 0: However, the test procedure is complicated because
there exists a possibility that when a …rm is more productive, it will pay more bonus to
workers regardless of the e¤ort of its workers. In this case, the OLS estimator ^ 3 cannot
consistently estimate o3 and cannot be used to test the null hypothesis.

Why?
To re‡ect the fact that a more productive …rm pays more bonus to its workers, we
can assume a structural equation for bonus:

0 0
Bt = 0 + 1 ln(Yt ) + wt (7.5)

where 01 > 0; and fwt g is an i.i.d. (0; 2w ) sequence that is independent of fYt g: For
simplicity, we assume that fwt g is independent of f"t g:
Put Xt = [1; ln(Lt ); ln(Kt ); Bt ]0 : Now, from Eq. (7.5) and then Eq. (7.4), we have

o o
E(Bt "t ) = E[( 0 + 1 ln(Yt ) + wt )"t ]
o
= 1 E[ln(Yt )"t ]
o o o 2
= 1 3 E(Bt "t ) + 1 E("t ):

5
It follows that o
1 2
E(Bt "t ) = o o
6= 0;
1 1 3

where 2 = var("t ): Consequently, the OLS estimator ^ 3 is inconsistent for o3 due to


the existence of the causality from productivity ln(Yt ) to bonus Bt .
The bias of the OLS estimator for o3 in the above model is usually called the si-
multaneous equation bias because it arises from the fact that productivity function is
but one of two relationships that hold simultaneously. It is a common phenomena in
economics. It is the rule rather than the exception for economic relationships to be
embedded in a simultaneous system of equations. We now consider two more examples
with simultaneous equation bias.
Example 6 [Simultaneous Equation Bias] We consider the following simple
model of national income determination:

o o
Ct = 0 + 1 It + "t ; (7.6)
It = Ct + Dt ; (7.7)

where It is the income, Ct is the consumption expenditure, and Dt is the non-consumption


expenditure. The variables It and Ct are called endogenous variables, as they can be
determined by the two-equation model. The variable Dt is called an exogenous variable,
because it is determined outside the model (or the system considered). We assume that
fDt g and f"t g are mutually independent, and f"t g is i.i.d. (0; 2 ):

Question: If the OLS estimator ^ is applied to the …rst equation, is it consistent for
o
?
To answer this question, we have from Eq. (7.7)

E(It "t ) = E[(Ct + Dt )"t ]


= E(Ct "t ) + E(Dt "t )
o
= 1 E(It "t ) + E("2t ) + 0:

It follows that
1 2
E(It "t ) = o 6= 0:
1 1

Thus, ^ is not consistent for o


:

In fact, this bias problem can also be seen from the so-called reduced form model.

Question: What is the reduced form?

6
Solving for Eqs. (7.6) and (7.7) simultaneously, we can obtain the “reduced forms”
that express endogenous variables in terms of exogenous variables and disturbances:
o o
0 1 1
Ct = o + o Dt + o "t ;
1 1 1 1 1 1
o
0 1 1
It = o + o Dt + o "t :
1 1 1 1 1 1

Obviously, It is positively correlated with "t (i.e., E(It "t ) 6= 0): Thus, the OLS estimator
for the regression of Ct on It in Eq. (7.6) will not be consistent for o1 ; the parameter for
marginal propensity to consume. Generally speaking, the OLS estimator for the reduced
form is consistent for new parameters, which are functions of original parameters.

Example 7 [Wage-Price Spiral Model] Consider the system of equations


o o o
Wt = 0 + 1 Pt + 2 Dt + "t ; (7.8)
o o
Pt = 0 + 1 Wt + vt ; (7.9)

where Wt ; Pt ; Dt are the wage, price, and excess demand in the labor market respectively.
Eq. (7.8) describes the mechanism of how wage is determined. In particular, wage
depends on price and excess demand for labor. Eq. (7.9) describes how price depends
on wage (or income).
Suppose Dt is an exogenous variable, with E("t jDt ) = 0: There are two endogenous
variables, Wt and Pt ; in the system of equations (7.8) and (7.9):
Question: Will Wt be correlated with vt ? And, will Pt be correlated with "t ?
To answer these questions, we …rst obtain the reduced form equations:
o o o
0 + 1 0 1 "t + o1 vt
Wt = o o + D
o o t + o o;
1 1 1 1 1 1 1 1 1
o o o o
0 1 2 "
1 t + vt
Pt = o o
+ o o Dt + o o:
1 1 1 1 1 1 1 1 1

Conditional on the exogenous variable Dt ; both Wt and Pt are correlated with "t and vt :
As a consequence, both the OLS estimator for o1 in Eq. (7.8) and the OLS estimator
for o1 in Eq. (7.9) will be inconsistent.
In this chapter, we will consider a method called two-stage least squares estima-
tion to obtain consistent estimators for the unknown parameters in all above examples
except for the parameter o2 in Eq. (7.8) of Example 7. No methods can deliver a con-
sistent estimator for o2 in Eq. (7.8) because it is not identi…able. This is the so-called
identi…cation problem of the simultaneous equations.
A Digression: Identi…cation Problem in Simultaneous Equation Models

7
o
To see why there is no way to obtain a consistent estimator for 2 in Eq. (7.8), from
Eq. (7.9), we can write
o
1 1 vt
Wt = o
+ o Pt o
: (7.10)
2 2 2

Let a and b be two arbitrary constants. We multiply Eq. (7.8) with a; and multiply Eq.
(7.10) with b; and add them together:

o b 1 o b o b
(a + b)Wt = a 1 + (a 2 + )Pt + a 3 Dt + (a"t vt );
2 2 2

or
a o1 b o1 1 o b a o3 1 b
Wt = o
+ (a 2 + o )Pt + Dt + (a"t v ):
o t
(7.11)
a+b (a + b) 2 a+b 2 a+b a+b 2

This new equation, (7.11), is a combination of the original wage equation (7.8) and the
price equation (7.9). It is of the same statistical form as Eq. (7.8). Since a and b are
arbitrary, there is an in…nite number of parameters that can satisfy Eq. (11) and they
are all indistinguishable from Eq. (7.8). Consequently, if we use OLS to run regression
of Wt on Pt and Dt ; or more generally, use any other method to estimate the equation
(7.8) or (7.11), there is no way to know which model, either Eq. (7.8) or Eq. (7.11),
is being estimated. Therefore, there is no way to estimate o2 : This is the so-called
identi…cation problem with simultaneous equation models. To avoid such identi…cation
problems in simultaneous equations, certain conditions are required to make the system
of simultaneous equations identi…able. For example, if an extra variable, say money
supply growth rate, is added in the price equation in (7.9), we then obtain

o o o
Pt = 0 + 1 Wt + 2 Mt + vt ; (7.12)

o
then the system of equations (7.8) and (7.12) is identi…able provided 2 6= 0, and so the

parameters in Eqs. (7.8) and (7.12) can be consistently estimated. [Question: Check
why the system of equations (7.8) and (7.12) is identi…able.]

We note that for the system of equations (7.8) and (7.9), although Eq. (7.8) cannot be
consistently estimated by any method, Eq. (7.9) can still be consistently estimated using
the method proposed below. For an identi…able system of simultaneous equations with
simultaneous equation bias, we can use various methods to estimate them consistently,
including 2SLS, the generalized method of moments and the maximum likelihood or
quasi-maximum likelihood estimation methods. These methods will be introduced below
and in subsequent chapters.

8
7.1 Framework and Assumptions
We now provide a set of regularity conditions for our formal analysis in this chapter.

Assumption 7.1 [Ergodic Stationarity]: fYt ; Xt0 ; Zt0 g0n


t=1 is an ergodic stationary
stochastic process, where Xt is a K 1 vector, Zt is a l 1 vector, and l K:

Assumption 7.2 [Linearity]:

o
Yt = Xt0 + "t ; t = 1; :::; n;

o
for some unknown parameter and some unobservable disturbance "t ;

Assumption 7.3 [Nonsingularity]: The K K matrix

Qxx = E(Xt Xt0 )

is nonsingular and …nite;

Assumption 7.4 [IV Conditions]:


(i) E(Xt "t ) 6= 0;
(ii) E(Zt "t ) = 0;
(iii) The l l matrix
Qzz = E(Zt Zt0 )
is …nite and nonsingular, and the l K matrix

Qzx = E(Zt Xt0 )

is …nite and of full rank.


P d
Assumption 7.5 [CLT]: n 1=2 nt=1 Zt "t ! N (0; V ) for some K K symmetric matrix
P
V avar(n 1=2 nt=1 Zt "t ) …nite and nonsingular.

Remarks:
Assumption 7.1 allows for i.i.d. and stationary time series observations.

Assumption 7.5 directly assumes that the CLT holds. This is often called a “high level
assumption.”It covers three cases: IID, MDS and non-MDS for fXt "t g; respectively. For
an IID or MDS sequence fZt "t g; we have V = var(Zt "t ) = E(Zt Zt0 "2t ): For a non-MDS
process fZt "t g; V = 1
j= 1 cov(Zt "t ; Zt j "t j ) is a long-run variance-covariance matrix.

9
The random vector Zt that satis…es Assumption 7.4 is called instruments. The
condition that l K in Assumption 7.1 implies that the number of instruments Zt is
larger than or at least equal to the number of regressors Xt :

Question: Why is the condition of l K required?

Question: How to choose instruments Zt in practice?

First of all, one should analyze which explanatory variables in Xt are endogenous
or exogenous. If an explanatory variable is exogenous, then this variable should be
included in Zt ; the set of instruments. For example, the constant term should always
be included, because a constant is uncorrelated with any random variables. All other
exogenous variables in Xt should also be included in the set of Zt . If k0 of K regressors
are endogenous, one should …nd at least k0 additional instruments.
Most importantly, we should choose an instrument vector Zt which is closely related
to Xt as much as possible. As we will see below, the strength of the correlation between
Zt and Xt a¤ects the magnitude of the asymptotic variance of the 2SLS estimator for
0 which we will propose, although it does not a¤ect the consistency provided the
correlation between Zt and Xt is not zero.
In time series regression models, it is often reasonable to assume that lagged variables
of Xt are not correlated with "t : Therefore, we can use lagged values of Xt ; for example,
Xt 1 ; as an instrument. This instrument is expected to be highly correlated with Xt
if fXt g is a time series process. In light of this, we can choose the set of instruments
Zt = (1; ln Lt ; ln Kt ; Bt 1 )0 in estimating Eq.(7.4) in Example 5, choose Zt = (1; Dt ; It 1 )0
in estimating Eq.(7.6) in Example 6, choose Zt = (1; Dt ; Pt 1 )0 in estimating Eq.(7.8)
in Example 7. For examples with measurement errors or expectational errors, where
E(Xt "t ) 6= 0 due to the presence of measurement errors or expectational errors, we can
choose Zt = Xt 1 if the measurement errors or expectational errors in Xt are serially un-
correlated (check this!). The expectational errors in Xt are MDS and so are uncorrelated
in Example 3 when the economic agent has rational expectations.

7.2 Two-Stage Least Squares (2SLS) Estimation


Question: Because E("t jXt ) 6= 0; the OLS estimator ^ is not consistent for o : How
to obtain consistent estimators for o in situations similar to the examples described in
Section 7.1?

We now introduce a two-stage least squares (2SLS) procedure, which can consistently
estimate the true parameter o . The 2SLS procedure can be described as follows:

10
^t:
Stage 1: Regress Xt on Zt via OLS and save the predicted value X

Here, the arti…cial linear regression model is


0
Xt = Zt + v t ; t = 1; :::; n;

where is a l K parameter matrix, and vt is a K 1 regression error. From the


result in Chapter 2, we have E(Zt vt ) = 0 if and only if is the best LS approximation
coe¢ cient, i.e., if and only if
0
= [E(Zt Zt0 )] 1 E(Zt Xt ):

In matrix form, we can write


X = Z + v;
where X is a n K matrix, Z is a n l matrix, is a l K matrix, and v is a n K
matrix.
The OLS estimator for is

^ = (Z0 Z) 1 Z0 X
! 1
X
n X
n
1
= n Zt Zt0 n 1
Zt Xt0 :
t=1 t=1

The predicted value or the sample projection of Xt on Zt is

^ t = ^ 0 Zt
X

or in matrix form
^ = Z^ = Z(Z0 Z) 1 Z0 X:
X
Stage 2: Use the predicted value X ^ t as regressors for Yt : Regress Yt on X
^ t ; and the
resulting OLS estimator is called the 2SLS estimator, denoted as ^ 2sls :

^ t = ^ 0 Zt as regressors?
Question: Why use the …tted value X

We …rst consider
0
Xt = Zt + v t ;
where is the best linear LS approximation coe¢ cient, and so vt is orthogonal to Zt in
0
the sense E(Zt vt ) = 0: Because E(Zt "t ) = 0; the population projection 0 Zt is orthogonal
0
to ": In general, vt = Xt Zt ; which is orthogonal to Zt ; is correlated with "t : In other
words, the auxiliary regression in stage 1 decomposes Xt into two components: 0 Zt and
vt ; where 0 Zt is orthogonal to "t ; and vt is correlated with "t .

11
Since the best linear LS approximation coe¢ cient is unknown, we have to replace
it with ^ : The …tted value X ^ t = ^ 0 Zt is the (sample) projection Xt onto Zt : The re-
gression of Xt on Zt purges the component of Xt that is correlated with "t so that the
projection X^ t is approximately orthogonal to "t given that Zt is orthogonal to "t : (The
word “approximately” is used here because ^ is an estimator of and thus contains
some estimation error.)

The regression model in the second stage can be written as


^0
Yt = X o
+ ~"t
t

or in matrix form
^
Y =X o
+ ~":
^ t is not Xt :
Note that the disturbance ~"t is not "t because X
^ = Z^ = Z(Z0 Z) 1 Z0 X; we can write the second stage OLS estimator,
Using X
namely the 2SLS estimator as follows:
^ 2sls = (X
^ 0 X)
^ 1X
^ 0Y
= [(Z^ )0 (Z^ )] 1 (Z^ )0 Y
1
= [Z(Z0 Z) 1 Z0 X]0 [Z(Z0 Z) 1 Z0 X] [Z(Z0 Z) 1 Z0 X]0 Y
= [X0 Z(Z0 Z) 1 Z0 Z(Z0 Z) 1 Z0 X] 1 X0 Z(Z0 Z) 1 Z0 Y
= [X0 Z(Z0 Z) 1 Z0 X] 1 X0 Z(Z0 Z) 1 Z0 Y
" # 1
0 0 1 0 1
XZ ZZ ZX X0 Z Z0 Z Z0 Y
= :
n n n n n n
o
Using the expression Y = X + " from Assumption 7.2, we have
" # 1
1 0 1
^ 2sls o X0 Z Z0 Z ZX X0 Z Z0 Z Z0 "
=
n n n n n n
h i 1 0
= Q ^ xz Q
^ zz1 Q
^ zx ^ 1 Z ";
^ xz Q
Q zz
n
where
0 X
n
^ zz = Z Z = n
Q 1
Zt Zt0 ;
n t=1
0 Xn
^ xz = X Z = n
Q 1
Xt Zt0 ;
n t=1
0 X
n
^ zx = Z X = n
Q 1 ^ 0xz :
Zt Xt0 = Q
n t=1

12
Question: What are the statistical properties of ^ 2sls ?

7.3 Consistency of 2SLS


By the WLLN for a stationary ergodic process, we have
p
^ zz !
Q Qzz ; l l
p
^ xz ! Qxz ;
Q K l;
Z 0" p
! E(Zt "t ) = 0; l 1:
n
Also, Qxz Qzz1 Qzx is a K K symmetric and nonsingular matrix because Qxz is of full
rank, Qzz is nonsingular, and l K. It follows from continuity that
h i 1
p
^ ^ 1^
Qxz Qzz Qzx ! [Qxz Qzz1 Qzx ] 1 :

Consequently, we have

^ o p
2sls ! [Qxz Qzz1 Qzx ] 1 Qxz Qzz1 0 = 0:

We now state this consistency result in the following theorem.

Theorem 7.1 [Consistency of 2SLS]: Under Assumptions 7.1-7.4, as n ! 1;


p
^ 2sls ! o
:

To provide intuition why the 2SLS estimator ^ 2sls is consistent for o


; we consider

o
Yt = Xt0 + "t :

The OLS estimator ^ is not consistent for o


because E(Xt "t ) 6= 0: Suppose we decom-
pose the regressor Xt into two terms:

~ t + vt ;
Xt = X

where one X ~ t = 0 Zt is a projection of Xt on Zt and so it is orthogonal to "t : The other


component, vt = Xt X ~ t ; is generally correlated with "t : Then consistent estimation for
o
is possible if we observe vt and run the following augmented regression
o
Yt = Xt0 + "t
~0
= X o
+ (vt0 o
+ "t )
t
~0
= X o
+ ut ;
t

13
where ut = vt0 o ~ t : Because
+ "t is the disturbance when regressing Yt on X

~ t ut ) =
E(X 0
E(Zt ut )
0 o
= E(Zt vt0 ) + 0
E(Zt "t )
= 0;

the OLS estimator of regressing Yt on X ~ t would be consistent for o :


However, X ~ t = 0 Zt is not observable, so we need to use a proxy, i.e., X
^ t = ^ 0 Zt ;
where ^ is the OLS estimator of regressing Xt on Zt : This results in the 2SLS estimator
^ 2sls : The estimation error of ^ does not a¤ect the consistency of the 2SLS estimator ^ :

7.4 Asymptotic Normality of 2SLS


We now derive the asymptotic distribution of ^ 2sls : Write

p h i 1 Z0 "
n( ^ 2sls o
) = ^ xz Q
Q ^ zz1 Q
^ zx ^ xz Q
Q ^ zz1 p :
n
Z0 "
= A^ p ;
n
where the K l matrix h i 1
A^ = Q^ xz Q
^ 1Q
zz
^ zx ^ xz Q
Q ^ 1:
zz

By the CLT assumption (Assumption 7.5), we have

Z0 " 1
X
n
d
p =n 2 Zt "t ! N (0; V ) G;
n t=1

where V is a …nite and nonsingular l l matrix, and we denote the random vector
G N (0; V ): Then by the Slutsky theorem, we have
p d 1
n( ^ 2sls o
) ! Qxz Qzz1 Qzx Qxz Qzz1 N (0; V )
N (0; AV A0 )
N (0; );
p
where A = (Qxz Qzz1 Qzx ) 1 Qxz Qzz1 : The asymptotic variance of n( ^ 2sls o
)
p
avar( n ^ 2sls ) =
= AV A0
0
= [Qxz Qzz1 Qzx ] 1 Qxz Qzz1 V [Qxz Qzz1 Qzx ] 1 Qxz Qzz1
= [Qxz Qzz1 Qzx ] 1 Qxz Qzz1 V Qzz1 Qzx [Qxz Qzz1 Qzx ] 1 :

14
Theorem 7.2 [Asymptotic Normality of 2SLS]: Under Assumptions 7.1-7.5, as
n ! 1;
p d
n( ^ 2sls o
) ! N (0; ):
The estimation of V depends on whether fZt "t g is an MDS. We …rst consider the
case where fZt "t g is an MDS process. In this case, V = E(Zt Zt0 "2t ) and so we need not
estimate the long-run variance-covariance matrix.

Case I: fZt "t g is a Stationary Ergodic MDS

Assumption 7.6 [MDS]: (i) fZt "t g is an MDS; (ii) var(Zt "t ) = E(Zt Zt0 "2t ) is …nite
and nonsingular.

Corollary 7.3: Under Assumptions 7.1–7.4 and 7.6, we have as n ! 1;


p d
n( ^ 2sls o
) ! N (0; );

where is de…ned as above with V = E(Zt Zt0 "2t ):

There is no need to estimate a long-run variance-covariance matrix but involves


consistent estimation of the heteroskedasticity-consistent variance-covariance matrix V .

When fZt "t g is an MDS with conditional homoskedasticity, the asymptotic variance
can be greatly simpli…ed.

Special Case: Conditional Homoskedasticity

Assumption 7.7 [Conditional Homoskedasticity]: E("2t jZt ) = 2


a.s:

Note that the conditional expectation in Assumption 7.7 is conditional on Zt ; not on


Xt :
Under this assumption, by the law of iterated expectations, we obtain

V = E(Zt Zt0 "2t )


= E[Zt Zt0 E("2t jZt )]
2
= E(Zt Zt0 )
2
= Qzz :

It follows that

= (Qxz Qzz1 Qzx ) 1 Qxz Qzz1 2


Qzz Qzz1 Qzx (Qxz Qzz1 Qzx ) 1

2
= (Qxz Qzz1 Qzx ) 1 :

15
Corollary 7.4 [Asymptotic Normality of 2SLS under MDS with Conditional
Homoskedasticity] Under Assumptions 7.1–7.4,7.6 and 7.7, we have as n ! 1;
p d
n( ^ 2sls o
) ! N (0; );

where
2
= [Qxz Qzz1 Qzx ] 1 :
Case II: fZt "t g is a Stationary Ergodic non-MDS

In this general case, we have


!
X
n X
1
1=2
V avar n Zt "t = (j)
t=1 j= 1

where (j) =cov(Zt "t ; Zt j "t j ): We need to use a long-run variance-covariance matrix
estimator for V: When fZt "t g is not an MDS, there is no need (and in fact there is no way)
to consider conditional homoskedasticity and conditional heteroskedasticity separately.

7.5 Interpretation and Estimation of the 2SLS As-


ymptotic Variance
The asymptotic variance of ^ 2sls is so complicated that it will be highly desirable
if we can …nd an interpretation to help understand its structure. What is the nature of
^ ? What is ?
2sls

Let us revisit the second stage regression model


^ t0
Yt = X o
+ ~"t ;

where the regressor


^ t = ^ 0 Zt
X
is the sample projection of Xt on Zt ; and the disturbance ~"t = Yt X ^ t0 o : Note that
^ t 6= Xt : Given Yt = Xt0 o + "t from Assumption 7.2, we have
~"t 6= "t because X

~"t = Yt ^0
X o
t

= "t + (Xt ^ t )0
X o

o
= "t + v^t0 ;

where "t is the true disturbance and v^t Xt ^ t = Xt


X ^ 0 Zt : Since v^t is the estimated
residual from the …rst OLS regression

X = Z + v;

16
we have the following FOC holds:

Z0 (X ^ = Z0 v^ = 0:
X)

It follows that the 2SLS estimator

^ ^ 0 X)
= (X ^ 1X ^ 0Y
2sls
^ 0 X)
= (X ^ 1X ^ 0 (X
^ o + ~")
= o + (X ^ 0 X)
^ 1X ^ 0 [" + v^ o ]
= o + (X ^ 0 X)
^ 1X ^ 0"

^ 0 v^ = 0 (why?). Therefore, the asymptotic properties of ^ 2sls are determined


because X
by

^ 2sls o ^ 0 X)
= (X ^ 1X^ 0"
! 1
^ 0X
X ^ X^ 0"
= :
n n

In other words, the estimated residual v^ = X X ^ from the …rst stage regression has no
impact on the statistical properties of ^ 2sls ; although it is a component of ~"t : Thus, when
analyzing the asymptotic properties of ^ 2sls ; we can proceed as if we were estimating
Y =X ^ o + " by OLS.
Next, recall that we have

^ = Z^ ;
X
^ = (Z0 Z) 1 Z0 X
p
! Qzz1 Qzx =

By the WLLN, the sample projection X ^ t “converges”to the population projection X~t
0
Zt as n ! 1: That is, X^ t will become arbitrarily close to X
~ t as n ! 1: In fact, the
estimation error of ^ in the …rst stage has no impact on the asymptotic properties of
^ :
2sls

Thus, we can consider the following arti…cial regression model

~ t0
Yt = X o
+ "t ;

whose infeasible OLS estimator

~ = (X
~ 0 X)
~ 1X
~ 0 Y:

17
As we will show below, the asymptotic properties of ^ 2sls are the same as those of the
infeasible OLS estimator ~ : This helps a lot in understanding the variance-covariance
structure of ^ 2sls : It is important to emphasize that the equation in (7.13) is not derived
from other equations. It is just a convenient way to understand the nature of ^ 2sls :

We now show that the asymptotic properties of ^ 2sls are the same as the asymptotic
properties of ~ . For the asymptotic normality, observe that

p ~ 0"
n( ~ o ^ 1X
) = Q p
x~x
~
n
d
! Qx~x~1 N (0; V~ )
N (0; Q 1 V~ Q 1 )
x
~x~ x
~x~

using the asymptotic theory in Chapters 5 and 6, where

Qx~x~ ~tX
E(X ~ t0 );
!
X
n
V~ avar n 1=2 ~ t "t :
X
t=1

We …rst consider the case where fZt "t g is MDS with conditional homoskedasticity.

Case I: MDS with Conditional Homoskedasticity

Suppose fX ~t) =
~ t "t g is MDS, and E("2 jX 2
a.s: Then we have
t

V~ = E(X
~tX
~ 0 "2 )
t t
2
= Qx~x~

by the law of iterated expectations (LIE). It follows that


p d
n( ~ o
) ! N (0; 2
Qx~x~1 ):

~t =
Because X 0
Zt ; = Qzz1 Qzx ; we have

~tX
Qx~x~ = E(X ~ 0)
t
0
= E(Zt Zt0 )
0
= Qzz
= Qxz Qzz1 Qzz Qzz1 Qzx
= Qxz Qzz1 Qzx :

18
Therefore,
1
2
Qx~x~1 = 2
Qxz Qzz1 Qzx
p
= avar( n ^ 2sls ):

This implies that the asymptotic distribution of ~ is indeed the same as the asymptotic
distribution of ^ 2sls under the MDS with conditional homoskedasticity.

The asymptotic variance formula


p
avar( n ^ 2sls ) = 2
Qx~x~1 = 2
( 0 Qzz ) 1

p
indicates that the asymptotic variance of n ^ 2sls will be large if the correlation between
Zt and Xt ; as measured by ; is weak. Thus, more precise estimation of o will be
obtained if one chooses the instrument vector Zt such that Zt is highly correlated with
Xt :

Question: How to estimate under the MDS disturbances with conditional homoskedasticity?

Consider the asymptotic variance estimator

^ = s^2 Q
^ 1
x^x
^
1
^ xz Q
= s^2 Q ^ zz1 Q
^ zx

where s^2 = e^0 e^=(n K); e^ = Y X ^ 2sls ;

X
n
^ x^x^ = n
Q 1 ^tX
X ^0
t
t=1

and X^ t = ^ 0 Zt is the sample projection of Xt on Zt : Note that we have to use X


^ t rather
than X~ t because X ~ t = 0 Zt is unknown.

It should be emphasized that e^ is not the estimated residual from the second stage
^ This implies that even under condi-
regression (i.e., not from the regression of Y on X):
tional homoskedasticity, the conventional t-statistic in the second stage regression does
not converge to N (0; 1) in distribution, and J F^ does not converge to 2J where F^ is
the F -statistic in the second stage regression.
p p p
To show ^ ! ; we shall show (i) Q
^ 1!
x^x
^ Qx~x~1 and (ii) s^2 ! 2
:

19
We …rst show (i). There are two methods for proving this.
p p
^ 1!
Method 1: We shall show Q ^ t = ^ 0 Zt and ^ !
Qx~x~1 : Because X ; we have
x^x
^

X
n
^ x^x^ = n
Q 1 ^tX
X ^ t0
t=1
!
X
n
= ^0 n 1
Zt Zt0 ^
t=1
^ zz ^
= ^0Q
p 0
! Qzz
= E[( 0 Zt )(Zt0 )]
= E(X~tX~ 0)
t

= Qx~x~ :
p
Method 2: We shall show (Q ^ xz Q
^ 1Q^ zx ) 1 ! (Qxz Qzz1 Qzx ) 1 ; which follows immedi-
zz
p p
^ xz ! Qxz and Q
ately from Q ^ zz ! Qzz by the WLLN. This method is more straightfor-
ward but is less intuitive than the …rst method.
p
Next, we shall show (ii) s^2 ! 2
. We decompose
e^0 e^
s^2 =
n K
1 X
n
= (Yt Xt0 ^ 2sls )2
n K t=1
1 X
n
= ["t Xt0 ( ^ 2sls o
)]2
n K t=1

1 X
n
= "2t
n K t=1

1 X
n
+( ^ 2sls
o 0
) Xt Xt0 ( ^ 2sls o
)
n K t=1

1 Xn
2( ^ 2sls o 0
) Xt "t
n K t=1
p 2
! + 0 Qxx 0 2 0 E(Xt "t )
2
= :

Note that although E(Xt "t ) 6= 0; the last term still vanishes to zero in probability,
o p
because ^ 2sls ! 0:

20
Question: What happens if we use s2 = e0 e=(n K); where e = Y X^ ^ 2sls is the
p
estimated residual from the second stage regression? Do we still have s2 ! 2 ?

We have proved the following theorem.

Theorem 7.5 [Consistency of ^ under MDS with Conditional Homoskedas-


ticity]: Under Assumptions 7.1 – 7.4, 7.6 and 7.7, we have as n ! 1;
p 1
^ = s^2 Q
^ 1! = 2
Qx~x~1 = 2
Qxz Qzz1 Qzx :
x^x
^

Case II: fZt "t g is an MDS with Conditional Heteroskedasticity


When there exists conditional heteroskedasticity but fZt "t g is still an MDS, the
infeasible OLS estimator ~ in the arti…cial regression

~
Y =X o
+"

has the following asymptotic distribution:


p d
n( ~ o
) ! N (0; Qx~x~1 V~ Qx~x~1 );

where
V~ = E(X
~tX
~ 0 "2 ):
t t

~ t = 0 Zt ; = Qzz1 Qzx ; Qx~x~ = 0 Qzz ; and V~ = 0 E(Zt Zt0 "2t ) = 0 V ;


Given X
where V = E(Zt Zt0 "2t ) under the MDS assumption with conditional heteroskedasticity,
we have
p
avar( n ~ ) = Qx~x~1 V~ Qx~x~1
= [E(X ~tX~ t0 )] 1 E[X
~tX
~ t0 "2t ][E(X
~tX
~ t0 )] 1

= [ 0 E(Zt Zt0 ) ] 1 0
E(Zt Zt0 "2t ) [ 0 E(Zt Zt0 ) ] 1

1
= Qxz Qzz1 Qzx Qxz Qzz1 V Qzz1 Qzx (Qxz Qzz1 Qzx ) 1

p
= avar( n ^ 2sls ):

This implies that the asymptotic distribution of the infeasible OLS estimator ~ is the
same as the asymptotic distribution of ^ 2sls under MDS with conditional heteroskedas-
ticity. Therefore, the estimator for is

^ =Q
^ 1 V^x^x^ Q
^ 1;
x^x
^ x^x
^

21
where
X
n
V^x^x^ = n 1 ^tX
X ^ t0 e^2t
t=1
!
X
n
= ^0 n 1
Zt Zt0 e^2t ^;
t=1

where ^ = (Z0 Z) 1 Z0 X = Q ^ 1Q^ zx and e^t = Yt X 0 ^ 2sls : This is a White’s (1980)


zz t
heteroskedasticity-consistent variance-covariance matrix estimator for ^ 2sls :
Now, put
Xn
^
V n 1
Zt Zt0 e^2t ;
t=1
Then
^ = [Q
^ xz Q
^ zz1 Q
^ zx ] 1 Q
^ xz Q
^ zz1 V^ Q
^ zz1 Q
^ zx [Q
^ xz Q
^ 1Q^ zx ] 1 ;
zz

where (please check it!)


X
n
V^ = n 1
Zt Zt0 e^2t
t=1
p
!V = E(Zt Zt0 "2t )

under suitable regularity conditions.


p
Question: How to show ^ ! under MDS with conditional heteroskedasticity?
p
Again, there are two methods to show ^ ! here.
p p p
Method 1: We shall show Q ^ x^x^ ! Qx~x~ and V^x^x^ ! V~ : The fact that Q
^ x^x^ ! Qx~x~ has
p
been shown earlier in the case of conditional homoskedasticity. To show Vx^x^ ! V~ ; we
^
write
Xn
^
Vx^x^ = n 1
X^tX ^ t e^2
t
t=1
!
X
n
= ^0 n 1
Zt Zt0 e^2t ^
t=1

= ^ 0 V^ ^ :
p 1
Pn
Because ^ ! ; and following the consistency proof for n t=1 Xt Xt0 e2t in Chapter
4, we can show (please verify!) that
X
n
p
V^ = n 1
Zt Zt0 e^2t ! E(Zt Zt0 "2t ) = V;
t=1

22
under the following additional moment condition:

4
Assumption 7.8: (i) E(Zjt ) < 1 for all 0 j l; and (ii) E("4t ) < 1.

It follows that
p
V^x^x^ ! 0 E(Zt Zt0 "2t )
= E(X ~tX~ t0 "2t )
= V~ :
p p
^ x^x^ !
This and Q Qx~x~ imply ^ ! :

Method 2: Given that


1 1
^= Q
^ xz Q
^ 1Q^ zx ^ xz Q
Q ^ 1 V^ Q
^ 1Q^ zx Q
^ xz Q
^ 1Q^ zx ;
zz zz zz zz

p p p
it su¢ ces to show Q^ xz ! ^ zz !
Qxz ; Q Qzz and V^ ! V: The …rst two results immedi-
ately follow by the WLLN. The last result follows by using a similar reasoning of the
P
consistency proof for n 1 nt=1 Xt Xt0 e2t in Chapter 4 or 5.
We now summarize the result derived above.

Theorem 7.6 [Consistency of ^ under MDS with Conditional Heteroskedas-


ticity]: Under Assumptions 7.1-7.4, 7.6 and 7.8, we have as n ! 1;
p
^ = Q
^ 1 V^x^x^ Q
^ 1! = Qx~x~1 V~ Qx~x~1
x^x
^ x^x
^

= (Qxz Qzz1 Qzx ) 1 Qxz Qzz1 V Qzz1 Qzx (Qxz Qzz1 Qzx ) 1 :

where V~ = E(X
~tX
~ 0 "2 ) and V = E(Zt Z 0 "2 ):
t t t t

Case III: fZt "t g is a Stationary Ergodic non-MDS


Finally, we consider a general case where fZt "t g is not an MDS, which may arise as
in the examples discussed in Chapter 6.
p d
In this case, we have n( ^ 2sls o
) ! N (0; ) as n ! 1; where

= Qx~x~1 V~ Qx~x~1
= (Qxz Qzz1 Qzx ) 1 Qxz Qzz1 V Qzz1 Qzx (Qxz Qzz1 Qzx ) 1 ;

23
with
X
1
V~ = ~ (j); ~ (j) = cov(X
~ t "t ; X
~ t j "t j );
j= 1
X1
V = (j); (j) = cov(Zt "t ; Zt j "t j ):
j= 1

On the other hand, we have


p
avar( n ~ ) = Qx~x~1 Vx~x~ Qx~x~1
= ( 0 Qxx ) 1 0
V ( 0 Qxx ) 1

p
= avar( n ^ 2sls ):
p ^
Thus, the asymptotic variance of n 2sls is the same as the asymptotic variance of
~ under this general case.

Question: How to estimate ?


Answer: Use a long-run variance-covariance matrix estimator for V or V~ .

We directly assume that we have a consistent estimator V^ for V:


p
Assumption 7.9: V^ ! V 1
j= 1 (j); where (j) = cov(Zt "t ; Zt j "t j ):

Question: How to estimate V~ = 1 ~


j= 1 (j)?

Recall that ~ (j) = 0


(j) : A consistent estimator for V~ can be given by
p
^ 0 V^ ^ ! V~ :

Theorem 7.7 [Consistency of ^ under Non-MDS]: Under Assumptions 7.1-7.4,


and 7.9, we have as n ! 1;

^ = Q^ 1 V^x^x^ Q
^ 1
x
^x^ x
^x^

= (Q^ xz Q ^ Q1 ^ zx ) 1 Q ^ 1 V^ Q
^ xz Q ^ 1Q^ zx (Q
^ xz Q
^ 1Q^ zx ) 1
zz zz zz zz
p
! = Qx~x~1 V~ Qx~x~1 ;

where V^x^x^ = ^ V^ ^ 0 and

= (Qxz Qzz1 Qzx ) 1 Qxz Qzz1 V Qzz1 Qzx (Qxz Qzz1 Qzx ) 1 :

24
With a consistent estimator of ; we can develop various con…dence interval estima-
tors and various tests for the null hypothesis H0 : R o = r: We will consider the latter

now:

7.6 Hypothesis Testing


Now, consider the null hypothesis of interest

o
H0 : R = r;

where R is a J K nonstochastic matrix, and r is a J 1 nonstochastic vector. The


test statistics will di¤er depending on whether fZt "t g is an MDS, and whether f"t g is
conditionally homoskedastic when fZt "t g is an MDS. For space, here we do not present
the results on t-type test statistics when J = 1:

Case I: fZt "t g is an MDS with Conditional Homoskedasticity


Theorem 7.8 [Hypothesis Testing]: Put e^ Y X ^ 2sls : Then under Assumptions
7.1-7.4, 7.6 and 7.7, the Wald test statistic
^ r)0 [R(X ^ 1 R0 ] 1 (R ^ 2sls
^ 0 X)
^ = n(R 2sls
W
r) d
! 2
J
e^0 e^=(n K)

as n ! 1;under H0 :

Proof: The result follows immediately from the asymptotic normality theorem for
p ^ p p
n( 2sls o
), H0 (which implies n(R ^ 2sls r) = R n( ^ 2sls o
)); the consistent
asymptotic variance estimation theorem, and the Slutsky theorem.

Remarks:
^ =J the F -statistic from the second stage regression?
Question: Is W
Answer: No, because e^ is not the estimated residual from the second stage regres-
sion.
Question: Do we still have
(e0 er e0u eu )=J
F^ = r0 ;
eu eu =(n K)
where er and eu are estimated residuals from the restricted and unrestricted regression
models in the second stage regression respectively?

Answer: No. (Why?)

25
Case II: fZt "t g is a Stationary Ergodic MDS with Conditional Heteroskedas-
ticity

Theorem 7.9 [Hypothesis Testing]: Under Assumptions 7.1-7.4, 7.6 and 7.8, the
Wald test statistic
d
^
W n(R ^ 2sls x^x
^
^ 1 R0 ] 1 (R ^ 2sls
^ 1 V^x^x^ Q
r)0 [RQ x^x
^ r) ! 2
J

under H0 ; where V^x^x^ = n 1 n ^ ^ 0 ^2t


t=1 Xt Xt e and e^t = Yt Xt0 ^ 2sls :

Question: Suppose there exists conditional homoskedasticity but we use W ^ above. Is


^ an asymptically valid procedure in this case?
W
Answer: Yes, W ^ is asymptotically valid. However, the …nite sample performance of W
^
will be generally less satisfactory than the test statistic in Case I.

Case III: fZt "t g is a Stationary ergodic non-MDS

When fZt "t g is non-MDS, we can still construct a Wald test which is robust to
conditional heteroskedasticity and autocorrelation, as is stated below.

Theorem 7.10 [Hypothesis Testing]: Under Assumptions 7.1-7.5 and 7.9, the Wald
test statistic

^ = n(R ^ 2sls ^ 1 R0 ] 1 (R ^ 2sls


^ 1 V^x^x^ Q d
W r)0 [RQx^x
^ x^x
^ r) ! 2
J

under H0 ; where V^x^x^ = ^ 0 V^ ^ ; ^ = (Z0 Z) 1 Z0 X and V^ is a long-run variance-covariance


estimator for V = 1 j= 1 (j) with (j) = cov(Zt "t ; Zt j "t j ):

7.7 Hausman’s Test


When there exists endogeneity so that E(Xt "t ) 6= 0; the OLS estimator ^ is inconsis-
tent for o : Instead, the 2SLS estimator ^ 2sls should be used, which involves the choice
of the instrumental vector Zt which in turn a¤ects the e¢ ciency of ^ 2sls . In practice, it
is not uncommon that practitioners are not sure whether there exists endogeneity. In
this section, we introduce Hausman’s (1978) test for endogeneity. The null hypothesis
of interest is:
H0 : E("t jXt ) = 0:
If this null hypothesis is rejected, one has to use the 2SLS estimator ^ 2sls provided
that one can …nd a set of instruments Zt that satis…es Assumption 7.4.

26
For simplicity, we impose the following conditions.

Assumption 7.10: (i) f(Xt0 ; Zt0 )0 "t g is an MDS process; and (ii) E("2t jXt ; Zt ) = 2
a.s.

Assumptions 7.10 is made for simplicity. They could be relaxed to be a non-MDS


process with conditional heteroskedasticity but Hausman’s (1978) test statistic to be
introduced below should be generalized.

Question: How to test the conditional homoskedasticity assumption that E("2t jXt ; Zt ) =
2
?

Answer: Put e^t = Yt X ^ 0 ^ 2sls : (Question: Can we use et = Yt X 0 ^ 2sls ?) Then run an
t t
2 0 0 0
auxiliary regression of e^t on vech(Ut ); where Ut = (Xt ; Zt ) , a (K + l) 1 vector. Then
d
under the condition that E("4t jXt ; Zt ) = 4 is a constant, we have nR2 ! 2J under the
null hypothesis of conditional homoskedasticity, where J = (K + l)(K + l + 1)=2 1:

The basic idea of Hausman’s test is under H0 : E("t jXt ) = 0; both the OLS estimator
^ = (X 0 X) 1 X 0 Y and the 2SLS estimator ^ o
2sls are consistent for : They converge to
the same limit o ^
but it can be shown that is an asymptotically e¢ cient estimator
while ^ 2sls is not. Under the alternatives to H0 ; ^ 2sls remains to be consistent for o
but ^ is not. Hausman (1978) considers a test for H0 based on the di¤erence between
the two estimators:
^ ^;
2sls

which converges to zero under H0 but generally to a nonzero constant under the alter-
natives to H0 ; giving the test its power against H0 when the sample size n is su¢ ciently
large.

To construct Hausman’s (1978) test statistic, we need to derive the asymptotic dis-
tribution of ^ 2sls ^ For this purpose, we …rst state a lemma.
p p
Lemma 7.11: Suppose A^ ! A and B
^ = OP (1): Then (A^ ^ ! 0:
A)B

We …rst consider the OLS ^ : Note that


p X
n
n( ^ o ^ 1n
)=Qxx
1=2
Xt "t
t=1

p
^ xx1 !
where Q Qxx1 and
X
n
d
1=2 2
n Xt "t ! N (0; Qxx )
t=1

27
P
as n ! 1 (see Chapter 5). It follows that n 1=2 nt=1 Xt "t = OP (1); and by Lemma
7.11, we have
p X
n
n( ^ o
) = Qxx1 n 1=2 Xt "t + oP (1):
t=1
Similarly, we can obtain
p X
n
n( ^ 2sls o ^
) = An 1=2
Zt "t
t=1
Xn
1=2
= An Zt "t + oP (1);
t=1

p Pn d
where A^ = (Q ^ xz Q
^ 1Q
zz
^ zx ) 1 Q
^ xz Q
^ zz ! A = (Qxz Qzz1 Qzx ) 1 Qxz Qzz1 and n 1=2
t=1 Zt "t !
N (0; 2 Qzz ) (see Corollary 7.4). It follows that
p X
n
n( ^ 2sls ^) = n 1=2
(Qxz Qzz1 Qzx ) 1 Qxz Qzz1 Zt Qxx1 Xt "t + oP (1)
t=1
d 2
! N (0; (Qxz Qzz1 Qzx ) 1 2
Qxx1 )

by the CLT for the stationary ergodic MDS process and Assumption 7.10. Therefore,
under the null hypothesis H0 ; the quadratic form
h i 1
n( ^ 2sls ^ )0 (Q^ xz Q
^ 1Q
zz
^ zx ) 1 Q
^ 1
xx ( ^ 2sls ^ )
H =
s2
d
! 2K

as n ! 1 by the Slutsky theorem, where s2 = e0 e=n is the residual variance estimator


based on the OLS residual e = Y X ^ . This is called Hausman’s test statistic.

Question: Can we replace the residual variance estimator s2 by s^2 = e^0 e^=n; where
e^ = Y X ^ 2sls ?

Theorem 7.12 [Hausman’s Test for Endogeneity] Suppose Assumptions 7.1–7.4,


7.10 and H0 hold, and Qxx Qxz Qzz1 Qzx is strictly positive de…nite. Then as n ! 1;
d 2
H! K:

Remarks:
We note that in the above Theorem,
p
avar[ n( ^ 2sls ^ )] = 2
(Qxz Qzz1 Qzx ) 1 2
Qxx1
p p
= avar( n ^ 2sls ) avar( n ^ ):

28
This simple asymptotic variance-covariance structure is made possible under As-
sumption 7.10. Suppose there exists conditional heteroskedasticity (i.e., E("2t jXt ; Zt ) 6=
p
2
): Then we no longer have the above simple variance-covariance structure for avar[ n( ^
^ 2sls )]:

The variance-covariance (Qxz Qzz1 Qzx ) 1 Qxx1 may become singular when its rank
J < K. In this case, we have to modify the Hausman’s test statistic by using the
generalized inverse of the variance estimator:
h i
n( ^ 2sls ^ )0 (Q
^ xz Q
^ zz1 Q ^ 1 ( ^ 2sls ^ )
^ zx ) 1 Q
xx
H= 2
s
d 2
Note that now H ! J under H0 where J < K:

Question: What is the generalized inverse A of matrix A?

Question: How to modify the Hausman’s test statistic so that it remains asymptot-
ically 2K when there exists conditional heteroskedasticity (i.e., E("2t jXt ; Zt ) 6= 2 ) but
f(Xt0 ; Zt0 )0 "t g is still an MDS process?

In fact, Hausman’s (1978) test is a general approach to testing model speci…cation,


not merely whether endogeneity exists. For example, it can be used to test whether a
…xed e¤ect panel regression model or a random e¤ect panel regression model should be
used. In Hausman (1978), two estimators are compared, one of which is asymptotically
e¢ cient under the null hypothesis but inconsistent under the alternative and another of
which is asymptotically ine¢ cient but consistent under the alternative hypothesis. This
approach was extended by White (1981) to compare any two di¤erent estimators either
of which need not be asymptotically most e¢ cient. The methods of Hausman and White
were further extended by Newey (1985), Tauchen (1985) and White (1990) to construct
moment-based tests for model speci…cation.

Hausman’s test is used to check whether E("t jXt ) = 0: Suppose this condition fails,
one has to choose an instrumental vector Zt that satis…es Assumption 7.4. When we
choose a set of variables Zt ; how can we check the validity of Zt as instruments? In
particular, how to check whether E("t jZt ) = 0? For this purpose, we will consider a
so-called overidenti…cation test, which will be introduced in Chapter 8.

7.8 Empirical Applications


Application I: Incentives in Chinese State-owned Enterprises

29
Groves, Hong, McMillan and Naughton (1994, Quaterly Journal of Economics)

Application II: The Consumption Function

Campbell and Mankiw (1989, 1991)


The consumption function

Ct = + Yt + " t ;
Yt = Zt0 + vt ;

where Yt is income growth, Ct is consumption growth.

7.9 Conclusion

In this chapter, we discuss the possibilities that the condition of E("t jXt ) = 0 may
fail in practice, which will render inconsistent the OLS estimator for the true model
parameters. With the use of instrumental variables, we introduce a consistent two-stage
least squares (2SLS) estimator. We investigate the statistical properties of the 2SLS
estimator and provide some interpretations that can enhance deeper understanding of
the nature of the 2SLS estimator. We discuss how to construct consistent estimators for
the asymptotic variance of the 2SLS estimator under various scenarios, including MDS
with conditional homoskedasticity, MDS with conditional heteroskedasticity, and non-
MDS possibly with conditional heteroskedasticity. For the latter, consistent estimation
for the long-run variance covariance matrix is needed. With these consistent asymptotic
variance estimators, various hypothesis test procedures are proposed. It is important to
emphasize that the conventional t-test and F -test cannot be used even for large samples.
Finally, some empirical applications that employ 2SLS are considered.

In fact, the 2SLS procedure is one of several approaches to consistent estimation


of model parameters when the condition of E("t jXt ) = 0 fails. There are alternative
estimation procedures that also yield consistent estimators. For example, suppose the
correlation between Xt and "t is caused by the omitted variables problem, namely

"t = g(Wt ) + ut ;

when E(ut jXt ; Wt ) = 0 and Wt is an omitted variable which is correlated with Xt . This
delivers a partially linear regression model

o
Yt = Xt0 + g(Wt ) + ut :

30
o
Because E(Yt jWt ) = E(Xt jWt )0 + g(Wt ); we obtain
o
Yt E(Yt jWt ) = [Xt E(Xt jWt )]0 + ut

or
o
Yt = Xt 0 + ut ;
where Yt = Yt E(Yt jWt ) and Xt = Xt E(Xt jWt ): Because E(Xt ut ) = 0; the OLS
estimator ~ of regressing Yt on Xt would be consistent for o : However, (Yt ; Xt ) are
not observable, so ~ is infeasible. Nevertheless, one can …rst estimate E(Yt jWt ) and
E(Xt jWt ) nonparametrically, and then obtain a feasible OLS estimator which will be
consistent for the true model parameter (e.g., Robinson 1988). Speci…cally, let m ^ Y (Wt )
and m^ X (Wt ) be consistent nonparametric estimators for E(Yt jWt ) and E(Xt jWt ) respec-
tively. Then we can obtain a feasible OLS estimator
" n # 1 n
X X
~ = X^ X ^ 0 X^ Y^ ;
a t t t t
t=1 t=1

p
^ = Xt
where Xt ^ X (Wt ) and Y^t = Yt
m ^ Y (Wt ): It can be shown that ~ a !
m o
and
p d
n( ~ a o
) ! N (0; Q 1
V Q 1
);

where Q = E(Xt Xt 0 ) and V = var(n 1=2 nt=1 Xt ut ): The …rst stage nonparametric
estimation has no impact on the asymptotic properties of the feasible OLS estimator ~ a :
Another method to consistently estimate the true model parameters is to make use
of panel data. A panel data is a collection of observations for a total of n cross-sectional
units and each of these units has T time series observations over the same time period.
This is called a balanced panel data. In contrast, an unbalanced panel data is a collection
of observations for a total of n cross-sectional units and each unit may have di¤erent
lengths of time series observations with some common overlapping time periods.
With a balanced panel data, we have
o
Yit = Xit0 + "it
o
= Xit0 + i + uit ;

where i is called individual-speci…c e¤ect and uit is called idiosyncratic disturbance


such that E(uit jXit ; i ) = 0. When i is correlated with Xit ; which may be caused by
omitted variables which do not change over time, the panel data model is called a …xed
e¤ect panel data model. When i is uncorrelated with Xit ; the panel data model is
called a random e¤ect panel data model. Here, we consider a …xed e¤ect panel data

31
model with strictly exogenous variables Xit . Because "it is correlated with Xit ; the OLS
estimator of regressing Yit on Xit is not consistent for o : However, one can consider the
demeaned model
Yit Y_ i: = (Xit X_ i: )0 o + ("it "_ i: );
where Y_ i: = T 1 Tt=1 Yit and similarly for X_ i: and "_ i: : The demeaning procedure removes
the unobservable individual-speci…c e¤ect and as a result, the OLS estimator for the
demeaned model, which is called the within estimator in the panel data literature, will be
consistent for the true model parameter o : (It should be noted that for a dynamic panel
data model where Xit is not strictly exogenous, the within estimator is not consistent for
o
when the number of the time periods T is …xed. Di¤erent estimation methods have
to be used.) See Hsiao (2002) for detailed discussion of panel data econometric models.

Chapters 2 to 7 present a relatively comprehensive econometric theory for linear re-


gression models often encountered in economics and …nance. We start with a general
regression analysis, discussing the interpretation of a linear regression model, which de-
pends on whether the linear regression model is correctly speci…ed. After discussing the
classical linear regression model in Chapter 3, Chapters 4 to 7 discuss various extensions
and generalizations when some assumptions in the classical linear regression model are
violated. In particular, we consider the scenarios under which the results for classical
linear regression models are approximately applicable for large samples. The key condi-
tion here are conditional homokedasticity and serial uncorrelatedness in the regression
disturbance. When there exists conditional heteroskedasticity or serial correlation in
the regression disturbance, the results for classical linear regression models are no longer
applicable; we provide robust asymptotically valid procedures under these scenarios.
The asymptotic theory developed for linear regression models in Chapters 4–7 can
be easily extended to more complicated, nonlinear models. For example, consider a
nonlinear regression model
Yt = g(Xt ; o ) + "t ;
where E("t jXt ) = 0 a.s. The nonlinear least squares estimator solves the minimization
of the sum of squared residual problem
X
n
^ = arg min [Yt g(Xt ; )]2 :
t=1

The …rst order condition is


D( ^ )0 e = 0;

32
where D( ) is a n K matrix, with the t-th row being @g(Xt ; )=@ : Although one
generally does not have a closed form expression for ^ ; all asymptotic theory and pro-
cedures in Chapters 4–7 are applicable to the nonlinear least squares estimator if one
replaces Xt by (@=@ )g(Xt ; ): See also the discussion in Chapters 8 and 9.
The asymptotic theory in Chapters 4–7 however, cannot be directly applied to some
popular nonlinear models. Examples of such nonlinear models are

Rational Expectations Model:

o
E [m(Zt ; )] = 0;

Conditional Variance Model:

o o
Yt = g(Xt ; ) + (Xt ; )ut ;

2
where g(Xt ; ) is a parametric model for E(Yt jXt ); (Xt ; ) is a parametric model
for var(Yt jXt ); and fut g is i.i.d.(0; 1);

Conditional probability model of Yt given Xt :

f (yjXt ; ):

These nonlinear models are not models for conditional mean or regression; they also
model other characteristics of the conditional distribution of Yt given Xt : For these
models, we need to develop new estimation methods and new asymptotic theory, which
we will turn to in subsequent chapters.
One important part that we do not discuss in Chapters 2–7 is model speci…cation
testing. Chapter 2 emphasizes the importance of correct model speci…cation for the
validity of economic interpretation of model parameters. How to check whether a lin-
ear regression model is correctly speci…ed for conditional mean E(Yt jXt )? This is called
model speci…cation testing. Some popular speci…cation tests in econometrics are Haus-
man’s (1978) test and White’s (1981) test which compares two parameter estimators for
the same model parameter. Also, see Hong and White’s (1995) speci…cation test using
a nonparametric series regression approach.

33
EXERCISES

7.1. Consider the following simple Keynes national income qmodel

o o
Ct = 1 + 2 (Yt Tt ) + "t ; (1.1)
o o
Tt = 1 + 2 Yt + vt ; (1.2)
Yt = Ct + Gt ; (1.3)

where Ct ; Yt ; Tt ; Gt are the consumption, income, tax, and government spending respec-
tively, and f"t g and fvt g are i.i.d. (0; 2" ) and (0; 2v ) respectively. Model (1.1) is a
consumption function which we are interested in, (1.2) is a tax function, and (1.3) is an
income identity.
(a) Can the OLS estimator ^ of model (1.1) give consistent estimation for the mar-
ginal propensity to consume? Explain.
(b) Suppose Gt is an exogenous variable (i.e., Gt does not depend on both Ct and
Yt ). Can Gt be used as a valid instrumental variable? If yes, describe a 2SLS procedure.
If not, explain.
(c) Suppose the government has to maintain a budget balance such that

Gt = Tt + wt ; (1.4)

where fwt g is i.i.d. (0; 2w ): Could Gt be used as a valid instrumental variable? If yes,
describe a 2SLS procedure. If not, explain.

7.2. Consider the data generating process

o
Yt = Xt0 + "t ; (2.1)

where Xt = (1; X1t )0 ;

X1t = vt + ut ; (2.2)
"t = wt + ut : (2.3)

where fvt g; fut g and fwt g are all i.i.d. N (0; 1); and they are mutually independent.
(a) Is the OLS estimator ^ consistent for o ? Explain.
(b) Suppose that Z1t = wt "t : Is Zt = (1; Z1t )0 a valid instrumental vector? Explain.
(c) Find an instrumental vector and the asymptotic distribution of ^ 2sls using this
p d
instrumental vector: [Note you need to …nd n( ^ 2sls o
) ! N (0; V ) for some V; where
the expression of V should be given:]

34
(d) To test the hypothesis
o
H0 : R = r;
where R is a J 2 matrix, and r is a J 1 vector. Suppose that F~ is the F -statistic
in the second stage regression of 2SLS. Could we use J F~ as an asymptotic 2J test?
Explain.

7.3. Consider the following demand-supply system:


o o o
Yt = 0 + 1 Pt + 2 St + "t ;
o o o
Yt = 0 + 1 Pt + 2 Ct + vt ;

where the …rst equation is a model for the demand of certain good, where Yt is
the quantity demanded for the good, Pt is the price of the good, St is the price of a
substitute, and "t is a shock to the demand. The second equation is a model for the
supply of the good, where Yt is the quantity supplied, Ct is the cost of production, and vt
is a shock to the supply. Suppose St and Ct are exogenous variables, f"t g is i.i.d.(0; 2" )
and fvt g is i.i.d.(0; 2v ); and two series f"t g and fvt g are independent of each other. We
have also assumed that the market is always clear so the quantity demanded is equal to
the quantity supplied.
(a) Suppose we use a 2SLS estimator to estimate the demand model with the instru-
ments Zt = (St ; Ct )0 : Describe the 2SLS procedure. Is the resulting 2SLS ^ 2sls consistent
for o = ( o0 ; o1 ; o2 )0 ? Explain.
(b) Suppose we use a 2SLS estimator to estimate the supply equation with instru-
ments Zt = (St ; Ct )0 : Describe the 2SLS procedure. Is the resulting 2SLS ^ 2sls consistent
for o = ( o0 ; o1 ; o2 )0 ? Explain.
(c) Suppose f"t g and fvt g are contemporaneously correlated, namely, E("t vt ) 6= 0:
This can occur when there is a common shock to both the demand and supply of the
good. Does this a¤ect the conclusions in part (a) and part (b). Explain.
p
7.4. Show that under Assumptions 7.1-7.4, ^ 2sls ! o
as n ! 1:

7.5. Suppose Assumptions 7.1-7.5 hold.


p d
(a) Show that n( ^ 2sls o
) ! N (0; ) as n ! 1; where

= [Qxz Qzz1 Qzx ] 1 Qxz Qzz1 V Qzz1 Qzx [Qxz Qzz1 Qzx ] 1 ;

and V is given in Assumption 7.5;


(b) If in addition that fZt "t g is an ergodic stationary MDS process with E("2t jZt ) =
2
: Show that
= 2 [Qxz Qzz1 Qzx ] 1 :

35
7.6. Suppose Assumptions 7.1 –7.4, 7.6 and 7.7 hold.
(a) De…ne
e^0 e^
s^2 =
n
p
where e^ = Y X ^ : Show s^2 ! 2 = var("t ) as n ! 1:
2sls
(b) De…ne
e0 e
s2 = ;
n
where e = Y X ^ ^ 2sls is the estimated residual from the second stage regression of Yt
^ t = ^ 0 Zt : Show that s2 is not a consistent estimator for 2 :
on X

7.7. [2SLS Hypothesis Testing] Suppose Assumptions 7.1-7.5 hold. De…ne a


F -statistic
n(R ^ 2sls r)0 [RQ ^ 1 R0 ] 1 (R ^ 2sls r)=J
x
^x^
F = ;
e0 e=(n K)
where et = Yt X ^ 0 ^ 2sls is the estimated residual from the second stage regression of
t
^ d
Yt on Xt : Does J F ! 2J under the null hypothesis H0 : R o = ? If yes, give your
reasoning. If not, provide a modi…cation so that the modi…ed test statistic converges to
2
J under H0 :

7.8. Let
1X
n
V^ = Zt Zt0 e^2t ;
n t=1
p
where e^t = Yt Xt0 ^ 2sls : Show V^ ! V under Assumptions 7.1–7.8.

7.9. Suppose the following assumptions hold:

Assumption 3.1 [Linearity]: fYt ; Xt0 g0n


t=1 is a stationary ergodic process with

o
Yt = Xt0 + "t ; t = 1; :::; n;

o
for some unknown parameter and some unobservable disturbance "t ;

Assumption 3.2 [Nonsingularity] The K K matrix

Qxx = E(Xt Xt0 )

is nonsingular and …nite;

Assumption 3.3 [Orthogonality]

36
(i) E(Xt "t ) = 0;
(ii) E(Zt "t ) = 0; where Zt is a l 1 random vector, with l K;
(iii) The l l matrix
Qzz = E(Zt Zt0 )
is …nite and nonsingular, and the l K matrix

Qxz = E(Zt Xt0 )

is …nite and of full rank;

Assumption 3.4: f(Xt0 ; Zt0 )0 "t g is an martingale di¤erence sequence.


Assumption 3.5: E("2t jXt ; Zt ) = 2 a.s.

Under these assumptions, both OLS

^ = (X0 X) 1 X0 Y

and 2SLS
^ = [(X0 Z)(Z0 Z) 1 Z0 X] 1 X0 Z(Z0 Z) 1 Z0 Y
2sls

are consistent for o :


(a) Show that ^ is a special 2SLS estimator ^ 2sls with some proper choice of instru-
mental vector Zt :
p
(b) Which estimator, ^ or ^ 2sls ; is more asymptotically e¢ cient? [Hint: if n( ^ 1
d p d
o
) ! N (0; 1 ) and n( ^ 2 o
) ! N (0; 2 ); then ^ 1 is asymptotically more e¢ cient
than ^ 2 if and only if 2 1 or 1
1 1
2 is positive semi-de…nite.]

7.10. Consider the linear regression model


o
Yt = Xt0 + "t ;
o
where E(Xt "t ) 6= 0: Our purpose is to …nd a consistent estimation procedure for .
First, consider the arti…cial regression
0
Xt = Zt + v t ;

where Xt is the regressor vector, Zt is the instrumental vector, = [E(Zt Zt0 )] 1 E(Zt Xt0 )
is the best linear LS approximation coe¢ cient, and vt is the K 1 regression error.
Now, suppose instead of decomposing Xt , we decompose the regression error "t as
follows:
"t = vt0 0 + ut ;

37
where 0 = [E(vt vt0 )] 1 E(vt "t ) is the best linear LS approximation coe¢ cient.
Now, assuming that vt is observable, we consider the augmented linear regression
model
Yt = Xt0 o + vt0 0 + ut :
Show E[(Xt0 ; vt )0 ut ] = 0. One important implication of this orthogonality condition
is that if vt is observable then the OLS estimator of regressing Yt on Xt and vt will be
consistent for ( o ; o )0 :

7.11. In practice, vt is unobservable. However, it can be estimated by the estimated


residual
v^t = Xt ^ 0 Zt = Xt X ^t:

We now consider the following feasible augmented linear regression model


o
Yt = Xt0 + v^t0 + u~t ;
0
and we denote the resulting OLS estimator as ^ = ( ^ ; ^0 )0 ; where ^ is the OLS estimator
for o and ^ is the OLS estimator for :
Show ^ = ^ 2sls :[Hint: The following decomposition may be useful: Suppose
" #
B C0
A=
C D

is a nonsingular square matrix, where B is k1 k1 ; C is k2 k1 and D is k2 k2 : Then


" #
1 0 1 1 0 1 0 1
B (I + C E CB ) B C E
A 1= ;
E 1 CB 1 E 1

where E = D CB 1 C 0 :]

7.12. Suppose Y^ is a n 1 vector of the …tted values of regressing Yt on Zt ; and X


^ is
a n K matrix of …tted values of regressing Xt on Zt : Show that ^ 2sls is equal to the
OLS estimator of regressing Y^ on X:
^

7.13 [Hausman’s Test] Suppose Assumptions 3.1, 3.2, 3.3(ii, iii), 3.4 and 3.5 in
Problem 7.8 hold. A test for the null hypothesis H0 : E(Xt "t ) = 0 can be constructed
by comparing ^ and ^ 2sls ; because they will converge in probability to the same limit
o
under H0 and to di¤erent limits under the alternatives to H0 : Assume H0 holds.
(a) Show that
p Xn
^ o 1 1 p
n( ) Qxx p Xt "t ! 0
n t=1

38
or equivalently
p 1 X
n
n( ^ o
)= Qxx1 p Xt "t + oP (1);
n t=1
p p
where Qxx = E(Xt Xt0 ): [Hint: If A^ ! A and B
^ = OP (1); then A^B
^ ^ ! 0 or
AB
A^B
^ = AB^ + oP (1):]
(b) Show that

p 1 X
n
n( ^ 2sls o
)= Qx~x~1 p ~ t "t + oP (1);
X
n t=1

~t =
where Qx~x~ = E(Xt Xt0 ); X 0
Zt and = [E(Zt Zt0 )] 1 E(Zt Xt ):
(e) Show that

p Xn n o
n( ^ ^ ) = p1 Qxx1 Xt ~ t "t + oP (1):
Qx~x~1 X
2sls
n t=1
p
(d) The asymptotic distribution of n( ^ 2sls ^ ) is determined by the leading term
only in part (c). Find its asymptotic distribution.
(e) Construct an asymptotically 2 test statistic. What is the degree of freedom of
the asymptotic 2 distribution? Assume that Qxx Qx~x~ is strictly positive de…nite.

4
7.14. Suppose Assumptions 3.1, 3.2, 3.3(ii, iii) and 3.4 in Problem 7.8 hold, E(Xjt )<1
4 4
for 1 j K; E(Zjt ) < 1 for 1 j l; and E("t ) < 1: Construct a Hausman’s test
statistic for H0 : E("t jXt ) = 0 and derive its asymptotic distribution under H0 .

39
CHAPTER 8 GENERALIZED METHOD OF
MOMENTS ESTIMATION
Abstract: Many economic theories and hypotheses have implications on and only on a mo-
ment condition or a set of moment conditions. A popular method to estimate model parameters
contained in the moment condition is the Generalized Method of Moments (GMM). In this chap-
ter, we …rst provide some economic examples for the moment condition, and de…ne the GMM
estimator. We then establish the consistency and asymptotic normality of the GMM estimator.
Since the asymptotic variance of a GMM estimator depends on the choice of a weighting matrix,
we introduce an asymptotically optimal two-stage GMM estimator with a suitable choice of
a weighting matrix. With the construction of a consistent asymptotic variance estimator, we
then propose an asymptotically 2 Wald test statistic for the hypothesis of interest, and a model
speci…cation test for the moment condition.

Key words: CAPM, GMM, IV Estimation, Model speci…cation test, Moment condition,
Moment matching, Optimal estimation, Overidenti…cation, Rational expectations.

8.1 Introduction to the Method of Moments Estimation


(MME)
To motivate the generalized method of moments (GMM) estimation, we …rst consider a
traditional method in statistics which is called the method of moments estimation (MME).

MME Procedure: Suppose f (y; o ) is the probability density function (pdf) or the probability
mass function (pmf) of a univariate random variable Yt .
o
Question: How to estimate the unknown parameter using a realization of the random sample
fYt gnt=1 ?

The basic idea of MME is to match the sample moments with the population moments
obtained under the probability distributional model. Speci…cally, MME can be implemented as
follows:
o o
Step 1: Compute population moments k( ) E(Ytk ) under the model density f (y; ):

For example, for k = 1; 2; we have

Z 1
o o
E(Yt ) = yf (y; )dy = 1( )
1
Z 1
o
E(Yt2 ) = y 2 f (y; )dy
1
= 2
( o) + 2 o
1 ( );

1
where 2 ( o ) is the variance of Yt .
Step 2: Compute the sample moments from the random sample Y n = (Y1 ; :::; Yn )0 :

For example, for k = 1; 2; we have

p
^ 1 = Yn ! ( o )
m
X
n
1
m
^2 = n Yt2
t=1
p o o
! E(Yt2 ) = 2
( )+ 2
1 ( );

where 2
( o) = 2(
o
) 2 o
1 ( ); and the weak convergence follows from the WLLN.

Step 3: Match the sample moments with the corresponding population moments evaluated at
some parameter value ^ :

For example, for k = 1; 2; we set

m
^1 = ( ^ );
2 ^
m
^2 = ( )+ 2
( ^ ):

Step 4: Solve for the system of equations. The solution ^ is called the method of moment
estimator for o :

Remarks: In general, if is a K 1 parameter vector, we need K equations of matching


moments.

Question: Is MME consistent for o ?


p p
Answer: Because k ( ^ ) = m
^ k ! k ( o ) by the WLLN; we expect that ^ ! o
as n ! 1:

We now illustrate MME by two simple examples.

Example 1: Suppose the random sample fYt gnt=1 i.i.d. EXP( ): Find an estimator for using
the method of moment estimation.

Solution: In our application, = : Because the exponential pdf

y
f (y; ) = e for y > 0;

2
it can be shown that
Z 1
( ) = E(Yt ) = yf (y; )dy
0
Z 1
y
= y e dy
0
1
= :

On the other hand, the …rst sample moment is the sample mean:

m
^ 1 = Yn :

Matching the sample mean with the population mean evaluated at ^ :

1
^ 1 = (^) =
m ;
^

we obtain the method of moment estimator

^= 1 = 1:
m^1 Yn
o
Example 2: Suppose the random sample fYt gnt=1 i.i.d.N ( ; 2
): Find MME for =( ; 2 0
):

Solution: The …rst two population moments are

E(Yt ) = ;
E(Yt2 ) = 2
+ 2
:

The …rst two sample moments are

m
^ 1 = Yn ;
1X 2
n
m
^2 = Y :
n t=1 t

Matching the …rst two moments, we have

Yn = ^ ;
1 Xn
Y 2 = ^2 + ^2:
n t=1 t

3
It follows that the MME

^ = Yn ;
1X 2
n
^2 = Y Yn2
n t=1 t
1X
n
= (Yt Yn )2 :
n t=1

p p
It is well-known that ^ ! and ^ 2 ! 2
as n ! 1:

8.2 Generalized Method of Moments (GMM) Estimation


Suppose is a K 1 unknown parameter vector, and there exists a l 1 moment function
mt ( ) such that
E[mt ( o )] = 0;

where sub-index t denotes that mt ( ) is a function of both and some random variables indexed
by t. For example, we may have

mt ( ) = Xt (Yt Xt0 )

in the OLS estimation, or


mt ( ) = Zt (Yt Xt0 )

in the 2SLS estimation, or more generally in the instrumental variable (IV) estimation, where
Zt is a l 1 instrument vector.

If l = K; that is, if the number of moment conditions is the same as the number of unknown
parameters, the model E[mt ( o )] = 0 is called exactly identi…ed. If l > K; that is, if the number
of moment conditions is more than the number of unknown parameters, the model is called
overidenti…ed.

The moment condition E[mt ( o )] = 0 may follow from economic and …nancial theory (e.g.
rational expectations and correct asset pricing). We now illustrate this by the following example.

Example 1 [Capital Asset Pricing Model (CAPM)]: De…ne Yt as an L 1 vector of excess


returns for L assets (or portfolios of assets) in period t. For these L assets, the excess returns
can be described using the excess-return market model:

o o
Yt = 0 + 1 Rmt + "t
o0
= Xt + "t ;

4
o
where Xt = (1; Rmt )0 is a bivariate vector, Rmt is the excess market portfolio return, is a
2 L parameter matrix, and "t is an L 1 disturbance, with E("t jXt ) = 0.
De…ne the l 1 moment function

0
mt ( ) = Xt (Yt Xt );

where l = 2L and denotes the Kronecker product. When CAPM holds, we have

E[mt ( o )] = 0:

These l 1 moment conditions form a basis to estimate and test the CAPM.
In fact, for any measurable function h : R2 ! Rl ; CAPM implies

0
E[h(Xt )(Yt Xt )] = 0:

This can also be used to estimate the CAPM model.

Question: How to choose the instruments h(Xt )?

Example 2 [Hansen and Singleton (1982, Econometrica) Dynamic Capital Asset


Pricing Model]:

Suppose a representative economic agent has a constant relative risk aversion utility over his
lifetime
X n X
n
t t Ct 1
U= u(Ct ) = ;
t=0 t=0

where u( ) is the time-invariant utility function of the economic agent in each time period (here
we assume u(c) = (c 1)= ), is the agent’s time discount factor, is the economic agent’s risk
aversion parameter, and Ct is the consumption during period t: Let the information available to
the agent at time t 1 be represented by the sigma-algebra It 1 –in the sense that any variable
whose value is known at time t 1 is presumed to be It 1 -measurable, and let

Pt P t Pt 1
Rt = =1+
Pt 1 Pt 1

be the gross return to an asset acquired at time t 1 at the price of Pt 1 (we assume no dividend
on the asset). The agent’s optimization problem is to

max E(U )
fCt g

5
subject to the intertemporal budget constraint

Ct + Pt qt = Yt + Pt qt 1 ;

where qt is the quantity of the asset purchased at time t and Yt is the agent’s labor income during
period t. De…ne the marginal rate of intertemporal substitution

@u(Ct ) 1
@Ct Ct
MRSt ( ) = @u(Ct 1 )
= :
Ct 1
@Ct 1

The …rst order conditions of the agent optimization problem are characterized by the Euler
equation:
E [ o MRSt ( o )Rt jIt 1 ] = 1 for some o = ( o ; o )0 :

That is, the marginal rate of intertemporal substitution discounts gross returns to unity.

Remarks: Any dynamic asset pricing model is equivalent to a speci…cation of MRSt :

We may write the Euler equation as follows:

E [f o MRSt ( o )Rt 1gjIt 1 ] = 0:

Thus, one may view that f MRSt ( )Rt 1g is a generalized model residual which has the MDS
property when evaluated at the true structural parameters o = ( o ; o )0 :

o
Question: How to estimate the unknown parameter in an asset pricing model?

More generally, how to estimate o from any linear or nonlinear econometric model which
can be formulated as a set of moment conditions? Note that the joint distribution of the random
sample is not given or implied by economic theory; only a set of conditional moments is given.

From the Euler equation, we can induce the following conditional moment restrictions:

E ( o MRSt ( o )Rt 1) = 0;
Ct 1 o
E ( MRSt ( o )Rt 1) = 0;
Ct 2
E [Rt 1 ( o MRSt ( o )Rt 1)] = 0:

Therefore, we can consider the 3 1 sample moments

1X
n
m(
^ )= mt ( );
n t=1

6
where 0
Ct 1
mt ( ) = [ MRSt ( )Rt 1] 1; ; Rt 1
Ct 2

can serve as the basis for estimation. The elements of the vector
0
Ct 1
Zt 1; ; Rt 1
Ct 2

are called instrumental variables which are a subset of information set It 1 .

We now de…ne the GMM estimator.

De…nition 8.1 [GMM Estimator] The generalized method of moments (GMM) estimator is

^ = arg min m( ^
^ )0 W 1
m(
^ );
2

where
X
n
1
m(
^ )=n mt ( )
t=1

^ is a l l symmetric nonsingular matrix which is possibly data-


is a l 1 sample moment vector, W
dependent, and is a K 1 unknown parameter vector, and is a K-dimensional parameter
space. Here, we assume l K; i.e., the number of moments may be larger than or at least equal
to the number of parameters.

Question: Why do we require l K in GMM estimation?

Question: Why is the GMM estimator ^ not de…ned by setting the l 1 sample moments to
zero jointly, namely
^ ^ ) = 0?
m(

Remarks: When l > K; i.e., when the number of equations is larger than the number of
unknown parameters, we generally cannot …nd a ^ such that m( ^ ^ ) = 0: However, we can …nd a
^ which makes m(
^ ^ ) as close to a l 1 zero vector as possible by minimizing the quadratic form

X
l
0
m(
^ ) m(
^ )= ^ 2i ( );
m
i=1

P
where m^ i ( ) = n 1 nt=1 mit ( ); i = 1; :::; l: Since each sample moment component m ^ i ( ) has a
di¤erent variance, and m
^ i ( ) and m^ j ( ) may be correlated, we can introduce a weighting matrix
^ ^
W and choose to minimize a weighted quadratic form in m( ^ ^ ); namely

m( ^
^ )0 W 1
m(
^ ):

7
^?
Question: What is the role of W

When W ^ = I; an identity matrix, each of the l component sample moments is weighted


equally. If W^ 6= I; then the l sample moment components are weighted di¤erently. A suitable
choice of weighting matrix W ^ can improve the e¢ ciency of the resulting estimator. Here, a
^?
natural question is: what is the optimal weighting function for the choice of W

Intuitively, the sample moment components which have large sampling variations should be
discounted. This is an idea similar to GLS, which discounts noisy observations by dividing by the
conditional standard deviation of the disturbance term and di¤erencing out serial correlations.

Special Case: Linear IV Estimation


Question: Does the GMM estimator have a closed form expression?

In general, when the moment function mt ( ) is nonlinear in parameter ; there is no closed


form solution for ^ . However, there is an important special case where the GMM estimator ^
has a closed form. This is the case of so-called linear IV estimation where we have

mt ( ) = Zt (Yt Xt0 )

and
o o
E[Zt (Yt Xt0 )] = 0 for some ;

where Yt is a scalar, Xt is a K 1 vector, and Zt is l 1 vector, with l K:

In this case, the GMM estimator, or more precisely, the linear IV estimator, ^ ; solves the
following minimization problem:

min m( ^
^ )0 W 1
m(
^ )=n 2
min (Y ^
X )0 ZW 1
Z0 (Y X );
2RK 2RK

where
1X
n
Z0 (Y X )
m(
^ )= = Zt (Yt Xt0 ):
n n t=1
The FOC is given by

@ h ^ 1 Z0 (Y X )
i
(Y X )0 ZW
@ =^

= ^ 1 Z0 (Y X ^ ) = 0:
2X0 ZW

It follows that
^
X0 ZW 1
Z0 X ^ = X0 ZW
^ 1
Z0 Y:

8
When the K l matrix Qxz = E(Xt Zt0 ) is of full rank of K; the K K matrix Qxz W Qzx is
^ 1 Z0 X is not singular at least for large samples, and consequently
nonsingular. Therefore, X0 ZW
the GMM estimator ^ has the closed form expression:

^ = (X0 ZW
^ 1 ^
Z0 X) 1 X0 ZW 1
Z0 Y:

This is called a linear IV estimator because it estimates the parameter o in the linear model
Yt = Xt0 o + "t with E("t jZt ) = 0:
Interestingly, the 2SLS estimator ^ 2sls considered in Chapter 7 is a special case of the IV
estimator by choosing
^ = Z0 Z:
W
^ = c(Z0 Z) for any constant c 6= 0:
or more generally, by choosing W

Question: Is the choice of W ^ = Z0 Z optimal? In other words, is the 2SLS estimator ^ 2sls
asymptotically e¢ cient in estimating o ?

When l = K such that Qxz = E(Xt Zt0 ) is nonsingular, the K K matrix X0 Z is nonsingular
at least for large samples. Consequently,

^ = (Z0 X) 1 Z0 Y:

Theorem 8.1: Suppose mt ( ) = Zt (Yt Xt0 ); where Yt is a scalar, Zt is a l 1 vector, Xt is


K 1 vector, with l K: Also, the K l matrix X0 Z is of full rank K and the l l weighting
matrx W^ is nonsingular. Then the resulting GMM estimator ^ is called a linear IV estimator
and has the closed form expression

^ = (X0 ZW
^ 1 ^
Z0 X) 1 X0 ZW 1
Z0 Y:

When l = K; and Qxz is nonsignular,

^ = (Z0 X) 1 Z0 Y:

Note that the IV estimator ^ generally depends on the choice of instruments Zt and weighting
^ : However, when l = K; the exact identi…cation case, the IV estimator ^ does not
matrix W
depend on the choice of W ^ 1 Z0 (Y X ^ ) = 0
^ : This is because in this case the FOC that X0 ZW
becomes

Z0 (Y X^) = 0
(K n)(n 1) = K 1

9
given X0 Z and W ^ are nonsingular at least for large samples. Obviously, the OLS estimator
^ = (X0 X) 1 X0 Y is a special case of the linear IV estimator by choosing Zt = Xt :

8.3 Consistency of GMM


Question: What are the statistical properties of GMM ^ ?

To investigate the asymptotic properties of the GMM estimator ^ , we …rst provide a set of
regularity conditions.

Assumption 8.1 [Compactness]: The parameter space is compact (closed and bounded);

Assumption 8.2 [Uniform convergence]: (i) The moment function mt ( ) is a measurable


function of a random vector indexed by t for each 2 ; and given each t; mt ( ) is continuous
in 2 ; (ii) fmt ( )g is a stationary ergodic process; (iii) m(
^ ) converges uniformly over to
m( ) E[mt ( )] in probability in the sense that

p
sup jjm(
^ ) m( )jj ! 0;
2

where jj jj is an Euclidean norm; (iv) m( ) is continuous in 2 :

o
Assumption 8.3 [Identi…cation]: There exists a unique parameter in such that m( o ) =
0:
p
^ !
Assumption 8.4 [Weighting Matrix]: W W , where W is a nonstochastic l l symmetric,
…nite and nonsingular matrix.

Remarks:
Assumption 8.3 is an identi…cation condition. If the moment condition m( o ) = 0 is implied
by economic theory, o can be viewed as the true model parameter value. Assumptions 8.1
and 8.3 imply that the true model parameter o lies inside the compact parameter space :
Compactness is sometimes restrictive, but it greatly simpli…es our asymptotic analysis and is
sometime necessary (as in the case of estimating GARCH models) where some parameters must
be restricted to ensure a positive conditional variance estimator.

In many applications, the moment function mt ( ) usually has the form

mt ( ) = ht "t ( )

for some weighting function ht and some error or generalized error term "t ( ): Assumption 8.2
allows but does not require such a multiplicative form for mt ( ): Also, in Assumption 8.2, we

10
impose a uniform WLLN for m( ^ ) over : Intuitively, uniform convergence implies that the
largest (or worse) deviation between m(
^ ) and m( ) over vanishes to 0 in probability as
n ! 1.

Question: How to ensure uniform convergence in probability?

This can be achieved by a suitable uniform weak law of large numbers (UWLLN). For example,
when fYt ; Xt0 g0n
t=1 is i.i.d., we have the following:

Lemma 8.2 [Uniform Strong Law of Large Numbers for IID Processes (USLLN)]: Let
fZt ; t = 1; 2; :::g be an IID sequence of random d 1 vectors, with common cumulative distribution
function F:
Let be a compact subset of RK ; and let q : Rd ! R be a function such that q( ; ) is
measurable for each 2 and q(z; ) is continuous on for each z 2 Rd :
Suppose there exists a measurable function D : Rd ! R+ such that jq(z; )j D(z) for all
2 and z 2 S; where S is the support of Zt and E[D(Zt )] < 1:
Then
(i) Q( ) = E[q(Zt ; )] is continuous on ;
(ii) sup 2 jQ( ^ ) Q( )j ! 0 a.s. as n ! 1; where Q( ^ ) = n 1 Pn q(Zt ; ):
t=1

Proof: See Jennrich (1969, Theorem 2).


A USLLN for stationary ergodic processes is following:

Lemma 8.3 [Uniform Strong Law of Large Numbers for Stationary Ergodic Processes
{Ranga Rao (1962)}]: Let ( ; F; P ) be a probability space, and let T : ! be a one-to-one
measure preserving transformation.
Let be a compact subset of RK ; and let q : ! R be a function such that q( ; ) is
measurable for each 2 and q(!; ) is continuous on for each ! 2 :
Suppose there exists a measurable function D : ! R+ such that jq(!; )j D(!) for all
R
2 and ! 2 ; and E(D) = DdP < 1:
If for each 2 ; qt ( ) = q(T t !; ) is ergodic, then
(i) Q( ) = E[qt ( )] is continuous on ;
(ii) sup 2 jQ(^ ) Q( )j ! 0 a.s. as n ! 1; where Q( ^ ) = n 1 Pn qt ( ):
t=1

Proof: See Ranga Rao (1962).

Remarks: Uniform almost sure convergence implies uniform convergence in probability.

We …rst state the consistency result for the GMM estimator ^ :

11
Theorem 8.4 [Consistency of the GMM Estimator]: Suppose Assumptions 8.1–8.4 hold.
p
Then ^ ! o .

To show this consistency theorem, we need the following extrema estimator lemma.

Lemma 8.5 [White, 1994, Consistency of Extrema Estimators]: Let Q( ^ ) be a stochastic


real-valued function of 2 ; and Q( ) be a nonstochastic real-valued continuous function of ;
where is a compact parameter space. Suppose that for each ; Q(^ ) is a measurable function of
^ ) is continuous in 2 with probability
the random sample with sample n, and for each n; Q(
p
one. Also suppose Q(^ ) Q( ) ! 0 uniformly in 2 :
Let ^ = arg max 2 Q(^ ); and o = arg max 2 Q( ) is the unique maximizer. Then ^
o p
! 0.

Remarks: This lemma continues to hold if we change all convergences in probability to almost
sure convergences.

We now show the consistency of the GMM estimator ^ by applying the above lemma:

Proof: Put
^ )=
Q( m( ^
^ )0 W 1
m(
^ )

and
Q( ) = m( )0 W 1
m( ):

Then

^ )
Q( Q( )

= m( ^
^ )0 W 1
m(
^ ) m( )0 W 1
m( )

= [m(
^ ) ^
m( ) + m( )]0 W 1
[m(
^ ) m( ) + m( )] m( )0 W 1
m( )

[m(
^ ) ^
m( )]0 W 1
[m(
^ ) m( )]
^
+2 m( )0 W 1
[m(
^ ) m( )]
^
+ m( )0 (W 1
W 1
)m( ) :

It follows from Assumptions 8.1, 8.2 and 8.4 that

p
^ )!
Q( Q( )

uniformly over ; and Q( ) = m( )0 W 1 m( ) is continuous in over . Moreover, Assumption


p
8.3 implies that o is the unique minimizer of Q( ) over . It follows that ^ ! o by the extrema

12
estimator Lemma. Note that the proof of the consistency theorem does not require the existence
of the FOC. This is made possible by using the extrema estimator lemma. This completes the
proof of consistency.

8.4 Asymptotic Normality of GMM


To derive the asymptotic distribution of the GMM estimator, we impose two additional
regularity conditions.

o
Assumption 8.5 [Interiorness]: 2 int( ):

Assumption 8.6 [CLT]:


(i) For each t; mt ( ) is continuously di¤erentiable with respect to 2 with probability
one.
(ii) As n ! 1;
p X
n
d
o
nm(
^ ) n 1=2
mt ( o ) ! N (0; Vo );
t=1
p
where Vo avar[ nm( ^ o )] is …nite and p.d.
(iii) f @m@t ( ) g obeys the uniform weak law of large numbers (UWLLN), i.e.,

Xn
@mt ( ) p
1
sup n D( ) ! 0,
2 t=1
@
where the l K matrix

@mt ( )
D( ) E
@
dm( )
=
d

is continuous in 2 and is of full rank K.

Remarks:

Question: Why do we need to assume that o is an interior point in ?


This is because we will have to use a Taylor series expansion. We need to make use of the
FOC for GMM in order to derive the asymptotic distribution of ^ :

In Assumption 8.6, we assume both CLT and UWLLN directly. These are called “high-
level assumptions." They can be ensured by imposing more primitive conditions on the data
generating processes (e.g., i.i.d. random samples or MDS random samples), and the moment and
smoothness conditions of mt ( ). Fore more discussion, see White (1994).

13
We now establish the asymptotic normality of the GMM estimator ^ :

Theorem 8.6 [Asymptotic Normality]: Suppose Assumptions 8.1–8.6 hold. Then as n ! 1;


p d
n ^ o
! N (0; );

where
= (Do0 W 1
Do ) 1 Do0 W 1
Vo W 1
Do (Do0 W 1
Do ) 1 ;
o
@m( )
and Do D( o ) = @
:
p
Proof: Because o is an interior element in ; and ^ ! o as n ! 1, we have that ^ is an
interior element of with probability approaching one as n ! 1:
^ ) = m(
For n su¢ ciently large, the …rst order conditions for the maximization of Q( ^ 1 m(
^ )0 W ^ )
are
^ )
dQ(
0 = j =^
d
^ ^) ^ 1 ^
dm(
= 2 W m(^ ):
d 0
^ ^ ) ^ 1p
dm(
0 = W ^ ^ ):
nm(
d 0
K 1 = (K l) (l l) (l 1)

Note that W ^ is not a function of : Also, this FOC does not necessarily imply m(^ ^ ) = 0: Instead,
it only says that a set (with dimension K l) of linear combinations of the l components in
^ ^ ) is equal to zero. Here, the l K matrix m(
m( d ^ )
d
^ ^ ) with
is the gradient of the l 1 vector m(
respect to the K 1 vector .

o
Using the Taylor series expansion around the true parameter value , we have

p p ^ )p ^
dm(
^ ^) =
nm( ^ o) +
nm( n( o
);
d

where = ^ + (1 ) o lies between ^ and o ; with 2 [0; 1]: Here, for notational simplicity,
we have abused the notation in the expression of dm(
^ )
d
: Precisely speaking, a di¤erent is needed
for each partial derivative of m(
^ ) with respect to each parameter i ; i = 1; :::; K:
The …rst term in the above Taylor series expansion is contributed by the sampling randomness
of the sample average of the moment functions evaluated at the true parameter o ; and the second
term is contributed by the randomness of parameter estimator ^ o
:

14
It follows from FOC that

^ ^ ) ^ 1p
dm(
0 = W ^ ^)
nm(
d 0
^ ^ ) ^ 1p
dm(
= W ^ o)
nm(
d 0
^ ^ ) ^ 1 dm(
dm( ^ )p ^ o
+ W n( ):
d 0 d

^ ^)
dm( p
Now let us show that d
! Do D( o ). To show this, consider

^ ^)
dm(
D0
d
^ ^)
dm(
= D( ^ ) + D( ^ ) D( o )
d
^ ^)
dm(
D( ^ ) + D( ^ ) D( o )
d
dm(
^ )
sup D( ) + D( ^ ) D( o )
2 d
p
!0

o p
by the triangle inequality and Assumption 8.6 (the UWLLN, the continuity of D( ); and ^ !
0).
Similarly, because = ^ + (1 ) o for 2 [0; 1]; we have

p
jj o
jj = jj ( ^ o
)jj jj ^ o
jj ! 0:

It follows that
dm(
^ ) p
! Do .
d
Then the K K matrix
Do0 W 1
Do

is nonsingular by Assumptions 8.4 and 8.6. Therefore, for n su¢ ciently large, the inverse
" # 1
^ ^) ^
dm( 1 dm(
^ )
W
d 0 d

exists and it converges in probability to (Do0 W 1


Do ) 1 . Therefore, when n is su¢ ciently large,

15
we have
" # 1
p ^ ^) ^
dm( 1 dm(
^ ) ^ ^) ^
dm( p
n( ^ o
) = W W 1
^ o)
nm(
d 0 d d 0
p
= A^ nm(
^ o );

where " # 1
^ ^) ^
dm( 1 dm(
^ ) ^ ^) ^
dm(
A^ = W W 1
:
d 0 d d 0

By Assumption 8.6(ii), the CLT for fmt ( o )g, we have


p d
^ o ) ! N (0; Vo );
nm(

1=2 n o
where Vo avar[n t=1 mt ( )]: Moreover,
" # 1
^ ^) ^
dm( 1 dm(
^ ) ^ ^) ^
dm(
A^ = W W 1
d 0 d d 0
p 1
! Do0 W 1
Do Do0 W 1

A.

It follows from the Slutsky theorem that


p d
n( ^ o
) ! A N (0; Vo ) N (0; );

where

= AVo A0
= (Do0 W 1
Do ) 1 Do0 W 1
Vo W 1
Do (Do0 W 1
Do ) 1 :

This completes the proof.

Remarks:
p p
The structure of avar( n ^ ) is very similar to that of avar( n ^ 2sls ): In fact, as pointed out
earlier, 2SLS is a special case of the GMM estimator with the choice of

mt ( ) = Zt (Yt Xt0 )
W = E(Zt Zt0 ) = Qzz :

16
Similarly, the OLS estimator is a special case of GMM with the choice of

mt ( ) = Xt (Yt Xt0 );
W = E(Xt Xt0 ) = Qxx :

Most econometric estimators can be viewed as a special case of GMM, at least asymptotically. In
other words, GMM provides a convenient uni…ed framework to view most econometric estimators.
See White (1994) for more discussion.

8.5 Asymptotic E¢ ciency of GMM


^ : Is there any optimal choice for W
Question: There are many possible choices of W ^ ? If so,
^?
what is the optimal choice of W
p
The following theorem shows that the optimal choice of W is given by W = Vo ^ o )]:
var[ nm(

Theorem 8.7 [Asymptotic E¢ ciency]: Suppose Assumptions 8.4 and 8.6 hold. De…ne o =
p
^ o )]: Then
(Do0 Vo 1 Do ) 1 ; which is obtained from by choosing W = Vo avar[ nm(

o is p.s.d.

for any …nite, symmetric and nonsingular matrix W:

1 1
Proof: Observe that o is p.s.d. if and only if o is p.s.d. We therefore consider

1 1
o

= Do0 Vo 1 Do Do0 W 1
Do (Do0 W 1
Vo W 1
Do ) 1 Do0 W 1
Do
= Do0 Vo 1=2
[I Vo1=2 W 1
Do (Do0 W 1
Vo W 1
Do ) 1 Do0 W 1
Vo1=2 ]Vo 1=2
Do
= Do0 Vo 1=2
GVo 1=2
Do ;

1=2 1=2 1=2


where Vo = Vo Vo for some symmetric and nonsingular matrix Vo ; and

G I Vo1=2 W 1
Do (Do0 W 1
Vo W 1
Do ) 1 Do0 W 1
Vo1=2

is a symmetric idempotent matrix (i.e., G = G0 and G2 = G): It follows that we have

1 1
o = (Do0 Vo 1=2
G)(GVo 1=2
Do )
1=2
= (GVo Do )0 (GVo 1=2
Do )
= B0B
p.s.d. (why?),

17
1=2
where B = GVo Do is a l K matrix. This completes the proof.

Remarks:

The optimal choice of W = Vo is not unique. The choice of W = cVo for any nonzero constant
c is also optimal.
In practice, the matrix Vo is unavailable. However, we can use a feasible asymptotically
optimal choice W ^ = V~ , a consistent estimator for Vo avar[pnm(
^ o )]:

^ = V~ is an optimal weighting matrix?


Question: What is the intuition that W

^ !p p
Answer: W Vo , and Vo is the variance-covariance matrix of the sample moments nm( ^ o ):
p
The use of W ^ 1! Vo 1 ; therefore, downweighs the sample moments which have large sampling
p p
variations and di¤erences out correlations between di¤erent components nm ^ i ( o ) and nm
^ j ( o)
for i 6= j; where i; j = 1; :::; K. This is similar in spirit to the GLS estimator in the linear
regression model. It also corrects serial correlations between di¤erent sample moments when
they exist.

Optimality of the 2SLS Estimator ^ 2sls

As pointed out earlier, the 2SLS estimator ^ 2sls is a special case of the GMM estimator with
mt ( ) = Zt (Yt Xt0 ) and the the choice of weighting matrix W = E(Zt Zt0 ) = Qzz : Suppose
fmt ( o )g is an MDS and E("2t jZt ) = 2 ; where "t = Yt Xt0 o : Then
p
^ o )]
Vo = avar[ nm(
= E [mt ( o )mt ( o )0 ]
2
= Qzz

where the last equality follows from the law of iterated expectations and conditional ho-
moskedasticity. Because W = Qzz is proportional to Vo ; the 2SLS estimator ^ is asymptotically
optimal in this case. In contrast, when fmt ( o )g is an MDS with conditional heteroskedasticity
(i.e., E("2t jZt ) 6= 2 ) or fmt ( o )g is not an MDS, then the choice of W = Qzz does not de-
liver an asymptotically optimal 2SLS estimator. Instead, the GMM estimator with the choice of
W = Vo = E(Zt Zt0 "2t ) is asymptotically optimal.

Two-Stage GMM Estimator

The previous theorem suggests that the following two-stage GMM estimator will be asymp-
totically optimal.

18
Step 1: Find a consistent preliminary estimator ~ :

~ = arg min m( ~
^ )0 W 1
m(
^ );
2

for some prespeci…ed W ~ which converges in probability to some …nite and p.d. matrix. For
convenience, we can set W ~ = I; an l l identity matrix. This is not an optimal estimator, but
it is a consistent estimator for o .
p
Step 2: Find a preliminary consistent estimator V~ for Vo ^ = V~ :
^ o )]; and choose W
avar[ nm(

The construction of V~ di¤ers in the following two cases, depending on whether fmt ( o )g is
an MDS:

Case (i): fmt ( o )g is an ergodic stationary MDS process. In this case,


p
Vo ^ o )] = E[mt ( o )mt ( o )0 ]:
avar[ nm(

The asymptotic variance estimator

X
n
V~ = n 1
mt ( ~ )mt ( ~ )0
t=1

will be consistent for


Vo = E[mt ( o )mt ( o )0 ]:

Question: How to show this?


P
Answer: We need to assume that fn 1 nt=1 mt ( )mt ( )0 E[mt ( )mt ( )0 ]g satis…es the uniform
convergence:
X n
p
sup n 1 mt ( )mt ( )0 E[mt ( )mt ( )0 ] ! 0:
2 t=1

Also, we need to assume that E[mt ( )mt ( )0 ] is continuous in 2 :


p
Case (ii): fmt ( o )g is not MDS. In this case, a long-run variance estimator for Vo ^ o )]
avar[ nm(
is needed:
Xn 1
V~ = k(j=p) ~ (j);
j=1 n

where k( ) is a kernel function, p = p(n) is a smoothing parameter,

X
n
~ (j) = n 1
mt ( ~ )mt j ( ~ )0 for j 0;
t=j+1

19
and ~ (j) = ~ ( j)0 if j < 0: Under regularity conditions, it can be shown that V~ is consistent for
the long-run variance
X1
Vo = (j);
j= 1

where (j) =cov[mt ( ); mt j ( )] = E[mt ( )mt j ( o )0 ]: See more discussion in Chapter 6.


o o o

Question: Why do not we need demean when de…ning ~ (j)?

Step 3: Find an asymptotically optimal estimator ^ :

^ = arg min m(
^ )0 V~ 1
m(
^ ):
2

Remarks: The weighting matrix V~ does not involve the unknown parameter : It is a given (sto-
chastic) weighting matrix. This two-stage GMM estimator ^ is asymptotically optimal because
p p
V~ ! Vo = avar[ nm(^ o )].

Theorem 8.8 [Two-Stage Asymptotically Most E¢ cient GMM]: Suppose Assumptions


p p
8.1–8.3, 8.5 and 8.6 hold, V~ ! V; and W
~ ! W for some symmetric …nite and positive de…nite
matrix W: Then
p d
n( ^ o
) ! N (0; o ) as n ! 1;

where o = (Do0 Vo 1 Do ) 1 :

Question: Why do we need the asymptotically two-stage GMM estimator?

First, most macroeconomic time series data sets are usually short, and second, the use of
instruments Zt is usually ine¢ cient. These factors lead to a large estimation error so it is
desirable to have an asymptotically e¢ cient estimator.

Although the two-stage GMM procedure is asymptotically e¢ cient, one may like to iterate the
procedure further until the GMM parameter estimates and the values of the minimized objective
function converge. This will eliminate any dependence of the GMM estimator on the choice of
~ ; and it may improve the …nite sample performance of the GMM
the initial weighting matrix W
estimator when the number of parameters is large (e.g., Ferson and Foerster 1994).

8.6 Asymptotic Variance Estimator


To construct con…dence interval estimators and conduct hypothesis tests, we need to estimate
the asymptotic variance o of the optimal GMM estimator.

Question: How to estimate o (Do0 Vo 1 Do ) 1 ?

20
We need to estimate both Do and Vo :
o
(i) To estimate Do = E[ @m@t ( )
]; we can use

^ ^)
^ = dm(
D :
d

We have shown earlier that


p
^!
D Do .

(ii) To estimate Vo ; we need to consider two cases— MDS and non-MDS separately:

Case I: fmt ( o )g is ergodic stationary MDS. In this case,

Vo = E[mt ( o )mt ( o )0 ]:

A consistent variance estimator is


X
n
V^ = n 1
mt ( ^ )mt ( ^ )0 :
t=1

Assuming the UWLLN for fmt ( )mt ( )0 g; we can show that V^ is consistent for

Vo = E[mt ( o )mt ( o )0 ]:

Case II: fmt ( o )g is not MDS. In this case,

X
1
V0 = (j);
j= 1

where (j) = E[mt ( o )mt j ( o )0 ]: A consistent variance estimator is

X
n 1
V^ = k(j=p) ^ (j);
j=1 n

where k( ) is a kernel function, and

X
n
^ (j) = n 1
mt ( ^ )mt j ( ^ )0 for j 0;
t=j+1

21
Under suitable conditions (e.g., Newey and West 1994, Andrews 1991), we can show

p
V^ ! Vo

but the proof of this is beyond the scope of this course.

To cover both cases, we directly impose the following “high-level assumption”:


p p
Assumption 8.7: V^ Vo ! 0, where Vo ^ o )]:
avar[ nm(

Theorem 8.9 [Asymptotic Variance Estimator for the Optimal GMM Estimator]:
Suppose Assumptions 8.1–8.7 hold. Then

^o ^ 0 V^ 1 ^ 1 p
(D D) ! o as n ! 1.

8.7 Hypothesis Testing


We now consider testing the hypothesis of interest

H0 : R( o ) = r;

where R( ) is a J 1 continuously di¤erentiable vector-valued function, J K; and the J K


o
matrix dR(d ) = R0 ( o ) is of full rank J. Note that R( o ) = r covers both linear and nonlinear
restrictions on model parameters. An example of nonlinear restriction on o is o1 o2 = 1:

Remarks: We need J K: The number of restrictions is less than the number of parameters.
We now allow hypotheses of both linear and nonlinear restrictions on o :

Question: How to construct a test statistic for H0 ?

The basic idea is to check whether R( ^ ) r is close to 0: By the Taylor series expansion and
R( o ) = r under H0 ; we have

p p
n[R( ^ ) r] = n[R( o ) r]
p
+R0 ( ) n( ^ o
)
p
= R0 ( ) n( ^ o
)
d
! R0 ( o ) N (0; o)

N [0; R0 ( o ) oR
0
( o )0 ]:

22
where lies between ^ and o ; i.e., = ^ + (1 ) o for some 2 [0; 1]:
p o p
Because R0 ( ) ! R0 ( o ) given continuity of R0 ( ) and ! 0, and
p d
n( ^ o
) ! N (0; o );

we have
p d
n[R( ^ ) r] ! N [0; R0 ( o ) oR
0
( o )0 ]:

by the Slutsky theorem. It follows that the quadratic form


p p d
n[R( ^ ) r]0 [R0 ( o ) oR
0
( o )0 ] 1
n[R( ^ ) r] ! 2
J:

The Wald test statistic is then

d
W = n[R( ^ ) r]0 [R0 ( ^ ) ^ o R0 ( ^ )0 ] 1 [R( ^ ) r] ! 2
J

where the convergence in distribution to 2J follows from the Slutsky theorem.


When J = 1; we can have an asymptotically N(0,1) test statistic
p
n[R( ^ ) r)] d
T =q ! N (0; 1) as n ! 1:
0 ^ ^ 0 ^
R ( ) oR ( ) 0

Theorem 8.10 [Wald Test Statistic]: Suppose Assumptions 8.1–8.7 hold. Then under H0 :
R( o ) = r; we have

d
W = n[R( ^ ) r]0 [R0 ( ^ ) ^ o R0 ( ^ )0 ] 1 [R( ^ ) r] ! 2
J:

Remarks: This can be used for hypothesis testing. This Wald test is built upon an asymp-
totically optimal GMM estimator. One could also construct a Wald test using a consistent but
suboptimal GMM estimator (how?).

8.8 Model Speci…cation Testing


As pointed out earlier, many dynamic economic theories can be formulated as a moment
condition or a set of moment conditions. Thus, to test validity of an economic theory, one can
check whether the related moment condition holds.

Question: How to test whether the econometric model as characterized by

E [mt ( o )] = 0 for some o

is correctly speci…ed?

23
Answer: We can check correct model speci…cation by testing whether the above moment con-
dition holds.

Question: How to check if the moment condition

E[mt ( o )] = 0

holds?

Answer: Use the sample moment

X
n
^ ^) = n
m( 1
mt ( ^ )
t=1

and see if it is signi…cantly di¤erent from zero (the value of the population moment evaluated at
the true parameter value o ). For this purpose, we need to know the asymptotic distribution of
p
^ ^ ):
nm(

Consider the test statistic


p p
^ ^) =
nm( ^ o)
nm(
^ )p ^
dm( o
+ n( )
d

which follows from a …rst order Taylor series expansion, and lies between ^ and o
. The
p
asymptotic distribution of nm(^ ^ ) is contributed from two sources.
Recall that the two-stage GMM

^ = arg min m(
^ )0 V~ 1
m(
^ ):
2

The FOC of the two-stage GMM estimation is given by

d h ^ 0~ 1 ^
i
0= m(
^ )V m(
^ ) :
d

It is very important to note that V~ is not a function of , so it has nothing to do with the
di¤erentiation with respect to : We then have

^ ^ ) ~ 1p
dm(
0 = 0 V ^ o)
nm(
d
^ ^ ) ~ 1 dm(
dm( ^ )p ^ o
+ 0 V n( ):
d d

24
It follows that for n su¢ ciently large, we have
p
n( ^ o
)
" # 1
^ ^) ~
dm( 1 dm(
^ )
= V
d 0 d
^ ^) ~
dm( p
V 1
^ o ):
nm(
d 0

Hence,

p
V~ 1=2
^ ^)
nm(
p
= V~ 1=2
^ o)
nm(
^ )p ^
dm(
+V~ 1=2 n( o
)
d
2 " # 3
1
dm(
^ ) ^ ^) ~
dm( 1 dm(
^ ) ^ ^) ~
dm( p
= 4I V~ 1=2 V V 1=2 5
V~ 1=2
^ o)
nm(
d d 0 d d 0
p
= ^ [V~ 1=2 nm(
^ o )]:

By the CLT for fmt ( o )g and the Slutsky theorem, we have


p d
V~ 1=2
^ o ) ! N (0; I):
nm(

where I is a l l identity matrix. Also, we have


" # 1
^ = I dm(
^ ) ^ ^) ~
dm( 1 dm(
^ ) ^ ^) ~
dm(
V~ 1=2 V V 1=2
d d 0 d d 0
p 1=2
!I Vo Do (Do0 Vo 1 Do ) 1 Do0 Vo 1=2

= ;

where
1=2
=I Vo Do (Do0 Vo 1 Do ) 1 Do0 Vo 1=2

2
is a l l symmetric matrix which is also idempotent (i.e., = ) with tr( ) = l K (why?
Use tr(AB) = tr(BA)!):

25
It follows that under correct model speci…cation, we have
p p
^ ^ )0 V~
n[m( 1
^ ^ )] = [V~
m( 1=2
^ o )]0 ^ 2 [V~
nm( 1=2
^ o )] + oP (1)
nm(
d
! G0 G
2
l K

by the following lemma, where G N (0; I):

Lemma 8.11 [Quadratic Form in Normal Random Variables]: If v N (0; I) and is


an l l symmetric and idempotent with rank q l; then the quadratic form

v0 v 2
q:

Remarks: The adjustment of degrees of freedom from l to l K is due to the impact of the
asymptotically optimal parameter estimator ^ :
p
Theorem 8.12 [Overidenti…cation Test] Suppose Assumptions 8.1–8.6 hold, and V~ ! Vo
as n ! 1. Then under the null hypothesis that E[mt ( o )] = 0 for some unknown o ; the test
statistic
d
^ ^ )0 V~ 1 m(
n m( ^ ^ ) ! 2l K :

Remarks: This is often called the J-test or the test for overidenti…cation in the GMM litera-
ture, because it requires l > K. This test can be used to check if the model characterized as
E[mt ( o )] = 0 is correctly speci…ed.

It is important to note that the fact that

^ ^ )0 V~
nm( 1
^ ^ ) ! G0 G
m(

where is an idempotent matrix is due to the fact that ^ is an asymptotically optimal GMM
estimator that minimizes the objective function nm( ^ )0 V~ 1 m(
^ ). If a suboptimal GMM esti-
mator is used, we would have no above result. Instead, we need to use a di¤erent asymptotic
variance estimator to replace V~ and obtain an asymptotically 2l distribution under correct model
speci…cation. Because the critical value of 2l K is smaller than that of 2l when K > 0; the use
of the asymptotically optimal estimator ^ leads to an asymptotically more e¢ cient test.

Remarks: When l = K; the exactly identi…ed case, the moment conditions cannot be tested by
the asymptotically optimal GMM ^ ; because m(
^ ^ ) will be identically zero, no matter whether
E[m( o )] = 0:

Question: Why is the degree of freedom equal to l K?

26
Answer: The adjustment of degrees of freedom (minus K) is due to the impact of the sam-
pling variation of the asymptotically optimal GMM estimator. In other words, the use of an
asymptotically optimal GMM estimator ^ instead of ~ renders the degrees of freedom to change
from l to l K: Note that if ^ is not an asymptotically optimal GMM estimator, the asymptotic
^ ^ )0 V~ 1 m(
distribution of nm( ^ ^ ) will be changed.

Question: In the J test, why do we use the preliminary weighting matrix V~ ; which is evaluated
at a preliminary parameter estimator ~ ? Why not use V^ ; a consistent estimator for V that is
evaluated at the asymptotically optimal estimator ^ ?

Answer: With the preliminary matrix V~ ; the J-test statistic is n times the minimum value
of the objective function— the quadratic form in the second stage of GMM estimation. Thus,
^ ^ )0 V~ 1 m(
the value of the test statistic nm( ^ ^ ) is directly available as a by-product of the second
stage GMM estimation. For this reason and for its asymptotic 2 distribution, the J-test is also
called the minimum chi-square test.

Question: Can we use V^ to replace V~ in the J-test statistic?

Answer: Yes. The test statistic nm( ^ ^ ) is also asymptotically 2l K under correct
^ ^ )0 V^ 1 m(
^ ^ )0 V~ 1 m(
model speci…cation (please verify!), but this statistic is less convenient to compute than nm( ^ ^ );
because the latter is the objective function of the second stage GMM estimation. This is analo-
gous to the F -test statistic, which is based on the sums of squared residuals of linear regression
models.

Question: Can we replace ^ by some suboptimal but consistent GMM estimator ~ ; say?

Answer: No. We cannot obtain the asymptotically 2l K distribution. We need to replace V~ in


the nm( ^ ^ ) with a suitable asymptotic variance estimator and will obtain an asymptotic
^ ^ )0 V~ 1 m(
2
p ^ p ~ ~ is a consistent but suboptimal estimator
l distribution. Note that avar( n ) 6= avar( n ) if
for o :

Testing for Validity of Instruments

In the linear IV estimation context, where

mt ( ) = Zt (Yt Xt0 );

the overidenti…cation test can be used to check the validity of the moment condition

E[mt ( o )] = E[Zt (Yt Xt0 o


)]
o
= 0 for some :

27
This is essentially to check whether Zt is a valid instrument vector, that is, whether Zt is
orthogonal to "t = Yt Xt0 o . Put e^t = Yt Xt0 ^ 2sls : We can use the following test statistic

e^0 Z(Z 0 Z) 1 Z 0 e^
e^0 e^=n

Note that the numerator

^ ^ 2sls )0 W
e^0 Z(Z 0 Z) 1 Z 0 e^ = n m( ^ 1
^ ^ 2sls )
m(

is n times the value of the objective function of the GMM minimization with the choice of W ^ =
(Z 0 Z=n); which is an optimal choice when fmt ( o )g is an MDS with conditional homoskedasticity
(i.e., E("2t jZt ) = 2 ): In this case,

e^0 e^ Z 0 Z p 2
! Qzz = Vo :
n n
It follows that the test statistic
e^0 Z(Z 0 Z) 1 Z 0 e^ d 2
! l K
e^0 e^=n
o
under the null hypothesis that E("t jZt ) = 0 for some :

Corollary 8.13: Suppose Assumptions 7.1–7.4, 7.6 and 7.7 hold, and l > K. Then under the
null hypothesis that E("t jZt ) = 0, the test statistic

e^0 Z(Z 0 Z) 1 Z 0 e^ d 2
! l K;
e^0 e^=n

where e^ = Y X ^ 2sls :

2 2
In fact, the overidenti…cation test statistic is equal to nRuc , where Ruc is the uncentered R2
from the auxiliary regression
e^t = 0 Zt + wt :
2
In fact, it can be shown that under the null hypothesis of E("t jZt ) = 0; nRuc is asymptotically
equivalent to nR2 in the sense that nRuc 2
= nR2 + oP (1); where R2 is the uncentered R2 of
regressing e^t on Zt :This provides a convenient way to calculate the test statistic. However,
it is important to emphasize that this convenient procedure is asymptotically valid only when
E("2t jZt ) = 2 :

8.9 Empirical Applications


8.10 Conclusion
28
Most economic and …nancial theories have implications on and only on a moment restriction

E[mt ( o )] = 0;

where mt ( ) is a l 1 moment function. This moment condition can be used to estimate model
parameter o via the so-called GMM estimation method. The GMM estimator is de…ned as:

^ = arg min m( ^
^ )0 W 1
m(
^ );
2

where
X
n
1
m(
^ )=n mt ( ):
t=1

Under a set of regularity conditions, it can be shown that

p
^ ! o

and
p d
n( ^ o
) ! N (0; );

where
= (Do0 W 1
Do ) 1 Do0 W 1
Vo W 1
Do (Do0 W 1
Do ) 1 :

The asymptotic variance of the GMM estimator ^ depends on the choice of weighting matrix
p
W: An asymptotically most e¢ cient GMM estimator is to choose W = Vo avar[ nm( ^ o )]: In
this case, the asymptotic variance of the GMM estimator is given by

o = (Do0 Vo 1 Do ) 1

which is a minimum variance. This is similar in spirit to the GLS estimator in a linear regression
model. This suggests a two-stage asymptotically optimal GMM estimator ^ : First, one can
obtain a consistent but suboptimal GMM estimator ~ by choosing some convenient weighting
matrix W ~ : Then one can use ~ to construct a consistent estimator V~ for Vo ; and use it as a
weighting matrix to obtain the second stage GMM estimator ^ :
To construct con…dence interval estimators and hypothesis tests, one has to obtain consistent
asymptotic variance estimators for GMM estimators. A consistent asymptotic variance estimator
for an asymptotically optimal GMM estimator is

^ o = (D
^ 0 V^ 1 ^ 1;
D)

29
where
X
n
dmt ( ^ )
^ =n
D 1
;
t=1
d

and the construction of V^ depends on the properties of fmt ( o )g; particularly on whether
fmt ( o )g is an ergodic stationary MDS process.
Suppose a two-stage asymptotically optimal GMM estimator is used. Then the associated
Wald test statistic for the null hypothesis

H0 : R( o ) = r:

is given by

^ = n[R( ^ ) d
W r]0 [R0 ( ^ )(D
^ 0 V^ 1 ^ 1 R0 ( ^ )0 ] 1 [R( ^ )
D) r] ! 2
J :

The moment condition E[mt ( o )] = 0 also provides a basis to check whether an economic
theory or economic model is correctly speci…ed. This can be done by checking whether the sample
moment m( ^ ^ ) is close to zero. A popular model speci…cation test in the GMM framework is the
J-test statistic
d
^ ^ )0 V~ 1 m(
nm( ^ ^ ) ! 2l K

under correct model speci…cation, where ^ is an asymptotically optimal GMM estimator (ques-
tion: what will happen if a consistent but suboptimal GMM estimator is used). This is also
^ ^ )V~ 1 m(
called the overidenti…cation test. The J-test statistic nm( ^ ^ ) is rather convenient to
compute, because it is the objective function of the GMM estimator.
GMM provides a convenient uni…ed framework to view most econometric estimators. In other
words, most econometric estimators can be viewed as a special case of the GMM framework with
suitable choice of moment function and weighting matrix. In particular, the OLS and 2SLS
estimators are special cases of the class of GMM estimators.

30
EXERCISES
8.1. A generalized method of moment (GMM) estimator is de…ned as

^ = arg min m( ^
^ )0 W 1
m(
^ );
2

where is a K ^ is a possibly stochastic l


1 vector, W l symmetric and nonsingular matrix,

X
n
1
m(
^ )=n mt ( );
t=1

and mt ( ) is a l 1 moment function of random vector Zt , and l K: We make the following


assumptions:
o o o
Assumption 1.1: is the unique solution to E[m(Zt ; )] = 0; and is an interior point in
:
o
Assumption 1.2: fZt g is a stationary time series process and m(Zt ; ) is a martingale di¤er-
ence sequence in the sense that

o
E m(Zt ; )j Z t 1
= 0;

where Z t 1
= fZt 1 ; Zt 2 ; :::; Z1 g is the information available at time t 1:.

Assumption 1.3: m(Zt ; ) is continuously di¤erentiable with respect to 2 such that

p
^ 0( )
sup km m0 ( )k ! 0,
2

d d
^ 0( ) =
where m d
^ ) and m0 ( ) =
m( d
E[m(Zt ; )] = E[ @@ m(Zt ; )]:
p d
Assumption 1.4: ^ o ) ! N (0; Vo ) for some …nite and positive de…nite matrix Vo :
nm(
p
^ !
Assumption 1.5: W W , where W is a …nite and positive de…nite matrix.
p
From these assumptions, one can show that ^ ! o , and this result can be used in answering
the following questions in parts (a)–(d). Moreover, you can make additional assumptions if you
feel appropriate and necessary.

(a) Find the expression of Vo in terms of m(Zt ; o ):


(b) Find the …rst order condition of the above GMM minimization problem.
p
(c) Derive the asymptotic distribution of n( ^ o
):

31
^ : Explain why your choice of W
(d) Find the optimal choice of W ^ is optimal.

8.2. (a) Show that the 2SLS ^ 2sls for the parameter o in the regression model Yt = Xt0 o + "t
is a special case of the GMM estimator with suitable choices of moment function mt ( ) and
weighting matrix W ^;
(b) Assume that fZt "t g is a stationary ergodic process and other regularity conditions hold.
Compare the relative e¢ ciency between an asymptotically optimal GMM estimator (with the
optimal choice of the weighting matrix) and ^ 2sls under conditional homoskedasticity and con-
ditional heteroskedasticity respectively.
p
8.3. Use a suboptimal GMM estimator ^ with a given weighting function W ^ ! W to construct
o
a Wald test statistic for the null hypothesis H0 : R = r; and justify your reasoning. Assume
all necessary regularity conditions hold.

8.4. Suppose that fmt ( )g is an ergodic stationary MDS process, where mt ( ) is continuous on
a compact parameter set ; and fmt ( )mt ( )0 g follows a uniform weak law of large numbers,
P
and Vo = E[mt ( o )mt ( o )0 ] is …nite and nonsingular. Let V^ = n 1 nt=1 mt ( ^ )mt ( ^ )0 ; where ^ is
p
a consistent estimator of o . Show V^ ! Vo :
p
8.5. Suppose V^ is a consistent estimator for Vo = avar[ nm(^ o )]: Show that replacing V~ by V^
has no impact on the asymptotic distribution of the overidenti…cation test statistic, that is, show

p
^ ^ )V~
nm( 1
^ ^)
m( ^ ^ )V^
nm( 1
^ ^ ) ! 0:
m(

Assume all necessary regularity conditions hold.

8.6. Suppose ~ is a suboptimal but consistent GMM estimator. Could we simply replace ^ by
~ and still obtain the asymptotic 2 distribution for the overidenti…cation test statistic? Give
l K
your reasoning. Assume all necessary regularity conditions hold.

8.7. Suppose Assumptions 7.1–7.4, 7.6 and 7.7 hold. To test the null hypothesis that E("t jZt ) =
0, where Zt is a l 1 instrumental vector, one can consider the auxiliary regression

0
e^t = Zt + wt ;

where e^t = Yt Xt0 ^ 2sls : Show nRuc


2
= nR2 + oP (1) as n ! 1 under the null hypothesis. [Hint:
2
Recall the de…nitions of Ruc and R2 in Chapter 3.]

8.8 [Nonlinear Least Squares Estimation]. Consider a nonlinear regression model

o
Yt = g(Xt ; ) + "t ;

32
where o is an unknown K 1 parameter vector and E("t jXt ) = 0 a.s. Assume that g(Xt ; ) is
twice continuously di¤erentiable with respect to with the K K matrices E[ @g(X
@
t ; ) @g(Xt ; )
@ 0
]
2
and E[ @ @g(Xt; )
@ 0
] …nite and nonsingular for all 2 :
The nonlinear least squares (NLS) estimator solves the minimization of the sum of squared
residual problem
Xn
^ = arg min [Yt g(Xt ; )]2 :
t=1

The …rst order condition is


X
n
@g(Xt ; ^ )
D( ^ )0 e = [Yt g(Xt ; ^ )] = 0;
t=1
@

@
where D( ) is a n K matrix, with the t-th row being @
g(Xt ; ): This FOC can be viewed as
the FOC
^ ^) = 0
m(

for an GMM estimation with


@g(Xt ; )
mt ( ) = [Yt g(Xt ; )]
@

in an exact identi…cation case (l = K). Generally, there exists no closed form expression for ^ :
Assume all necessary regularities conditions hold.
p
(a) Show that ^ ! o as n ! 1:
p
(b) Derive the asymptotic distribution of n( ^ o
):
p ^
(c) What is the asymptotic variance of n( o
) if f @g(X
@
t; )
"t g is an MDS with conditional
2 2
homoskedasticity (i.e., E("t jXt ) = a.s.)? Give your reasoning.
p
(d) What is the asymptotic variance of n( ^ o
) if f @g(X
@
t; )
"t g is an MDS with conditional
2 2
heteroskedasticity (i.e., E("t jXt ) 6= a.s.)? Give your reasoning.
@g(Xt ; )
(e) Suppose f @ "t g is an MDS with conditional homoskedasticity (i.e., E("2t jXt ) = 2
a.s.). Construct a test for the null hypothesis H0 : R( o ) = r; where R( ) is a J K nonstochastic
matrix such that R0 ( o ) = @@ R( o ) is a J L matrix with full rank J L; and r is a J 1
nonstochastic vector.

8.9. [Nonlinear IV Estimation] Consider a nonlinear regression model

o
Yt = g(Xt ; ) + "t ;

where g(Xt ; ) is twice continuously di¤erentiable with respect to ; E("t jXt ) 6= 0 but
E("t jZt ) = 0; where Yt is a scalar, Xt is a K 1 vector and Zt is a l 1 vector with l K:

33
Suppose fYt ; Xt0 ; Zt0 g0n
t=1 is a stationary ergodic process, and fZt "t g is an MDS.
The unknown parameter o can be consistently estimated based on the moment condition

E[mt ( o )] = 0;

o
where mt ( ) = Zt [Yt g(Xt ; )]: Suppose a nonlinear IV estimator solves the minimization
problem
^ = arg min m( ^
^ )0 W 1
m(
^ );
P ^ !p
^ ) = n 1 nt=1 Zt [Yt g(Xt ; )]; and W
where m( W; a …nite and positive de…nite matrix.
p
^
(a) Show ! : o

(b) Derive FOC.


(c) Derive the asymptotic distribution of ^ : Discuss the cases of conditional homoskedasticity
and conditional heteroskedasticity respectively.
(d) What is the optimal choice of W so that ^ is asymptotically most e¢ cient?
(e) Construct a test for the null hypothesis that H0 : R( o ) = r; where R( ) is a J K
nonstochastic matrix with R0 ( o ) of full rank, r is a J 1 nonstochastic vector, and J K.
(f) Suppose f @g(X@
t; )
"t g is an MDS with conditional heteroskedasticity (i.e., E("2t jXt ) 6= 2
a.s.). Construct a test for the null hypothesis H0 : R( o ) = r; where R( ) is a J K nonstochastic
matrix such that R0 ( o ) = @@ R( o ) is a J L matrix with full rank J L; and r is a J 1
nonstochastic vector.

8.10. Consider testing the hypothesis of interest H0 : R( o ) = r under the GMM framework,
where R( o ) is a J K nonstochastic matrix, r is a J 1 nonstochastic vector, and R0 ( o ) is a
J K matrix with full rank J; where J K: We can construct a Lagrangian multiplier test based
on the Lagrangian multiplier ^ , where ^ is the optimal solution of the following constrained
GMM minimization problem:
h i
( ^ ; ^ ) = arg min ^ )0 V~
m( 1
m(
^ )+ 0
[r R( ) ;
2 ; 2R

p
where V~ is a preliminary consistent estimator for Vo = avar[ nm(
^ o )] that does not depend :
Construct the LM test statistic and derive its asymptotic distribution. Assume all regularity
conditions hold.

34
CHAPTER 9 MAXIMUM LIKELIHOOD
ESTIMATION AND QUASI-MAXIMUM
LIKELIHOOD ESTIMATION
Abstract: Conditional distribution models have been widely used in economics and …nance. In
this chapter, we introduce two closely related popular methods to estimate conditional proba-
bility distribution models— Maximum Likelihood Estimation (MLE) and Quasi-MLE (QMLE).
MLE is a parameter estimator that maximizes the model likelihood function of the random sam-
ple when the conditional probability distribution model is correctly speci…ed, and QMLE is a
parameter estimator that maximizes the model likelihood function of the random sample when
the conditional probability distribution model is misspeci…ed. Because the score function is an
MDS process and the dynamic information matrix equality holds when a conditional distribution
model is correctly speci…ed, the asymptotic properties of the MLE is analogous to those of the
OLS estimator when the regression disturbance is an MDS with conditional homoskedasticity,
and we can use the Wald test, Lagrange Multiplier test and Likelihood Ratio test for hypothesis
testing, where the Likelihood Ratio test is analogous to the J F test statistic. On the other hand,
when the conditional distributional model is misspeci…ed, the score function has mean zero, but
it may no longer be an MDS process and the dynamic information matrix equality may fail. As a
result, the asymptotic properties of the QMLE are analogous to those of the OLS estimator when
the regression disturbance displays serial correlation and conditional heteroskedasticity. Robust
Wald tests and Lagrange Multiplier tests can be constructed for hypothesis testing, but the Like-
lihood ratio test can no longer be used, for a reason similar to the failure of the F -test statistic
when the regression disturbance displays conditional heteroskedasticity and serial correlation.
We discuss methods to test the MDS properties of the score function, and the dynamic informa-
tion matrix equality, and correct speci…cation of the entire conditional distribution model. Some
empirical applications are considered.

Key words: ARMA model, Censored data, Conditional probability distribution model,
Discrete choice model, Dynamic information matrix test, GARCH model, Hessian matrix, Infor-
mation matrix equality, Information matrix test, Lagrange multiplier test, Likelihood, Likelihood
ratio test, Martingale, MLE, Pesudo likelihood function, QMLE, Score function, Truncated data,
Wald test.
9.1 Motivation
So far we have focused on the econometric models for conditional mean or conditional ex-
pectation, either linear or nonlinear. When do we need to model the conditional probability
distribution of Yt given Xt ?

1
We …rst provide a number of economic examples which call for the use of a conditional
probability distribution model.

Example 1 [Value at Risk, VaR]

In …nancial risk management, how to quantify extreme downside market risk has been an
important issue. Let It 1 = (Yt 1 ; Yt 2 ; :::; Y1 ) be the information set available at time t 1;
where Yt is the return on a portfolio in period t: Suppose

o
Yt = t( ) + "t
o o
= t( )+ t( )zt ;

where t ( o ) = E(Yt jIt 1 ); 2t ( o ) = var(Yt jIt 1 ); fzt g is an i.i.d. sequence with E(zt ) = 0,
var(zt ) = 1; and pdf fz ( j o ): An example is that fzt g i:i:d:N (0; 1):

The value at risk (VaR), Vt ( ) = V ( ; It 1 ); at the signi…cance level 2 (0; 1); is de…ned as

P [Yt < Vt ( )jIt 1 ] = = 0:01 (say).

Intuitively, VaR is the threshold that the actual loss will exceed with probability : Given that
Yt = t + t zt ; where for simplicity we have put t = t ( o ) and t = t ( o ); we have

= P( t + t zt< Vt ( )jIt 1 )
Vt ( ) t
= P zt < It 1
t
Vt ( ) t
= Fz ;
t

where the last equality follows from the independence assumption of fzt g: It follows that

Vt ( ) t
= C( ):
t

Vt ( ) = t + t C( );

where C( ) is the left-tailed critical value of the distribution Fz ( ) at level ; namely

P [zt < C( )] =

or Z C( )
0
fz (zj )dz = :
1

2
For example, C(0:05) = 1:65 and C(0:01) = 2:33:

Obviously, we need to model the conditional distribution of Yt given It 1 in order to calculate


Vt ( ), which is a popular quantitative measure for downside market risk.

For example, J.P. Morgan’s RiskMetrics uses a simple conditionally normal distribution model
for asset returns:

Yt = t zt ;
X
t 1
2 j
t = (1 ) Yt2 j ; 0< < 1;
j=1
fzt g i:i:d:N (0; 1):

2
Here, the conditional probability distribution of Yt jIt 1 is N (0; t ); from which we can obtain

Vt (0:05) = 1:65 t :

Example 2 [Binary Probability Modeling] Suppose Yt is a binary variable taking values 1


and 0 respectively. For example, a business turning point or a currency crisis may occur under
certain circumstance; households may buy a fancy new product; and default risk may occur for
some …nancial …rms. In all these scenarios, the variables of interest can take only two possible
values. Such variables are called binary.
We are interested in the probability that some economic event of interest occurs (Yt = 1)
and how it depends on some economic characteristics Xt : It may well be that the probability of
Yt = 1 di¤ers among individuals or across di¤erent time periods. For example, the probability
of students’success depends on their intelligence, motivation, e¤ort, and the environment. The
probability of buying a product may depend on income, age, and preference.
To capture such individual e¤ects (denoted as Xt ), we consider a model

o
P (Yt = 1jXt ) = F (Xt0 );

where F ( ) is a prespeci…ed CDF. An example of F ( ) is the logistic function, namely,

1
F (u) = ; 1 < u < 1:
1 + exp( u)

This is the so-called logistic regression model. This model is useful for modeling (e.g.) credit
default risk and currency crisis.

3
An economic interpretation for the binary outcome Yt is a story of a latent variable process.
De…ne (
1 if Yt c;
Yt =
0 if Yt > c;
where c is a constant, the latent variable

o
Yt = Xt0 + "t ;

and F ( ) is the CDF of the i.i.d. error term "t : If f"t g i:i:d:N (0; 2 ) and c = 0; the resulting
model is called a probit model. If f"t g i:i:d: Logistic(0; 2 ) and c = 0, the resulting model
is called a logit model. The latent variable could be the actual economic decision process. For
example, Yt can be the credit score and c is the threshold with which a lending institute makes
its decision on loan approvals.

This model can be extended to the multinomial model, where Yt takes discrete multiple
integers instead of only two values.

Example 3 [Duration Models]

Suppose we are interested in the time it takes for an unemployed person to …nd a job, the
time that elapses between two trades or two price changes, the length of a strike, the length
before a cancer patient dies, and the length before a …nancial crisis (e.g., credit default risk)
comes out. Such analysis is called duration analysis or survival analysis.
In practice, the main interest often lies in the question of how long a duration of an economic
event will continue, given that it has not …nished yet. An important concept called the hazard
rate measures the chance that the duration will end now, given that it has not ended before.
This hazard rate therefore can be interpreted as the chance to …nd a job, to trade, to end a
strike, etc.
Suppose Yt is the duration from a population with the probability density function f (y) and
probability distribution function F (y): Then the survival function is de…ned as

S(y) = P (Yt > y) = 1 F (y);

4
and the hazard rate is de…ned as
P (y < Yt y + jYt > y)
(y) = lim+
!0
P (y < Yt y + )=P (Yt > y)
= lim+
!0
f (y)
=
S(y)
d
= ln S(y):
dy

Hence, we have f (y) = (y)S(y): The speci…cation of (y) is equivalent to a speci…cation of


f (y): But (y) is more interpretable in economics. For example, suppose we have (y) = r; a
constant; that is, the hazard rate does not depend on the length of duration. Then

f (y) = r exp( ry)

is an exponential probability density.


The hazard rate may not be the same for all individuals (i.e., it may depend on individual
characteristics Xt ). To control heterogeneity across individuals, we assume a conditional hazard
function
0
t (y) = exp(Xt ) 0 (y);

where 0 (y) is called the baseline hazard rate. This speci…cation is called the proportional hazard
model, proposed by Cox (1962). The parameter

@
= ln t (y)
@Xt
1 @
= t (y)
t (y) @Xt

is the marginal relative e¤ect of Xt on the hazard rate of individual t: The survival function of
the proportional hazard model is
0
St (t) = [So (t)]exp(Xt )

where So (t) is the survival function of the baseline hazard rate 0 (t):
The probability density function of Yt given Xt is

f (yjXt ) = t (y)St (y):

To estimate parameter ; we need to use the maximum likelihood estimation (MLE) method,

5
which will be introduced below.

Example 4 [Ultra-High Frequency Financial Econometrics and Engle and Russell’s


(1998) Autoregressive conditional duration model]

Suppose we have a sequence of tick-by-tick …nancial data fPi ; ti g; where Pi is the price traded
at time ti ; where i is the index for the i-th price change. De…ne the time interval between price
changes
Yi = ti ti 1 ; i = 1; :::; n:

Question: How to model the serial dependence of the duration Yi ?

Engle and Russell (1998) propose a class of autoregressive conditional duration model:
8
o
>
< Yi = i ( )zi ;
o
i ( ) = E(Yi jIi 1 );
>
:
fzi g i:i:d:EXP(1),

o
where Ii 1 is the information set available at time ti 1 : Here, i = i( ) is called the conditional
expected duration given Ii 1 : A model for i is

i =!+ i 1 + Yi 1 ;

where = (!; ; )0 :
From this model, we can write down the model-implied conditional probability density of Yi
given Ii 1 :
1 y
f (yjIi 1 ) = exp ; y > 0:
i i

From this conditional density, we can compute the conditional intensity of Yi (i.e., the instanta-
neous probability that the next price change will occur at time ti ); which is important for (e.g.)
options pricing.

Example 5 [Continuous-time Di¤usion models] The dynamics of the spot interest rate Yt
is fundamental to pricing …xed income securities. Consider a di¤usion model for the spot interest
rate
dYt = (Yt ; o )dt + (Yt ; o )dWt ;

where (Yt ; o ) is the drift model, and (Yt ; o ) is the di¤usion (or volatility) model, o is an
unknown K 1 parameter vector, and Wt is the standard Brownian motion. Note that the time
t is a continuous variable here.

Question: What is the Brownian motion?

6
Continuous-time models have been rather popular in mathematical …nance and …nancial
engineering. First, …nancial economists have the belief that informational ‡ow into …nancial
markets is continuous in time. Second, the mathematical treatment of derivative pricing is
elegant when a continuous-time model is used.

The following are three well-known examples of the di¤usion model:

The random walk model with drift

dYt = dt + dWt ;

Vasicek’s (1977) model


dYt = ( + Yt )dt + dWt ;

Cox, Ingersoll, and Ross’(1985) model

1=2
dYt = ( + Yt )dt + Yt dWt :

These di¤usion models are important for hedging, derivatives pricing and …nancial risk manage-
ment.

Question: How to estimate model parameters of a di¤usion model using a discretely sampled
data fYt gnt=1 ?

Given (Yt ; ) and (Yt ; ); we can determine the conditional probability density fYt jIt 1 (yt jIt 1 ; )
of Yt given It 1 : Thus, we can estimate o by the maximum likelihood estimation (MLE) or as-
ymptotically equivalent methods using discretely observed data. For the random walk model,
the conditional pdf of Yt given It 1 is

1 (y t)2
f (yjIt 1 ; ) = p exp :
2 2t 2 2t

For Vasicek’s (1977) model, the conditional pdf of Yt given It 1 is

f (yjIt 1 ; ) = :

For the Cox, Ingersoll and Ross’(1985) model, the conditional pdf of Yt given It 1 is

f (yjIt 1 ; ) = :

It may be noted that many continuous-time di¤usion models do not have a closed form
expression for their conditional pdf, which makes the MLE estimation infeasible. Methods have

7
been proposed in the literature to obtain some accurate approximations to the conditional pdf
so that MLE becomes feasible.

9.2 Maximum Likelihood Estimation (MLE) and Quasi-


MLE
Recall a random sample of size n is a collection of random vectors fZ1 ; ; Zn g; where
Zt = (Yt ; Xt0 )0 : We denote the random sample as follows:

Z n = (Z10 ; ; Zn0 )0 :

A realization of Z n is a data set, denoted as z n = (z10 ; ; zn0 )0 . A random sample Z n can generate
many realizations (i.e., data sets).

Question: How to characterize the random sample Z n ?

All information in Z n is completely described by its joint probability density function (pdf) or
probability mass function (pmf) fZ n (z n ): [For discrete r.v.’s, we have fZ n (z n ) = P (Z n = z n ):] By
sequential partitioning (repeatedly using the multiplication rule that P (A \ B) = P (AjB)P (B)
for any two events A and B), we have

fZ n (z n ) = fZn jZ n 1 (zn jz n 1 )fZ n 1 (z n 1 )


Yn
= fZt jZ t 1 (zt jz t 1 ):
t=1

where Z t 1 = (Zt0 1 ; Zt0 2 ; ; Z10 )0 ; and fZt jZ t 1 (zt jz t 1 ) is the conditional pdf of Zt given Z t 1 :
Also, given Zt = (Yt ; Xt0 )0 and using the formula that P (A \ BjC) = P (AjB \ C)P (BjC) for any
events A; B and C; we have

fZt jZ t 1 (zt jz t 1 ) = fYt j(Xt ;Z t 1) (yt jxt ; z t 1 )fXt jZ t 1 (xt jz t 1 )


= fYt j t (yt j t )fXt jZ t 1 (xt jz t 1 );

where
t = (Xt0 ; Z t 10 0
);

an extended information set which contains not only the past history Z t 1
but also the current

8
Xt : It follows that

Y
n
n
fZ n (z ) = fYt j t (yt j t )fXt jZ t 1 (xt jz t 1 )
t=1
Y
n Y
n
= fYt j t (yt j t) fXt jZ t 1 (xt jz t 1 ):
t=1 t=1

0 10 0
Often, the interest is in modeling the conditional distribution of Yt given t = (Xt ; Z t ):

Some Important Special Cases

Case I [Cross-Sectional Observations]: Suppose fZt g is i.i.d. Then fYt j t (yt jxt ; z t 1 ) =
fYt jXt (yt jxt ) and fXt jZ t 1 (xt jz t 1 ) = fXt (xt ): It follows that

Y
n Y
n
n
f (z ) =
Zn fYt jXt (yt jxt ) fXt (xt );
t=1 t=1

where fXt (xt ) is the marginal pdf/pmf of Xt :

Case II: [Univariate Time Series Analysis] Suppose Xt does not exist, namely Zt = Yt .
Then t = (Xt0 ; Z t 10 )0 = Z t 1 = (Yt 1 ; :::; Y1 )0 ; and as a consequence,

Y
n
n
fZ n (z ) = fYt jY t 1 (yt jy t 1 ):
t=1

Variation-Free Parameters Assumption

We assume a parametric conditional probability model

fZt jZ t 1 (zt jz t 1 ) = fYt j t (yt j t; )fXt jZ t 1 (xt jz t 1 ; );

where fYt j t ( j t ; ) is a known functional form up to some unknown K 1 parameter vector ;


and fXt jZ t 1 ( jz t 1 ; ) is a known or unknown parametric function with some unknown parameter
. Note that fYt j t (yt j t ; ) is a function of rather than while fXt jZ t 1 (xt jz t 1 ; ) is a function
of rather than : This is called a variation-free parameters assumption. It follows that the
model log-likelihood function

X
n
ln fZ n (z n ) = ln fYt j t (yt j t; )
t=1
X
n
+ ln fXt jZ t 1 (xt jz t 1 ; ):
t=1

9
If we are interested in using the extended information set t = (Xt0 ; Z t 10 )0 to predict the dis-
tribution of Yt ; then is called the parameter of interest, and is called the nuisance
parameter. In this case, to estimate , we only need to focus on modeling the conditional
pdf/pmf fYt j t (yj t ; ): This follows because the second part of the likelihood function does
not depend on so that the maximization of ln fZ n (z n ) with respect to is equivalent to the
maximization of the …rst part of the likelihood with respect to :
We now introduce various conditional distributional models. For simplicity, we only consider
i.i.d. observations so that fYt j t (yj t ; ) = fYt jXt (yjXt ; ).

Example 1 [Linear Regression Model with Normal Errors]: Suppose Zt = (Yt ; Xt0 )0 is
i.i.d., Yt = Xt0 o + "t ; where "t jXt N (0; 2o ): Then the conditional pdf of Yt jXt is

1 1
(y x0 ) 2
fYt jXt (yjx; ) = p e 2 2 ;
2 2

where = ( 0; 2 0
) : This is a classical linear regression model discussed in Chapter 3.

Example 2 [Logit Model]: Suppose Zt = (Yt ; Xt0 )0 is i.i.d., Yt is a binary random variable
taking either value 1 or value 0, and
(
o
(Xt0 ) if yt = 1;
P (Yt = yt jXt ) = o
1 (Xt0 ) if yt = 0;

where
1
(u) = ; 1 < u < 1;
1 + exp( u)
is the CDF of the logistic distribution. We have

fYt jXt (yt jXt ; ) = (Xt0 )yt [1 (Xt0 )]1 yt


:

Example 3 [Probit Model]: Suppose Zt = (Yt ; Xt0 )0 is i.i.d., and Yt is a binary random
variable such that (
(Xt0 o ) if yt = 1
P (Yt = yt jXt ) = o
1 (Xt0 ) if yt = 0;
where ( ) is the CDF of the N(0,1) distribution. We have

fYt jXt (yt jXt ; ) = (Xt0 )yt [1 (Xt0 )]1 yt


:

There are wide applications of the logit and probit models. For example, a consumer chooses
a particular brand of car; a student decides to go to PHD study, etc.

10
Example 4 [Censored Regression (Tobit) Models]: A dependent variable Yt is called
censored when the response Yt cannot take values below (left censored) or above (right censored)
a certain threshold value. For example, the investment can only be zero or positive (when no
borrowing is allowed). The censored data are mixed continuous-discrete. Suppose the data
generating process is
Yt = Xt0 o + "t ;

where f"t g i:i:d:N (0; 2o ): When Yt > c; we observe Yt = Yt . When Yt c; we only have the
o
record Yt = c: The parameter should not be estimated by regressing Yt on Xt based on the
subsample with Yt > c; because the data with Yt = c contain relevant information about o and
2
o : More importantly, in the subsample with Yt > c, "t is a truncated distribution with nonzero
mean (i.e., E("t jYt > c) 6= 0 and E(Xt "t jYt > c) 6= 0). Therefore, OLS is not consistent for o
if one only uses the subsample consisting of observations of Yt > c and throw away observations
with Yt = c:

Question: How to estimate o given an observed sample fYt ; Xt0 gnt=1 where some observations
of Yt are censored? Suppose Zt = (Yt ; Xt0 )0 is i.i.d., with the observed dependent variable
(
Yt if Yt > c
Yt =
c if Yt c;

where Yt = Xt0 o + "t and "t jXt i:i:d:N (0; 2


o ): We assume that the threshold c is known.
Then we can write

Yt = max(Yt ; c)
= max(Xt0 o
+ "t ; c):

De…ne a dummy variable indicating whether Yt > c or Yt c;


(
1 if Yt > c (i.e., if Yt > c)
Dt =
0 if Yt = c (i.e., if Yt c):

Then the pdf of Yt jXt is

fYt jXt (yt jxt ; )


Dt
1 1
(yt x0t )2
= p e 2 2
2 2
1 Dt
c x0t
;

11
where ( ) is the N (0; 1) CDF, and the second part is the conditional probability

P (Yt = cjXt )
= P (Yt cjXt )
= P ("t c Xt0 jXt )
"t c Xt0
= P jXt

c Xt0
= ;

"t
given jXt N (0; 1):

Question: Can you give some examples where this model can be applied?

One example is a survey on unemployment spells. At the terminal date of the survey, the
recorded time length of an unemployed worker is not the duration when his layo¤ will last.
Another example is a survey on cancer patients. Those who have survived up to the ending date
of the survey will usually live longer than the survival duration recorded.

Example 5 [Truncated Regression Models]: A random sample is called truncated if we


know before hand that observations can come only from a restricted part of the underlying
population distribution. The truncation can come from below, from above, or from both sides.
We now consider an example where the truncation is from below with a known truncation point.
More speci…cally, assume that the data generating process is

Yt = Xt0 o
+ "t ;

where "t jXt i:i:d:N (0; 2o ): Suppose only those of Yt whose values are larger than or equal
to constant c are observed, where c is known. That is, we observe Yt = Yt if and only if
Yt = Xt0 o + "t c: The observations with Yt < c are not recorded. Assume the resulting
n
sample is fYt ; Xt gt=1 ; where fYt ; Xt g is i.i.d. We now analyze the e¤ect of truncation for this
model. For the observed sample, Yt c and so "t comes from the truncated version of the
2 0 o
distribution N (0; o ) with "t c Xt : It follows that E(Xt "t jYt c) 6= 0 and therefore the
0
OLS estimator based on the observed sample fYt ; Xt g is not consistent.
Because the observation Yt is recorded if and only if Yt c; the conditional probability
distribution of Yt given Xt is the same as the probability distribution of Yt given Xt and Yt > c:

12
Hence, for any observed sample point (yt ; xt ); we have

fYt jXt (yt jxt ; ) = fYt jXt ;(Yt >c) (yt jxt ; Yt > c)
fYt jXt ;(Yt >c) (yt jxt ; Yt > c)P (Yt > cjxt )
=
P (Yt > cjxt )
fYt jXt (yt jxt )
=
P (Yt > cjxt )
1 1 0 2
= p e 2 2 (yt xt )
2 2
1
;
c x0t
1

where = ( 0; 2
);and the conditional probability

P (Yt > cjXt ) = 1 P (Yt cjXt )


0
"t c Xt
= 1 P jXt

c Xt0
= 1 :

Question: Can you give some examples where this model can be applied?

Example 6 [Loan applications]: Only those successful loan applications will be recorded.

Example 7 [Students and Examination Scores]:


Suppose we are interested in investigating how the examination scores of students depend
on their e¤ort, family support, and high schools, and we have a sample from those who have
been admitted to colleges. This sample is obviously a truncated sample because we do not
observe those who are not admitted to colleges because their scores are below certain minimum
requirements.

Question: How to estimate in a conditional distribution model fYt j t (yj t; )?

We …rst introduce the likelihood function.

De…nition 9.1 [Likelihood Function]: The joint pdf/pmf of the random sample Z n =
(Z1 ; Z2 ; :::; Zn ) as a function of ( ; )

Ln ( ; ; z n ) = fZ n (z n ; ; )

13
is called the likelihood function of Z n when z n is observed. Moreover, ln Ln ( ; ; z n ) is called the
log-likelihood function of Z n when z n is observed.

Remarks:
The likelihood function Ln ( ; ; z n ) is algebraically identical to the joint probability density
function fZ n (z n ; ; ) of the random sample Z n taking value z n : Thus, given ( ; ); Ln ( ; ; z n )
can be viewed as a measure of the probability or likelihood with which the observed sample z n
will occur.

Lemma 9.1 [Variation-Free Parameter Spaces]: Suppose and are variation-free over
parameter spaces ; in the sense that for all ( ; ) 2 ; we have

fZt j t (zt j t; ; ) = fYt j t (yt j t; )fXt jZ t 1 (xt jZ t 1 ; );

where t = (Xt0 ; Z t ) : Then the likelihood function of Z n given Z n = z n can be written as


10 0

Y
n Y
n
n
Ln ( ; ; z ) = fYt j t (yt j t; ) fXt jZ t 1 (xt jZ t 1 ; );
t=1 t=1

and the log-likelihood function

X
n
n
ln Ln ( ; ; z ) = ln fYt j t (yt j t; )
t=1
X
n
+ ln fXt jZ t 1 (xt jZ t 1 ; ):
t=1

Suppose we are interested in predicting Yt using the extended information set t = (Xt0 ; Z t 10 )0 :
Then only the …rst part of the log-likelihood is relevant, and is called the parameter of interest.
The other parameter ; appearing in the second part of the log-likelihood function, is called the
nuisance parameter.

We now de…ne an estimation method based on maximizing the conditional log-likelihood


P
function nt=1 ln fYt j t (yt j t ; ):

De…nition 9.2 [(Quasi-)Maximum Likelihood Estimator for Parameters of Interest


; (Q)MLE]: The MLE ^ for 2 is de…ned as

Y
n
^ = arg max fYt j t (Yt j t; )
2
t=1
X
n
= arg max ln fYt j t (Yt j t; );
2
t=1

14
where is a parameter space. When the conditional probability distribution model fYt j t (yj t ; )
is correctly speci…ed in the sense that there exists some parameter value 2 such that
fYt j t (yj t ; ) coincides with the true conditional distribution of Yt given t , then ^ is called the
maximum likelihood estimator (MLE); when fYt j t (yj t ; ) is misspeci…ed in the sense that there
exists no parameter value 2 such that fYt j t (yj t ; ) coincides with the true conditional
distribution of Yt given t , ^ is called the quasi-maximum likelihood estimator (QMLE).

Remarks:

By the nature of the objective function, the MLE gives a parameter estimate which makes
the observed sample z n most likely to occur. By choosing a suitable parameter ^ 2 ; MLE
maximizes the probability that Z n = z n ; that is, the probability that the random sample Z n
takes the value of the observed data z n : Note that MLE and QMLE may not be unique.
The MLE is obtained over ; where may be subject to some restriction. An example is
the GARCH model where some parameters have to be restricted in order to ensure that the
estimated conditional variance is nonnegative (e.g., Nelson and Cao 1992).
Under regularity conditions, we can characterize the MLE by a …rst order condition. Like
the GMM estimator, However, there is usually no closed form for the MLE ^ : The solution ^
has to be searched by computers. The most popular methods used in economics are BHHH, and
Gauss-Newton.

Question: When does the MLE exist?

Suppose the likelihood function is continuous in 2 and parameter space is compact. Then
a global maximizer ^ 2 exists.

Theorem 9.2 [Existence of MLE/QMLE] Suppose for each 2 ; where is a compact pa-
rameter space, fYt j t (Yt j t ; ) is a measurable function of (Yt ; t ), and for each t; fYt j t (Yt j t ; )
is continuous in 2 : Then MLE/QMLE ^ exists.

This result is analogous to the Weierstrass Theorem in multivariate calculus that any contin-
uous function over a compact support always has a maximum and a minimum.

9.3 Statistical Properties of MLE/QMLE


For notational simplicity, from now on we will write the conditional pdf/pmf of Yt given t
as
fYt j t (yj t; ) = f (yj t; ); 1<y<1

We …rst provide a set of regularity conditions.

15
0
Assumption 9.1 [Parametric Distribution Model]: (i) fZt = (Yt ; Xt0 ) gnt=1 is a stationary
ergodic process, and (ii) f (yt j t ; ) is a conditional pdf/pmf model of Yt given t = (Xt0 ; Z t 10 )0 ;
where Z t 1 = (Zt0 1 ; Zt0 2 ; ; Z10 )0 : For each ; ln f (Yt j t ; ) is measurable with respect to
observations (Yt ; t ), and for each t; ln f (Yt j t ; ) is continuous in 2 ; where is a …nite-
dimensional parameter space.

Assumption 9.2 [Compactness]: Parameter space is compact.

Assumption 9.3 [Uniform WLLN]: fln f (Yt j t; ) E ln f (Yt j t; )g obeys the uniform weak
law of large numbers (UWLLN), i.e.,

X
n
p
1
sup n ln f (Yt j t; ) l( ) ! 0
2 t=1

where the population log-likelihood function

l( ) = E [ln f (Yt j t; )]

is continuous in 2 :

Assumption 9.4 [Identi…cation]:

= arg max l( )
2

is the unique maximizer of l( ) over .

Question: What is the interpretation of ?

Assumption 9.4 is an identi…cation condition which states that is a unique solution that
maximizes l( ); the expected value of the logarithmic conditional likelihood function ln f (Yt j t ; ).
So far, there is no economic interpretation for : This is analogous to the best linear least squares
approximation coe¢ cient = arg min E(Y X 0 )2 in Chapter 2.

9.3.1 Consistency
We …rst consider the consistency property of ^ for : Because we assume that is compact,
^ and may be corner solutions. Thus, we have to use the extrema estimator lemma to prove
the consistency of the MLE/QMLE ^ :

Theorem 9.3 [Consistency of MLE/QMLE]: Suppose Assumptions 9.1–9.4 hold. Then as


n ! 1;
p
^ ! 0:

16
Proof: Applying the extrema estimator lemma in Chapter 8, with

X
n
^ )=n
Q( 1
ln f (Yt j t; )
t=1

and
Q( ) = l( ) E[ln f (Yt j t; )]:
^ ) and Q( ) in the extrema estimator
Assumptions 9.1–9.4 ensure that all conditions for Q(
p
lemma are satis…ed. It follows that ^ ! as n ! 1:

Model Speci…cation and Interpretation of

De…nition 9.3 [Correct Speci…cation for Conditional Distribution] The model f (yt j t ; )
is correctly speci…ed for the conditional distribution of Yt given t if there exists some parameter
value o 2 such that f (yt j t ; o ) coincides with the true conditional pdf/pmf of Yt given t :

Under correct speci…cation of f (yj t ; ); the parameter value o is usually called the true
model parameter value. It will usually have economic interpretation.

Question: What are the implications of correct speci…cation of a conditional distributional


model f (yj t ; )?

Lemma 9.4: Suppose Assumption 9.4 holds, and the model f (yt j t ; ) is correctly speci…ed for
the conditional distribution of Yt given t : Then f (yt j t ; ) coincides with the true conditional
pdf/pmf f (yt j t ; o ) of Yt given t ; where is as given in Assumption 9.4 : In other words, the
population likelihood maximizer coincides with the true parameter value o when the model
f (yt j t ; ) is correctly speci…ed for the conditional distribution of Yt given t .

Proof: Because f (yj t ; ) is correctly speci…ed for the conditional distribution of Yt given t;
there exists some o 2 such that

l( ) = E[ln f (Yt j t; )]
= EfE[ln f (Yt j t ; )j t ]g by LIE
Z
= E ln[f (yj t ; )]f (yj t ; o )dy;

where the second equality follows from LIE and the expectation E( ) in the third equality is
taken with respect to the true distribution of the random variables in t :

17
By Assumption 9.4, we have l( ) l( ) for all 2 : By the law of iterated expectations,
it follows that
Z
E ln[f (yj t ; )]f (yj t ; o )dy
Z
E ln[f (yj t ; )]f (yj t ; o )dy;

o o
where f (yt j t; ) is the true conditional pdf/pmf. Hence, by choosing = ; we have
Z
o o
E ln[f (yj t; )]f (yj t; )dy
Z
o
E ln[f (yj t; )]f (yj t; )dy:

On the other hand, by Jensen’s inequality and the concavity of the logarithmic function, we have
Z Z
o
ln[f (yj t ; )]f (yj t ; )dy ln[f (yj t ; o )]f (yj t ; o )dy
Z
f (yj ; )
= ln o f (yj t ; o )dy
f (yj t ; )
Z
f (yj ; )
ln f (yj t ; o )dy
f (yj t ; o )
Z
= ln f (yj ; )dy

= ln(1)
= 0;
R
where we have made use of the fact that f (yj t; )dy = 1 for all 2 : Therefore, we have
Z
o
ln [f (yj t; )] f (yj t; )dy
Z
o
ln[f (yj t; )]f (yj t; )dy:

Therefore, by taking the expectation with respect to the distribution of t; we obtain


Z
E ln[f (yj t ; )]f (yj t ; o )dy
Z
E ln[f (yj t ; o )]f (yj t ; o )dy:

o
It follows that we must have = ; otherwise cannot be the the maximizer of l( ) over :
This completes the proof.

18
Remarks:

This lemma provides an interpretation of in Assumption 9.4: That is, the population
likelihood maximizer coincides with the true model parameter o when f (yj t ; ) is correctly
speci…ed. Thus, by maximizing the population model log-likelihood function l( ); we can obtain
the true parameter value o :
p
Under Theorem 9.3, we have ^ ! as n ! 1. Furthermore, by correct speci…cation
for conditional distribution (i.e., Lemma 9.4), we know = o , where o is the true model
p
parameter. Thus, we have ^ ! o as n ! 1.
This is essentially equivalent to the consistency in the linear regression context, in which,
^
OLS always converges to no matter whether the model is correctly speci…ed. And only when
the model we have coincides with the true model, we have = o and then ^ OLS will converge
to the true model parameter o :Otherwise, our estimation will be biased since ^ OLS does not
converge to o , as n ! 1:

9.3.2 Implication of Correct Model Speci…cation


We now examine some important implications of correct model speci…cation. For this pur-
pose, we assume that o is an interior point of the parameter space ; so that we can impose
di¤erentiability condition on the log-likelihood function ln f (yj t ; ) at o :
o
Assumption 9.5: 2 int( ) :

Question: Why do we need this assumption? This assumption is needed for the purpose of
taking a Taylor series expansion.

We …rst state an important implication of a correctly speci…ed conditional distribution model


for Yt given t :

Lemma 9.5 [The MDS Property of the Score Function of a Correctly Speci…ed Condi-
tional Distribution Model]: Suppose that for each t; ln f (Yt j t ; ) is continuously di¤erentiable
with respect to 2 : De…ne a K 1 score function

@
St ( ) = ln f (yt j t; ):
@

If f (yj t; ) is correctly speci…ed for the conditional distribution of Yt given t; then

E [St ( o )j t] = 0 a.s.;

where o is as in Assumption 9.4 and satis…es Assumption 9.5, and E( j t) is the expectation
taken over the true conditional distribution of Yt given t .

19
Proof: Note that for any given 2 ; f (yj t; ) is a valid pdf. Thus we have
Z 1
f (yj t; )dy = 1:
1

When 2 int( ) ; by di¤erentiation, we have


Z 1
@
f (yj t; )dy = 0:
@ 1

By exchanging di¤erentiation and integration (assume that we can do so), we have


Z 1
@
f (yj t; )dy = 0;
1 @

which can be further written as


Z 1
@ ln f (yj t; )
f (yj t; )dy = 0:
1 @
o
This relationship holds for all 2 int( ); including : It follows that
Z 1 o
@ ln f (yj t; ) o
f (yj t; )dy = 0;
1 @

where o
@ ln f (yj t; ) @ ln f (yj t; )
= j = o :
@ @
Because f (yj t ; o ) is the true conditional pdf/pmf of Yt given t when f (yj t; ) is correctly
speci…ed for the conditional distribution of Yt given t ; we have

E[St ( o )j t] = 0:

This completes the proof.

Note that E[St ( o )j t] = 0 implies that E[St ( o )jZ t 1 ] = 0; namely fSt ( o )g is an MDS.

Question: Suppose E[St ( o )j t ] = 0 for some o


2 : Can we claim that the conditional
pdf/pmd model is correctly speci…ed?

Answer: No. The MDS property is one of many implications of correct model speci…cation. In
certain sense, the MDS property is equivalent to correct speci…cation of the conditional mean.
Misspeci…cation of f (yj t ; ) may occur in higher order conditional moments of Yt given t :
Below is an example in which fSt ( o )g is MDS but the model f (yt j t ; ) is misspeci…ed.

20
Example 1: Suppose fYt g is a univariate time series process such that

Yt = t( )+ t( )zt ;

where t ( o ) = E(Yt jIt 1 ) for some o and It 1 = (Yt 1 ; Yt 2 ; :::; Y1 ) but 2t ( ) 6= var(Yt jIt 1 )
for all : Then, correct model speci…cation for the conditional mean E(Yt jIt 1 ) implies that
E(zt jIt 1 ) = 0: Assume that fzt g i.i.d.N (0; 1): Then the conditional probability density model

1 (Yt t( ))2
f (yj t; ) = p exp ;
2 2
t( ) 2 2t ( )

where t = It 1 : It is straightforward to verify that

E [St ( o )j t] = E[St ( o )jIt 1 ] = 0;

2
although the conditional variance t( ) is misspeci…ed for var(Yt jIt 1 ):

Next, we state another important implication of a correctly speci…ed conditional distribution


model for Yt given t :

Lemma 9.6 [Conditional Information Matrix Equality]: Suppose Assumptions 9.1–9.5


hold, f (yj t ; ) is twice continuously di¤erentiable with respect to 2 int( ); and f (yt j t ; )
is correctly speci…ed for the conditional distribution of Yt given t : Then

E [St ( o )St ( o )0 + Ht ( o )j t] = 0;

where
d
Ht ( ) St ( )
d
@2
= ln f (Yt j t; );
@ @ 0

or equivalently,

@ @
E ln f (Yt j t ; o ) 0 ln f (Yt j t;
o
) t
@ @
2
@ o
= E 0 ln f (Yt j t ; ) t :
@ @

Proof: For all 2 ; we have Z 1


f (yj t; )dy = 1:
1

21
By di¤erentiation with respect to 2 int( ), we obtain
Z 1
@
f (yj t; )dy = 0:
@ 1

Exchanging di¤erentiation and integration, we have


Z 1
@f (yj t ; )
dy = 0;
1 @
Z 1
@ ln f (yj t ; )
f (yj t ; )dy = 0:
1 @

With further di¤erentiation of the above equation again, we have


Z 1
@ @ ln f (yj t ; )
f (yj t ; )dy
@ 1 @
Z 1
@ @ ln f (yj t ; )
= f (yj t ; ) dy
1 @ @
Z 1 2
@ ln f (yj t ; )
= f (yj t ; )dy
1 @ @ 0
Z 1
@ ln f (yj t ; ) @f (yj t ; )
+ dy
1 @ @ 0
Z 1 2
@ ln f (yj t ; )
= f (yj t ; )dy
1 @ @ 0
Z 1
@ ln f (yj t ; ) @ ln f (yj t ; )
+ f (yj t; )dy
1 @ @ 0
= 0:

The above relation holds for all 2 ; including o : This and the fact that f (yj t ; o ) is the
true conditional pdf/pmf of Yt given t imply the desired conditional information matrix equality
stated in the lemma. This completes the proof.

Remarks:

The K K matrix

E[St ( o )St ( o )0 j t]
o o
@ ln f (Yt j t ; ) @ ln f (Yt j t ; )
= E t
@ @ 0

is called the conditional Fisher’s information matrix of Yt given t : It measures the content of the
information contained in the random variable Yt conditional on t : The larger the expectation
is, the more information Yt contains.

22
Question: What is the implication of the conditional information matrix equality?

In certain sense, the IM equality could be viewed as equivalent to correct speci…cation of


conditional variance. It has important implications on the form of the asymptotic variance of
the MLE. More speci…cally, the IM equality will simplify the asymptotic variance of the MLE
in the same way as conditional homoskedasticity simpli…es the asymptotic variance of the OLS
estimator.
p
To investigate the asymptotic distribution of n( ^ o
); we need the following conditions.

Assumption 9.6: (i) For each t; ln f (yt j t; ) is continuously twice di¤erentiable with respect
to 2 ; (ii) fSt ( o )g obeys a CLT, i.e.,

p X
n
d
^ o)
nS( n 1=2
St ( o ) ! N (0; Vo )
t=1

P
for some K K matrix Vo avar[n 1=2 nt=1 St ( o )] which is symmetric, …nite and positive
2
de…nite; (iii) fHt ( ) @ @@ 0 ln f (yt j t ; )g obeys a uniform weak law of large numbers (UWLLN)
over . That is, as n ! 1;

X
n
p
1
sup n Ht ( ) H( ) ! 0,
2 t=1

where the K K Hessian matrix

H( ) E [Ht ( )]
@ 2 ln f (Yt j t ; )
= E
@ @ 0

symmetric, …nite and nonsingular, and is continuous in 2 :


p ^ o
Question: What is the form of the asymptotic variance Vo of nS( ) when f (yj t; ) is
correctly speci…ed?

By the stationary MDS property of St ( o ) with respect to t, we have

23
" #
X
n
Vo avar n 1=2
St ( o )
t=1
(" #" #0 )
X
n X
n
= E n 1=2
St ( o ) n 1=2
S ( o)
t=1 =1
X
n X
n
= n 1
E[St ( o )S ( o )0 ]
t=1 =1
= E[St ( o )St ( o )0 ];

where the expectations of cross-products, E[St ( o )S ( o )0 ]; are identically zero for all t 6= ;
as implied by the MDS property of fSt ( o )g from the Lemma on the score function.

Furthermore, from the conditional information matrix equality, we have

Vo = E[St ( o )St ( o )0 ]
= Ho :

Note that Ho is a K K symmetric negative de…nite matrix.

9.3.3 Asymptotic Distribution


Next, we derive the asymptotic normality of the MLE.

Theorem 9.7 [Asymptotic Normality of MLE]: Suppose Assumptions 9.1–9.6 hold, and
f (yt j t ; ) is correctly speci…ed for the conditional distribution of Yt given t . Then
p d
n( ^ o
) ! N (0; Ho 1 ):

o p
Proof: Because o is an interior point in and ^ ! 0 as n ! 1, we have ^ 2 int( )
for n su¢ ciently large. It follows that the FOC of maximizing the log-likelihood holds when n is
su¢ ciently large:

X
n
@ ln f (Yt j t;
^)
^ ^)
S( n 1

t=1
@
Xn
= n 1
St ( ^ )
t=1
= 0:

The FOC provides a link between MLE and GMM: MLE can be viewed as a GMM estimation

24
with the moment condition

E[mt ( o )] = E[St ( o )] = 0 for some o

in an exact identi…cation case.


^ ^ ) around the true parameter
By the …rst order Taylor series expansion of S( o
, we have
p
0 = ^ ^)
nS(
p p
= nS( ^ ) n( ^
^ o ) + H( o
);

where lies between ^ and o


; namely, = a ^ + (1 a) o
for some a 2 [0; 1]; and

X
n
^ ) = n
H( 1
Ht ( )
t=1
Xn
@ 2 ln f (Yt j t ; )
1
= n
t=1
@ @ 0

^ ): Given that ^ o p
is the derivative of S( ! 0, we have

jj o
jj = jja( ^ o
)jj jj ^ o
jj
p
! 0:

Also, by the triangle inequality, the UWLLN for fHt ( )g over and the continuity of H( ); we
obtain

^ )
H( H0

= ^ )
H( H( ) + H( ) H( o )
^ )
sup H( H( ) + H( ) H( o )
2
p
! 0.

^ ) for n su¢ ciently large. Therefore, from FOC we have


Because H0 is nonsingular, so is H(
p p
n( ^ o
)= ^
H 1 ^ o)
( ) nS(
p ^ 1 pn X 0 " :]
for n su¢ ciently large. [Compare with the OLS estimator n( ^ o
)=Q n
p ^ o
Next, we consider nS( ): By the CLT, we have
p d
^ o ) ! N (0; Vo );
nS(

25
where, as we have shown above,
hp i
Vo avar ^ o)
nS(
= E[St ( o )St ( o )0 ]

given that fSt ( o )g is an MDS with respect to t:


It follows by the Slutsky theorem that
p p
n( ^ o
) = ^
H 1 ^ o)
( ) nS(
d
! N (0; Ho 1 Vo Ho 1 )
N (0; Ho 1 )

or equivalently
p d
n( ^ o
) ! N (0; Ho 1 Vo Ho 1 ) N (0; Vo 1 )

using the information matrix equality Vo = E[St ( o )St ( o )0 ] = Ho . This completes the proof.

Remarks:

Now it is easy to understand why Vo = E[St ( o )St ( o )0 ] = Ho is called the information


matrix of Yt given t : The larger Ho is, the smaller the variance of ^ is (i.e., the more precise
the estimator ^ is). Intuitively, as a measure of the curvature of the population log-likelihood
function, the absolute value of the magnitude of Ho characterizes the sharpness of the peak of
the population log-likelihood function at o :
The simpli…cation of Ho 1 Vo Ho 1 to Ho 1 by the information matrix equality is similar in
spirit to the case of the asymptotic variance of the OLS estimator under conditional homoskedas-
ticity.

9.3.4 E¢ ciency of MLE


From statistics theory, it is well-known that the asymptotic variance of MLE ^ achieves the
Cramer-Rao lower bound. Therefore, the MLE ^ is asymptotically most e¢ cient.

Question: What is the Cramer-Rao lower bound?

We now discuss consistent estimation of the asymptotic variance-covariance matrix of MLE.

Consistent Estimation of the Asymptotic Variance of the MLE


p p
Because avar( n ^ ) = Vo 1
= Ho 1 ; there are two methods to estimate avar[ n( ^ o
)]:

26
Method 1: Use ^ ^
H 1
( ^ ); where

X @ 2 ln f (Yt j t ; )
n
^ )= 1
H( :
n t=1 @ @ 0

This requires taking second derivatives of the log-likelihood function. By Assumption 9.6(iii)
p p
and ^ ! o , we have ^ ! Ho 1 .

Method 2: Use ^ V^ 1
; where

1X
n
V^ St ( ^ )St ( ^ )0 :
n t=1

This requires the computation of the …rst derivatives (i.e., score functions) of the log-likelihood
function.

Suppose the K K process fSt ( )St ( )0 g follows the UWLLN, namely,

X
n
p
1
sup n St ( )St ( )0 V ( ) ! 0,
2 t=1

where
V ( ) = E[St ( )St ( )0 ]
p p
is continuous in : Then if ^ ! o
, we can show that V^ ! Vo . Note that Vo = V ( o ):

Question: Which asymptotic variance estimator (method 1 or method 2) is better in …nite


samples?

9.3.5 MLE-based Hypothesis Testing


We now consider the hypothesis of interest

H0 : R( o ) = r;

where R( ) is a J 1 continuously di¤erentiable vector function with the J K matrix R0 ( o )


being of full rank. We allow both linear and nonlinear restrictions on parameters. Note that in
order for R0 ( o ) to be of full rank, we need the condition that J K; that is, the number of
restrictions is smaller than or at most equal to the number of unknown parameters:

We will introduce three test procedures, namely the Wald test, the Likelihood Ratio (LR)
test, and the Lagrange Multiplier (LM) test. We now derive these tests respectively.

27
Wald Test

By the Taylor series expansion, H0 ; and the Slustky theorem, we have


p p
n[R( ^ ) r] = n[R( o ) r]
p
+R0 ( ) n( ^ o
)
p
= R0 ( ) n( ^ o
)
d
! N [0; R0 ( o )H0 1 R0 ( o )0 ];

where = a ^ + (1 a) o
for some a 2 [0; 1]: It follows that the quadratic form

d
n[R( ^ ) r]0 [ R0 ( o )H0 1 R0 ( o )0 ] 1 [R( ^ ) r] ! 2
J:

By the Slutsky theorem, we have the Wald test statistic

d
W = n[R( ^ ) r]0 [ R0 ( ^ )H
^ 1
( ^ )R0 ( ^ )0 ] 1 [R( ^ ) r] ! 2
J;

where again
X
n
@2
^ )=n
H( 1
ln f (Yt j t; ):
0
t=1
@ @

Note that only the unconstrained MLE ^ is needed in constructing the Wald test statistic.

Theorem 9.8 [MLE-based Hypothesis Testing: Wald test] Suppose Assumptions 9.1-9.6
hold, and the model f (yt j t ; ) is correctly speci…ed for the conditional distribution of Yt given
o
t . Then under H0 : R( ) = r; we have as n ! 1;

d
^
W n[R( ^ ) r]0 [ R0 ( ^ )H
^ 1
( ^ )R0 ( ^ )0 ] 1 [R( ^ ) r] ! 2
J:

Question: Do we have the following result: Under H0

~ = n[R( ^ ) r]0 [R0 ( ^ )V^ 1 R0 ( ^ )0 ] 1 [R( ^ ) r]


W
d
= [R( ^ ) r]0 [R0 ( ^ )[S( ^ )0 S( ^ )] 1 R0 ( ^ )0 ] 1 [R( ^ ) r] ! 2
J

as n ! 1;where
X
n
V^ = n 1
St ( ^ )St ( ^ )0 = S( ^ )0 S( ^ )=n;
t=1

and S( ) = [S1 ( ); S2 ( ); :::; Sn ( )]0 is a n K matrix.

28
Answer: Yes. But Why?

Likelihood Ratio Test

Theorem 9.9 [Likelihood Ratio Test]: Suppose Assumptions 9.1-9.6 hold, and f (yj t ; ) is
correctly speci…ed for the conditional distribution of Yt given t : De…ne the average log-likelihoods

X
n
^l( ^ ) = n 1
ln f (Yt j t;
^ );
t=1
X
n
^l( ~ ) = n 1
ln f (Yt j t;
~ );
t=1

where ^ is the unconstrained MLE and ~ is the constrained MLE subject to the constraint that
R( ~ ) = r: Then under H0 : R( o ) = r; we have

d
LR = 2n[^l( ^ ) ^l( ~ )] ! 2
J as n ! 1:

Proof: We shall use the following strategy of proof:


(i) Use a second order Taylor series expansion to approximate 2n[^l( ^ ) ^l( ~ )] by a quadratic
p
form in n( ~ ^ ):
p p
(ii) Link n( ~ ^ ) with n ~ ; where ~ is the Lagrange multiplier of the constrained MLE:
p
(iii) Derive the asymptotic distribution of n ~ :
Then combining (i)–(iii) will give an asymptotic 2J distribution for the LR test statistic LR =
2n[^l( ^ ) ^l( ~ )]:

The unconstrained MLE ^ solves for

max ^l( ):
2

The corresponding FOC is


^ ^ ) = 0:
S(

On the other hand, the constrained MLE ~ solves the maximization problem
n o
max ^l( ) + 0
[r R( )] ;
2

29
where is a J 1 Lagrange multiplier vector. The corresponding FOC are

^ ~)
S( R0 ( ~ )0 ~ = 0;
(K 1) (K J) (J 1) = K 1
R( ~ ) r = 0:

[Recall R0 ( ) is a K J matrix.] We now take a second order Taylor series expansion of ^l( ~ )
around the unconstrained MLE ^ :

LR = 2n[^l( ~ ) ^l( ^ )]
= 2n[^l( ^ ) ^l( ^ )] + 2nS(
^ ^ )0 ( ~ ^ )
p p
+ n( ~ ^ )0 H( ^ a ) n( ~ ^ )
p p
= n( ~ ^ )0 H( ^ a ) n( ~ ^ )

where a lies between ~ and ^ ; namely a = a ~ + (1 a) ^ for some a 2 [0; 1]: It follows that
p p
2n[^l( ^ ) ^l( ~ )] = n( ~ ^ )0 [ H(
^ a )] n( ~ ^ ): (9.1)

This establishes the link between the LR test statistic and ~ ^:


p
Next, we consider n( ~ ^ ): By a Taylor expansion for S(
^ ~ ) around the unconstrained MLE
^ ~ ) R0 ( ~ )0 ~ = 0; we have
^ in the FOC S(

^ ^ ) + H(
S( ^ b )( ~ ^) R0 ( ~ )0 ~ = 0;

where b = b ^ + (1 b) ~ for some b 2 [0; 1]: Given S(


^ ^ ) = 0; we have

p p
^ b ) n( ~
H( ^) R0 ( ~ )0 n ~ = 0

or
p p
n( ~ ^) = H
^ 1
( b )R0 ( ~ )0 n ~ (9.2)

for n su¢ ciently large. This establishes the link between ~ and ~ ^ : In particular, it implies
that the Lagrange multiplier ~ is an indicator for the magnitude of the di¤erence ~ ^ :
p
Next, we derive the asymptotic distribution of n ~ : By a Taylor expansion of S(
^ ~ ) around
p p
^ ~ ) R0 ( ~ )0 n ~ = 0; we have
the true parameter o in the FOC nS(
p p
R0 ( ~ )0 n ~ = ^ ~)
nS(
p p
= nS( ^ c ) n( ~
^ o ) + H( o
);

30
where c lies between ~ and o
; namely, c = c ~ + (1 c) o
for some c 2 [0; 1]: It follows that
p p p
^
H 1
( c )R0 ( ~ )0 n ~ = H
^ 1 ^ o ) + n( ~
( c ) nS( o
) (9.3)

for n su¢ ciently large. Now, we consider a Taylor series expansion of R( ~ ) r = 0 around o
:
p p
n[R( o ) r] + R0 ( d) n( ~ o
) = 0;

where d lies between ~ and o


. Given that R( o ) = r under H0 ; we have
p
R0 ( d) n( ~ o
) = 0: (9.4)

It follows from Eq. (9.3) and Eq. (9.4) that


p
R0 ( ( c )R0 ( ~ )0 n ~
d )H
^ 1

p
^ 1 ( c ) nS(
= R0 ( d )H ^ o)
p
+R0 ( d ) n( ~ o
)
p
0
= R ( d )H^ ( c ) nS(
1 ^ o)
d
! N (0; R0 ( o )Ho 1 Vo Ho 1 R0 ( o )0 )

and therefore for n su¢ ciently large, we have

p h i 1 p
n~ = R0 ( ^
d )H
1
( c )R0 ( ~ )0 R0 ( d )H
^ 1 ^ o)
( c ) nS(
d
! N (0; [ R0 ( o )H0 1 R0 ( o )0 ] 1 ) (9.5)
p ^ o
by the CLT for nS( ); the MDS property of fSt ( o )g; the information matrix equality, and
the Slutsky theorem.
Therefore, from Eq. (9.2) and Eq. (9.5), we have
p
^
H( a)
1=2
n( ~ ^)
p
^
= H( a) H ( b )R0 ( ~ )0
1=2 ^ 1
n~
d
! N (0; )
1=2
N (0; I); (9.6)

where
= Ho 1=2 R0 ( o )0 [ R0 ( o )Ho 1 R0 ( o )0 ] 1 R0 ( o )Ho 1=2
2
is a K K symmetric and idempotent matrix ( = ) with rank equal to J (using the formula

31
that tr(ABC) =tr(BCA)):
Recall that if v N (0; ); where is a symmetric and idempotent matrix with rank J; then
the quadratic form v 0 2
J : It follows from Eq. (9.1) and Eq. (9.6) that

p p
2n[^l( ~ ) ^l( ^ )] = n( ~ ^ )0 [ H(
^ a )]
1=2
[ ^
H( a )]
1=2
n( ~ ^)
d 2
! J:

This completes the proof.

Remarks:

The LR test is based on comparing the objective functions— the log likelihood functions under
the null hypothesis H0 and the alternative to H0 : Intuitively, when H0 holds, the likelihood ^l( ^ )
of the unrestricted model is similar to the likelihood ^l( ~ ) of the restricted model, with the little
di¤erence subject to sampling variations. If the likelihood ^l( ^ ) of the unrestricted model is
su¢ ciently larger than the likelihood ^l( ~ ) of the restricted model, there exists evidence that H0
is false. How large a di¤erence between ^l( ^ ) and ^l( ~ ) is considered as su¢ ciently large to reject
H0 is determined by the associated asymptotic 2J distribution.
The likelihood ratio test statistic is similar in spirit to the F -test statistic in the classical
linear regression model, which compares the objective functions— the sum of squared residuals
under the null hypothesis H0 and the alternative to H0 respectively. In other words, the negative
log-likelihood is analogous to the sum of squared residuals. In fact, the LR test statistic and the
J F statistic are asymptotically equivalent under H0 for a linear regression model

Yt = Xt0 o
+ "t ;

2
where "t j t N (0; o ): To see this, put = ( 0; 2 0
) and note that

1 1
(Yt Xt0 )2
f (Yt j t; ) = p e 2 2 ;
2 2

X
n
^l( ) = n 1
ln f (Yt j t; )
t=1

1 1 X
n
2 1
= ln(2 ) 2
n (Yt Xt0 )2 :
2 2 t=1

32
It is straightforward to show (please show it!) that

^l( ^ ) = 1 ln(e0 e);


2
^l( ~ ) = 1 ln(~
e0 e~);
2
where e and e~ are the n 1 unconstrained and constrained estimated residual vectors respec-
tively. Therefore, under H0 ; we have

2n[^l( ~ ) ^l( ^ )] = n ln(~e0 e~=e0 e)


e0 e~ e0 e)
(~
= + oP (1)
e0 e=n
= J F + oP (1);

where we have used the inequality that j ln(1+z) zj z 2 for small z; and the asymptotically
negligible (oP (1)) reminder term is contributed by the quadratic term in the expansion.

In the proof of the above theorem, we see that the asymptotic distribution of the LR test
statistic depends on correct model speci…cation of f (yj t ; ), because it uses the MDS property
of the score function and the IM equality. In other words, if the conditional distribution model
f (yj t ; ) is misspeci…ed such that the MDS property of the score function or the IM equality
does not hold, then the LR test statistic will not be asymptotically 2 -distributed.
Lagrange Multiplier (LM) or E¢ cient Score Test

We can also use the Lagrange multiplier ~ to construct a Lagrange Multiplier (LM) test, which
is also called Rao’s e¢ cient score test. Recall the Lagrange multiplier is introduced in the
constrained MLE problem:
max L(^ ) + 0 [r R( )]:
2

The J 1 Lagrange multipier vector ~ measures the e¤ect of the restriction of H0 on the

maximized value of the model likelihood. When H0 holds, the imposition of the restriction results
in little change in the maximized likelihood. Thus the value of the Lagrange multiplier ~ for a
correct restriction should be small. If a su¢ ciently large Lagrange mutiplier ~ is obtained, it
implies that the maximized likelihood value of the restricted model is su¢ ciently smaller than
that of the unrestricted model, thus leading to the rejection of H0 :Therefore, we can use ~ to
construct a test for H0 :

33
In deriving the asymptotic distribution of the LR test statistic, we have obtained

p h i 1 p
n~ = R0 ( ^
d )H
1
( c )R0 ( ~ )0 R0 ( d )H
^ 1 ^ o)
( c ) nS(
d
! N (0; [ R0 ( o )Ho 1 R0 ( o )0 ] 1 )

It follows that the quadratic form


0 d
n ~ [ R0 ( o )Ho 1 R0 ( o )0 ] ~ ! 2
J;

and so by the Slutsky theorem, we have


0 d
n ~ [ R0 ( ~ )H
^ 1
( ~ )R0 ( ~ )0 ] ~ ! 2
J:

We have actually proven the following theorem.

Theorem 9.10 [LM/E¢ cient Score test] Suppose Assumptions 9.1–9.6 hold, and the model
f (yj t ; ) is correctly speci…ed for the conditional distribution of Yt given t : Then we have
0 d
LM0 n ~ R0 ( ~ )[ H
^ 1
( ~ )]R0 ( ~ )0 ~ ! 2
J

under H0 :

The LM test statistic only involves estimation of the model f (yt j t ; ) under H0 ; its compu-
tation may be simpler than the computation of the Wald test statistic or the LR test statistic in
many cases.

Question: Is it true that under H0 ;


0 0 d
n ~ R0 ( ~ )V~ 1
R0 ( ~ )0 ~ = n2 ~ R0 ( ~ )[S( ~ )0 S( ~ )] 1 R0 ( ~ )0 ~ ! 2
J;

where
X
n
V~ = n 1
St ( ~ )St ( ~ )0
t=1

= S( ~ )0 S( ~ )=n:

Question: What is the advantage of the LM test?

Question: What is the relationship among the Wald, LR and LM test statistics?

9.4 Quasi-Maximum Likelihood Estimation


34
When f (yt j t ; ) is misspeci…ed, for all 2 , f (yj t ; ) is not equal to the true conditional
pdf/pmf of Yt given t :
Question: What happens if f (yt j t ; ) is not correctly speci…ed for the conditional pdf/pmf of
Yt given t ?
Question: What is the interpretation for ; where = arg max 2 l( ) is as in Assumption
9.4 when f (yj t ; ) is misspeci…ed?

We can no longer interpret as the true model parameter, because f (yj t ; ) does not coincide
with the true conditional probability distribution of Yt given t .
It should be noted that in QMLE, we no longer have the following equality:
= o
where is as de…ned in Assumption 9.4 and o is the true model parameter.
p p
Although it always holds that ^ QM LE ! ; as n ! 1; we no longer have ^ QM LE ! o , as
n ! 1; given that the conditional probability distribution is misspeci…ed.
Below, we provide an alternative interpretation for when f (yj t ; ) is misspeci…ed.

Lemma 9.11: Suppose Assumption 9.4 holds. De…ne the conditional relative entropy
Z
p(yj )
I(f : pj ) = ln p(yj )dy;
f (yj ; )

where p(yj ) is the true conditional pdf/pmf of Y on : Then I(f : pj ) is nonnegative almost
surely for all ; and
= arg min E[I(f : pj )];
2

where E( ) is taken over the probability distribution of :

Remarks:

The parameter value minimizes the “distance”of f ( j ; ) from the true conditional density
p( j ) in terms of conditional relative entropy. Relative entropy is a divergence measure for two
alternative distributions. It is zero if and only if two distributions coincide with each other.
There are many distance/divergence measures for two distributions. Relative entropy has the
appealing information-theoretic interpretation and the invariance property with respect to data
transformation. It has been widely used in economics and econometrics.

Question: Why is a misspeci…ed pdf/pmf model f (yt j t ; ) still useful in economic applications?
In many applications, misspeci…cation of higher order conditional moments does not render
inconsistent the estimator for the parameters appearing in the lower order conditional moments.
For example, suppose a conditional mean model is correctly speci…ed but the conditional higher
order moments are misspeci…ed. We can still obtain a consistent estimator for the parameter

35
appearing in the conditional mean model. Of course, the parameters appearing in the higher
order conditional moments cannot be consistently estimated.
In other words, even though does not equal to o element by element, we can have equality
in some parameters of interests. For example, the …rst two elements (e.g. 0 = 0 and 1 = 20 ;
where 0 and 20 are the parameters we are interested in) in could be equal to the corresponding
elements in o , i.e., 0 = o0 and 1 = o1 :Therefore, by using QMLE, we can have an inconsistent
estimator ^ QM LE in which ^ 0 and ^ 1 are consistent for the population mean and variance. See
Example 1.
We now consider a few illustrative examples.

Example 1 [Nonlinear Regression Model] Suppose (Yt ; Xt0 )0 is i.i.d.,

o
Yt = g(Xt ; ) + "t ;

where E("t jXt ) = 0 a.s.

Here, the regression model g(Xt ; ) is correctly speci…ed for E(Yt jXt ) if and only if E("t jXt ) =
0 a.s.: We need not know the distribution of "t jXt :

o
Question: How to estimate the true parameter when the conditional mean model g(Xt ; )
is correctly speci…ed for E(Yt jXt )?

In order to estimate o ; we assume that "t jXt i:i:d:N (0; 2 ); which is likely to be incorrect
(and we know this). Then we can obtain the pesudo conditional likelihood function

1 1
[yt g(xt ; )]2
f (yt jxt ; ) = p e 2 2 ;
2 2

where = ( 0 ; 2 )0 :
De…ne the Quasi-MLE

X
n
^ = (^ 0 ; ^ 2 )0 = arg max ln f (Yt jXt ; ):
2;
t=1

Then ^ is a consistent estimator for o : In this example, misspeci…cation of i.i.d. N (0; 2 ) for
"t jXt does not render inconsistent the parameter for o : The QMLE ^ is consistent for o as
long as the conditional mean of Yt is correctly speci…ed by f (yjXt ; ): Of course, the parameter
estimator ^ = (^ 0 ; ^ 2 )0 cannot consistently estimate the true conditional distribution of Yt given
t if the conditional distribution of "t jXt is misspeci…ed.

2 2 2
Suppose the true conditional distribution "t jXt i:i:d:N (0; t ); where t = (Xt ) is a function

36
of Xt but we assume "t jXt i:i:d:N (0; 2 ): Then we still have E[St ( )jXt ] = 0 a.s. but the
conditional informational matrix equality does not hold.

Example 2 [Capital Asset Pricing Model (CAPM)]:


De…ne Yt as an L 1 vector of excess returns for L assets (or portfolios of assets). For these
L assets, the excess returns can be described using the excess-return market model:

o o
Yt = 0 + 1 Zmt + "t
o0
= Xt + "t ;

where Xt = (1; Zmt )0 is a bivariate vector, Zmt is the excess market return, o is a 2 L
parameter matrix, and "t is an L 1 disturbance, with E("t jXt ) = 0. With this condition,
CAPM is correctly speci…ed for the expected excess return E(Yt jXt ):
To estimate unknown parameter matrix o ; one can assume

"t j t N (0; );

where t = fXt ; Yt 1 ; Xt 1 ; Yt 2 ; :::g and is an L L symmetric and positive de…nite matrix.


Then we can write the conditional pdf of Yt given t as follows:
L 1
f (Zt j t; ) = (2 ) j j 2
2

1 0
exp (Yt Xt )0 1
(Yt 0
Xt ) ;
2

where = ( 0 ;vech( )0 )0 :
Although the i.i.d. normality assumption for f"t g may not hold, the estimator based on the
pesudo Gaussian likelihood function will be consistent for parameter matrix o appearing in the
CAPM model.

Example 3 [Univariate ARMA(p; q) Model]: In Chapter 5, we introduced a class of time


series models called ARMA(p; q): Suppose
p q
X X
Yt = 0 + j Yt j + j "t j + "t ;
j=1 j=1

where "t is an MDS with mean 0 and variance 2 : Then this ARMA(p; q) model is correctly
speci…ed for E(Yt jIt 1 ); where It 1 = fYt 1 ; Yt 2 ; :::; Y1 g is the information set available at
time t 1: Note that the distribution of "t is not speci…ed. How can we estimate parameters
0 ; 1 ; :::; p ; 1 ; :::; and q ?

37
2
Assuming that f"t g i:i:d:N (0; ); then the conditional pdf of Yt given t = It 1 is

1 (y t( ; ))2
f (yj t; )= p exp 2
;
2 2 2
2 0
where =( 0; 1 ; :::; p; 1 ; :::; q; ) ; and

p q
X X
t( )= 0 + j Yt j + j "t j :
j=1 j=1

Although the i.i.d. normality assumption for f"t g may be false, the estimator based on the above
pesudo Gaussian likelihood function will be consistent for parameters ( o ; o ) appearing in the
ARMA(p; q) model.
In practice, we have a random sample fYt gnt=1 of size n to estimate an ARMA(p; q) model and
need to assume some initial values for fYt g0t= p and f"t g0t= q : For example, we can set Yt = Y
for p t 0 and "t = 0 for q t 0: When an ARMA(p; q) is a stationary process, these
choice of initial values does not a¤ect the asymptotic properties of the QMLE ^ under regularity
conditions.

Example 4 [Vector Autoregression Model]: Suppose Yt = (Y1t ; :::; YLt )0 is a L 1 stationary


ergodic autoregressive process of order p :
p
X
o o0
Yt = 0 + j Yt j + "t ; t = p + 1; :::; n;
j=1

where o0 is an L 1 parameter vector, oj is a L L parameter matrix for j = f1; :::; pg,


and f"t = ("1t ; :::; "Lt )0 g is an L 1 MDS with E("t ) = 0 and E("t "0t ) = o ; an L L …nite
and positive de…nite matrix. When o is not a diagonal matrix, there exist contemporaneous
correlations between di¤erent components of "t : This implies that a shock on "1t will be spilled
over to other variables. With the MDS condition for f"t g; the VAR(p) model is correctly speci…ed
for E(Yt jIt 1 ); where It 1 = fYt 1 ; Yt 2 ; :::; Y1 g: Note that the VAR(p) model can be equivalently
represented as follows:
P P
Y1t = 10 + pj=1 11j Y1t j + + pj=1 1Lj YLt j + "1t ;
Pp P
Y2t = 20 + j=1 21j Y1t j + + pj=1 2Lj YLt j + "2t ;

Pp Pp
YLt = L0 + j=1 L1j Y1t j + + j=1 LLj YLt j + "Lt :

o
Let denote a parameter vector containing all components of unknown parameters from

38
o o o o o
0; 1 ; :::; p; and : To estimate ; one can assume

"t jIt 1 N (0; ):


Pp 0
Then Yt jIt 1 N( 0 + j=1 j Yt j ; ); and the pesudo conditional pdf of Yt given t =Yt 1
is

1
f (Yt j t; ) = p
(2 )Ldet( )
1 0 1
exp [Yt t ( )] Yt t( ) ;
2
Pp 0
where t( )= 0 + j=1 j Yt j :

Example 5 [GARCH Model]: Time-varying volatility is an important empirical stylized


facts for many economic and …nancial time series. For example, it has been well-known that there
exists volatility clustering in …nancial markets, that is, a large volatility today tends to be followed
by another large volatility tomorrow; a small volatility today tends to be followed by another
small volatility tomorrow, and the patterns alternate over time. In …nancial econometrics, the
following GARCH model has been used to capture volatility clustering or more generally time-
varying volatility. Suppose (Yt ; Xt ) is a strictly stationary process with

Yt = ( t; )+ ( t; )zt ;
E(zt j t) = 0 a.s.;
E(zt2 j t) = 1 a.s.:

The models ( t ; ) and 2 ( t ; ) are correctly speci…ed for E(Yt j t ) and var(Yt j t ) if and only
if E(zt j t ) = 0 a.s. and var(zt j t ) = 1 a.s. We need not know the conditional distribution of
zt j t (in particular, we need not know the higher order conditional moments of zt given t ):
An example for ( t ; ) is the ARMA(p; q) in Example 2. We now give some popular models
for 2 ( t ; ): For notational simplicity, we put 2t = 2 ( t ; ):

Engle’s (1982) ARCH(q) model


q
X
2 2
t = 0+ j "t j ;
j=1

where "t = t zt :

39
Bollerslev’s (1986) GARCH(p; q) model
p q
X X
2 2 2
t =!+ j t j + j "t j ;
j=1 j=1

Nelson’s (1990) EGARCH(p; q) model


p q
X X
2 2
ln t =!+ j ln t j + j g(zt j );
j=1 j=0

where g(zt ) is a nonlinear function de…ned as

g(zt ) = 1 (jzt j Ejzt j) + 2 zt :

Threshold GARCH(p; q) model:


p q
X X
2 2 2
t = !+ j t j + j "t j 1(zt j > 0)
j=1 j=1
q
X
2
+ j "t j 1(zt j 0);
j=1

where 1( ) is the indicator function.

Question: How to estimate , the parameters appearing in the …rst two conditional moments?

2
A most popular approach is to assume that zt j t i.i.d.N(0; 1): Then Yt j t N ( t( t; ); ( t; ));
and the pseudo conditional pdf of Yt given t is

1 1
2 2( t; )
[y ( t; )]2
f (yj t; )= p e :
2 ( t; )

It follows that the log-likelihood function

X
n
ln f (Yt j t; )
t=1

n X
n
= ln 2 ln t( t; )
2 t=1

1 X [Yt
n
( t; )]2
2(
:
2 t=1 t; )

40
The i.i.d. N(0,1) innovation assumption does not a¤ect the speci…cation of the conditional mean
( t ; ) and conditional variance 2 ( t ; ), so it does not a¤ect the consistency of the QMLE
^ for the true parameter value appearing in the conditional mean and conditional variance
speci…cations. In other words, "t may not be i.i.d. N(0,1) but this does not a¤ect the consistency
of the Gaussian QMLE ^ :

In addition to the i.i.d.N(0,1) assumption, the following two error distributions have also been
popularly used in practice:
p
Standardized Student’s ( 2)= t( ) Distribution
p
The scale factor ( 2)= ensures that zt has unit variance. The pdf of zt is
+1
+1
2 z2 2
f (z) = p 1+ ; 1 < z < 1:
2

Generalized Error Distribution


" #
b
b jz j
f (zt ) = 1
exp ; 1<z<1
2a b
a

where ; a and b are location, scale and shape parameters respectively. Note that both standard-
ized t-distribution and generalized error distribution include N(0,1) as a special case.
Like estimation of an ARMA(p; q) model, we may have to choose initial values for some
variables in estimating GARCH models. For example, in estimating GARCH(1,1) models, we
will encounter the initial value problem for the conditional variance 20 and "0 : One can set h0 to
be the unconditional variance E( 2t ) = !=(1 1 1 ); and set "0 = 0:
We note that the ARMA model in Example 2 can be estimated via QMLE as a special case
of the GARCH model by setting 2 ( t ; ) = 2 :

Question: What is the implication of a misspeci…ed probability distribution model?

Although misspeci…cation of f (yt j t ; ) may not a¤ect the consistency of the QMLE (or the
consistency of a subset of parameters) under suitable regularity conditions, it does a¤ect the
asymptotic variance (and so e¢ ciency) of the QMLE ^ :

Remarks: The parameter is not always consistently estimable by QMLE when the likelihood
function is misspeci…ed. In some cases, cannot be consistently estimated when the likelihood
model is misspeci…ed.

We …rst investigate the implication of a misspecifed conditional distribution model f (yj t; )


on the score function and the IM equality.

41
Lemma 9.12: Suppose Assumptions 9.4–9.6(i) hold. Then

E [St ( )] = 0;

where E( ) is taken over the true distribution of the data generating process.

Proof: Because maximizes l( ) and is an interior point in , the FOC holds: at = :

dl( )
= 0:
d

By di¤erentiating, we have
d
E[ln f (Yt j t; )] = 0:
d
Exchanging di¤erentiation and integration yields the desired result:

@ ln f (Y j t; )
E = 0:
@

This completes the proof.

Remarks:
No matter whether the conditional distributional model f (yj t ; ) is correctly speci…ed, the
score function St ( ) evaluated at always has mean zero. This is due to the consequence
of the FOC of the maximization of l( ): This is analogous to the FOC of the best linear least
squares approximation where one always has E(Xt ut ) = 0 with ut = Yt Xt0 and =
[E(Xt Xt0 )] 1 E(Xt Yt ):

When fZt = (Yt ; Xt0 )0 g is i.i.d., or fZt g is not independent but fSt ( )g is MDS (we note that
St ( ) could still be MDS when f (Yt j t ; ) is misspeci…ed for the conditional distribution of Yt
given t ), we have
!
X
n
1=2
V = V( ) = avar n St ( )
t=1
" ! !0 #
X
n X
n
1=2 1=2
= lim E n St ( ) n S ( )
n!1
t=1 =1
= E[St ( )St ( )0 ]:

Thus, even when f (yj t; ) is a misspeci…ed conditional distribution model, we do not have to

estimate a long-run variance-covariance matrix for V as long as fSt ( )g is an MDS process.

42
Question: Can you give a time series example in which f (yt j t; ) is misspeci…ed but
fSt ( )g is MDS?

Answer: Consider a conditional distribution model which correctly speci…es the conditional
mean of Yt but misspeci…es the higher order conditional moments (e.g., conditional variance).

Question: Is fSt ( )g always MDS, when fSt ( )g is stationary ergodic?

Answer: In the time series context, when the conditional pdf/pmf f (yt j t; ) is misspeci…ed,
then St ( ) may not be MDS. In this case, we have
hp i
V avar ^
nS( )
X
n X
n
1
= lim n E[St ( )S ( )0 ]
n!1
t=1 =1
X
1
= E[St ( )St j ( )0 ]
j= 1
X1
= (j);
j= 1

where
(j) = E[St ( )St j ( )0 ]:

In other words, we have to estimate the long-run variance-covariance matrix for V when fSt ( )g

is not an MDS:

Question: If the model f (yj t ; ) is misspeci…ed for the conditional distribution of Yt given t,
do we have the conditional information matrix equality?

Generally, no. That is, we generally have neither E [St ( ) jIt 1 ] = 0 nor

@ 2 ln f (Yt j t; )
E [St ( )St ( )0 j t] +E 0 j t = 0;
@ @

where E( j t ) is taken under the true conditional distribution which di¤ers from the model
f (yt j t ; ) when f (yt j t ; ) is misspeci…ed: Please check.

Question: What is the impact of the failure of the MDS property for the score function and
the failure of the conditional information matrix equality?

Theorem 9.13 [Asymptotic Normality of QMLE]: Suppose Assumptions 9.1–9.6 hold.


Then
p d
n( ^ ) ! N (0; H 1 V H 1 );

43
p ^ h i
@ 2 ln f (Yt j t ; )
where V = V ( ) avar[ nS( )] and H = H( ) E @ @ 0
j t :

Remarks:
p ^
Without the MDS property of the score function, we have to estimate V avar[ nS( )] by
(e.g.) the Newey-West (1987, 1994) type estimator in the time series context. Without the con-
ditional information matrix equality (even if the MDS holds), we cannot simplify the asymptotic
variance of the QMLE from H 1 V H 1 to H 1 even if the score function is i.i.d. or MDS.
In certain sense, the MDS property of the score function is analogous to serial uncorrelated-
ness in a regression disturbance, and the information matrix equality is analogous to conditional
homoskedasticity.

Compared with the asymptotic variance H 1 of MLE, the asymptotic variance H 1 V H 1


of QMLE is more complicated than that of MLE, because we cannot use the information matrix
equality to simplify the asymptotic variance. In addition, V has to be estimated using a kernel-
based method when fSt ( )g is not an MDS.
In the literature, the variance H 1 V H 1 is usually called the robust asymptotic variance-
covariance matrix of QMLE ^ : It is robust to misspeci…cation of model f (yt j t ; ): That is, no
matter whether f (yt j t ; ) is correctly speci…ed, H 1 V H 1 is always the correct asymptotic
p
variance of n ^ :

Question: Is QMLE asymptotically less e¢ cient than MLE?

Yes. The asymptotic variance of the MLE, equal to H 1 ; the inverse of the negative
Hessian matrix, achieves the Cramer-Rao lower bound, and therefore is asymptotically most
e¢ cient. On the other hand, the asymptotic variance H 1 V H 1 of the QMLE is not the same
as the asymptotic variance H 1 of the MLE and thus does not achieve the Cramer-Rao lower
bound. It is asymptotically less e¢ cient than the MLE. This is the price one has to pay with
use of a misspeci…ed pdf/pmf model, although some model parameters still can be consistently
estimated.

9.4.1 Asymptotic Variance Estimation


1 1
Question: How to estimate the asymptotic variance H VH of the QMLE?

First, it is straightforward to estimate H0 :

X
n
@ 2 ln f (Yt j t;
^)
^ ^) = n
H( 1
:
0
t=1
@ @

p
^ ^) !
The UWLLN for fHt ( )g and the continuity of H( ) ensure that H( H.

44
1=2 n
Next, how to estimate V = avar[n t=1 St ( )]?

We consider two cases, depending on whether fSt ( )g is MDS:

Case I: fZt = (Yt ; Xt0 )0 g is i.i.d. or fZt g is not independent but fSt ( )g is MDS.

In this case,
V = E[St ( )St ( )0 ]

so we can use
X
n
V^ = n 1
St ( ^ )St ( ^ )0
t=1

which is consistent for V :

Case II: When fZt g is not independent, fSt ( )g may not be MDS.
In this case, we can use the kernel method

X
n 1
V^ = k(j=p) ^ (j);
j=1 n

where
X
n
^ (j) = n 1
St ( ^ )St j ( ^ )0 if j 0
t=j+1

and ^ (j) = ^ ( j)0 for j < 0:

We directly assume that V^ is consistent for V :


p
Assumption 9.7: V^ ! V ; where V is …nite and nonsingular.

Lemma 9.14 [Asymptotic Variance Estimator for QMLE]: Suppose Assumptions 9.1–9.7
hold. Then as n ! 1;
p
H^ 1 ( ^ )V^ H
^ 1( ^ ) ! H 1V H 1.

9.4.2 QMLE-based Hypothesis Testing


With the consistent asymptotic variance estimator, we can now construct suitable hypothesis
tests under a misspeci…ed conditional distributional model.
Again, we consider the null hypothesis

H0 : R( ) = r;

45
where R( ) is a J 1 continuously di¤erentiable vector function with the J K matrix R0 ( )
being of full rank, and r is a J 1 vector.

Wald Test Under Model Misspeci…cation

We …rst consider a Wald test.

Theorem 9.15 [QMLE-based Hypothesis Testing, Wald Test]: Suppose Assumptions


9.1–9.7 hold. Then under H0 : R( ) = r; we have

^ = n[R( ^ ) r]0
W
[R0 ( ^ )[H
^ 1 ( ^ )V^ H
^ 1
( ^ )] 1 R0 ( ^ )0 ] 1

[R( ^ ) r]
d 2
! J

Proof: By the …rst order Taylor series expansion, we obtain


p p p
n[R( ^ ) r] = n[R( ) r] + R0 ( ) n( ^ )
p
= R0 ( ) n( ^ )
d
! N (0; R0 ( )H 1
VH 1
R0 ( )0 )
p d
where we have made use of the fact that n( ^ ) ! N (0; H 1
VH 1
); and the Slutsky theorem.
^ follows immediately.
The desired result for W

Remarks:

Only the unconstrained QMLE ^ is used in constructing the robust Wald test statistic. The
Wald test statistic under model misspeci…cation is similar in structure to the Wald test in linear
regression modeling that is robust to conditional heteroskedasticity (under the i.i.d. or MDS
assumption) or that is robust to conditional heteroskedasticity and autocorrelation (under the
non-MDS assumption).

LM/Score Test Under Model Misspeci…cation

Question: Can we use the LM test principle for H0 when f (yj t; ) is misspeci…ed?
p
Yes, we can still derive the asymptotic distribution of n ~ ; with a suitable (i.e., robust)
asymptotic variance, which of course will be generally di¤erent from that under correct model
speci…cation.

46
Recall that from the FOC of the constrained MLE ~ ;

^ ~)
S( R0 ( ~ )0 ~ = 0;
R( ~ ) r = 0;

In deriving the asymptotic distribution of the LR test statistic, we have obtained

p h i 1 p
n ~ = R0 ( ^
d )H
1
( c )R0 ( ~ )0 R0 ( d )H
^ 1 ^
( c ) nS( )

p ^ d p ^
for n su¢ ciently large. By the CLT, we have nS( ) ! N (0; V ); where V = avar[ nS( )]:
Using the Slutsky theorem, we can obtain
p d
n ~ ! N (0; );

where

= [R0 ( )H 1
R0 ( )0 ] 1

R0 ( )H 1
VH 1
R0 ( )0
[R0 ( )H 1
R0 ( )0 ] 1 :

Then a robust LM test statistic


0 1~ d
LM n~ ~ ! 2
J

by the Slutsky theorem, where the asymptotic variance estimator

~ = [R0 ( ~ )H
^ 1 ( ~ )R0 ( ~ )0 ] 1
[R0 ( ~ )H
^ 1 ( ~ )V~ H^ 1 ( ~ )R0 ( ~ )0 ]
[R0 ( ~ )H
^ 1
( ~ )R0 ( ~ )0 ] 1 ;

and V~ satis…es the following condition:


p
Assumption 9.8: V~ ! V , where V~ is de…ned as V^ in Assumption 9.7 with ^ replaced with ~ :

With this assumption, the LM test statistic will only involves estimation of the conditional
pdf/pmf model f (yj t ; ) under the null hypothesis H0 :

Theorem 9.16 [QMLE-based LM Test]: Suppose Assumptions 9.1–9.6 and 9.8 and H0 :
R( ) = r holds. Then as n ! 1;
0 1~ d
LM n~ ~ ! 2
J:

47
Remarks:

The LM0 test statistic under MLE and the LM test statistic under QMLE di¤er in the sense
that they use di¤erent asymptotic variance estimators. The LM test statistic here is robust to
misspeci…cation of the conditional pdf/pmf model f (yj t ; ):

Question: Could we use the likelihood ratio (LR) test under model speci…cation?

LR = 2n[^l( ^ ) ^l( ~ )]:

No. This is because in deriving the asymptotic distribution of the LR test statistic, we have
used the MDS property of the score function fSt ( )g and the information matrix equality (V =
H ), which may not hold when the conditional distribution model f (yj t ; ) is misspeci…ed. If
the MDS property of the score function or the information matrix equality fails, the LR statistic
is not asymptotically 2J under H0 : This is similar to the fact that J times the F -test statistic does
not converge to 2J when there exists serial correlation in f"t g or when there exists conditional
heteroskedasticity.

In many applications (e.g., estimating CAPM models), both GMM and QMLE can be used to
estimate the same parameter vector. In general, by making fewer assumptions on the DGP,
GMM will be less e¢ cient than QMLE if the pesudo-model likelihood function is close to the
true conditional distribution of Yt given t .

9.5 Model Speci…cation Testing


It is important to check whether a conditional probability distribution f (yj t ; ) is correctly
speci…ed. There are various reasons:
(i) A misspeci…ed pdf/pmf model f (yj t ; ) implies suboptimal forecasts of the true proba-
bility distribution of the underlying process.
(ii) The QMLE based on a misspeci…ed pdf/pmf model f (yj t ; ) is less e¢ cient than the
MLE based on a correctly speci…ed pdf/pmf model.
(iii) A misspeci…ed pdf/pmf model f (yj t ; ) implies that we have to use a robust version
of the asymptotic variance of QMLE, because the conditional information matrix equality no
longer holds among other things. As a consequence, the resulting statistical inference procedures
are more tedious.

Question: How to check whether a conditional distribution model f (yj t; ) is correctly speci-
…ed?

We now introduce a number of speci…cation tests for conditional distributional model f (yj t; ):

48
Case I: When fZt = (Yt ; Xt0 )0 g i.i.d.

When the data generating process is an i.i.d. sequence, we have

p d
n( ^ o
) ! N (0; Ho 1 Vo Ho 1 );

where
Vo = E[St ( o )St ( o )0 ]:

White’s (1982) Information Matrix Test

In the i.i.d. random sample context, White (1982) proposes a speci…cation test for f (yj t; )=
f (yjXt ; ) by checking whether the information matrix equality holds:

E [St ( o )St ( o )0 ] + E[Ht ( o )] = 0:

This is implied by correct model speci…cation. If the information matrix equality does not hold,
then there is evidence of model misspeci…cation for the conditional distribution of Y given X.
De…ne the K(K+1)
2
1 sample average

1X
n
m(
^ )= mt ( );
n t=1

where
mt ( ) = vech [St ( )St ( )0 + Ht ( )] :

^ ^ ) is close to zero (the population moment).


Then one can check whether the sample average m(

^ ^ ) should be in order to be considered as signi…cantly larger


How large the magnitude of m(
p
than zero can be determined by the asymptotic distribution of nm( ^ ^ ):
p
Question: How to derive the asymptotic distribution of ^ ^ )?
nm(
p
^ ^)
White (1982) proposes an information matrix test using a suitable quadratic form of nm(
that is asymptotically 2K(K+1)=2 under correct model speci…cation. Speci…cally, White (1982)
shows that
X
n
n 1=2
^ ^) = n
m( 1=2
[mt ( o ) D0 H0 1 St ( o )]
t=1
d
! N (0; W );

49
h o
i
@mt ( )
where Do D( o ) = E @
; and the asymptotic variance

W = var mt ( o ) Do H o 1 S t ( o ) :

It follows that a test statistic can be constructed by using the quadratic form

d
^ ^ )0 W
M = nm( ^ 1
^ ^) !
m( 2
K(K+1)=2

for some consistent variance estimator W^ for W: Putting W^ t = mt ( ^ ) ^ ^ )H


D( ^ 1
( ^ )St ( ^ ); we
can use the variance estimator
1X ^ ^0
n
^
W = Wt Wt :
n t=1

Question: If the information matrix equality holds, is the model f (yjXt ; ) correctly speci…ed
for the conditional distribution of Yt given Xt ?

Answer: No. Correct model speci…cation implies the information matrix equality but the con-
verse may not be true. The information matrix equality is only one of many (in…nite) implications
of the correct speci…cation for f (yj t ; ):

Although White (1982) considers i.i.d. random samples only, his IM test is applicable for
both cross-sectional and time series models as long as the score function fSt ( o )g is an MDS.

Case II: fZt = (Yt ; Xt0 )0 g is a serially dependent process.

White’s (1994) Dynamic Information Matrix Test:

In a time series context, White (1994) proposes a dynamic information matrix test that
essentially checks the MDS property of the score function fSt ( o )g:

E[St ( o )j t] = 0;

which is implied by correct model speci…cation for f (yj t ; ):


Let
mt ( ) = vech[St ( ) Wt ( )];

where Wt ( ) = [St 1 ( )0 ; St 2 ( )0 ; :::; St p ( )0 ]0 and is the Kronecker product. Then the MDS
property implies
E[mt ( o )] = 0:

This test is essentially checking whether fSt ( o )g is a white noise process up to lag order p. If

50
E[mt ( o )] 6= 0; i.e., if there exists serial correlations in fSt ( o )g; then there is evidence of model
misspeci…cation.
White (1994) considers the sample average

X
n
m
^ =n 1
mt ( ^ )
t=1

and checks if this is close to zero. White (1994) develops a so-called dynamic information matrix
p
test by using a suitable quadratic form of nm ^ that is asymptotically chi-square distributed
under correct dynamic model speci…cation.

Question: If fSt ( o )g is MDS, is f (yj t; ) correctly speci…ed for the conditional distribution
of Yt given t ?

No. Correct model speci…cation implies that fSt ( o )g is a MDS but the converse may not be
true. It is possible that St ( o ) is an MDS even when the model f (yj t ; ) is misspeci…ed for
the conditional distribution of Yt given t : A better approach is to test the conditional density
model itself, rather than the properties of its derivatives (e.g., the MDS of the score function or
the information matrix equality).

Next, we consider a test that directly checks the conditional distribution of Yt given t:

Hong and Li’s (2005) Nonparametric Test for Time Series Conditional Distribution
Models

Suppose Yt is a univariate continuous random variable, and f (yj t ; ) is a conditional distri-


bution model of Yt given t . De…ne the dynamic probability integral transform
Z Yt
Ut ( ) = f (yj t; )dy:
1

o
Lemma 9.17: If f (yj t; ) coincides with the true conditional pdf of Yt given t; then

fUt ( o )g i.i.d.U[0,1].

Thus, one can test whether fUt ( o )g is i.i.d.U[0,1]. If it is not, there exists evidence of model
misspeci…cation.

Question: Suppose fUt ( o )g is i.i.d.U[0,1], is the model f (yj t; ) correctly speci…ed for the
conditional distribution of Yt given t ?

For univariate time series (so that t = fYt 1 ; Yt 2 ; :::g); the i.i.d.U[0,1] property holds if and
only if the conditional pdf model f (yt j t ; ) is correctly speci…ed.

51
Hong and Li (2005) use a nonparametric kernel estimator for the joint density of fUt ( o ); Ut j ( o )g
and compare the joint density estimator with 1 = 1 1; the product of the marginal densities of
Ut ( o ) and Ut j ( o ) under correct model speci…cation. The test statistic follows an asymptotical
N(0,1) distribution. See Hong and Li (2005) for more discussion.

9.6 Empirical Applications


Empirical Application I: China’s Evolving Managerial Labor Market

Groves, Hong, McMillan and Naughton (1995, Journal of Political Economy)

Question: How does the industrial bureau decide to use the competitive auction to select …rm
managers?

We de…ne a binary variable as follows: Yt = 1 if the current manager of …rm t selected by


competitive auction, and Yt = 0 otherwise. We shall use the past performance of a …rm and
the size of a …rm to predict the probability of Yt = 1: Thus, we put Xt = (1; X1t ; X2t )0 ; where
X1t = past performance of …rm t (the average output per worker in the past 3-year relative to
the industry average), X2t = the size of …rm t (the number of employees of …rm t relative to the
industry)

We specify a probit model:


P (Yt = 1jXt ) = (Xt0 );

where ( ) is the N(0,1) CDF.

Estimation Results:

X1t X2t n
0:2769 0:2467 645 ;
( 7:485) ( 7:584)
where ** indicates signi…cance at the 5% level. These results suggest that the poor-performing
and/or smaller …rms are more likely to have their managers selected by competitive auction.

Empirical Application II: Full Dynamics of the Short-Term Interest Rates


Data: Daily series of 7-day Eurodollar rates frt g from June 1, 1973 to February 25, 1995. The
sample size T = 5050:

We are interested in modeling the conditional probability distribution of the short-term in-
terest rate. There are two popular discrete-time models for the spot interest rate: one is the
GARCH model, and the other is the Markov chain regime-switching model.

52
8 1: GARCH(1,1)-Level E¤ect with an i.i.d. N(0,1)
Model
1=2
innovation:
1 2
>
< rt = 1 rt 1 + 0 + 1 rt 1 + 2 rt 2 + rt 1 ht zt ;
ht = 0 + 1 ht 1 + 2 ht 1 zt2 1 ;
>
:
fzt g i:i:d:N (0; 1):
Here, the conditional mean of the interest rate change is a nonlinear function of the interest
rate level:
1 2
t = E( rt jIt 1 ) = 1 rt 1 + 0 + 1 rt 1 + 2 rt 2 :

This speci…cation can capture nonlinear dynamics in the interest rate movement.
The conditional variance model of the interest rate change is

2 2 2
t = var( rt jIt 1 ) = rt 1 ht ;

where rt 1 captures the so-called “level e¤ect” in the sense that when > 0; volatility will
increase when the interest rate level is high. On the other hand, the GARCH component ht
captures volatility clustering.

Estimation Results

Parameter Estimates for the GARCH Model (with nonlinear drift and level e¤ect)

Parameters Estimates (GARCH) Std. Error (GARCH)

1 -0.0984 0.1249

0 (1e-02) 5.0494 6.3231

1 (1e-03) -4.4132 9.2876

2 0.0000 0.0004

1.0883 0.0408

0 (1e-03) 0.0738 0.0119

2 (1e-01) 6.4117 0.1359

1 (1e-01) 3.5260 0.2181

Log-Likelihood 654.13

53
Model 2: Regime-Switching Model with GARCH and Level E¤ects
(St 1) 1=2
rt = (St 1 ) + (St 1 ) rt 1 + (St 1 ) rt 1t ht zt 1 ;
2
ht = 0 + ht 1 1 + 2 zt 1 ;
fzt g i:i:d:N (0; 1);

where the state variable St is a latent process that is assumed to follow a two-state Markov chain
with time-varying transition matrix, as speci…ed in Ang and Bekaert (1998):

1
P (St = 1jSt 1 = 1) = [1 + exp( a01 a11 rt 1 )] ;
1
P (St 1 = 0jSt 1 = 0) = [1 + exp( a00 a10 rt 1 )] :

Question: What is the model likelihood function? That is, what is the conditional density of
rt given It 1 = frt 1 ; rt 2 ; :::g; the observed information set available at time t 1?

The di¢ culty arises because the state variable St is not observable. See Hamilton (1994,
Chapter 22) for treatment.

Estimation Results

Parameter estimates for the Regime Switching Model (with GARCH and level e¤ect)

Parameters Estimates (RS) Std. Error (RS)

0 1.5378 1.5378

0 -1.0646 0.4207

1 -0.0013 0.0351

1 -0.0076 0.0484

1 0.3355 0.0483

0 0.3566 0.0693

1 0.0064 0.0512

b0 (1e-03) 6.5126 1.9898

b1 0.0224 0.0034

b2 0.7810 0.0254

a00 0.2350 0.2192

a01 4.5398 0.2691

a10 0.0208 0.0184

a11 -0.2800 0.0296

Log-Likelihood 2712.97

54
Empirical III: Volatility Models of Foreign Exchange Returns

Hong (2001, Journal of Econometrics)

Suppose one is interested in studying volatility spillover between two exchange rates— German
Deutschmark and Japanese Yen. A …rst step is to specify a univariate volatility for German
Deutschmark and Japanese yen respectively. Hong …ts an AR(3)-GARCH(1,1) model for weekly
German Deutschmark exchange rate changes and Japanese Yen exchange rate changes:

Model: AR(3)-GARCH(1,1)-i.i.d.N(0,1)
8
>
> Xt = t + "t ;
>
> P3
>
>
< t = b0 + j=1 bj Xt j ;
1=2
"t = ht zt ;
>
>
>
> ht = ! + "2t 1 + ht 1 ;
>
>
: = (b ; b ; b ; b ; !; ; )0 :
0 1 2 3

Assuming that fzt g i:i:d:N (0; 1); we obtain the following QMLE.

Data: First week of 1976:1 to last week of 1995:11, with totally 1039 observations.

Estimation results

DM Y EN

Parameter Estimate s.d. Estimate s.d.

b0 0:073 0:041 0:097 0:042


b1 0:049 0:033 0:051 0:034
b2 0:067 0:033 0:093 0:034
b3 0:028 0:033 0:066 0:033
! 0:051 0:030 0:116 0:068
0:114 0:027 0:084 0:026
0:873 0:033 0:863 0:055

Sample Size 1038 1038


Log-Likelihood 1862:307 1813:625

The standard errors reported here are robust standard errors.

9.7 Conclusion
55
Conditional probability distribution models have wide applications in economics and …nance.
For some applications, one is required to specify the entire distribution of the underlying process.
If the distribution model is correct, the resulting estimator ^ which maximizes the likelihood
function is called MLE.
For some other applications, on the other hand, one is only required to specify certain aspects
(e.g., conditional mean and conditional variance) of the distribution. One important example is
volatility modeling for …nancial time series. To estimate model parameters, one usually makes
some auxiliary assumptions on the distribution that may be incorrect so that one can estimate
by maximizing the pseudo likelihood function. This is called QMLE. MLE is asymptotically
more e¢ cient than QMLE, because the asymptotic variance of MLE attains the Cramer-Rao
lower bound.
The likelihood function of a correctly speci…ed conditional distributional model has di¤er-
ent properties from that of a misspeci…ed conditional distributional model. In particular, for a
correctly speci…ed distributional model, the score function is an MDS and the conditional in-
formation matrix equality holds. As a consequence, the asymptotic distributions of MLE and
QMLE are di¤erent (more precisely, their asymptotic variances are di¤erent). In particular, the
asymptotic variance of MLE is analogous to that of the OLS estimator under MDS regression
errors with conditional homoskedasticity; and the asymptotic variance of QMLE is analogous to
that of the OLS estimator under possibly non-MDS with conditional heteroskedasticity.
Hypothesis tests can be developed using MLE or QMLE. For hypothesis testing under a
correct speci…ed conditional distributional models, the Wald test, Lagrange Multiplier test, and
Likelihood Ratio tests can be used. When a conditional distributional model is misspeci…ed,
robust Wald tests and LM tests can be constructed. Like the F-test in the regression context,
Likelihood ratio tests are valid only when the distribution model is correctly speci…ed. The
reasons are that they exploit the MDS property of the score function and the information matrix
equality which may not hold under model misspeci…cation.
It is important to test correct speci…cation of a conditional distributional model. We introduce
some speci…cation tests for conditional distributional models under i.i.d. observations and time
series observations respectively. In particular, White (1982) proposes an Information Matrix
test for i.i.d. observations and White (1994) proposes a dynamic information matrix test that
essentially checks the MDS property of the score function of a correctly speci…ed conditional
distribution model with time series observations.

EXERCISES
o y o
9.1. For the probit model P (Yt = yjXt ) = (Xt0 ) [1 (Xt0 )]1 y , where y = 0; 1: Show that
(a) E(Yt jXt ) = (Xt0 o );

56
o o
(b) var(Yt jXt ) = (Xt0 )[1 (Xt0 )]:

9.2. For a censored regression model, show that E(Xt "t jYt > c) 6= 0: Thus, the OLS estimator
based on a censored random sample cannot be consistent for the true model parameter o :

9.3. Suppose f (yj ; ) is a conditional pdf model for Y given ; where 2 ; a parameter

space: Show that for all ; _ 2 and all ;


Z Z
ln[f (yj ; )]f (yj ; _ )dy ln[f (yj ; _ )]f (yj ; _ )dy:

9.4. (a) Suppose f (yj ; ); 2 ; is a correctly speci…ed model for the conditional probability
density of Y given ; such that f (yj ; o ) coincides with the true conditional probability density
of Y given : We assume that f (Y j ; ) is continuously di¤erentiable with respect to and o
is an interior point in . Please show that
o
@ ln f (Y j ; )
E = 0:
@

(b) Suppose Part (a) is true. Can we conclude that f (yj ; ) is correctly speci…ed for the
conditional distribution of Y given ? If yes, give your reasoning. If not, give a counter example.

9.5. Suppose f (yjx; ); 2 RK ; is a correctly speci…ed model for the conditional probability
density of Y given X; such that for some parameter value o ; f (yjx; o ) coincides with the
true conditional probability density of Y given X: We assume that f (Y jx; ) is continuously
di¤erentiable with respect to and o is an interior point in . Please show that
o o o
@ ln f (Y jX; ) @ ln f (Y jX; ) @ 2 ln f (Y jX; )
E X +E X = 0;
@ @ 0 @ @ 0
2
where @ @ln f is a K 1 vector, @@ln0f is the transpose of @ @ln f ; @@ ln
@ 0
f
is a K K matrix, and the
expectation E( ) is taken under the true conditional distribution of Y given X.

2
9.6. Put Vo = E[St ( o )St ( o )0 ] and Ho = E[ @@ St ( o )] = E[ @ @@ 0 ln fYt j t (yj t ; o )]; where
St ( ) = @@ ln f (Yt j t ; ); and o = arg min 2 l( ) = E[ln fYt j t (Yt j t ; )]: Is Ho 1 Vo Ho 1
( Ho 1 ) always positive semi-de…nite? Give your reasoning and any necessary regularity condi-
p
tions. Note that the …rst term Ho 1 Vo Ho 1 is the formula for the asymptotic variance of n ^ QM LE
p
and the second term Ho 1 is the formula for the asymptotic variance of n ^ M LE :

9.7. Suppose a conditional pdf/pmf model f (yjx; ) is misspeci…ed for the conditional distrib-
ution of Y given X; namely, there exists no 2 such that f (yjx; ) coincides with the true

57
conditional distribution of Y given X: Show that generally,
o o o
@ ln f (Y jX; ) @ ln f (Y jX; ) @ 2 ln f (Y jX; )
E X +E X = 0;
@ @ 0 @ @ 0

does not hold, where o satis…es Assumptions 9.4 and 9.5. In other words, the conditional infor-
mation matrix equality generally does not hold when the conditional pdf/pmf model f (yjx; ) is
misspeci…ed for the conditional distribution of Y given X:

9.8. Consider the following maximum likelihood estimation problem:

Assumption 7.1: fYt ; Xt0 g0 is a stationary ergodic process, and f (Yt j t ; ) is a correctly speci…ed
conditional probability density model of Yt given t = (Xt0 ; Z t 10 )0 ; where Z t 1 = (Zt0 1 ; Zt0 2 ; ; Z10 )0
and Zt = (Yt ; Xt0 )0 : For each ; ln f (Yt j t ; ) is measurable of the data, and for each t; ln f (Yt j t ; )
is twice continuously di¤erentiable with respect to 2 ; where is a compact set:

Assumption 7.2: l( ) = E [ln f (Yt j t; )] is continuous in 2 :

Assumption 7.3: (i) o = arg max 2 l( ) is the unique maximizer of l( ) over ; and (ii) o

is an interior point of .

Assumption 7.4: (i) fSt ( o ) @


@
ln f (Yt j t; )g obeys a CLT, i.e.,

p X
n
^ o) = n
nS( 1=2
St ( o )
t=1

converges to a multivariate normal distribution with some K K variance-covariance matrix;


2
(ii) fHt ( ) @ @@ 0 ln f (Yt j t ; )g obeys a uniform weak law of large numbers (UWLLN) over .
That is,
X n
lim sup n 1 Ht ( ) H( ) = 0 a.s.,
n!1 2
t=1

where the K K Hessian matrix H( ) E [Ht ( )] is symmetric, …nite and nonsingular, and is
continuous in 2 :

The maximum likelihood estimator is de…ned as ^ = arg max 2 ^ln ( ); where ^ln ( )
P
n 1 nt=1 ln f (Yt j t ; ): Suppose we have had ^ ! o almost surely, and this consistency re-
sult can be used in answering the following questions in parts (a)–(d). Show your reasoning in
each step.
(a) Find the …rst order condition of the MLE.
p
(b) Derive the asymptotic distribution of n( ^ o
): Note that the asymptotic variance of
p ^ o
n( ) should be expressed as the Hessian matrix H( o ):

58
p
(c) Find a consistent estimator for the asymptotic variance of n( ^ o
) and justify why it
is consistent.
(d) Construct a Wald test statistic for the null hypothesis H0 : R( o ) = r; where r is a J 1
constant vector, and R( ) is a J 1 vector with the derivative R0 ( ) is continuous in and R0 ( o )
is of full rank. Derive the asymptotic distribution of the Wald test under H0 :

9.9. In a linear regression model Yt = Xt0 o


+ "t ; where "t j t N (0; 2
o ): Put = ( 0; 2 0
) and
note that
1 1
(Yt Xt0 )2
f (Yt jXt ; ) = p e 2 2 ;
2 2

X
n
^l( ) = n 1
ln f (Yt jXt ; )
t=1

1 1 X
n
1
= 2
ln(2 ) 2
n (Yt Xt0 )2 :
2 2 t=1

o
Suppose H0 : R = r is the hypothesis of interest.
(a) Show

^l( ^ ) = 1 ln(e0 e);


2
^l( ~ ) = 1 ln(~
e0 e~);
2

where ~ is the MLE under H0 :


(b) Show that under H0 ;

2n[^l( ~ ) ^l( ^ )] = n ln(~


e0 e~=e0 e)
ee~ e0 e)=J
(~
= J + oP (1)
e0 e=n
= J F + oP (1):

9.10. Show the dynamic probability integral transforms fUt ( o )g is i.i.d.U[0,1] if the conditional
probability density model f (yj t ; ) is correctly speci…ed for the conditional distribution of Yt
given t :

59
CHAPTER 10 CONCLUSION
Abstract: In this chapter, we …rst review what we have covered in the previous chapters, and
then discuss other econometric courses needed for various …elds of economics and …nance.

Key words: Microeconometrics, Financial econometrics, Nonparametric econometrics, Panel


data econometrics, Time series econometrics.
10.1 Summary
Question: What have we learnt from this course?

In this chapter, we will …rst summarize what we have learnt in this book.
The modern econometric theory developed in this book is built upon the following funda-
mental axioms:

Any economy can be viewed as a stochastic process governed by some probability law.

Any economic phenomena can be viewed as a realization of the stochastic economic process.

The probability law of the data generating process can be called the “law of economic motions.”
The objective of econometrics is to infer the probability law of economic motions using observed
data, and then use the obtained knowledge to explain what has happened, to predict what will
happen, and to test economic theories and economic hypotheses.

Suppose the conditional pdf f (yt j t ) of Yt given t = (Xt ; Z t 1 ); is available. Then we can
obtain various attributes of the conditional distribution of Yt given t , such as

conditional mean;

conditional variance;

conditional skewness;

conditional kurtosis;

conditional quantile.

An important question in economic analysis is: what aspect of the conditional pdf will be im-
portant in economics and …nance? Generally speaking, the answer is dictated by the nature of
the economic problem one has at hand. For example, the e¢ cient market hypothesis states that
the conditional expected asset return given the past information is equal to the long-run market

1
average return; rational expectations theory suggests that conditional expectational errors given
the past information should be zero. In unemployment duration analysis, one should model the
entire conditional distribution of the unemployment duration given the economic characteristics
of the unemployed workers.

It should be emphasized that the conditional pdf or its various aspects only indicate a predic-
tive relationship between economic variables, that is, when one can use some economic variables
to predict other variables. The predictive relationship may or may not be the causal relationship
between or among economic variables, which is often of central interest to economists. Economic
theory often hypothesizes a causal relationship and such economic theory is used to interpret the
predictive relationship as a causal relationship.

Economic theory or economic model is not a general framework that embeds an econometric
model. In contrast, economic theory is often formulated as a restriction on the conditional pdf
or its certain aspect. Such a restriction can be used to validate economic theory, and to improve
forecasts if the restriction is valid or approximately valid.
Question: What is the role that economic theory plays in economic modeling?

Indication of the nature (e.g., conditional mean, conditional variance, etc) of the relation-
ship between Yt and Xt : Which moments are important and of interest?

Choice of economic variables Xt :

Restriction on the functional form or parameters of the relationship.

Helping judge causal relationships.

In summary, any economic theory can be formulated as a restriction on the conditional prob-
ability distribution of the economic stochastic process. Economic theory plays an important role
in simplifying statistical relationships so that a parsimoneus econometric model can eventually
capture essential economic relationships.

Motivated by the fact that economic theory often has implication on and only on the con-
ditional mean of economic variables of interest, we …rst develop a comprehensive econometric
theory for linear regression models where by linearity we mean the conditional mean is linear in
parameters and not necessarily linear in explanatory variables. We start in Chapter 3 with the
classical linear regression model, for which we develop a …nite sample statistical theory when
the regression disturbance is i.i.d. normally distributed, and is independent of the regressor.
The normality assumption is crucial for the …nite sample statistical theory. The essence of the

2
classical theory for linear regression models is i.i.d., which implies conditional homoskedasticity
and serial uncorrelatedness, which ensures the BLUE property for the OLS estimator. When
conditional heteroskedasticity and autocorrelation exist, the GLS estimator illustrates how to re-
store the BLUE property by correcting conditional heteroskedasticity and di¤erencing out serial
correlation.
With the classical linear regression model as a benchmark, we have developed a modern econo-
metric theory for linear regression models by relaxing the classical assumptions in subsequent
chapters. First of all, we relax the normality assumption in Chapter 4. This calls for asymptotic
analysis because …nite sample theory is no longer possible. It is shown that when the sample size
is large, the classical results are still approximately applicable for linear regression models with
independent observations under conditional homoskedasticity. However, under conditional het-
eroskedasticity, the classical results, such as the popular t-test and F -test statistics, are no longer
applicable, even if the sample size goes to in…nity. This is due to the fact that the asymptotic
variance of the OLS estimator has a di¤erent structure under conditional heteroskedasticity. We
need to use White’s (1980) heteroskedasticity-consistent variance-covariance estimator and use it
to develop robust hypothesis tests. It is therefore important to test conditional homoskedasticity,
and White (1980) develops a regression-based test procedure.
The asymptotic theory developed for linear regression models with independent observations
in Chapter 4 is extended to linear regression models with time series observations. This covers
two types of regression models: one is called a static regression model where the explanatory
variables or regressors are exogenous variables. The other is called a dynamic regression model
whose regressors include lagged dependent variables and exogenous variables. It is shown in
Chapter 5 that when the asymptotic theory of Chapter 4 is applicable when the regression
disturbance is a martingale di¤erence sequence. Because of its importance, we introduce tests
for martingale di¤erence sequence of regression disturbances by checking serial correlation in the
disturbance. The tests include the popular Lagrange multiplier test for serial correlation. We
have also considered a Lagrange multiplier test for autoregressive conditional heteroskedasticity
(ARCH) and discussed its implication on the inference of static and dynamic regression models
respectively.

For many static regression models, it is evident that the regression disturbance displays serial
correlation. This a¤ects the asymptotic variance of the OLS estimator. When serial correlation is
of a known structure up to a few unknown parameter, we can use the Ornut-Cochrance procedure
to obtain asymptotically e¢ cient estimator for regression parameters. When serial correlation
is of unknown form, we have to use a long-run variance estimator to estimate the asymptotic
variance of the OLS estimator. A leading example is the kernel-based estimator such as the
Newey-West variance estimator. With such a variance estimator, robust test procedures for
hypotheses of interest can be constructed. These are discussed in Chapter 6.

3
The estimation and inference of linear regression models are complicated when the condition
of E("t jXt ) = 0 does not hold, which can arise due to measurement errors, simultaneous equations
bias, omitted variables, and so on. In Chapter 7 we discuss a popular method— the two-stage
least squares— to estimate model parameters in such scenarios.
Chapter 8 introduces the GMM method, which is particularly suitable for estimating both
linear and nonlinear econometric models that can be characterized by a set of moment conditions.
A prime economic example is the rational expectations theory, which is often characterized by
an Euler equation. In fact, the GMM method provides a convenient framework to view most
econometric estimators, including the least squares, and instrumental variables regression.
Chapter 9 discusses conditional probability distribution models and other econometric models
that can be estimated by using pseudo probability likelihood methods. Conditional distribution
models have found wide applications in economics and …nance, and MLE is the most popular
and most e¢ cient method to estimate parameters in conditional distribution models. On the
other hand, many econometric models can be conveniently estimated by using a pseudo like-
lihood function. These include nonlinear least squares, ARMA, GARCH models, as well as
limited dependent variables and discrete choice models. Such an estimation method is called
the Quasi-MLE. There is an important di¤erence between MLE and QMLE. The forms of their
asymptotic variances are di¤erent. In certain sense, the asymptotic variance of MLE is similar in
structure to the asymptotic variance of the OLS estimator under conditional homoskedasticity
and serial uncorrelatedness, while the asymptotic variance of the QMLE is similar in struc-
ture to the asymptotic variance of the OLS estimator under conditional heteroskedasticity and
autocorrelation.
Chapters 2 to 9 are treated in a uni…ed and coherent manner. The theory is constructed
progressively from the simplest classical linear regression models to nonlinear expectations mod-
els and then to conditional distributional models. The book has emphasized the important
implication of conditional heteroskedasticity and autocorrelation as well as misspeci…cation of
conditional distributional models on the asymptotic variance of the related econometric esti-
mators. With a good command of the econometric theory developed in Chapters 2 to 9, we
can conduct a variety of empirical analysis in economics and …nance, including all motivating
examples introduced in Chapter 1. In addition to asymptotic theory, the book has also shown
students how to do asymptotic analysis via the progressive development of the asymptotic theory
in Chapters 2 to 9. Moreover, we have also introduced a variety of basic asymptotic analytic tools
concepts, including various convergence concepts, limit theorems, and basic time series concepts
and models.

10.2 Directions for Further Study in Economet-


rics
4
The econometric theory presented in this book has laid down a solid foundation in econometric
study. However, it does not cover all econometric theory. For example, we only cover stationary
time series models, nonstationary time series models, such as unit root models and cointegrated
models, have not been covered, which call for a di¤erent asymptotic theory (see, e.g., Hamilton
1994). Panel data models also require a separate and independent treatment (see, e.g., Hsiao
2002). Due to the unique features of …nancial time series, particularly high-frequency …nancial
time series, …nancial econometrics has emerged as a new …eld in econometrics that is not covered
by standard time series econometrics. On the other hand, although our theory can be applied
to models for limited dependent variables and discrete choice variables, more detailed treatment
and comprehensive coverage are needed. Moreover, topics on asymptotic analytic tools may be
covered to train students’asymptotic analysis ability in a more comprehensive manner.

5
References
Bollerslev, T. (1986), "Generalized Autoregressive Conditional Heteroskedastcity", Journal of
Econometrics 31, 307-327.
Box, G.E.P. and D.A. Pierce (1970), "Distribution of Residual Autorrelations in Autoregres-
sive Moving Average Time Series Models," Journal of the American Statistical Association 65,
1509-1526.
Campbell, J.Y. and J. Cochrance (1999), "“By Force of Habit: A Consumption-Based
Explanation of Aggregate Stock Market Behavior”Journal of Political Economy 107, 205-251.
Chen, D. and Y. Hong (2003), "Has Chinese Stock Market Become E¢ cient? Evidence from
a New Approach," China Economic Quarterly 1 (2), 249-268.
Chow, G. C. (1960), "Tests of Equality Between Sets of Coe¢ cients in Two Linear Regressions,"
Econometrica 28, 591-605.
Cournot, A. (1838), Researches into the Mathematical Properties of the Theory of Wealth,
trans. Nathaniel T. Bacon, with an essay and an biography by Irving Fisher. McMillan: New
York, 2nd edition, 1927.
Cox, D. R. (1972), "Regression Models and Life Tables (with Discussion)," Journal of the Royal
Statistical Society, Series B, 34, 187-220,
Engle, R. (1982), "Autoregressive Conditional Hetersokedasticity with Estimates of the Vari-
ance of United Kingdom In‡ation,”Econometrica 50, 987-2008.
Engle, R. and C.W.J. Granger (1987), "Cointegration and Error-Corretion Representation,
Estimation and Testing,”Econometrica 55, 251-276.
Fisher, I. (1933), "Report of the Meeting," Econometrica 1, 92-93
Frisch, R. (1933), “Propagation Problems and Impulse Problems in Dynamic Economics.” In
Economic Essays in Honour of Gustav Cassel. London: Allen and Unwin, 1933.
Granger, C.J.W. (2001), "Overview of Nonlinear Macroeconometric Empirical Models," Jour-
nal of Macroeconomic Dynamics 5, 466-481.
Granger, C.J.W. and T. Teräsvirta (1993), modelling Nonlinear Economic Relationships,
Oxford University Press: Oxford.
Groves, T., Hong, Y., McMillan, J. and B. Naughton (1994), "Incentives in Chinese
State-owned Enterprises," Quarterly Journal of Economics CIX, 183-209.
Gujarati, D.N. (2006), Essentials of Econometrics, 3rd Edition, McGraw-Hill: Boston.
Hansen, L.P. (1982), "Large Sample Properties of Generalized Method of Moments Estima-
tors,”Econometrica 50, 1029-1054.
Hansen, L.P. and K. Singleton (1982), "Generalized Instrumental Variables Estimation of
Nonlinear Rational Expectations Models," Econometrica 50, 1269-1286.
Hardle, W. (1990), Applied Nonparametric Regression. Cambridge University Press: Cam-
bridge.

6
Hong, Y. and Y.J. Lee (2005), "Generalized Spectral Testing for Conditional Mean Models
in Time Series with Conditional Heteroskedasticity of Unknown Form," Review of Economic
Studies 72, 499-451.
Hsiao, C. (2003), Panel Data Analysis, 2nd Edition, Cambridge University Press: Cambridge.
Keynes, J. M. (1936), The General Theory of Employment, Interest and Money, McMillan
Cambridge University Press: Cambridge, U.K.
Kiefer, N. (1988), "Economic Duration Data and Hazard Functions," Journal of Economic
Literature 26, 646-679.
Lancaster, T. (1990), The Econometric Analysis of Transition Data, Cambridge University
Press: Cambridge, U.K.
Lucas, R. (1977), "Understanding Business Cycles," in Stabilization of the Domestic and Inter-
national Economy, Karl Brunner and Allan Meltzer (eds.), Carnegie-Rochester Conference Series
on Public Policy, Vol. 5. North-Holland: Amsterdam.
Mehra, R. and E. Prescott (1985), "The Equity Premium: A Puzzle," Journal of Monetary
Economics 15, 145-161.
Nelson, C.R. and C. I. Plosser (1982), "Trends and Random Walks in Macroeconomic Time
Series: Some Evidence and Implications," Journal of Monetary Economics 10, 139-162.
Pagan, A. and A. Ullah (1999), Nonparametric Econometrics, Cambridge University Press:
Cambridge.
Phillips, P.C. (1987), "Time Series Regression with a Unit Root," Econometrica 55, 277-301.
Samuelson, L. (2005), "Economic Theory and Experimental Economics," Journal of Economic
Literature XLIII, 65-107.
Samuelson, P. (1939), “Interactions Between the Multiplier Analysis and the Principle of Ac-
celeration,”Review of Economic Studies 21, 75-78.
Smith, A. (1776), An Inquiry into the Nature and Causes of the Wealth of Nations, edited, with
an Introduction, Notes, Marginal Summary and an Enlarged Index, by Edwin Cannan; with an
Introduction by Max Lerner. New York :The Modern library, 1937.
Von Neumann, J. and O. Morgenstern (1944), Theory of Games and Economic Behavior,
Princeton University Press: Princeton.
Walras, L. (1874), Elements of Pure Economics, or, The Theory of Social Wealth, translated
by William Ja¤e. Fair…eld, PA; Kelley,1977.
White, H. (1980), "A Heteroskedasticity-Consistent Covariance Matrix Estimator and a Direct
Test for Heterokedasticity,”Econometrica 48, 817-838.
White, H. (1982), “Maximum Likelihood Estimation of Misspeci…ed Models,” Econometrica
50, 1-26.
White, H. (1994), Estimation, Inference and Speci…cation Analysis. Cambridge University
Press: Cambridge.

7
About the Author: Yongmiao Hong received his Bachelor Degree in Physics in 1985, and
his MA degree in Economics in 1988, both from Xiamen University. He received his PHD in
Economics from University of California, San Diego, in 1993. In the same year, he became a
tenure track assistant professor in Department of Economics, Cornell University, where he became
a tenured faculty in 1998, and a full professor in 2001. He has also been a special-term visiting
professor in the School of Economics and Management, Tsinghua University since 2002, and a
Cheung Kong Visiting Professor in the Wang Yanan Institute for Studies in Economics (WISE),
Xiamen University, since 2005. He is the President of the Chinese Economists Society in North
America, 2009-2010. Yongmiao Hong’s research interests have been econometric theory, time
series analysis, …nancial econometrics, and empirical study on the Chinese economy and …nancial
markets. He has published dozens of academic papers in a number of top academic journals in
economics, …nance and statistics, such as Econometrica, Journal of Political Economy, Journal of
Quarterly Economics, Review of Economic Studies, Review of Economics and Statistics, Review
of Financial Studies, Journal of Econometrics, Econometric Theory, Biometrika, Journal of
Royal Statistical Society Series B, and Journal of American Statistical Association.

You might also like