Lecture Notes On Advanced Econometrics
Lecture Notes On Advanced Econometrics
ADVANCED ECONOMETRICS
Yongmiao Hong
Department of Economics and
Department of Statistical Sciences
Cornell University
SPRING 2016
0
Chapter 5 Linear Regression Models with Dependent Observations
5.1 Introduction to Time Series Analysis
5.2 Framework and Assumptions
5.3 Consistency of OLS
5.4 Asymptotic Normality of OLS
5.5 Asymptotic Variance Estimator for OLS
5.6 Hypothesis Testing
5.7 Testing for Conditional Heteroskedasticity and Autoregressive Conditional Heteroskedas-
ticity
5.8 Testing for Serial Correlation
5.9 Conclusion
Chapter 6 Linear Regression Models under Conditional Heteroskedasticity and
Autocorrelation
6.1 Framework and Assumptions
6.2 Long-run Variance Estimation
6.3 Consistency of OLS
6.4 Asymptotic Normality of OLS
6.5 Hypothesis Testing
6.6 Testing Whether Long-run Variance Estimation Is Needed
6.7 A Classical Ornut-Cochrane Procedure
6.8 Empirical Applications
6.9 Conclusion
Chapter 7 Instrumental Variables Regression
7.1 Framework and Assumptions
7.2 Two-Stage Least Squares (2SLS) Estimation
7.3 Consistency of 2SLS
7.4 Asymptotic Normality of 2SLS
7.5 Interpretation and Estimation of the 2SLS Asymptotic Variance
7.6 Hypothesis Testing
7.7 Hausman’s Test
7.8 Empirical Applications
7.9 Conclusion
Chapter 8 Generalized Method of Moments Estimation
8.1 Introduction to the Method of Moments Estimation
8.2 Generalized Method of Moments (GMM) Estimation
8.3 Consistency of GMM
1
8.4 Asymptotic Normality of GMM
Chapter 3 introduces the classical linear regression analysis. A set of classical assumptions
are given and discussed, and conventional statistical procedures for estimation, inference, and
hypothesis testing are introduced. The roles of conditional homoskedasticity, serial uncorrelat-
edness, and normality of the disturbance of a linear regression model are analyzed in a …nite
sample econometric theory. We also discuss the generalized least squares estimation as an ef-
…cient estimation method of a linear regression model when the variance-covariance matrix is
known up to a constant. In particular, the generalized least squares estimation is embedded as
an ordinary least squares estimation of a suitably transformed regression model via conditional
variance scaling and autocorrelation …ltering.
1
The subsequent chapters 4–7 are the generalizations of classical linear regression analysis
when various classical assumptions fail. Chapter 4 …rst relaxes the normality and conditional
homoskedasticity assumptions, two key conditions assumed in the classical linear regression mod-
eling. A large sample theoretic approach is taken. For simplicity, it is assumed that the observed
data are generated from an independent and identically distributed random sample. It is shown
that while the …nite distributional theory is no longer valid, the classical statistical procedures are
still approximately applicable when the sample size is large, provided conditional homoskedas-
ticity holds. In contrast, if the data display conditional heteroskedasticity, classical statistical
procedures are not applicable even for large samples, and heteroskedasticity-robust procedures
will be called for. Tests for existence of conditional heteroskedasticity in a linear regression
framework are introduced.
Chapter 5 extends the linear regression theory to time series data. First, it introduces a
variety of basic concepts in time series analysis. Then it shows that the large sample theory for
i.i.d. random samples carries over to stationary ergodic time series data if the regression error
follows a martingale di¤erence sequence. We introduce tests for serial correlation, and tests for
conditional heteroskedasticity and autoregressive conditional heteroskedasticity in a time series
regression framework. We also discuss the impact of autoregressive conditional heteroskedasticity
on inferences of static time series regressions and dynamic time series regressions.
Chapter 6 extends the large sample theory to a very general case where there exist conditional
heteroskedasticity and autocorrelation. In this case, the classical regression theory cannot be
used, and a long-run variance-covariance matrix estimator is called for to validate statistical
inferences in a time series regression framework.
Chapter 7 is the instrumental variable estimation for linear regression models, where the
regression error is correlated with the regressors. This can arise due to measurement errors,
simultaneous equation biases, and other various reasons. Two- stage least squares estimation
method and related statistical inference procedures are fully exploited. We describe tests for
endogeneity.
Chapter 9 introduces the maximum likelihood estimation and the quasi-maximum likelihood
estimation methods for conditional probability models and other nonlinear econometric mod-
2
els. We exploit the important implications of correctly speci…cation of a conditional distribution
model, especially the analogy between the martingale di¤erence sequence property of the score
function and serial uncorrelatedness, and the analogy between the conditional information equal-
ity and conditional homoskedasticity. These links can provide a great help in understanding the
large sample properties of the maximum likelihood estimator and the quasi-maximum likelihood
estimator.
Chapter 10 concludes the book by summarizing the main econometric theory and methods
covered in this book, and pointing out directions for further build-up in econometrics.
This book has several important features. It covers, in a progressive manner, various econo-
metrics models and related methods from conditional means to possibly nonlinear conditional
moments to the entire conditional distributions, and this is achieved in a uni…ed and coherent
framework. We also provide a brief review of asymptotic analytic tools and show how they are
used to develop the econometric theory in each chapter. By going through this book progres-
sively, readers will learn how to do asymptotic analysis for econometric models. Such skills are
useful not only for those students who intend to work on theoretical econometrics, but also for
those who intend to work on applied subjects in economics because with such analytic skills,
readers will be able to understand more specialized or more advanced econometrics textbooks.
This book is based on my lecture notes taught at Cornell University, Renmin University of
China, Shandong University, Shanghai Jiao Tong University, Tsinghua University, and Xiamen
University, where the graduate students provide detailed comments on my lecture notes.
3
CHAPTER 1 INTRODUCTION TO
ECONOMETRICS
Abstract: Econometrics has become an integral part of training in modern economics and
business. Together with microeconomics and macroeconomics, econometrics has been taught as
one of the three core courses in most undergraduate and graduate economic programs in North
America. This chapter discusses the philosophy and methodology of econometrics in economic
research, the roles and limitations of econometrics, and the di¤erences between econometrics and
mathematical economics as well as mathematical statistics. A variety of illustrative econometric
examples are given, which cover various …elds of economics and …nance.
Key Words: Data generating process, Econometrics, Probability law, Quantitative analysis,
Statistics.
1.1 Introduction
Econometrics has become an integrated part of teaching and research in modern economics
and business. The importance of econometrics has been increasingly recognized over the past
several decades. In this chapter, we will discuss the philosophy and methodology of econometrics
in economic research. First, we will discuss the quatitative feature of modern economics, and the
di¤erences between econometrics and mathematical economics as well as mathematical statistics.
Then we will focus on the important roles of econometrics as a fundamental methodology in
economic research via a variety of illustrative economic examples including the consumption
function, marginal propensity to consume and multipliers, rational expectations models and
dynamic asset pricing, the constant return to scale and regulations, evaluation of e¤ects of
economic reforms in a transitional economy, the e¢ cient market hypothesis, modeling uncertainty
and volatility, and duration analysis in labor economics and …nance. These examples range
from econometric analysis of the conditional mean to the conditional variance to the conditional
distribution of economic variables of interest. we will also discuss the limitations of econometrics,
due to the nonexperimental nature of economic data and the time-varying nature of econometric
structures.
1
roughly classi…ed into four categories: macroeconomics, microeconomics, …nancial economics,
and econometrics. Of them, macroeconomics, microeconomics and econometrics now consti-
tute the core courses for most economic doctoral programs in North America, while …nancial
economics is now mainly being taught in business and management schools.
Most doctoral programs in economics in the U.S. emphasize quantitative analysis. Quantita-
tive analysis consists of mathematical modeling and empirical studies. To understand the roles
of quantitative analysis, it may be useful to …rst describe the general process of modern economic
research. Like most natural science, the general methodology of modern economic research can
be roughly summarized as follows:
Step 1: Data collection and summary of empirical stylized facts. The so-called stylized
facts are often summarized from observed economic data. For example, in microeconomics,
a well-known stylized fact is the Engel’s curve, which characterizes that the share of a
consumer’s expenditure on a commodity out of her or his total income will vary as his/her
income changes; in macroeconomics, a well-known stylized fact is the Phillips Curve, which
characterizes a negative correlation between the in‡ation rate and the unemployment rate
in an aggregate economy; and in …nance, a well-known stylized fact about …nancial markets
is volatility clustering, that is, a high volatility today tends to be followed by another high
volatility tomorrow, a low volatility today tends to be followed by another low volatility
tomorrow, and both alternate over time. The empirical stylized facts often serve as a
starting point for economic research. For example, the development of unit root and
cointegration econometrics was mainly motivated by the empirical study of Nelson and
Plosser (1982) who found that most macroeconomic time series are unit root processes.
Step 3: Empirical veri…cation of economic models. Economic theory only suggests a quali-
tative economic relationship. It does not o¤er any concrete functional form. In the process
of transforming a mathematical model into a testable empirical econometric model, one of-
ten has to assume some functional form, up to some unknown model parameters. One needs
to estimate unknown model parameters based on the observed data, and check whether
the econometric model is adequate. An adequate model should be at least consistent with
the empirical stylized facts.
Step 4: Applications. After an econometric model passes the empirical evaluation, it can
2
then be used to test economic theory or hypotheses, to forecast future evolution of the
economy, and to make policy recommendations.
For an excellent example highlighting these four steps, see Gujarati (2006, Section 1.3) on
labor force participation. We note that not every economist or every research paper has to
complete these four steps. In fact, it is not uncommon that each economist may only work on
research belonging to a certain stage in his/her entire academic lifetime.
From the general methodology of economic research, we see that modern economics has two
important features: one is mathematical modeling for economic theory, and the other is empirical
analysis for economic phenomena. These two features arise from the e¤ort of several generations
of economists to make economics a "science". To be a science, any theory must ful…ll two criteria:
one is logical consistency and coherency in theory itself, and the other is consistency between
theory and stylized facts. Mathematics and econometrics serve to help ful…ll these two criteria
respectively. This has been the main objective of the econometric society. The setup of the
Nobel Memorial Prize in economics in 1969 may be viewed as the recognition of economics as a
science in the academic profession.
3
Why does economics need mathematics? Brie‡y speaking, mathematics plays a number of
important roles in economics. First, the mathematical language can summarize the essence of
a theory in a very concise manner. For example, macroeconomics studies relationships between
aggregate economic variables (e.g., GDP, consumption, unemployment, in‡ation, interest rate,
exchange rate, etc.) A very important macroeconomic theory was proposed by Keynes (1936).
The classical Keynesian theory can be summarized by two simple mathematical equations:
@Y 1
= :
@G 1
Thus, the Keynesian theory can be e¤ectively summarized by two mathematical equations.
Second, complicated logical analysis in economics can be greatly simpli…ed by using math-
ematics. In introductory economics, economic analysis can be done by verbal descriptions or
graphical methods. These methods are very intuitive and easy to grasp. One example is the
partial equilibrium analysis where a market equilibrium can be characterized by the intersection
of the demand curve and the supply curve. However, in many cases, economic analysis cannot
be done easily by verbal languages or graphical methods. One example is the general equilib-
rium theory …rst proposed by Walras (1874). This theory addresses a fundamental problem in
economics, namely whether the market force can achieve an equilibrium for a competitive mar-
ket economy where there exist many markets and when there exist mutual interactions between
di¤erent markets. Suppose there are n goods, with demand Di (P ); supply Si (P ) for good i;
where P = (P1 ; P2 ; :::; Pn )0 is a price vector for n goods. Then the general equilibrium analysis
addresses whether there exists an equilibrium price vector P such that all markets are clear
simultaneously:
Di (P ) = Si (P ) for all i 2 f1; :::; ng:
Conceptually simple, it is rather challenging to give a de…nite answer because both the demand
and supply functions could be highly nonlinear. Indeed, Walras was unable to establish this
theory formally. It was satisfactorily solved by Arrow and Debreu many years later, when they
used the …xed point theorem in mathematics to prove the existence of an equilibrium price vector.
The power and magic of mathematics was clearly demonstrated in the development of the general
4
equilibrium theory.
Third, mathematical modeling is a necessary path to empirical veri…cation of an economic
theory. Most economic and …nancial phenomena are in form of data (indeed we are in a digital
era!). We need “digitalize” economic theory so as to link the economic theory to data. In
particular, one needs to formulate economic theory into a testable mathematical model whose
functional form or important structural model parameters will be estimated from observed data.
5
Any economy can be viewed as a stochastic process governed by some probability law.
There is no way to verify these axioms. They are the philosophic views of econometricians
toward an economy. Not every economist or even econometrician agrees with this view. For
example, some economists view an economy as a deterministic chaotic process which can generate
seemingly random numbers. However, most economists and econometricians (e.g., Granger and
Teräsvirta 1993, Lucas 1977) view that there are a lot of uncertainty in an economy, and they
are best described by stochastic factors rather than deterministic systems. For instance, the
multiplier-accelerator model of Samuelson (1939) is characterized by a deterministic second-
order di¤erence equation for aggregate output. Over a certain range of parameters, this equation
produces deterministic cycles with a constant period of business cycles. Without doubt this
model sheds deep insight into macroeconomic ‡uctuations. Nevertheless, a stochastic framework
will provide a more realistic basis for analysis of periodicity in economics, because the observed
periods of business cycles never occur evenly in any economy. Frisch (1933) demonstrates that
a structural propagation mechanism can convert uncorrelated stochastic impulses into cyclical
outputs with uneven, stochastic periodicity. Indeed, although not all uncertainties can be well
characterized by probability theory, probability is the best quantitative analytic tool to describe
uncertainties. The probability law of this stochastic economic system, which characterizes the
evolution of the economy, can be viewed as the “law of economic motions.” Accordingly, the
tools and methods of mathematical statistics will provide the operating principles.
One important implication of the fundamental axioms is that one should not hope to de-
termine precise, deterministic economic relationships, as do the models of demand, production,
and aggregate consumption in standard micro- and macro-economic textbooks. No model could
encompass the myriad essentially random aspects of economic life (i.e., no precise point forecast
is possible, using a statistical terminology). Instead, one can only postulate some stochastic
economic relationships. The purpose of econometrics is to infer the probability law of the eco-
nomic system using observed data. Economic theory usually takes a form of imposing certain
restrictions on the probability law. Thus, one can test economic theory or economic hypotheses
by checking the validity of these restrictions.
It should be emphasized that the role of mathematics is di¤erent from the role of econometrics.
The main task of mathematical economics is to express economic theory in the mathematical form
of equations (or models) without regard to measurability or empirical veri…cation of economic
theory. Mathematics can check whether the reasoning process of an economic theory is correct
and sometime can give surprising results and conclusions. However, it cannot check whether
an economic theory can explain reality. To check whether a theory is consistent with reality,
one needs econometrics. Econometrics is a fundamental methodology in the process of economic
6
analysis. Like the development of a natural science, the development of economic theory is a
process of refuting the existing theories which cannot explain newly arising empirical stylized facts
and developing new theories which can explain them. Econometrics rather than mathematics
plays a crucial role in this process. There is no absolutely correct and universally applicable
economic theory. Any economic theory can only explain the reality at certain stage, and therefore,
is a “relative truth”in the sense that it is consistent with historical data available at that time.
An economic theory may not be rejected due to limited data information. It is possible that
more than one economic theory or model coexist simultaneously, because data does not contain
su¢ cient information to distinguish the true one (if any) from false ones. When new data become
available, a theory that can explain the historical data well may not explain the new data well
and thus will be refuted. In many cases, new econometric methods can lead to new discovery
and call for new development of economic theory.
Econometrics is not simply an application of a general theory of mathematical statistics.
Although mathematical statistics provides many of the operating tools used in econometrics,
econometrics often needs special methods because of the unique nature of economic data, and
the unique nature of economic problems at hand. One example is the generalized method of
moment estimation (Hansen 1982), which was proposed by econometricians aiming to estimate
rational expectations models which only impose certain conditional moment restrictions charac-
terized by the Euler equation and the conditional distribution of economic processes is unknown
(thus, the classical maximum likelihood estimation cannot be used). The development of unit
root and cointegration (e.g., Engle and Granger 1987, Phillips 1987), which is a core in modern
time series econometrics, has been mainly motivated from Nelson and Plosser’s (1982) empirical
documentation that most macroeconomic time series display unit root behaviors. Thus, it is
necessary to provide an econometric theory for unit root and cointegrated systems because the
standard statistical inference theory is no longer applicable. The emergence of …nancial econo-
metrics is also due to the fact that …nancial time series display some unique features such as
persistent volatility clustering, heavy tails, infrequent but large jumps, and serially uncorrelated
but not independent asset returns. Financial applications, such as …nancial risk management,
hedging and derivatives pricing, often call for modeling for volatilities and the entire conditional
probability distributions of asset returns. The features of …nancial data and the objectives of
…nancial applications make the use of standard time series analysis quite limited, and therefore,
call for the development of …nancial econometrics. Labor economics is another example which
shows how labor economics and econometrics have bene…ted from each other. Labor economics
has advanced quickly over the last few decades because of availability of high-quality labor data
and rigorous empirical veri…cation of hypotheses and theories on labor economics. On the other
hand, microeconometrics, particularly panel data econometrics, has also advanced quickly due
to the increasing availability of microeconomic data and the need to develop econometric theory
7
to accommondate the features of microeconomic data (e.g., censoring and endogeneity).
In the …rst issue of Econometrica, the founder of the econometric society, Fisher (1933), nicely
summarizes the objective of the econometric society and main features of econometrics: “Its main
object shall be to promote studies that aim at a uni…cation of the theoretical-quantitative and the
empirical-quantitative approach to economic problems and that are penetrated by constructive
and rigorous thinking similar to that which has come to dominate the natural sciences.
But there are several aspects of the quantitative approach to economics, and no single one of
these aspects taken by itself, should be confounded with econometrics. Thus, econometrics is by
no means the same as economic statistics. Nor is it identical with what we call general economic
theory, although a considerable portion of this theory has a de…nitely quantitative character.
Nor should econometrics be taken as synonymous [sic] with the application of mathematics
to economics. Experience has shown that each of these three viewpoints, that of statistics,
economic theory, and mathematics, is a necessary, but not by itself a su¢ cient, condition for a
real understanding of the quantitative relations in modern economic life. It is the uni…cation of
all three that is powerful. And it is this uni…cation that constitutes econometrics.”
@Yt 1
= ;
@Gt 1
8
which depends on the marginal propensity to consume :
To assess the e¤ect of …scal policies on the economy, it is important to know the magnitude
of . For example, suppose the Chinese government wants to maintain a steady growth rate
(e.g., an annual 8%) for its economy by active …scal policy. It has to …gure out how many
government bonds to be issued each year. Insu¢ cient government spending will jeopardize the
goal of achieving the desired growth rate, but excessive government spending will cause budget
de…cit in the long run. The Chinese government has to balance these con‡icting e¤ects and this
crucially depends on the knowledge of the value of : Economic theory can only suggest a positive
qualitative relationship between income and consumption. It never tells exactly what should
be for a given economy. It is conceivable that di¤ers from country to country, because cultural
factors may have impact on the consumption behavior of an economy. It is also conceivable that
will depend on the stage of economic development in an economy. Fortunately, econometrics
o¤ers a feasible way to estimate from observed data. In fact, economic theory even does not
suggest a speci…c functional form for the consumption function. The linear functional form for
the consumption is assumed for convenience, not implied by economic theory. Econometrics can
provide a consistent estimation procedure for the unknown consumption function. This is called
the nonparametric method (see, e.g., Hardle 1990, Pagan and Ullah 1999).
X
n X
n
t t Ct 1
U= u(Ct ) = ;
t=0 t=0
where > 0 is the agent’s time discount factor, 0 is the risk aversion parameter, u( ) is
the agent’s utility function in each time period, and Ct is consumption during period t: Let
the information available to the agent at time t be represented by the -algebra It — in the
sense that any variable whose value is known at time t is presumed to be It -measurable, and let
Rt = Pt =Pt 1 be the gross return to an asset acquired at time t 1 at a price of Pt 1 : The agent’s
optimization problem is to choose a sequence of consumptions fCt g over time to
max E(U )
fCt g
Ct + Pt qt W t + Pt qt 1 ;
where qt is the quantity of the asset purchased at time t and Wt is the agent’s period t income.
9
De…ne the marginal rate of intertemporal substitution
@u(Ct+1 ) 1
@Ct+1 Ct+1
MRSt+1 ( ) = @u(Ct )
= ;
Ct
@Ct
where model parameter vector = ( ; )0 : Then the …rst order condition of the agent’s optimiza-
tion problem can be characterized by
That is, the marginal rate of intertemporal substitution discounts gross returns to unity. This
FOC is usually called the Euler equation of the economic system (see Hansen and Singelton 1982
for more discussion).
How to estimate this model? How to test validity of a rational expectations model? Here, the
traditional popular maximum likelihood estimation method cannot be used, because one does not
know the conditional distribution of economic variables of interest. Nevertheless, econometricians
have developed a consistent estimation method based on the conditional moment condition or
the Euler equation, which does not require knowledge of the conditional distribution of the data
generating process. This method is called the generalized method of moments (see Hansen 1982).
In the empirical literature, it was documented that the empirical estimates of risk aversion
parameter are often too small to justify the substantial di¤erence between the observed returns
on stock markets and bond markets (e.g., Mehra and Prescott 1985). This is the well-known
equity premium puzzle. To resolve this puzzle, e¤ort has been devoted to the development of new
economic models with time-varying, large risk aversion. An example is Campbell and Cochrance’s
(1999) consumption-based capital asset pricing model. This story con…rms our earlier statement
that econometric analysis calls for new economic theory after documenting the inadequacy of the
existing model.
Example 3: The Production Function and the Hypothesis on Constant Return
to Scale
Suppose that for some industry, there are two inputs— labor Li and capital stock Ki ; and one
output Yi ; where i is the index for …rm i. The production function of …rm i is a mapping from
inputs (Li ; Ki ) to output Yi :
Yi = exp("i )F (Li ; Ki );
where "i is a stochastic factor (e.g., the uncertain weather condition if Yi is an agricultural prod-
uct). An important economic hypothesis is that the production technology displays a constant
return to scale (CRS), which is de…ned as follows:
10
CRS is a necessary condition for the existence of a long-run equilibrium of a competitive market
economy. If CRS does not hold for some industry, and the technology displays the increasing
return to scale (IRS), the industry will lead to natural monopoly. Government regulation is then
necessary to protect consumers’welfare. Therefore, testing CRS versus IRS has important policy
implication, namely whether regulation is necessary.
A conventional approach to testing CRS is to assume that the production function is a Cobb-
Douglas function:
F (Li ; Ki ) = A exp("i )Li Ki :
H0 : + = 1:
where i is the index for …rm i 2 f1; :::; N g; and t is the index for year t 2 f1; :::; T g; Bonusit is
the proportion of bonus out of total wage bill, and Contractit is the proportion of workers who
have signed a …xed-term contract. This is an example of the so-called panel data model (see,
e.g., Hsiao 2003).
Paying bonus and signing …xed-term contracts were two innovative incentive reforms in the
Chinese state-owned enterprises in the 1980s, compared to the …xed wage and life-time employ-
ment systems in the pre-reform era. Economic theory predicts that the introduction of the bonus
and contract systems provides stronger incentives for workers to work harder, thus increasing
11
the productivity of a …rm (see Groves, Hong, McMillan and Naughton 1994).
To examine the e¤ects of these incentive reforms, we consider the null statistical hypothesis
H0 : = = 0:
It appears that conventional t-tests or F -tests would serve our purpose here, if we can assume
conditional homoskedasticity. Unfortunately, this cannot be used because there may well exist
the other way of causation from Yit to Bonusit : a productive …rm may pay its workers higher
bonuses regardless of their e¤orts. This will cause correlation between the bonuses and the error
term uit ; rendering the OLS estimator inconsistent and invalidating the conventional t-tests or
F -tests. Fortunately, econometricians have developed an important estimation procedure called
Instrumental Variables estimation, which can e¤ectively …lter out the impact of the causation
from output to bonus and obtain a consistent estimator for the bonus parameter. Related
hypothesis test procedures can be used to check whether bonus and contract reforms can increase
…rm productivity.
In evaluating the e¤ect of economic reforms, we have turned an economic hypothesis— that
introducing bonuses and contract systems has no e¤ect on productivity— into a statistical hy-
pothesis H0 : = = 0: When the hypothesis H0 : = = 0 is not rejected, we should
not conclude that the reforms have no e¤ect. This is because the extended production function
model, where the reforms are speci…ed additively, is only one of many ways to check the e¤ect
of the reforms. For example, one could also specify the model such that the reforms a¤ect the
marginal produtivities of labor and capital (i.e., the coe¢ cients of labor and capital). Thus,
when the hypothesis H0 : = = 0 is not rejected, we can only say that we do not …nd evidence
against the economic hypothesis that the reforms have no e¤ect. We should not conclude that
the reforms have no e¤ect.
The LHS, the so-called conditional mean of Yt given It 1 ; is the expected return that can be
obtained when one is fully using the information available at time t 1: The RHS, the uncondi-
tional mean of Yt ; is the expected market average return in the long-run; it is the expected return
of a buy-and-hold trading strategy. When EMH holds, the past information of stock returns has
no predictive power for future stock returns. An important implication of EMH is that mutual
12
fund managers will have no informational advantage over layman investors.
One simple way to test EMH is to consider the following autoregression
p
X
Yt = 0 + j Yt j + "t ;
j=1
where p is a pre-selected number of lags, and "t is a random disturbance. EMH implies
H0 : 1 = 2 = = p = 0:
Any nonzero coe¢ cient j ; 1 j p; is evidence against EMH. Thus, to test EMH, one can test
whether the j are jointly zero. The classical F -test in a linear regression model can be used to
test the hypothesis H0 when var("t jIt 1 ) = 2 ; i.e., when there exists conditional homoskedastic-
ity. However, EMH may coexist with volatility clustering (i.e., var("t jIt 1 ) may be time-varying),
which is one of the most important empirical stylized facts of …nancial markets (see Chen and
Hong (2003) for more discussion). This implies that the standard F -test statistic cannot be
used here, even asymptotically. Similarly, the popular Box and Pierce’s (1970) portmanteau Q
test, which is based on the sum of the …rst p squared sample autocorrelations, also cannot be
used, because its asymptotic 2 distribution is invalid in presence of autoregressive conditional
heteroskedasticity. One has to use procedures that are robust to conditional heteroskedasticity.
Like the discussion in Subsection 5.4, when one rejects the null hypothesis H0 that the j are
jointly zero, we have evidence against EMH. Furthermore, the linear AR(p) model has predictive
ability for asset returns. However, when one fails to reject the hypothesis H0 that the j are
jointly zero, one can only conclude that we do not …nd evidence against EMH. One cannot
conclude that EMH holds. The reason is, again, that the linear AR(p) model is one of many
possibilities to check EMH (see, e.g., Hong and Lee 2005, for more discussion).
2
t var(Yt jIt 1 ) = E (Yt E(Yt jIt 1 ))2 jIt 1 :
13
An example of the conditional variance is the AutoRegressive Conditional Heteroskedasticity
(ARCH) model, originally proposed by Engle (1982). An ARCH(q) model assumes that
8
>
> Yt = t + " t ;
>
>
>
>
< "t = t zt ;
t = E(Yt jIt 1 );
>
> P
>
> 2
t = + qj=1 j "2t j ; > 0; > 0;
>
>
: fz g i:i:d:(0; 1):
t
This model can explain a well-known stylized fact in …nancial markets— volatility clustering: a
high volatility tends to be followed by another high volatility, and a small volatility tends to
be followed by another small volatility. It can also explain the non-Gaussian heavy tail of asset
returns. More sophisticated volatility models, such as Bollerslev’s (1986) Generalized ARCH or
GARCH model, have been developed in time series econometrics.
In practice, an important issue is how to estimate a volatility model. Here, the models for
the conditional mean t and the conditional variance 2t are assumed to be correctly speci…ed,
but the conditional distribution of Yt is unknown, because the distribution of the standardized
innovation fzt g is unknown. Thus, the popular maximum likelihood estimation (MLE) method
cannot be used. Nevertheless, one can assume that fzt g is i.i.d.N (0; 1) or follows other plausible
distribution. Under this assumption, we can obtain a conditional distribution of Yt given It 1
and estimate model parameters using the MLE procedure. Although fzt g is not necessarily
i.i.d.N (0; 1) and we know this, the estimator obtained this way is still consistent for the true
model parameters. However, the asymptotic variance of this estimator is larger than that of
the MLE (i.e., when the true distribution of fzt g is known), due to the e¤ect of not knowing
the true distribution of fzt g: This method is called the quasi-MLE, or QMLE (see, e.g., White
1994). Inference procedures based on the QMLE are di¤erent from those based on the MLE.
For example, the popular likelihood ratio test cannot be used. The di¤erence comes from the
fact that the asymptotic variance of the QMLE is di¤erent from that of the MLE, just like the
fact that the asymptotic variance of the OLS estimator under conditional heteroskedasticity is
di¤erent from that of the OLS under conditional homoskedasticity. Incorrect calculation of the
asymptotic variance estimator for the QMLE will lead to misleading inference and conclusion
(see White 1982, 1994 for more discussion).
14
given that it has not …nished yet. The so-called hazard rate measures the chance that the
duration will end now, given that it has not ended before. This hazard rate therefore can be
interpreted as the chance to …nd a job, to trade, to end a strike, etc.
Suppose Ti is the duration from a population with the probability density function f (t) and
probability distribution function F (t): Then the survival function is
This is called the proportional hazard model, originally proposed by Cox (1972). The parameter
@ 1 @
= ln i (t) = i (t)
@Xi i (t) @Xi
can be interpreted as the marginal relative e¤ect of Xi on the hazard rate of individual i. Inference
of will allow one to examine how individual characteristics a¤ect the duration of interest. For
example, suppose Ti is the unemployment duration for individual i; then the inference of will
allow us to examine how individual characteristics, such as age, education, gender, and etc,
can a¤ect the unemployment duration. This will provide important policy implication on labor
markets.
Because one can obtain the conditional probability density function of Yi given Xi
Rt
where the survival function Si (t) = exp[ 0 i (s)ds], we can estimate by the maximum likeli-
hood estimation method.
For an excellent survey on duration analysis in labor economics, see Kiefer (1988), and for
a complete and detailed account, see Lancaster (1990). Duration analysis has been also widely
used in credit risk modeling in the recent …nancial literature.
15
The above examples, although not exhaustive, illustrate how econometric models and tools
can be used in economic analysis. As noted earlier, an economy can be completely characterized
by the probability law governing the economy. In practice, which attributes (e.g., conditional
moments) of the probability law should be used depends on the nature of the economic problem
at hand. In other words, di¤erent economic problems will require modeling di¤erent attributes of
the probability law and thus require di¤erent econometric models and methods. In particular, it
is not necessary to specify a model for the entire conditional distribution function for all economic
applications. This can be seen clearly from the above examples.
16
annual Chinese GDP growth rate fYt g over the past several years:
GDP growths in di¤erent years should be viewed as di¤erent random variables, and each variable
Yt only has one realization! There is no way to conduct statistical analysis if one random
variable only has a single realization. As noted earlier, statistical analysis studies the “average”
behavior of a large number of realizations from the same data generating process. To conduct
statistical analysis of economic data, economists and econometricians often assume some time-
invarying "common features" of an economic system so as to use time series data or cross-
sectional data of di¤erent economic variables. These common features are usually termed as
"stationarity" or "homogeneity" of the economic system. With these assumptions, one can
consider that the observed data are generated from the same population or populations with
similar characters. Economists and econometricians assume that the conditions needed to employ
the tools of statistical inference hold, but this is rather di¢ cult, if not impossible, to check in
practice.
Third, economic relationships are often changing over time for an economy. Regime shifts and
structural changes are rather a rule than an exception, due to technology shocks and changes in
preferences, population structure and institution arrangements. An unstable economic relation-
ship makes it di¢ cult for out-of-sample forecasts and policy-making. With a structural break,
an economic model that was performing well in the past may not forecast well in the future.
Over the past several decades, econometricians have made some progress to copy with the time-
varying feature of an economic system. Chow’s (1960) test, for example, can be used to check
whether there exist structural breaks. Engle’s (1982) volatility model can be used to forecast
time-varying volatility using historical asset returns. Nevertheless, the time-varying feature of
an economic system always imposes a challenge for economic forecasts. This is quite di¤erent
from natural sciences, where the structure and relationships are more or less stable over time.
Fourth, data quality. The success of any econometric study hinges on the quantity as well as
the quality of data. However, economic data may be subject to various defects. The data may be
badly measured or may correspond only vaguely to the economic variables de…ned in the model.
Some of the economic variables may be inherently unmeasurable, and some relevant variables
may be missing from the model. Moreover, sample selection bias will also cause a problem. In
China, there may have been a tendency to over-report or estimate the GDP growth rates given
the existing institutional promotion mechanism for local government o¢ cials. Of course, the
advances in computer technology, the development of statistical sampling theory and practice
can help improve the quality of economic data. For example, the use of scanning machines makes
every transaction data available.
17
The above features of economic data and economic systems together unavoidably impose
some limitations for econometrics to achieve the same mature stage as the natural science.
1.7 Conclusion
In this chapter, we have discussed the philosophy and methodology of econometrics in eco-
nomic research, and the di¤erences between econometrics and mathematical economics and math-
ematical statistics. I …rst discussed two most important features of modern economics, namely
mathematical modeling and empirical analysis. This is due to the e¤ort of several generations
of economists to make economics a science. As the methodology for empirical analysis in eco-
nomics, econometrics is an interdisciplinary …eld. It uses the insights from economic theory, uses
statistics to develop methods, and uses computers to estimate models. We then discussed the
roles of econometrics and its di¤erences from mathematics, via a variety of illustrative examples
in economics and …nance. Finally, we pointed out some limitations of econometric analysis, due
to the fact that any economy is not a controlled experiment. It should be emphasized that these
limitations are not only the limitations of econometrics, but of economics as a whole.
18
EXERCISES
1.1. Discuss the di¤erences of the roles of mathematics and econometrics in economic research.
1.2. What are the fundamental axioms of econometrics? Discuss their roles and implications.
1.3. What are the limitations of econometric analysis? Discuss possible ways to alleviate the
impact of these limits.
1.4. How do you perceive the roles of econometrics in decision-making in economics and business?
19
CHAPTER 2 GENERAL REGRESSION
ANALYSIS
Abstract: This chapter introduces regression analysis, the most popular statistical tool to ex-
plore the dependence of one variable ( say Y ) on others (say X). The variable Y is called the
dependent variable, and X is called the independent variable or explanatory variable. The re-
gression relationship between X and Y can be used to study the e¤ect of X on Y or to predict
Y using X. We motivate the importance of the regression function from both the economic
and statististical perspectives, and characterize the condition for correct speci…cation of a linear
model for the regression function, which is shown to be crucial for a valid economic interpretation
of model parameters.
We assume that Z = (Y; X 0 )0 is a random vector with E(Y 2 ) < 1; where Y is a scalar, X
is a (k + 1) 1 vector of economic variables with its …rst component being a constant, and X 0
denotes the transpose of X: Given this assumption, the conditional mean E(Y jX) exists and is
well-de…ned.
Statistically speaking, the relationship between two random variables or vectors X (e.g., oil
price change) and Y (e.g., economic growth) can be characterized by their joint distribution
function. Suppose (X 0 ; Y )0 are continuous random vectors, and the joint probability density
function (pdf ) of (X 0 ; Y )0 is f (x; y): Then the marginal pdf of X is
Z 1
fX (x) = f (x; y)dy;
1
f (x; y)
fY jX (yjx) = ;
fX (x)
provided fX (x) > 0: The conditional pdf fY jX (yjx) completely describes how Y depends on X:
In other words, it characterizes a predictive relationship of Y using X: With this conditional pdf
fY jX (yjx), we can compute the following quantities:
1
The conditional mean
Note that when = 0:5; Q(x; 0:5) is the conditional median, which is the cuto¤ point or
threshold that divides the population into two equal halves, conditional on X = x.
2
and conditional variance. There is no need to model the entire conditional distribution of Y given
X when only certain conditional moments are needed. For example, when the conditional mean
is of concern, there is no need to model the conditional variance or impose restrictive conditions
on it.
The conditional moments, and more generally the conditional probability distribution of Y
given X; are not the causal relationship from X to Y: They are a predictive relationship. That
is, one can use the information on X to predict the distribution of Y or its attributes. These
probability concepts cannot tell whether the change in Y is caused by the change in X: Such
causal interpretation has to reply on economic theory. Economic theory usually hypothesizes
that a change in Y is caused by a change in X; i.e., there exists a causal relationship from X to
Y: If such an economic causal relationship exists, we will …nd a predictive relationship from X
to Y: On the other hand, a documented predictive relationship from X to Y may not be caused
by an economic causal relationship from X to Y . For example, it is possible that both X and
Y are positively correlated due to their dependence on a common factor. As a result, we will
…nd a predictive relationship from X to Y; although they do not have any causal relationship.
In fact, it is well-known in econometrics that some economic variables that trend consistently
upwards over time are highly correlated even in the absence of any causal relationship between
them. Such strong correlations are called spurious relationships.
De…nition 2.1 [Regression Function]: The conditional mean E(Y jX) is called a regression
function of Y on X:
Many economic theories can be characeterized by the conditional mean E(Y jX) of Y given
X; provided X and Y are suitably de…ned. Most, though not all, of dynamic economic theories
and/or dynamic optimization models, such as rational expectations, e¢ cient markets hypothesis,
expectations hypothesis, and optimal dynamic asset pricing, have important implications on (and
only on) the conditional mean of underlying economic variables given the information available to
economic agents (e.g., Cochrane 2001, Sargent and Ljungqvist 2002). For example, the classical
e¢ cient market hypothesis states that the expected asset return given the information available,
is zero, or at most, is constant over time; the optimal dynamic asset pricing theory implies
that the expectation of the pricing error given the information available is zero for each asset
(Cochrane 2001). Although economic theory may suggest a nonlinear relationship, it does not
3
give a completely speci…ed functional form for the conditional mean of economic variables. It is
therefore important to model the conditional mean properly.
Before modeling E(Y jX); we …rst discuss some probabilistic properties of E(Y jX):
Proof: The result follows immediately from applying the law of iterated expectations below.
Lemma 2.2 [Law of Iterated Expectations (LIE)]: For any measurable function G(X; Y );
Proof: We consider the case of the continuous distribution of (Y; X 0 )0 only. By the multiplication
rule that the joint pdf f (x; y) = fY jX (yjx)fX (x); we have
Z 1 Z 1
E[G(X; Y )] = G(x; y)fXY (x; y)dxdy
1 1
Z 1 Z 1
= G(x; y)fY jX (yjx)fX (x)dxdy
1
Z 1 Z11
= G(x; y)fY jX (yjx)dy fX (x)dx
1 1
Z 1
= E[G(X; Y )jX = x]fX (x)dx
1
= EfE[G(X; Y )jX]g;
where the operator E( jX) is the expectation with respect to fY jX ( jX); and the operator E( )
is the expectation with respect to fX ( ): This completes the proof.
4
and the overall average wage
where P (X = 1) is the proportion of female employees in the labor force, and P (X = 0) is the
proportion of the male employees in the labor force. The use of LIE here thus provides some
insight into the income distribution between genders.
Example 2: Suppose Y is an asset return and we have two information sets: X and X; ~ where
X X ~ so that all information in X is also in X~ but X~ contains some extra information. Then
we have a conditional version of the law of iterated expectations says that
~
E(Y jX) = E[E(Y jX)jX]
or equivalently nh io
E Y ~
E(Y jX)jX = 0:
where Y E(Y jX) ~ is the prediction error using the superior information set X:
~ The conditional
LIE says that one cannot use limited information X to predict the prediction error one would
~ See Campbell, Lo and MacKinlay (1997, p.23) for more
make if one had superior information X:
discussion.
Suppose we are interested in predicting Y using some function g(X) of X; and we use a
so-called Mean Squared Error (MSE) criterion to evaluate how well g(X) approximates Y: Then
the optimal predictor under the MSE criterion is the conditional mean, as will be shown below.
We …rst de…ne the MSE criterion. Intuitively, MSE is the average of the squared deviations
between the predictor g(X) and the actual outcome Y .
De…nition 2.2 [MSE]: Suppose function g(X) is used to predict Y: Then the mean squared
error of function g(X) is de…ned as
M SE(g) = E [Y g(X)]2 ;
The theorem below states that E(Y jX) minimizes the MSE.
5
Theorem 2.3 [Optimality of E(Y jX)]: The regression function E(Y jX) is the solution to the
optimization problem
Proof: We will use the variance and squared-bias decomposition technique. Put
Then
Remarks:
6
MSE is a popular criterion for measuring precision of a predictior g(X) for Y: It has at least
two advantages: …rst, it can be analyzed conveniently, and second, it has a nice decomposition
of a variance component and a squared-bias component.
However, MSE is one of many possible criteria for measuring goodness of the predictor g(X)
for Y: In general, any increasing function of the absolute value jY g(X)j can be used to measure
the goodness of …t for the predictor g(X): For example, the Mean Absolute Error
In other words, m(x) divides the conditional population into two equal halves.
y
Example 3: Let the joint pdf fXY (x; y) = e for 0 < x < y < 1. Find E(Y jX) and var(Y jX):
Solution: We …rst …nd the conditional pdf fY jX (yjx): The marginal pdf of X
Z 1
fX (x) = fXY (x; y)dy
1
Z 1
= e y dy
x
x
= e for 0 < x < 1:
Therefore,
fXY (x; y)
fY jX (yjx) =
fX (x)
= e (y x) for 0 < x < y < 1:
7
Then
Z 1
E(Y jx) = yfY jX (yjx)dy
1
Z 1
(y x)
= ye dy
x
Z 1
x
= e ye y dy
x
Z 1
x y
= e yde
x
= 1 + x:
Because
Z 1
2
E(Y jx) = y 2 fY jX (yjx)dy
1
Z 1
= y2e (y x)
dy
x
Z 1
x
= e y 2 e y dy
x
Z 1
x
= e where de y = e y dy:
y 2 de y
x
Z 1
x 2 y 1
= ( e ) y e jx e y dy 2
x
Z 1
x 2 x
= [ e ] 0 xe 2 ye y dy
x
Z 1
= x2 + 2ex ye y dy
Z 1x
= x2 + 2 ye (y x) dy
x
= x2 + 2(1 + x);
8
we have
The conditional variance of Y given X does not depend on X: That is, X has no e¤ect on the
conditional variance of Y:
The above example shows that while the conditional mean of Y given X is a linear function
of X; the conditional variance of Y may not depend on X: This is essentially the assumption
made in the classical linear regression model (see Chapter 3). Another example for which we
have a linear regression function with constant conditional variance is when X and Y are jointly
normally distributed (see Exercise 2 at the end of this chapter).
Theorem 2.4 [Regression Identity]: Suppose E(Y jX) exists. Then we can always write
where " is called the regression disturbance and has the property that
E("jX) = 0:
where
Remarks:
The regression function E(Y jX) can be used to predict the expected value of Y using the
information of X: In regression analysis, an important issue is the direction of causation between
Y and X:In practice, one often hope to check whether Y “depends”on or can be “explained”by
X; with help of economic theory. For this reason, Y is called the dependent variable, and X is
9
called the explanatory variable or vector. However, it should be emphasized that the regression
function E(Y jX) itself does not tell any causal relationship between Y and X:
The random variable " represents the part of Y that is not captured by E(Y jX): It is usually
called a noise or a disburbance, because it “disturbs” an otherwise stable relationship between
Y and X: On the other hand, the regression function E(Y jX) is called a signal.
The property that E("jX) = 0 implies that the regression disturbance " contains no system-
atic information of X that can be used to predict the expected value of Y: In other words, all
information of X that can be used to predict the expectation of Y has been completely summa-
rized by E(Y jX): The condition E("jX) = 0 is crucial for the validity of economic interpretation
of model parameters, as will be seen shortly.
E("jX) = 0 implies that the unconditional mean of " is zero:
E(") = E[E("jX)] = 0
E(X") = E [E(X"jX)]
= E [XE("jX)]
= E(X 0)
= 0:
Since E(") = 0; we have E(X") = cov(X; "): Thus, orthogonality (E(X") = 0) means that
X and " are uncorrelated.
In fact, " is orthogonal to any measurable function of X; i.e., E["h(X)] = 0 for any measurable
function h( ): This implies that we cannot predict the mean of " by using any possible model
h(X); no matter it is linear or nonlinear.
10
Example 4: Suppose
p
"= + 2;
0 1X
where random variables X and are independent, and E( ) = 0;var( ) = 1: Find E("jX) and
var("jX):
Solution:
h p i
2
E("jX) = E 0+ 1 X jX
p
= 2 E( jX)
0+ 1X
p
= 2 E(
0+ 1X )
p
= 2
0+ 1X 0
= 0:
Next,
Although the conditional mean " given X is identically zero, the conditional variance of " given
X depends on X:
The regression analysis (conditional mean analysis) is the most popular statistical method in
econometrics. It has been applied widely to economics. For example, it can be used to
Example 5: Let Y =consumption, X=disposable income. Then the regression function E(Y jX) =
C(X) is the so-called consumption function, and the marginal propensity to consume (MPC) is
the derivative
d
M P C = C 0 (X) = E(Y jX):
dX
11
MPC is an important concept in the “multiplier e¤ect” analysis. The magnitude of MPC
is important in macroeconomic policy analysis and forecasting. On the other hand, when Y is
consumption on food only, then Engle’s law implies that MPC must be a decreasing function of
d
X: Therefore, we can test Engle’s law by testing whether C 0 (X) = dX E(Y jX) is a decreasing
function of X:
Example 6: Y =output, X=(labor, capital, raw material)0 , then the regression E(Y jX) = F (X)
is the so-called production function. This can be used to test the hypothesis of constant return
to scale (CRS), which is de…ned as
Example 7: Let Y be the cost of producing certain output X: Then the regression function
E(Y jX) = C(X) is the cost function. For a monopoly …rm or industry, the marginal cost must
be declining in output X: That is,
d
E(Y jX) = C 0 (X) > 0;
dX
d2
E(Y jX) = C 00 (X) < 0:
dX 2
These imply that the cost function of a monopoly is a nonlinear function of X:
Generally speaking, given that E(Y jX) depends on X; it is conceivable that var(Y jX) and other
higher order conditional moments may also depend on X: In fact, conditional heteroskedasticity
may arise from di¤erent sources. For example, a larger …rm may have a larger output variation.
Granger and Machina (2006) explain why economic variables may display volatility clustering
from an econometric structral perspective.
The following example shows that conditional heteroskedasticity may arise due to random
coe¢ cients in a data generating process.
Y = 0 +( 1 + 2 )X + ;
2
where X and are independent, and E( ) = 0;var( ) = : Find the conditional mean E(Y jX)
and conditional variance var(Y jX).
12
Solution: (i)
(ii)
= E ( 2X + )2 jX
= E ( 2X + 1)2 2 jX
2 2
= (1 + 2 X) E( jX)
2 2
= (1 + 2 X) E( )
2 2
= (1 + 2 X) :
The random coe¢ cient process has been used to explain why the conditional variance may
depend on the regressor X: We can write this process as
Y = 0 + 1X + ";
where
" = (1 + 2 X) :
2 2
Note that E("jX) = 0 but var("jX) = (1 + 2 X) :
where F is a class of functions that includes all measurable and square-integrable functions, i.e.,
Z
k+1
F= g( ) : R !Rj g 2 (x)fX (x)dx < 1 :
13
In general, the regression function E(Y jX) is an unknown functional form of X. Economic
theory usually suggests a qualitative relationship between X and Y (e.g., the cost of production
is an increasing function of output X); but it never suggests a concrete functional form. One
needs to use some mathematical model to approximate go (X):
In econometrics, a most popular modeling strategy is the parametric approach, which assumes
a known functional form for go (X); up to some unknown parameters. In particular, one usually
uses a class of linear functions to approximate go (x); which is simple and easy to interpret. This
is the approach we will take in most of this book.
X
k
A = fg : Rk+1 ! R : g(X) = 0+ j Xj ; j 2 Rg
j=1
k+1 0
= fg : R ! R j g(X) = Xg:
Here, there is no restriction on the values of parameter vector : For this class of functions, the
functional form is known to be linear in both explanatory varibles X and parameters ; the
unknown is the (k + 1) 1 vector .
Remarks:
From an econometric point of view, the key feature of A is that g(X) = X 0 is linear in ;
not in X: Later, we will generalize A so that g(X) = X 0 is linear in but is possibly nonlinear
in X: For example, when k = 1; we can generalize A to include
2
g(X) = 0 + 1 X1 + 2 X1 ;
or
g(X) = 0 + 1 ln X1 :
14
These possibilities are included in A if we properly rede…ne X as X = (1; X1 ; X12 )0 or X =
(1; ln X1 )0 : Therefore, our econometric theory to be developed in subsequent chapters are actually
applicable to all regression models that are linear in but not necessarily linear in X: Such models
are called linear regression models. Conversely, a nonlinear regression model for go (X) means a
known parametric functional form g(X; ) which is nonlinear in : An example is the so-called
logistic regression model
1
g(X; ) = :
1 + exp( X 0 )
Nonlinear regression models can be handled using the analytic tools developed in Chapter 8. See
more discussions there.
The solution g (X) = X 0 is called the Best Linear Least Squares Predictor for Y; and
is called the best LS approximation coe¢ cient vector.
Theorem 2.5 [Best Linear LS Prediction]: Suppose E(Y 2 ) < 1 and the (k + 1) (k + 1)
matrix E(XX 0 ) is nonsingular. Then the best linear LS predictor that solves
= [E(XX 0 )] 1 E(XY ):
15
The left hand side
d @
E(Y X 0 )2 = E (Y X 0 )2
d @
@
= E 2(Y X0 )
( X0 )
@
@
= 2E (Y X 0 ) (X 0 )
@
0
= 2E[X(Y X )]:
E[X(Y X0 )] = 0 or
E(XY ) = E(XX 0 ) :
= [E(XX 0 )] 1 E(XY ):
d2
E(Y X 0 )2 = 2E(XX 0 )
d d 0
Remarks:
The moment condition E(Y 2 ) < 1 ensures that E(Y jX) exists and is well-de…ned. When
the (k + 1) (k + 1) matrix
2 3
1 E(X1 ) E(X2 ) E(Xk )
6 7
6 E(X1 ) E(X12 ) E(X1 X2 ) E(X1 Xk ) 7
6 7
E(XX ) = 6
0
6 E(X2 ) E(X2 X1 ) E(X22 ) 7
7
6 .. .. 7
4 . . 5
2
E(Xk ) E(Xk X1 ) E(Xk )
is nonsingular and E(XY ) exists, the best linear LS approximation coe¢ cient is always
well-de…ned, no matter whether E(Y jX) is linear or nonlinear in X.
0
To gain insight into the nature of , we consider a simple case where = ( 0; 1) and
16
X = (1; X1 )0 : Then the slope coe¢ cient and the intercept coe¢ cient are, respectively,
cov(Y; X1 )
1 = ;
var(X1 )
0 = E(Y ) 1 E(X1 ):
Thus, the best linear LS approximation coe¢ cient 1 is proportional to cov(Y; X1 ). In other
words, 1 captures the dependence between Y and X1 that is measurable by cov(Y; X1 ): It will
miss the dependence between Y and X1 that cannot be measured by cov(Y; X1 ): Therefore, linear
regression analysis is essentially correlation analysis.
In general, the best linear LS predictor g (X) X 0 6= E(Y jX): An important question is
what happens if g (X) = X 0 6= E(Y jX)? In particular, what is the interpretation of ?
We now discuss the relationship between the best linear LS prediction and a linear regression
model.
Y = X 0 + u; 2 Rk+1 ;
is called a linear regression model, where u is the regression model disturbance or regression
model error. If k = 1; it is called a bivariate linear regression model or a straight line regression
model. If k > 1; it is called a multiple linear regression model.
The linear regression model is an arti…cial speci…cation. Nothing ensures that the regression
function is linear, namely E(Y jX) = X 0 o for some o : In other words, the linear model may
not contain the true regression function go (X) E(Y jX): However, even if go (X) is not a linear
function of X; the linear regresson model Y = X 0 + u may still have some predictive ability
although it is a misspeci…ed model.
We …rst characterize the relationship between the best linear LS approximation and the linear
regression model.
Theorem 2.6: Suppose the conditions of the previous theorem hold. Let
Y = X 0 + u;
and let be the best linear least squares approximation coe¢ cient. Then
17
if and only if the following orthogonality condition holds:
E(Xu) = 0:
= [E(XX 0 )] 1 E(XY ) :
Remarks:
This theorem implies that no matter whether E(Y jX) is linear or nonlinear in X; we can
always write
Y = X0 + u
for some = such that the orthogonality condition E(Xu) = 0 holds, where u = Y X 0 .
The orthogonality condition E(Xu) = 0 is fundamentally linked with the best least squares
optimizer. If is the best linear LS coe¢ cient , then the disturbance u must be orthogonal to
X: On the other hand, if X is orthogonal to u; then must be the least squares minimizer .
Essentially the orthogonality between X and " is the FOC of the best linear LS problem! In other
words, the orthoganality condition E(Xu) = 0 will always hold as long as the MSE ceriterion is
used to obtain the best linear prediction. Note that when X contains an intercept, the orthog-
onality condition E(Xu) = 0 implies that E(u) = 0: In this case, we have E(Xu) =cov(X; u):
In other words, the orthogonality condition is equivalent to uncorrelatedness between X and u:
This implies that u does not contain any component that can be predicted by a linear function
18
of X:
The condition E(Xu) = 0 is fundamentally di¤erent from E(ujX) = 0: The latter implies
the former but not vice versa. In other words, E(ujX) = 0 implies E(Xu) = 0 but it is possible
that E(Xu) = 0 and E(ujX) 6= 0: This can be illustrated by the following example.
Example 1: Suppose u = (X 2 1)+"; where X and " are independent N(0,1) random variables.
Then
E(ujX) = X 2 1 6= 0; but
E(Xu) = E[X(X 2 1)] + E(X")
= E(X 3 ) E(X) + E(X)E(")
= 0:
De…nition 2.5 [Correct Model Speci…cation in Conditional Mean]: The linear regression
model
Y = X 0 + u; 2 Rk+1 ;
o o
E(Y jX) = X 0 for some 2 Rk+1 :
Remarks:
The class of linear regression models contains an in…nite number of linear functions, each
corresponding to a particular value of : When the linear model is correctly speci…ed, a linear
function corresponding to some o will coincide with go (X): The coe¢ cient o is called the “true
parameter”, because now it has a meaningful economic interpretation as the expected marginal
e¤ect of X on Y :
o d
= E(Y jX):
dX
19
o
For example, when Y is comsumption and X is income, is the marginal propensity to consume
(MPC).
When o is a vector, the component
o @E(Y jX)
j = ; 1 j k;
@Xj
is the partial marginal e¤ect of Xj on Y when holding all other explanatory variables in X
…xed.
Question: What is the interpretation of the intercept coe¢ cient o0 when a linear regression
model is correctly speci…ed for go (X)?
where Y is the excess portfolio return (i.e., the di¤erence between a portfolio return and a risk-
free rate) and X1 is the excess market portfolio return (i.e., the di¤erence between the market
portfolio return and a risk-free rate). Here, o0 represents the average pricing error. When CAPM
holds, o0 = 0: Thus, if the data generating process has o0 > 0; CAPM underprices the portfolio.
If o0 < 0; CAPM overprices the portfolio.
No economic theory ensures that the functional form of E(Y jX) must be linear in X. Non-
linear functional form in X is a generic possibility. Therefore, we must be very cautious about
the economic interpretation of linear coe¢ cients.
Proof: (a) If the linear model is correctly speci…ed for E(Y jX); then E(Y jX) = X 0 o for some
o
:
On the other hand, we always have the regression identity Y = E(Y jX)+"; where E("jX) = 0:
Combining these two equations gives result (a) immediately.
20
(b) From part (a) we have
E(X") = E[XE("jX)]
= E(X 0)
= 0:
o o
It follows that the orthogonality condition holds for Y = X 0 + ": Therefore, we have =
by the previous theorem (which one?).
Remarks:
Theorem (a) implies E(Y jX) = X 0 o under correct model speci…cation for E(Y jX): This,
together with Theorem (b), implies that when a linear regression model is correctly speci…ed, the
conditional mean E(Y jX) will coincide with the best linear least squares predictor g (X) = X 0 :
Under correct model speci…cation, the best linear LS approximation coe¢ cient is equal to
o
the true marginal e¤ect parameter : In other words, can be interpreted as the true parameter
o
when (and only when) the linear regression model is correctly speci…ed.
Question: What happens if the linear regression model
Y = X 0 + u;
where E(Xu) = 0; is misspeci…ed for E(Y jX)? In other words, what happens if E(Xu) = 0 but
E(ujX) 6= 0?
There exists some neglected structure in u that can be exploited to improve the prediction of Y
using X: A misspeci…ed model always yields suboptimal predictions. A correctly speci…ed model
yields optimal predictions in terms of MSE.
1 1
Y = 1 + X1 + (X12 1) + ";
2 4
where X1 and " are mutually independent N (0; 1):
(a) Find the conditional mean E(Y jX1 ) and dXd 1 E(Y jX1 ); the marginal e¤ect of X1 on Y .
21
Suppose now a linear regression model
Y = 0 + 1 X1 +u
= X 0 + u;
1 1
E(Y jX1 ) = 1 + X1 + (X12 1);
2 4
d 1 1
E(Y jX1 ) = + X1 :
dX1 2 2
1
= [E(XX 0 )] E(XY )
" # 1" #
1 0 1
= 1
0 1 2
" #
1
= 1
:
2
Hence, we have
1
g (X) = X 0 = 1 + X1 :
2
(c) By de…nition and part (b), we have
u = Y X0
= Y ( 0 + 1 X1 )
1 2
= (X 1) + ":
4 1
22
It follows that
" #
1 ( 14 (X12 1) + ")
E(Xu) = E
X1 ( 14 (X12 1) + ")
" #
0
= ;
0
although
1
E(ujX1 ) = (X12 1) 6= 0:
4
(d) No, because
d 1 1 1
E(Y jX1 ) = + X1 6= 1 = :
dX1 2 2 2
The marginal e¤ect depends on the level of X1 ; rather than only on a constant. Therefore, the
condition E(Xu) = 0 is not su¢ cient for the validity of the economic interpretation for 1 as
the marginal e¤ect.
Any parametric regression model is subject to potential model misspeci…cation. This can
occur due to the use of a misspeci…ed functional form, as well as the existence of omitted variables
which are correlated with the existing regressors, among other things. In econometrics, there
exists a modeling strategy which is free of model misspeci…cation when a data set is su¢ ciently
large. This modeling strategy is called a nonparametric approach, which does not assume any
functional form for E(Y jX) but let data speak for the true relationship. We now introduce the
basic idea of a nonparametric approach.
Nonparametric modeling is a statistical method that can model the unknown function arbi-
trarily well without having to know the functional form of E(Y jX): To illustrate the basic idea
of nonparametric modeling, suppose go (x) is a smooth function of x: Then we can expand go (x)
using a set of orthonormal “basis”functions f j (x)g1j=0 :
X
1
go (x) = j j (x) for x 2 support(X);
j=0
j = go (x) j (x)dx
1
and Z (
1
1 if i = j;
i (x) j (x)dx = ij
1 0 if i 6= j:
The function ij is called the Kronecker delta.
23
Example 2: Suppose go (x) = x2 where x 2 [ ; ]: Then
2
cos(2x) cos(3x)
go (x) = 4 cos(x) +
3 22 32
2 X
1
1 cos(jx)
= 4 ( 1)j :
3 j=1
j2
Example 3: Suppose 8
>
< 1 if < x < 0;
go (x) = 0 if x = 0;
>
:
1 if 0 < x < :
Then
4 sin(3x) sin(5x)
go (x) = sin(x) + + +
3 5
4 X sin[(2j + 1)x]
1
= :
j=0
(2j + 1)
Therefore, j ! 0 as j ! 1: That is, the Fourier coe¢ cient j will eventually vanish to zero as
the order j goes to in…nity. This motivates us to use the following truncated approximation:
p
X
gp (x) = j j (x);
j=0
24
where p is the order of bases. The approximation bias of gp (x) for go (x) is
The coe¢ cients f j g are unknown in practice, so we have to estimate them from an observed
data fYt ; Xt gnt=1 ; where n is the sample size. We consider a linear regression
p
X
Yt = j j (Xt ) + ut ; t = 1; :::; n:
j=0
Obviously, we need to let p = p(n) ! 1 as n ! 1 to ensure that the bias Bp (x) vanishes
to zero as n ! 1: However, we should not let p grow to in…nity too fast, because otherwise
there will be too much sampling variation in parameter estimators (due to too many unknown
parameters). This requires p=n ! 0 as n ! 1:
The nonparametric approach is ‡exible and powerful, but it generally requires a large data set
for precise estimation because there is a large number of unknown parameters. Moreover, there
is little economic interpretation for it (for example, it is di¢ cult to give economic interpretation
for the coe¢ cients f j g). Nonparametric analysis is usually treated in a separate, more advanced
econometric course (see more discussion in Chapter 10).
2.5 Conclusion
Most economic theories (e.g., rational expectations theory) have implications on and only on
the conditional mean of the underlying economic variable given some suitable information set.
The conditional mean E(Y jX) is called the regression function of Y on X: In this chapter, we
have shown that the regression function E(Y jX) is the optimal solution to the MSE minimization
25
problem
min E[Y g(X)]2 ;
g2F
is g (X) = X 0 ; where
= [E(XX 0 )] 1 E(XY )
is called the best linear least squares approximation coe¢ cient. The best linear least squares
predictor gA (X) = X 0 is always well-de…ned, no matter whether E(Y jX) is linear in X:
Suppose we write
Y = X 0 + u:
This orthogonality condition is actually the …rst order condition for the best linear least squares
minimization problem. It does not guarantee correct speci…cation of a linear regression model.
A linear regression model is correctly speci…ed for E(Y jX) if E(Y jX) = X 0 o for some o ; which
is equivalent to the condition that
E(ujX) = 0;
where u = Y X 0 o : That is, correct model speci…cation for E(Y jX) holds if and only if the
conditional mean of the linear regression model error is zero when evaluated at some parameter
o
. Note that E(ujX) = 0 is equivalent to the condition that E[uh(X)] = 0 for all measurable
functions h( ): When E(Y jX) = X 0 o for some o ; we have = o : That is, the best linear least
squares approximation coe¢ cient will coincide with the true model parameter o and can be
interpreted as the marginal e¤ect of X on Y: The condition E(ujX) = 0 fundamentally di¤ers
from E(Xu) = 0: The former is crucial for validity of economic interpretation of the coe¢ cient
as the true coe¢ cient o : The orthogonality condition E(Xu) = 0 does not guarantee this
interpretation. Correct model speci…cation is important for economic interpretation of model
coe¢ cient and for optimal predictions.
An econometric model aims to provide a concise and reasonably accurate re‡ection of the
data generating process. By disregarding less relevant aspects of the data, the model helps to
26
obtain a better understanding of the main aspects of the DGP. This implies that an econometric
model will never provide a completely accurate description of the DGP. Therefore, the concept of
a “true model”does not make much practical sense. It re‡ects an idealized situation that allows
us to obtain mathematically exact results. The idea is that similar results hold approximately
true if the model is a reasonably accurate approximation of the DGP.
The main purpose of this chapter is to provide a general idea of regression analysis and to shed
some light on the nature and limitation of linear regression models, which have been popularly
used in econometrics and will be the subject of study in Chapters 3 to 7.
27
EXERCISES
2.1. Put " = Y E(Y jX): Show var(Y jX) = var("jX):
2.2. Show var(Y ) =var[E(Y jX)] +var[Y E(Y jX)]:
2.3. Suppose (X; Y ) follows a bivariate normal distribution with joint pdf
fXY (x; y)
1
= p
2 1 2
1 2
( " #)
2 2
1 x 1 x 1 y 2 y 2
exp 2)
2 + ;
2(1 1 1 2 2
2.4. Suppose Z (Y; X 0 )0 is a stochastic process such that the conditional mean go (X)
E(Y jX) exists, where X is a (k + 1) 1 random vector. Suppose one uses a model (or a
function) g(X) to predict Y: A popular evaluation criterion for model g(X) is the mean squared
error M SE(g) E[Y g(X)]2 :
(a) Show that the optimal predictor g (X) for Y that minimizes M SE(g) is the conditional
mean go (X); namely, g (X) = go (X):
(b) Put " Y go (X); which is called the true regression disturbance: Show that E("jX) = 0
and interpret this result.
2.5. The choices of model g(X) in Exercise 2.4 are very general. Suppose that we now restrict our
choice of g(X) to a linear (or a¢ ne) models fgA (X) = X 0 g; where is a (k + 1) 1 parameter.
One can choose a linear function gA (X) by choosing a value for parameter : Di¤erent values
of give di¤erent linear functions gA (X): The best linear predictor gL that minimizes the mean
squared error criterion is de…ned as gA (X) X 0 ; where
28
(c) Suppose the conditional mean go (X) = X 0 o for some given o : Then we say that the
linear model gA (X) is correctly speci…ed for conditional mean go (X), and o is the true parameter
of the data generating process. Show that = o and E(u jX) = 0:
(d) Suppose the conditional mean go (X) 6= X 0 for any value of : Then we say that the
linear model gA (X) is misspeci…ed for conditional mean go (X): Check if E(u jX) = 0 and discuss
its implication.
2 2 2
E[Y ( 0 + 1 X1 )] = Y (1 X1 Y );
2.7. Suppose
Y = 0 + 1 X1 + jX1 j";
1
Y = 1 + 0:5X1 + (X12 1) + ";
4
where X1 N (0; 1); " N (0; 1); and X1 is independent of ":
(a) Find the conditional mean go (X) E(Y jX); where X (1; X1 )0 :
(b) Find the marginal propensity to consume (MPC) dXd 1 go (X).
(c) Suppose we use a linear model
Y = X0 + u = 0 + 1 X1 +u
where ( 0 ; 1 )0 to predict Y: Find the optimal linear coe¢ cient and the optimal linear
predictor gA (X) X 0 :
29
d
(d) Compute the partial derivative of the linear model g (X),
dX1 A
and compare it with the
MPC in part (b). Discuss the results you obtain.
Y = go (X) + ";
where E("jX) = 0:
Consider a …rst order Taylor series expansion of go (X) around 1 = E(X1 ) :
Suppose = ( 0 ; 1 )0 is the best linear least squares approximation coe¢ cient. Is it true
that 1 = go0 ( 1 )? Provide your reasoning.
Y = 0:8X1 X2 + ";
where X1 N (0; 1); X2 N (0; 1); " N (0; 1); and X1 ; X2 and " are mutually independent.
0
Put X = (1; X1 ; X2 ) :
(a) Is Y predictable in mean using information X?
(b) Suppose we use a linear model
gA (X) = X 0 + u
= 0 + 1 X1 + 2 X2 +u
to predict Y: Does this linear model has any predicting power? Explain.
2.11. Show that E(ujX) = 0 if and only if E[h(X)u] = 0 for any measurable functions h( ):
2.13. Suppose E(ujX) exists, X is a bounded random variable, and h(X) is an arbitrary
measurable function. Put g(X) = E("jX) and assume that E[g 2 (X)] < 1:
(a) Show that if g(X) = 0; then E["h(X)] = 0:
(b) Show that if E["h(X)] = 0; then E("jX) = 0: [Hint: Consider h(X) = etX for t in a small
X
1
j
g(X) = jX
j=0
30
R1
where j = 1
g(x)xj fX (x)dx is the Fourier coe¢ cient. Then
E("etX ) = E E("jX)etX
= E g(X)etX
X1 j
t
= E g(X)X j
j=0
j!
X
1 j
t
= j
j=0
j!
2.15. Comment on the following statement: “All econometric models are approximations of the
economic system of interest and are therefore misspeci…ed. Therefore, there is no need to check
correct model speci…cation in practice.”
31
CHAPTER 3 CLASSICAL LINEAR
REGRESSION MODELS
Abstract: In this chapter, we will introduce the classical linear regression theory,
including the classical model assumptions, the statistical properties of the OLS esti-
mator, the t-test and the F -test, as well as the GLS estimator and related statistical
procedures. This chapter will serve as a starting point from which we will develop the
modern econometric theory.
We …rst list and discuss the assumptions of the classical linear regression theory.
Assumption 3.1 [Linearity]:
o
Yt = Xt0 + "t ; t = 1; :::; n;
1
The key notion of linearity in the classical linear regression model is that the re-
gression model is linear in o rather than in Xt :In other words, linear regression models
cover some models for Yt which have a nonlinear relationship with Xt :
Not necessarily. As Kendall and Stuart (1961, Vol.2, Ch. 26, p.279) point out,
“a statistical relationship, however strong and however suggestive, can never establish
causal connection. Our ideas of causation must come from outside statistics ultimately,
from some theory or other.”Assumption 3.1 only implies a predictive relationship: Given
Xt , can we predict Yt linearly?
Denote
Y = (Y1 ; :::; Yn )0 ; n 1;
" = ("1 ; :::; "n )0 ; n 1;
X = (X1 ; :::; Xn )0 ; n K:
where the t-th row of X is Xt0 = (1; X1t ; :::; Xkt ): With these matrix notations, we have
a compact expression for Assumption 3.1:
o
Y = X + ";
n 1 = (n K)(K 1) + n 1:
Remarks:
Among other things, Assumption 3.2 implies correct model speci…cation for E(Yt jXt ):
This is because Assumption 3.2 implies E("t jXt ) = 0 by conditional expectation. It also
implies E("t ) = 0 by the law of iterated expectations.
Under Assumption 3.2, we have E(Xs "t ) = 0 for any (t; s); where t; s 2 f1; :::; ng:
This follows because
2
Note that (i) and (ii) imply cov(Xs ; "t ) = 0 for all t; s 2 f1; :::; ng:
Because X contains regressors fXs g for both s t and s > t; Assumption 3.2
essentially requires that the error "t do not depend on the past and future values of
regressors if t is a time index. This rules out dynamic time series models for which
"t may be correlated with the future values of regressors (because the future values of
regressors depend on the current shocks), as is illustrated in the following example.
Yt = 0 + 1 Yt 1 + "t ; t = 1; :::; n;
= Xt0 + "t ;
2
f"t g i.i.d.(0; );
In econometrics, there are some alternative de…nitions of strict exogeneity. For ex-
ample, one de…nition assumes that "t and X are independent. Another example is that
X is nonstochastic. This rules out conditional heteroskedasticity (i.e., var("t jX) depends
on X). In Assumption 3.2, we still allow for conditional heteroskedasticity, because we
3
do not assume that "t and X are independent. We only assume that the conditional
mean E("t jX) does not depend on X:
Yt = Xt0 o + "t
Xk
o j
= j t + "t :
j=0
In other words, when fZt g is i.i.d., E("t jX) = 0 is equivalent to E("t jXt ) = 0:
min (X0 X) ! 1 as n ! 1
Remarks:
Assumption 3.3(a) rules out multicollinearity among the (k + 1) regressors in Xt : We
say that there exists multicollinearity (sometimes called the exact or perfect multicolin-
earity in the literature) among the Xt if for all t 2 f1; :::; ng; the variable Xjt for some
j 2 f0; 1; :::; kg is a linear combination of the other K 1 column variables fXit ; i 6= jg.
4
In this case, the matrix X0 X is singular, and as a consequence, the true model parameter
o
in Assumption 3.1 is not identi…able.
E("2t jX) = 2
> 0; t = 1; :::; n;
5
(b) [conditional non-autocorrelation]:
Remarks:
We can write Assumption 3.4 as
2
E("t "s jX) = ts ;
and
6
De…nition 3.1 [OLS estimator]: Suppose Assumptions 3.1 and 3.3(a) hold. De…ne
the sum of squared residuals (SSR) of the linear regression model Yt = Xt0 + ut as
SSR( ) (Y X )0 (Y X )
Xn
= (Yt Xt0 )2 :
t=1
Note that SSR( ) is the sum of squred model errors fut = Yt Xt0 g, with equal
weighting for each t.
Theorem 3.1 [Existence of OLS]: Under Assumptions 3.1 and 3.3, the OLS estimator
^ exists and
^ = (X0 X) 1 X0 Y
! 1
1X 1X
n n
= Xt Xt0 Xt Yt :
n t=1 n t=1
The last expression will be useful for our asymptotic analysis in subsequent chapters.
Proof: Using the formula that for an K 1 vector A and and K 1 vector , the
derivative
@(A0 )
= A;
@
we have
d X
n
dSSR( )
= (Yt Xt0 )2
d d t=1
Xn
@
= (Yt Xt0 )2
t=1
@
X
n
@
= 2(Yt Xt0 ) (Yt Xt0 )
t=1
@
X
n
= 2 Xt (Yt Xt0 )
t=1
= 2X0 (Y X ):
7
The OLS must satisfy the FOC:
2X0 (Y X ^ ) = 0;
X0 (Y X ^ ) = 0;
X0 Y (X0 X) ^ = 0:
It follows that
(X0 X) ^ = X0 Y:
By Assumption 3.3, X0 X is nonsingular. Thus,
^ = (X0 X) 1 X0 Y:
@ 2 SSR( ) Xn
@
= 2 [(Yt Xt0 )Xt ]
@ @ 0 t=1
@ 0
= 2X0 X
positive de…nite
given min (X0 X) > 0: Thus, ^ is a global minimizer. Note that for the existence of ^ ; we
only need that X0 X is nonsingular, which is implied by the condition that min (X0 X) !
1 as n ! 1 but it does not require that min (X0 X) ! 1 as n ! 1: This completes
the proof.
Remarks:
Suppose Zt = fYt ; Xt0 g0 ; t = 1; :::; n; is an independent and identically distributed
(i.i.d.) random sample of size n. Consider the sum of squared residual scaled by n 1 :
1X
n
SSR( )
= (Yt Xt0 )2
n n t=1
8
That is, SSR( ); after scaled by n 1 ; is the sample analogue of M SE( ); and the OLS
^ is the sample analogue of the best LS approximation coe¢ cient :
Put Y^t Xt0 ^ : This is called the …tted value (or predicted value) for observation Yt ;
and et Yt Y^t is the estimated residual (or prediction error) for observation Yt : Note
that
et = Yt Y^t
= (Xt0 o
+ "t ) Xt0 ^
= "t X 0( ^
t
o
);
This is the consequence of the very nature of OLS, as implied by the FOC of min 2RK SSR( ).
It always holds no matter whether E("t jX) = 0 (recall that we do not impose Assump-
tion 3.2 in the Theorem above). Note that if Xt contains the intercept, then X0 e = 0
implies nt=1 et = 0:
P = X(X0 X) 1 X0
and
M = In P:
9
Then both P and M are symmetric (i.e., P = P 0 and M = M 0 ) and idempotent (i.e.,
P 2 = P; M 2 = M ), with
P X = X;
M X = 0:
(iv)
SSR( ^ ) = e0 e = Y 0 M Y = "0 M ":
Proof: (i) The result follows immediately from the FOC of the OLS estimator.
(ii) Because ^ = (X0 X) 1 X0 Y and Y = X o + "; we have
^ o
= (X0 X) 1 X0 (X o
+ ") o
= (X0 X) 1 X0 ":
P2 = PP
= [X(X0 X) 1 X0 ][X(X0 X) 1 X0 ]
= X(X0 X) 1 X0
= P:
e = Y X^
= Y X(X0 X) 1 X0 Y
= [I X(X0 X) 1 X0 ]Y
= MY
o
= M (X + ")
o
= MX + M"
= M"
SSR( ^ ) = e0 e
= (M ")0 (M ")
= "0 M 2 "
= "0 M ";
10
where the last equality follows from M 2 = M:
We …rst introduce two measures for goodness of …t. The …rst measure is called the
uncentered squared multi-correlation coe¢ cient R2
Remarks:
2
The measure Ruc has a nice interpretation: The proportion of the uncentered sample
quadratic variation in the dependent variables fYt g that can be attributed to the un-
centered sample quadratic variation of the predicted values fY^t g. Note that we always
2
have 0 Ruc 1:
Remarks:
11
When Xt contains the intercept, we have the following orthogonal decomposition:
Xn Xn
(Yt Y )2 = (Y^t Y + Yt Y^t )2
t=1 t=1
X
n X
n
= (Y^t Y) +2
e2t
t=1 t=1
X
n
+2 (Y^t Y )et
t=1
X
n X
n
= (Y^t Y )2 + e2t ;
t=1 t=1
0 X
n X
n
= ^ Xt et Y et
t=1 t=1
0 X
n
= ^ (X0 e) Y et
t=1
0
= ^ 0 Y 0
= 0;
P
where we have made use of the facts that X0 e = 0 and nt=1 et = 0 from the FOC of
the OLS estimation and the fact that Xt contains the intercept (i.e., X0t = 1): It follows
that
e0 e
R2 1 Pn
(Yt Y )2
Pn t=1 Pn 2
t=1 (Yt Y )2 t=1 et
= Pn 2
t=1 (Yt Y)
Pn ^ 2
(Yt Y )
= Pnt=1 :
t=1 (Yt Y )2
and consequently we have
0 R2 1:
Question: Can R2 be negative?
Yes, it is possible! If Xt does not contain the intercept, then the orthogonal decom-
position identity
Xn Xn Xn
2
(Yt Y ) = ^
(Yt Y ) +2
e2t
t=1 t=1 t=1
12
no longer holds. As a consequence, R2 may be negative when there is no intercept! This
is because the cross-product term
X
n
2 (Y^t Y )et
t=1
may be negative.
Example 1 [Capital Asset Pricing Model (CAPM)]: The classical CAPM is char-
acterized by the equation
where rpt is the return on portfolio (or asset) p; rf t is the return on a risk-free asset,
and rmt is the return on the market portfolio. Here, rpt rf t is the risk premium of
portfolio p; rmt rf t is the risk premium of the market portfolio, which is the only sys-
tematic market risk factor, and "pt is the individual-speci…c risk which can be eliminated
by diversi…cation if the "pt are uncorrelated across di¤erent assets. In this model, R2
has an interesting economic interpretation: it is the proportion of the risk of portfolio
p (as measured by the sample variance of its risk premium rpt rf t ) that is attributed
to the market risk factor (rmt rf t ). In contrast, 1 R2 is the proportion of the risk of
portfolio p that is contributed by individual-speci…c risk factor "pt :
For any given random sample fYt ; Xt0 g0 ; t = 1; :::; n; R2 is nondecreasing in the number
of explanatory variables Xt : In other words, the more explanatory variables are added
in the linear regression, the higher R2 is. This is always true no matter whether Xt has
any true explanatory power for Yt :
Theorem 3.3: Suppose fYt ; X1t ; :::; X(k+q)t g0 ; t = 1; :::; n; is a random sample, and
Assumptions 3.1 and 3.3(a) hold. Let R12 be the centered R2 from the linear regression
Yt = Xt0 + ut ;
where Xt = (1; X1t ; :::; Xkt )0 ; and is a K 1 parameter vector; also, R22 is the centered
R2 from the extended linear regression
~ 0 + vt ;
Yt = Xt
13
~ t = (1; X1t ; :::; Xkt ; X(k+1)t ; :::; X(k+q)t )0 ;and
where X is a (K + q) 1 parameter vector.
Then R22 R12 :
where e is the estimated residual vector from the regression of Y on X; and e~ is the
estimated residual vector from the regression of Y on X: ~ It su¢ ces to show e~0 e~ e0 e:
Because the OLS estimator ^ = (X ~ 0 X)
~ 1X
~ 0 Y minimizes SSR( ) for the extended model,
we have
X
n X
n
e~0 e~ = ~ t0 ^ )2
(Yt X (Yt X ~ t0 )2 for all 2 RK+q :
t=1 t=1
Now we choose
0
= ( ^ ; 00 )0 ;
where ^ = (X0 X) 1 X0 Y is the OLS from the …rst regression. It follows that
k+q
!2
X
n X
k X
e~0 e~ Yt ^ j Xjt 0 Xjt
t=1 j=0 j=k+1
Xn
= (Yt Xt0 ^ )2
t=1
0
= e e:
The measure R2 can be used to compare models with the same number of predictors,
but it is not a useful criterion for comparing models of di¤erent sizes because it is biased
in favor of large models.
The measure R2 is not a suitable criterion for correct model speci…cation. It is a
measure for sampling variation rather than a measure of population. A high value of R2
does not necessarily imply correct model speci…cation, and correct model speci…cation
also does not necessarily imply a high value of R2 :
Strictly speaking, R2 is a measure merely of association with nothing to say about
causality. High values of R2 are often very easy to achieve when dealing with economic
14
time series data, even when the causal link between two variables is extremely tenuous
or perhaps nonexistent. For example, in the spurious regressions where the dependent
variable Yt and the regressors Xt have no causal relationship but they diaplay similar
trending behaviors over time, it is often found that R2 is close to unity.
ln Yt = 0 + 1 ln Lt + 2 ln Kt + "t ;
Answer: R2 is the proportion of the total sample variations in ln Yt that can be at-
tributed to the sample variations in ln Lt and ln Kt : It is not the proportion of the sample
quadratic variations in Yt that can be attributed to the sample variations of Lt and Kt :
Question: Does a high R2 value imply a precise estimation for o ?
Below, we introduce two popular model selection criteria that re‡ect such an idea.
A linear regression model can be selected by minimizing the following AIC criterion
with a suitable choice of K :
2K
AIC = ln(s2 ) +
n
goodness of …t + model complexity
15
where
s2 = e0 e=(n K);
is called the residual variance estimator for E("2t ) = 2
and K = k + 1 is the number of
regressors. AIC is proposed by Akaike (1973).
Both AIC and BIC try to trade o¤ the goodness of …t to data measured by ln(s2 )
with the desire to use as few paramerers as possible. When ln n 2; which is the
case when n > 7; BIC gives a heavier penalty for model complexity than AIC, which is
measured by the number of estimated parameters (relative to the sample size n). As a
consequence, BIC will choose a more parsimonious linear regression model than AIC.
The di¤erence between AIC and BIC is due to the way they are constructed. AIC
is designed to select a model that will predict best and is less concerned than BIC
with having a few too many parameters. BIC is designed to select the true value of
K exactly. Under certain regularity conditions, BIC is strongly consistent in the sense
that it determines the true model asymptotically (i.e., as n ! 1), whereas for AIC
an overparameterized model will emerge no matter how large the sample is. Of course,
such properties are not necessarily guaranteed in …nite samples. In practice, te best AIC
model is usually close to the best BIC model and often they deliver the same model.
In addition to AIC and BIC, there are other criteria such as R2 ; the so-called adjusted
R2 that can also be used to select a linear regression model. The adjusted R2 is de…ned
as
2 e0 e=(n K)
R =1 :
(Y Y )0 (Y Y )=(n 1)
This di¤ers from
2 e0 e
R =1 :
(Y Y )0 (Y Y)
In R2 ; the adjustmet is made according to the degrees of freedom, or the number of
explanatory variables in Xt . It may be shown that
n 1
R2 = 1 (1 R2 ) :
n K
16
we note that R2 may take a negative value although there is an intercept in Xt :
All model criteria are structured in terms of the estimated residual variance ^ 2 plus
a penalty adjustment involving the number of estimated parameters, and it is in the
extent of this penalty that the criteria di¤er from. For more discussion about these and
other selection criteria, see Judge et al. (1985, Section 7.5).
A complicated model contains many unknown parameters. Given a …xed amount of data
information, parameter estimation will become less precise if more parameters have to be
estimated. As a consequence, the out-of-sample forecast for Yt may become less precise
than the forecast of a simpler model. The latter may have a larger bias but more precise
parameter estimates. Intuitively, a complicated model is too ‡exible in the sense that it
may capture not only systematic components but also some features in the data which
will not show up again. Thus, it cannot forecast futures well.
The sampling distribution of ^ is useful for any statistical inference involving ^ ; such as
con…dence interval estimation and hypothesis testing.
17
0
If in addition Assumption 3.3(b) holds, then for any K 1 vector such that = 1;
we have
0
var( ^ jX) ! 0 as n ! 1:
(iii) [Orthogonality between e and ^ ]
(iv) [Gauss-Markov]
1 X
n
s2 = e0 e=(n K) = e2t
n K t=1
2
is unbiased for = E("2t ): That is, E(s2 jX) = 2
:
E[( ^ o
)jX] = E[(X0 X) 1 X0 "jX]
= (X0 X) 1 X0 E("jX)
= (X0 X) 1 X0 0
= 0:
(ii) Given ^ o
= (X0 X) 1 X0 " and E(""0 jX) = 2 I; we have
h i
var( ^ jX) E ( ^ E ^ )( ^ E ^ )0 jX
h i
= E ( ^ o ^
)( o 0
) jX
= E[(X0 X) 1 X0 ""0 X(X0 X) 1 jX]
= (X0 X) 1 X0 E(""0 jX)X(X0 X) 1
= (X0 X) 1 X0 2
IX(X0 X) 1
2
= (X0 X) 1 X0 X(X0 X) 1
2
= (X0 X) 1 :
18
2
Note that Assumption 3.4 is crucial here to obtain the expression of (X0 X) 1
for
var( ^ jX): Moreover, for any 2 RK such that 0 = 1; we have
0
var( ^ jX) = 2 0
(X0 X) 1
2 0 1
max [(X X) ]
2 1 0
= min (X X)
! 0
given min (X0 X) ! 1 as n ! 1 with probability one. Note that the condition that
0 ^
min (X X) ! 1 ensures that var( jX) vanishes to zero as n ! 1:
(iii) Given ^ o
= (X0 X) 1 X0 "; e = Y X ^ = M Y = M " (since M X = 0); and
E(e) = 0; we have
h i
cov( ^ ; ejX) = E ( ^ E ^ )(e Ee)0 jX
h i
= E (^ o 0
)e jX
= E[(X0 X) 1 X0 ""0 M jX]
= (X0 X) 1 X0 E(""0 jX)M
= (X0 X) 1 X0 2
IM
2
= (X0 X) 1 X0 M
= 0:
Again, Assumption 3.4 plays a crucial role in ensuring zero correlation between ^
and e:
(iv) Consider a linear estimator
^b = C 0 Y;
o
where C = C(X) is a n K matrix depending on X: It is unbiased for regardless of
the value of o if and only if
E(^bjX) = C 0 X o
+ C 0 E("jX)
o
= C 0X
o
= :
19
Because
^b = C 0 Y
o
= C 0 (X + ")
o
= C 0X + C 0"
o
= + C 0 ";
the variance of ^b
h i
var(^b) = E (^b o
)(^b o 0
) jX
= E [C 0 ""0 CjX]
= C 0 E(""0 jX)C
= C0 2
IC
2
= C 0 C:
2
= [C 0 C C 0 X(X0 X) 1 X0 C]
2
= C 0 [I X(X0 X) 1 X0 ]C
2
= C 0M C
2
= C 0M M C
2
= C 0M 0M C
2
= (M C)0 (M C)
2
= D0 D
Xn
2
= Dt Dt0
t=1
p.s.d.
where we have used the fact that for any real-valued matrix D; the squared matrix D0 D
is always p.s.d. [Question: How to show this?]
20
we have
where
2 E(e0 ejX)
E(s jX) =
n K
2
(n K)
=
(n K)
2
= :
Remarks:
Both Theorem 3.4 (i) and (ii) imply that the conditional MSE
21
Recall that MSE measures how close an estimator ^ is to the target parameter o :
Theorem (iv) implies that ^ is the best linear unbiased estimator (BLUE) for o
because var( ^ jX) is the smallest among all unbiased linear estimators for o :
Formally, we can de…ne a related concept for comparing two unbiased estimators:
When ^ is more e¢ cient than ^b; we have that for any 2 RK such that 0
= 1;
h i
0
var(^bjX) var( ^ jX) 0:
var(^b1 ) var( ^ 1 ) 0:
We note that the OLS estimator ^ is still BLUE even when there exists near-
multicolinearity, where min (X0 X) does not grow with the sample size n; and var( ^ jX)
does not vanish to zero as n ! 1: Near-multicolinearity is essentially a sample or data
problem which we cannot remedy or improve upon when the objective is to estimate the
unknown parameter o :
2
Assumption 3.5: "jX N (0; I):
Remarks:
Assumption 3.5 implies both Assumptions 3.2 (E("jX) = 0) and 3.4 (E(""jX) =
2
I). Moreover, under Assumption 3.5, the conditional pdf of " given X is
1 "0 "
f ("jX) = p exp = f (");
( 2 2 )n 2 2
which does not depend on X; so the disturbance " is independent of X: Thus, every
conditional moment of " given X does not depend on X:
22
The normal distribution is also called the Gaussian distribution named after the
German mathematician and astronomer Carl F. Gauss. It is assumed here so that we
can derive the …nite sample distributions of ^ and related statistics, i.e., the distributions
of ^ and related statistics when the sample size n is a …nite integer. This assumption
may be reasonable for observations that are computed as the averages of the outcomes
of many repeated experiments, due to the e¤ect of the so-called central limit theorem
(CLT). This may occur in physics, for example. In economics, the normality assumption
may not always be reasonable. For example, many high-frequency …nancial time series
usually display heavy tails (with kurtosis larger than 3).
We write
^ o
= (X0 X) 1 X0 "
Xn
0 1
= (X X) Xt "t
t=1
X
n
= Ct "t ;
t=1
(^ o
)jX N [0; 2
(X0 X) 1 ]:
Proof: Conditional on X; ^ o
is a weighted sum of independent normal random
variables f"t g; and so it is also normally distributed.
We note that the OLS estimator ^ still has the conditional …nite sample normal distri-
bution N ( o ; 2 (X0 X) 1 ) even when there exists near-multicolinearity, where min (X0 X)
does not grow with the sample size n and var( ^ jX) does not vanish to zero as n ! 1:
The corollary below follows immediately.
R( ^ o
)jX N [0; 2
R(X0 X) 1 R0 ]:
23
Proof: Conditional on X; ^ o
is normally distributed. Therefore, conditional on X;
the linear combination R( ^ o
) is also normally distributed, with
E[R( ^ o
)jX] = RE[( ^ o
)jX] = R 0 = 0
and
h i
var[R( ^ o
)jX] = E R( ^ o
)(R( ^ o 0
)) jX
h i
= E R( ^ o
)( ^ o 0 0
) R jX
h i
= RE ( ^ o
)( ^ o 0
) jX R0
= Rvar( ^ jX)R0
2
= R(X0 X) 1 R0 :
It follows that
R( ^ o
)jX N (0; 2
R(X0 X) 1 R0 ):
s2 = e0 e=(n K):
Theorem 3.7 [Residual Variance Estimator]: Suppose Assumptions 3.1, 3.3(a) and
3.5 hold. Then we have for all n > K; (i)
(n K)s2 e0 e 2
2
jX = 2
jX n K;
24
Proof: (i) Because e = M "; we have
e0 e "0 M " " 0 "
2
= 2
= M :
It remains to show that e and ^ jointly normally distributed? For this purpose, we
write
" # " #
e M"
=
^ o
(X0 X) 1 X0 "
" #
M
= ":
(X0 X) 1 X0
2
Because "jX N (0; I); the linear combination of
" #
M
"
(X0 X) 1 X0
25
2
Question: What is a q distribution?
2
will follow a q distribution.
(n K)s2
E 2
jX = n K:
(n K)
2
E(s2 jX) = n K:
It follows that E(s2 jX) = 2 : Note that we have shown this result with a di¤erent
method but under a more general condition.
Theorem 3.7(i) also implies
(n K)s2
var 2
jX = 2(n K);
4
2
var(s2 jX) =
n K
! 0
as n ! 1:
Both Theorems 3.7(i) and (ii) imply that the conditional MSE of s2
The independence between s2 and ^ is crucial for us to obtain the sampling distrib-
ution of the popular t-test and F -test statistics, which will be introduced shortly.
26
The sample residual variance s2 = e0 e=(n K) is a generalization of the sample
variance Sn2 = (n 1) 1 nt=1 (Yt Y )2 for the random sample fYt gnt=1 : The factor n K
is called the degrees of freedom of the estimated residual sample fet gnt=1 : To gain intuition
why the degrees of freedom is equal to n K; note that the orginal sample fZt gnt=1 =
0n
fYt ; Xt0 gt=1 has n observations, which can be viewed to have n degrees of freedom. Now
when estimating 2 ; we have to use the estimated residual sample fet gnt=1 : These n
estimated residuals are not linearly independent because they have to satisfy the FOC
of the OLS estimation, namely,
X0 e = 0:
(K n) (n 1) = K 1:
They are useful in con…dence interval estimation and hypothesis testing on model para-
meters. In this book, we will focus on hypothesis testing on model parameters. Statisti-
cally speaking, con…dence interval estimation and hypothesis testing on model parame-
ters are just two sides of the same coin.
where R is called the selection matrix, and J is the number of restrictions. We assume
J K:
Motivation
27
We …rst provide a few motivating examples for hypothesis testing.
where AUt is a dummy variable indicating whether …rm t is granted autonomy, and P St
is the pro…t share of …rm t with the state.
Suppose we are interested in testing whether autonomy AUt has an e¤ect on produc-
tivity. Then we can write the null hypothesis
o
Ha0 : 3 =0
This is equivalent to the choices of:
o 0
= ( 0; 1; 2; 3; 4) :
R = (0; 0; 0; 1; 0);
r = 0:
Alternatively, to test whether the production technology exhibits the constant return
to scale (CRS), we can write the null hypothesis as follows:
o o
Hc0 : 1 + 2 = 1:
28
Example 2 [Optimal Predictor for Future Spot Exchange Rate]: Consider
where St+ is the spot exchange rate at period t + ; and Ft ( ) is the forward exchange
rate, namely the period t’s price for the foreign currency to be delivered at period t + :
The null hypothesis of interest is that the forward exchange rate Ft ( ) is an optimal
predictor for the future spot rate St+ in the sense that E(St+ jIt ) = Ft ( ); where It is
the information set available at time t. This is actually called the expectations hypothesis
in economics and …nance. Given the above speci…cation, this hypothesis can be written
as
He0 : o0 = 0; o1 = 1;
and E("t+ jIt ) = 0: This is equivalent to the choice of
" # " #
1 0 0
R= ;r = :
0 1 1
where r is a J 1 vector.
R^ r = R^ R o
= R( ^ o
)
! 0 as n ! 1
because ^ o
! 0 as n ! 1 in terms of MSE.
29
Under the alternative to H0 ; R o
6= r; but we still have ^ o
! 0 in terms of MSE.
It follows that
R^ r = R( ^ o
)+R o
r
o
! R r 6= 0
The fact that the behavior of R ^ r is di¤erent under H0 and under the alternative
hypothesis to H0 provides a basis to construct hypothesis tests. In particular, we can
test H0 by examining whether R ^ r is signi…cantly di¤erent from zero.
Question: How large should the magnitude of the absolute value of the di¤erence R ^ r
be in order to claim that R ^ r is signi…cantly di¤erent from zero?
For this purpose, we need a decision rule which speci…es a threshold value with which
we can compare the (absolute) value of R ^ r. Because R ^ r is a random variable
and so it can take many (possibly an in…nite number of) values. Given a data set, we
only obtain one realization of R ^ r: Whether a realization of R ^ r is close to zero
should be judged using the critical value of its sampling distribution, which depends on
the sample size n and the signi…cance level 2 (0; 1) one preselects.
Because
R( ^ o
)jX N (0; 2
R(X0 X) 1 R0 );
we have that conditional on X;
R^ r = R( ^ o
)+R o
r
o 2
N (R r; R(X0 X) 1 R0 )
o
Corollary 3.9: Under Assumptions 3.1, 3.3 and 3.5, and H0 : R = r; we have for
each n > K;
(R ^ r)jX N (0; 2 R(X0 X) 1 R0 ):
The di¤erence R ^ r cannot be used as a test statistic for H0 ; because 2 is unknown
and there is no way to calculate the critical values of the sampling distribution of R ^ r:
30
Question: How to construct a feasible (i.e., computable) test statistic?
The forms of test statistics will di¤er depending on whether we have J = 1 or J > 1:
We …rst consider the case of J = 1:
(R ^ r)jX N (0; 2
R(X0 X) 1 R0 );
var[(R ^ r)jX] = 2
R(X0 X) 1 R0
R^ r R^ r
q = p
2 R(X0 X) 1 R0
var[(R ^ r)jX]
N (0; 1):
R^ r
p ?
2 R(X0 X) 1 R0
R^ r
p
2 R(X0 X) 1 R0
R^ r
T =p :
s2 R(X0 X) 1 R0
31
However, the test statistic T will be no longer normally distributed. Instead,
R^ r
T = p
s2 R(X0 X) 1 R0
p R^ r
2 R(X0 X) 1 R0
= q
(n K)s2
2 =(n K)
N (0; 1)
q
2
n K =(n K)
tn K;
2
De…nition 3.6 [Student’s t-distribution]: Suppose Z N (0; 1) and V q; and
both Z and V are independent. Then the ratio
Z
p tq :
V =q
The tq -distribution is symmetric about 0 with heavier tails than the N (0; 1) distri-
bution. The smaller number of the degrees of freedom, the heavier tails it has. When
d d
q ! 1; tq ! N (0; 1); where ! denotes convergence in distribution. This implies that
we have
R^ r d
T =p ! N (0; 1) as n ! 1:
s2 R(X0 X) 1 R0
This result has a very important implication in practice: for a large sample size n; it
makes no di¤erence to use either the critical values from tn K or from N (0; 1).
32
converges to Z in distribution if the distribution of Zn converges to the distribution of
Z at all continuity points; namely,
for any continuity point z (i.e., for any point at which F (z) is continuous): We use the
d
notation Zn ! Z: The distribution of Z is called the asymptotic or limiting distribution
of Zn :
In practice, Zn is a test statistic or a parameter estimator, and often its sampling distri-
bution Fn (z) is either unknown or very complicated, but F (z) is known or very simple.
d
As long as Zn ! Z; then we can use F (z) as an approximation to Fn (z): This gives
a convenient procedure for statistical inference. The potential cost is that the approx-
imation of Fn (z) to F (z) may not be good enough in …nite samples (i.e., when n is
…nite). How good the approximation is will depend on the data generating process and
the sample size n:
With the obtained sampling distribution for the test statistic T; we can now describe
a decision rule for testing H0 when J = 1:
jT j > Ctn K; 2
;
where Ctn K; is the so-called upper-tailed critical value of the tn K distribution at level
2
2
; which is determined by h i
P tn K > Ctn K; 2
=
2
or equivalently h i
P jtn Kj > Ctn K; 2
= :
(ii) Do not reject H0 at the signi…cance level if
jT j Ctn K; 2
:
Remarks:
33
In testing H0 ; there exist two types of errors, due to the limited information about
the population in a given random sample fZt gnt=1 . One possibility is that H0 is true but
we reject it. This is called the “Type I error”. The signi…cance level is the probability
of making the Type I error. If
h i
P jT j > Ctn K; jH0 = ;
2
there exists a possibility that one may fail to reject H0 when it is false. This is called
the “Type II error”.
Ideally one would like to minimize both the Type I error and Type II error, but
this is impossible for any given …nite sample. In practice, one usually presets the level
for Type I error, the so-called signi…cance level, and then minimizes the Type II error.
Conventional choices for signi…cance level are 10%, 5% and 1% respectively.
Next, we describe an alternative decision rule for testing H0 when J = 1; using the
so-called p-value of test statistic T:
34
is unlikely that the test statistic T (Zn ) will follow a Student’s tn K distribution. As a
consequence, the null hypothesis is likely to be false.
The above decision rule can be described equivalently as follows:
Remarks:
A small p-value is evidence against the null hypothesis. A large p-value shows that
the data are consistent with the null hypothesis.
Question: What are the advantages/disadvantages of using p-values versus using
critical values?
p-values are more informative than only rejecting/accepting the null hypothesis at
some signi…cance level . A p-value is the smallest signi…cance level at which a null
hypothesis can be rejected. It not only tells us whether the null hypothesis should be
accepted or rejected, but it also tells us whether the decision to accept or reject the null
hypothesis is a close call.
Most statistical software reports p-values of parameter estimates. This is much more
convenient than asking the user to specify signi…cance level and then reporting whether
the null hypothesis is accepted or rejected for that :
The t-test and associated procedures just introduced are valid even when there ex-
ists near-multicolinearity, where min (X0 X) does not grow with the sample size n and
var( ^ jX) does not vanish to zero as n ! 1: However, the degree of near-multicolineary,
as measured by sample correlations between explanatory variables, will a¤ect the the
precision of the OLS estimator ^ : Other things being equal, the higher degree of near-
multicolinearity, the larger the variance of ^ : As a result, the t-statistic is often insignif-
icant even when the null hypothesis H0 is false.
Examples of t-tests
35
Example 4 [Reforms have no e¤ects (continued.)]
We …rst consider testing the null hypothesis
Ha0 : 3 = 0;
where 3 is the coe¢ cient of the autonomy AUt in the extended production function
regression model. This is equivalent to the selection of R = (0; 0; 0; 1; 0): In this case,
we have
s2 R(X0 X) 1 R0 = s2 (X0 X) 1
(4;4)
= S ^2
3
which is the estimator of var( ^ 3 jX): The squared root of var( ^ 3 jX) is called the standard
error of estimator ^ ; and S ^ is called the estimated standard error of ^ : The t-test
3 3 3
statistic
R^ r
T = p
s2 R(X0 X) 1 R0
^
= q3
S ^2
3
tn K:
= s2 (X0 X)
(2;2)
1
+ s (X X) 1 (3;3)
2 0
+2 s2 (X0 X) 1 (2;3)
= S ^2+ ^ ;
2
which is the estimator of var( ^ 1 + ^ 2 jX): Here, ĉov( ^ 1 ; ^ 2 ) is the estimator for cov( ^ 1 ; ^ 2 jX);
the covariance between ^ 1 and ^ 2 conditional on X:
The t-test statistic is
R^ r
T = p
s2 R(X0 X) 1 R0
^ +^ 1
1 2
=
S^1+^2
tn K :
36
Case II: F -testing (J > 1)
Proof: Because V is symmetric and positive de…nite, we can …nd a symmetric and
invertible matrix V 1=2 such that
V 1=2 V 1=2 = V;
1=2 1=2 1
V V = V :
1=2 1=2
= V VV
1=2
= V V 1=2 V 1=2 V 1=2
= I:
Y 0Y 2
J:
(R ^ r)jX N [0; 2
R(X0 X) 1 R0 ]
(R ^ r)0 [ 2 R(X0 X) 1 R0 ] 1 (R ^ r) 2
J
37
conditional on X; or
(R ^ r)0 [R(X0 X) 1 R0 ] 1 (R ^ r) 2
2 J
conditional on X:
Because 2J does not depend on X; therefore, we also have
(R ^ r)0 [R(X0 X) 1 R0 ] 1 (R ^ r) 2
2 J
unconditionally.
(R ^ r)0 [R(X0 X) 1 R0 ] 1 (R ^ r)
s2
(R ^ r)0 [R(X0 X) 1 R0 ] 1 (R ^ r)
2 =J
= J (n K)s2
2 =(n K)
J FJ;n K;
38
(ii) t2q F1;q :
2
1 =1
t2q = 2 =q
F1;q
2
(iii) Given any …xed integer p; p Fp;q ! p as q ! 1:
Property (ii) implies that when J = 1; using either the t-test or the F -test will deliver
the same conclusion. Property (iii) implies that the conclusions based on Fp;q and on
p Fp;q using the 2p approximation will be approximately the same when q is su¢ ciently
large.
Theorem 3.11: Suppose Assumptions 3.1, 3.3(a) and 3.5 hold. Then under H0 :
o
R = r, we have
o
Y =X + ":
Let SSRr = e~0 e~ be the sum of squared residuals from the restricted model
o
Y =X +"
subject to
o
R = r;
39
where ~ is the restricted OLS estimator. Then under H0 ; the F -test statistic can be
written as
(~e0 e~ e0 e)=J
F = 0 FJ;n K :
e e=(n K)
Proof: Let ~ be the OLS under H0 ; that is,
~ = arg min (Y X )0 (Y X )
2RK
L( ; ) = (Y X )0 (Y X ) + 2 0 (r R );
@L( ~ ; ~ )
= 2X0 (Y X ~ ) 2R0 ~ = 0;
@
@L( ~ ; ~ )
= 2(r R ~ ) = 0:
@
With the unconstrained OLS estimator ^ = (X0 X) 1 X0 Y; and from the …rst equation of
FOC, we can obtain
(^ ~ ) = (X0 X) 1 R0 ~ ;
R(X0 X) 1 R0 ~ = R( ^ ~ ):
~ = [R(X0 X) 1 R0 ] 1 R( ^ ~ ):
= [R(X0 X) 1 R0 ] 1 (R ^ r);
Now,
e~ = Y X~
= Y X ^ + X( ^ ~)
= e + X( ^ ~ ):
40
It follows that
e~0 e~ = e0 e + ( ^ ~ )0 X0 X( ^ ~)
= e0 e + (R ^ r)0 [R(X0 X) 1 R0 ] 1 (R ^ r):
We have
(R ^ r)0 [R(X0 X) 1 R0 ] 1 (R ^ r) = e~0 e~ e0 e
and
Remarks:
The F -statistic is a convenient test statistic! One only needs to compute SSR in
order to compute the F -test statistic. Intuitively, the sum of squared residuals SSRu
of the unrestricted regression model is always larger than or at least equal to that of
the restricted regression model. When the null hypothesis H0 is true (i.e., when the
parameter restriction is valid), the sum of squared residuals SSRr of the restricted
model is more or less similar to that of the unrestricted model, subject to the di¤erence
due to sampling variations. If SSRr is su¢ ciently larger than SSRu ; then there exists
evidence against H0 :How large a di¤erence between SSRr and SSRu is considered as
su¢ ciently large to reject H0 is determined by the critical value of the associated F
distribution.
~ = [R(X0 X) 1 R0 ] 1 R( ^ ~)
= [R(X0 X) 1 R0 ] 1 (R ^ r):
Thus, ~ is an indicator of the departure of R ^ from r: That is, the value of ~ will indicate
whether R ^ r is signi…cantly di¤erent from zero.
41
d
Recall the important property of the Fp;q distribution that p Fp;q ! 2p when q ! 1:
Since our F -statistic for H0 follows a FJ;n K distribution, it follows that under H0 ; the
quadratic form
1 d
J F = (R ^ r)0 s2 R(X0 X) 1 R0 (R ^ r) ! 2
J
Theorem 3.13: Suppose Assumptions 3.1, 3.3(a) and 3.5 hold. Then under H0 ; we
have the Wald test statistic
(R ^ r)0 [R(X0 X) 1 R0 ] 1 (R ^ r) d 2
W = ! J
s2
as n ! 1:
This result implies that when n is su¢ ciently large, using the F -statistic and the exact
FJ;n K distribution and using the quadratic form W and the simpler 2J approximation
will make no essential di¤erence in statistical inference.
3.8 Applications
We now consider some special but important cases often encountered in economics
and …nance.
Case I: Testing for the Joint Signi…cance of Explanatory Variables
Consider a linear regression model
o
Yt = Xt0 + "t
Xk
o o
= 0 + j Xjt + "t :
j=1
We are interested in testing the combined e¤ect of all the regressors except the intercept.
The null hypothesis is
o
H0 : j = 0 for 1 j k;
which implies that none of the explanatory variables in‡uences Yt :
The alternative hypothesis is
o o
HA : j 6= 0 at least for some j; j = 1; ; k:
42
In fact, the restricted model under H0 is very simple:
o
Yt = 0 + "t :
e~ = Y X~ = Y Y:
Hence, we have
e~0 e~ = (Y Y )0 (Y Y ):
Recall the de…nition of R2 :
e0 e
R2 = 1
(Y Y )0 (Y Y)
e0 e
= 1 :
e~0 e~
It follows that
(~ e0 e~ e0 e)=k
F =
e0 e=(n k 1)
0
(1 ee~0 ee~ )=k
= e0 e
e~0 e~
=(n k 1)
R2 =k
= :
(1 R2 )=(n k 1)
Thus, it su¢ ces to run one regression, namely the unrestricted model in this case. We
emphasize that this formula is valid only when one is testing for H0 : oj = 0 for all
1 j k:
Example 1 [E¢ cient Market Hypothesis]: Suppose Yt is the exchange rate return
in period t; and It 1 is the information available at time t 1: Then a classical version
of the e¢ cient market hypothesis (EMH) can be stated as follows:
To check whether exchange rate changes are unpredictable using the past history of
exchange rate changes, we specify a linear regression model:
o
Yt = Xt0 + "t ;
where
Xt = (1; Yt 1 ; :::; Yt k )0 :
43
Under EMH, we have
o
H0 : j = 0 for all j = 1; :::; k:
If the alternative
o
HA : j 6= 0 at least for some j 2 f1; :::; kg
holds, then exchange rate changes are predictable using the past information.
Note that there exists a gap between the e¢ ciency hypothesis and H0 , because the
linear regression model is just one of many ways to check EMH. Thus, H0 is not rejected,
at most we can only say that no evidence against the e¢ ciency hypothesis is found. We
should not conclude that EMH holds.
Strictly speaking, the current theory (Assumption 3.2: E("t jX) = 0) rules out this
application, which is a dynamic time series regression model. However, we will justify
in Chapter 5 that
R2
k F =
(1 R2 )=(n k 1)
d 2
! k
p
This follows from the Slutsky theorem because R2 ! 0 under H0 : Although Assumption
3.5 is not needed for this result, conditional homoskedasticity is still needed, which rules
out autoregressive conditional heteroskedasticity (ARCH) in the time series context.
44
Suppose we are interested in whether labor income or liquidity asset wealth has
impact on consumption. We can use the F -test statistic,
R2 =2
F =
(1 R2 )=(n 3)
= (0:742=2)=[(1 0:742)=(25 3)]
= 31:636
F2 ;22
Comparing it with the critical value of F2;22 at the 5% signi…cance level, we reject the
null hypothesis that neither income nor liquidity asset has impact on consumption at
the 5% signi…cance level.
Case II: Testing for Omitted Variables (or Testing for No E¤ect)
Suppose X = (X(1) ; X(2) ); where X(1) is a n (k1 + 1) matrix and X(2) is a n k2
matrix:
(2)
A random vector Xt has no explanatory power for the conditional expectation of
Yt if
(1)
E(Yt jXt ) = E(Yt jXt ):
Alternatively, it has explanatory power for the conditional expectation of Yt if
(1)
E(Yt jXt ) 6= E(Yt jXt ):
(2)
When Xt has explaining power for Yt but is not included in the regression, we say that
(2)
Xt is an omitted random variable or vector.
(2)
Question: How to test whether Xt is an omitted variable in the linear regression
context?
Suppose we have additional k2 variables (X(k1 +1)t ; ; X(k1 +k2 )t ), and so we consider the
unrestricted regression model
The null hypothesis is that the additional variables have no e¤ect on Yt : If this is the
case, then
H0 : k1 +1 = k1 +2 = = k1 +k2 = 0:
45
The alternative is that at least one of the additional variables has e¤ect on Yt :
The F -Test statistic is
e0 e~ e0 e)=k2
(~
F = Fk2 ;n (k1 +k2 +1) :
e0 e=(n k1 k2 1)
Question: Suppose we reject the null hypothesis. Then some important explanatory
variables are omitted, and they should be included in the regression. On the other hand,
if the F -test statistic does not reject the null hypothesis H0 ; can we say that there is no
omitted variable?
No. There may exist a nonlinear relationship for additional variables which a linear
regression speci…cation cannot capture.
Yt = 0 + 1 ln(Lt ) + 2 ln(Kt )
+ 3 AUt + 4 P St + 5 CMt + "t ;
where AUt is the autonomy dummy, P St is the pro…t sharing ratio, and CMt is the
dummy for change of manager. The null hypothesis of interest here is that none of the
three reforms has impact:
H0 : 3 = 4 = 5 = 0:
We can use the F -test, and F F3;n 6 under H0 :
Suppose rejection occurs. Then there exists evidence against H0 : However, if no rejection
occurs, then we can only say that we …nd no evidence against H0 (which is not the same
(2)
as the statement that reforms have no e¤ect): It is possible that the e¤ect of Xt is
(2)
of nonlinear form. In this case, we may obtain a zero coe¢ cient for Xt ; because the
linear speci…cation may not be able to capture it.
46
In time series analysis, Granger causality is de…ned in terms of incremental predictability
rather than the real cause-e¤ect relationship. From an econometric point of view, it is
a test of omitted variables in a time series context. It is …rst introduced by Granger
(1969).
Yt = 0 + 1 Yt 1 + + p Yt p
H0 : p+1 = = p+q = 0:
The current econometric theory (Assumption 3.2: E("t jX) = 0) actually rules out this
application, because it is a dynamic regression model. However, we will justify in Chap-
ter 5 that under H0 ;
d
q F ! 2q
as n ! 1 under conditional homoskedasticity even for a linear dynamic regression
model.
Example 5 [Testing for Structural Change (or testing for regime shift)]
Consider a bivariate regression model
Yt = 0 + 1 X1t + "t ;
where t is a time index, and fXt g and f"t g are mutually independent. Suppose there
exist changes after t = t0 ; i.e., there exist structural changes. We can consider the
extended regression model:
Yt = ( 0 + 0 Dt ) +( 1 + 1 Dt )X1t + "t
= 0 + 1 X1t + 0 Dt + 1 (Dt X1t ) + "t ;
47
The null hypothesis of no structral change is
H0 : 0 = 1 = 0:
HA : 0 6= 0 or 1 6= 0:
H0 : 1 + 2 = 1:
or equivanetly
Wt = 0 + 1 Pt + 2 Pt 1 + 3 Ut
+ 4 Vt + 5 Wt 1 + "t ;
48
where Wt = wage, Pt = price, Ut = unemployment, and Vt = un…lled vacancies.
We will test the null hypothesis
H0 : 1 + 2 = 0; 3 + 4 = 0; and 5 = 1:
Wt = 0 + 1 Pt + 4 Dt + "t ;
Note that in the …rst two separate regressions, we can …nd signi…cant t-test statistics
for income and wealth, but in the third joint regression, both income and wealth are
49
insigni…cant. This may be due to the fact that income and wealth are highly multico-
linear! To test neither income nor wealth has impact on consumption, we can use the
F -test:
R2 =2
F =
(1 R2 )=(n 3)
0:742=2
=
(1 0:742)=(25 3)
= 31:636
F2 ;22 :
This F -test shows that the null hypothesis is …rmly rejected at the 5% signi…cance level,
because the critical value of F2;22 at the 5% level is 3.44.
Question: Under what conditions, the existing procedures and results are still approx-
imately true?
Assumption 3.5 is unrealistic for many economic and …nancial data. Suppose As-
sumption 3.5 is replaced by the following condition:
Assumption 3.6: "jX N (0; 2 V ); where 0 < 2 < 1 is unknown and V = V (X) is
a known n n symmetric, …nite and positive de…nite matrix.
Remarks:
Assumption 3.6 implies that
50
known form. If t is an index for cross-sectional units, this implies that there exists spatial
correlation of known form.
However, the assumption that V is known is still very restrictive from a practical
point of view. In practice, V usually has an unknown form.
Theorem 3.14: Suppose Assumptions 3.1, 3.3(a) and 3.6 hold. Then
(i) unbiasedness: E( ^ jX) = o :
(ii) variance:
var( ^ jX) = 2
(X0 X) 1 X0 V X(X0 X) 1
2
6= (X0 X) 1 :
(iii)
(^ o
)jX N (0; 2
(X0 X) 1 X0 V X(X0 X) 1 ):
(iv) cov( ^ ; ejX) 6= 0 in general.
E[( ^ o
)jX] = (X0 X) 1 X0 E("jX)
= (X0 X) 1 X0 0
= 0:
(ii)
2
= (X0 X) 1 X0 V X(X0 X) 1 :
^ o
= (X0 X) 1 X0 "
Xn
= Ct "t ;
t=1
51
where the weighting vector
Ct = (X0 X) 1 Xt ;
^ o
follows a normal distribution given X; because it is a sum of a normal random
variables. As a result,
^ o
N (0; 2
(X0 X) 1 X0 V X(X0 X) 1 ):
(iv)
Remarks:
OLS ^ is still unbiased and one can show that its variance goes to zero as n ! 1
(see Question 6, Problem Set 03). Thus, it converges to o in the sense of MSE.
However, the variance of the OLS estimator ^ does no longer have the simple expres-
sion of 2 (X0 X) 1 under Assumption 3.6. As a consequence, the classical t- and F -test
statistics are invalid because they are based on an incorrect variance-covariance matrix
of ^ . That is, they use an incorrect expression of 2 (X0 X) 1 rather than the correct
variance formula of 2 (X0 X) 1 X0 V X(X0 X) 1 :
Theorem (iv) implies that even if we can obtain a consistent estimator for 2 (X0 X) 1 X0 V X(X0 X) 1
and use it to construct tests, we can no longer obtain the Student t-distribution and
F -distribution, because the numerator and the denominator in de…ning the t- and F -test
statistics are no longer independent.
Lemma 3.15: For any symmetric positive de…nite matrix V; we can always write
1
V = C 0 C;
1
V = C (C 0 ) 1
52
where C is a n n nonsingular matrix.
Question: What is this decomposition called? Note that C may not be symmetric.
Consider the original linear regression model:
o
Y =X + ":
where Y = CY; X = CX and " = C": Then the OLS of this transformed model
^ = (X 0 X ) 1 X 0 Y
= (X0 C 0 CX) 1 (X0 C 0 CY )
= (X0 V 1
X) 1 X0 V 1
Y
Observe that
53
The transformation makes the new error " conditionally homoskedastic and serially
uncorrelated, while maintaining the normality distribution. Suppose that for t; "t has
a large variance 2t : The transformation "t = C"t will discount "t by dividing it by
its conditional standard deviation so that "t becomes conditionally homoskedastic. In
addition, the transformation also removes possible correlation between "t and "s ; t 6= s:
As a consequence, GLS becomes the best linear LS estimator for o in term of the
Gauss-Markov theorem.
Then 2 3
1
1 0 0
6 0 1
0 7
6 2 7
C=6 7
4 0 5
1
0 n
2 2
where i = i (X); i = 1; :::; n; and
2 "1
3
1
6 "2 7
6 7
" = C" = 6 2
7:
4 5
"n
n
where
Yt = Yt = t ;
Xt = Xt = t ;
"t = "t = t :
54
2 2 n 2 n 1
3
1
6 n 3 n 2 7
6 1 7
6 7
6 2
1 n 4 n 3 7
V =6
6
7:
7
6 7
6 n 2 n 3 n 4 7
4 1 5
n 1 n 2 n 3
1
Then we have 2 3
1 0 0 0
6 2 n 3 7
6 1+ 0 7
6 7
6 0 1+ 2 n 4
0 7
V 1
=6
6
7:
7
6 7
6 2 7
4 0 0 0 1+ 5
0 0 0 1
and 2 p 3
1 2 0 0 0 0
6 7
6 1 0 0 0 7
6 7
6 0 1 0 0 7
C=6
6
7:
7
6 7
6 7
4 0 0 0 1 0 5
0 0 0 1
It follows that 2 p 3
1 2"
1
6 " "1 7
6 2 7
" = C" = 6 7:
4 5
"n "n 1
where
p
Y1 = 1 2Y
1; Yt = Yt Yt 1 ; t = 2; :::; n;
p
X1 = 1 2X ; Xt = Xt Xt 1 ; t = 2; :::; n;
1
p
"1 = 1 2"
1; "t = "t "t 1 ; t = 2; :::; n:
p
The 1 2 transformation for t = 1 is called the Prais-Winsten transformation.
55
(i) E( ^ jX) = o ;
(ii) var( ^ jX) = 2 (X 0 X ) 1 = 2 (X0 V 1 X) 1 ;
(iii) cov( ^ ; e jX) = 0; where e = Y X ^ ;
(iv) ^ is BLUE.
(v) E(s 2 jX) = 2 ; where s 2 = e 0 e =(n K):
Proof: Results in (i)–(iii) follow because the GLS is the OLS of the transformed model.
(iv) The transformed model satis…es 3.1, 3.3 and 3.5 of the classical regression as-
sumptions with " jX N (0; 2 In ): It follows that GLS is BLUE by the Gauss-Markov
theorem. Result (v) also follows immediately. This completes the proof.
Remarks:
Because ^ is the OLS of the transformed regression model with i.i.d. N (0; 2 I)
errors, the t-test and F -test are applicable, and these test statistics are de…ned as follows:
R^ r
T = p tn K ;
s 2 R(X 0 X ) 1 R0
(R ^ r)0 [R(X 0 X ) 1 R0 ] 1 (R ^ r)=J
F =
s2
FJ;n K :
2
It is very important to note that we still have to estimate the proportionality in
spite of the fact that V = V (X) is known.
When testing whether all coe¢ cients except the intercept are jointly zero, we have
d
(n K)R 2 ! 2k :
Because GLS ^ is BLUE and OLS ^ di¤ers from ^ ; OLS ^ cannot be BLUE.
^ = (X 0 X ) 1 X 0 Y ;
= (X0 V 1
X) 1 X0 V 1
Y:
^ = (X0 X) 1 X0 Y:
In fact, the most important message of GLS is the insight it provides into the impact
of conditional heteroskedasticity and serial correlation on the estimation and inference
of the linear regression model. In practice, GLS is generally not feasible, because the
n n matrix V is of unknown form, where var("jX) = 2 V .
Two Approaches
56
(i) First Approach: Adaptive feasible GLS
In some cases with additional assumptions, we can use a nonparametric estimator V^
to replace the unknown V; we obtain the adaptive feasible GLS
^ = (X0 V^ 1
X) 1 X0 V^ 1
Y;
a
2
V = diagf 21 (X); :::; 2
n (X)g
where diag{ } is a n n diagonal matrix and 2 (Xt ) = E("2t jXt ) is unknown. The fact
that 2 V is a diagonal matrix can arise when cov("t "s jX) = 0 for all t 6= s; i.e., when
there is no serial correlation. Then we can use the nonparametric kernel estimator
1
Pn 2 1 x Xt
2 n t=1 et b K b
^ (x) = 1 Pn 1 x Xt
n t=1 b K b
p 2
! (x);
where et is the estimated OLS residual, and K( ) is a kernel function which is a speci…ed
symmetric density function (e.g., K(u) = (2 ) 1=2 exp( 21 u2 ) if x is a scalar); and b =
b(n) is a bandwidth such that b ! 0; nb ! 1 as n ! 1: The …nite sample distribution
of ^ a will be di¤erent from the …nite sample distribution of ^ ; which assumes that V
were known. This is because the sampling errors of the estimator V^ have some impact on
the estimator ^ a . However, under some suitable conditions on V^ ; ^ a will share the same
asymptotic property as the infeasible GLS ^ (i.e., the MSE of ^ a is approximately
equal to the MSE of ^ ). In other words, the …rst stage estimation of 2 ( ) has no
impact on the asymptotic distribution of ^ a : For more discussion, see Robinson (1988)
and Stinchcombe and White (1991).
(ii) Second Approach
Continue to use OLS ^ , obtaining the correct formula for
var( ^ jX) = 2
(X0 X) 1 X0 V X(X0 X) 1
as well as a consistent estimator for var( ^ jX): The classical de…nitions of t and F -tests
cannot be used, because they are based on an incorrect formula for var( ^ jX). However,
57
some modi…ed tests can be obtained by using a consistent estimator for the correct
formula for var( ^ jX): The trick is to estimate 2 X0 V X; which is a K K unknown
matrix, rather than to estimate V; which is a n n unknown matrix. However, only
asymptotic distributions can be used in this case.
E(""0 jX) = 2
V
= diagf 21 (X); :::; 2
n (X)g:
As pointed out earlier, this essentially assumes E("t "s jX) = 0 for all t 6= s: That is,
there is no serial correlation in f"t g conditional on X: Instead of estimating 2t (X); one
can estimate the K K matrix 2 X0 V X directly.
Then, how to estimate
X
n
2
X0 V X = Xt Xt0 2
t (X)?
t=1
0
X
n
0
X D(e)D(e) X = Xt Xt0 e2t ;
t=1
where D(e) = diag(e1 ; :::; en ) is a n n diagonal matrix with all o¤-diagonal elements be-
ing zero. This is called White’s (1980) heteroskedasticity-consistent variance-covariance
matrix estimator. See more discussion in Chapter 4.
R^ r
p tn K?
R(X0 X) 1 X0 D(e)D(e)0 X(X0 X) 1 R0
For J > 1; do we have
1
(R ^ r)0 R(X0 X) 1 X0 D(e)D(e)0 X(X0 X) 1 R0 (R ^ r)=J
FJ;n K?
No. Although we have standardized both test statistics by the correct variance
estimators, we still have cov( ^ ; ejX) 6= 0 under Assumption 3.6. This implies that
^ and e are not independent, and therefore, we no longer have a t-distribution or an
F -distribution in …nite samples.
58
However, when n ! 1; we have
(i) Case I (J = 1) :
R^ r d
p ! N (0; 1):
R(X0 X) 1 X0 D(e)D(e)0 X(X0 X) 1 R0
The above two feasible solutions are based on the assumption that E("t "s jX) = 0
for all t 6= s:
In fact, we can also consistently estimate the limit of X 0 V X when there exists con-
ditional heteroskedasticity and autocorrelation. This is called heteroskedasticity and
autocorrelation consistent variance-covariance estimation. When there exists serial cor-
relation of unknown form, an alternative solution should be provided. This is discussed
in Chapter 6. See also Andrews (1991) and Newey and West (1987, 1994).
3.10 Conclusion
In this chapter, we have presented the econometric theory for the classical linear
regression models. We …rst provide and discuss a set of assumptions on which the
classical linear regression model is built. This set of regularity conditions will serve as
the starting points from which we will develop modern econometric theory for linear
regression models.
We derive the statistical properties of the OLS estimator. In particular, we point out
that R2 is not a suitable model selection criterion, because it is always nondecreasing
with the dimension of regressors. Suitable model selection criteria, such as AIC and
BIC, are discussed. We show that conditional on the regressor matrix X; the OLS
estimator ^ is unbiased, has a vanishing variance, and is BLUE. Under the additional
conditional normality assumption, we derive the …nite sample normal distribution for ^ ;
the Chi-squared distribution for (n K)s2 = 2 ; as well as the independence between ^
and s2 .
Many hypotheses encountered in economics can be formulated as linear restrictions
on model parameters. Depending on the number of parameter restrictions, we derive
the t-test and the F -test. In the special case of testing the hypothesis that all slope
59
coe¢ cients are jointly zero, we also derive an asymptotically Chi-squared test based on
R2 :
When there exist conditional heteroskedasticity and/or autocorrelation, the OLS
estimator is still unbiased and has a vanishing variance, but it is no longer BLUE, and ^
and s2 are no longer mutually independent. Under the assumption of a known variance-
covariance matrix up to some scale parameter, one can transform the linear regression
model by correcting conditional heteroskedasticity and eliminating autocorrelation, so
that the transformed regression model has conditionally homoskedastic and uncorrelated
errors. The OLS estimator of this transformed linear regression model is called the GLS
estimator, which is BLUE. The t-test and F -test are applicable. When the variance-
covariance structure is unknown, the GLS estimator becomes infeasible. However, if the
error in the original linear regression model is serially uncorrelated (as is the case with
independent observations across t), there are two feasible solutions. The …rst is to use
a nonparametric method to obtain a consistent estimator for the conditional variance
var("t jXt ), and then obtain a feasible plug-in GLS. The second is to use White’s (1980)
heteroskedasticity-consistent variance-covariance matrix estimator for the OLS estimator
^ : Both of these two methods are built on the asymptotic theory. When the error of the
original linear regression model is serially correlated, a feasible solution to estimate the
variance-covariance matrix is provided in Chapter 6.
60
EXERCISES
3.1. Consider a bivariate linear regression model
o
Yt = Xt0 + "t ; t = 1; :::; n;
3.2. For the OLS estimation of the linear regression model Yt = Xt0 o + "; where Xt is
a K 1 vector, show R2 = ^2Y Y^ ; the squared sample correlation between Yt and Y^t :
61
3.4. [E¤ect of Multicolinearity] Consider a regression model
where Xt = (1; X1t ; :::; Xkt )0 : Suppose Assumptions 31.–3.3 hold. Let Rj2 is the coef-
…cient of determination of regressing variable Xjt on all the other explanatory variables
fXit ; 0 i k; i 6= jg. Show
2
var( ^ j jX) = Pn ;
(1 Rj2 ) t=1 (Xjt Xj )2
P
where Xj = n 1 nt=1 Xjt : The factor 1=(1 Rj2 ) is called the variance in‡ation factor
(VIF); it is used to measure the degree of multicolinearity among explanatory variables
in Xt .
where
ut = (Xt )"t ;
where fXt g is a nonstochastic process, and (Xt ) is a positive function of Xt such that
2 3
2
(X1 ) 0 0 ::: 0
6 2 7
6 0 (X2 ) 0 ::: 0 7
6 7
=66 0 0 2
(X3 ) ::: 0 7 = 12 12 ;
7
6 7
4 ::: ::: ::: ::: ::: 5
2
0 0 0 ::: (Xn )
62
with
2 3
(X1 ) 0 0 ::: 0
6 7
6 0 (X2 ) 0 ::: 0 7
1 6 7
2 =6
6 0 0 (X3 ) ::: 0 7:
7
6 7
4 ::: ::: ::: ::: ::: 5
0 0 0 ::: (Xn )
Assume that f"t g is i.i.d. N (0; 1): Then fut g is i.i.d. N (0; 2 (Xt )): This di¤ers from
Assumption 3.5 of the classical linear regression analysis, because now fut g exhibits
conditional heteroskedasticity.
Let ^ denote the OLS estimator for o :
(a) Is ^ unbiased for o ?
(b) Show that var( ^ ) = (X0 X) 1 X0 X(X0 X) 1 :
Consider an alternative estimator
~ = (X0 1
X) 1 X0 1
Y
" n # 1
X X
n
2
= (Xt )Xt Xt0 2
(Xt )Xt Yt :
t=1 t=1
where Yt = Yt = (Xt ); Xt = Xt = (Xt ): This model is obtained from model (4.1) after
dividing by (Xt ): In matrix notation, model (4.2) can be written as
o
Y =X + ";
1 1
where the n 1 vector Y = 2 Y and the n k matrix X = 2 X:]
(g) Construct two test statistics for the null hypothesis of interest H0 : o2 = 0.
One test is based on ^ ; and the other test is based on ~ : What are the …nite sample
distributions of your test statistics under H0 ? Can you tell which test is better?
(h) Construct two test statistics for the null hypothesis of interest H0 : R o = r;
where R is a J k matrix with J > 0: One test is based on ^ ; and the other test is
based on ~ : What are the …nite sample distributions of your test statistics under H0 ?
63
3.7. Consider the following classical regression model
o
Yt = Xt0 + "t :
o
H0 : R = r;
where Y^t = Xt0 ^ ; Y~t = Xt0 ~ ; and ^ ; ~ are the unrestricted and restricted OLS estimators
respectively.
o
Yt = Xt0 + "t
Xk
o o
= 0 + j Xjt + "t ; t = 1; :::; n: (7.1)
j=1
o o o
H0 : 1 = 2 = = k = 0:
64
Then the F -statistic can be written as
e0 e~ e0 e)=k
(~
F = 0 :
e e=(n k 1)
where e0 e is the sum of squared residuals from the unrestricted model (7.1), and e~0 e~ is
the sum of squared residuals from the restricted model (7.2)
o
Yt = 0 + "t : (7.2)
R2 =k
F = ;
(1 R2 )=(n k 1)
where Y^t = Xt0 ^ ; Y~t = Xt0 ~ ; and ^ ; ~ are the unrestricted and restricted OLS estimators
respectively.
3.11. [Structral Change] Suppose Assumptions 3.1 and 3.3 hold. Consider the fol-
lowing model on the whole sample:
o
Yt = Xt0 + (Dt Xt )0 o
+ "t ; t = 1; :::; n;
o
Yt = Xt0 + "t ; t = 1; :::; n1
and
o
Yt = Xt0 ( + o
) + "t ; t = n1 + 1; :::; n:
65
Let SSRu ; SSR1 ; SSR2 denotes the sums of squared residuals of the above three
regression models via OLS. Show
This identity implies that estimating the …rst regression mdoel with time dummy
variable Dt via OLS is equivalent to estimating two separate regression models over two
subsample periods respectively.
is …t to data. Suppose the p-value for the OLS estimator of 1 was 0.67 and for 2
was 0.84. Can we accept the hypothesis that 1 and 2 are both 0? Explain.
3.14. Suppose the conditions in 3.9 hold. It can be shown that the variances of the
OLS ^ and GLS ^ are respectively:
var( ^ jX) = 2
(X0 X) 1 X0 V X(X0 X) 1 ;
var( ^ jX) = 2
(X0 V 1
X) 1 :
where Xt = (X1t ; X2t )0 ; E(Xt Xt0 ) is nonsingular, and E("t jXt ) = 0. For simplicity, we
further assume E(X2t ) = 0 and E(X1t X2t ) 6= 0:
Now consider the following bivariate linear regression model
o
Yt = 1 X1t + ut :
66
(a) Show that if o2 6= 0; then E(Y1 jXt ) = Xt0 o 6= E(Y1t jX1t ): That is, there exists
an omitted variable (X2t ) in the bivariate regression model.
(b) Show that E(Yt jX1t ) 6= 1 X1t for all 1 : That is, the bivariate linear regression
model is misspeci…ed for E(Yt jX1t ):
(c) Is the best linear least squares approximation co¢ cient 1 in the bivariate linear
regression model equal to o1 ?
where Xt = (X1t ; X2t )0 ; and Assumptions 3.1–3.4 hold. (For simplicity, we have assumed
no intercept.) Denote the OLS estimator by ^ = ( ^ 1 ; ^ 2 )0 :
If o2 = 0 and we know it. Then we can consider a simpler regression
o
Yt = 1 X1t + "t :
67
What distributions will these test statistics follow under the null hypothesis that
R = r? Explain.
(d) Which set of tests, (T ; F ) or (T~ ; Q ~ ); are more powerful at the same signi…-
cance level? Explain. [Hint: The t-distribution has a heavier tail than N (0; 1) and so
has a larger critical value at a given signi…cance level.]
0
Yt = Xt0 + "t ; t = 1; 2; :::; n;
where "t = (Xt )vt ; Xt is a K 1 nonstochastic vector, and (Xt ) is a positive function
of Xt ; and fvt g is i.i.d. N (0; 1):
Let ^ = (X 0 X) 1 X 0 Y denote the OLS estimator for 0 ; where X is a n K matrix
whose t-th row is Xt ; and Y is a n 1 vector whose t-th component is Yt :
(a) Is ^ unbiased for 0 ?
(b) Find var( ^ ) = E[( ^ E ^ )( ^ E ^ )0 ]: You may …nd the following notation useful:
= diagf 2 (X1 ); 2 (X2 ); :::; 2 (Xn )g; i.e., is a n n diagonal matrix with the t-th
diagonal component equal to 2 (Xt ) and all o¤-diagonal components equal to zero.
Consider the transformed regression model
1 1 0
Yt = X0 + vt
(Xt ) (Xt ) t
or
0
Yt = Xt 0 + vt ;
where Yt = 1 (Xt )Yt and Xt = 1 (Xt )Xt :
Denote the OLS estimator of this transformed model as ~ :
(c) Show
~ = (X 0 1
X) 1 X 0 1
Y:
(d) Is ~ unbiased for 0 ?
(e) Find var( ~ ):
(f) Which estimator, ^ or ~ ; is more e¢ cient in terms of the mean squared error
criterion? Give your reasoning.
(g) Use the di¤erence R ~ r to construct a test statistic for the null hypothesis of
interest H0 : R 0 = r; where R is a J K matrix, r is K 1; and J > 1: What is the
…nite sample distribution of your test statistic under H0 ?
68
CHAPTER 4 LINEAR REGRESSION MODELS
WITH I.I.D. OBSERVATIONS
Abstract: When the conditional normality assumption on the regression error does
not hold, the OLS estimator no longer has the …nite sample normal distribution, and
the t-test statistics and F -test statistics no longer follow the Student t-distribution and
a F -distribution in …nite samples respectively. In this chapter, we show that under the
assumption of i.i.d. observations with conditional homoskedasticity, the classical t-test
and F -test are approximately applicable in large samples. However, under conditional
heteroskedasticity, the t-test statistics and F -test statistics are not applicable even when
the sample size goes to in…nity. Instead, White’s (1980) heteroskedasticity-consistent
variance-covariance matrix estimator should be used, which yields asymptotically valid
hypothesis test procedures. A direct test for conditional heteroskedasticity due to White
(1980) is presented. To facilitate asymptotic analysis in this and subsequent chapters,
we …rst introduce some basic tools in asymptotic analysis.
Key words: Asymptotic analysis, Almost sure convergence, Central limit theorem,
Convergence in distribution, Convergence in quadratic mean, Convergence in probability,
I.I.D., Law of large numbers, the Slutsky theorem, White’s heteroskedasticity-consistent
variance-covariance matrix estimator.
Motivation
The assumptions of classical linear regression models are rather strong and one may
have a hard time …nding practical applications where all these assumptions hold exactly.
For example, it has been documented that most economic and …nancial data have heavy
tails, and so they are not normally distributed. An interesting question now is whether
the estimators and tests which are based on the same principles as before still make
sense in this more general setting. In particular, what happens to the OLS estimator,
the t- and F -tests if any of the following assumptions fails:
When classical assumptions are violated, we do not know the …nite sample statistical
properties of the estimators and test statistics anymore. A useful tool to obtain the
1
understanding of the properties of estimators and tests in this more general setting
is to pretend that we can obtain a limitless number of observations. We can then
pose the question how estimators and test statistics would behave when the number of
observations increases without limit. This is called asymptotic analysis. In practice, the
sample size is always …nite. However, the asymptotic properties translate into results
that hold true approximately in …nite samples, provided that the sample size is large
enough. We now need to introduce some basic analytic tools for asymptotic theory.
For more systematic introduction of asymptotic theory, see, for example, White (1994,
1999).
2
4.1 Introduction to Asymptotic Theory
In this section, we introduce some important convergence concepts and limit the-
orems. First, we introduce the concept of convergence in mean squares, which is a
distance measure of a sequence of random variables from a random variable.
EjjZn Zjj2 ! 0 as n ! 1;
X
l X
m
jjZn Zjj2 = [Zn Z]2(t;s) :
t=1 s=1
2 1
Pn
Example 1: Suppose fZt g is i.i.d.( ; ); and Zn = n t=1 Zt : Then
q:m:
Zn ! :
E(Zn )2 = var(Zn )
!
X
n
1
= var n Zt
t=1
!
1 X
n
= var Zt
n2 t=1
1 X
n
= var(Zt )
n2 t=1
2
=
n
! 0 as n ! 1:
3
It follows that
2
E(Zn )2 = ! 0 as n ! 1:
n
Next, we introduce the concept of convergence in probability that is another popular
distance measure between a sequence of random variables and a random variable.
Lemma 4.1 [Weak Law of Large Numbers (WLLN) for I.I.D. Sample] Suppose
P
fZt g is i.i.d.( ; 2 ); and de…ne Zn = n 1 nt=1 Zt ; n = 1; 2; ::: : Then
p
Zn ! as n ! 1:
Proof: For any given constant > 0; we have by Chebyshev’s inequality
E(Zn )2
Pr(jZn j > ) 2
2
= 2
! 0 as n ! 1:
n
4
Hence,
p
Zn ! as n ! 1:
This is the so-called weak law of large numbers (WLLN). In fact, we can weaken the
moment condition.
Suppose Zt is the return of the stock on period t; and the returns over di¤erent time
periods are i.i.d.( ; 2 ): Also assume the investor holds the stock for a total of n period.
Then the average return over each time period is the sample mean
1X
n
Z= Zt :
n t=1
Lemma 4.2 [WLLN for I.I.D. Random Sample] Suppose fZt g is i.i.d. with
P
E(Zt ) = and EjZt j < 1: De…ne Zn = n 1 nt=1 Zt : Then
p
Zn ! as n ! 1:
5
as n ! 1: We denote
Zn = OP (1):
Intuitively, when Zn = OP (1); the probability that jjZn jj exceeds a very large constant is
small as n ! 1. Or, equivalently, jjZn jj is smaller than C with a very high probability
as n ! 1:
2
Example 2: Suppose Zn N( ; ) for all n 1: Then
Zn = OP (1):
Solution: For any > 0; we always have a su¢ ciently large constant C = C( ) > 0
such that
P (jZn j > C) = 1 P( C Zn C)
C Zn C
= 1 P
C C+
= 1 +
;
where (z) = P (Z z) is the CDF of N (0; 1): [We can choose C such that [(C
)= ] 1 12 and [ (C + )= ] 12 :]
In this case,
Suppose we set
2[1 (C)] = ;
that is, we set
1
C= 1 ;
2
1
where ( ) is the inverse function of ( ): Then we have
P (jZn j > C) = :
6
The following lemma provides a convenient way to verify convergence in probability.
q:m: p
Lemma 4.3: If Zn Z ! 0; then Zn Z ! 0:
E[Zn Z]2
P (jZn Zj > ) 2
!0
Example 3: Suppose Assumptions 3.1–3.4 hold. Does the OLS estimator ^ converges
in probability to o ?
0
E[( ^ o
)( ^ o 0
) jX] = 2 0
(X 0 X) 1
! 0
Example 4: Suppose Assumptions 3.1, 3.3 and 3.5 hold. Does s2 converge in
probability to 2 ?
s2 2
(n K) 2 n K;
4
and therefore we have E(s2 ) = 2 and var(s2 ) = n2 K : It follows that E(s2 2 2
) =
4 2 q:m: 2 2 p 2
2 =(n K) ! 0; s ! and so s ! because convergence in quadratic mean
implies convergence in probability.
Example 5: Suppose (
1
0 with prob 1 n
Zn =
n with prob n1 :
p
Then Zn ! 0 as n ! 1 but E(Zn 0)2 = n ! 1: Please verify it.
7
Solution:
(i) For any given " > 0; we have
1
P (jZn 0j > ") = P (Zn = n) = ! 0:
n
(ii)
X
E(Zn 0)2 = (zn 0)2 f (zn )
zn 2f0;ng
= (0 0)2 1 n 1
+ (n 0)2 n 1
= n ! 1:
a:s:
We denote Zn Z ! 0:
To gain intuition for the concept of almost sure convergence, recall the de…nition of
a random variable: any random variable is a mapping from the sample space to the
real line, namely Z : ! R: Let ! be a basic outcome in the sample space : De…ne a
subset in :
Ac = f! 2 : lim Zn (!) = Z(!)g:
n!1
c
That is, A is the set of basic outcomes on which the sequence of fZn ( )g converges to
Z( ) as n ! 1: Then almost sure convergence can be stated as
P (Ac ) = 1:
and
Zn (!) = ! + ! n for ! 2 [0; 1]:
a:s:
Is Zn Z ! 0?
8
Solution: Consider
Lemma 4.4 [Strong Law of Large Numbers (SLLN) for I.I.D. Random Sam-
ples] Suppose fZt g be i.i.d. with E(Zt ) = and EjZt j < 1: Then
a:s:
Zn ! as n ! 1:
Almost sure convergence implies convergence in probability but not vice versa.
p p
Question: If s2 ! 2
; do we have s ! ?
Answer: Yes. It follows from the following continuity lemma with the choice of g(s2 ) =
p
s2 = s:
p p
Lemma 4.5 [Continuity]: (i) Suppose an ! a and bn ! b; and g( ) and h( ) are
continuous functions. Then
p
g(an ) + h(bn ) ! g(a) + h(b); and
p
g(an )h(bn ) ! g(a)h(b):
9
and the random variable Z: This di¤ers from the concept of convergence in distribution
introduced in Chapter 3. There, convergence in distribution is de…ned in terms of the
closeness of the CDF Fn (z) of Zt to the CDF F (z) of Z; not between the closeness of
the random variable Zn to the random variable Z: As a result, for convergence in mean
squares, convergence in probability and almost sure convergence, Zn converges to Z if
and only if convergence of Zn to Z occurs element by element (that is, each element of
Zn converges to the corresponding element of Z). For the convergence in distribution
of Zn to Z, however, element by element convergence does not imply convergence in
distribution of Zn to Z;because element-wise convergence in distribution ignores the
d
relationships among the components of Zn : Nevertheless, Zn ! Z does imply element
by element convergence in distribution. That is, convergence in joint distribution implies
convergence in marginal distribution.
The main purpose of asymptotic analysis is to derive the large sample distribution
of the estimator or statistic of interest and use it as an approximation in statistical
inference. For this purpose, we need to make use of an important limit theorem, namely
Central Limit Theorem (CLT). We now state and prove the CLT for i.i.d. random
samples, a fundamental limit theorem in probability theory.
Lemma 4.6 [Central Limit Theorem (CLT) for I.I.D. Random Samples]: Sup-
P
pose fZt g is i.i.d.( ; 2 ); and Zn = n 1 nt=1 Zt : Then as n ! 1;
Zn E(Zn ) Zn
p = p
var(Zn ) 2 =n
p
n(Zn )
=
d
! N (0; 1):
Proof: Put
Zt
Yt = ;
1
Pn
and Yn = n t=1 Yt : Then
p
n(Zn ) p
= n Yn :
10
p
The characteristic function of n Yn
p p
n (u) = E[exp(iu nYn )]; i= 1
" !#
iu X
n
= E exp p Yt
n t=1
Y
n
iu
= E exp p Yt by independence
t=1
n
n
u
= Y p by identical distribution:
n
n
0 u 1 00 u2
= Y (0) + (0) p + (0) +
n 2 n
n
u2
= 1 + o(1)
2n
u2
! exp as n ! 1;
2
where the third equality follows from independence, the fourth equality follows from
identical distribution, the …fth equality follows from the Taylor series expansion, and
(0) = 1; 0 (0) = 0; 00 (0) = 1: Note that o(1) means a reminder term that vanishes to
n
zero as n ! 1; and we have also made use of the fact that 1 + na ! ea :
More rigorously, we can show
u
ln n (u) = n ln Y p
n
ln pu
Y n
= 1
n p
0
Y (u= n)
p
u Y (u= n)
! lim
2 n!1 n 1=2 p p p
00
u2 Y (u= n) Y (u= n) [ 0Y (u= n)]2
= lim 2 p
2 n!1 Y (u= n)
u2
= :
2
It follows that
1 2
u
lim n (u) =e 2 :
n!1
This is the characteristic function of N (0; 1). By the uniqueness of the characteristic
function, the asymptotic distribution of
p
n(Zn )
11
is N (0; 1): This completes the proof.
d
Lemma 4.7 [Cramer-Wold Device] A p 1 random vector Zn ! Z if and only if
for any nonzero 2 Rp such that 0 = pj=1 2j = 1; we have
0 d 0
Zn ! Z:
Example 9: Suppose Assumptions 3.1, 3.3(a) and 3.5, and the hypothesis H0 : R o = r
hold, where R is a J K nonstochastic matrix with rank J, r is a J 1 nonstochastic
vector, and J K. Then the quadratic form
(R ^ r)0 [R(X0 X) 1 R0 ] 1 (R ^ r) 2
2 J:
2
Suppose now we replace by s2 : What is the asymptotic distribution of the quadratic
form
(R ^ r)0 [R(X0 X) 1 R0 ] 1 (R ^ r)
?
s2
Finally, we introduce a lemma which is very useful in deriving the asymptotic distri-
butions of nonlinear statistics (i.e., nonlinear functions of the random sample).
p d
Lemma 4.9 [Delta Method] Suppose n(Zn )= ! N (0; 1), and g( ) is continu-
ously di¤erentiable with g 0 ( ) 6= 0: Then as n ! 1;
p d
n[g(Zn ) g( )] ! N (0; [g 0 ( )]2 2 ):
12
p d p
Proof: First, because n(Zn )= ! N (0; 1) implies n(Zn )= = OP (1); we
have Zn = OP (n 1=2 ) = oP (1):
Next, by a …rst order Taylor series expansion, we have
Yn = g(Zn ) = g( ) + g 0 ( n )(Zn );
where n = + (1 )Zn for 2 [0; 1]: It follows by the Slutsky theorem that
p g(Zn ) g( ) p Zn
n = g0( n) n
d
! N (0; [g 0 ( )]2 );
p p
where g 0 ( n ) ! g 0 ( ) given n ! :
By the Slutsky theorem again, we have
p d 2
n[Yn g( )] ! N (0; [g 0 ( )]2 ):
g(Zn ) = g( ) + g 0 ( n )(Zn ); or
Zn 1 1
= ( n
2
)(Zn )
p p
where n = + (1 )Zn ! given Zn ! and 2 [0; 1]: It follows that
p
p n(Zn )
n(Zn 1 1
) = 2
n
d 2
! N (0; = 4 ):
13
Taylor series expansions, various convergence concepts, laws of large numbers, central
limit theorems, and slutsky theorem constitute a tool kit of asymptotic analysis. We
now use these asymptotic tools to investigate the large sample behavior of the OLS
estimator and related statistics in subsequent chapters.
14
4.2 Framework and Assumptions
We …rst state the assumptions under which we will establish the asymptotic theory
for linear regression models.
o
Yt = Xt0 + "t ; t = 1; :::; n;
o
for some unknown K 1 parameter and some unobservable random variable "t :
Assumption 4.3 [Correct Model Speci…cation]: E("t jXt ) = 0 a.s. with E("2t ) =
2
< 1:
Q = E(Xt Xt0 )
Assumption 4.5: The K K matrix V var(Xt "t ) = E(Xt Xt0 "2t ) is …nite and positive
de…nite (p.d.).
Remarks:
The i.i.d. observations assumption in Assumption 4.1 implies that the asymptotic
theory developed in this chapter will be applicable to cross-sectional data, but not time
series data. The observations of the later are usually correlated and will be considered
in Chapter 5. Put Zt = (Yt ; Xt0 )0 : Then I.I.D. implies that Zt and Zs are independent
when t 6= s, and the Zt have the same distribution for all t: The identical distribution
means that the observations are generated from the same data generating process, and
independence means that di¤erent observations contain new information about the data
generating process.
Assumptions 4.1 and 4.3 imply the strict exogeneity condition (Assumption 3.2)
holds, because we have
15
As a most important feature of Assumptions 4.1–4.5 together, we allow for condi-
tional heteroskedasticity (i.e., var("t jXt ) 6= 2 a.s.); and do not assume normality for the
conditional distribution of "t jXt . It is possible that var("t jXt ) may be correlated with
Xt : For example, the variation of the output of a …rm may depend on the size of the
…rm, and the variation of a household may depend on its income level. In economics
and …nance, conditional heteroskedasticity is more likely to occur in cross-sectional ob-
servations than in time series observations, and for time series observations, conditional
heteroskedasticity is more likely to occur for high-frequency data than low-frequency
data. In this chapter, we will consider the e¤ect of conditional heteroskedasticity in
cross-section observations. The e¤ect of conditional heteroskedasticity in time series
observations will be considered in Chapter 5.
On the other hand, relaxation of the normality assumption is more realistic for
economic and …nancial data. For example, it has been well documented (Mandelbrot
1963, Fama 1965, Kon 1984) that returns on …nancial assets are not normally distributed.
However, the I.I.D. assumption implies that cov("t ; "s ) = 0 for all t 6= s: That is, there
exists no serial correlation in the regression disturbance.
2
Among other things, Assumption 4.4 implies E(Xjt ) < 1 for 0 j k: By the
SLLN for i.i.d. random samples, we have
1X
n
X0 X a:s:
= Xt Xt0 ! E(Xt Xt0 ) = Q
n n t=1
as n ! 1: Hence, when n is large, the matrix X0 X behaves approximately like nQ; whose
minimum eigenvalue min (nQ) = n min (Q) ! 1 at the rate of n: Thus, Assumption 4.4
implies Assumption 3.3.
When X0t = 1; Assumption 4.5 implies E("2t ) < 1: If E("2t jXt ) = 2 < 1 a.s.,
i.e., there exists conditional homoskedasticity, then Assumption 4.5 can be ensured by
Assumption 4.4. More generally, there exists conditional heteroskedasticity, the moment
condition in Assumption 4.5 can be ensured by the moment conditions that E("4t ) < 1
4
and E(Xjt ) < 1 for 0 j k; because by repeatedly using the Cauchy-Schwarz
inequality twice, we have
where 0 j; l k and 1 t n:
16
Consistency of OLS?
Asymptotic normality?
Asymptotic e¢ ciency?
Hypothesis testing?
where
X
n
^=n
Q 1
Xt Xt0 :
t=1
Substituting Yt = Xt0 o + "t ; we obtain
X
n
^= o ^ 1n
+Q 1
Xt "t :
t=1
Proof: Let C > 0 be some bounded constant. Also, recall Xt = (X0t ; X1t ; :::; Xkt )0 :
First, the moment condition holds: for all 0 j k;
1 1
2 2
EjXjt "t j (EXjt ) (E"2t ) 2 by the Cauchy-Schwarz inequality
1 1
C2C2
C
17
2
where E(Xjt ) C by Assumption 4.4, and E("2t ) C by Assumption 4.3. It follows
from WLLN (with Zt = Xt "t ) that
X
n
p
1
n Xt "t ! E(Xt "t ) = 0;
t=1
where
by the Cauchy-Schwarz inequality for all pairs (j; l); where 0 j; l k; we have
p
^!
Q E(Xt Xt0 ) = Q:
^ 1 p 1
Hence, we have Q !Q by continuity. It follows that
^ o
= (X0 X) 1 X0 "
Xn
^
= Q n1 1
Xt "t
t=1
p 1
!Q 0 = 0:
Lemma 4.11 [Multivariate Central Limit Theorem (CLT) for I.I.D. Random
Samples]: Suppose fZt g is a sequence of i.i.d. random vectors with E(Zt ) = 0 and
var(Zt ) = E(Zt Zt0 ) = V is …nite and positive de…nite. De…ne
X
n
1
Zn = n Zt :
t=1
18
Then as n ! 1;
p d
nZn ! N (0; V )
or
1 p d
V 2 nZn ! N (0; I):
p
Question: What is the variance-covariance matrix of n Zn ?
Answer: Noting that E(Zt ) = 0; we have
!
p 1
Xn
var( nZn ) = var n 2 Zt
t=1
" ! !0 #
1
X
n
1
X
n
= E n 2 Zt n 2 Zs
t=1 s=1
X
n X
n
1
= n E(Zt Zs0 )
t=1 s=1
X
n
1
= n E(Zt Zt0 ) (because Zt and Zs are independent for t 6= s)
t=1
= E(Zt Zt0 )
= V:
p
In other words, the variance of nZn is identical to the variance of each individual
random vector Zt :
1
X
n
n 2 Xt "t :
t=1
Noting that E(Xt "t ) = 0 by Assumption 4.3, and var(Xt "t ) = E(Xt Xt0 "2t ) = V; which
is …nite and p.d. by Assumption 4.5. Then, by the CLT for i.i.d. random sequences
19
fZt = Xt "t g, we have
!
1
X
n
p X
n
1
n 2 Xt "t = n n Xt "t
t=1 t=1
p
= n Zn
d
! Z ~ N (0; V ):
and so
^ 1 p 1
Q !Q
given that Q is nonsingular so that the inverse function is continuous and well de…ned.
It follows by the Slutsky Theorem that
p 1
X
n
n( ^ o ^ 1n
) = Q 2 Xt "t
t=1
d 1
!Q Z N (0; Q V Q 1 ):
1
Remarks:
p
The theorem implies that the asymptotic mean of n( ^ o
) is equal to 0. That is,
p ^ o
the mean of n( ) is approximately 0 when n is large.
p
It also implies that the asymptotic variance of n( ^ o
) is Q 1 V Q 1 : That is, the
p ^ o
variance of n( ) is approximately Q 1 V Q 1 : Because the asymptotic variance is
p
a di¤erent concept from the variance of n( ^ o
); we denote the asymptotic variance
p ^ o p ^ 1 1
of n( ) as follows: avar( n ) = Q V Q :
p
We now consider a special case under which we can simplfy the expression of avar( n ^ ):
20
Proof: Under Assumption 4.6, we can simplify
Q 1V Q 1
=Q 1 2
QQ 1
= 2
Q 1:
Remarks:
p
Under conditional homoskedasticity, the asymptotic variance of n( ^ o
) is
p
avar( n ^ ) = 2
Q 1:
Question: Is the OLS estimator ^ the BLUE estimator asymptotically (i.e., when
n ! 1)?
Lemma 4.14: Suppose Assumptions 4.1, 4.2 and 4.4 hold. Then
X
n
p
^=n
Q 1
Xt Xt0 ! Q:
t=1
2
Question: How to estimate ?
21
2
Recalling that = E("2t ); we use the sample residual variance estimator
s2 = e0 e=(n K)
1 X
n
= e2t
n K t=1
1 Xn
= (Yt Xt0 ^ )2 :
n K t=1
2
Theorem 4.15 [Consistent Estimator for ]: Under Assumptions 4.1-4.4,
p
s2 ! 2
:
et = Yt Xt0 ^
= "t + Xt0 o
Xt0 ^
= "t Xt0 ( ^ o
);
we have
1 X
n
s 2
= ["t Xt0 ( ^ o
)]2
n K t=1
!
n X
n
1
= n "2t
n K t=1
" #
X
n
+( ^ o 0
) (n K) 1
Xt Xt0 (^ o
)
t=1
X
n
2( ^ o 0
) (n K) 1
Xt "t
t=1
p 2
!1 +0 Q 0 2 0 0
2
=
given that K is a …xed number (i.e., K does not grow with the sample size n), where
we have made use of the WLLN in three places respectively.
22
Remarks:
p
The asymptotic variance estimator of n( ^ o
) is
^
s2 Q 1
= s2 (X0 X=n) 1 :
in the classical regression case. Because of this, as will be seen below, the conventional
t-test and F -test are still valid for large samples under conditional homoskedasticity.
where
D(e) = diag(e1 ; e2 ; :::; en )
is an n n diagonal matrix with diagonal elements equal to et for t = 1; :::; n: To ensure
consistency of V^ to V; we impose the following additional moment conditions.
4
Assumption 4.7: (i) E(Xjt ) < 1 for all 0 j k; and (ii) E("4t ) < 1:
Lemma 4.17: Suppose Assumptions 4.1–4.5 and 4.7 hold. Then
p
V^ ! V:
23
X
n
V^ = n 1
Xt Xt0 "2t
t=1
Xn
+n 1
Xt Xt0 [( ^ o 0
) Xt Xt0 ( ^ o
)]
t=1
Xn
2n 1
Xt Xt0 ["t Xt0 ( ^ o
)]
t=1
p
! V +0 2 0;
24
o p
given ^ ! 0; and
X
n
p
1
n Xit Xjt Xlt "t ! E (Xit Xjt Xlt "t ) = 0
t=1
^ 1 V^ Q
^ 1 p
Q ! Q 1V Q 1:
Remarks:
Observe that
where 2 = E("2t ); 2 (Xt ) = E("2t jXt ); and the last equality follows from the LIE.
Thus, if 2 (Xt ) is positively correlated with Xt Xt0 ; 2 Q will underestimate the true
variance-covariance E(Xt Xt0 "2t ) in the sense that V 2
Q is a positive de…nite matrix.
Consequently, the standard t-test and F -test will overreject the correct null hypothesis
at any given signi…cance level. There will exist substantial Type I errors.
25
^ 1 V^ Q
Question: What happens if one use the asymptotic variance estimator Q ^ 1
but
there exists conditional homoskedasticity?
The asymptotic variance estimator is asymptotically valid, but it will not perform as
^ 1 in …nite samples, because the latter exploits the information
well as the estimator s2 Q
of conditonal homoskedasticity.
We …rst consider
R^ r = R( ^ o
)+R o
r:
It follows that under H0 : R o = r; we have
p d
n(R ^ r) ! N (0; RQ 1 V Q 1 R0 ):
The test procedures will di¤er depending on whether there exists conditional het-
eroskedasticity. We …rst consider the case of conditional homoskedasticity.
when H0 holds.
When J = 1; we can use the conventional t-test statistic for large sample inference.
Theorem 4.19 [t-test]: Suppose Assumptions 4.1-4.4 and 4.6 hold. Then under H0
with J = 1;
R^ r d
T =p ! N (0; 1)
2 0
s R(X X) R1 0
as n ! 1:
p d
Proof: Give R n( ^ o
) ! N (0; 2 RQ 1 R0 ); R o = r under H0 ; and J = 1; we
have p p
n(R ^ r) R n( ^ o
) d
p = p ! N (0; 1):
2 RQ 1 R0 2 RQ 1 R0
26
^ = X0 X=n, we obtain
By the Slutsky theorem and Q
p
n(R ^ r) d
q ! N (0; 1):
^ 1 R0
s2 R Q
2
Theorem 4.20 [Asymptotic Test] Suppose Assumptions 4.1–4.4 and 4.6 hold.
Then under H0 ;
1
J F (R ^ r)0 s2 R(X0 X) 1 R0 (R ^ r)
d 2
! J
as n ! 1:
^ 1 p
Also, s2 Q ! 2
Q 1 ; so we have by the Slutsky theorem
p 1 p d
n(R ^ ^ 1 R0
r)0 s2 RQ n(R ^ r) ! 2
J:
or equivalently
Remarks:
When f"t g is not i.i.d.N (0; 2 ) conditional on Xt ; we cannot use the F distribution,
but we can still compute the F -statistic and the appropriate test statistic is J times the
F -statistic, which is asymptotically 2J . That is,
(~e0 e~ e0 e) d 2
J F = ! J:
e0 e=(n K)
27
2
Because J FJ;n K approaches J as n ! 1; we may interpret the above theorem in
the following way: the classical results for the F -test are still approximately valid under
conditional homoskedasticity when n is large.
When the null hypothesis is that all slope coe¢ cients except the intercept are jointly
zero, we can use a test statistic based on R2 :
Theorem 4.21 [(n K)R2 Test]: Suppose Assumption 4.1-4.6 hold, and we are inter-
ested in testing the null hypothesis that
o o o
H0 : 1 = 2 = = k = 0;
o o o
Yt = 0 + 1 X1t + + k Xkt + "t :
Let R2 be the coe¢ cient of determination from the unrestricted regression model
o
Yt = Xt0 + "t :
Then under H0 ;
d
(n K)R2 ! 2
k;
where K = k + 1:
R2 =k
F =
(1 R2 )=(n k 1)
R2 =k
= :
(1 R2 )=(n K)
(n K)R2 d 2
k F = ! k
1 R2
under H0 : This implies that k F is bounded in probability; that is,
(n K)R2
= OP (1):
1 R2
28
Consequently, given that k is a …xed integer,
R2
2
= OP (n 1 ) = oP (1)
1 R
or
p
R2 ! 0:
p
Therefore, 1 R2 ! 1: By the Slutsky theorem, we have
K)R2 =k (n
(n K)R = k 2
(1 R2 )
1 R2
= (k F )(1 R2 )
d 2
! k;
or asymptotically equivalently,
d
(n K)R2 ! 2
k:
where
V = E(Xt Xt0 "2t ):
Therefore, when J = 1;we have
p
n(R ^ r) d
p ! N (0; 1) as n ! 1:
1
RQ V Q R 1 0
p p
Given Q^! Q and V^ ! V; where V^ = X0 D(e)D(e)0 X=n; and the Slutsky theorem,
we can de…ne a robust t-test statistic
p
n(R ^ r) d
Tr = q ! N (0; 1) as n ! 1
RQ ^ 1 V^ Q
^ 1 R0
29
when H0 holds. By robustness, we mean that Tr is valid no matter whether there
exists conditional heteroskedasticity.
p p
^!
under H0 : Given Q Q and V^ ! V; the robust Wald test statistic
p p
W = n(R ^ ^ 1 V^ Q
r)0 [RQ ^ 1 R0 ] 1
n(R ^ r)
d 2
! J
1X
n
V^ = Xt et et Xt0
n t=1
X0 D(e)D(e)0 X
= ;
n
where D(e)= diag(e1 ; e2 ; :::; en ):
Remarks:
30
Under conditional heteroskedasticity, the test statistics J F and (n K)R2 cannot
be used.
There will exist Type I errors because J F or (n K)R2 will be no longer asymp-
totically 2 -distributed under H0 .
Although the general form of the Wald test statistic developed here can be used
no matter whether there exists conditional homoskedasticity, this general form of test
statistic may perform poorly in small samples. Thus, if one has information that the
error term is conditionally homoskedastic, one should use the test statistics derived under
conditional homoskedasticity, which will perform better in small sample sizes. Because
of this reason, it is important to test whether conditional homoskedasticity holds.
There have been many tests for conditional homoskedasticity. Here, we introduce a
popular one due to White (1980).
First, suppose "t were observed, and we consider the auxiliary regression
X
k X
"2t = 0+ j Xjt + jl Xjt Xlt + vt
j=1 1 j l k
0
= vech(Xt Xt0 ) + vt
0
= Ut + vt ;
31
where vech(Xt Xt0 ) is an operator stacks all lower triangular elements of matrix Xt Xt0
into a K(K+1)
2
1 column vector. For example, when Xt = (1; X1t ; X2t )0 ; we have
~2 ! d 2
(n J 1)R J;
where J = K(K+1)
2
1 is the number of the regressors except the intercept.
Unfortunately, "t is not observable. However, we can replace "t with et = Yt Xt0 ^ ; and
run the following feasible auxiliary regression
X
k X
e2t = 0+ j Xjt + jl Xjt Xlt + v~t
j=1 1 j l k
0
= vech(Xt Xt0 ) + v~t ;
It can be shown that the replacement of "2t by e2t has no impact on the asymptotic 2J
distribution of (n J 1)R2 : The proof, however, is rather tedious. For the details of
the proof, see White (1980). Below, we provide some intuition.
Question: Why does the use of e2t in place of "2t have no impact on the asymptotic
distribution of (n J 1)R2 ?
To explain this, we put Ut = vech(Xt Xt0 ): Then the infeasible auxiliary regression is
"2t = Ut0 0
+ vt :
p 0 d
We have n(~ ) ! N (0; 2v Quu1 ); where Quu = E(Ut Ut0 ); and under H0 : R 0 = 0;
where R is a J J diagonal matrix with the …rst diagonal element being 0 and other
diagonal elements being 1, we have
p d
nR~ ! N (0; 2v RQuu1 R0 );
32
where ~ is the OLS estimator and 2v = E(vt2 ): This implies R~ = OP (n 1=2 ); which
vanishes to zero in probability at rate n 1=2 : It is this term that yields the asymptotic
2 ~ 2 ; which is asymptotically equivalent to the test statistic
J distribution for (n J 1)R
p p
^ uu1 R0 ]
n(R~ )0 [s2v RQ 1
nR~ :
Now suppose we replace "2t with e2t ; and consider the auxiliary regression
e2t = Ut0 0
+ v~t :
^ = ~ + ^ + ^;
where ~ is the OLS estimator of the infeasible auxiliary regression, ^ is the e¤ect of
the second term, and ^ is the e¤ect of the third term. For the third term, Xt "t is
uncorrelated with Ut given E("t jXt ) = 0: Therefore, this term, after scaled by the factor
^ o
that itself vanishes to zero in probability at the rate n 1=2 ; will vanish to zero
in probability at a rate n 1 ; that is, ^ = OP (n 1 ): This is expected to have negligible
impact on the asymptotic distribution of the test statistic. For the second term, Xt Xt0
is perfectly correlated with Ut : However, it is scaled by a factor of jj ^ o 2
jj rather than
by jj ^ o
jj only. As a consequence, the regression coe¢ cient of ( ^ ) Xt Xt0 ( ^
o 0 o
)
on Ut will also vanish to zero at rate n ; that is, ^ = OP (n ): Therefore, it also has
1 1
Question: How to test conditional homoskedasticity if E("4t jXt ) is not a constant (i.e.,
E("4t jXt ) 6= 4 for some 4 under H0 )? This corresponds to the case when vt displays
conditional heteroskedasticity.
Question: Suppose White’s (1980) test rejects the null hypothesis of conditional ho-
moskedasticity, one can then conclude that there exists evidence of conditional het-
eroskedasticity. What conclusion can one reach if White’s test fails to reject H0 :
E("2t jXt ) = 2 ?
33
Because White (1980) considers a quadratic alternative to test H0 ; it may have no
power against some conditional heteroskedastic alternatives for which E("2t jXt ) does not
depend on the quadratic form of Xt but depends on cubic or higher order polynomials of
Xt : Thus, when White’s test fails to reject H0 ; one can only say that we …nd no evidence
against H0 :
However, when White’s test fails to reject H0 ; we have
E("2t Xt Xt0 ) = 2
E(Xt Xt0 ) = 2
Q
The validity of White’s test procedure and associated interpretations is built upon
the assumption that the linear regression model is correctly speci…ed for the condi-
tional mean E(Yt jXt ):Suppose the linear regression model is not correctly speci…ed, i.e.,
E(Yt jXt ) 6= Xt0 for all : Then the OLS ^ will converge to = [E(Xt Xt0 )] 1 E(Xt Yt );
the best linear least squares approximation coe¢ cient, and E(Yt jXt ) 6= Xt0 . In this
case, the estimated residual
et = Yt Xt0 ^
= "t + [E(Yt jXt ) Xt0 ] + Xt0 ( ^ );
where "t = Yt E(Yt jXt ) is the true disturbance with E("t jXt ) = 0; the estimation
error Xt0 ( ^ ) vanishes to 0 as n ! 1; but the approximation error E(Yt jXt ) X 0
t
never disappears. In other words, when the linear regression model is misspeci…ed for
E(Yt jXt ); the estimated residual et will contain not only the true disturbance but also the
approximation error which is a function of Xt : This will result in a spurious conditional
heteroskedasticity when White’s test is used. Therefore, before using White’s test or
any other tests for conditional heteroskedasticity, it is important to …rst check whether
the linear regression model is correctly speci…ed. For tests of correct speci…cation of a
linear regression model, see Hausman’s test in Chapter 7 and other speci…cation tests
mentioned there.
34
4.8 Empirical Applications
4.9 Conclusion
In this chapter, within the context of i.i.d. observations, we have relaxed some key
assumptions of the classical linear regression model. In particular, we do not assume
conditional normality for "t and allow for conditional heteroskedasticity. Because the
exact …nite sample distribution of the OLS is generally unknown, we have relied on as-
ymptotic analysis. It is found that for large samples, the results of the OLS estimator
^ and related test statistics (e.g., t-test statistic and F -test statistic) are still applicable
under conditional homoskedasticity. Under conditional heteroskedasticity, however, the
statistic properties of ^ are di¤erent from those of ^ under conditional homoskedas-
ticity, and as a consequence, the conventional t-test and F -test are invalid even when
the sample size n ! 1: One has to use White’s (1980) heteroskedasticity-consistent
variance-covariance matrix estimator for the OLS estimator ^ and use it to construct
robust test statistics. A direct test for conditional heteroskedasticity, due to White
(1980), is described.
35
EXERCISES
4.1. Suppose Assumptions 3.1, 3.3 and 3.5 hold. Show (a) s2 converges in probability
to 2 ; and (b) s converges in probability to .
4.2. Let Z1 ; :::; Zn be a random sample from a population with mean and variance
2
. Show that
p p
n(Zn ) n(Zn )
E = 0 and V ar = 1:
4.4. Let the sample space S be the closed interval [0,1] with the uniform probability
distribution. De…ne Z(s) = s for all s 2 [0; 1]: Also, for n = 1; 2; :::; de…ne a sequence of
random variables (
s + sn if s 2 [0; 1 n 1 ]
Zn (s) =
s + 1 if s 2 (1 n 1 ; 1]:
(a) Does Zn converge in quadratic mean to Z?
(a) Does Zn converge in probability to Z?
(b) Does Zn converge almost surely to Z?
o
Yt = Xt0 + "t ; t = 1; :::; n;
o
for some unknown parameter and some unobservable disturbance "t ;
36
Assumption 1.2 [i.i.d.] The K K matrix E(Xt Xt0 ) = Q is nonsingular and …nite;
^ =Q
^ 1 V^ Q
^ 1 p
! ;
4.7. Put Q = E(Xt Xt0 ); V = E("2t Xt Xt0 ) and 2 = E("2t ): Suppose there exists con-
ditional heteroskedasticity, and cov("2t ; Xt Xt0 ) = V 2
Q is positive semi-de…nite, i.e,
2
(Xt ) is positively correlated with Xt Xt : Show that Q 1 V Q 1
0 2
Q 1 is positive
semi-de…nite.
o
Yt = Xt0 + "t ;
o
for some unknown parameter and unobservable random disturbance "t :
Assumption 2.3:
(i) Wt = W (Xt ) is a positive function of Xt ;
(ii) The K K matrix E (Xt Wt Xt0 ) = Qw is …nite and nonsingular.
(iii) E(Wt8 ) C < 1; E(Xjt 8
) C < 1 for all 0 j k; and E("4t ) C;
37
Assumption 2.4: Vw = E(Wt2 Xt Xt0 "2t ) is …nite and nonsingular.
o
We consider the so-called weighted least squares (WLS) estimator for :
! 1
X
n X
n
^w = n 1
Xt Wt Xt0 n 1
Xt Wt Yt :
t=1 t=1
X
n
min Wt (Yt Xt0 )2 :
t=1
4.9. Consider the problem of testing conditional homoskedasticity (H0 : E("2t jXt ) = 2
)
for a linear regression model
Yt = Xt0 o + "t ;
where Xt is a K 1 vector consisting of an intercept and explanatory variables. To
test conditional homoskedasticity, we consider the auxiliary regression
Show that under H0 : E("2t jXt ) = 2 ; (a) E(vt jXt ) = 0, and (b) E(vt2 jXt ) = 2
v if
and only if E("4t jXt ) = 4 for some constant 4 :
38
4.10. Consider the problem of testing conditional homoskedasticity (H0 : E("2t jXt ) =
2
) for a linear regression model
o
Yt = Xt0 + "t ;
Suppose Assumptions 4.1, 4.2, 4.3, 4.4, 4.7 hold, and E("4t jXt ) 6= 4 : That is,
E("4t jXt ) is a function of Xt :
(a) Show var(vt jXt ) 6= 2v under H0 : That is, the disturbance vt in the auxiliary
regression model displays conditional heteroskedasticity.
(b) Suppose "t is directly observable. Construct an asymptotically valid test for the
null hypothesis H0 of conditional homoskedasticity of "t . Justify your reasoning and test
statistic.
39
CHAPTER 5 LINEAR REGRESSION
MODELS WITH DEPENDENT
OBSERVATIONS
Abstract: In this chapter, we will show that the asymptotic theory for linear regression
models with i.i.d. observations carries over to linear time series regression models with
martingale di¤erence sequence disturbances. Some basic concepts in time series analysis
are introduced, and some tests for serial correlation are described.
Motivation
Here, Xt = (1; Yt 1 )0 . This is called an autoregression model, which violates the i.i.d.
assumption for fYt ; Xt0 g0n
t=1 in Chapter 4. Here, we have
because Xt+j contains "t when j > 0: Hence, Assumption 3.2 (strict exogeneity) fails.
1
Question: Under what conditions will the asymptotic theory developed in Chapter 4
carry over to linear regression models with dependent observations?
De…nition 5.1 [Stochastic Time Series Process]: A stochastic time series fZt g is a
sequence of random variables or random vectors indexed by time t 2 f:::; 0; 1; 2; :::g and
governed by some probability law ( ; F; P ); where is the sample space, F is a -…eld,
and P is a probability measure, with P : F ! [0; 1]:
Remarks:
More precisely, we can write Zt = Z(t; ); and its realization zt = Z(t; !); where
! 2 is a basic outcome in sample space .
For each !; we can obtain a sample path zt = Z(t; !) of the process fZt g as a
deterministic function of time t: Di¤erent !’s will give di¤erent sample paths.
The dynamics of fZt g is completely determined by the transition probability of Zt ;
that is, the conditional probability of Zt given its past history It 1 = fZt 1 ; Zt 2 ; :::g.
Time Series Random sample: Consider a subset (or a segment) of a time series
process fZt g for t = 1; ; n: This is called a time series random sample of size n;
denoted as
Z n = fZ1 ; ; Zn g0 :
Any realization of this random sample is called a data set, denoted as
z n = fz1 ; ; zn g0 :
2
Question: Why can the dynamics of fZt g be completely captured by its conditional
probability distribution?
Consider the random sample Z n : It is well-known from basic statistics courses that
the joint probability distribution of the random sample Z n ;
Y
n
n
fZ n (z ) = fZt jIt 1 (zt jIt 1 );
t=1
where by convention, for t = 1; f (z1 jI0 ) = f (z1 ); the marginal density of Z1 . Thus, the
conditional density function fZt jIt 1 (zjIt 1 ) completely describes the joint probability of
the random sample Z n .
Example 1: Let Zt be the US Gross Domestic Product (GDP) in quarter t: Then the
quarterly records of U.S. GDP from the …rst quarter of 1961 to the last quarter of 2001
constitute a time series data set, denoted as z n = (z1 ; ; zn )0 ; with n = 164.
Example 2: Let Zt be the S&P 500 closing price index at day t: Then the daily records
of S & P 500 index from July 2, 1962 to December 31, 2001 constitute a time series data
set, denoted as z n = (z1 ; ; zn )0 ; with n = 9987.
Here is a fundamental feature of economic time series: each random variable Zt only
has one observed realization zt in practice. It is impossible to obtain more realizations
for each economic variable Zt ; due to the nonexperimental nature of an economic system.
In order to “aggregate” realizations from di¤erent random variables fZt gnt=1 ; we need
to impose stationarity— a concept of stability for certain aspects of the probability law
fZt jIt 1 (zt jIt 1 ). For example, we may need to assume:
(i) The marginal probability of each Zt shares some common features (e.g., the same
mean, the same variance).
(ii) The relationship (joint distribution) between Zt and It 1 is time-invariant in certain
aspects (e.g., cov(Zt ; Zt j ) = (j) does not depend on time t; it only depends on the
time distance j).
3
With these assumptions, observations from di¤erent random variables fZt g can be
viewed to contain some common features of the data generating process, so that one can
conduct statistical inference by pooling them together.
Stationarity
A stochastic time series fZt g can be stationary or nonstationary. There are at least
two notions for stationarity. The …rst is strict stationarity.
De…nition 5.2 [Strict Stationarity]: A stochastic time series process fZt g is strictly
stationary if for any admissible t1 ; t2 ; ; tm ; the joint probability distribution of fZt1 ; Zt2 ; ; Ztm g
is the same as the joint distribution of fZt1 +k ; Zt2 +k ; ; Ztm +k g for all integers k: That
is,
fZt1 Zt2 Ztm (z 1 ; ; z m ) = f Zt +k Zt +k Ztm +k (z 1 ; ; z m ):
1 2
Remarks:
If Zt is strictly stationary, the conditional probability of Zt given It 1 will have a time-
invariant functional form. In other words, the probabilistic structure of a completely
stationary process is invariant under a shift of the time origin.
Strict stationarity is also called “complete stationarity”, because it characterizes the
time-invariance property of the entire joint probability distribution of the process fZt g.
No moment condition on fZt g is needed when de…ning strict stationarity. Thus, a
strictly stationary process may not have …nite moments (e.g., var(Zt ) = 1). However,
if moments (e.g., E(Zt )) and cross-moments (e:g:; E(Zt Zt j )) of fZt g exist, then they
are time-invariant when fZt g is strictly stationary.
Any measurable transformation of a strictly stationary process is still strictly sta-
tionary.
Strict stationarity implies identical distribution for each of the Zt : Thus, although a
strictly stationary time series data are realizations from di¤erent random variables, they
can be viewed as realizations from the same (marginal) population distribution.
Example 3: Suppose fZt g is an i.i.d. Cauchy (0; 1) sequence with marginal pdf
1
f (z) = ; 1 < z < 1:
(1 + z 2 )
Note that Zt has no moment. Consider fZt1 ; ; Ztm g: Because the joint distribution
Y
m
fZt1 Zt2 Ztm (z1 ; ; zm ) = f (zj )
j=1
4
We now introduce another concept of stationarity based on the time-invariance prop-
erty of the joint moments of fZt1 ; Zt2 ; :::; Ztm g:
De…nition 5.3 [N -th order stationarity]: The time series process fZt g is said to be
stationary up to order N if, for any admissible t1 ; t2 ; ; tm ; and any k; all the joint
moments up to order N of fZt1 ; Zt2 ; ; Ztm g exist and equal to the corresponding joint
moments up to order N of fZt1 +k ; ; Ztm +k g: That is,
Remarks:
Setting n2 = n3 = = nm = 0; we have
On the other hand, for n1 + n2 N; we have the pairwise joint product moment
We now consider a special case: N = 2: This yields a concept called weak stationarity.
De…nition 5.4 [Weak Stationarity] A stochastic time series process fZt g is weakly
stationary if
(i) E(Zt ) = for all t;
(ii) var(Zt ) = 2 < 1 for all t;
(iii) cov(Zt ; Zt j ) = (j) is only a function of lag order j for all t:
Remarks:
Strict stationarity is de…ned in terms of the “time invariance”property of the entire
distribution of fZt g; while weak-stationarity is de…ned in terms of the “time-invariance”
property in the …rst two moments (means, variances and covariances) of fZt g. Suppose
all moments of fZt g exist. Then it is possible that the …rst two moments are time-
invariant but the higher order moments are time-varying. In other words, a process
fZt g can be weakly stationary but not strictly stationary. However, Example 1 shows
that a process can be strictly stationary but not weakly stationary, because the …rst two
moments simply do not exist.
5
Weak stationarity is also called “covariance-stationarity”, or “2nd order stationarity”
because it is based on the time-invariance property of the …rst two moments. It does
not require identical distribution for each of the Zt : The higher order moments of Zt can
be di¤erent for di¤erent t’s:
Example 4: An i.i.d. Cauchy(0; 1) process is strictly stationary but not weakly sta-
tionary.
A special but important weakly stationary time series is a process with zero auto-
correlations.
De…nition 5.5 [White Noise]: A time series process fZt g is a white noise (or serially
uncorrelated) process if
(i) E(Zt ) = 0:
(ii) var(Zt ) = 2 ;
(iii) cov(Zt ; Zt j ) = (j) = 0 for all j > 0:
Remarks:
Later we will explain why such a process is called a white noise (WN) process. WN
is a basic building block for linear time series modeling.
When fZt g is a white noise and fZt g is a Gaussian process (i.e., any …nite set
(Zt1 ; Zt2 ; :::; Ztm ) of fZt g has a joint normal distribution), we call fZt g is a Gaussian
white noise. For a Gaussian white noise process, fZt g is an i.i.d. sequence.
Zt = Zt 1 + "t ;
2
"t white noise (0; )
6
P1 j
j=0 "t j ; and
E(Zt ) = 0;
2
var(Zt ) = 2
;
1
2
jjj
(j) = 2
; j = 0; 1; 2; ::: :
1
Here, "t may be interpreted as a random shock or an innovation that derives the move-
ment of the process fZt g over time.
This is a weakly stationary process. For an MA(q) process, we have (j) = 0 for all
jjj > q:
Under rather mild regularity conditions, a zero-mean weakly stationary process can
be represented by an MA(1) process
X
1
Zt = j "t j ;
j=0
2
"t WN(0; );
7
1 2
where j=1 j < 1: This is called Wold’s decomposition. The partial derivative
@Zt+j
= j; j = 0; 1; :::
@"t
is called the impulse response function of the time series process fZt g with respect to
a random shock "t : This function characterizes the impact of a random shock "t on the
immediate and subsequent observations fZt+j ; j 0g: For a weakly stationary process,
the impact of any shock on a future Zt+j will always diminish to zero as the lag order
j ! 1; because j ! 0: The ultimate cumulative impact of "t on the process fZt g is
the sum 1 j=0 j :
The function (j) = cov(Zt ; Zt j ) is called the autocovariance function of the weakly
stationary process fZt g; where j is a lag order. It characterizes the (linear) serial de-
pendence of Zt on its own lagged variable Zt j : Note that (j) = ( j) for all integers
j:
The normalized function (j) = (j)= (0) is called the autocorrelation function of
fZt g: It has the property that j (j)j 1: The plot of (j) as a function of j is called the
autocorrelogram of the time series process fZt g: It can be used to judge which linear
time series model (e.g., AR, MA, or ARMA) should be used to …t a particular time
series data set.
1 X
1
ij!
h(!) = (j)e ; !2[ ; ];
2 j= 1
p
where i = 1; is called the power spectral density of process fZt g:
The normalized version
1 X
1
h(!) ij!
f (!) = = (j)e ; !2[ ; ];
(0) 2 j= 1
8
The spectral density h(!) is widely used in economic analysis. For example, it can be
used to search for business cycles. Speci…cally, a frequency ! 0 corresponding to a special
peak is closely associated with a business cycle with periodicity T0 = 2 =! 0 : Intuitively,
time series can be decomposed as the sum of many cyclical components with di¤erent
frequencies !; and h(!) is the strength or magnitude of the component with frequency
!: When h(!) has a peak at ! 0 ; it means that the cyclical component with frequency
! 0 or periodicity T0 = 2 =! 0 dominates all other frequencies. Consequently, the whole
time series behaves as mainly having a cycle with periodicity T0 :
The functions h(!) and (j) are Fourier transforms of each other. Thus, they contain
the same information on serial dependence in fZt g: In time series analysis, the use of
(j) is called the time domain analysis, and the use of h(!) is called the frequency
domain analysis. Which tool to use depends on the convenience of the user. In some
applications, the use of (j) is simpler and more intuitive, while in other applications,
the use of h(!) is more enlightening. This is exactly the same as the case that it is more
convenient to use Chinese in China, while it is more convenient to use English in U.S.
Example 8: Hamilton, James (1994, Time Series Analysis): Business cycles of U.S.
industrial production
Example 9: Steven Durlauf (1990, Journal of Monetary Economics): Income tax rate
changes
For a serially uncorrelated sequence, the spectral density h(!) is ‡at as a function of
frequency ! :
1
h(!) = (0)
2
1 2
= for all ! 2 [ ; ]:
2
This is analogous to the power (or energy) spectral density of a physical white color
light. It is for this reason that we call a serially uncorrelated time series a white noise
process.
Intuitively, a white color light can be decomposed via a lens as the sum of equal
magnitude components of di¤erent frequencies. That is, a white color light has a ‡at
physical spectral density function.
9
It is important to point out that a white noise may not be i.i.d., as is illustrated by
the example below:
"t i.i.d.(0,1).
This is …rst proposed by Engle (1982) and it has been widely used to model volatility
in economics and …nance. We have E(Zt jIt 1 ) = 0 and var(Zt jIt 1 ) = ht ; where It 1 =
fZt 1 ; Zt 2 ; :::g is the information set containing all past history of Zt :
When 1 < 1; fZt g is a stationary white noise. But it is not weakly stationary if
1 = 1; because var(Zt ) = 1: In both cases, fZt g is strictly stationary (e.g., Nelson
1990, Journal of Econometrics).
Although fZt g is a white noise, it is not an i.i.d. sequence because the correlation
jjj
in fZt2 g is corr(Zt2 ; Zt2 j ) = 1 for j = 0; 1; 2; :::: In other words, an ARCH process is
uncorrelated in level but is autocorrelated in squares.
Nonstationarity
Usually, we call fZt g a nonstationary time series when it is not covariance-stationary.
In time series econometrics, there have been two types of nonstationary processes that
display similar sample paths when the sample size is not large but have quite di¤erent
implications. We …rst discuss a nonstationary process called trend-stationary process.
10
Question: What happens if Zt = Zt Zt 1 ?
where f"t g is a weakly stationary process. The reason that fZt g is called trend-stationary
is that it will become weakly stationary after the deterministic trend is removed.
Zt = 0 + Zt 1 + "t ;
2
where f"t g is i.i.d. (0; ). For simplicity, we assume Z0 = 0: Then
E(Zt ) = 0 t;
2
var(Zt ) = t;
2
cov(Zt ; Zt j ) = (t j):
Note that fZt g has a deterministic linear time trend but with an increasing variance
over time. The impulse response function @Zt+j =@"t = 1 for all j 0; which never dies
o¤ to zero as j ! 1:
De…nition 5.7 [Martingale] A time series process fZt g is a martingale with drift if
Zt = + Zt 1 + "t ;
11
and f"t g satis…es
E("t jIt 1 ) = 0 a.s.;
where It 1 is the -…eld generated by f"t 1 ; "t 2 ; g: We call that f"t g is a martingale
di¤erence sequence (MDS).
ln Pt = ln Pt 1 + "t ;
Pt Pt 1
where E("t jIt 1 ) = 0: Then "t = ln Pt ln Pt 1 Pt 1
is the stock relative price
change or stock return (if no dividend) from time t 1 to time t; which can be viewed as
the proxy for the new information arrival from time t 1 to time t that derives the stock
price change in the same period. For this reason, "t is also called an innovation sequence.
The MDS property of "t implies that the price change "t is unpredictable using the past
information available at time t 1; and the market is called informationally e¢ cient.
Thus, the best predictor for the stock price at time t using the information available at
time t 1 is Pt 1 ; that is, E(Pt jIt 1 ) = Pt 1 :
A random walk is a martingale because IID with zero mean implies E("t jIt 1 ) =
E("t ) = 0: However, the converse is not true.
fzt g i.i.d.(0,1).
E("t jIt 1 ) = 0;
2
var("t jIt 1 ) = ht = 0 + 1 "t 1 ;
where It 1 denotes the information available at time t 1: Clearly f"t g is MDS but
not IID, because its conditional variance ht is time-varying (depending on the past
information set It 1 ).
12
Since the only condition for MDS is E("t jIt 1 ) = 0 a.s., an MDS need not be strictly
stationary or weakly stationary. However, if it is assumed that var("t ) = 2 exists, then
an MDS is weakly stationary.
When the variance E("2t ) exists, we have the following directional relationships:
"t = zt 1 zt 2 + zt ;
fzt g i:i:d:(0; 1):
Then it can be shown that f"t g is a white noise but not MDS, because cov("t ; "t j ) = 0
for all j > 0 but
E("t jIt 1 ) = zt 1 zt 2 6= 0:
Question: When will the concepts of IID, MDS and White noise coincide?
When f"t g is a stationary Gaussian process. A time series is a stationary Gaussian
process if f"t1 ; "t2 ; :::; "tm g is multivariate normally distributed for any admissible sets
of integers ft1 ; t2 ; :::; tm g: Unfortunately, an important stylized fact for economic and
…nancial time series is that they are typically non-Gaussian. Therefore, it is important
to emphasize the di¤erence among the concepts of IID, MDS and White Noise in time
series econometrics.
13
When var("t ) exists, both random walk and martingale processes are special cases of the
so-called unit root process which is de…ned below.
De…nition 5.8 [Unit root or di¤erence stationary process]: fZt g is a unit root
process if
Zt = 0 + Zt 1 + "t ;
2
f"t g is covariance-stationary (0; ):
The process fZt g is called a unit root process because its autoregressive coe¢ cient
is unity. It is also called a di¤erence-stationary process because its …rst di¤erence,
Zt = Zt Zt 1 = 0 + "t ;
Zt = 1 + "t "t 1 :
where Z0 is the starting value of the process fZt g: This is analogous to di¤erentiation
and integration in calculus which are inverses of each other. For this reason, fZt g is also
called an Integrated process of order 1, denoted as I(1):Obviously, a random walk and
a martingale process are I(1) processes if the variance of the innovation "t is …nite.
We will assume strict stationarity in most cases in the present and subsequent chap-
ters. This implies that some economic variables have to be transformed before used in
Yt = Xt0 o + "t : Otherwise, the asymptotic theory developed here cannot be applied.
Indeed, a di¤erent asymptotic theory should be developed for unit root processes (see,
e.g., Hamilton (1994), Time Series Analysis).
Question: Why has the unit root econometrics been so popular in econometrics?
14
It was found in empirical studies (e.g., Nelson and Plosser (1982, Journal of Monetary
Economics)) that most macroeconomic time series display unit root properties.
Ergodicity
Next, we introduce a concept of asymptotic independence.
Z n = (Z1 ; Z2 ; ; Zn )0
= (W; W; ; W )0 ,
where W is a random variable that does not depend on time index t: Obviously, the
stationarity condition holds. However, any realization of this random sample Z n will be
z n = (w; w; ; w)0 ;
i.e., it will contain the same realization w for all n observations (no new information
as n increases). In order to avoid this, we need to impose a condition called ergodicity
that assumes that (Zt ; ; Zt+k ) and (Zm+t ; ; Zm+t+l ) are asymptotically independent
when their time distance m ! 1:
Remarks:
Clearly, ergodicity is a concept of asymptotic independence. A strictly stationary
process that is ergodic is called ergodic stationary. If fZt g is ergodic stationary, then
ff (Zt )g is also ergodic stationary for any measurable function f ( ):
15
Theorem 5.2 [WLLN for Ergodic Stationary Random Samples]: Let fZt g be
an ergodic stationary process with E(Zt ) = and EjZt j < 1: Then the sample mean
X
n
p
1
Zn = n Zt ! as n ! 1:
t=1
Consider a counter example which does not satisfy the ergodicity condition: Zt = W for
all t: Then Zn = W; a random variable which will not converge to as n ! 1:
Theorem 5.3 [Central Limit Theorem for Ergodic Stationary MDS]: Suppose
fZt g is a stationary ergodic MDS process, with var(Zt ) E(Zt Zt0 ) = V …nite, symmetric
and positive de…nite. Then as n ! 1;
p X
n
d
1=2
n Zn = n Zt ! N (0; V )
t=1
or equivalently,
1=2
p d
V nZn ! N (0; I):
p p
Question: Is avar( nZn ) = V =var(Zt )? That is, is the asymptotic variance of nZn
p p p
var( nZn ) = E[ nZn nZn0 ]
" ! !0 #
X
n X
n
= E n 1=2 Zt n 1=2
Zs
t=1 s=1
X
n X
n
1
= n E(Zt Zs0 )
t=1 s=1
[E(Zt Zs ) = 0 for t 6= s; by the LIE]
X n
= n 1 E(Zt Zt0 )
t=1
= E(Zt Zt0 )
= V:
16
Here, the MDS property plays a crucial rule in simplifying the asymptotic variance
p
of nZn because it implies cov(Zt ; Zs ) = 0 for all t 6= s:MDS is one of the most
important concepts in modern economics, particularly in macroeconomics, …nance, and
econometrics. For example, rational expectations theory can be characterized by an
expectational error being an MDS.
Assumption 5.1 [Ergodic stationarity]: The stochastic process fYt ; Xt0 g0n
t=1 is jointly
stationary and ergodic.
Assumption 5.3 [Correct Model Speci…cation]: E("t jXt ) = 0 a.s. with E("2t ) =
2
< 1:
Q = E(Xt Xt0 )
Assumption 5.5 [MDS]: fXt "t g is an MDS process with respect to the -…eld gener-
ated by fXs "s ; s < tg and the K K matrix V var(Xt "t ) = E(Xt Xt0 "2t ) is …nite and
positive de…nite.
Remarks:
In Assumption 5.1, the ergodic stationary process Zt = fYt ; Xt0 g0n
t=1 can be indepen-
dent or serially dependent across di¤erent time periods. we thus allow for time series
observations from a stationary stochastic process.
It is important to emphasize that the asymptotic theory to be developed below
and in subsequent chapters is not applicable to nonstationary time series. A problem
associated with nonstationary time series is the so-called spurious regression or spurious
correlation problem. If the dependent variable Yt and the regressors Xt display similar
17
trending behaviors over time, one is likely to obtain seemly highly “signi…cant”regression
coe¢ cients and high values for R2 ; even if they do not have any causal relationship.
Such results are completely spurious. In fact, the OLS estimator for nonstationary time
series regression model does not follow the asymptotic theory to be developed below.
A di¤erent asymptotic theory for nonstationary time series regression models has to be
used (see, e.g., Hamiltion 1994). Using the correct asymptotic theory, the seemingly
highly “signi…cant” regression coe¢ cient estimators would become insigni…cant in the
spurious regression models.
Unlike the i.i.d. case, where E("t jXt ) = 0 is equivalent to the strict exogeneity
condition that
E("t jX) = E("t jX1 ; :::; Xt ; :::; Xn ) = 0;
the condition E("t jXt ) = 0 is weaker than E("t jX) = 0 in a time series context. In other
words, it is possible that E("t jXt ) = 0 but E("t jX) 6= 0: Assumption 5.3 allows for the
inclusion of predetermined variables in Xt ; the lagged dependent variables Yt 1 ; Yt 2 ;
etc.
For example, suppose Xt = (1; Yt 1 )0 : Then we obtain an AR(1) model
o
Yt = Xt0 + "t
= 0 + 1 Yt 1 + "t ; t = 2; ; n:
2
f"t g MDS(0; ):
Then E("t jXt ) = 0 holds if E("t jIt 1 ) = 0; namely if f"t g is an MDS, where It 1 is
the sigma-…eld generated by f"t 1 ; "t 2 ; :::g. However, we generally have E("t jX) =
6 0
because E("t Xt+1 ) 6= 0:
When Xt contains an intercept the MDS condition for fXt "t g in Assumption 5.5
implies that E("t jIt 1 ) = 0; that is, f"t g is an MDS, where It 1 = f"t 1 ; "t 2 ; :::g.
Question: When can an MDS disturbance "t arise in economics and …nance?
Recall the dynamic asset pricing model under a rational expectations framework in
Chapter 1. The behavior of the economic agent is characterized by the Euler equation:
u0 (Ct )
E Rt It 1 = 1 or
u0 (Ct 1 )
E[Mt Rt jIt 1 ] = 1;
18
where is the time discount factor of the representative economic agent, Ct is the
consumption, Rt is the asset gross return, and Mt is the stochastic discount factor
de…ned as follows:
u0 (Ct )
Mt =
u0 (Ct 1 )
u00 (Ct 1 )
= + 0 Ct + higher order
u (Ct 1 )
risk adjustment factor.
Using the formula that cov(Xt ; Yt jIt 1 ) = E(Xt Yt jIt 1 ) E(Xt jIt 1 )E(Yt jIt 1 ) and re-
arranging, we can write the Euler equation as
It follows that
1 cov(Mt ; Rt jIt 1 ) var(Mt jIt 1 )
E(Rt jIt 1 ) = +
E(Mt jIt 1 ) var(Mt jIt 1 ) E(Mt jIt 1 )
= t + t t;
where t = (It 1 ) is the riskfree interest rate, t = (It 1 ) is the market risk, and
t = (It 1 ) is the price of market risk, or the so-called investment beta factor.
Equivalently, we can write a regression equation for the asset return
Rt = t + t t + "t ; where
E("t jIt 1 ) = 0:
19
5.3 Consistency of OLS
We …rst investigate the consistency of OLS ^ : Recall the OLS estimator
^ = (X0 X) 1 X0 Y
Xn
= Q^ 1n 1 Xt Yt ;
t=1
where, as before,
X
n
^=n
Q 1
Xt Xt0 :
t=1
Substituting Yt = Xt0 o + "t from Assumption 5.2, we have
X
n
^ o ^ 1n
=Q 1
Xt "t :
t=1
Proof: Because fXt g is ergodic stationary, fXt Xt0 g is also ergodic stationary. Thus,
given Assumption 5.4, which implies EjXit Xjt j C < 1 for 0 i; j k and for some
constant C; we have
p
Q^! E(Xt Xt0 ) = Q
by the WLLN for ergodic stationary processes. Because Q 1 exists, by continuity we
have
p
Q^ 1! Q 1 as n ! 1:
P
Next, we consider n 1 nt=1 Xt "t : Because fYt ; Xt0 g0n
t=1 is ergodic stationary, "t =
0 o
Yt Xt is ergodic stationary, and so is Xt "t : In addition,
2 1=2
EjXjt "t j E(Xjt )E("2t ) C < 1 for 0 j k
by the Cauchy-Schwarz inequality and Assumptions 5.3 and 5.4. It follows that
X
n
p
1
n Xt "t ! E(Xt "t ) = 0
t=1
20
by the law of iterated expectations and Assumption 5.3. Therefore, we have
X
n
p
^ o ^ 1n
=Q 1
Xt "t ! Q 1
0 = 0:
t=1
Proof: Recall
p 1
X
n
n( ^ o ^ 1n
)=Q 2 Xt "t :
t=1
First, we consider the second term
1
X
n
n 2 Xt "t :
t=1
1
X
n
d
n 2 Xt "t ! N (0; V ):
t=1
^ 1 p
Moreover, Q ! Q 1 ; as shown earlier. It follows from the Slutsky theorem that
p 1
X
n
n( ^ o ^ 1n
) = Q 2 Xt "t
t=1
d
! Q 1 N (0; V )
N (0; Q 1 V Q 1 ):
21
Special Case: Conditional Homoskedasticity
p
The asymptotic variance of n ^ can be simpli…ed if there exists conditional ho-
moskedasticity.
This assumption rules out the possibility that the conditional variance of "t changes
with Xt . For low-frequency macroeconomic time series, this might be a reasonable
assumption. For high-frequency …nancial time series, however, this assumption will be
rather restrictive.
The desired results follow immediately from the previous theorem. This completes the
proof.
p
Under conditional homoskedasticity, the asymptotic variance of n( ^ o
) is
p
avar( n ^ ) = Q 1 V Q 1
2
= Q 1:
22
Case I: Conditional Homoskedasticity
p
Under this case, the asymptotic variance of n( ^ o
) is
p
avar( n ^ ) = Q 1 V Q 1 = 2
Q 1:
2
It su¢ ces to have consistent estimators for and Q respectively.
1 X
n
2( ^ o 0
) Xt "t
n K t=1
p 2 2
! +0 Q 0 2 0 0=
given that K is a …xed number, where we have made use of the WLLN for ergodic
stationary processes in several places. This completes the proof.
p
We can then estimate avar( n ^ ) = 2
Q 1 ^ 1:
by s2 Q
23
This implies that the variance estimator of ^ is calculated as
^ 1 =n = s2 (X0 X) 1 ;
s2 Q
4
Assumption 5.7: E(Xjt ) < 1 for 0 j k and E("4t ) < 1:
Lemma 5.10: Suppose Assumptions 5.1–5.5 and 5.7 hold. Then
p
V^ ! V as n ! 1:
Proof: The proof is analogous to the proof of Lemma 4.17 in Chapter 4. Because
et = "t ( ^ o 0
) Xt ; we have
X
n
V^ = n 1
Xt Xt0 "2t
t=1
Xn
+n 1
Xt Xt0 [( ^ ) Xt Xt0 ( ^
o 0 o
)]
t=1
Xn
2n 1
Xt Xt0 ["t Xt0 ( ^ o
)]
t=1
p
!V +0 2 0;
24
where for the …rst term, we have
X
n
p
1
n Xt Xt0 "2t ! E(Xt Xt0 "2t ) = V
t=1
by the WLLN for ergodic stationary processes and Assumption 5.5. For the second term,
it su¢ ces to show that for any combination (i; j; l; m); where 0 i; j; l; m k;
X
n
n 1
Xit Xjt [( ^ ) Xt Xt0 ( ^
o 0 o
)]
t=1
!
X
k Xk X
n
= (^l o ^
l )( m
o
m) n 1
Xit Xjt Xlt Xmt
l=0 m=0 t=1
p
! 0;
o p P p
which follows from ^ ! 0 and n 1 nt=1 Xit Xjt Xlt Xmt ! E(Xit Xjt Xlt Xmt ) = O(1)
by the WLLN and Assumption 5.7:
For the last term, it su¢ ces to show
X
n
n 1
Xit Xjt ["t Xt0 ( ^ o
)]
t=1
!
X
k X
n
= (^l o
l) n 1
Xit Xjt Xlt "t
l=0 t=1
p
! 0;
o p P p
which follows from ^ ! 0; n 1 nt=1 Xit Xjt Xlt "t ! E(Xit Xjt Xlt "t ) = 0 by the
WLLN for ergodic stationary processes, the law of iterated expectations, and E("t jXt ) =
0 a.s:
We have proved the following result.
p
Theorem 5.11 [Asymptotic variance estimator for n( ^ o
)]: Under Assump-
p
tions 5.1–5.5 and 5.7, we can estimate avar( n ^ ) by
^ 1 V^ Q
^ 1 p
Q ! Q 1V Q 1:
^ 1 V^ Q
The variance estimator Q ^ 1 is the so-called White’s heteroskedasticity-consistent
p
variance-covariance matrix of estimator n( ^ o
) in a linear time series regression
model with MDS disturbances.
o
H0 : R = r;
The test statistics di¤er in two cases. We …rst construct a test under conditional ho-
moskedasticity.
When J = 1; we can use the conventional t-test statistic for large sample inference.
Theorem 5.12 [t-test]: Suppose Assumptions 5.1-5.6 hold. Then under H0 with
J = 1;
R^ r d
T =p ! N (0; 1)
2 0
s R(X X) R1 0
as n ! 1:
p d
Proof: Given R n( ^ o
) ! N (0; 2
RQ 1 R0 ); R o
= r under H0 ; and J = 1; we
have p
n(R ^ r) d
p ! N (0; 1):
2 RQ 1 R0
^ = X0 X=n, we obtain
By the Slutsky theorem and Q
p
n(R ^ r) d
q ! N (0; 1):
2 ^
s RQ R1 0
26
2
For J > 1; we can consider an asymptotic test that is based on the conventional
F -statistic.
2
Theorem 5.13 [Asymptotic Test]: Suppose Assumptions 5.1-5.6 hold. Then
under H0 ;
d 2
J F ! J
as n ! 1:
Proof: We write
R^ r = R( ^ o
)+R o
r:
o
Under H0 : R = r; we have
p p
n(R ^ r) = R n( ^ o
)
d 2
! N (0; RQ 1 R0 ):
^ 1 p
Also, because s2 Q ! 2
Q 1 ; we have the Wald test statistic
p p
W = n(R ^ ^ 1 R0 ]
r)0 [s2 RQ 1
n(R ^ r)
d 2
! J
(R ^ r)0 [R(X0 X) 1 R0 ] 1 (R ^ r) d 2
W = ! J;
s2
namely
d 2
W =J F ! J;
Remarks:
We cannot use the F distribution for a …nite sample size n, but we can still compute
the F -statistic and the appropriate test statistic is J times the F -statistic, which is
asymptotically 2J as n ! 1. That is,
(~e0 e~ e0 e) d 2
J F = ! J:
e0 e=(n K)
27
Put it di¤erently, the classical F -test is still approximately applicable under Assumptions
5.1–5.6 for a large n.
We now give two examples that are not covered under the assumptions of classical
linear regression models.
We consider two approaches to testing Granger causality. The …rst test is proposed
by Granger (1969). Consider now a linear regression model
Yt = 0 + 1 Yt 1 + + p Yt p
H0 : p+1 = = p+q = 0:
The classical regression theory of Chapter 3 (Assumption 3.2: E("t jX) = 0) rules out
this application, because it is a dynamic regression model. However, we have justi…ed
in this chapter that under H0 ;
d
q F ! 2q
as n ! 1 under conditional homoskedasticity even for a linear dynamic regression
model.
28
There is another well-known test for Granger causality proposed by Sims (1980),
which is based on the fact that the future cannot cause the present in any notion of
causality. To test whether fXt g Granger-causes fYt g; we consider the following linear
regression model
p q
X X
J X
Xt = 0 + j Xt j + j Yt+j + j Yt j + "t :
j=1 j=1 j=1
Here, the dependent variable is Xt rather than Yt : If fXt g Granger-causes fYt g; we expect
some relationship between the current Xt and the future values of Yt : Note that nonzero
values for any of f j gJj=1 cannot be interpreted as causality from the future values of Yt
to the current Xt ; simply because the future cannot cause the present. Nonzero values
of any j must imply that there exists causality from current Xt to future values of Yt :
Therefore, we test the null hypothesis
H0 : j = 0 for 1 j J:
Wt = 0 + 1 Pt + 2 Pt 1 + 3 Ut
+ 4 Vt + 5 Wt 1 + "t ;
H0 : 1 + 2 = 0; 3 + 4 = 0; and 5 = 1:
Wt = 0 + 1 Pt + 4 Dt + "t ;
29
Under H0 ; we have
d 2
3F ! 3:
Theorem 5.14 [(n K)R2 Test]: Suppose Assumption 5.1-5.6 hold, and we are inter-
ested in testing the null hypothesis that
o o o
H0 : 1 = 2 = = k = 0;
where the oj ; 1 j k; are the slope coe¢ cients in the linear regression model Yt =
o
Xt0 + "t :
Let R2 be the coe¢ cient of determination from the unrestricted regression model
o
Yt = Xt0 + "t :
Then under H0 ;
d
(n K)R2 ! 2
k:
R2 =k
F = :
(1 R2 )=(n K)
Here, we have J = k; and under H0 ;
(n K)R2 d 2
k F = ! k:
1 R2
This implies that k F is bounded in probability; that is,
(n K)R2
= OP (1):
1 R2
Consequently, given that k is …xed (i.e., does not grow with the sample size n), we have
p
R2 =(1 R2 ) ! 0
or equivalently,
p
R2 ! 0:
p
Therefore, 1 R2 ! 1: By the Slutsky theorem, we have
(n K)R2
(n K)R2 = (1 R2 )
1 R2
d 2
! k:
30
This completes the proof.
Example 3 [E¢ cient Market Hypothesis]: Suppose Yt is the exchange rate return
in period t; and It 1 is the information available at time t 1: Then a classical version
of the e¢ cient market hypothesis (EMH) can be stated as follows:
To check whether exchange rate changes are unpredictable using the past history of
exchange rate changes, we specify a linear regression model:
o
Yt = Xt0 + "t ;
where
Xt = (1; Yt 1 ; :::; Yt k )0 :
Under EMH, we have
o
H0 : j = 0 for all j = 1; :::; k:
If the alternative
o
HA : j 6= 0 at least for some j 2 f1; :::; kg
holds, then exchange rate changes are predictable using the past information.
Remarks:
What is the appropriate interpretation if H0 is not rejected? Note that there exists
a gap between the e¢ ciency hypothesis and H0 , because the linear regression model is
just one of many ways to check EMH. Thus, H0 is not rejected, at most we can only
say that no evidence against the e¢ ciency hypothesis is found. We should not conclude
that EMH holds.
31
Next, we construct hypothesis tests for H0 under conditional heteroskedasticity. Re-
call that under H0 ;
p p p
n(R ^ r) = R n( ^ o
)+ n(R o
r)
p
= nR( ^ o
)
d
! N (0; RQ 1 V Q 1 R0 );
For J = 1; we have
p
n(R ^ r) d
p ! N (0; 1) as n ! 1:
1
RQ V Q R1 0
p p
Because Q^! Q and V^ ! V; where V^ = X0 D(e)D(e)0 X=n; we have by the Slutsky
theorem that the robust t-test statistic
p
n(R ^ r) d
Tr = q ! N (0; 1) as n ! 1:
RQ ^ 1 V^ Q
^ 1 R0
p p
under H0 : Given Q ^ ! Q and V^ ! V; where V^ = X0 D(e)D(e)0 X=n; we have a robust
Wald test statistic
W = n(R ^ ^ 1 R0 ] 1 (R ^
^ 1 V^ Q
r)0 [RQ r)
d 2
! J
32
Theorem 5.16 [Robust Wald Test Under Conditional Heteroskedasticity] Sup-
pose Assumptions 5.1–5.5 and 5.7 hold. Then under H0 ; as n ! 1;
d
W = n(R ^ r)0 [RQ ^ 1 R0 ] 1 (R ^
^ 1 V^ Q r) ! 2
J:
Remarks:
Under conditional heteroskedasticity, J F and (n K)R2 cannot be used even when
n ! 1.
On the other hand, although the general form of the test statistic W developed
here can be used no matter whether there exists conditional homoskedasticity, W may
perform poorly in small samples (i.e., the asymptotic 2J approximation may be poor in
small samples, or Type I errors are large). Thus, if one has information that the error
term is conditionally homoskedastic, one should use the test statistics derived under
conditional homoskedasticity, which will perform better in small sample sizes. Because
of this reason, it is important to test whether conditional homoskedasticity holds in a
time series context.
Question: Can we still use White’s (1980) test for conditional heteroskedasticity?
Yes. Although White’s (1980) test is developed under the independence assumption,
it is still applicable to a time series linear regression model when fXt "t g is an MDS
process. Thus, the test procedure to implement White’s (1980) test as is discussed in
Chapter 4 can be used here.
33
The null hypothesis
2 2 2
H0 : t = for some > 0:
Here, to allow for a possibly time-varying conditional variance of the regression dis-
turbance "t given It 1 ; "t is formulated as the product between a random shock zt and
t = (It 1 ): When the random shock series fzt g is i.i.d.(0; 1); we have
That is, 2t is the conditional variance of "t given It 1 : The null hypothesis H0 says
that the conditional variance of "t given It 1 does not change over time.
where E(vt jIt 1 ) = 0 a.s. This is called an ARCH(q) process in Engle (1982). ARCH
models can capture a well-known empirical styles fact called volatility clustering in
…nancial markets, that is, a high volatility today tends to be followed by another large
volatility tomorrow, and a small volatility today tends to be followed by another small
volatility tomorrow, and such patterns alternate over time. To see this more clearly, we
consider an ARCH(1) model where
2 2
t = 0 + 1 "t 1 ;
34
In addition to volatility clustering, the ARCH(1) model can also generate heavy tails
for "t even when the random shock zt is i.i.d.N (0; 1): This can be seen from its kurtosis
E("4t )
K =
[E("2t )]2
E(zt4 )(1 2
1)
= 2
(1 3 1)
> 3
given 1 > 0:
With an ARCH modeling framework, all autoregressive coe¢ cients j ; 1 j q;
are identically zero when H0 holds. Thus, we can test H0 by checking whether all
j; 1 j q; are jointly zero. If j 6= 0 for some 1 j q; then there exists
2
autocorrelation in f"t g and H0 is false.
Observe that with "t = t zt and fzt g is i.i.d.(0,1), the disturbance vt in the auxiliary
autoregression model is an i.i.d. sequence under H0 , which implies that E(vt2 jIt 1 ) = 2v ;
that is, fvt g is conditionally homoskedastic. Thus, when H0 holds, we have
~2 ! d 2
(n q 1)R q;
The auxiliary regression for "2t , unfortunately, is infeasible because "t is not observ-
able. However, we can replace "t by the estimated residual et and consider the regression
q
X
e2t = 0 + 2
j et j + v~t :
j=1
Then we have
d
(n q 1)R2 ! 2
q:
Note that the replacement of "t by et has no impact on the asymptotic distribution of
the test statistic, for the same reason as in White’s (1980) direct test for conditional
heteroskedasticity. See Chapter 4 for more discussions.
Remarks:
The existence of ARCH e¤ect for f"t g does not automatically imply that we have to
use White’s heteroskedasticity-consistent variance-covariance matrix Q 1 V Q 1 for the
OLS estimator ^ : Suppose Yt = Xt0 o + "t is a static time series model such that the
35
two time series fXt g and f"t g are independent of each other, and f"t g displays ARCH
e¤ect, i.e.,
p
X
2
var("t jIt 1 ) = 0 + j "t j
j=1
with at least some j 6= 0. Then Assumption 5.6 still holds because var("t jXt ) =
var("t ) = 2 given the assumption that fXt g and f"t g are independent. In this case, we
p
have avar( n ^ ) = 2 Q 1 :
Next, suppose Yt = Xt0 o + "t is a dynamic time series regression model such that
Xt contains some lagged dependent variables (say Yt 1 ). Then if f"t g displays ARCH
e¤ect, Assumption 5.6 may fail because we may have E("2t jXt ) 6= 2 ; which generally
occurs when Xt and f"2t j ; j = 1; :::; pg are not independent. In this case, we have to use
p
avar( n ^ ) = Q 1 V Q 1 :
We …rst provide some motivation for doing so. Recall that under Assumptions 5.1–
5.5,
p d
n( ^ o
) ! N (0; Q 1 V Q 1 );
where V =var(Xt "t ): Among other things, this implies that the asymptotic variance
P
of n 1=2 nt=1 Xt "t is the same as the variance of Xt "t . This follows from the MDS
assumption for fXt "t g :
!
Xn
var n 1=2 Xt "t
t=1
X
n X
n
1
= n E(Xt "t Xs0 "s )
t=1 s=1
X
n
1
= n E(Xt Xt0 "2t )
t=1
= E(Xt Xt0 "2t )
= V:
This result will not generally hold if the MDS property for fXt "t g is violated.
Question: How to check E(Xt "t jIt 1 ) = 0 ; where It 1 is the -…eld generated by
fXs "s ; s < tg?
36
When Xt contains the intercept, we have that f"t g is MDS with respect to the -…eld
generated by f"s ; s < tg, which implies that f"t g is serially uncorrelated (or is a white
noise).
If f"t g is serially correlated, then fXt "t g will not be MDS, and consequently we
P
will generally have var(n 1=2 nt=1 Xt "t ) 6= V . Therefore, serial uncorrelatedness is an
p
important necessary condition for the validity of avar( n ^ ) = Q 1 V Q 1 with V =
E(Xt Xt0 "2t ):
On the other hand, let us revisit the correct model speci…cation condition that
in a time series context. Note that this condition does not necessarily imply that f"t g
or fXt "t g is MDS in a time series context.
To see this, consider the case when Yt = Xt0 o + "t is a static regression model (i.e.,
when fXt g and f"t g are mutually independent, or at least when cov(Xt ; "s ) = 0 for
all t; s), it is possible that E("t jXt ) = 0 but f"t g is serially correlated. An example
is that f"t g is an AR(1) process but f"t g and fXt g are mutually independent. In this
case, serial dependence in f"t g does not cause inconsistency of OLS ^ to o ; but we no
longer have var(n 1=2 nt=1 Xt "t ) = V = E(Xt Xt0 "2t ): In other words, the MDS property
for f"t g is crucial for var(n 1=2 nt=1 Xt "t ) = V in a static regression model, although
it is not needed to ensure E("t jXt ) = 0. For a static regression model, the regressors
Xt are usually called exogenous variables. In particular, if fXt g and f"t g are mutually
independent, then Xt is called strictly exogenous.
On the other hand, when Yt = Xt0 o + "t is a dynamic model (i.e., when Xt includes
lagged dependent variables such as fYt 1 ; ; Yt k g so that Xt and "t j are generally not
independent for j > 0), the correct model speci…cation condition
holds when f"t g is MDS. If f"t g is not an MDS, the condition that E("t jXt ) = 0 a.s.
generally does not hold. To see this, we consider, for example, an AR(1) model
o o
Yt = 0 + 1 Yt 1 + "t
o
= Xt0 + "t :
Suppose f"t g is an MA(1) process. Then E(Xt "t ) 6= 0; and so E("t jXt ) 6= 0: Thus, to
ensure correct speci…cation (E(Yt jXt ) = Xt0 o a.s.) of a dynamic regression model in a
37
time series context, it is important to check MDS for f"t g. In this case, tests for MDS
can be viewed as speci…cation tests for dynamic regression models.
where It 1 is the information set available to the economic agent at time t 1: In this
content, Xt is usually a subset of It 1 ; namely Xt 2 It 1 : Thus both Assumptions 5.3
and 5.5 hold simultaneously:
because Xt belongs to It 1 :
To check the MDS property of f"t g; one may check whether there exists serial corre-
lation in f"t g: Evidence of serial correlation in f"t g will indicate that f"t g is not MDS.
The existence of serial correlation may be due to various sources of model misspec-
i…cation. For example, it may be that in the linear regression model, an important
explanatory variable is missing (omitted variables), or that the functional relationship
is nonlinear (functional form misspeci…cation), or that lagged dependent variables or
lagged explanatory variables should be included as regressors (neglected dynamics or
dynamic misspeci…cation). Therefore, tests for serial correlation can also be viewed as
a model speci…cation check in a dynamic time series regression context.
We now introduce a number of tests for serial correlation of the disturbance f"t g in
a linear regression model.
38
It 1 = f"t 1 ; "t 2 ; :::g; and E("2t jXt ) = 2
a.s.
Below, following the vast literature, we will …rst assume conditional homoskedasticity in
testing serial correlation for f"t g: Thus, this method is not suitable for high-frequency
…nancial time series, where volatility clustering has been well-documented. Extensions
to conditional heteroskedasticity will be discussed later.
First, suppose "t is observed, and we consider the auxiliary regression model (an
AR(p))
p
X
"t = j "t j + ut ; t = p + 1; ; n;
j=1
~2 ! d 2
(n 2p)R uc p;
where R ~ uc
2
is the uncentered R2 in the auxiliary regression (note that there is no inter-
cept), and p is the number of the regressors. The reason that we use (n 2p)R ~ uc
2
is that
t begins from p + 1:
Unfortunately, "t is not observable. However, we can replace "t with the estimated
residual et = Yt Xt0 ^ : Unlike White’s (1980) test for heteroskedasticity of unknown form,
this replacement will generally change the asymptotic 2p distribution for (n 2p)Ruc 2
where Xt contains the intercept. The inclusion of the regressors Xt in the auxiliary
regression will purge the impact of the estimation error Xt0 ( ^ o
) of the test statistic,
because Xt and Xt0 ( ^ o
) are perfectly correlated. Therefore, the resulting statistic
d
(n 2p K)R2 ! 2
p;
under H0 ; where R2 is the centered squared multi-correlation coe¢ cient in the feasible
auxiliary regression model.
39
Question: Why should Xt be generally included in the auxiliary regression?
First, we consider the infeasible auxiliary autoregression. Under the null hypothesis
of no serial correlation, the OLS estimator
p 0
p
n(~ )= n~
^ = ~ + ^ + reminder term,
where ~ ; as discussed above, is the OLS estimator of regressing "t on "t 1 ; :::; "t p ; and ^
is the OLS estimator of regressing ( ^ o 0
) Xt on "t 1 ; :::; "t p : For a dynamic regression
model, the regressor Xt contains lagged dependent variables and so E(Xt "t j ) is likely
nonzero for some j 2 f1; :::; pg: It follows that ^ will converge to zero at the same rate
p
as ~ 0
; which is n 1=2 : Because ^ ! 0 at the same rate as ~ ; ^ will have impact
2 2
on the asymptotic distribution of nRuc ; where Ruc is the uncentered R2 in the auxiliary
autoregression. To remove the impact of ^; we need to include Xt as additional regressors
in the auxiliary regression.
40
Answer: When we have a static regression model, cov(Xt ; "s ) = 0 for all t; s (so
E(Xt "t j ) = 0 for all j = 1; :::; p), the estimation error Xt0 ( ^ o
) has no impact
2
on the asymptotic distribution of nRuc : It follows that we do not need to include Xt in
the auxiliary autoregression. In other words, we can test serial correlation for f"t g by
running the following auxiliary regression model
p
X
et = j et j + ut :
j=1
2 2
The resulting nRuc is asymptotically p under the null hypothesis of no serial correlation.
Question: Suppose we have a static regression model, and we include Xt in the auxiliary
regression in testing serial correlation of f"t g: What will happen?
For a static regression model, whether Xt is included in the auxiliary regression has
no impact on the asymptotic 2p distribution of (n 2p)Ruc 2
or (n 2p)R2 under the null
hypothesis of no serial correlation in f"t g: Thus, we will still obtain an asymptotic valid
test statistic (n 2p)R2 under H0 : In fact, the size performance of the test can be better
in …nite samples. However, the test may be less powerful than the test without including
Xt ; because Xt may take away some serial correlation in f"t g under the alternative to
H0 :
With the inclusion of the intercept here, we can then use (n 2p)R2 to test serial corre-
2
lation in f"t g; which is more convenient to compute than (n 2p)Ruc : (Most statistical
d
software report R2 but not Ruc 2
:) Under H0 ; (n 2p)R2 ! 2p : However, the inclusion
of the intercept 0 may have some adverse impact on the power of the test in small
samples, because there is an additional parameter to estimate.
As discussed at the beginning of this section, a test for serial correlation can be
viewed as a speci…cation test for dynamic regression models in a time series context,
because existence of serial correlation in the estimated model residual fet g will generally
indicate misspeci…cation of a dynamic regression model.
41
On the other hand, for static regression models with time series observations, it is
possible that a static regression model Yt = Xt0 o + "t is correctly speci…ed in the sense
that E("t jXt ) = 0 but f"t g displays serial correlation. In this case, existence of serial
correlation in f"t g does not a¤ect the consistency of the OLS estimator ^ but a¤ects the
asymptotic variance and therefore the e¢ ciency of the OLS estimator ^ : However, since
"t is unobservable, one has to use the estimated residual et in testing for serial correlation
in a static regression model in the same way as in a dynamic regression model. Because
the estimated residual
et = Yt Xt0 ^
= "t + [E(Yt jXt ) Xt0 ] + Xt0 ( ^ );
it contains the true disturbance "t = Yt E(Yt jXt ) and model approximation error
E(Yt jXt ) Xt0 ; where = [E(Xt Xt0 )] 1 E(Xt Yt ) is the best linear least squares approx-
imation coe¢ cient which the OLS ^ always converges to as n ! 1. If the linear regres-
sion model is misspeci…ed for E(Yt jXt ); then the approximation error E(Yt jXt ) Xt0
will never vanish to zero and this term can cause serial correlation in et if Xt is a time
series process. Thus, when one …nds that there exists serial correlation in the estimated
residuals fet g of a static linear regression model, it is also likely due to the misspeci…-
cation of the static regression model. In this case, the OLS estimator ^ is generally not
consistent. Therefore, one has to …rst check correct speci…cation of a static regression
model in order to give correct interpretation of any documented serial correlation in the
estimated residuals.
In the development of tests for serial correlation in regression disturbances, there
have been two very popular tests that have historical importance. One is the Durbin-
Watson test and the other is Durbin’s h test. The Durbin-Watson test is the …rst formal
procedure developed for testing …rst order serial correlation
2
"t = "t 1 + ut ; fut g i.i.d. 0; ;
o
using the OLS residuals fet gnt=1 in a static linear regression model Yt = Xt0 +"t . Durbin
and Watson (1950,1951) propose a test statistic
n
t=2 (et et 1 )2
d= n 2
:
t=1 et
Durbin and Watson present tables of bounds at the 0.05, 0.025 and 0.01 signi…cance
levels of the d statistic for static regressions with an intercept. Against the one-sided
alternative that > 0; if d is less than the lower bound dL , the null hypothesis that
42
= 0 is rejected; if d is greater than the upper bound dU , the null hypothesis is accepted.
Otherwise, the test is equivocal. Against the one-sided alternative that < 0; 4 d can
be used to replace d in the above procedure.
The Durbin-Watson test has been extended to test for lag 4 autocorrelation by Wallis
(1972) and for autocorrelation at any lag by Vinod (1973).
o o o
Yt = 0 + 1 Yt 1 + 2 Xt + "t ;
where v̂ar( ^ 1 ) is an estimator for the asymptotic variance of ^ 1 ; ^ is the OLS estimator
d
from regressing et on et 1 (in fact, ^ 1 d=2). Durbin (1970) shows that h ! N (0; 1)
as n ! 1 under null hypothesis that = 0. In fact, Durbin’s h test is asymptotically
equivalent to the Lagrange multiplier test introduced above.
P
where e = n 1 nt=1 et (this is zero when Xt contains an intercept). The Box-Pierce
portmanteau test statistic is de…ned as
p
X
Q(p) = n ^2 (j);
j=1
43
When fet g is a directly observed data or is the estimated residual from a static
regression model, we can show
d
Q(p) ! 2p
under the null hypothesis of no serial correlation.
X
r X
s
Yt = 0 + j Yt j + j "t j + "t ;
j=1 j=1
then
d 2
Q(p) ! p (r+s)
To improve small sample performance of the Q(p) test, Ljung and Box (1978) propose
a modi…ed Q(p) test statistic:
p
X d
Q (p) n(n + 2) (n j) 1 ^2 (j) ! 2
p (r+q) :
j=1
The modi…cation matches the …rst two moments of Q (p) with those of the 2 distribu-
tion. This improves the size in small samples, although not the power of the test.
When fet g is an estimated residual from a dynamic regression model with regressors
including both lagged dependent variables and exogenous variables, then the asymptotic
distribution of Q(p) is generally unknown (Breusch and Pagan 1980). One solution is
to modify the Q(p) test statistic as follows:
^ d
^ ) 1^ !
Q(p) n^0 (I 2
p as n ! 1;
where ^ = [^ (1) ; ; ^ (p)]0 , and ^ captures the impact caused by nonzero correlation
between fXt g and f"t j ; 1 j pg : See Hayashi (2000, Section 2.10) for more discus-
sion and the expression of ^ .
Like the (n p)R2 test, the Q(p) test also assumes conditional homoskedasticity. In
fact, it can be shown to be asymptotically equivalent to the (n p)R2 test statistic when
et is the estimated residual of a static regression model.
44
Hong (1996, Econometrica)
Let k : R ! [ 1; 1] be a symmetric function that is continuous at all points except
R1
a …nite number of points on R; with k(0) = 1 and 1 k 2 (z)dz < 1:
Examples of k( ) :
sin( z)
k(z) = ; z 2 R;
z
Here, 1(jzj 1) is the indicator function that takes value 1 if jzj 1 and 0 otherwise.
De…ne a test statistic
" n 1 #
X p
M (p) = n k 2 (j=p)^2 (j) C(p) = D(p);
j=1
X
n 1
C(p) = k 2 (j=p);
j=1
X
n 2
D(p) = 2 k 4 (j=p):
j=1
45
This can be viewed as a generalized version of the Box-Pierce test. In other words,
the Box-Pierce test can be viewed as a kernel-based test with the choice of the truncated
kernel.
d
For a static regression model, we have n pj=1 ^2 (j) ! 2p under the null hypothesis of
no serial correlation. When p is large, we can obtain a normal approximation for 2p by
p
substracting its mean p and dividing by its standard deviation 2p :
2
p p d
p ! N (0; 1) as p ! 1:
2p
In fact, when p ! 1 as n ! 1; we have the same asymptotic result even when the
regression model is dynamic.
Question: Why is it not needed to correct for the impact of the estimation error
contained in et even when the regression model is dynamic?
Answer: The estimation error indeed does have some impact but such impact becomes
asymptotically negligible when p grows to in…nity as n ! 1: In contrast, the Box-Pierce
portmanteau test has some problem because it uses a …xed lag order p (i.e., p is …xed
when n ! 1:)
For a weakly stationary process f"t g, the autocorrelation function (j) typically decays
to zero as j increases. Consequently, it is more powerful if one can discount higher
order lags rather than treat all lags equally. This can be achieved by using a downward
weighting kernel function such as the Bartlett kernel and the Daniell kernel. Hong
(1996) shows that the Daniell kernel gives a most powerful test among a class of kernel
functions.
46
Answer: It is a reasonable assumption for low-frequency macroeconomic time series.
It is not a reasonable assumption for high-frequency …nancial time series.
Question: How to construct a test for serial correlation under conditional heteroskedas-
ticity?
Step 2: Regress 1 on v^t et and obtain SSR; the sum of squared residuals;
2
Step 3: Compare the n SSR statistic with the asymptotic p distribution.
The …rst auxiliary regression purges the impact of parameter estimation uncertainty
in the OLS estimator ^ and the second auxiliary regression delivers a test statistic robust
to conditional heteroskedasticity of unknown form.
X
n 1 X
n 1
^
C(p) 2
^ (0) 2
k (j=p) + k 2 (j=p)^ 22 (j);
j=1 j=1
X
n 2 X
n 2
^
D(p) 2^ (0) 4 4
k (j=p) + 4^ (0) 2
k 4 (j=p)^ 22 (j)
j=1 j=1
X
n 2X
n 2
+2 ^ j; l)2 ;
k 2 (j=p)k 2 (l=p)C(0;
j=1 l=1
47
with
X
n 1
1
^ 22 (j) n [e2t ^ (0)][e2t j ^ (0)]
t=j+1
and
X
n
^ j; l)
C(0; n 1
[e2t ^ (0)]et j et l :
t=max(j;l)+1
Intuitively, the centering and scaling factors have taken into account possible volatility
^ test is robust to these
clustering and asymmetric features of volatility dynamics, so the M
e¤ects. It allows for various volatility processes, including GARCH models, Nelson’s
(1991) EGARCH, and Glosten et al.’s (1993) Threshold GARCH models.
5.9 Conclusion
In this chapter, after introducing some basic concepts in time series analysis, we show
that the asymptotic theory established under the i.i.d. assumption in Chapter 4 carries
over to linear ergodic stationary time series regression models with MDS disturbances.
The MDS assumption for the regression disturbances plays a key role here. For a static
linear regression model, the MDS assumption is crucial for the validity of White’s (1980)
heteroskedasticity-consistent variance-covariance matrix estimator. For a dynamic linear
regression model, the MDS assumption is crucial for correct model speci…cation for the
conditional mean E(Yt jIt 1 ):
To check the validity of the MDS assumption, one can test serial correlation in
the regression disturbances. We introduce a number of tests for serial correlation and
discuss the di¤erence in testing serial correlation between a static regression model and
a dynamic regression model.
48
EXERCISES
5.1. (a) Suppose that using the Lagrange Multiplier test, one …nds that there exists ser-
ial correlation in f"t g: Can we conclude that f"t g is not a martingale di¤erence sequence
(m.d.s)? Give your reasoning.
(b) Suppose one …nds that there exists no serial correlation in f"t g: Can we conclude
that f"t g is a m.d.s.? Give your reasoning. [Hint: Consider a process "t = zt 1 zt 2 + zt ;
where zt i:i:d:(0; 2 ):]
5.2. Suppose fZt g is a zero-mean weakly stationary process with spectral density func-
tion h(!) and normalized spectral density function f (!): Show that:
(a) f (!) is real-valued for all ! 2 [ ; ];
(b) f (!) is a symmetric function, i.e., f ( !) = f (!);
R
(c) f (!)d! = 1;
P
(d) f (!) 0 for all ! 2 [ ; ]: [Hint: Consider the limit of Ejn 1=2 nt=1 Zt eit! j2 ;
P
the variance of the complex-valued random variable n 1=2 nt=1 Zt eit! :
5.3. Suppose a time series linear regression model
o
Yt = Xt0 + "t ;
where the disturbance "t is directly observable, satis…es Assumptions 5.1–5.3. This class
of models contains both static regression models and dynamic regression models.
(a) Does the condition E("t jXt ) = 0 imply that f"t g is a white noise? Explain.
(b) If f"t g is MDS, does it imply E("t jXt ) = 0? Explain.
(c) If f"t g is serially correlated, does it necessarily imply E("t jXt ) 6= 0; i.e., the linear
regression model is misspeci…ed for E(Yt jXt )? Explain.
the disturbance "t is directly observable. We are interested in testing the null hypothesis
H0 that f"t g is serially uncorrelated. Suppose Assumptions 5.1–5.6 hold.
(a) Consider the auxiliary regression
p
X
"t = j "t j + ut ; t = p + 1; :::; n:
j=1
~ 2 is the uncentered R2 from the OLS estimation of this auxiliary regression. Show
Let R uc
d
~ uc ! 2
that (n 2p)R p as n ! 1 under H0 :
49
(b) Now consider another auxiliary regression
p
X
"t = 0 + j "t j + ut ; t = p + 1; :::; n:
j=1
~ 2 be the centered R2 from this auxiliary regression model. Show that (n 2p)R d
~2 !
Let R
2
p as n ! 1 under H0 :
(c) Which test statistic, (n 2p)R ~ uc
2
or (n 2p)R~ 2 ; performs better in …nite samples?
Give your heuristic reasoning.
the disturbance "t is directly observable. We are interested in testing the null hypothesis
H0 that f"t g is serially uncorrelated. Suppose Assumptions 5.1–5.5 hold, and E("2t jXt ) 6=
2
.
(a) Consider the auxiliary regression
p
X
"t = j "t j + ut ; t = p + 1; :::; n:
j=1
Construct an asymptotically valid test statistic for the null hypothesis that there exists
no serial correlation in f"t g:
"t = zt t ;
2 2
t = 0 + 1 "t 1 ;
(a) Show E("t jIt 1 ) = 0 and cov("t ; "t j ) = 0 for all j > 0; where It 1 = f"t 1 ; "t 2 ; :::g:
(b) Show cov("2t ; "2t 1 ) = 1 :
(c) Show the kurtosis of "t is given by
E("4t ) 3(1 2
1)
K = 2 2
= 2
[E("t )] 1 3 1
> 3 if 1 > 0:
50
where the disturbance "t is directly observable, satis…es Assumptions 5.1–5.5. Both
static and dynamic regression models are covered.
Suppose there exists autoregressive conditional heteroskedasticity (ARCH) for f"t g,
namely,
q
X
2 2
E("t jIt 1 ) = 0 + j "t j ;
j=1
where It 1 is the sigma-…eld generated by f"t 1 ; "t 2 ; :::g: Does this imply that one
p
has to use the asymptotic variance formula Q 1 V Q 1 for avar( n ^ )? Explain.
where the disturbance "t is directly observable, satis…es Assumptions 5.1–5.5, and the
two time series fXt g and f"t g are independent of each other.
Suppose there exists autoregressive conditional heteroskedasticity for f"t g, namely,
q
X
E("2t jIt 1 ) = 0 + 2
j "t j ;
j=1
E("2t jIt 1 ) = 0 + 2
1 Yt 1 :
p
What is the form of avar( n ^ ); where ^ is the OLS estimator?
satis…es Assumptions 5.1, 5.2 and 5.4, the two time series fXt g and f"t g are independent
of each other, and E("t ) = 0. Suppose further that there exist serial correlation in f"t g:
51
(a) Does the presence of serial correlation in f"t g a¤ect the consistency of ^ for o ?
Explain.
(b) Does the presence of serial correlation in f"t g a¤ect the form of asymptotic
p
variance avar( n ^ ) = Q 1 V Q 1 ; where V = limn!1 var(n 1=2 nt=1 Xt "t )? In particular,
do we still have V = E(Xt Xt0 "2t )? Explain.
o o
Yt = 0 + 1 Yt 1 + "t
o
= Xt0 + "t ;
where Xt = (1; Yt 1 )0 , satis…es Assumptions 5.1, 5.2 and 5.4. Suppose further f"t g
follows an MA(1) process:
"t = v t 1 + v t ;
where fvt g is i.i.d.(0; 2v ): Thus, there exists …rst order serial correlation in f"t g.
Is the OLS estimator ^ consistent for o ? Explain.
52
CHAPTER 6 LINEAR REGRESSION
MODELS UNDER CONDITIONAL
HETEROSKEDASTICITY AND
AUTOCORRELATION
Abstract: When the regression disturbance f"t g displays serial correlation, the asymp-
totic results in Chapter 5 are no longer applicable, because the asymptotic variance of
the OLS estimator will depend on serial correlation in fXt "t g: In this chapter, we intro-
duce a method to estimate the asymptotic variance of the OLS estimator in the presence
of heteroskedasticity and autocorrelation, and then develop test procedures based on it.
Some empirical applications are considered.
Motivation
Example 1 [Testing a zero population mean]: Suppose the daily stock return fYt g
is a stationary ergodic process with E(Yt ) = : We are interested in testing the null
hypothesis
H0 : = 0
versus the alternative hypothesis
HA : 6= 0:
A test for H0 can be based on the sample mean
X
n
1
Yn = n Yt :
t=1
By a suitable CLT (White (1999)), the sampling distribution of the sample mean Yn
p
scaled by n
p d
nYn ! N (0; V );
1
where the asymptotic variance of the sample mean
p
V avar n Yn :
Because
p X
n
1
var( nYn ) = n var (Yt )
t=1
X
n 1X
t 1
1
+2n cov(Yt ; Yt j );
t=2 j=1
p
serial correlation in fYt g is expected to a¤ect the asymptotic variance of nYn : Thus,
p
unlike in Chapter 5, avar( nYn ) is no longer equal to var(Yt ):
p
Suppose there exists a variance-covariance estimator V^ such that V^ ! V: Then, by
the Slutsky theorem, we can construct a test statistic which is asymptotically N(0,1)
under H0 : p
nY d
p n ! N (0; 1):
V^
Example 2 [Unbiasedness Hypothesis]: Consider the following linear regression
model
St+ = + Ft ( ) + "t+ ;
where St+ is the spot foreign exchange rate at time t + ; Ft ( ) is the forward exchange
rate (with maturity > 0) at time t; and the disturbance "t+ is not observable. Forward
currency contracts are agreements to exchange, in the future, …xed amounts of two
currencies at prices set today. No money changes hand over until the contract expires
or is o¤set.
It has been a longstanding controversy on whether the current forward rate Ft ( );
as opposed to the current spot rate St ; is a better predictor of the future spot rate St+ :
The unbiasedness hypothesis states that the forward exchange rate (with maturity ) at
time t is the optimal predictor for the spot exchange rate at time t + ; namely,
H0 : = 0; = 1;
and
E("t+ jIt ) = 0 a.s., t = 1; 2; ::::
2
However, with > 1; we generally do not have E("t+j jIt ) = 0 a.s. for 1 j 1:
Consequently, there exists serial correlation in f"t g up to 1 lags under H0 :
Example 3 [Long Horizon Return Predictability]: There has been much interest
in regressions of asset returns, measured over various horizons, on various forecasting
variables. The latter include ratios of price to dividends or earnings various interest rate
measures such as the yield spread between long and short term rates, and the quality
yield spread between low and high-grade corporate bonds, and the short term interest
rate.
Consider a regression
where Yt+h;h is the cumulative return over the holding period from time t to time
t + h; namely,
X
h
Yt+h;h = Rt+j ;
j=1
where Rt+j is an asset return in period t + j; rt is the short term interest rate
in time t; and dt pt is the log dividend-price ratio, which is expected to be a good
proxy for market expectations of future stock return, because dt pt is equal to the
expectation of the sum of all discounted future returns and dividend growth rates. In
the empirical …nance, there has been an interest in investigating how the predictability
of asset returns by various forecasting variables depends on time horizon h: For example,
it is expected that dt pt is a better proxy for expectations of long horizon returns than
for expectations of short horizon returns. When monthly data is used and h > 1, there
exists an overlapping for observations on Yt+h;h : As a result, the regression disturbance
"t+h;h is expected to display serial correlation up to lag order h 1:
Example 4 [Relationship between GDP and Money Supply]: Consider the linear
macroeconomic regression model
Yt = + M t + "t ;
where Yt is GDP at time t; Mt is the money supply at time t; and "t is an unobservable
disturbance such that E("t jMt ) = 0 but there may exist strong serial correlation of
unknown form in f"t g:
Question: What happens to the OLS estimator ^ if the disturbance f"t g displays
conditional heteroskedasticity (i.e., E("2t jXt ) = 2 a.s. fails) and/or autocorrelation
(i.e., cov("t ; "t j ) 6= 0 for some j > 0)? In particular,
3
Is the OLS estimator ^ consistent for o
?
Are the t-test and F -test statistics are applicable for large sample inference?
Q = E(Xt Xt0 )
Assumption 6.5 [Long-run Variance]: (i) For j = 0; 1; :::; put the K K matrix
is p.d.
(ii) The conditional expectation
q:m:
E(Xt "t jXt j "t j ; Xt j 1 "t j 1 ; :::) ! 0 as j ! 1;
4
P1 0 1=2
(iii) j=0 [E(rj rj )] < 1; where
Remarks:
Assumptions 6.1–6.4 have been assumed in Chapter 5 but Assumption 6.5 is new.
Assumption 6.5(i) allows for both conditional heteroskedasticity and autocorrelation of
unknown form in f"t g, and no normality assumption is imposed on f"t g.
We do not assume that fXt "t g is an MDS, although E(Xt "t ) = 0 as implied by
E("t jXt ) = 0 a.s. Note that E("t jXt ) = 0 a.s. does not necessarily imply that fXt "t g is
MDS in a time series context. See the aforementioned examples for which fXt "t g is not
MDS.
Assumptions 6.5(ii, iii) imply that the serial dependence of Xt "t on its past history
in term of mean and variance respectively vanishes to zero as the lag order j ! 1: Intu-
itively, Assumption 6.5(iii) may be viewed as the net e¤ect of Xt j "t j on the conditional
mean of Xt "t : It assumes that E(rj0 rj ) ! 0 as j ! 1:
p X
n
n( ^ o ^ 1n
)=Q 1=2
Xt "t :
t=1
Suppose the CLT holds for fXt "t g. That is, suppose
X
n
d
1=2
n Xt "t ! N (0; V );
t=1
5
Then, by the Slutsky theorem, we have
p d
n( ^ o
) ! N (0; Q 1 V Q 1 )
X
n 1
= (1 jjj=n) (j)
j= (n 1)
X1
! (j) as n ! 1
j= 1
6
P1
by dominated convergence. Therefore, we have V = j= 1 (j):
When cov(gt ; gt j ) is p.s.d. for all j > 0; the di¤erence 1j= 1 (j) (0) is a p.s.d
matrix. Intuitively, when (j) is p.s.d.; a large deviation of gt from its mean will tend
to be followed by another large deviation. As a result, V (0) is p.s.d:
To explore the link between the long-run variance V and the spectral density matrix
of fXt "t g, which is crucial for consistent estimation of V; we now extend the concept of
the spectral density of a univariate time series to a multivariate time series context.
X
1
jj (j)jj < 1:
j= 1
Then the Fourier transform of the autocovariance function (j) exists and is given by
1 X
1
H(!) = (j) exp( ij!); !2[ ; ];
2 j= 1
p
where i = 1: The K K matrix-valued function H(!) is called the spectral density
matrix of the weakly stationary time series vector-valued process fgt g:
Remarks:
7
Both H(!) and (j) are Fourier transforms of each other. They contain the same amount
of information on serial dependence of the process fgt = Xt "t g: The spectral distribution
function H(!) is useful to identify business cycles (see Sargent 1987, Dynamic Marcoeco-
nomics, 2nd Edition). For example, if gt is the GDP growth rate at time t; then H(!)
can be used to identify business cycle of the economy.
When ! = 0; then the long-run variance-covariance matrix
X
1
V = 2 H(0) = (j):
j= 1
That is, the long-run variance V is 2 times the spectral density matrix of the time
series process fgt g at frequency zero. As will be seen below, this link provides a basis
for consistent nonparametric estimation of V:
Recall that gt = (g0t; g1t ; :::; gkt )0 ; where glt = Xlt "t for 0 l k: Then the (l + 1; m +
1)-th element of (j) is
[ (j)](l+1;m+1) = lm (j)
= cov[glt ; gm(t j) ]
which is the cross-covariance between Xlt "t and Xm(t j) "(t j) : We note that
lm (j) 6= lm ( j);
(j) = ( j)0 ;
1 X
1
ij!
Hlm (!) = lm (j)e
2 j= 1
is called the cross-spectral density between fglt g and fgmt g: The cross-spectrum is very
useful in investigating the comovements between di¤erent economic time series. The
popular concept of Granger causality was …rst de…ned using the cross-spectrum (see
Granger 1969, Econometrica). In general, Hlm (!) is complex-valued.
8
Question: How to estimate V ?
where (j) = cov(gt ; gt j ): The long-run variance V is 2 times H(0); the spectral density
matrix at frequency zero. This provides the basis to use a nonparametric approach to
estimating V:
X
1
V = (j);
j= 1
Question: Why?
There are too many estimated terms in the summation over lag orders. In fact, there
are n estimated parameters f ^ (j)gj=0
n 1
in V^ : The asymptotic variance of the estimator
V^ de…ned above is proportional to the ratio of the number of estimated autocovariance
matrices f ^ (j)g to the sample size n; which will not vanish to zero if the number of
estimated covariances is the same as or close to the sample size n:
9
Nonparametric Kernel Estimation
The above explanation motivates us to consider the following truncated sum
p
X
V^ = ^ (j);
j= p
where p is a positive integer. If p is …xed (i.e., p does not grow when the sample size n
increases), however, we expect
p
p
X
V^ ! (j) 6= 2 H(0) = V;
j= p
will never vanish to zero as n ! 1 when p is …xed. Hence, we should let p grows to
in…nity as n ! 1; that is, let p = p(n) ! 1 as n ! 1: The bias will then vanish to
zero as n ! 1: However, we cannot let p grow as fast as the sample size n: Otherwise,
the variance of V^ will never vanish to zero. Therefore, to ensure consistency of V^ to
V; we should balance the bias and the variance of V^ properly. This requires using a
truncated variance estimator pn
X
V^ = ^ (j);
j= pn
where the weighting function k( ) is called a kernel function. An example of such kernels
is the Bartlett kernel
k(z) = (1 jzj)1(jzj 1);
where 1( ) is the indicator function, which takes value 1 if the condition inside holds, and
takes value 0 if the condition inside does not hold. Newey and West (1987, Econometrica;
1994, Review of Economic Studies) …rst used this kernel function to estimate V in
econometrics. The truncated variance estimator V^ can be viewed as a kernel-based
10
estimator with the use of the truncated kernel k(z) = 1(jzj 1) ; which assigns equal
weighting to each of the …rst pn lags.
Most kernels are downward-weighting in the sense that k(z) ! 0 as jzj ! 1: The
use of a downward weighting kernel may enhance estimation e¢ ciency of V because
when 1 j= 1 jj (j)jj < 1; we have (j) ! 0 as j ! 1; and so it is more e¢ cient to
assign a larger weight to a lower order j and a smaller weight to a higher order j:
In fact, we can consider a more general form of estimator for V :
X
n 1
V^ = k(j=pn ) ^ (j);
j=1 n
where k( ) may have unbounded support. Although the lag order j sums up from 1 n to
n 1; the variance of the estimator V^ still vanishes to zero, provided pn ! 1; pn =n ! 0;
and k( ) discounts higher order lags as j ! 1: An example of k( ) that has unbounded
support is the Quadratic-Spectral kernel:
3 sin( z)
k(z) = cos( z) ; 1 < z < 1:
( z)2 z
Andrews (1991, Econometrica) uses it to estimate for V . This kernel also delivers a
p.s.d. matrix. Moreover, it minimizes the asymptotic MSE of the estimator V^ over a
class of kernel functions.
Under certain regularity conditions on random sample fYt ; Xt0 g0n
t=1 , kernel function
k( ), and lag order pn (Newey and West 1987, Andrews 1991), we have
p
V^ ! V
At point 0; k( ) attains the maximal value, and the fact that k( ) is square-integrable
implies k(z) ! 0 as jzj ! 1:
11
For derivations of asymptotic variance and asymptotic bias of the long-run variance
estimator V^ ; see Newey and West (1987) and Andrews (1991).
By Assumptions 6.1, 6.2 and 6.4 and the WLLN for stationary ergodic processes, we
have
p p
^!
Q ^ 1!
Q and Q Q 1:
Similarly, by Assumptions 6.1–6.3 and 6.5(i), we have
X
n
p
1
n Xt "t ! E(Xt "t ) = 0
t=1
using the WLLN for ergodic stationary processes, where E(Xt "t ) = 0 given Assumption
6.2 (E("t jXt ) = 0 a.s.) and LIE.
The proof of this theorem calls for the use of a new CLT.
12
Lemma 6.3 [CLT for Zero Mean Ergodic Stationary Processes (White 1984,
Theorem 5.15)]: Suppose fZt g is a stationary ergodic process with
(i) E(Zt ) = 0;
P
(ii) V = 1 j= 1 (j) is …nite and nonsingular; where (j) = E(Zt Zt0 j );
q:m:
(iii) E(Zt jZt j ; Zt j 1 ; :::) ! 0;
P
(iv) 1 0
j=0 [E(rj rj )]
1=2
< 1; where
Then as n ! 1;
X
n
d
n1=2 Zn = n 1=2
Zt ! N (0; V ):
t=1
p
We now use this CLT to derive the asymptotic distribution of n( ^ o
):
By Assumptions 6.1–6.3 and 6.5 and the CLT for stationary ergodic processes, we have
X
n
d
1=2
n Xt "t ! N (0; V );
t=1
P1 p
^ ! p
^ 1 !
where V = j= 1 (j) is as in Assumption 6.5. Also, Q Q and Q Q 1 by
Assumption 6.4 and the WLLN for ergodic stationary processes. We then have by the
Slutsky theorem
p d
n( ^ o
) ! N (0; Q 1 V Q 1 ):
13
We directly assume a consistent estimator V^ for V:
p
Assumption 6.6: V^ ! V:
When there exists serial correlation of unknown form, we can estimate V using the
nonparametric kernel estimator V^ ; as described in Section 6.3. In some special scenarios,
we may have (j) = 0 for all j > p0 ; where p0 is a …xed lag order. An example of this
case is Example 2 in Section 6.1. In this case, we can use the following estimator
p0
X
V^ = ^ (j):
j= p0
p
It can be shown that V^ ! V in this case.
For the case where J = 1; a robust t-type test statistic
p
n(R ^ r) d
q ! N (0; 1);
RQ^ 1 V^ Q
^ 1 R0
This statistic has used the asymptotic variance estimator that is robust to conditional
heteroskedasticity and autocorrelation of unknown form.
Theorem 6.5: Under Assumptions 6.1–6.6, we have the Wald test statistic
^ = n 1 (R ^ d
W r)0 [R(X0 X) 1 V^ (X0 X) 1 R0 ] 1 (R ^ r) ! 2
J
o
as n ! 1 under H0 : R = r:
Proof: Because
p d
n(R ^ r) ! N (0; RQ 1 V Q 1 R0 );
we have the quadratic form
p 1 p d
n(R ^ r)0 RQ 1 V Q 1 R0 n(R ^ r) ! 2
J:
14
^ = X0 X=n; we have an equivalent expression for W
Using the expression of Q ^ :
^ = n 1 (R ^ d
W r)0 [R(X0 X) 1 V^ (X0 X) 1 R0 ] 1 (R ^ r) ! 2
J:
Remarks:
The standard t-statistic and F -statistic cannot be used when there exists autocorre-
lation and conditional heteroskedasticity in fXt "t g.
Question: Can we use this Wald test when (j) = 0 for all nonzero j?
Yes. But this is not a good test statistic because it may perform poorly in …nite
samples. In particular, it usually overrejects the correct null hypothesis H0 in …nite
samples even if (j) = 0 for all j 6= 0: In the case where (j) = 0 for all j 6= 0; a better
estimator to use is
V^ = ^ (0)
Xn
= n 1 Xt et et Xt0
t=1
= X D(e)D(e)0 X=n:
0
Question: Why do the robust t- and Wald tests tend to overreject H0 in the presence
of HAC?
Simulation Evidence
15
Question: How to test whether we need to use the long-run variance-covariance matrix
estimator? That is, how to test whether the null hypothesis that
X
1
H0 : 2 H(0) (j) = (0)?
j= 1
We now provide a test for H0 under case (i). See Hong (1997) in a related univariate
context.
To test the null hypothesis that 1 j=1 (j) = 0; we can use a consistent estimator A
^
(say) for 1 ^
j=1 (j) and then check whether A is close to a zero matrix. Any signi…cant
di¤erence of A^ from zero will indicate the violation of the null hypothesis, and thus a
long-run variance estimator is needed.
To estimate 1 j=1 (j) consistently, we can use a nonparametric kernel estimator
X
n 1
A^ = k(j=pn )vech[ ^ (j)];
j=1
16
Next, we consider the case when fgt = Xt "t g is autoregressively conditionally het-
eroskedastic, namely var(gt jIt 1 ) 6= var(gt ): In this case, the test statistic is
^ = A^0 B
M ^ 1 A;
^
where
X
n 1X
n 1
^ =
B ^ l);
k(j=p)k(l=p)C(j;
j=1 l=1
X
n 1
^ l) = 1
C(j; gt g^t0 j )vech0 (^
vech(^ gt g^t0 l );
n
t=1+max(j;l)
with g^t = Xt et : Under the assumption that fgt = Xt "t g is an MDS, we have
d
^ ! 2
M K(K+1)=2 :
In fact, the above test is closely related to a variance ratio test that is popular
in …nancial econometrics. Extending an idea of Cochrane (1988), Lo and MacKinlay
(1988) …rst rigorously present an asymptotic theory for a variance ratio test for the
P
MDS hypothesis of asset returns fYt g. Recall that pj=1 Yt j is the cumulative asset
return over a total of p periods. Then under the MDS hypothesis, which implies (j)
cov(Yt ; Yt j ) = 0 for all j > 0; one has
Pp Pp
var Yt p (0) + 2p j=1 (1 j=p) (j)
j=1 j
= = 1:
p var(Yt ) p (0)
This unity property of the variance ratio can be used to test the MDS hypothesis because
any departure from unity is evidence against the MDS hypothesis.
The variance ratio test is essentially based on the statistic
p
p X p 1
VRo n=p (1 j=p)^(j) = n=p f^(0) ;
j=1
2 2
where p
^ 1 X jjj
f (0) = 1 ^(j)
2 j= p p
17
is a kernel-based normalized spectral density estimator at frequency 0, with the Bartlett
kernel K(z) = (1 jzj)1(jzj 1) and a lag order equal to p: This, the variance ratio test
is the same as checking whether the long-run variance is equal to the individual variance
(0): Because VRo is based on a spectral density estimator of frequency 0, and because
of this, it is particularly powerful against long memory processes, whose spectral density
at frequency 0 is in…nity (see Robinson 1994, for discussion on long memory processes).
Under the MDS hypothesis with conditional homoskedasticity for fYt g, Lo and
MacKinlay (1988) show that for any …xed p;
d
VRo ! N [0; 2(2p 1)(p 1)=3p] as n ! 1:
When fYt g displays conditional heteroskedasticity, Lo and MacKinlay (1988) also con-
sider a heteroskedasticity-consistent variance ratio test:
p
p X p
VR n=p (1 j=p)^ (j)= ^ 2 (j);
j=1
where ^ 2 (j) is a consistent estimator for the asymptotic variance of ^ (j) under condi-
tional heteroskedasticity. Lo and MacKinlay (1988) assume a fourth order cumulant
condition that
18
procedure. Consider a linear regression model with serially correlated errors:
o
Yt = Xt0 + "t ;
depends on serial correlation in f"t g: We can consider the following transformed linear
regression model
p p
!0
X X
o
Yt j Yt j = Xt j Xt j
j=1 j=1
p
!
X
+ "t j "t j
j=1
p
!0
X
o
= Xt j Xt j + vt :
j=1
Step 1: Regress
o
Yt = Xt0 + "t ; t = 1; :::; n;
Yt on Xt ; and obtain the estimated OLS residual et = Yt Xt0 ^ ;
19
Step 2: Regress an AR(p) model
p
X
et = j et j + v~t ; t = p + 1; :::; n;
j=1
Y^t = X
^ 0
t
o
+ vt ; t = p + 1; :::; n;
where Y^t and X ^ are de…ned in the same way as Yt and Xt respectively, with
t
f^ j gpj=1 replacing f j gpj=1 : The resulting OLS estimator is denoted as ~ a :
It can be shown that the adaptive feasible OLS estimator ~ a has the same asymptotic
properties as the infeasible OLS estimator ~ : In other words, the sampling error resulting
from the …rst step estimation has no impact on the asymptotic properties of the OLS
estimator in the second step. The asymptotic variance estimator of ~ a is given by
^ x 1x ;
s^2v Q
where
1 X
n
s^2v = v^t 2 ;
n K t=1
1X ^ ^ 0
n
^x
Q x = X X ;
n t=1 t t
with v^t = Y^t X^ 0 ~ a : The t-test statistic which is asymptotically N (0; 1) and the
t
J F -test statistic which is asymptotically 2J from the last stage regression are applicable
when the sample size n is large.
The estimator ~ a is essentially the adaptive feasible GLS estimator described in
Chapter 3, and it is asymptotically BLUE. This estimation method is therefore asymp-
totically more e¢ cient than the robust test procedures developed in Section 6, but it
is based on the assumption that the AR(p) process for the disturbance f"t g is known.
The robust test procedures in Section 6 are applicable when f"t g has conditional het-
eroskedasticity and serial correlation of unknown form.
21
EXERCISES
6.2. Suppose (j) = 0 for all j > p0 ; where p0 is a …xed lag order. An example of this
P0
case is Example 2 in Section 6.1. In this case, the long-run variance V = pj= p0 (j)
and we can estimate it by using the following estimator
p0
X
V^ = ^ (j):
j= p0
p
where ^ (j) is de…ned as in Section 6.1. Show that for each given j; ^ (j) ! (j) as
n ! 1:
p
Given that p0 is a …xed interger, an important implication that ^ (j) ! (j) for each
p
given j as n ! 1 is that V^ ! V as n ! 1:
6.3. Suppose fYt g is a stationary time series process with the following spectral density
function exists:
1 X
1
h(!) = (j)e ij! :
2 j= 1
Show that !
p
X
var Yt j ! 2 h(0) as p ! 1:
j=1
22
CHAPTER 7 INSTRUMENTAL
VARIABLES REGRESSION
Abstract: In this chapter we …rst discuss possibilities that the condition E("t jXt ) = 0
a.s. may fail, which will generally render inconsistent the OLS estimator for the true
model parameters. We then introduce a consistent two-stage least squares (2SLS) esti-
mator, investigating its statistical properties and providing intuitions for the nature of
the 2SLS estimator. Hypothesis tests are constructed. We consider various test proce-
dures corresponding to the cases for which the disturbance is an MDS with conditional
homoskedasticity, an MDS with conditional heteroskedasticity, and a non-MDS process,
respectively. The latter case will require consistent estimation of a long-run variance-
covariance matrix. It is important to emphasize that the t-test and F -test obtained
from the second stage regression estimation cannot be used even for large samples. Fi-
nally, we consider some empirical applications and conclude this chapter by presenting
a brief summary of the comprehensive econometric theory for linear regression models
developed in Chapters 2–7.
Motivation
In all previous chapters, we always assumed that E("t jXt ) = 0 holds even when there
exist conditional heteroskedasticity and autocorrelation.
Questions: When may the condition E("t jXt ) = 0 fail? And, what will happen to
the OLS estimator ^ if E("t jXt ) = 0 fails?
There are at least three possibilities where E("t jXt ) = 0 may fail. The …rst is model
misspeci…cation (e.g., functional form misspeci…cation or omitted variables). The second
is the existence of measurement errors in regressors (also called errors in variables). The
third is the estimation of a subset of a simultaneous equation system. We will consider
the last two possibilities in this chapter. For the …rst case (i.e., model misspeci…ca-
tion), it may not be meaningful to discuss consistent estimation of the parameters in a
misspeci…ed regression model.
Some Motivating Examples
1
usually called errors in variables in econometrics. Consider a data generating process
(DGP)
Yt = o0 + o1 Xt + ut ; (7.1)
where Xt is the income, Yt is the consumption, and fut g is i.i.d. (0; 2u ) and is inde-
pendent of fXt g:
Suppose both Xt and Yt are not observable. The observed variables Xt and Yt contain
measurement errors in the sense that
Xt = Xt + vt ; (7.2)
Yt = Yt + wt ; (7.3)
where fvt g and fwt g are measurement errors independent of fXt g and fYt g, such that
fvt g i:i:d: (0; 2v ) and fwt g i:i:d: (0; 2w ): We assume that the series fvt g; fwt g and
fut g are all mutually independent of each other.
Because we only observe (Xt ; Yt ); we are forced to estimate the following regression
model
Yt = o0 + o1 Xt + "t ; (7.4)
where "t is some unobservable disturbance.
Clearly, the disturbance "t is di¤erent from the original (true) disturbance ut : Al-
though the linear regression model is correctly speci…ed, we no longer have E("t jXt ) = 0
due to the existence of the measurement errors. This is explained below.
Question: If we use the OLS estimator ^ to estimate this model, is ^ consistent for o ?
From the general regression analysis in Chapter 2, we have known that the key for
the consistency of the OLS estimator ^ for o is to check if E(Xt "t ) = 0: From Eqs.
(7:1) (7:3); we have
Yt = Yt + wt
o o
= ( 0 + 1 Xt + ut ) + wt
Xt = Xt + vt :
The regression error "t contains the true disturbance ut and a linear combination of
measurement errors.
2
Now, the expectation
6= 0:
Question: What is the e¤ect of the measurement errors fwt g in the dependent variable
Yt ?
where Xt is the income, Yt is the consumption, and fut g is i.i.d. (0; 2u ) and is inde-
pendent of fXt g:
Suppose Xt is now observed, and Yt is still not observable, such that
Xt = Xt ;
Yt = Yt + wt ;
where fwt g is i.i.d.(0; 2w ) measurement errors independent of fXt g and fYt g: We as-
sume that the two series fwt g and fut g are mutually independent.
Because we only observe (Xt ; Yt ); we are forced to estimate the following model
o o
Yt = 0 + 1 Xt + "t :
Question: If we use the OLS estimator ^ to estimate this model, is ^ consistent for o
?
3
Answer: Yes! The measurement errors in Yt do not cause any trouble for consistent
estimation of o .
The measurement error in Yt can be regarded as part of the true regression distur-
p
bance. It increases the asymptotic variance of n( ^ o
); that is, the existence of
o
measurement errors in Yt renders the estimation of less precise.
Yt = 0 + 1 Xt + "t ;
Since
o
E(Xt ut ) = E[(Xt + vt )("t 1 vt )]
o 2
= 1 v
6= 0
o o
provided 1 6= 0; the OLS estimator is not consistent for 1:
4
that E(ut jXt ; At ) = 0: Because one does not observe At ; one is forced to consider the
regression model
Yt = Xt0 o + "t
and is interested in knowing o ; the marginal e¤ect of schooling and working experience.
However, we have E(Xt "t ) 6= 0 because At is usually correlated with Xt :
o o o o
ln(Yt ) = 0 + 1 ln(Lt ) + 2 ln(Kt ) + 3 Bt + "t ;
where Yt ; Lt ; Kt are the output, labor and capital stock, Bt is the proportion of bonus
out of total pay, and t is a time index. Without loss of generality, we assume that
E("t ) = 0;
E[ln(Lt )"t ] = 0;
E[ln(Kt )"t ] = 0:
Economic theory suggests that the use of bonus in addition to basic wage will provide
a stronger incentive for workers to work harder in a transitional economy. This theory
can be tested by checking if o3 = 0: However, the test procedure is complicated because
there exists a possibility that when a …rm is more productive, it will pay more bonus to
workers regardless of the e¤ort of its workers. In this case, the OLS estimator ^ 3 cannot
consistently estimate o3 and cannot be used to test the null hypothesis.
Why?
To re‡ect the fact that a more productive …rm pays more bonus to its workers, we
can assume a structural equation for bonus:
0 0
Bt = 0 + 1 ln(Yt ) + wt (7.5)
where 01 > 0; and fwt g is an i.i.d. (0; 2w ) sequence that is independent of fYt g: For
simplicity, we assume that fwt g is independent of f"t g:
Put Xt = [1; ln(Lt ); ln(Kt ); Bt ]0 : Now, from Eq. (7.5) and then Eq. (7.4), we have
o o
E(Bt "t ) = E[( 0 + 1 ln(Yt ) + wt )"t ]
o
= 1 E[ln(Yt )"t ]
o o o 2
= 1 3 E(Bt "t ) + 1 E("t ):
5
It follows that o
1 2
E(Bt "t ) = o o
6= 0;
1 1 3
o o
Ct = 0 + 1 It + "t ; (7.6)
It = Ct + Dt ; (7.7)
Question: If the OLS estimator ^ is applied to the …rst equation, is it consistent for
o
?
To answer this question, we have from Eq. (7.7)
It follows that
1 2
E(It "t ) = o 6= 0:
1 1
In fact, this bias problem can also be seen from the so-called reduced form model.
6
Solving for Eqs. (7.6) and (7.7) simultaneously, we can obtain the “reduced forms”
that express endogenous variables in terms of exogenous variables and disturbances:
o o
0 1 1
Ct = o + o Dt + o "t ;
1 1 1 1 1 1
o
0 1 1
It = o + o Dt + o "t :
1 1 1 1 1 1
Obviously, It is positively correlated with "t (i.e., E(It "t ) 6= 0): Thus, the OLS estimator
for the regression of Ct on It in Eq. (7.6) will not be consistent for o1 ; the parameter for
marginal propensity to consume. Generally speaking, the OLS estimator for the reduced
form is consistent for new parameters, which are functions of original parameters.
where Wt ; Pt ; Dt are the wage, price, and excess demand in the labor market respectively.
Eq. (7.8) describes the mechanism of how wage is determined. In particular, wage
depends on price and excess demand for labor. Eq. (7.9) describes how price depends
on wage (or income).
Suppose Dt is an exogenous variable, with E("t jDt ) = 0: There are two endogenous
variables, Wt and Pt ; in the system of equations (7.8) and (7.9):
Question: Will Wt be correlated with vt ? And, will Pt be correlated with "t ?
To answer these questions, we …rst obtain the reduced form equations:
o o o
0 + 1 0 1 "t + o1 vt
Wt = o o + D
o o t + o o;
1 1 1 1 1 1 1 1 1
o o o o
0 1 2 "
1 t + vt
Pt = o o
+ o o Dt + o o:
1 1 1 1 1 1 1 1 1
Conditional on the exogenous variable Dt ; both Wt and Pt are correlated with "t and vt :
As a consequence, both the OLS estimator for o1 in Eq. (7.8) and the OLS estimator
for o1 in Eq. (7.9) will be inconsistent.
In this chapter, we will consider a method called two-stage least squares estima-
tion to obtain consistent estimators for the unknown parameters in all above examples
except for the parameter o2 in Eq. (7.8) of Example 7. No methods can deliver a con-
sistent estimator for o2 in Eq. (7.8) because it is not identi…able. This is the so-called
identi…cation problem of the simultaneous equations.
A Digression: Identi…cation Problem in Simultaneous Equation Models
7
o
To see why there is no way to obtain a consistent estimator for 2 in Eq. (7.8), from
Eq. (7.9), we can write
o
1 1 vt
Wt = o
+ o Pt o
: (7.10)
2 2 2
Let a and b be two arbitrary constants. We multiply Eq. (7.8) with a; and multiply Eq.
(7.10) with b; and add them together:
o b 1 o b o b
(a + b)Wt = a 1 + (a 2 + )Pt + a 3 Dt + (a"t vt );
2 2 2
or
a o1 b o1 1 o b a o3 1 b
Wt = o
+ (a 2 + o )Pt + Dt + (a"t v ):
o t
(7.11)
a+b (a + b) 2 a+b 2 a+b a+b 2
This new equation, (7.11), is a combination of the original wage equation (7.8) and the
price equation (7.9). It is of the same statistical form as Eq. (7.8). Since a and b are
arbitrary, there is an in…nite number of parameters that can satisfy Eq. (11) and they
are all indistinguishable from Eq. (7.8). Consequently, if we use OLS to run regression
of Wt on Pt and Dt ; or more generally, use any other method to estimate the equation
(7.8) or (7.11), there is no way to know which model, either Eq. (7.8) or Eq. (7.11),
is being estimated. Therefore, there is no way to estimate o2 : This is the so-called
identi…cation problem with simultaneous equation models. To avoid such identi…cation
problems in simultaneous equations, certain conditions are required to make the system
of simultaneous equations identi…able. For example, if an extra variable, say money
supply growth rate, is added in the price equation in (7.9), we then obtain
o o o
Pt = 0 + 1 Wt + 2 Mt + vt ; (7.12)
o
then the system of equations (7.8) and (7.12) is identi…able provided 2 6= 0, and so the
parameters in Eqs. (7.8) and (7.12) can be consistently estimated. [Question: Check
why the system of equations (7.8) and (7.12) is identi…able.]
We note that for the system of equations (7.8) and (7.9), although Eq. (7.8) cannot be
consistently estimated by any method, Eq. (7.9) can still be consistently estimated using
the method proposed below. For an identi…able system of simultaneous equations with
simultaneous equation bias, we can use various methods to estimate them consistently,
including 2SLS, the generalized method of moments and the maximum likelihood or
quasi-maximum likelihood estimation methods. These methods will be introduced below
and in subsequent chapters.
8
7.1 Framework and Assumptions
We now provide a set of regularity conditions for our formal analysis in this chapter.
o
Yt = Xt0 + "t ; t = 1; :::; n;
o
for some unknown parameter and some unobservable disturbance "t ;
Remarks:
Assumption 7.1 allows for i.i.d. and stationary time series observations.
Assumption 7.5 directly assumes that the CLT holds. This is often called a “high level
assumption.”It covers three cases: IID, MDS and non-MDS for fXt "t g; respectively. For
an IID or MDS sequence fZt "t g; we have V = var(Zt "t ) = E(Zt Zt0 "2t ): For a non-MDS
process fZt "t g; V = 1
j= 1 cov(Zt "t ; Zt j "t j ) is a long-run variance-covariance matrix.
9
The random vector Zt that satis…es Assumption 7.4 is called instruments. The
condition that l K in Assumption 7.1 implies that the number of instruments Zt is
larger than or at least equal to the number of regressors Xt :
First of all, one should analyze which explanatory variables in Xt are endogenous
or exogenous. If an explanatory variable is exogenous, then this variable should be
included in Zt ; the set of instruments. For example, the constant term should always
be included, because a constant is uncorrelated with any random variables. All other
exogenous variables in Xt should also be included in the set of Zt . If k0 of K regressors
are endogenous, one should …nd at least k0 additional instruments.
Most importantly, we should choose an instrument vector Zt which is closely related
to Xt as much as possible. As we will see below, the strength of the correlation between
Zt and Xt a¤ects the magnitude of the asymptotic variance of the 2SLS estimator for
0 which we will propose, although it does not a¤ect the consistency provided the
correlation between Zt and Xt is not zero.
In time series regression models, it is often reasonable to assume that lagged variables
of Xt are not correlated with "t : Therefore, we can use lagged values of Xt ; for example,
Xt 1 ; as an instrument. This instrument is expected to be highly correlated with Xt
if fXt g is a time series process. In light of this, we can choose the set of instruments
Zt = (1; ln Lt ; ln Kt ; Bt 1 )0 in estimating Eq.(7.4) in Example 5, choose Zt = (1; Dt ; It 1 )0
in estimating Eq.(7.6) in Example 6, choose Zt = (1; Dt ; Pt 1 )0 in estimating Eq.(7.8)
in Example 7. For examples with measurement errors or expectational errors, where
E(Xt "t ) 6= 0 due to the presence of measurement errors or expectational errors, we can
choose Zt = Xt 1 if the measurement errors or expectational errors in Xt are serially un-
correlated (check this!). The expectational errors in Xt are MDS and so are uncorrelated
in Example 3 when the economic agent has rational expectations.
We now introduce a two-stage least squares (2SLS) procedure, which can consistently
estimate the true parameter o . The 2SLS procedure can be described as follows:
10
^t:
Stage 1: Regress Xt on Zt via OLS and save the predicted value X
^ = (Z0 Z) 1 Z0 X
! 1
X
n X
n
1
= n Zt Zt0 n 1
Zt Xt0 :
t=1 t=1
^ t = ^ 0 Zt
X
or in matrix form
^ = Z^ = Z(Z0 Z) 1 Z0 X:
X
Stage 2: Use the predicted value X ^ t as regressors for Yt : Regress Yt on X
^ t ; and the
resulting OLS estimator is called the 2SLS estimator, denoted as ^ 2sls :
^ t = ^ 0 Zt as regressors?
Question: Why use the …tted value X
We …rst consider
0
Xt = Zt + v t ;
where is the best linear LS approximation coe¢ cient, and so vt is orthogonal to Zt in
0
the sense E(Zt vt ) = 0: Because E(Zt "t ) = 0; the population projection 0 Zt is orthogonal
0
to ": In general, vt = Xt Zt ; which is orthogonal to Zt ; is correlated with "t : In other
words, the auxiliary regression in stage 1 decomposes Xt into two components: 0 Zt and
vt ; where 0 Zt is orthogonal to "t ; and vt is correlated with "t .
11
Since the best linear LS approximation coe¢ cient is unknown, we have to replace
it with ^ : The …tted value X ^ t = ^ 0 Zt is the (sample) projection Xt onto Zt : The re-
gression of Xt on Zt purges the component of Xt that is correlated with "t so that the
projection X^ t is approximately orthogonal to "t given that Zt is orthogonal to "t : (The
word “approximately” is used here because ^ is an estimator of and thus contains
some estimation error.)
or in matrix form
^
Y =X o
+ ~":
^ t is not Xt :
Note that the disturbance ~"t is not "t because X
^ = Z^ = Z(Z0 Z) 1 Z0 X; we can write the second stage OLS estimator,
Using X
namely the 2SLS estimator as follows:
^ 2sls = (X
^ 0 X)
^ 1X
^ 0Y
= [(Z^ )0 (Z^ )] 1 (Z^ )0 Y
1
= [Z(Z0 Z) 1 Z0 X]0 [Z(Z0 Z) 1 Z0 X] [Z(Z0 Z) 1 Z0 X]0 Y
= [X0 Z(Z0 Z) 1 Z0 Z(Z0 Z) 1 Z0 X] 1 X0 Z(Z0 Z) 1 Z0 Y
= [X0 Z(Z0 Z) 1 Z0 X] 1 X0 Z(Z0 Z) 1 Z0 Y
" # 1
0 0 1 0 1
XZ ZZ ZX X0 Z Z0 Z Z0 Y
= :
n n n n n n
o
Using the expression Y = X + " from Assumption 7.2, we have
" # 1
1 0 1
^ 2sls o X0 Z Z0 Z ZX X0 Z Z0 Z Z0 "
=
n n n n n n
h i 1 0
= Q ^ xz Q
^ zz1 Q
^ zx ^ 1 Z ";
^ xz Q
Q zz
n
where
0 X
n
^ zz = Z Z = n
Q 1
Zt Zt0 ;
n t=1
0 Xn
^ xz = X Z = n
Q 1
Xt Zt0 ;
n t=1
0 X
n
^ zx = Z X = n
Q 1 ^ 0xz :
Zt Xt0 = Q
n t=1
12
Question: What are the statistical properties of ^ 2sls ?
Consequently, we have
^ o p
2sls ! [Qxz Qzz1 Qzx ] 1 Qxz Qzz1 0 = 0:
o
Yt = Xt0 + "t :
~ t + vt ;
Xt = X
13
where ut = vt0 o ~ t : Because
+ "t is the disturbance when regressing Yt on X
~ t ut ) =
E(X 0
E(Zt ut )
0 o
= E(Zt vt0 ) + 0
E(Zt "t )
= 0;
p h i 1 Z0 "
n( ^ 2sls o
) = ^ xz Q
Q ^ zz1 Q
^ zx ^ xz Q
Q ^ zz1 p :
n
Z0 "
= A^ p ;
n
where the K l matrix h i 1
A^ = Q^ xz Q
^ 1Q
zz
^ zx ^ xz Q
Q ^ 1:
zz
Z0 " 1
X
n
d
p =n 2 Zt "t ! N (0; V ) G;
n t=1
where V is a …nite and nonsingular l l matrix, and we denote the random vector
G N (0; V ): Then by the Slutsky theorem, we have
p d 1
n( ^ 2sls o
) ! Qxz Qzz1 Qzx Qxz Qzz1 N (0; V )
N (0; AV A0 )
N (0; );
p
where A = (Qxz Qzz1 Qzx ) 1 Qxz Qzz1 : The asymptotic variance of n( ^ 2sls o
)
p
avar( n ^ 2sls ) =
= AV A0
0
= [Qxz Qzz1 Qzx ] 1 Qxz Qzz1 V [Qxz Qzz1 Qzx ] 1 Qxz Qzz1
= [Qxz Qzz1 Qzx ] 1 Qxz Qzz1 V Qzz1 Qzx [Qxz Qzz1 Qzx ] 1 :
14
Theorem 7.2 [Asymptotic Normality of 2SLS]: Under Assumptions 7.1-7.5, as
n ! 1;
p d
n( ^ 2sls o
) ! N (0; ):
The estimation of V depends on whether fZt "t g is an MDS. We …rst consider the
case where fZt "t g is an MDS process. In this case, V = E(Zt Zt0 "2t ) and so we need not
estimate the long-run variance-covariance matrix.
Assumption 7.6 [MDS]: (i) fZt "t g is an MDS; (ii) var(Zt "t ) = E(Zt Zt0 "2t ) is …nite
and nonsingular.
When fZt "t g is an MDS with conditional homoskedasticity, the asymptotic variance
can be greatly simpli…ed.
It follows that
2
= (Qxz Qzz1 Qzx ) 1 :
15
Corollary 7.4 [Asymptotic Normality of 2SLS under MDS with Conditional
Homoskedasticity] Under Assumptions 7.1–7.4,7.6 and 7.7, we have as n ! 1;
p d
n( ^ 2sls o
) ! N (0; );
where
2
= [Qxz Qzz1 Qzx ] 1 :
Case II: fZt "t g is a Stationary Ergodic non-MDS
where (j) =cov(Zt "t ; Zt j "t j ): We need to use a long-run variance-covariance matrix
estimator for V: When fZt "t g is not an MDS, there is no need (and in fact there is no way)
to consider conditional homoskedasticity and conditional heteroskedasticity separately.
~"t = Yt ^0
X o
t
= "t + (Xt ^ t )0
X o
o
= "t + v^t0 ;
X = Z + v;
16
we have the following FOC holds:
Z0 (X ^ = Z0 v^ = 0:
X)
^ ^ 0 X)
= (X ^ 1X ^ 0Y
2sls
^ 0 X)
= (X ^ 1X ^ 0 (X
^ o + ~")
= o + (X ^ 0 X)
^ 1X ^ 0 [" + v^ o ]
= o + (X ^ 0 X)
^ 1X ^ 0"
^ 2sls o ^ 0 X)
= (X ^ 1X^ 0"
! 1
^ 0X
X ^ X^ 0"
= :
n n
In other words, the estimated residual v^ = X X ^ from the …rst stage regression has no
impact on the statistical properties of ^ 2sls ; although it is a component of ~"t : Thus, when
analyzing the asymptotic properties of ^ 2sls ; we can proceed as if we were estimating
Y =X ^ o + " by OLS.
Next, recall that we have
^ = Z^ ;
X
^ = (Z0 Z) 1 Z0 X
p
! Qzz1 Qzx =
By the WLLN, the sample projection X ^ t “converges”to the population projection X~t
0
Zt as n ! 1: That is, X^ t will become arbitrarily close to X
~ t as n ! 1: In fact, the
estimation error of ^ in the …rst stage has no impact on the asymptotic properties of
^ :
2sls
~ t0
Yt = X o
+ "t ;
~ = (X
~ 0 X)
~ 1X
~ 0 Y:
17
As we will show below, the asymptotic properties of ^ 2sls are the same as those of the
infeasible OLS estimator ~ : This helps a lot in understanding the variance-covariance
structure of ^ 2sls : It is important to emphasize that the equation in (7.13) is not derived
from other equations. It is just a convenient way to understand the nature of ^ 2sls :
We now show that the asymptotic properties of ^ 2sls are the same as the asymptotic
properties of ~ . For the asymptotic normality, observe that
p ~ 0"
n( ~ o ^ 1X
) = Q p
x~x
~
n
d
! Qx~x~1 N (0; V~ )
N (0; Q 1 V~ Q 1 )
x
~x~ x
~x~
Qx~x~ ~tX
E(X ~ t0 );
!
X
n
V~ avar n 1=2 ~ t "t :
X
t=1
We …rst consider the case where fZt "t g is MDS with conditional homoskedasticity.
Suppose fX ~t) =
~ t "t g is MDS, and E("2 jX 2
a.s: Then we have
t
V~ = E(X
~tX
~ 0 "2 )
t t
2
= Qx~x~
~t =
Because X 0
Zt ; = Qzz1 Qzx ; we have
~tX
Qx~x~ = E(X ~ 0)
t
0
= E(Zt Zt0 )
0
= Qzz
= Qxz Qzz1 Qzz Qzz1 Qzx
= Qxz Qzz1 Qzx :
18
Therefore,
1
2
Qx~x~1 = 2
Qxz Qzz1 Qzx
p
= avar( n ^ 2sls ):
This implies that the asymptotic distribution of ~ is indeed the same as the asymptotic
distribution of ^ 2sls under the MDS with conditional homoskedasticity.
p
indicates that the asymptotic variance of n ^ 2sls will be large if the correlation between
Zt and Xt ; as measured by ; is weak. Thus, more precise estimation of o will be
obtained if one chooses the instrument vector Zt such that Zt is highly correlated with
Xt :
Question: How to estimate under the MDS disturbances with conditional homoskedasticity?
^ = s^2 Q
^ 1
x^x
^
1
^ xz Q
= s^2 Q ^ zz1 Q
^ zx
X
n
^ x^x^ = n
Q 1 ^tX
X ^0
t
t=1
It should be emphasized that e^ is not the estimated residual from the second stage
^ This implies that even under condi-
regression (i.e., not from the regression of Y on X):
tional homoskedasticity, the conventional t-statistic in the second stage regression does
not converge to N (0; 1) in distribution, and J F^ does not converge to 2J where F^ is
the F -statistic in the second stage regression.
p p p
To show ^ ! ; we shall show (i) Q
^ 1!
x^x
^ Qx~x~1 and (ii) s^2 ! 2
:
19
We …rst show (i). There are two methods for proving this.
p p
^ 1!
Method 1: We shall show Q ^ t = ^ 0 Zt and ^ !
Qx~x~1 : Because X ; we have
x^x
^
X
n
^ x^x^ = n
Q 1 ^tX
X ^ t0
t=1
!
X
n
= ^0 n 1
Zt Zt0 ^
t=1
^ zz ^
= ^0Q
p 0
! Qzz
= E[( 0 Zt )(Zt0 )]
= E(X~tX~ 0)
t
= Qx~x~ :
p
Method 2: We shall show (Q ^ xz Q
^ 1Q^ zx ) 1 ! (Qxz Qzz1 Qzx ) 1 ; which follows immedi-
zz
p p
^ xz ! Qxz and Q
ately from Q ^ zz ! Qzz by the WLLN. This method is more straightfor-
ward but is less intuitive than the …rst method.
p
Next, we shall show (ii) s^2 ! 2
. We decompose
e^0 e^
s^2 =
n K
1 X
n
= (Yt Xt0 ^ 2sls )2
n K t=1
1 X
n
= ["t Xt0 ( ^ 2sls o
)]2
n K t=1
1 X
n
= "2t
n K t=1
1 X
n
+( ^ 2sls
o 0
) Xt Xt0 ( ^ 2sls o
)
n K t=1
1 Xn
2( ^ 2sls o 0
) Xt "t
n K t=1
p 2
! + 0 Qxx 0 2 0 E(Xt "t )
2
= :
Note that although E(Xt "t ) 6= 0; the last term still vanishes to zero in probability,
o p
because ^ 2sls ! 0:
20
Question: What happens if we use s2 = e0 e=(n K); where e = Y X^ ^ 2sls is the
p
estimated residual from the second stage regression? Do we still have s2 ! 2 ?
~
Y =X o
+"
where
V~ = E(X
~tX
~ 0 "2 ):
t t
= [ 0 E(Zt Zt0 ) ] 1 0
E(Zt Zt0 "2t ) [ 0 E(Zt Zt0 ) ] 1
1
= Qxz Qzz1 Qzx Qxz Qzz1 V Qzz1 Qzx (Qxz Qzz1 Qzx ) 1
p
= avar( n ^ 2sls ):
This implies that the asymptotic distribution of the infeasible OLS estimator ~ is the
same as the asymptotic distribution of ^ 2sls under MDS with conditional heteroskedas-
ticity. Therefore, the estimator for is
^ =Q
^ 1 V^x^x^ Q
^ 1;
x^x
^ x^x
^
21
where
X
n
V^x^x^ = n 1 ^tX
X ^ t0 e^2t
t=1
!
X
n
= ^0 n 1
Zt Zt0 e^2t ^;
t=1
= ^ 0 V^ ^ :
p 1
Pn
Because ^ ! ; and following the consistency proof for n t=1 Xt Xt0 e2t in Chapter
4, we can show (please verify!) that
X
n
p
V^ = n 1
Zt Zt0 e^2t ! E(Zt Zt0 "2t ) = V;
t=1
22
under the following additional moment condition:
4
Assumption 7.8: (i) E(Zjt ) < 1 for all 0 j l; and (ii) E("4t ) < 1.
It follows that
p
V^x^x^ ! 0 E(Zt Zt0 "2t )
= E(X ~tX~ t0 "2t )
= V~ :
p p
^ x^x^ !
This and Q Qx~x~ imply ^ ! :
p p p
it su¢ ces to show Q^ xz ! ^ zz !
Qxz ; Q Qzz and V^ ! V: The …rst two results immedi-
ately follow by the WLLN. The last result follows by using a similar reasoning of the
P
consistency proof for n 1 nt=1 Xt Xt0 e2t in Chapter 4 or 5.
We now summarize the result derived above.
= (Qxz Qzz1 Qzx ) 1 Qxz Qzz1 V Qzz1 Qzx (Qxz Qzz1 Qzx ) 1 :
where V~ = E(X
~tX
~ 0 "2 ) and V = E(Zt Z 0 "2 ):
t t t t
= Qx~x~1 V~ Qx~x~1
= (Qxz Qzz1 Qzx ) 1 Qxz Qzz1 V Qzz1 Qzx (Qxz Qzz1 Qzx ) 1 ;
23
with
X
1
V~ = ~ (j); ~ (j) = cov(X
~ t "t ; X
~ t j "t j );
j= 1
X1
V = (j); (j) = cov(Zt "t ; Zt j "t j ):
j= 1
p
= avar( n ^ 2sls ):
p ^
Thus, the asymptotic variance of n 2sls is the same as the asymptotic variance of
~ under this general case.
^ = Q^ 1 V^x^x^ Q
^ 1
x
^x^ x
^x^
= (Q^ xz Q ^ Q1 ^ zx ) 1 Q ^ 1 V^ Q
^ xz Q ^ 1Q^ zx (Q
^ xz Q
^ 1Q^ zx ) 1
zz zz zz zz
p
! = Qx~x~1 V~ Qx~x~1 ;
= (Qxz Qzz1 Qzx ) 1 Qxz Qzz1 V Qzz1 Qzx (Qxz Qzz1 Qzx ) 1 :
24
With a consistent estimator of ; we can develop various con…dence interval estima-
tors and various tests for the null hypothesis H0 : R o = r: We will consider the latter
now:
o
H0 : R = r;
as n ! 1;under H0 :
Proof: The result follows immediately from the asymptotic normality theorem for
p ^ p p
n( 2sls o
), H0 (which implies n(R ^ 2sls r) = R n( ^ 2sls o
)); the consistent
asymptotic variance estimation theorem, and the Slutsky theorem.
Remarks:
^ =J the F -statistic from the second stage regression?
Question: Is W
Answer: No, because e^ is not the estimated residual from the second stage regres-
sion.
Question: Do we still have
(e0 er e0u eu )=J
F^ = r0 ;
eu eu =(n K)
where er and eu are estimated residuals from the restricted and unrestricted regression
models in the second stage regression respectively?
25
Case II: fZt "t g is a Stationary Ergodic MDS with Conditional Heteroskedas-
ticity
Theorem 7.9 [Hypothesis Testing]: Under Assumptions 7.1-7.4, 7.6 and 7.8, the
Wald test statistic
d
^
W n(R ^ 2sls x^x
^
^ 1 R0 ] 1 (R ^ 2sls
^ 1 V^x^x^ Q
r)0 [RQ x^x
^ r) ! 2
J
When fZt "t g is non-MDS, we can still construct a Wald test which is robust to
conditional heteroskedasticity and autocorrelation, as is stated below.
Theorem 7.10 [Hypothesis Testing]: Under Assumptions 7.1-7.5 and 7.9, the Wald
test statistic
26
For simplicity, we impose the following conditions.
Assumption 7.10: (i) f(Xt0 ; Zt0 )0 "t g is an MDS process; and (ii) E("2t jXt ; Zt ) = 2
a.s.
Question: How to test the conditional homoskedasticity assumption that E("2t jXt ; Zt ) =
2
?
Answer: Put e^t = Yt X ^ 0 ^ 2sls : (Question: Can we use et = Yt X 0 ^ 2sls ?) Then run an
t t
2 0 0 0
auxiliary regression of e^t on vech(Ut ); where Ut = (Xt ; Zt ) , a (K + l) 1 vector. Then
d
under the condition that E("4t jXt ; Zt ) = 4 is a constant, we have nR2 ! 2J under the
null hypothesis of conditional homoskedasticity, where J = (K + l)(K + l + 1)=2 1:
The basic idea of Hausman’s test is under H0 : E("t jXt ) = 0; both the OLS estimator
^ = (X 0 X) 1 X 0 Y and the 2SLS estimator ^ o
2sls are consistent for : They converge to
the same limit o ^
but it can be shown that is an asymptotically e¢ cient estimator
while ^ 2sls is not. Under the alternatives to H0 ; ^ 2sls remains to be consistent for o
but ^ is not. Hausman (1978) considers a test for H0 based on the di¤erence between
the two estimators:
^ ^;
2sls
which converges to zero under H0 but generally to a nonzero constant under the alter-
natives to H0 ; giving the test its power against H0 when the sample size n is su¢ ciently
large.
To construct Hausman’s (1978) test statistic, we need to derive the asymptotic dis-
tribution of ^ 2sls ^ For this purpose, we …rst state a lemma.
p p
Lemma 7.11: Suppose A^ ! A and B
^ = OP (1): Then (A^ ^ ! 0:
A)B
p
^ xx1 !
where Q Qxx1 and
X
n
d
1=2 2
n Xt "t ! N (0; Qxx )
t=1
27
P
as n ! 1 (see Chapter 5). It follows that n 1=2 nt=1 Xt "t = OP (1); and by Lemma
7.11, we have
p X
n
n( ^ o
) = Qxx1 n 1=2 Xt "t + oP (1):
t=1
Similarly, we can obtain
p X
n
n( ^ 2sls o ^
) = An 1=2
Zt "t
t=1
Xn
1=2
= An Zt "t + oP (1);
t=1
p Pn d
where A^ = (Q ^ xz Q
^ 1Q
zz
^ zx ) 1 Q
^ xz Q
^ zz ! A = (Qxz Qzz1 Qzx ) 1 Qxz Qzz1 and n 1=2
t=1 Zt "t !
N (0; 2 Qzz ) (see Corollary 7.4). It follows that
p X
n
n( ^ 2sls ^) = n 1=2
(Qxz Qzz1 Qzx ) 1 Qxz Qzz1 Zt Qxx1 Xt "t + oP (1)
t=1
d 2
! N (0; (Qxz Qzz1 Qzx ) 1 2
Qxx1 )
by the CLT for the stationary ergodic MDS process and Assumption 7.10. Therefore,
under the null hypothesis H0 ; the quadratic form
h i 1
n( ^ 2sls ^ )0 (Q^ xz Q
^ 1Q
zz
^ zx ) 1 Q
^ 1
xx ( ^ 2sls ^ )
H =
s2
d
! 2K
Question: Can we replace the residual variance estimator s2 by s^2 = e^0 e^=n; where
e^ = Y X ^ 2sls ?
Remarks:
We note that in the above Theorem,
p
avar[ n( ^ 2sls ^ )] = 2
(Qxz Qzz1 Qzx ) 1 2
Qxx1
p p
= avar( n ^ 2sls ) avar( n ^ ):
28
This simple asymptotic variance-covariance structure is made possible under As-
sumption 7.10. Suppose there exists conditional heteroskedasticity (i.e., E("2t jXt ; Zt ) 6=
p
2
): Then we no longer have the above simple variance-covariance structure for avar[ n( ^
^ 2sls )]:
The variance-covariance (Qxz Qzz1 Qzx ) 1 Qxx1 may become singular when its rank
J < K. In this case, we have to modify the Hausman’s test statistic by using the
generalized inverse of the variance estimator:
h i
n( ^ 2sls ^ )0 (Q
^ xz Q
^ zz1 Q ^ 1 ( ^ 2sls ^ )
^ zx ) 1 Q
xx
H= 2
s
d 2
Note that now H ! J under H0 where J < K:
Question: How to modify the Hausman’s test statistic so that it remains asymptot-
ically 2K when there exists conditional heteroskedasticity (i.e., E("2t jXt ; Zt ) 6= 2 ) but
f(Xt0 ; Zt0 )0 "t g is still an MDS process?
Hausman’s test is used to check whether E("t jXt ) = 0: Suppose this condition fails,
one has to choose an instrumental vector Zt that satis…es Assumption 7.4. When we
choose a set of variables Zt ; how can we check the validity of Zt as instruments? In
particular, how to check whether E("t jZt ) = 0? For this purpose, we will consider a
so-called overidenti…cation test, which will be introduced in Chapter 8.
29
Groves, Hong, McMillan and Naughton (1994, Quaterly Journal of Economics)
Ct = + Yt + " t ;
Yt = Zt0 + vt ;
7.9 Conclusion
In this chapter, we discuss the possibilities that the condition of E("t jXt ) = 0 may
fail in practice, which will render inconsistent the OLS estimator for the true model
parameters. With the use of instrumental variables, we introduce a consistent two-stage
least squares (2SLS) estimator. We investigate the statistical properties of the 2SLS
estimator and provide some interpretations that can enhance deeper understanding of
the nature of the 2SLS estimator. We discuss how to construct consistent estimators for
the asymptotic variance of the 2SLS estimator under various scenarios, including MDS
with conditional homoskedasticity, MDS with conditional heteroskedasticity, and non-
MDS possibly with conditional heteroskedasticity. For the latter, consistent estimation
for the long-run variance covariance matrix is needed. With these consistent asymptotic
variance estimators, various hypothesis test procedures are proposed. It is important to
emphasize that the conventional t-test and F -test cannot be used even for large samples.
Finally, some empirical applications that employ 2SLS are considered.
"t = g(Wt ) + ut ;
when E(ut jXt ; Wt ) = 0 and Wt is an omitted variable which is correlated with Xt . This
delivers a partially linear regression model
o
Yt = Xt0 + g(Wt ) + ut :
30
o
Because E(Yt jWt ) = E(Xt jWt )0 + g(Wt ); we obtain
o
Yt E(Yt jWt ) = [Xt E(Xt jWt )]0 + ut
or
o
Yt = Xt 0 + ut ;
where Yt = Yt E(Yt jWt ) and Xt = Xt E(Xt jWt ): Because E(Xt ut ) = 0; the OLS
estimator ~ of regressing Yt on Xt would be consistent for o : However, (Yt ; Xt ) are
not observable, so ~ is infeasible. Nevertheless, one can …rst estimate E(Yt jWt ) and
E(Xt jWt ) nonparametrically, and then obtain a feasible OLS estimator which will be
consistent for the true model parameter (e.g., Robinson 1988). Speci…cally, let m ^ Y (Wt )
and m^ X (Wt ) be consistent nonparametric estimators for E(Yt jWt ) and E(Xt jWt ) respec-
tively. Then we can obtain a feasible OLS estimator
" n # 1 n
X X
~ = X^ X ^ 0 X^ Y^ ;
a t t t t
t=1 t=1
p
^ = Xt
where Xt ^ X (Wt ) and Y^t = Yt
m ^ Y (Wt ): It can be shown that ~ a !
m o
and
p d
n( ~ a o
) ! N (0; Q 1
V Q 1
);
where Q = E(Xt Xt 0 ) and V = var(n 1=2 nt=1 Xt ut ): The …rst stage nonparametric
estimation has no impact on the asymptotic properties of the feasible OLS estimator ~ a :
Another method to consistently estimate the true model parameters is to make use
of panel data. A panel data is a collection of observations for a total of n cross-sectional
units and each of these units has T time series observations over the same time period.
This is called a balanced panel data. In contrast, an unbalanced panel data is a collection
of observations for a total of n cross-sectional units and each unit may have di¤erent
lengths of time series observations with some common overlapping time periods.
With a balanced panel data, we have
o
Yit = Xit0 + "it
o
= Xit0 + i + uit ;
31
model with strictly exogenous variables Xit . Because "it is correlated with Xit ; the OLS
estimator of regressing Yit on Xit is not consistent for o : However, one can consider the
demeaned model
Yit Y_ i: = (Xit X_ i: )0 o + ("it "_ i: );
where Y_ i: = T 1 Tt=1 Yit and similarly for X_ i: and "_ i: : The demeaning procedure removes
the unobservable individual-speci…c e¤ect and as a result, the OLS estimator for the
demeaned model, which is called the within estimator in the panel data literature, will be
consistent for the true model parameter o : (It should be noted that for a dynamic panel
data model where Xit is not strictly exogenous, the within estimator is not consistent for
o
when the number of the time periods T is …xed. Di¤erent estimation methods have
to be used.) See Hsiao (2002) for detailed discussion of panel data econometric models.
32
where D( ) is a n K matrix, with the t-th row being @g(Xt ; )=@ : Although one
generally does not have a closed form expression for ^ ; all asymptotic theory and pro-
cedures in Chapters 4–7 are applicable to the nonlinear least squares estimator if one
replaces Xt by (@=@ )g(Xt ; ): See also the discussion in Chapters 8 and 9.
The asymptotic theory in Chapters 4–7 however, cannot be directly applied to some
popular nonlinear models. Examples of such nonlinear models are
o
E [m(Zt ; )] = 0;
o o
Yt = g(Xt ; ) + (Xt ; )ut ;
2
where g(Xt ; ) is a parametric model for E(Yt jXt ); (Xt ; ) is a parametric model
for var(Yt jXt ); and fut g is i.i.d.(0; 1);
f (yjXt ; ):
These nonlinear models are not models for conditional mean or regression; they also
model other characteristics of the conditional distribution of Yt given Xt : For these
models, we need to develop new estimation methods and new asymptotic theory, which
we will turn to in subsequent chapters.
One important part that we do not discuss in Chapters 2–7 is model speci…cation
testing. Chapter 2 emphasizes the importance of correct model speci…cation for the
validity of economic interpretation of model parameters. How to check whether a lin-
ear regression model is correctly speci…ed for conditional mean E(Yt jXt )? This is called
model speci…cation testing. Some popular speci…cation tests in econometrics are Haus-
man’s (1978) test and White’s (1981) test which compares two parameter estimators for
the same model parameter. Also, see Hong and White’s (1995) speci…cation test using
a nonparametric series regression approach.
33
EXERCISES
o o
Ct = 1 + 2 (Yt Tt ) + "t ; (1.1)
o o
Tt = 1 + 2 Yt + vt ; (1.2)
Yt = Ct + Gt ; (1.3)
where Ct ; Yt ; Tt ; Gt are the consumption, income, tax, and government spending respec-
tively, and f"t g and fvt g are i.i.d. (0; 2" ) and (0; 2v ) respectively. Model (1.1) is a
consumption function which we are interested in, (1.2) is a tax function, and (1.3) is an
income identity.
(a) Can the OLS estimator ^ of model (1.1) give consistent estimation for the mar-
ginal propensity to consume? Explain.
(b) Suppose Gt is an exogenous variable (i.e., Gt does not depend on both Ct and
Yt ). Can Gt be used as a valid instrumental variable? If yes, describe a 2SLS procedure.
If not, explain.
(c) Suppose the government has to maintain a budget balance such that
Gt = Tt + wt ; (1.4)
where fwt g is i.i.d. (0; 2w ): Could Gt be used as a valid instrumental variable? If yes,
describe a 2SLS procedure. If not, explain.
o
Yt = Xt0 + "t ; (2.1)
X1t = vt + ut ; (2.2)
"t = wt + ut : (2.3)
where fvt g; fut g and fwt g are all i.i.d. N (0; 1); and they are mutually independent.
(a) Is the OLS estimator ^ consistent for o ? Explain.
(b) Suppose that Z1t = wt "t : Is Zt = (1; Z1t )0 a valid instrumental vector? Explain.
(c) Find an instrumental vector and the asymptotic distribution of ^ 2sls using this
p d
instrumental vector: [Note you need to …nd n( ^ 2sls o
) ! N (0; V ) for some V; where
the expression of V should be given:]
34
(d) To test the hypothesis
o
H0 : R = r;
where R is a J 2 matrix, and r is a J 1 vector. Suppose that F~ is the F -statistic
in the second stage regression of 2SLS. Could we use J F~ as an asymptotic 2J test?
Explain.
where the …rst equation is a model for the demand of certain good, where Yt is
the quantity demanded for the good, Pt is the price of the good, St is the price of a
substitute, and "t is a shock to the demand. The second equation is a model for the
supply of the good, where Yt is the quantity supplied, Ct is the cost of production, and vt
is a shock to the supply. Suppose St and Ct are exogenous variables, f"t g is i.i.d.(0; 2" )
and fvt g is i.i.d.(0; 2v ); and two series f"t g and fvt g are independent of each other. We
have also assumed that the market is always clear so the quantity demanded is equal to
the quantity supplied.
(a) Suppose we use a 2SLS estimator to estimate the demand model with the instru-
ments Zt = (St ; Ct )0 : Describe the 2SLS procedure. Is the resulting 2SLS ^ 2sls consistent
for o = ( o0 ; o1 ; o2 )0 ? Explain.
(b) Suppose we use a 2SLS estimator to estimate the supply equation with instru-
ments Zt = (St ; Ct )0 : Describe the 2SLS procedure. Is the resulting 2SLS ^ 2sls consistent
for o = ( o0 ; o1 ; o2 )0 ? Explain.
(c) Suppose f"t g and fvt g are contemporaneously correlated, namely, E("t vt ) 6= 0:
This can occur when there is a common shock to both the demand and supply of the
good. Does this a¤ect the conclusions in part (a) and part (b). Explain.
p
7.4. Show that under Assumptions 7.1-7.4, ^ 2sls ! o
as n ! 1:
= [Qxz Qzz1 Qzx ] 1 Qxz Qzz1 V Qzz1 Qzx [Qxz Qzz1 Qzx ] 1 ;
35
7.6. Suppose Assumptions 7.1 –7.4, 7.6 and 7.7 hold.
(a) De…ne
e^0 e^
s^2 =
n
p
where e^ = Y X ^ : Show s^2 ! 2 = var("t ) as n ! 1:
2sls
(b) De…ne
e0 e
s2 = ;
n
where e = Y X ^ ^ 2sls is the estimated residual from the second stage regression of Yt
^ t = ^ 0 Zt : Show that s2 is not a consistent estimator for 2 :
on X
7.8. Let
1X
n
V^ = Zt Zt0 e^2t ;
n t=1
p
where e^t = Yt Xt0 ^ 2sls : Show V^ ! V under Assumptions 7.1–7.8.
o
Yt = Xt0 + "t ; t = 1; :::; n;
o
for some unknown parameter and some unobservable disturbance "t ;
36
(i) E(Xt "t ) = 0;
(ii) E(Zt "t ) = 0; where Zt is a l 1 random vector, with l K;
(iii) The l l matrix
Qzz = E(Zt Zt0 )
is …nite and nonsingular, and the l K matrix
^ = (X0 X) 1 X0 Y
and 2SLS
^ = [(X0 Z)(Z0 Z) 1 Z0 X] 1 X0 Z(Z0 Z) 1 Z0 Y
2sls
where Xt is the regressor vector, Zt is the instrumental vector, = [E(Zt Zt0 )] 1 E(Zt Xt0 )
is the best linear LS approximation coe¢ cient, and vt is the K 1 regression error.
Now, suppose instead of decomposing Xt , we decompose the regression error "t as
follows:
"t = vt0 0 + ut ;
37
where 0 = [E(vt vt0 )] 1 E(vt "t ) is the best linear LS approximation coe¢ cient.
Now, assuming that vt is observable, we consider the augmented linear regression
model
Yt = Xt0 o + vt0 0 + ut :
Show E[(Xt0 ; vt )0 ut ] = 0. One important implication of this orthogonality condition
is that if vt is observable then the OLS estimator of regressing Yt on Xt and vt will be
consistent for ( o ; o )0 :
where E = D CB 1 C 0 :]
7.13 [Hausman’s Test] Suppose Assumptions 3.1, 3.2, 3.3(ii, iii), 3.4 and 3.5 in
Problem 7.8 hold. A test for the null hypothesis H0 : E(Xt "t ) = 0 can be constructed
by comparing ^ and ^ 2sls ; because they will converge in probability to the same limit
o
under H0 and to di¤erent limits under the alternatives to H0 : Assume H0 holds.
(a) Show that
p Xn
^ o 1 1 p
n( ) Qxx p Xt "t ! 0
n t=1
38
or equivalently
p 1 X
n
n( ^ o
)= Qxx1 p Xt "t + oP (1);
n t=1
p p
where Qxx = E(Xt Xt0 ): [Hint: If A^ ! A and B
^ = OP (1); then A^B
^ ^ ! 0 or
AB
A^B
^ = AB^ + oP (1):]
(b) Show that
p 1 X
n
n( ^ 2sls o
)= Qx~x~1 p ~ t "t + oP (1);
X
n t=1
~t =
where Qx~x~ = E(Xt Xt0 ); X 0
Zt and = [E(Zt Zt0 )] 1 E(Zt Xt ):
(e) Show that
p Xn n o
n( ^ ^ ) = p1 Qxx1 Xt ~ t "t + oP (1):
Qx~x~1 X
2sls
n t=1
p
(d) The asymptotic distribution of n( ^ 2sls ^ ) is determined by the leading term
only in part (c). Find its asymptotic distribution.
(e) Construct an asymptotically 2 test statistic. What is the degree of freedom of
the asymptotic 2 distribution? Assume that Qxx Qx~x~ is strictly positive de…nite.
4
7.14. Suppose Assumptions 3.1, 3.2, 3.3(ii, iii) and 3.4 in Problem 7.8 hold, E(Xjt )<1
4 4
for 1 j K; E(Zjt ) < 1 for 1 j l; and E("t ) < 1: Construct a Hausman’s test
statistic for H0 : E("t jXt ) = 0 and derive its asymptotic distribution under H0 .
39
CHAPTER 8 GENERALIZED METHOD OF
MOMENTS ESTIMATION
Abstract: Many economic theories and hypotheses have implications on and only on a mo-
ment condition or a set of moment conditions. A popular method to estimate model parameters
contained in the moment condition is the Generalized Method of Moments (GMM). In this chap-
ter, we …rst provide some economic examples for the moment condition, and de…ne the GMM
estimator. We then establish the consistency and asymptotic normality of the GMM estimator.
Since the asymptotic variance of a GMM estimator depends on the choice of a weighting matrix,
we introduce an asymptotically optimal two-stage GMM estimator with a suitable choice of
a weighting matrix. With the construction of a consistent asymptotic variance estimator, we
then propose an asymptotically 2 Wald test statistic for the hypothesis of interest, and a model
speci…cation test for the moment condition.
Key words: CAPM, GMM, IV Estimation, Model speci…cation test, Moment condition,
Moment matching, Optimal estimation, Overidenti…cation, Rational expectations.
MME Procedure: Suppose f (y; o ) is the probability density function (pdf) or the probability
mass function (pmf) of a univariate random variable Yt .
o
Question: How to estimate the unknown parameter using a realization of the random sample
fYt gnt=1 ?
The basic idea of MME is to match the sample moments with the population moments
obtained under the probability distributional model. Speci…cally, MME can be implemented as
follows:
o o
Step 1: Compute population moments k( ) E(Ytk ) under the model density f (y; ):
Z 1
o o
E(Yt ) = yf (y; )dy = 1( )
1
Z 1
o
E(Yt2 ) = y 2 f (y; )dy
1
= 2
( o) + 2 o
1 ( );
1
where 2 ( o ) is the variance of Yt .
Step 2: Compute the sample moments from the random sample Y n = (Y1 ; :::; Yn )0 :
p
^ 1 = Yn ! ( o )
m
X
n
1
m
^2 = n Yt2
t=1
p o o
! E(Yt2 ) = 2
( )+ 2
1 ( );
where 2
( o) = 2(
o
) 2 o
1 ( ); and the weak convergence follows from the WLLN.
Step 3: Match the sample moments with the corresponding population moments evaluated at
some parameter value ^ :
m
^1 = ( ^ );
2 ^
m
^2 = ( )+ 2
( ^ ):
Step 4: Solve for the system of equations. The solution ^ is called the method of moment
estimator for o :
Example 1: Suppose the random sample fYt gnt=1 i.i.d. EXP( ): Find an estimator for using
the method of moment estimation.
y
f (y; ) = e for y > 0;
2
it can be shown that
Z 1
( ) = E(Yt ) = yf (y; )dy
0
Z 1
y
= y e dy
0
1
= :
On the other hand, the …rst sample moment is the sample mean:
m
^ 1 = Yn :
1
^ 1 = (^) =
m ;
^
^= 1 = 1:
m^1 Yn
o
Example 2: Suppose the random sample fYt gnt=1 i.i.d.N ( ; 2
): Find MME for =( ; 2 0
):
E(Yt ) = ;
E(Yt2 ) = 2
+ 2
:
m
^ 1 = Yn ;
1X 2
n
m
^2 = Y :
n t=1 t
Yn = ^ ;
1 Xn
Y 2 = ^2 + ^2:
n t=1 t
3
It follows that the MME
^ = Yn ;
1X 2
n
^2 = Y Yn2
n t=1 t
1X
n
= (Yt Yn )2 :
n t=1
p p
It is well-known that ^ ! and ^ 2 ! 2
as n ! 1:
where sub-index t denotes that mt ( ) is a function of both and some random variables indexed
by t. For example, we may have
mt ( ) = Xt (Yt Xt0 )
in the 2SLS estimation, or more generally in the instrumental variable (IV) estimation, where
Zt is a l 1 instrument vector.
If l = K; that is, if the number of moment conditions is the same as the number of unknown
parameters, the model E[mt ( o )] = 0 is called exactly identi…ed. If l > K; that is, if the number
of moment conditions is more than the number of unknown parameters, the model is called
overidenti…ed.
The moment condition E[mt ( o )] = 0 may follow from economic and …nancial theory (e.g.
rational expectations and correct asset pricing). We now illustrate this by the following example.
o o
Yt = 0 + 1 Rmt + "t
o0
= Xt + "t ;
4
o
where Xt = (1; Rmt )0 is a bivariate vector, Rmt is the excess market portfolio return, is a
2 L parameter matrix, and "t is an L 1 disturbance, with E("t jXt ) = 0.
De…ne the l 1 moment function
0
mt ( ) = Xt (Yt Xt );
where l = 2L and denotes the Kronecker product. When CAPM holds, we have
E[mt ( o )] = 0:
These l 1 moment conditions form a basis to estimate and test the CAPM.
In fact, for any measurable function h : R2 ! Rl ; CAPM implies
0
E[h(Xt )(Yt Xt )] = 0:
Suppose a representative economic agent has a constant relative risk aversion utility over his
lifetime
X n X
n
t t Ct 1
U= u(Ct ) = ;
t=0 t=0
where u( ) is the time-invariant utility function of the economic agent in each time period (here
we assume u(c) = (c 1)= ), is the agent’s time discount factor, is the economic agent’s risk
aversion parameter, and Ct is the consumption during period t: Let the information available to
the agent at time t 1 be represented by the sigma-algebra It 1 –in the sense that any variable
whose value is known at time t 1 is presumed to be It 1 -measurable, and let
Pt P t Pt 1
Rt = =1+
Pt 1 Pt 1
be the gross return to an asset acquired at time t 1 at the price of Pt 1 (we assume no dividend
on the asset). The agent’s optimization problem is to
max E(U )
fCt g
5
subject to the intertemporal budget constraint
Ct + Pt qt = Yt + Pt qt 1 ;
where qt is the quantity of the asset purchased at time t and Yt is the agent’s labor income during
period t. De…ne the marginal rate of intertemporal substitution
@u(Ct ) 1
@Ct Ct
MRSt ( ) = @u(Ct 1 )
= :
Ct 1
@Ct 1
The …rst order conditions of the agent optimization problem are characterized by the Euler
equation:
E [ o MRSt ( o )Rt jIt 1 ] = 1 for some o = ( o ; o )0 :
That is, the marginal rate of intertemporal substitution discounts gross returns to unity.
Thus, one may view that f MRSt ( )Rt 1g is a generalized model residual which has the MDS
property when evaluated at the true structural parameters o = ( o ; o )0 :
o
Question: How to estimate the unknown parameter in an asset pricing model?
More generally, how to estimate o from any linear or nonlinear econometric model which
can be formulated as a set of moment conditions? Note that the joint distribution of the random
sample is not given or implied by economic theory; only a set of conditional moments is given.
From the Euler equation, we can induce the following conditional moment restrictions:
E ( o MRSt ( o )Rt 1) = 0;
Ct 1 o
E ( MRSt ( o )Rt 1) = 0;
Ct 2
E [Rt 1 ( o MRSt ( o )Rt 1)] = 0:
1X
n
m(
^ )= mt ( );
n t=1
6
where 0
Ct 1
mt ( ) = [ MRSt ( )Rt 1] 1; ; Rt 1
Ct 2
can serve as the basis for estimation. The elements of the vector
0
Ct 1
Zt 1; ; Rt 1
Ct 2
De…nition 8.1 [GMM Estimator] The generalized method of moments (GMM) estimator is
^ = arg min m( ^
^ )0 W 1
m(
^ );
2
where
X
n
1
m(
^ )=n mt ( )
t=1
Question: Why is the GMM estimator ^ not de…ned by setting the l 1 sample moments to
zero jointly, namely
^ ^ ) = 0?
m(
Remarks: When l > K; i.e., when the number of equations is larger than the number of
unknown parameters, we generally cannot …nd a ^ such that m( ^ ^ ) = 0: However, we can …nd a
^ which makes m(
^ ^ ) as close to a l 1 zero vector as possible by minimizing the quadratic form
X
l
0
m(
^ ) m(
^ )= ^ 2i ( );
m
i=1
P
where m^ i ( ) = n 1 nt=1 mit ( ); i = 1; :::; l: Since each sample moment component m ^ i ( ) has a
di¤erent variance, and m
^ i ( ) and m^ j ( ) may be correlated, we can introduce a weighting matrix
^ ^
W and choose to minimize a weighted quadratic form in m( ^ ^ ); namely
m( ^
^ )0 W 1
m(
^ ):
7
^?
Question: What is the role of W
Intuitively, the sample moment components which have large sampling variations should be
discounted. This is an idea similar to GLS, which discounts noisy observations by dividing by the
conditional standard deviation of the disturbance term and di¤erencing out serial correlations.
mt ( ) = Zt (Yt Xt0 )
and
o o
E[Zt (Yt Xt0 )] = 0 for some ;
In this case, the GMM estimator, or more precisely, the linear IV estimator, ^ ; solves the
following minimization problem:
min m( ^
^ )0 W 1
m(
^ )=n 2
min (Y ^
X )0 ZW 1
Z0 (Y X );
2RK 2RK
where
1X
n
Z0 (Y X )
m(
^ )= = Zt (Yt Xt0 ):
n n t=1
The FOC is given by
@ h ^ 1 Z0 (Y X )
i
(Y X )0 ZW
@ =^
= ^ 1 Z0 (Y X ^ ) = 0:
2X0 ZW
It follows that
^
X0 ZW 1
Z0 X ^ = X0 ZW
^ 1
Z0 Y:
8
When the K l matrix Qxz = E(Xt Zt0 ) is of full rank of K; the K K matrix Qxz W Qzx is
^ 1 Z0 X is not singular at least for large samples, and consequently
nonsingular. Therefore, X0 ZW
the GMM estimator ^ has the closed form expression:
^ = (X0 ZW
^ 1 ^
Z0 X) 1 X0 ZW 1
Z0 Y:
This is called a linear IV estimator because it estimates the parameter o in the linear model
Yt = Xt0 o + "t with E("t jZt ) = 0:
Interestingly, the 2SLS estimator ^ 2sls considered in Chapter 7 is a special case of the IV
estimator by choosing
^ = Z0 Z:
W
^ = c(Z0 Z) for any constant c 6= 0:
or more generally, by choosing W
Question: Is the choice of W ^ = Z0 Z optimal? In other words, is the 2SLS estimator ^ 2sls
asymptotically e¢ cient in estimating o ?
When l = K such that Qxz = E(Xt Zt0 ) is nonsingular, the K K matrix X0 Z is nonsingular
at least for large samples. Consequently,
^ = (Z0 X) 1 Z0 Y:
^ = (X0 ZW
^ 1 ^
Z0 X) 1 X0 ZW 1
Z0 Y:
^ = (Z0 X) 1 Z0 Y:
Note that the IV estimator ^ generally depends on the choice of instruments Zt and weighting
^ : However, when l = K; the exact identi…cation case, the IV estimator ^ does not
matrix W
depend on the choice of W ^ 1 Z0 (Y X ^ ) = 0
^ : This is because in this case the FOC that X0 ZW
becomes
Z0 (Y X^) = 0
(K n)(n 1) = K 1
9
given X0 Z and W ^ are nonsingular at least for large samples. Obviously, the OLS estimator
^ = (X0 X) 1 X0 Y is a special case of the linear IV estimator by choosing Zt = Xt :
To investigate the asymptotic properties of the GMM estimator ^ , we …rst provide a set of
regularity conditions.
Assumption 8.1 [Compactness]: The parameter space is compact (closed and bounded);
p
sup jjm(
^ ) m( )jj ! 0;
2
o
Assumption 8.3 [Identi…cation]: There exists a unique parameter in such that m( o ) =
0:
p
^ !
Assumption 8.4 [Weighting Matrix]: W W , where W is a nonstochastic l l symmetric,
…nite and nonsingular matrix.
Remarks:
Assumption 8.3 is an identi…cation condition. If the moment condition m( o ) = 0 is implied
by economic theory, o can be viewed as the true model parameter value. Assumptions 8.1
and 8.3 imply that the true model parameter o lies inside the compact parameter space :
Compactness is sometimes restrictive, but it greatly simpli…es our asymptotic analysis and is
sometime necessary (as in the case of estimating GARCH models) where some parameters must
be restricted to ensure a positive conditional variance estimator.
mt ( ) = ht "t ( )
for some weighting function ht and some error or generalized error term "t ( ): Assumption 8.2
allows but does not require such a multiplicative form for mt ( ): Also, in Assumption 8.2, we
10
impose a uniform WLLN for m( ^ ) over : Intuitively, uniform convergence implies that the
largest (or worse) deviation between m(
^ ) and m( ) over vanishes to 0 in probability as
n ! 1.
This can be achieved by a suitable uniform weak law of large numbers (UWLLN). For example,
when fYt ; Xt0 g0n
t=1 is i.i.d., we have the following:
Lemma 8.2 [Uniform Strong Law of Large Numbers for IID Processes (USLLN)]: Let
fZt ; t = 1; 2; :::g be an IID sequence of random d 1 vectors, with common cumulative distribution
function F:
Let be a compact subset of RK ; and let q : Rd ! R be a function such that q( ; ) is
measurable for each 2 and q(z; ) is continuous on for each z 2 Rd :
Suppose there exists a measurable function D : Rd ! R+ such that jq(z; )j D(z) for all
2 and z 2 S; where S is the support of Zt and E[D(Zt )] < 1:
Then
(i) Q( ) = E[q(Zt ; )] is continuous on ;
(ii) sup 2 jQ( ^ ) Q( )j ! 0 a.s. as n ! 1; where Q( ^ ) = n 1 Pn q(Zt ; ):
t=1
Lemma 8.3 [Uniform Strong Law of Large Numbers for Stationary Ergodic Processes
{Ranga Rao (1962)}]: Let ( ; F; P ) be a probability space, and let T : ! be a one-to-one
measure preserving transformation.
Let be a compact subset of RK ; and let q : ! R be a function such that q( ; ) is
measurable for each 2 and q(!; ) is continuous on for each ! 2 :
Suppose there exists a measurable function D : ! R+ such that jq(!; )j D(!) for all
R
2 and ! 2 ; and E(D) = DdP < 1:
If for each 2 ; qt ( ) = q(T t !; ) is ergodic, then
(i) Q( ) = E[qt ( )] is continuous on ;
(ii) sup 2 jQ(^ ) Q( )j ! 0 a.s. as n ! 1; where Q( ^ ) = n 1 Pn qt ( ):
t=1
11
Theorem 8.4 [Consistency of the GMM Estimator]: Suppose Assumptions 8.1–8.4 hold.
p
Then ^ ! o .
To show this consistency theorem, we need the following extrema estimator lemma.
Remarks: This lemma continues to hold if we change all convergences in probability to almost
sure convergences.
We now show the consistency of the GMM estimator ^ by applying the above lemma:
Proof: Put
^ )=
Q( m( ^
^ )0 W 1
m(
^ )
and
Q( ) = m( )0 W 1
m( ):
Then
^ )
Q( Q( )
= m( ^
^ )0 W 1
m(
^ ) m( )0 W 1
m( )
= [m(
^ ) ^
m( ) + m( )]0 W 1
[m(
^ ) m( ) + m( )] m( )0 W 1
m( )
[m(
^ ) ^
m( )]0 W 1
[m(
^ ) m( )]
^
+2 m( )0 W 1
[m(
^ ) m( )]
^
+ m( )0 (W 1
W 1
)m( ) :
p
^ )!
Q( Q( )
12
estimator Lemma. Note that the proof of the consistency theorem does not require the existence
of the FOC. This is made possible by using the extrema estimator lemma. This completes the
proof of consistency.
o
Assumption 8.5 [Interiorness]: 2 int( ):
Xn
@mt ( ) p
1
sup n D( ) ! 0,
2 t=1
@
where the l K matrix
@mt ( )
D( ) E
@
dm( )
=
d
Remarks:
In Assumption 8.6, we assume both CLT and UWLLN directly. These are called “high-
level assumptions." They can be ensured by imposing more primitive conditions on the data
generating processes (e.g., i.i.d. random samples or MDS random samples), and the moment and
smoothness conditions of mt ( ). Fore more discussion, see White (1994).
13
We now establish the asymptotic normality of the GMM estimator ^ :
where
= (Do0 W 1
Do ) 1 Do0 W 1
Vo W 1
Do (Do0 W 1
Do ) 1 ;
o
@m( )
and Do D( o ) = @
:
p
Proof: Because o is an interior element in ; and ^ ! o as n ! 1, we have that ^ is an
interior element of with probability approaching one as n ! 1:
^ ) = m(
For n su¢ ciently large, the …rst order conditions for the maximization of Q( ^ 1 m(
^ )0 W ^ )
are
^ )
dQ(
0 = j =^
d
^ ^) ^ 1 ^
dm(
= 2 W m(^ ):
d 0
^ ^ ) ^ 1p
dm(
0 = W ^ ^ ):
nm(
d 0
K 1 = (K l) (l l) (l 1)
Note that W ^ is not a function of : Also, this FOC does not necessarily imply m(^ ^ ) = 0: Instead,
it only says that a set (with dimension K l) of linear combinations of the l components in
^ ^ ) is equal to zero. Here, the l K matrix m(
m( d ^ )
d
^ ^ ) with
is the gradient of the l 1 vector m(
respect to the K 1 vector .
o
Using the Taylor series expansion around the true parameter value , we have
p p ^ )p ^
dm(
^ ^) =
nm( ^ o) +
nm( n( o
);
d
where = ^ + (1 ) o lies between ^ and o ; with 2 [0; 1]: Here, for notational simplicity,
we have abused the notation in the expression of dm(
^ )
d
: Precisely speaking, a di¤erent is needed
for each partial derivative of m(
^ ) with respect to each parameter i ; i = 1; :::; K:
The …rst term in the above Taylor series expansion is contributed by the sampling randomness
of the sample average of the moment functions evaluated at the true parameter o ; and the second
term is contributed by the randomness of parameter estimator ^ o
:
14
It follows from FOC that
^ ^ ) ^ 1p
dm(
0 = W ^ ^)
nm(
d 0
^ ^ ) ^ 1p
dm(
= W ^ o)
nm(
d 0
^ ^ ) ^ 1 dm(
dm( ^ )p ^ o
+ W n( ):
d 0 d
^ ^)
dm( p
Now let us show that d
! Do D( o ). To show this, consider
^ ^)
dm(
D0
d
^ ^)
dm(
= D( ^ ) + D( ^ ) D( o )
d
^ ^)
dm(
D( ^ ) + D( ^ ) D( o )
d
dm(
^ )
sup D( ) + D( ^ ) D( o )
2 d
p
!0
o p
by the triangle inequality and Assumption 8.6 (the UWLLN, the continuity of D( ); and ^ !
0).
Similarly, because = ^ + (1 ) o for 2 [0; 1]; we have
p
jj o
jj = jj ( ^ o
)jj jj ^ o
jj ! 0:
It follows that
dm(
^ ) p
! Do .
d
Then the K K matrix
Do0 W 1
Do
is nonsingular by Assumptions 8.4 and 8.6. Therefore, for n su¢ ciently large, the inverse
" # 1
^ ^) ^
dm( 1 dm(
^ )
W
d 0 d
15
we have
" # 1
p ^ ^) ^
dm( 1 dm(
^ ) ^ ^) ^
dm( p
n( ^ o
) = W W 1
^ o)
nm(
d 0 d d 0
p
= A^ nm(
^ o );
where " # 1
^ ^) ^
dm( 1 dm(
^ ) ^ ^) ^
dm(
A^ = W W 1
:
d 0 d d 0
1=2 n o
where Vo avar[n t=1 mt ( )]: Moreover,
" # 1
^ ^) ^
dm( 1 dm(
^ ) ^ ^) ^
dm(
A^ = W W 1
d 0 d d 0
p 1
! Do0 W 1
Do Do0 W 1
A.
where
= AVo A0
= (Do0 W 1
Do ) 1 Do0 W 1
Vo W 1
Do (Do0 W 1
Do ) 1 :
Remarks:
p p
The structure of avar( n ^ ) is very similar to that of avar( n ^ 2sls ): In fact, as pointed out
earlier, 2SLS is a special case of the GMM estimator with the choice of
mt ( ) = Zt (Yt Xt0 )
W = E(Zt Zt0 ) = Qzz :
16
Similarly, the OLS estimator is a special case of GMM with the choice of
mt ( ) = Xt (Yt Xt0 );
W = E(Xt Xt0 ) = Qxx :
Most econometric estimators can be viewed as a special case of GMM, at least asymptotically. In
other words, GMM provides a convenient uni…ed framework to view most econometric estimators.
See White (1994) for more discussion.
Theorem 8.7 [Asymptotic E¢ ciency]: Suppose Assumptions 8.4 and 8.6 hold. De…ne o =
p
^ o )]: Then
(Do0 Vo 1 Do ) 1 ; which is obtained from by choosing W = Vo avar[ nm(
o is p.s.d.
1 1
Proof: Observe that o is p.s.d. if and only if o is p.s.d. We therefore consider
1 1
o
= Do0 Vo 1 Do Do0 W 1
Do (Do0 W 1
Vo W 1
Do ) 1 Do0 W 1
Do
= Do0 Vo 1=2
[I Vo1=2 W 1
Do (Do0 W 1
Vo W 1
Do ) 1 Do0 W 1
Vo1=2 ]Vo 1=2
Do
= Do0 Vo 1=2
GVo 1=2
Do ;
G I Vo1=2 W 1
Do (Do0 W 1
Vo W 1
Do ) 1 Do0 W 1
Vo1=2
1 1
o = (Do0 Vo 1=2
G)(GVo 1=2
Do )
1=2
= (GVo Do )0 (GVo 1=2
Do )
= B0B
p.s.d. (why?),
17
1=2
where B = GVo Do is a l K matrix. This completes the proof.
Remarks:
The optimal choice of W = Vo is not unique. The choice of W = cVo for any nonzero constant
c is also optimal.
In practice, the matrix Vo is unavailable. However, we can use a feasible asymptotically
optimal choice W ^ = V~ , a consistent estimator for Vo avar[pnm(
^ o )]:
^ !p p
Answer: W Vo , and Vo is the variance-covariance matrix of the sample moments nm( ^ o ):
p
The use of W ^ 1! Vo 1 ; therefore, downweighs the sample moments which have large sampling
p p
variations and di¤erences out correlations between di¤erent components nm ^ i ( o ) and nm
^ j ( o)
for i 6= j; where i; j = 1; :::; K. This is similar in spirit to the GLS estimator in the linear
regression model. It also corrects serial correlations between di¤erent sample moments when
they exist.
As pointed out earlier, the 2SLS estimator ^ 2sls is a special case of the GMM estimator with
mt ( ) = Zt (Yt Xt0 ) and the the choice of weighting matrix W = E(Zt Zt0 ) = Qzz : Suppose
fmt ( o )g is an MDS and E("2t jZt ) = 2 ; where "t = Yt Xt0 o : Then
p
^ o )]
Vo = avar[ nm(
= E [mt ( o )mt ( o )0 ]
2
= Qzz
where the last equality follows from the law of iterated expectations and conditional ho-
moskedasticity. Because W = Qzz is proportional to Vo ; the 2SLS estimator ^ is asymptotically
optimal in this case. In contrast, when fmt ( o )g is an MDS with conditional heteroskedasticity
(i.e., E("2t jZt ) 6= 2 ) or fmt ( o )g is not an MDS, then the choice of W = Qzz does not de-
liver an asymptotically optimal 2SLS estimator. Instead, the GMM estimator with the choice of
W = Vo = E(Zt Zt0 "2t ) is asymptotically optimal.
The previous theorem suggests that the following two-stage GMM estimator will be asymp-
totically optimal.
18
Step 1: Find a consistent preliminary estimator ~ :
~ = arg min m( ~
^ )0 W 1
m(
^ );
2
for some prespeci…ed W ~ which converges in probability to some …nite and p.d. matrix. For
convenience, we can set W ~ = I; an l l identity matrix. This is not an optimal estimator, but
it is a consistent estimator for o .
p
Step 2: Find a preliminary consistent estimator V~ for Vo ^ = V~ :
^ o )]; and choose W
avar[ nm(
The construction of V~ di¤ers in the following two cases, depending on whether fmt ( o )g is
an MDS:
X
n
V~ = n 1
mt ( ~ )mt ( ~ )0
t=1
X
n
~ (j) = n 1
mt ( ~ )mt j ( ~ )0 for j 0;
t=j+1
19
and ~ (j) = ~ ( j)0 if j < 0: Under regularity conditions, it can be shown that V~ is consistent for
the long-run variance
X1
Vo = (j);
j= 1
^ = arg min m(
^ )0 V~ 1
m(
^ ):
2
Remarks: The weighting matrix V~ does not involve the unknown parameter : It is a given (sto-
chastic) weighting matrix. This two-stage GMM estimator ^ is asymptotically optimal because
p p
V~ ! Vo = avar[ nm(^ o )].
where o = (Do0 Vo 1 Do ) 1 :
First, most macroeconomic time series data sets are usually short, and second, the use of
instruments Zt is usually ine¢ cient. These factors lead to a large estimation error so it is
desirable to have an asymptotically e¢ cient estimator.
Although the two-stage GMM procedure is asymptotically e¢ cient, one may like to iterate the
procedure further until the GMM parameter estimates and the values of the minimized objective
function converge. This will eliminate any dependence of the GMM estimator on the choice of
~ ; and it may improve the …nite sample performance of the GMM
the initial weighting matrix W
estimator when the number of parameters is large (e.g., Ferson and Foerster 1994).
20
We need to estimate both Do and Vo :
o
(i) To estimate Do = E[ @m@t ( )
]; we can use
^ ^)
^ = dm(
D :
d
(ii) To estimate Vo ; we need to consider two cases— MDS and non-MDS separately:
Vo = E[mt ( o )mt ( o )0 ]:
Assuming the UWLLN for fmt ( )mt ( )0 g; we can show that V^ is consistent for
Vo = E[mt ( o )mt ( o )0 ]:
X
1
V0 = (j);
j= 1
X
n 1
V^ = k(j=p) ^ (j);
j=1 n
X
n
^ (j) = n 1
mt ( ^ )mt j ( ^ )0 for j 0;
t=j+1
21
Under suitable conditions (e.g., Newey and West 1994, Andrews 1991), we can show
p
V^ ! Vo
Theorem 8.9 [Asymptotic Variance Estimator for the Optimal GMM Estimator]:
Suppose Assumptions 8.1–8.7 hold. Then
^o ^ 0 V^ 1 ^ 1 p
(D D) ! o as n ! 1.
H0 : R( o ) = r;
Remarks: We need J K: The number of restrictions is less than the number of parameters.
We now allow hypotheses of both linear and nonlinear restrictions on o :
The basic idea is to check whether R( ^ ) r is close to 0: By the Taylor series expansion and
R( o ) = r under H0 ; we have
p p
n[R( ^ ) r] = n[R( o ) r]
p
+R0 ( ) n( ^ o
)
p
= R0 ( ) n( ^ o
)
d
! R0 ( o ) N (0; o)
N [0; R0 ( o ) oR
0
( o )0 ]:
22
where lies between ^ and o ; i.e., = ^ + (1 ) o for some 2 [0; 1]:
p o p
Because R0 ( ) ! R0 ( o ) given continuity of R0 ( ) and ! 0, and
p d
n( ^ o
) ! N (0; o );
we have
p d
n[R( ^ ) r] ! N [0; R0 ( o ) oR
0
( o )0 ]:
d
W = n[R( ^ ) r]0 [R0 ( ^ ) ^ o R0 ( ^ )0 ] 1 [R( ^ ) r] ! 2
J
Theorem 8.10 [Wald Test Statistic]: Suppose Assumptions 8.1–8.7 hold. Then under H0 :
R( o ) = r; we have
d
W = n[R( ^ ) r]0 [R0 ( ^ ) ^ o R0 ( ^ )0 ] 1 [R( ^ ) r] ! 2
J:
Remarks: This can be used for hypothesis testing. This Wald test is built upon an asymp-
totically optimal GMM estimator. One could also construct a Wald test using a consistent but
suboptimal GMM estimator (how?).
is correctly speci…ed?
23
Answer: We can check correct model speci…cation by testing whether the above moment con-
dition holds.
E[mt ( o )] = 0
holds?
X
n
^ ^) = n
m( 1
mt ( ^ )
t=1
and see if it is signi…cantly di¤erent from zero (the value of the population moment evaluated at
the true parameter value o ). For this purpose, we need to know the asymptotic distribution of
p
^ ^ ):
nm(
which follows from a …rst order Taylor series expansion, and lies between ^ and o
. The
p
asymptotic distribution of nm(^ ^ ) is contributed from two sources.
Recall that the two-stage GMM
^ = arg min m(
^ )0 V~ 1
m(
^ ):
2
d h ^ 0~ 1 ^
i
0= m(
^ )V m(
^ ) :
d
It is very important to note that V~ is not a function of , so it has nothing to do with the
di¤erentiation with respect to : We then have
^ ^ ) ~ 1p
dm(
0 = 0 V ^ o)
nm(
d
^ ^ ) ~ 1 dm(
dm( ^ )p ^ o
+ 0 V n( ):
d d
24
It follows that for n su¢ ciently large, we have
p
n( ^ o
)
" # 1
^ ^) ~
dm( 1 dm(
^ )
= V
d 0 d
^ ^) ~
dm( p
V 1
^ o ):
nm(
d 0
Hence,
p
V~ 1=2
^ ^)
nm(
p
= V~ 1=2
^ o)
nm(
^ )p ^
dm(
+V~ 1=2 n( o
)
d
2 " # 3
1
dm(
^ ) ^ ^) ~
dm( 1 dm(
^ ) ^ ^) ~
dm( p
= 4I V~ 1=2 V V 1=2 5
V~ 1=2
^ o)
nm(
d d 0 d d 0
p
= ^ [V~ 1=2 nm(
^ o )]:
= ;
where
1=2
=I Vo Do (Do0 Vo 1 Do ) 1 Do0 Vo 1=2
2
is a l l symmetric matrix which is also idempotent (i.e., = ) with tr( ) = l K (why?
Use tr(AB) = tr(BA)!):
25
It follows that under correct model speci…cation, we have
p p
^ ^ )0 V~
n[m( 1
^ ^ )] = [V~
m( 1=2
^ o )]0 ^ 2 [V~
nm( 1=2
^ o )] + oP (1)
nm(
d
! G0 G
2
l K
v0 v 2
q:
Remarks: The adjustment of degrees of freedom from l to l K is due to the impact of the
asymptotically optimal parameter estimator ^ :
p
Theorem 8.12 [Overidenti…cation Test] Suppose Assumptions 8.1–8.6 hold, and V~ ! Vo
as n ! 1. Then under the null hypothesis that E[mt ( o )] = 0 for some unknown o ; the test
statistic
d
^ ^ )0 V~ 1 m(
n m( ^ ^ ) ! 2l K :
Remarks: This is often called the J-test or the test for overidenti…cation in the GMM litera-
ture, because it requires l > K. This test can be used to check if the model characterized as
E[mt ( o )] = 0 is correctly speci…ed.
^ ^ )0 V~
nm( 1
^ ^ ) ! G0 G
m(
where is an idempotent matrix is due to the fact that ^ is an asymptotically optimal GMM
estimator that minimizes the objective function nm( ^ )0 V~ 1 m(
^ ). If a suboptimal GMM esti-
mator is used, we would have no above result. Instead, we need to use a di¤erent asymptotic
variance estimator to replace V~ and obtain an asymptotically 2l distribution under correct model
speci…cation. Because the critical value of 2l K is smaller than that of 2l when K > 0; the use
of the asymptotically optimal estimator ^ leads to an asymptotically more e¢ cient test.
Remarks: When l = K; the exactly identi…ed case, the moment conditions cannot be tested by
the asymptotically optimal GMM ^ ; because m(
^ ^ ) will be identically zero, no matter whether
E[m( o )] = 0:
26
Answer: The adjustment of degrees of freedom (minus K) is due to the impact of the sam-
pling variation of the asymptotically optimal GMM estimator. In other words, the use of an
asymptotically optimal GMM estimator ^ instead of ~ renders the degrees of freedom to change
from l to l K: Note that if ^ is not an asymptotically optimal GMM estimator, the asymptotic
^ ^ )0 V~ 1 m(
distribution of nm( ^ ^ ) will be changed.
Question: In the J test, why do we use the preliminary weighting matrix V~ ; which is evaluated
at a preliminary parameter estimator ~ ? Why not use V^ ; a consistent estimator for V that is
evaluated at the asymptotically optimal estimator ^ ?
Answer: With the preliminary matrix V~ ; the J-test statistic is n times the minimum value
of the objective function— the quadratic form in the second stage of GMM estimation. Thus,
^ ^ )0 V~ 1 m(
the value of the test statistic nm( ^ ^ ) is directly available as a by-product of the second
stage GMM estimation. For this reason and for its asymptotic 2 distribution, the J-test is also
called the minimum chi-square test.
Answer: Yes. The test statistic nm( ^ ^ ) is also asymptotically 2l K under correct
^ ^ )0 V^ 1 m(
^ ^ )0 V~ 1 m(
model speci…cation (please verify!), but this statistic is less convenient to compute than nm( ^ ^ );
because the latter is the objective function of the second stage GMM estimation. This is analo-
gous to the F -test statistic, which is based on the sums of squared residuals of linear regression
models.
Question: Can we replace ^ by some suboptimal but consistent GMM estimator ~ ; say?
mt ( ) = Zt (Yt Xt0 );
the overidenti…cation test can be used to check the validity of the moment condition
27
This is essentially to check whether Zt is a valid instrument vector, that is, whether Zt is
orthogonal to "t = Yt Xt0 o . Put e^t = Yt Xt0 ^ 2sls : We can use the following test statistic
e^0 Z(Z 0 Z) 1 Z 0 e^
e^0 e^=n
^ ^ 2sls )0 W
e^0 Z(Z 0 Z) 1 Z 0 e^ = n m( ^ 1
^ ^ 2sls )
m(
is n times the value of the objective function of the GMM minimization with the choice of W ^ =
(Z 0 Z=n); which is an optimal choice when fmt ( o )g is an MDS with conditional homoskedasticity
(i.e., E("2t jZt ) = 2 ): In this case,
e^0 e^ Z 0 Z p 2
! Qzz = Vo :
n n
It follows that the test statistic
e^0 Z(Z 0 Z) 1 Z 0 e^ d 2
! l K
e^0 e^=n
o
under the null hypothesis that E("t jZt ) = 0 for some :
Corollary 8.13: Suppose Assumptions 7.1–7.4, 7.6 and 7.7 hold, and l > K. Then under the
null hypothesis that E("t jZt ) = 0, the test statistic
e^0 Z(Z 0 Z) 1 Z 0 e^ d 2
! l K;
e^0 e^=n
where e^ = Y X ^ 2sls :
2 2
In fact, the overidenti…cation test statistic is equal to nRuc , where Ruc is the uncentered R2
from the auxiliary regression
e^t = 0 Zt + wt :
2
In fact, it can be shown that under the null hypothesis of E("t jZt ) = 0; nRuc is asymptotically
equivalent to nR2 in the sense that nRuc 2
= nR2 + oP (1); where R2 is the uncentered R2 of
regressing e^t on Zt :This provides a convenient way to calculate the test statistic. However,
it is important to emphasize that this convenient procedure is asymptotically valid only when
E("2t jZt ) = 2 :
E[mt ( o )] = 0;
where mt ( ) is a l 1 moment function. This moment condition can be used to estimate model
parameter o via the so-called GMM estimation method. The GMM estimator is de…ned as:
^ = arg min m( ^
^ )0 W 1
m(
^ );
2
where
X
n
1
m(
^ )=n mt ( ):
t=1
p
^ ! o
and
p d
n( ^ o
) ! N (0; );
where
= (Do0 W 1
Do ) 1 Do0 W 1
Vo W 1
Do (Do0 W 1
Do ) 1 :
The asymptotic variance of the GMM estimator ^ depends on the choice of weighting matrix
p
W: An asymptotically most e¢ cient GMM estimator is to choose W = Vo avar[ nm( ^ o )]: In
this case, the asymptotic variance of the GMM estimator is given by
o = (Do0 Vo 1 Do ) 1
which is a minimum variance. This is similar in spirit to the GLS estimator in a linear regression
model. This suggests a two-stage asymptotically optimal GMM estimator ^ : First, one can
obtain a consistent but suboptimal GMM estimator ~ by choosing some convenient weighting
matrix W ~ : Then one can use ~ to construct a consistent estimator V~ for Vo ; and use it as a
weighting matrix to obtain the second stage GMM estimator ^ :
To construct con…dence interval estimators and hypothesis tests, one has to obtain consistent
asymptotic variance estimators for GMM estimators. A consistent asymptotic variance estimator
for an asymptotically optimal GMM estimator is
^ o = (D
^ 0 V^ 1 ^ 1;
D)
29
where
X
n
dmt ( ^ )
^ =n
D 1
;
t=1
d
and the construction of V^ depends on the properties of fmt ( o )g; particularly on whether
fmt ( o )g is an ergodic stationary MDS process.
Suppose a two-stage asymptotically optimal GMM estimator is used. Then the associated
Wald test statistic for the null hypothesis
H0 : R( o ) = r:
is given by
^ = n[R( ^ ) d
W r]0 [R0 ( ^ )(D
^ 0 V^ 1 ^ 1 R0 ( ^ )0 ] 1 [R( ^ )
D) r] ! 2
J :
The moment condition E[mt ( o )] = 0 also provides a basis to check whether an economic
theory or economic model is correctly speci…ed. This can be done by checking whether the sample
moment m( ^ ^ ) is close to zero. A popular model speci…cation test in the GMM framework is the
J-test statistic
d
^ ^ )0 V~ 1 m(
nm( ^ ^ ) ! 2l K
under correct model speci…cation, where ^ is an asymptotically optimal GMM estimator (ques-
tion: what will happen if a consistent but suboptimal GMM estimator is used). This is also
^ ^ )V~ 1 m(
called the overidenti…cation test. The J-test statistic nm( ^ ^ ) is rather convenient to
compute, because it is the objective function of the GMM estimator.
GMM provides a convenient uni…ed framework to view most econometric estimators. In other
words, most econometric estimators can be viewed as a special case of the GMM framework with
suitable choice of moment function and weighting matrix. In particular, the OLS and 2SLS
estimators are special cases of the class of GMM estimators.
30
EXERCISES
8.1. A generalized method of moment (GMM) estimator is de…ned as
^ = arg min m( ^
^ )0 W 1
m(
^ );
2
X
n
1
m(
^ )=n mt ( );
t=1
o
E m(Zt ; )j Z t 1
= 0;
where Z t 1
= fZt 1 ; Zt 2 ; :::; Z1 g is the information available at time t 1:.
p
^ 0( )
sup km m0 ( )k ! 0,
2
d d
^ 0( ) =
where m d
^ ) and m0 ( ) =
m( d
E[m(Zt ; )] = E[ @@ m(Zt ; )]:
p d
Assumption 1.4: ^ o ) ! N (0; Vo ) for some …nite and positive de…nite matrix Vo :
nm(
p
^ !
Assumption 1.5: W W , where W is a …nite and positive de…nite matrix.
p
From these assumptions, one can show that ^ ! o , and this result can be used in answering
the following questions in parts (a)–(d). Moreover, you can make additional assumptions if you
feel appropriate and necessary.
31
^ : Explain why your choice of W
(d) Find the optimal choice of W ^ is optimal.
8.2. (a) Show that the 2SLS ^ 2sls for the parameter o in the regression model Yt = Xt0 o + "t
is a special case of the GMM estimator with suitable choices of moment function mt ( ) and
weighting matrix W ^;
(b) Assume that fZt "t g is a stationary ergodic process and other regularity conditions hold.
Compare the relative e¢ ciency between an asymptotically optimal GMM estimator (with the
optimal choice of the weighting matrix) and ^ 2sls under conditional homoskedasticity and con-
ditional heteroskedasticity respectively.
p
8.3. Use a suboptimal GMM estimator ^ with a given weighting function W ^ ! W to construct
o
a Wald test statistic for the null hypothesis H0 : R = r; and justify your reasoning. Assume
all necessary regularity conditions hold.
8.4. Suppose that fmt ( )g is an ergodic stationary MDS process, where mt ( ) is continuous on
a compact parameter set ; and fmt ( )mt ( )0 g follows a uniform weak law of large numbers,
P
and Vo = E[mt ( o )mt ( o )0 ] is …nite and nonsingular. Let V^ = n 1 nt=1 mt ( ^ )mt ( ^ )0 ; where ^ is
p
a consistent estimator of o . Show V^ ! Vo :
p
8.5. Suppose V^ is a consistent estimator for Vo = avar[ nm(^ o )]: Show that replacing V~ by V^
has no impact on the asymptotic distribution of the overidenti…cation test statistic, that is, show
p
^ ^ )V~
nm( 1
^ ^)
m( ^ ^ )V^
nm( 1
^ ^ ) ! 0:
m(
8.6. Suppose ~ is a suboptimal but consistent GMM estimator. Could we simply replace ^ by
~ and still obtain the asymptotic 2 distribution for the overidenti…cation test statistic? Give
l K
your reasoning. Assume all necessary regularity conditions hold.
8.7. Suppose Assumptions 7.1–7.4, 7.6 and 7.7 hold. To test the null hypothesis that E("t jZt ) =
0, where Zt is a l 1 instrumental vector, one can consider the auxiliary regression
0
e^t = Zt + wt ;
o
Yt = g(Xt ; ) + "t ;
32
where o is an unknown K 1 parameter vector and E("t jXt ) = 0 a.s. Assume that g(Xt ; ) is
twice continuously di¤erentiable with respect to with the K K matrices E[ @g(X
@
t ; ) @g(Xt ; )
@ 0
]
2
and E[ @ @g(Xt; )
@ 0
] …nite and nonsingular for all 2 :
The nonlinear least squares (NLS) estimator solves the minimization of the sum of squared
residual problem
Xn
^ = arg min [Yt g(Xt ; )]2 :
t=1
@
where D( ) is a n K matrix, with the t-th row being @
g(Xt ; ): This FOC can be viewed as
the FOC
^ ^) = 0
m(
in an exact identi…cation case (l = K). Generally, there exists no closed form expression for ^ :
Assume all necessary regularities conditions hold.
p
(a) Show that ^ ! o as n ! 1:
p
(b) Derive the asymptotic distribution of n( ^ o
):
p ^
(c) What is the asymptotic variance of n( o
) if f @g(X
@
t; )
"t g is an MDS with conditional
2 2
homoskedasticity (i.e., E("t jXt ) = a.s.)? Give your reasoning.
p
(d) What is the asymptotic variance of n( ^ o
) if f @g(X
@
t; )
"t g is an MDS with conditional
2 2
heteroskedasticity (i.e., E("t jXt ) 6= a.s.)? Give your reasoning.
@g(Xt ; )
(e) Suppose f @ "t g is an MDS with conditional homoskedasticity (i.e., E("2t jXt ) = 2
a.s.). Construct a test for the null hypothesis H0 : R( o ) = r; where R( ) is a J K nonstochastic
matrix such that R0 ( o ) = @@ R( o ) is a J L matrix with full rank J L; and r is a J 1
nonstochastic vector.
o
Yt = g(Xt ; ) + "t ;
where g(Xt ; ) is twice continuously di¤erentiable with respect to ; E("t jXt ) 6= 0 but
E("t jZt ) = 0; where Yt is a scalar, Xt is a K 1 vector and Zt is a l 1 vector with l K:
33
Suppose fYt ; Xt0 ; Zt0 g0n
t=1 is a stationary ergodic process, and fZt "t g is an MDS.
The unknown parameter o can be consistently estimated based on the moment condition
E[mt ( o )] = 0;
o
where mt ( ) = Zt [Yt g(Xt ; )]: Suppose a nonlinear IV estimator solves the minimization
problem
^ = arg min m( ^
^ )0 W 1
m(
^ );
P ^ !p
^ ) = n 1 nt=1 Zt [Yt g(Xt ; )]; and W
where m( W; a …nite and positive de…nite matrix.
p
^
(a) Show ! : o
8.10. Consider testing the hypothesis of interest H0 : R( o ) = r under the GMM framework,
where R( o ) is a J K nonstochastic matrix, r is a J 1 nonstochastic vector, and R0 ( o ) is a
J K matrix with full rank J; where J K: We can construct a Lagrangian multiplier test based
on the Lagrangian multiplier ^ , where ^ is the optimal solution of the following constrained
GMM minimization problem:
h i
( ^ ; ^ ) = arg min ^ )0 V~
m( 1
m(
^ )+ 0
[r R( ) ;
2 ; 2R
p
where V~ is a preliminary consistent estimator for Vo = avar[ nm(
^ o )] that does not depend :
Construct the LM test statistic and derive its asymptotic distribution. Assume all regularity
conditions hold.
34
CHAPTER 9 MAXIMUM LIKELIHOOD
ESTIMATION AND QUASI-MAXIMUM
LIKELIHOOD ESTIMATION
Abstract: Conditional distribution models have been widely used in economics and …nance. In
this chapter, we introduce two closely related popular methods to estimate conditional proba-
bility distribution models— Maximum Likelihood Estimation (MLE) and Quasi-MLE (QMLE).
MLE is a parameter estimator that maximizes the model likelihood function of the random sam-
ple when the conditional probability distribution model is correctly speci…ed, and QMLE is a
parameter estimator that maximizes the model likelihood function of the random sample when
the conditional probability distribution model is misspeci…ed. Because the score function is an
MDS process and the dynamic information matrix equality holds when a conditional distribution
model is correctly speci…ed, the asymptotic properties of the MLE is analogous to those of the
OLS estimator when the regression disturbance is an MDS with conditional homoskedasticity,
and we can use the Wald test, Lagrange Multiplier test and Likelihood Ratio test for hypothesis
testing, where the Likelihood Ratio test is analogous to the J F test statistic. On the other hand,
when the conditional distributional model is misspeci…ed, the score function has mean zero, but
it may no longer be an MDS process and the dynamic information matrix equality may fail. As a
result, the asymptotic properties of the QMLE are analogous to those of the OLS estimator when
the regression disturbance displays serial correlation and conditional heteroskedasticity. Robust
Wald tests and Lagrange Multiplier tests can be constructed for hypothesis testing, but the Like-
lihood ratio test can no longer be used, for a reason similar to the failure of the F -test statistic
when the regression disturbance displays conditional heteroskedasticity and serial correlation.
We discuss methods to test the MDS properties of the score function, and the dynamic informa-
tion matrix equality, and correct speci…cation of the entire conditional distribution model. Some
empirical applications are considered.
Key words: ARMA model, Censored data, Conditional probability distribution model,
Discrete choice model, Dynamic information matrix test, GARCH model, Hessian matrix, Infor-
mation matrix equality, Information matrix test, Lagrange multiplier test, Likelihood, Likelihood
ratio test, Martingale, MLE, Pesudo likelihood function, QMLE, Score function, Truncated data,
Wald test.
9.1 Motivation
So far we have focused on the econometric models for conditional mean or conditional ex-
pectation, either linear or nonlinear. When do we need to model the conditional probability
distribution of Yt given Xt ?
1
We …rst provide a number of economic examples which call for the use of a conditional
probability distribution model.
In …nancial risk management, how to quantify extreme downside market risk has been an
important issue. Let It 1 = (Yt 1 ; Yt 2 ; :::; Y1 ) be the information set available at time t 1;
where Yt is the return on a portfolio in period t: Suppose
o
Yt = t( ) + "t
o o
= t( )+ t( )zt ;
where t ( o ) = E(Yt jIt 1 ); 2t ( o ) = var(Yt jIt 1 ); fzt g is an i.i.d. sequence with E(zt ) = 0,
var(zt ) = 1; and pdf fz ( j o ): An example is that fzt g i:i:d:N (0; 1):
The value at risk (VaR), Vt ( ) = V ( ; It 1 ); at the signi…cance level 2 (0; 1); is de…ned as
Intuitively, VaR is the threshold that the actual loss will exceed with probability : Given that
Yt = t + t zt ; where for simplicity we have put t = t ( o ) and t = t ( o ); we have
= P( t + t zt< Vt ( )jIt 1 )
Vt ( ) t
= P zt < It 1
t
Vt ( ) t
= Fz ;
t
where the last equality follows from the independence assumption of fzt g: It follows that
Vt ( ) t
= C( ):
t
Vt ( ) = t + t C( );
P [zt < C( )] =
or Z C( )
0
fz (zj )dz = :
1
2
For example, C(0:05) = 1:65 and C(0:01) = 2:33:
For example, J.P. Morgan’s RiskMetrics uses a simple conditionally normal distribution model
for asset returns:
Yt = t zt ;
X
t 1
2 j
t = (1 ) Yt2 j ; 0< < 1;
j=1
fzt g i:i:d:N (0; 1):
2
Here, the conditional probability distribution of Yt jIt 1 is N (0; t ); from which we can obtain
Vt (0:05) = 1:65 t :
o
P (Yt = 1jXt ) = F (Xt0 );
1
F (u) = ; 1 < u < 1:
1 + exp( u)
This is the so-called logistic regression model. This model is useful for modeling (e.g.) credit
default risk and currency crisis.
3
An economic interpretation for the binary outcome Yt is a story of a latent variable process.
De…ne (
1 if Yt c;
Yt =
0 if Yt > c;
where c is a constant, the latent variable
o
Yt = Xt0 + "t ;
and F ( ) is the CDF of the i.i.d. error term "t : If f"t g i:i:d:N (0; 2 ) and c = 0; the resulting
model is called a probit model. If f"t g i:i:d: Logistic(0; 2 ) and c = 0, the resulting model
is called a logit model. The latent variable could be the actual economic decision process. For
example, Yt can be the credit score and c is the threshold with which a lending institute makes
its decision on loan approvals.
This model can be extended to the multinomial model, where Yt takes discrete multiple
integers instead of only two values.
Suppose we are interested in the time it takes for an unemployed person to …nd a job, the
time that elapses between two trades or two price changes, the length of a strike, the length
before a cancer patient dies, and the length before a …nancial crisis (e.g., credit default risk)
comes out. Such analysis is called duration analysis or survival analysis.
In practice, the main interest often lies in the question of how long a duration of an economic
event will continue, given that it has not …nished yet. An important concept called the hazard
rate measures the chance that the duration will end now, given that it has not ended before.
This hazard rate therefore can be interpreted as the chance to …nd a job, to trade, to end a
strike, etc.
Suppose Yt is the duration from a population with the probability density function f (y) and
probability distribution function F (y): Then the survival function is de…ned as
4
and the hazard rate is de…ned as
P (y < Yt y + jYt > y)
(y) = lim+
!0
P (y < Yt y + )=P (Yt > y)
= lim+
!0
f (y)
=
S(y)
d
= ln S(y):
dy
where 0 (y) is called the baseline hazard rate. This speci…cation is called the proportional hazard
model, proposed by Cox (1962). The parameter
@
= ln t (y)
@Xt
1 @
= t (y)
t (y) @Xt
is the marginal relative e¤ect of Xt on the hazard rate of individual t: The survival function of
the proportional hazard model is
0
St (t) = [So (t)]exp(Xt )
where So (t) is the survival function of the baseline hazard rate 0 (t):
The probability density function of Yt given Xt is
To estimate parameter ; we need to use the maximum likelihood estimation (MLE) method,
5
which will be introduced below.
Suppose we have a sequence of tick-by-tick …nancial data fPi ; ti g; where Pi is the price traded
at time ti ; where i is the index for the i-th price change. De…ne the time interval between price
changes
Yi = ti ti 1 ; i = 1; :::; n:
Engle and Russell (1998) propose a class of autoregressive conditional duration model:
8
o
>
< Yi = i ( )zi ;
o
i ( ) = E(Yi jIi 1 );
>
:
fzi g i:i:d:EXP(1),
o
where Ii 1 is the information set available at time ti 1 : Here, i = i( ) is called the conditional
expected duration given Ii 1 : A model for i is
i =!+ i 1 + Yi 1 ;
where = (!; ; )0 :
From this model, we can write down the model-implied conditional probability density of Yi
given Ii 1 :
1 y
f (yjIi 1 ) = exp ; y > 0:
i i
From this conditional density, we can compute the conditional intensity of Yi (i.e., the instanta-
neous probability that the next price change will occur at time ti ); which is important for (e.g.)
options pricing.
Example 5 [Continuous-time Di¤usion models] The dynamics of the spot interest rate Yt
is fundamental to pricing …xed income securities. Consider a di¤usion model for the spot interest
rate
dYt = (Yt ; o )dt + (Yt ; o )dWt ;
where (Yt ; o ) is the drift model, and (Yt ; o ) is the di¤usion (or volatility) model, o is an
unknown K 1 parameter vector, and Wt is the standard Brownian motion. Note that the time
t is a continuous variable here.
6
Continuous-time models have been rather popular in mathematical …nance and …nancial
engineering. First, …nancial economists have the belief that informational ‡ow into …nancial
markets is continuous in time. Second, the mathematical treatment of derivative pricing is
elegant when a continuous-time model is used.
dYt = dt + dWt ;
1=2
dYt = ( + Yt )dt + Yt dWt :
These di¤usion models are important for hedging, derivatives pricing and …nancial risk manage-
ment.
Question: How to estimate model parameters of a di¤usion model using a discretely sampled
data fYt gnt=1 ?
Given (Yt ; ) and (Yt ; ); we can determine the conditional probability density fYt jIt 1 (yt jIt 1 ; )
of Yt given It 1 : Thus, we can estimate o by the maximum likelihood estimation (MLE) or as-
ymptotically equivalent methods using discretely observed data. For the random walk model,
the conditional pdf of Yt given It 1 is
1 (y t)2
f (yjIt 1 ; ) = p exp :
2 2t 2 2t
f (yjIt 1 ; ) = :
For the Cox, Ingersoll and Ross’(1985) model, the conditional pdf of Yt given It 1 is
f (yjIt 1 ; ) = :
It may be noted that many continuous-time di¤usion models do not have a closed form
expression for their conditional pdf, which makes the MLE estimation infeasible. Methods have
7
been proposed in the literature to obtain some accurate approximations to the conditional pdf
so that MLE becomes feasible.
Z n = (Z10 ; ; Zn0 )0 :
A realization of Z n is a data set, denoted as z n = (z10 ; ; zn0 )0 . A random sample Z n can generate
many realizations (i.e., data sets).
All information in Z n is completely described by its joint probability density function (pdf) or
probability mass function (pmf) fZ n (z n ): [For discrete r.v.’s, we have fZ n (z n ) = P (Z n = z n ):] By
sequential partitioning (repeatedly using the multiplication rule that P (A \ B) = P (AjB)P (B)
for any two events A and B), we have
where Z t 1 = (Zt0 1 ; Zt0 2 ; ; Z10 )0 ; and fZt jZ t 1 (zt jz t 1 ) is the conditional pdf of Zt given Z t 1 :
Also, given Zt = (Yt ; Xt0 )0 and using the formula that P (A \ BjC) = P (AjB \ C)P (BjC) for any
events A; B and C; we have
where
t = (Xt0 ; Z t 10 0
);
an extended information set which contains not only the past history Z t 1
but also the current
8
Xt : It follows that
Y
n
n
fZ n (z ) = fYt j t (yt j t )fXt jZ t 1 (xt jz t 1 )
t=1
Y
n Y
n
= fYt j t (yt j t) fXt jZ t 1 (xt jz t 1 ):
t=1 t=1
0 10 0
Often, the interest is in modeling the conditional distribution of Yt given t = (Xt ; Z t ):
Case I [Cross-Sectional Observations]: Suppose fZt g is i.i.d. Then fYt j t (yt jxt ; z t 1 ) =
fYt jXt (yt jxt ) and fXt jZ t 1 (xt jz t 1 ) = fXt (xt ): It follows that
Y
n Y
n
n
f (z ) =
Zn fYt jXt (yt jxt ) fXt (xt );
t=1 t=1
Case II: [Univariate Time Series Analysis] Suppose Xt does not exist, namely Zt = Yt .
Then t = (Xt0 ; Z t 10 )0 = Z t 1 = (Yt 1 ; :::; Y1 )0 ; and as a consequence,
Y
n
n
fZ n (z ) = fYt jY t 1 (yt jy t 1 ):
t=1
X
n
ln fZ n (z n ) = ln fYt j t (yt j t; )
t=1
X
n
+ ln fXt jZ t 1 (xt jz t 1 ; ):
t=1
9
If we are interested in using the extended information set t = (Xt0 ; Z t 10 )0 to predict the dis-
tribution of Yt ; then is called the parameter of interest, and is called the nuisance
parameter. In this case, to estimate , we only need to focus on modeling the conditional
pdf/pmf fYt j t (yj t ; ): This follows because the second part of the likelihood function does
not depend on so that the maximization of ln fZ n (z n ) with respect to is equivalent to the
maximization of the …rst part of the likelihood with respect to :
We now introduce various conditional distributional models. For simplicity, we only consider
i.i.d. observations so that fYt j t (yj t ; ) = fYt jXt (yjXt ; ).
Example 1 [Linear Regression Model with Normal Errors]: Suppose Zt = (Yt ; Xt0 )0 is
i.i.d., Yt = Xt0 o + "t ; where "t jXt N (0; 2o ): Then the conditional pdf of Yt jXt is
1 1
(y x0 ) 2
fYt jXt (yjx; ) = p e 2 2 ;
2 2
where = ( 0; 2 0
) : This is a classical linear regression model discussed in Chapter 3.
Example 2 [Logit Model]: Suppose Zt = (Yt ; Xt0 )0 is i.i.d., Yt is a binary random variable
taking either value 1 or value 0, and
(
o
(Xt0 ) if yt = 1;
P (Yt = yt jXt ) = o
1 (Xt0 ) if yt = 0;
where
1
(u) = ; 1 < u < 1;
1 + exp( u)
is the CDF of the logistic distribution. We have
Example 3 [Probit Model]: Suppose Zt = (Yt ; Xt0 )0 is i.i.d., and Yt is a binary random
variable such that (
(Xt0 o ) if yt = 1
P (Yt = yt jXt ) = o
1 (Xt0 ) if yt = 0;
where ( ) is the CDF of the N(0,1) distribution. We have
There are wide applications of the logit and probit models. For example, a consumer chooses
a particular brand of car; a student decides to go to PHD study, etc.
10
Example 4 [Censored Regression (Tobit) Models]: A dependent variable Yt is called
censored when the response Yt cannot take values below (left censored) or above (right censored)
a certain threshold value. For example, the investment can only be zero or positive (when no
borrowing is allowed). The censored data are mixed continuous-discrete. Suppose the data
generating process is
Yt = Xt0 o + "t ;
where f"t g i:i:d:N (0; 2o ): When Yt > c; we observe Yt = Yt . When Yt c; we only have the
o
record Yt = c: The parameter should not be estimated by regressing Yt on Xt based on the
subsample with Yt > c; because the data with Yt = c contain relevant information about o and
2
o : More importantly, in the subsample with Yt > c, "t is a truncated distribution with nonzero
mean (i.e., E("t jYt > c) 6= 0 and E(Xt "t jYt > c) 6= 0). Therefore, OLS is not consistent for o
if one only uses the subsample consisting of observations of Yt > c and throw away observations
with Yt = c:
Question: How to estimate o given an observed sample fYt ; Xt0 gnt=1 where some observations
of Yt are censored? Suppose Zt = (Yt ; Xt0 )0 is i.i.d., with the observed dependent variable
(
Yt if Yt > c
Yt =
c if Yt c;
Yt = max(Yt ; c)
= max(Xt0 o
+ "t ; c):
11
where ( ) is the N (0; 1) CDF, and the second part is the conditional probability
P (Yt = cjXt )
= P (Yt cjXt )
= P ("t c Xt0 jXt )
"t c Xt0
= P jXt
c Xt0
= ;
"t
given jXt N (0; 1):
Question: Can you give some examples where this model can be applied?
One example is a survey on unemployment spells. At the terminal date of the survey, the
recorded time length of an unemployed worker is not the duration when his layo¤ will last.
Another example is a survey on cancer patients. Those who have survived up to the ending date
of the survey will usually live longer than the survival duration recorded.
Yt = Xt0 o
+ "t ;
where "t jXt i:i:d:N (0; 2o ): Suppose only those of Yt whose values are larger than or equal
to constant c are observed, where c is known. That is, we observe Yt = Yt if and only if
Yt = Xt0 o + "t c: The observations with Yt < c are not recorded. Assume the resulting
n
sample is fYt ; Xt gt=1 ; where fYt ; Xt g is i.i.d. We now analyze the e¤ect of truncation for this
model. For the observed sample, Yt c and so "t comes from the truncated version of the
2 0 o
distribution N (0; o ) with "t c Xt : It follows that E(Xt "t jYt c) 6= 0 and therefore the
0
OLS estimator based on the observed sample fYt ; Xt g is not consistent.
Because the observation Yt is recorded if and only if Yt c; the conditional probability
distribution of Yt given Xt is the same as the probability distribution of Yt given Xt and Yt > c:
12
Hence, for any observed sample point (yt ; xt ); we have
fYt jXt (yt jxt ; ) = fYt jXt ;(Yt >c) (yt jxt ; Yt > c)
fYt jXt ;(Yt >c) (yt jxt ; Yt > c)P (Yt > cjxt )
=
P (Yt > cjxt )
fYt jXt (yt jxt )
=
P (Yt > cjxt )
1 1 0 2
= p e 2 2 (yt xt )
2 2
1
;
c x0t
1
where = ( 0; 2
);and the conditional probability
c Xt0
= 1 :
Question: Can you give some examples where this model can be applied?
Example 6 [Loan applications]: Only those successful loan applications will be recorded.
De…nition 9.1 [Likelihood Function]: The joint pdf/pmf of the random sample Z n =
(Z1 ; Z2 ; :::; Zn ) as a function of ( ; )
Ln ( ; ; z n ) = fZ n (z n ; ; )
13
is called the likelihood function of Z n when z n is observed. Moreover, ln Ln ( ; ; z n ) is called the
log-likelihood function of Z n when z n is observed.
Remarks:
The likelihood function Ln ( ; ; z n ) is algebraically identical to the joint probability density
function fZ n (z n ; ; ) of the random sample Z n taking value z n : Thus, given ( ; ); Ln ( ; ; z n )
can be viewed as a measure of the probability or likelihood with which the observed sample z n
will occur.
Lemma 9.1 [Variation-Free Parameter Spaces]: Suppose and are variation-free over
parameter spaces ; in the sense that for all ( ; ) 2 ; we have
Y
n Y
n
n
Ln ( ; ; z ) = fYt j t (yt j t; ) fXt jZ t 1 (xt jZ t 1 ; );
t=1 t=1
X
n
n
ln Ln ( ; ; z ) = ln fYt j t (yt j t; )
t=1
X
n
+ ln fXt jZ t 1 (xt jZ t 1 ; ):
t=1
Suppose we are interested in predicting Yt using the extended information set t = (Xt0 ; Z t 10 )0 :
Then only the …rst part of the log-likelihood is relevant, and is called the parameter of interest.
The other parameter ; appearing in the second part of the log-likelihood function, is called the
nuisance parameter.
Y
n
^ = arg max fYt j t (Yt j t; )
2
t=1
X
n
= arg max ln fYt j t (Yt j t; );
2
t=1
14
where is a parameter space. When the conditional probability distribution model fYt j t (yj t ; )
is correctly speci…ed in the sense that there exists some parameter value 2 such that
fYt j t (yj t ; ) coincides with the true conditional distribution of Yt given t , then ^ is called the
maximum likelihood estimator (MLE); when fYt j t (yj t ; ) is misspeci…ed in the sense that there
exists no parameter value 2 such that fYt j t (yj t ; ) coincides with the true conditional
distribution of Yt given t , ^ is called the quasi-maximum likelihood estimator (QMLE).
Remarks:
By the nature of the objective function, the MLE gives a parameter estimate which makes
the observed sample z n most likely to occur. By choosing a suitable parameter ^ 2 ; MLE
maximizes the probability that Z n = z n ; that is, the probability that the random sample Z n
takes the value of the observed data z n : Note that MLE and QMLE may not be unique.
The MLE is obtained over ; where may be subject to some restriction. An example is
the GARCH model where some parameters have to be restricted in order to ensure that the
estimated conditional variance is nonnegative (e.g., Nelson and Cao 1992).
Under regularity conditions, we can characterize the MLE by a …rst order condition. Like
the GMM estimator, However, there is usually no closed form for the MLE ^ : The solution ^
has to be searched by computers. The most popular methods used in economics are BHHH, and
Gauss-Newton.
Suppose the likelihood function is continuous in 2 and parameter space is compact. Then
a global maximizer ^ 2 exists.
Theorem 9.2 [Existence of MLE/QMLE] Suppose for each 2 ; where is a compact pa-
rameter space, fYt j t (Yt j t ; ) is a measurable function of (Yt ; t ), and for each t; fYt j t (Yt j t ; )
is continuous in 2 : Then MLE/QMLE ^ exists.
This result is analogous to the Weierstrass Theorem in multivariate calculus that any contin-
uous function over a compact support always has a maximum and a minimum.
15
0
Assumption 9.1 [Parametric Distribution Model]: (i) fZt = (Yt ; Xt0 ) gnt=1 is a stationary
ergodic process, and (ii) f (yt j t ; ) is a conditional pdf/pmf model of Yt given t = (Xt0 ; Z t 10 )0 ;
where Z t 1 = (Zt0 1 ; Zt0 2 ; ; Z10 )0 : For each ; ln f (Yt j t ; ) is measurable with respect to
observations (Yt ; t ), and for each t; ln f (Yt j t ; ) is continuous in 2 ; where is a …nite-
dimensional parameter space.
Assumption 9.3 [Uniform WLLN]: fln f (Yt j t; ) E ln f (Yt j t; )g obeys the uniform weak
law of large numbers (UWLLN), i.e.,
X
n
p
1
sup n ln f (Yt j t; ) l( ) ! 0
2 t=1
l( ) = E [ln f (Yt j t; )]
is continuous in 2 :
= arg max l( )
2
Assumption 9.4 is an identi…cation condition which states that is a unique solution that
maximizes l( ); the expected value of the logarithmic conditional likelihood function ln f (Yt j t ; ).
So far, there is no economic interpretation for : This is analogous to the best linear least squares
approximation coe¢ cient = arg min E(Y X 0 )2 in Chapter 2.
9.3.1 Consistency
We …rst consider the consistency property of ^ for : Because we assume that is compact,
^ and may be corner solutions. Thus, we have to use the extrema estimator lemma to prove
the consistency of the MLE/QMLE ^ :
16
Proof: Applying the extrema estimator lemma in Chapter 8, with
X
n
^ )=n
Q( 1
ln f (Yt j t; )
t=1
and
Q( ) = l( ) E[ln f (Yt j t; )]:
^ ) and Q( ) in the extrema estimator
Assumptions 9.1–9.4 ensure that all conditions for Q(
p
lemma are satis…ed. It follows that ^ ! as n ! 1:
De…nition 9.3 [Correct Speci…cation for Conditional Distribution] The model f (yt j t ; )
is correctly speci…ed for the conditional distribution of Yt given t if there exists some parameter
value o 2 such that f (yt j t ; o ) coincides with the true conditional pdf/pmf of Yt given t :
Under correct speci…cation of f (yj t ; ); the parameter value o is usually called the true
model parameter value. It will usually have economic interpretation.
Lemma 9.4: Suppose Assumption 9.4 holds, and the model f (yt j t ; ) is correctly speci…ed for
the conditional distribution of Yt given t : Then f (yt j t ; ) coincides with the true conditional
pdf/pmf f (yt j t ; o ) of Yt given t ; where is as given in Assumption 9.4 : In other words, the
population likelihood maximizer coincides with the true parameter value o when the model
f (yt j t ; ) is correctly speci…ed for the conditional distribution of Yt given t .
Proof: Because f (yj t ; ) is correctly speci…ed for the conditional distribution of Yt given t;
there exists some o 2 such that
l( ) = E[ln f (Yt j t; )]
= EfE[ln f (Yt j t ; )j t ]g by LIE
Z
= E ln[f (yj t ; )]f (yj t ; o )dy;
where the second equality follows from LIE and the expectation E( ) in the third equality is
taken with respect to the true distribution of the random variables in t :
17
By Assumption 9.4, we have l( ) l( ) for all 2 : By the law of iterated expectations,
it follows that
Z
E ln[f (yj t ; )]f (yj t ; o )dy
Z
E ln[f (yj t ; )]f (yj t ; o )dy;
o o
where f (yt j t; ) is the true conditional pdf/pmf. Hence, by choosing = ; we have
Z
o o
E ln[f (yj t; )]f (yj t; )dy
Z
o
E ln[f (yj t; )]f (yj t; )dy:
On the other hand, by Jensen’s inequality and the concavity of the logarithmic function, we have
Z Z
o
ln[f (yj t ; )]f (yj t ; )dy ln[f (yj t ; o )]f (yj t ; o )dy
Z
f (yj ; )
= ln o f (yj t ; o )dy
f (yj t ; )
Z
f (yj ; )
ln f (yj t ; o )dy
f (yj t ; o )
Z
= ln f (yj ; )dy
= ln(1)
= 0;
R
where we have made use of the fact that f (yj t; )dy = 1 for all 2 : Therefore, we have
Z
o
ln [f (yj t; )] f (yj t; )dy
Z
o
ln[f (yj t; )]f (yj t; )dy:
o
It follows that we must have = ; otherwise cannot be the the maximizer of l( ) over :
This completes the proof.
18
Remarks:
This lemma provides an interpretation of in Assumption 9.4: That is, the population
likelihood maximizer coincides with the true model parameter o when f (yj t ; ) is correctly
speci…ed. Thus, by maximizing the population model log-likelihood function l( ); we can obtain
the true parameter value o :
p
Under Theorem 9.3, we have ^ ! as n ! 1. Furthermore, by correct speci…cation
for conditional distribution (i.e., Lemma 9.4), we know = o , where o is the true model
p
parameter. Thus, we have ^ ! o as n ! 1.
This is essentially equivalent to the consistency in the linear regression context, in which,
^
OLS always converges to no matter whether the model is correctly speci…ed. And only when
the model we have coincides with the true model, we have = o and then ^ OLS will converge
to the true model parameter o :Otherwise, our estimation will be biased since ^ OLS does not
converge to o , as n ! 1:
Question: Why do we need this assumption? This assumption is needed for the purpose of
taking a Taylor series expansion.
Lemma 9.5 [The MDS Property of the Score Function of a Correctly Speci…ed Condi-
tional Distribution Model]: Suppose that for each t; ln f (Yt j t ; ) is continuously di¤erentiable
with respect to 2 : De…ne a K 1 score function
@
St ( ) = ln f (yt j t; ):
@
E [St ( o )j t] = 0 a.s.;
where o is as in Assumption 9.4 and satis…es Assumption 9.5, and E( j t) is the expectation
taken over the true conditional distribution of Yt given t .
19
Proof: Note that for any given 2 ; f (yj t; ) is a valid pdf. Thus we have
Z 1
f (yj t; )dy = 1:
1
where o
@ ln f (yj t; ) @ ln f (yj t; )
= j = o :
@ @
Because f (yj t ; o ) is the true conditional pdf/pmf of Yt given t when f (yj t; ) is correctly
speci…ed for the conditional distribution of Yt given t ; we have
E[St ( o )j t] = 0:
Note that E[St ( o )j t] = 0 implies that E[St ( o )jZ t 1 ] = 0; namely fSt ( o )g is an MDS.
Answer: No. The MDS property is one of many implications of correct model speci…cation. In
certain sense, the MDS property is equivalent to correct speci…cation of the conditional mean.
Misspeci…cation of f (yj t ; ) may occur in higher order conditional moments of Yt given t :
Below is an example in which fSt ( o )g is MDS but the model f (yt j t ; ) is misspeci…ed.
20
Example 1: Suppose fYt g is a univariate time series process such that
Yt = t( )+ t( )zt ;
where t ( o ) = E(Yt jIt 1 ) for some o and It 1 = (Yt 1 ; Yt 2 ; :::; Y1 ) but 2t ( ) 6= var(Yt jIt 1 )
for all : Then, correct model speci…cation for the conditional mean E(Yt jIt 1 ) implies that
E(zt jIt 1 ) = 0: Assume that fzt g i.i.d.N (0; 1): Then the conditional probability density model
1 (Yt t( ))2
f (yj t; ) = p exp ;
2 2
t( ) 2 2t ( )
2
although the conditional variance t( ) is misspeci…ed for var(Yt jIt 1 ):
E [St ( o )St ( o )0 + Ht ( o )j t] = 0;
where
d
Ht ( ) St ( )
d
@2
= ln f (Yt j t; );
@ @ 0
or equivalently,
@ @
E ln f (Yt j t ; o ) 0 ln f (Yt j t;
o
) t
@ @
2
@ o
= E 0 ln f (Yt j t ; ) t :
@ @
21
By di¤erentiation with respect to 2 int( ), we obtain
Z 1
@
f (yj t; )dy = 0:
@ 1
The above relation holds for all 2 ; including o : This and the fact that f (yj t ; o ) is the
true conditional pdf/pmf of Yt given t imply the desired conditional information matrix equality
stated in the lemma. This completes the proof.
Remarks:
The K K matrix
E[St ( o )St ( o )0 j t]
o o
@ ln f (Yt j t ; ) @ ln f (Yt j t ; )
= E t
@ @ 0
is called the conditional Fisher’s information matrix of Yt given t : It measures the content of the
information contained in the random variable Yt conditional on t : The larger the expectation
is, the more information Yt contains.
22
Question: What is the implication of the conditional information matrix equality?
Assumption 9.6: (i) For each t; ln f (yt j t; ) is continuously twice di¤erentiable with respect
to 2 ; (ii) fSt ( o )g obeys a CLT, i.e.,
p X
n
d
^ o)
nS( n 1=2
St ( o ) ! N (0; Vo )
t=1
P
for some K K matrix Vo avar[n 1=2 nt=1 St ( o )] which is symmetric, …nite and positive
2
de…nite; (iii) fHt ( ) @ @@ 0 ln f (yt j t ; )g obeys a uniform weak law of large numbers (UWLLN)
over . That is, as n ! 1;
X
n
p
1
sup n Ht ( ) H( ) ! 0,
2 t=1
H( ) E [Ht ( )]
@ 2 ln f (Yt j t ; )
= E
@ @ 0
23
" #
X
n
Vo avar n 1=2
St ( o )
t=1
(" #" #0 )
X
n X
n
= E n 1=2
St ( o ) n 1=2
S ( o)
t=1 =1
X
n X
n
= n 1
E[St ( o )S ( o )0 ]
t=1 =1
= E[St ( o )St ( o )0 ];
where the expectations of cross-products, E[St ( o )S ( o )0 ]; are identically zero for all t 6= ;
as implied by the MDS property of fSt ( o )g from the Lemma on the score function.
Vo = E[St ( o )St ( o )0 ]
= Ho :
Theorem 9.7 [Asymptotic Normality of MLE]: Suppose Assumptions 9.1–9.6 hold, and
f (yt j t ; ) is correctly speci…ed for the conditional distribution of Yt given t . Then
p d
n( ^ o
) ! N (0; Ho 1 ):
o p
Proof: Because o is an interior point in and ^ ! 0 as n ! 1, we have ^ 2 int( )
for n su¢ ciently large. It follows that the FOC of maximizing the log-likelihood holds when n is
su¢ ciently large:
X
n
@ ln f (Yt j t;
^)
^ ^)
S( n 1
t=1
@
Xn
= n 1
St ( ^ )
t=1
= 0:
The FOC provides a link between MLE and GMM: MLE can be viewed as a GMM estimation
24
with the moment condition
X
n
^ ) = n
H( 1
Ht ( )
t=1
Xn
@ 2 ln f (Yt j t ; )
1
= n
t=1
@ @ 0
^ ): Given that ^ o p
is the derivative of S( ! 0, we have
jj o
jj = jja( ^ o
)jj jj ^ o
jj
p
! 0:
Also, by the triangle inequality, the UWLLN for fHt ( )g over and the continuity of H( ); we
obtain
^ )
H( H0
= ^ )
H( H( ) + H( ) H( o )
^ )
sup H( H( ) + H( ) H( o )
2
p
! 0.
25
where, as we have shown above,
hp i
Vo avar ^ o)
nS(
= E[St ( o )St ( o )0 ]
or equivalently
p d
n( ^ o
) ! N (0; Ho 1 Vo Ho 1 ) N (0; Vo 1 )
using the information matrix equality Vo = E[St ( o )St ( o )0 ] = Ho . This completes the proof.
Remarks:
26
Method 1: Use ^ ^
H 1
( ^ ); where
X @ 2 ln f (Yt j t ; )
n
^ )= 1
H( :
n t=1 @ @ 0
This requires taking second derivatives of the log-likelihood function. By Assumption 9.6(iii)
p p
and ^ ! o , we have ^ ! Ho 1 .
Method 2: Use ^ V^ 1
; where
1X
n
V^ St ( ^ )St ( ^ )0 :
n t=1
This requires the computation of the …rst derivatives (i.e., score functions) of the log-likelihood
function.
X
n
p
1
sup n St ( )St ( )0 V ( ) ! 0,
2 t=1
where
V ( ) = E[St ( )St ( )0 ]
p p
is continuous in : Then if ^ ! o
, we can show that V^ ! Vo . Note that Vo = V ( o ):
H0 : R( o ) = r;
We will introduce three test procedures, namely the Wald test, the Likelihood Ratio (LR)
test, and the Lagrange Multiplier (LM) test. We now derive these tests respectively.
27
Wald Test
where = a ^ + (1 a) o
for some a 2 [0; 1]: It follows that the quadratic form
d
n[R( ^ ) r]0 [ R0 ( o )H0 1 R0 ( o )0 ] 1 [R( ^ ) r] ! 2
J:
d
W = n[R( ^ ) r]0 [ R0 ( ^ )H
^ 1
( ^ )R0 ( ^ )0 ] 1 [R( ^ ) r] ! 2
J;
where again
X
n
@2
^ )=n
H( 1
ln f (Yt j t; ):
0
t=1
@ @
Note that only the unconstrained MLE ^ is needed in constructing the Wald test statistic.
Theorem 9.8 [MLE-based Hypothesis Testing: Wald test] Suppose Assumptions 9.1-9.6
hold, and the model f (yt j t ; ) is correctly speci…ed for the conditional distribution of Yt given
o
t . Then under H0 : R( ) = r; we have as n ! 1;
d
^
W n[R( ^ ) r]0 [ R0 ( ^ )H
^ 1
( ^ )R0 ( ^ )0 ] 1 [R( ^ ) r] ! 2
J:
as n ! 1;where
X
n
V^ = n 1
St ( ^ )St ( ^ )0 = S( ^ )0 S( ^ )=n;
t=1
28
Answer: Yes. But Why?
Theorem 9.9 [Likelihood Ratio Test]: Suppose Assumptions 9.1-9.6 hold, and f (yj t ; ) is
correctly speci…ed for the conditional distribution of Yt given t : De…ne the average log-likelihoods
X
n
^l( ^ ) = n 1
ln f (Yt j t;
^ );
t=1
X
n
^l( ~ ) = n 1
ln f (Yt j t;
~ );
t=1
where ^ is the unconstrained MLE and ~ is the constrained MLE subject to the constraint that
R( ~ ) = r: Then under H0 : R( o ) = r; we have
d
LR = 2n[^l( ^ ) ^l( ~ )] ! 2
J as n ! 1:
max ^l( ):
2
On the other hand, the constrained MLE ~ solves the maximization problem
n o
max ^l( ) + 0
[r R( )] ;
2
29
where is a J 1 Lagrange multiplier vector. The corresponding FOC are
^ ~)
S( R0 ( ~ )0 ~ = 0;
(K 1) (K J) (J 1) = K 1
R( ~ ) r = 0:
[Recall R0 ( ) is a K J matrix.] We now take a second order Taylor series expansion of ^l( ~ )
around the unconstrained MLE ^ :
LR = 2n[^l( ~ ) ^l( ^ )]
= 2n[^l( ^ ) ^l( ^ )] + 2nS(
^ ^ )0 ( ~ ^ )
p p
+ n( ~ ^ )0 H( ^ a ) n( ~ ^ )
p p
= n( ~ ^ )0 H( ^ a ) n( ~ ^ )
where a lies between ~ and ^ ; namely a = a ~ + (1 a) ^ for some a 2 [0; 1]: It follows that
p p
2n[^l( ^ ) ^l( ~ )] = n( ~ ^ )0 [ H(
^ a )] n( ~ ^ ): (9.1)
^ ^ ) + H(
S( ^ b )( ~ ^) R0 ( ~ )0 ~ = 0;
p p
^ b ) n( ~
H( ^) R0 ( ~ )0 n ~ = 0
or
p p
n( ~ ^) = H
^ 1
( b )R0 ( ~ )0 n ~ (9.2)
for n su¢ ciently large. This establishes the link between ~ and ~ ^ : In particular, it implies
that the Lagrange multiplier ~ is an indicator for the magnitude of the di¤erence ~ ^ :
p
Next, we derive the asymptotic distribution of n ~ : By a Taylor expansion of S(
^ ~ ) around
p p
^ ~ ) R0 ( ~ )0 n ~ = 0; we have
the true parameter o in the FOC nS(
p p
R0 ( ~ )0 n ~ = ^ ~)
nS(
p p
= nS( ^ c ) n( ~
^ o ) + H( o
);
30
where c lies between ~ and o
; namely, c = c ~ + (1 c) o
for some c 2 [0; 1]: It follows that
p p p
^
H 1
( c )R0 ( ~ )0 n ~ = H
^ 1 ^ o ) + n( ~
( c ) nS( o
) (9.3)
for n su¢ ciently large. Now, we consider a Taylor series expansion of R( ~ ) r = 0 around o
:
p p
n[R( o ) r] + R0 ( d) n( ~ o
) = 0;
p
^ 1 ( c ) nS(
= R0 ( d )H ^ o)
p
+R0 ( d ) n( ~ o
)
p
0
= R ( d )H^ ( c ) nS(
1 ^ o)
d
! N (0; R0 ( o )Ho 1 Vo Ho 1 R0 ( o )0 )
p h i 1 p
n~ = R0 ( ^
d )H
1
( c )R0 ( ~ )0 R0 ( d )H
^ 1 ^ o)
( c ) nS(
d
! N (0; [ R0 ( o )H0 1 R0 ( o )0 ] 1 ) (9.5)
p ^ o
by the CLT for nS( ); the MDS property of fSt ( o )g; the information matrix equality, and
the Slutsky theorem.
Therefore, from Eq. (9.2) and Eq. (9.5), we have
p
^
H( a)
1=2
n( ~ ^)
p
^
= H( a) H ( b )R0 ( ~ )0
1=2 ^ 1
n~
d
! N (0; )
1=2
N (0; I); (9.6)
where
= Ho 1=2 R0 ( o )0 [ R0 ( o )Ho 1 R0 ( o )0 ] 1 R0 ( o )Ho 1=2
2
is a K K symmetric and idempotent matrix ( = ) with rank equal to J (using the formula
31
that tr(ABC) =tr(BCA)):
Recall that if v N (0; ); where is a symmetric and idempotent matrix with rank J; then
the quadratic form v 0 2
J : It follows from Eq. (9.1) and Eq. (9.6) that
p p
2n[^l( ~ ) ^l( ^ )] = n( ~ ^ )0 [ H(
^ a )]
1=2
[ ^
H( a )]
1=2
n( ~ ^)
d 2
! J:
Remarks:
The LR test is based on comparing the objective functions— the log likelihood functions under
the null hypothesis H0 and the alternative to H0 : Intuitively, when H0 holds, the likelihood ^l( ^ )
of the unrestricted model is similar to the likelihood ^l( ~ ) of the restricted model, with the little
di¤erence subject to sampling variations. If the likelihood ^l( ^ ) of the unrestricted model is
su¢ ciently larger than the likelihood ^l( ~ ) of the restricted model, there exists evidence that H0
is false. How large a di¤erence between ^l( ^ ) and ^l( ~ ) is considered as su¢ ciently large to reject
H0 is determined by the associated asymptotic 2J distribution.
The likelihood ratio test statistic is similar in spirit to the F -test statistic in the classical
linear regression model, which compares the objective functions— the sum of squared residuals
under the null hypothesis H0 and the alternative to H0 respectively. In other words, the negative
log-likelihood is analogous to the sum of squared residuals. In fact, the LR test statistic and the
J F statistic are asymptotically equivalent under H0 for a linear regression model
Yt = Xt0 o
+ "t ;
2
where "t j t N (0; o ): To see this, put = ( 0; 2 0
) and note that
1 1
(Yt Xt0 )2
f (Yt j t; ) = p e 2 2 ;
2 2
X
n
^l( ) = n 1
ln f (Yt j t; )
t=1
1 1 X
n
2 1
= ln(2 ) 2
n (Yt Xt0 )2 :
2 2 t=1
32
It is straightforward to show (please show it!) that
where we have used the inequality that j ln(1+z) zj z 2 for small z; and the asymptotically
negligible (oP (1)) reminder term is contributed by the quadratic term in the expansion.
In the proof of the above theorem, we see that the asymptotic distribution of the LR test
statistic depends on correct model speci…cation of f (yj t ; ), because it uses the MDS property
of the score function and the IM equality. In other words, if the conditional distribution model
f (yj t ; ) is misspeci…ed such that the MDS property of the score function or the IM equality
does not hold, then the LR test statistic will not be asymptotically 2 -distributed.
Lagrange Multiplier (LM) or E¢ cient Score Test
We can also use the Lagrange multiplier ~ to construct a Lagrange Multiplier (LM) test, which
is also called Rao’s e¢ cient score test. Recall the Lagrange multiplier is introduced in the
constrained MLE problem:
max L(^ ) + 0 [r R( )]:
2
The J 1 Lagrange multipier vector ~ measures the e¤ect of the restriction of H0 on the
maximized value of the model likelihood. When H0 holds, the imposition of the restriction results
in little change in the maximized likelihood. Thus the value of the Lagrange multiplier ~ for a
correct restriction should be small. If a su¢ ciently large Lagrange mutiplier ~ is obtained, it
implies that the maximized likelihood value of the restricted model is su¢ ciently smaller than
that of the unrestricted model, thus leading to the rejection of H0 :Therefore, we can use ~ to
construct a test for H0 :
33
In deriving the asymptotic distribution of the LR test statistic, we have obtained
p h i 1 p
n~ = R0 ( ^
d )H
1
( c )R0 ( ~ )0 R0 ( d )H
^ 1 ^ o)
( c ) nS(
d
! N (0; [ R0 ( o )Ho 1 R0 ( o )0 ] 1 )
Theorem 9.10 [LM/E¢ cient Score test] Suppose Assumptions 9.1–9.6 hold, and the model
f (yj t ; ) is correctly speci…ed for the conditional distribution of Yt given t : Then we have
0 d
LM0 n ~ R0 ( ~ )[ H
^ 1
( ~ )]R0 ( ~ )0 ~ ! 2
J
under H0 :
The LM test statistic only involves estimation of the model f (yt j t ; ) under H0 ; its compu-
tation may be simpler than the computation of the Wald test statistic or the LR test statistic in
many cases.
where
X
n
V~ = n 1
St ( ~ )St ( ~ )0
t=1
= S( ~ )0 S( ~ )=n:
Question: What is the relationship among the Wald, LR and LM test statistics?
We can no longer interpret as the true model parameter, because f (yj t ; ) does not coincide
with the true conditional probability distribution of Yt given t .
It should be noted that in QMLE, we no longer have the following equality:
= o
where is as de…ned in Assumption 9.4 and o is the true model parameter.
p p
Although it always holds that ^ QM LE ! ; as n ! 1; we no longer have ^ QM LE ! o , as
n ! 1; given that the conditional probability distribution is misspeci…ed.
Below, we provide an alternative interpretation for when f (yj t ; ) is misspeci…ed.
Lemma 9.11: Suppose Assumption 9.4 holds. De…ne the conditional relative entropy
Z
p(yj )
I(f : pj ) = ln p(yj )dy;
f (yj ; )
where p(yj ) is the true conditional pdf/pmf of Y on : Then I(f : pj ) is nonnegative almost
surely for all ; and
= arg min E[I(f : pj )];
2
Remarks:
The parameter value minimizes the “distance”of f ( j ; ) from the true conditional density
p( j ) in terms of conditional relative entropy. Relative entropy is a divergence measure for two
alternative distributions. It is zero if and only if two distributions coincide with each other.
There are many distance/divergence measures for two distributions. Relative entropy has the
appealing information-theoretic interpretation and the invariance property with respect to data
transformation. It has been widely used in economics and econometrics.
Question: Why is a misspeci…ed pdf/pmf model f (yt j t ; ) still useful in economic applications?
In many applications, misspeci…cation of higher order conditional moments does not render
inconsistent the estimator for the parameters appearing in the lower order conditional moments.
For example, suppose a conditional mean model is correctly speci…ed but the conditional higher
order moments are misspeci…ed. We can still obtain a consistent estimator for the parameter
35
appearing in the conditional mean model. Of course, the parameters appearing in the higher
order conditional moments cannot be consistently estimated.
In other words, even though does not equal to o element by element, we can have equality
in some parameters of interests. For example, the …rst two elements (e.g. 0 = 0 and 1 = 20 ;
where 0 and 20 are the parameters we are interested in) in could be equal to the corresponding
elements in o , i.e., 0 = o0 and 1 = o1 :Therefore, by using QMLE, we can have an inconsistent
estimator ^ QM LE in which ^ 0 and ^ 1 are consistent for the population mean and variance. See
Example 1.
We now consider a few illustrative examples.
o
Yt = g(Xt ; ) + "t ;
Here, the regression model g(Xt ; ) is correctly speci…ed for E(Yt jXt ) if and only if E("t jXt ) =
0 a.s.: We need not know the distribution of "t jXt :
o
Question: How to estimate the true parameter when the conditional mean model g(Xt ; )
is correctly speci…ed for E(Yt jXt )?
In order to estimate o ; we assume that "t jXt i:i:d:N (0; 2 ); which is likely to be incorrect
(and we know this). Then we can obtain the pesudo conditional likelihood function
1 1
[yt g(xt ; )]2
f (yt jxt ; ) = p e 2 2 ;
2 2
where = ( 0 ; 2 )0 :
De…ne the Quasi-MLE
X
n
^ = (^ 0 ; ^ 2 )0 = arg max ln f (Yt jXt ; ):
2;
t=1
Then ^ is a consistent estimator for o : In this example, misspeci…cation of i.i.d. N (0; 2 ) for
"t jXt does not render inconsistent the parameter for o : The QMLE ^ is consistent for o as
long as the conditional mean of Yt is correctly speci…ed by f (yjXt ; ): Of course, the parameter
estimator ^ = (^ 0 ; ^ 2 )0 cannot consistently estimate the true conditional distribution of Yt given
t if the conditional distribution of "t jXt is misspeci…ed.
2 2 2
Suppose the true conditional distribution "t jXt i:i:d:N (0; t ); where t = (Xt ) is a function
36
of Xt but we assume "t jXt i:i:d:N (0; 2 ): Then we still have E[St ( )jXt ] = 0 a.s. but the
conditional informational matrix equality does not hold.
o o
Yt = 0 + 1 Zmt + "t
o0
= Xt + "t ;
where Xt = (1; Zmt )0 is a bivariate vector, Zmt is the excess market return, o is a 2 L
parameter matrix, and "t is an L 1 disturbance, with E("t jXt ) = 0. With this condition,
CAPM is correctly speci…ed for the expected excess return E(Yt jXt ):
To estimate unknown parameter matrix o ; one can assume
"t j t N (0; );
1 0
exp (Yt Xt )0 1
(Yt 0
Xt ) ;
2
where = ( 0 ;vech( )0 )0 :
Although the i.i.d. normality assumption for f"t g may not hold, the estimator based on the
pesudo Gaussian likelihood function will be consistent for parameter matrix o appearing in the
CAPM model.
where "t is an MDS with mean 0 and variance 2 : Then this ARMA(p; q) model is correctly
speci…ed for E(Yt jIt 1 ); where It 1 = fYt 1 ; Yt 2 ; :::; Y1 g is the information set available at
time t 1: Note that the distribution of "t is not speci…ed. How can we estimate parameters
0 ; 1 ; :::; p ; 1 ; :::; and q ?
37
2
Assuming that f"t g i:i:d:N (0; ); then the conditional pdf of Yt given t = It 1 is
1 (y t( ; ))2
f (yj t; )= p exp 2
;
2 2 2
2 0
where =( 0; 1 ; :::; p; 1 ; :::; q; ) ; and
p q
X X
t( )= 0 + j Yt j + j "t j :
j=1 j=1
Although the i.i.d. normality assumption for f"t g may be false, the estimator based on the above
pesudo Gaussian likelihood function will be consistent for parameters ( o ; o ) appearing in the
ARMA(p; q) model.
In practice, we have a random sample fYt gnt=1 of size n to estimate an ARMA(p; q) model and
need to assume some initial values for fYt g0t= p and f"t g0t= q : For example, we can set Yt = Y
for p t 0 and "t = 0 for q t 0: When an ARMA(p; q) is a stationary process, these
choice of initial values does not a¤ect the asymptotic properties of the QMLE ^ under regularity
conditions.
Pp Pp
YLt = L0 + j=1 L1j Y1t j + + j=1 LLj YLt j + "Lt :
o
Let denote a parameter vector containing all components of unknown parameters from
38
o o o o o
0; 1 ; :::; p; and : To estimate ; one can assume
1
f (Yt j t; ) = p
(2 )Ldet( )
1 0 1
exp [Yt t ( )] Yt t( ) ;
2
Pp 0
where t( )= 0 + j=1 j Yt j :
Yt = ( t; )+ ( t; )zt ;
E(zt j t) = 0 a.s.;
E(zt2 j t) = 1 a.s.:
The models ( t ; ) and 2 ( t ; ) are correctly speci…ed for E(Yt j t ) and var(Yt j t ) if and only
if E(zt j t ) = 0 a.s. and var(zt j t ) = 1 a.s. We need not know the conditional distribution of
zt j t (in particular, we need not know the higher order conditional moments of zt given t ):
An example for ( t ; ) is the ARMA(p; q) in Example 2. We now give some popular models
for 2 ( t ; ): For notational simplicity, we put 2t = 2 ( t ; ):
where "t = t zt :
39
Bollerslev’s (1986) GARCH(p; q) model
p q
X X
2 2 2
t =!+ j t j + j "t j ;
j=1 j=1
Question: How to estimate , the parameters appearing in the …rst two conditional moments?
2
A most popular approach is to assume that zt j t i.i.d.N(0; 1): Then Yt j t N ( t( t; ); ( t; ));
and the pseudo conditional pdf of Yt given t is
1 1
2 2( t; )
[y ( t; )]2
f (yj t; )= p e :
2 ( t; )
X
n
ln f (Yt j t; )
t=1
n X
n
= ln 2 ln t( t; )
2 t=1
1 X [Yt
n
( t; )]2
2(
:
2 t=1 t; )
40
The i.i.d. N(0,1) innovation assumption does not a¤ect the speci…cation of the conditional mean
( t ; ) and conditional variance 2 ( t ; ), so it does not a¤ect the consistency of the QMLE
^ for the true parameter value appearing in the conditional mean and conditional variance
speci…cations. In other words, "t may not be i.i.d. N(0,1) but this does not a¤ect the consistency
of the Gaussian QMLE ^ :
In addition to the i.i.d.N(0,1) assumption, the following two error distributions have also been
popularly used in practice:
p
Standardized Student’s ( 2)= t( ) Distribution
p
The scale factor ( 2)= ensures that zt has unit variance. The pdf of zt is
+1
+1
2 z2 2
f (z) = p 1+ ; 1 < z < 1:
2
where ; a and b are location, scale and shape parameters respectively. Note that both standard-
ized t-distribution and generalized error distribution include N(0,1) as a special case.
Like estimation of an ARMA(p; q) model, we may have to choose initial values for some
variables in estimating GARCH models. For example, in estimating GARCH(1,1) models, we
will encounter the initial value problem for the conditional variance 20 and "0 : One can set h0 to
be the unconditional variance E( 2t ) = !=(1 1 1 ); and set "0 = 0:
We note that the ARMA model in Example 2 can be estimated via QMLE as a special case
of the GARCH model by setting 2 ( t ; ) = 2 :
Although misspeci…cation of f (yt j t ; ) may not a¤ect the consistency of the QMLE (or the
consistency of a subset of parameters) under suitable regularity conditions, it does a¤ect the
asymptotic variance (and so e¢ ciency) of the QMLE ^ :
Remarks: The parameter is not always consistently estimable by QMLE when the likelihood
function is misspeci…ed. In some cases, cannot be consistently estimated when the likelihood
model is misspeci…ed.
41
Lemma 9.12: Suppose Assumptions 9.4–9.6(i) hold. Then
E [St ( )] = 0;
where E( ) is taken over the true distribution of the data generating process.
dl( )
= 0:
d
By di¤erentiating, we have
d
E[ln f (Yt j t; )] = 0:
d
Exchanging di¤erentiation and integration yields the desired result:
@ ln f (Y j t; )
E = 0:
@
Remarks:
No matter whether the conditional distributional model f (yj t ; ) is correctly speci…ed, the
score function St ( ) evaluated at always has mean zero. This is due to the consequence
of the FOC of the maximization of l( ): This is analogous to the FOC of the best linear least
squares approximation where one always has E(Xt ut ) = 0 with ut = Yt Xt0 and =
[E(Xt Xt0 )] 1 E(Xt Yt ):
When fZt = (Yt ; Xt0 )0 g is i.i.d., or fZt g is not independent but fSt ( )g is MDS (we note that
St ( ) could still be MDS when f (Yt j t ; ) is misspeci…ed for the conditional distribution of Yt
given t ), we have
!
X
n
1=2
V = V( ) = avar n St ( )
t=1
" ! !0 #
X
n X
n
1=2 1=2
= lim E n St ( ) n S ( )
n!1
t=1 =1
= E[St ( )St ( )0 ]:
Thus, even when f (yj t; ) is a misspeci…ed conditional distribution model, we do not have to
42
Question: Can you give a time series example in which f (yt j t; ) is misspeci…ed but
fSt ( )g is MDS?
Answer: Consider a conditional distribution model which correctly speci…es the conditional
mean of Yt but misspeci…es the higher order conditional moments (e.g., conditional variance).
Answer: In the time series context, when the conditional pdf/pmf f (yt j t; ) is misspeci…ed,
then St ( ) may not be MDS. In this case, we have
hp i
V avar ^
nS( )
X
n X
n
1
= lim n E[St ( )S ( )0 ]
n!1
t=1 =1
X
1
= E[St ( )St j ( )0 ]
j= 1
X1
= (j);
j= 1
where
(j) = E[St ( )St j ( )0 ]:
In other words, we have to estimate the long-run variance-covariance matrix for V when fSt ( )g
is not an MDS:
Question: If the model f (yj t ; ) is misspeci…ed for the conditional distribution of Yt given t,
do we have the conditional information matrix equality?
Generally, no. That is, we generally have neither E [St ( ) jIt 1 ] = 0 nor
@ 2 ln f (Yt j t; )
E [St ( )St ( )0 j t] +E 0 j t = 0;
@ @
where E( j t ) is taken under the true conditional distribution which di¤ers from the model
f (yt j t ; ) when f (yt j t ; ) is misspeci…ed: Please check.
Question: What is the impact of the failure of the MDS property for the score function and
the failure of the conditional information matrix equality?
43
p ^ h i
@ 2 ln f (Yt j t ; )
where V = V ( ) avar[ nS( )] and H = H( ) E @ @ 0
j t :
Remarks:
p ^
Without the MDS property of the score function, we have to estimate V avar[ nS( )] by
(e.g.) the Newey-West (1987, 1994) type estimator in the time series context. Without the con-
ditional information matrix equality (even if the MDS holds), we cannot simplify the asymptotic
variance of the QMLE from H 1 V H 1 to H 1 even if the score function is i.i.d. or MDS.
In certain sense, the MDS property of the score function is analogous to serial uncorrelated-
ness in a regression disturbance, and the information matrix equality is analogous to conditional
homoskedasticity.
Yes. The asymptotic variance of the MLE, equal to H 1 ; the inverse of the negative
Hessian matrix, achieves the Cramer-Rao lower bound, and therefore is asymptotically most
e¢ cient. On the other hand, the asymptotic variance H 1 V H 1 of the QMLE is not the same
as the asymptotic variance H 1 of the MLE and thus does not achieve the Cramer-Rao lower
bound. It is asymptotically less e¢ cient than the MLE. This is the price one has to pay with
use of a misspeci…ed pdf/pmf model, although some model parameters still can be consistently
estimated.
X
n
@ 2 ln f (Yt j t;
^)
^ ^) = n
H( 1
:
0
t=1
@ @
p
^ ^) !
The UWLLN for fHt ( )g and the continuity of H( ) ensure that H( H.
44
1=2 n
Next, how to estimate V = avar[n t=1 St ( )]?
Case I: fZt = (Yt ; Xt0 )0 g is i.i.d. or fZt g is not independent but fSt ( )g is MDS.
In this case,
V = E[St ( )St ( )0 ]
so we can use
X
n
V^ = n 1
St ( ^ )St ( ^ )0
t=1
Case II: When fZt g is not independent, fSt ( )g may not be MDS.
In this case, we can use the kernel method
X
n 1
V^ = k(j=p) ^ (j);
j=1 n
where
X
n
^ (j) = n 1
St ( ^ )St j ( ^ )0 if j 0
t=j+1
Lemma 9.14 [Asymptotic Variance Estimator for QMLE]: Suppose Assumptions 9.1–9.7
hold. Then as n ! 1;
p
H^ 1 ( ^ )V^ H
^ 1( ^ ) ! H 1V H 1.
H0 : R( ) = r;
45
where R( ) is a J 1 continuously di¤erentiable vector function with the J K matrix R0 ( )
being of full rank, and r is a J 1 vector.
^ = n[R( ^ ) r]0
W
[R0 ( ^ )[H
^ 1 ( ^ )V^ H
^ 1
( ^ )] 1 R0 ( ^ )0 ] 1
[R( ^ ) r]
d 2
! J
Remarks:
Only the unconstrained QMLE ^ is used in constructing the robust Wald test statistic. The
Wald test statistic under model misspeci…cation is similar in structure to the Wald test in linear
regression modeling that is robust to conditional heteroskedasticity (under the i.i.d. or MDS
assumption) or that is robust to conditional heteroskedasticity and autocorrelation (under the
non-MDS assumption).
Question: Can we use the LM test principle for H0 when f (yj t; ) is misspeci…ed?
p
Yes, we can still derive the asymptotic distribution of n ~ ; with a suitable (i.e., robust)
asymptotic variance, which of course will be generally di¤erent from that under correct model
speci…cation.
46
Recall that from the FOC of the constrained MLE ~ ;
^ ~)
S( R0 ( ~ )0 ~ = 0;
R( ~ ) r = 0;
p h i 1 p
n ~ = R0 ( ^
d )H
1
( c )R0 ( ~ )0 R0 ( d )H
^ 1 ^
( c ) nS( )
p ^ d p ^
for n su¢ ciently large. By the CLT, we have nS( ) ! N (0; V ); where V = avar[ nS( )]:
Using the Slutsky theorem, we can obtain
p d
n ~ ! N (0; );
where
= [R0 ( )H 1
R0 ( )0 ] 1
R0 ( )H 1
VH 1
R0 ( )0
[R0 ( )H 1
R0 ( )0 ] 1 :
~ = [R0 ( ~ )H
^ 1 ( ~ )R0 ( ~ )0 ] 1
[R0 ( ~ )H
^ 1 ( ~ )V~ H^ 1 ( ~ )R0 ( ~ )0 ]
[R0 ( ~ )H
^ 1
( ~ )R0 ( ~ )0 ] 1 ;
With this assumption, the LM test statistic will only involves estimation of the conditional
pdf/pmf model f (yj t ; ) under the null hypothesis H0 :
Theorem 9.16 [QMLE-based LM Test]: Suppose Assumptions 9.1–9.6 and 9.8 and H0 :
R( ) = r holds. Then as n ! 1;
0 1~ d
LM n~ ~ ! 2
J:
47
Remarks:
The LM0 test statistic under MLE and the LM test statistic under QMLE di¤er in the sense
that they use di¤erent asymptotic variance estimators. The LM test statistic here is robust to
misspeci…cation of the conditional pdf/pmf model f (yj t ; ):
Question: Could we use the likelihood ratio (LR) test under model speci…cation?
No. This is because in deriving the asymptotic distribution of the LR test statistic, we have
used the MDS property of the score function fSt ( )g and the information matrix equality (V =
H ), which may not hold when the conditional distribution model f (yj t ; ) is misspeci…ed. If
the MDS property of the score function or the information matrix equality fails, the LR statistic
is not asymptotically 2J under H0 : This is similar to the fact that J times the F -test statistic does
not converge to 2J when there exists serial correlation in f"t g or when there exists conditional
heteroskedasticity.
In many applications (e.g., estimating CAPM models), both GMM and QMLE can be used to
estimate the same parameter vector. In general, by making fewer assumptions on the DGP,
GMM will be less e¢ cient than QMLE if the pesudo-model likelihood function is close to the
true conditional distribution of Yt given t .
Question: How to check whether a conditional distribution model f (yj t; ) is correctly speci-
…ed?
We now introduce a number of speci…cation tests for conditional distributional model f (yj t; ):
48
Case I: When fZt = (Yt ; Xt0 )0 g i.i.d.
p d
n( ^ o
) ! N (0; Ho 1 Vo Ho 1 );
where
Vo = E[St ( o )St ( o )0 ]:
In the i.i.d. random sample context, White (1982) proposes a speci…cation test for f (yj t; )=
f (yjXt ; ) by checking whether the information matrix equality holds:
This is implied by correct model speci…cation. If the information matrix equality does not hold,
then there is evidence of model misspeci…cation for the conditional distribution of Y given X.
De…ne the K(K+1)
2
1 sample average
1X
n
m(
^ )= mt ( );
n t=1
where
mt ( ) = vech [St ( )St ( )0 + Ht ( )] :
49
h o
i
@mt ( )
where Do D( o ) = E @
; and the asymptotic variance
W = var mt ( o ) Do H o 1 S t ( o ) :
It follows that a test statistic can be constructed by using the quadratic form
d
^ ^ )0 W
M = nm( ^ 1
^ ^) !
m( 2
K(K+1)=2
Question: If the information matrix equality holds, is the model f (yjXt ; ) correctly speci…ed
for the conditional distribution of Yt given Xt ?
Answer: No. Correct model speci…cation implies the information matrix equality but the con-
verse may not be true. The information matrix equality is only one of many (in…nite) implications
of the correct speci…cation for f (yj t ; ):
Although White (1982) considers i.i.d. random samples only, his IM test is applicable for
both cross-sectional and time series models as long as the score function fSt ( o )g is an MDS.
In a time series context, White (1994) proposes a dynamic information matrix test that
essentially checks the MDS property of the score function fSt ( o )g:
E[St ( o )j t] = 0;
where Wt ( ) = [St 1 ( )0 ; St 2 ( )0 ; :::; St p ( )0 ]0 and is the Kronecker product. Then the MDS
property implies
E[mt ( o )] = 0:
This test is essentially checking whether fSt ( o )g is a white noise process up to lag order p. If
50
E[mt ( o )] 6= 0; i.e., if there exists serial correlations in fSt ( o )g; then there is evidence of model
misspeci…cation.
White (1994) considers the sample average
X
n
m
^ =n 1
mt ( ^ )
t=1
and checks if this is close to zero. White (1994) develops a so-called dynamic information matrix
p
test by using a suitable quadratic form of nm ^ that is asymptotically chi-square distributed
under correct dynamic model speci…cation.
Question: If fSt ( o )g is MDS, is f (yj t; ) correctly speci…ed for the conditional distribution
of Yt given t ?
No. Correct model speci…cation implies that fSt ( o )g is a MDS but the converse may not be
true. It is possible that St ( o ) is an MDS even when the model f (yj t ; ) is misspeci…ed for
the conditional distribution of Yt given t : A better approach is to test the conditional density
model itself, rather than the properties of its derivatives (e.g., the MDS of the score function or
the information matrix equality).
Next, we consider a test that directly checks the conditional distribution of Yt given t:
Hong and Li’s (2005) Nonparametric Test for Time Series Conditional Distribution
Models
o
Lemma 9.17: If f (yj t; ) coincides with the true conditional pdf of Yt given t; then
fUt ( o )g i.i.d.U[0,1].
Thus, one can test whether fUt ( o )g is i.i.d.U[0,1]. If it is not, there exists evidence of model
misspeci…cation.
Question: Suppose fUt ( o )g is i.i.d.U[0,1], is the model f (yj t; ) correctly speci…ed for the
conditional distribution of Yt given t ?
For univariate time series (so that t = fYt 1 ; Yt 2 ; :::g); the i.i.d.U[0,1] property holds if and
only if the conditional pdf model f (yt j t ; ) is correctly speci…ed.
51
Hong and Li (2005) use a nonparametric kernel estimator for the joint density of fUt ( o ); Ut j ( o )g
and compare the joint density estimator with 1 = 1 1; the product of the marginal densities of
Ut ( o ) and Ut j ( o ) under correct model speci…cation. The test statistic follows an asymptotical
N(0,1) distribution. See Hong and Li (2005) for more discussion.
Question: How does the industrial bureau decide to use the competitive auction to select …rm
managers?
Estimation Results:
X1t X2t n
0:2769 0:2467 645 ;
( 7:485) ( 7:584)
where ** indicates signi…cance at the 5% level. These results suggest that the poor-performing
and/or smaller …rms are more likely to have their managers selected by competitive auction.
We are interested in modeling the conditional probability distribution of the short-term in-
terest rate. There are two popular discrete-time models for the spot interest rate: one is the
GARCH model, and the other is the Markov chain regime-switching model.
52
8 1: GARCH(1,1)-Level E¤ect with an i.i.d. N(0,1)
Model
1=2
innovation:
1 2
>
< rt = 1 rt 1 + 0 + 1 rt 1 + 2 rt 2 + rt 1 ht zt ;
ht = 0 + 1 ht 1 + 2 ht 1 zt2 1 ;
>
:
fzt g i:i:d:N (0; 1):
Here, the conditional mean of the interest rate change is a nonlinear function of the interest
rate level:
1 2
t = E( rt jIt 1 ) = 1 rt 1 + 0 + 1 rt 1 + 2 rt 2 :
This speci…cation can capture nonlinear dynamics in the interest rate movement.
The conditional variance model of the interest rate change is
2 2 2
t = var( rt jIt 1 ) = rt 1 ht ;
where rt 1 captures the so-called “level e¤ect” in the sense that when > 0; volatility will
increase when the interest rate level is high. On the other hand, the GARCH component ht
captures volatility clustering.
Estimation Results
Parameter Estimates for the GARCH Model (with nonlinear drift and level e¤ect)
1 -0.0984 0.1249
2 0.0000 0.0004
1.0883 0.0408
Log-Likelihood 654.13
53
Model 2: Regime-Switching Model with GARCH and Level E¤ects
(St 1) 1=2
rt = (St 1 ) + (St 1 ) rt 1 + (St 1 ) rt 1t ht zt 1 ;
2
ht = 0 + ht 1 1 + 2 zt 1 ;
fzt g i:i:d:N (0; 1);
where the state variable St is a latent process that is assumed to follow a two-state Markov chain
with time-varying transition matrix, as speci…ed in Ang and Bekaert (1998):
1
P (St = 1jSt 1 = 1) = [1 + exp( a01 a11 rt 1 )] ;
1
P (St 1 = 0jSt 1 = 0) = [1 + exp( a00 a10 rt 1 )] :
Question: What is the model likelihood function? That is, what is the conditional density of
rt given It 1 = frt 1 ; rt 2 ; :::g; the observed information set available at time t 1?
The di¢ culty arises because the state variable St is not observable. See Hamilton (1994,
Chapter 22) for treatment.
Estimation Results
Parameter estimates for the Regime Switching Model (with GARCH and level e¤ect)
0 1.5378 1.5378
0 -1.0646 0.4207
1 -0.0013 0.0351
1 -0.0076 0.0484
1 0.3355 0.0483
0 0.3566 0.0693
1 0.0064 0.0512
b1 0.0224 0.0034
b2 0.7810 0.0254
Log-Likelihood 2712.97
54
Empirical III: Volatility Models of Foreign Exchange Returns
Suppose one is interested in studying volatility spillover between two exchange rates— German
Deutschmark and Japanese Yen. A …rst step is to specify a univariate volatility for German
Deutschmark and Japanese yen respectively. Hong …ts an AR(3)-GARCH(1,1) model for weekly
German Deutschmark exchange rate changes and Japanese Yen exchange rate changes:
Model: AR(3)-GARCH(1,1)-i.i.d.N(0,1)
8
>
> Xt = t + "t ;
>
> P3
>
>
< t = b0 + j=1 bj Xt j ;
1=2
"t = ht zt ;
>
>
>
> ht = ! + "2t 1 + ht 1 ;
>
>
: = (b ; b ; b ; b ; !; ; )0 :
0 1 2 3
Assuming that fzt g i:i:d:N (0; 1); we obtain the following QMLE.
Data: First week of 1976:1 to last week of 1995:11, with totally 1039 observations.
Estimation results
DM Y EN
9.7 Conclusion
55
Conditional probability distribution models have wide applications in economics and …nance.
For some applications, one is required to specify the entire distribution of the underlying process.
If the distribution model is correct, the resulting estimator ^ which maximizes the likelihood
function is called MLE.
For some other applications, on the other hand, one is only required to specify certain aspects
(e.g., conditional mean and conditional variance) of the distribution. One important example is
volatility modeling for …nancial time series. To estimate model parameters, one usually makes
some auxiliary assumptions on the distribution that may be incorrect so that one can estimate
by maximizing the pseudo likelihood function. This is called QMLE. MLE is asymptotically
more e¢ cient than QMLE, because the asymptotic variance of MLE attains the Cramer-Rao
lower bound.
The likelihood function of a correctly speci…ed conditional distributional model has di¤er-
ent properties from that of a misspeci…ed conditional distributional model. In particular, for a
correctly speci…ed distributional model, the score function is an MDS and the conditional in-
formation matrix equality holds. As a consequence, the asymptotic distributions of MLE and
QMLE are di¤erent (more precisely, their asymptotic variances are di¤erent). In particular, the
asymptotic variance of MLE is analogous to that of the OLS estimator under MDS regression
errors with conditional homoskedasticity; and the asymptotic variance of QMLE is analogous to
that of the OLS estimator under possibly non-MDS with conditional heteroskedasticity.
Hypothesis tests can be developed using MLE or QMLE. For hypothesis testing under a
correct speci…ed conditional distributional models, the Wald test, Lagrange Multiplier test, and
Likelihood Ratio tests can be used. When a conditional distributional model is misspeci…ed,
robust Wald tests and LM tests can be constructed. Like the F-test in the regression context,
Likelihood ratio tests are valid only when the distribution model is correctly speci…ed. The
reasons are that they exploit the MDS property of the score function and the information matrix
equality which may not hold under model misspeci…cation.
It is important to test correct speci…cation of a conditional distributional model. We introduce
some speci…cation tests for conditional distributional models under i.i.d. observations and time
series observations respectively. In particular, White (1982) proposes an Information Matrix
test for i.i.d. observations and White (1994) proposes a dynamic information matrix test that
essentially checks the MDS property of the score function of a correctly speci…ed conditional
distribution model with time series observations.
EXERCISES
o y o
9.1. For the probit model P (Yt = yjXt ) = (Xt0 ) [1 (Xt0 )]1 y , where y = 0; 1: Show that
(a) E(Yt jXt ) = (Xt0 o );
56
o o
(b) var(Yt jXt ) = (Xt0 )[1 (Xt0 )]:
9.2. For a censored regression model, show that E(Xt "t jYt > c) 6= 0: Thus, the OLS estimator
based on a censored random sample cannot be consistent for the true model parameter o :
9.3. Suppose f (yj ; ) is a conditional pdf model for Y given ; where 2 ; a parameter
9.4. (a) Suppose f (yj ; ); 2 ; is a correctly speci…ed model for the conditional probability
density of Y given ; such that f (yj ; o ) coincides with the true conditional probability density
of Y given : We assume that f (Y j ; ) is continuously di¤erentiable with respect to and o
is an interior point in . Please show that
o
@ ln f (Y j ; )
E = 0:
@
(b) Suppose Part (a) is true. Can we conclude that f (yj ; ) is correctly speci…ed for the
conditional distribution of Y given ? If yes, give your reasoning. If not, give a counter example.
9.5. Suppose f (yjx; ); 2 RK ; is a correctly speci…ed model for the conditional probability
density of Y given X; such that for some parameter value o ; f (yjx; o ) coincides with the
true conditional probability density of Y given X: We assume that f (Y jx; ) is continuously
di¤erentiable with respect to and o is an interior point in . Please show that
o o o
@ ln f (Y jX; ) @ ln f (Y jX; ) @ 2 ln f (Y jX; )
E X +E X = 0;
@ @ 0 @ @ 0
2
where @ @ln f is a K 1 vector, @@ln0f is the transpose of @ @ln f ; @@ ln
@ 0
f
is a K K matrix, and the
expectation E( ) is taken under the true conditional distribution of Y given X.
2
9.6. Put Vo = E[St ( o )St ( o )0 ] and Ho = E[ @@ St ( o )] = E[ @ @@ 0 ln fYt j t (yj t ; o )]; where
St ( ) = @@ ln f (Yt j t ; ); and o = arg min 2 l( ) = E[ln fYt j t (Yt j t ; )]: Is Ho 1 Vo Ho 1
( Ho 1 ) always positive semi-de…nite? Give your reasoning and any necessary regularity condi-
p
tions. Note that the …rst term Ho 1 Vo Ho 1 is the formula for the asymptotic variance of n ^ QM LE
p
and the second term Ho 1 is the formula for the asymptotic variance of n ^ M LE :
9.7. Suppose a conditional pdf/pmf model f (yjx; ) is misspeci…ed for the conditional distrib-
ution of Y given X; namely, there exists no 2 such that f (yjx; ) coincides with the true
57
conditional distribution of Y given X: Show that generally,
o o o
@ ln f (Y jX; ) @ ln f (Y jX; ) @ 2 ln f (Y jX; )
E X +E X = 0;
@ @ 0 @ @ 0
does not hold, where o satis…es Assumptions 9.4 and 9.5. In other words, the conditional infor-
mation matrix equality generally does not hold when the conditional pdf/pmf model f (yjx; ) is
misspeci…ed for the conditional distribution of Y given X:
Assumption 7.1: fYt ; Xt0 g0 is a stationary ergodic process, and f (Yt j t ; ) is a correctly speci…ed
conditional probability density model of Yt given t = (Xt0 ; Z t 10 )0 ; where Z t 1 = (Zt0 1 ; Zt0 2 ; ; Z10 )0
and Zt = (Yt ; Xt0 )0 : For each ; ln f (Yt j t ; ) is measurable of the data, and for each t; ln f (Yt j t ; )
is twice continuously di¤erentiable with respect to 2 ; where is a compact set:
Assumption 7.3: (i) o = arg max 2 l( ) is the unique maximizer of l( ) over ; and (ii) o
is an interior point of .
p X
n
^ o) = n
nS( 1=2
St ( o )
t=1
where the K K Hessian matrix H( ) E [Ht ( )] is symmetric, …nite and nonsingular, and is
continuous in 2 :
The maximum likelihood estimator is de…ned as ^ = arg max 2 ^ln ( ); where ^ln ( )
P
n 1 nt=1 ln f (Yt j t ; ): Suppose we have had ^ ! o almost surely, and this consistency re-
sult can be used in answering the following questions in parts (a)–(d). Show your reasoning in
each step.
(a) Find the …rst order condition of the MLE.
p
(b) Derive the asymptotic distribution of n( ^ o
): Note that the asymptotic variance of
p ^ o
n( ) should be expressed as the Hessian matrix H( o ):
58
p
(c) Find a consistent estimator for the asymptotic variance of n( ^ o
) and justify why it
is consistent.
(d) Construct a Wald test statistic for the null hypothesis H0 : R( o ) = r; where r is a J 1
constant vector, and R( ) is a J 1 vector with the derivative R0 ( ) is continuous in and R0 ( o )
is of full rank. Derive the asymptotic distribution of the Wald test under H0 :
X
n
^l( ) = n 1
ln f (Yt jXt ; )
t=1
1 1 X
n
1
= 2
ln(2 ) 2
n (Yt Xt0 )2 :
2 2 t=1
o
Suppose H0 : R = r is the hypothesis of interest.
(a) Show
9.10. Show the dynamic probability integral transforms fUt ( o )g is i.i.d.U[0,1] if the conditional
probability density model f (yj t ; ) is correctly speci…ed for the conditional distribution of Yt
given t :
59
CHAPTER 10 CONCLUSION
Abstract: In this chapter, we …rst review what we have covered in the previous chapters, and
then discuss other econometric courses needed for various …elds of economics and …nance.
In this chapter, we will …rst summarize what we have learnt in this book.
The modern econometric theory developed in this book is built upon the following funda-
mental axioms:
Any economy can be viewed as a stochastic process governed by some probability law.
Any economic phenomena can be viewed as a realization of the stochastic economic process.
The probability law of the data generating process can be called the “law of economic motions.”
The objective of econometrics is to infer the probability law of economic motions using observed
data, and then use the obtained knowledge to explain what has happened, to predict what will
happen, and to test economic theories and economic hypotheses.
Suppose the conditional pdf f (yt j t ) of Yt given t = (Xt ; Z t 1 ); is available. Then we can
obtain various attributes of the conditional distribution of Yt given t , such as
conditional mean;
conditional variance;
conditional skewness;
conditional kurtosis;
conditional quantile.
An important question in economic analysis is: what aspect of the conditional pdf will be im-
portant in economics and …nance? Generally speaking, the answer is dictated by the nature of
the economic problem one has at hand. For example, the e¢ cient market hypothesis states that
the conditional expected asset return given the past information is equal to the long-run market
1
average return; rational expectations theory suggests that conditional expectational errors given
the past information should be zero. In unemployment duration analysis, one should model the
entire conditional distribution of the unemployment duration given the economic characteristics
of the unemployed workers.
It should be emphasized that the conditional pdf or its various aspects only indicate a predic-
tive relationship between economic variables, that is, when one can use some economic variables
to predict other variables. The predictive relationship may or may not be the causal relationship
between or among economic variables, which is often of central interest to economists. Economic
theory often hypothesizes a causal relationship and such economic theory is used to interpret the
predictive relationship as a causal relationship.
Economic theory or economic model is not a general framework that embeds an econometric
model. In contrast, economic theory is often formulated as a restriction on the conditional pdf
or its certain aspect. Such a restriction can be used to validate economic theory, and to improve
forecasts if the restriction is valid or approximately valid.
Question: What is the role that economic theory plays in economic modeling?
Indication of the nature (e.g., conditional mean, conditional variance, etc) of the relation-
ship between Yt and Xt : Which moments are important and of interest?
In summary, any economic theory can be formulated as a restriction on the conditional prob-
ability distribution of the economic stochastic process. Economic theory plays an important role
in simplifying statistical relationships so that a parsimoneus econometric model can eventually
capture essential economic relationships.
Motivated by the fact that economic theory often has implication on and only on the con-
ditional mean of economic variables of interest, we …rst develop a comprehensive econometric
theory for linear regression models where by linearity we mean the conditional mean is linear in
parameters and not necessarily linear in explanatory variables. We start in Chapter 3 with the
classical linear regression model, for which we develop a …nite sample statistical theory when
the regression disturbance is i.i.d. normally distributed, and is independent of the regressor.
The normality assumption is crucial for the …nite sample statistical theory. The essence of the
2
classical theory for linear regression models is i.i.d., which implies conditional homoskedasticity
and serial uncorrelatedness, which ensures the BLUE property for the OLS estimator. When
conditional heteroskedasticity and autocorrelation exist, the GLS estimator illustrates how to re-
store the BLUE property by correcting conditional heteroskedasticity and di¤erencing out serial
correlation.
With the classical linear regression model as a benchmark, we have developed a modern econo-
metric theory for linear regression models by relaxing the classical assumptions in subsequent
chapters. First of all, we relax the normality assumption in Chapter 4. This calls for asymptotic
analysis because …nite sample theory is no longer possible. It is shown that when the sample size
is large, the classical results are still approximately applicable for linear regression models with
independent observations under conditional homoskedasticity. However, under conditional het-
eroskedasticity, the classical results, such as the popular t-test and F -test statistics, are no longer
applicable, even if the sample size goes to in…nity. This is due to the fact that the asymptotic
variance of the OLS estimator has a di¤erent structure under conditional heteroskedasticity. We
need to use White’s (1980) heteroskedasticity-consistent variance-covariance estimator and use it
to develop robust hypothesis tests. It is therefore important to test conditional homoskedasticity,
and White (1980) develops a regression-based test procedure.
The asymptotic theory developed for linear regression models with independent observations
in Chapter 4 is extended to linear regression models with time series observations. This covers
two types of regression models: one is called a static regression model where the explanatory
variables or regressors are exogenous variables. The other is called a dynamic regression model
whose regressors include lagged dependent variables and exogenous variables. It is shown in
Chapter 5 that when the asymptotic theory of Chapter 4 is applicable when the regression
disturbance is a martingale di¤erence sequence. Because of its importance, we introduce tests
for martingale di¤erence sequence of regression disturbances by checking serial correlation in the
disturbance. The tests include the popular Lagrange multiplier test for serial correlation. We
have also considered a Lagrange multiplier test for autoregressive conditional heteroskedasticity
(ARCH) and discussed its implication on the inference of static and dynamic regression models
respectively.
For many static regression models, it is evident that the regression disturbance displays serial
correlation. This a¤ects the asymptotic variance of the OLS estimator. When serial correlation is
of a known structure up to a few unknown parameter, we can use the Ornut-Cochrance procedure
to obtain asymptotically e¢ cient estimator for regression parameters. When serial correlation
is of unknown form, we have to use a long-run variance estimator to estimate the asymptotic
variance of the OLS estimator. A leading example is the kernel-based estimator such as the
Newey-West variance estimator. With such a variance estimator, robust test procedures for
hypotheses of interest can be constructed. These are discussed in Chapter 6.
3
The estimation and inference of linear regression models are complicated when the condition
of E("t jXt ) = 0 does not hold, which can arise due to measurement errors, simultaneous equations
bias, omitted variables, and so on. In Chapter 7 we discuss a popular method— the two-stage
least squares— to estimate model parameters in such scenarios.
Chapter 8 introduces the GMM method, which is particularly suitable for estimating both
linear and nonlinear econometric models that can be characterized by a set of moment conditions.
A prime economic example is the rational expectations theory, which is often characterized by
an Euler equation. In fact, the GMM method provides a convenient framework to view most
econometric estimators, including the least squares, and instrumental variables regression.
Chapter 9 discusses conditional probability distribution models and other econometric models
that can be estimated by using pseudo probability likelihood methods. Conditional distribution
models have found wide applications in economics and …nance, and MLE is the most popular
and most e¢ cient method to estimate parameters in conditional distribution models. On the
other hand, many econometric models can be conveniently estimated by using a pseudo like-
lihood function. These include nonlinear least squares, ARMA, GARCH models, as well as
limited dependent variables and discrete choice models. Such an estimation method is called
the Quasi-MLE. There is an important di¤erence between MLE and QMLE. The forms of their
asymptotic variances are di¤erent. In certain sense, the asymptotic variance of MLE is similar in
structure to the asymptotic variance of the OLS estimator under conditional homoskedasticity
and serial uncorrelatedness, while the asymptotic variance of the QMLE is similar in struc-
ture to the asymptotic variance of the OLS estimator under conditional heteroskedasticity and
autocorrelation.
Chapters 2 to 9 are treated in a uni…ed and coherent manner. The theory is constructed
progressively from the simplest classical linear regression models to nonlinear expectations mod-
els and then to conditional distributional models. The book has emphasized the important
implication of conditional heteroskedasticity and autocorrelation as well as misspeci…cation of
conditional distributional models on the asymptotic variance of the related econometric esti-
mators. With a good command of the econometric theory developed in Chapters 2 to 9, we
can conduct a variety of empirical analysis in economics and …nance, including all motivating
examples introduced in Chapter 1. In addition to asymptotic theory, the book has also shown
students how to do asymptotic analysis via the progressive development of the asymptotic theory
in Chapters 2 to 9. Moreover, we have also introduced a variety of basic asymptotic analytic tools
concepts, including various convergence concepts, limit theorems, and basic time series concepts
and models.
5
References
Bollerslev, T. (1986), "Generalized Autoregressive Conditional Heteroskedastcity", Journal of
Econometrics 31, 307-327.
Box, G.E.P. and D.A. Pierce (1970), "Distribution of Residual Autorrelations in Autoregres-
sive Moving Average Time Series Models," Journal of the American Statistical Association 65,
1509-1526.
Campbell, J.Y. and J. Cochrance (1999), "“By Force of Habit: A Consumption-Based
Explanation of Aggregate Stock Market Behavior”Journal of Political Economy 107, 205-251.
Chen, D. and Y. Hong (2003), "Has Chinese Stock Market Become E¢ cient? Evidence from
a New Approach," China Economic Quarterly 1 (2), 249-268.
Chow, G. C. (1960), "Tests of Equality Between Sets of Coe¢ cients in Two Linear Regressions,"
Econometrica 28, 591-605.
Cournot, A. (1838), Researches into the Mathematical Properties of the Theory of Wealth,
trans. Nathaniel T. Bacon, with an essay and an biography by Irving Fisher. McMillan: New
York, 2nd edition, 1927.
Cox, D. R. (1972), "Regression Models and Life Tables (with Discussion)," Journal of the Royal
Statistical Society, Series B, 34, 187-220,
Engle, R. (1982), "Autoregressive Conditional Hetersokedasticity with Estimates of the Vari-
ance of United Kingdom In‡ation,”Econometrica 50, 987-2008.
Engle, R. and C.W.J. Granger (1987), "Cointegration and Error-Corretion Representation,
Estimation and Testing,”Econometrica 55, 251-276.
Fisher, I. (1933), "Report of the Meeting," Econometrica 1, 92-93
Frisch, R. (1933), “Propagation Problems and Impulse Problems in Dynamic Economics.” In
Economic Essays in Honour of Gustav Cassel. London: Allen and Unwin, 1933.
Granger, C.J.W. (2001), "Overview of Nonlinear Macroeconometric Empirical Models," Jour-
nal of Macroeconomic Dynamics 5, 466-481.
Granger, C.J.W. and T. Teräsvirta (1993), modelling Nonlinear Economic Relationships,
Oxford University Press: Oxford.
Groves, T., Hong, Y., McMillan, J. and B. Naughton (1994), "Incentives in Chinese
State-owned Enterprises," Quarterly Journal of Economics CIX, 183-209.
Gujarati, D.N. (2006), Essentials of Econometrics, 3rd Edition, McGraw-Hill: Boston.
Hansen, L.P. (1982), "Large Sample Properties of Generalized Method of Moments Estima-
tors,”Econometrica 50, 1029-1054.
Hansen, L.P. and K. Singleton (1982), "Generalized Instrumental Variables Estimation of
Nonlinear Rational Expectations Models," Econometrica 50, 1269-1286.
Hardle, W. (1990), Applied Nonparametric Regression. Cambridge University Press: Cam-
bridge.
6
Hong, Y. and Y.J. Lee (2005), "Generalized Spectral Testing for Conditional Mean Models
in Time Series with Conditional Heteroskedasticity of Unknown Form," Review of Economic
Studies 72, 499-451.
Hsiao, C. (2003), Panel Data Analysis, 2nd Edition, Cambridge University Press: Cambridge.
Keynes, J. M. (1936), The General Theory of Employment, Interest and Money, McMillan
Cambridge University Press: Cambridge, U.K.
Kiefer, N. (1988), "Economic Duration Data and Hazard Functions," Journal of Economic
Literature 26, 646-679.
Lancaster, T. (1990), The Econometric Analysis of Transition Data, Cambridge University
Press: Cambridge, U.K.
Lucas, R. (1977), "Understanding Business Cycles," in Stabilization of the Domestic and Inter-
national Economy, Karl Brunner and Allan Meltzer (eds.), Carnegie-Rochester Conference Series
on Public Policy, Vol. 5. North-Holland: Amsterdam.
Mehra, R. and E. Prescott (1985), "The Equity Premium: A Puzzle," Journal of Monetary
Economics 15, 145-161.
Nelson, C.R. and C. I. Plosser (1982), "Trends and Random Walks in Macroeconomic Time
Series: Some Evidence and Implications," Journal of Monetary Economics 10, 139-162.
Pagan, A. and A. Ullah (1999), Nonparametric Econometrics, Cambridge University Press:
Cambridge.
Phillips, P.C. (1987), "Time Series Regression with a Unit Root," Econometrica 55, 277-301.
Samuelson, L. (2005), "Economic Theory and Experimental Economics," Journal of Economic
Literature XLIII, 65-107.
Samuelson, P. (1939), “Interactions Between the Multiplier Analysis and the Principle of Ac-
celeration,”Review of Economic Studies 21, 75-78.
Smith, A. (1776), An Inquiry into the Nature and Causes of the Wealth of Nations, edited, with
an Introduction, Notes, Marginal Summary and an Enlarged Index, by Edwin Cannan; with an
Introduction by Max Lerner. New York :The Modern library, 1937.
Von Neumann, J. and O. Morgenstern (1944), Theory of Games and Economic Behavior,
Princeton University Press: Princeton.
Walras, L. (1874), Elements of Pure Economics, or, The Theory of Social Wealth, translated
by William Ja¤e. Fair…eld, PA; Kelley,1977.
White, H. (1980), "A Heteroskedasticity-Consistent Covariance Matrix Estimator and a Direct
Test for Heterokedasticity,”Econometrica 48, 817-838.
White, H. (1982), “Maximum Likelihood Estimation of Misspeci…ed Models,” Econometrica
50, 1-26.
White, H. (1994), Estimation, Inference and Speci…cation Analysis. Cambridge University
Press: Cambridge.
7
About the Author: Yongmiao Hong received his Bachelor Degree in Physics in 1985, and
his MA degree in Economics in 1988, both from Xiamen University. He received his PHD in
Economics from University of California, San Diego, in 1993. In the same year, he became a
tenure track assistant professor in Department of Economics, Cornell University, where he became
a tenured faculty in 1998, and a full professor in 2001. He has also been a special-term visiting
professor in the School of Economics and Management, Tsinghua University since 2002, and a
Cheung Kong Visiting Professor in the Wang Yanan Institute for Studies in Economics (WISE),
Xiamen University, since 2005. He is the President of the Chinese Economists Society in North
America, 2009-2010. Yongmiao Hong’s research interests have been econometric theory, time
series analysis, …nancial econometrics, and empirical study on the Chinese economy and …nancial
markets. He has published dozens of academic papers in a number of top academic journals in
economics, …nance and statistics, such as Econometrica, Journal of Political Economy, Journal of
Quarterly Economics, Review of Economic Studies, Review of Economics and Statistics, Review
of Financial Studies, Journal of Econometrics, Econometric Theory, Biometrika, Journal of
Royal Statistical Society Series B, and Journal of American Statistical Association.