0% found this document useful (0 votes)
26 views112 pages

Zellner

Uploaded by

bb747
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views112 pages

Zellner

Uploaded by

bb747
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 112

Chapter 2

STATISTICAL THEORY AND ECONOMETRICS

ARNOLD ZELLNER*
University of Chicago

Contents
1. Introduction and overview 68
2. Elements of probability theory 69
2.1. Probability models for observations 70
2.2. Definitions of probability 71
2.3. Axiom systems for probability theory 74
2.4. Random variables and probability models 82
2.5. Elements of asymptotic theory 110
3. Estimation theory 117
3.1. Point estimation 117
3.2. Criteria for point estimation 118
4. Interval estimation: Confidence bounds, intervals, and regions 152
4.1. Confidence bounds 152
4.2. Confidence intervals 154
4.3. Confidence regions 156
5. Prediction 158
5.1. Sampling theory prediction techniques 159
5.2. Bayesian prediction techniques 162
6. Statistical analysis of hypotheses 164
6.1. Types of hypotheses 164
6.2. Sampling theory testing procedures 165
6.3. Bayesian analysis of hypotheses 169
7. Summary and concluding remarks 172
References 174

*Research for this paper was financed by NSF Grant SES 7913414 and by income from the H.G.B.
Alexander Endowment Fund, Graduate School of Business, University of Chicago. Part of this work
was done while the author was on leave at the National Bureau of Economic Research and the Hoover
Institution, Stanford, California.

Handbook of Econometrics, Volume I, Edited by Z. Griliches and M.D. Intriligator


© North-Holland Publishing Company, 1983
68 A. Zellner

1. Introduction and overview

Econometricians, as well as other scientists, are engaged in learning from their


experience and d a t a - a fundamental objective of science. Knowledge so obtained
may be desired for its own sake, for example to satisfy our curiosity about aspects
of economic behavior a n d / o r for use in solving practical problems, for example
to improve economic policymaking. In the process of learning from experience
and data, description and generalization both play important roles. Description
helps us to understand "what is" and what is to be explained by new or old
economic generalizations or theories. Economic generalizations or theories are not
only instrumental in obtaining understanding of past data and experience but
also are most important in predicting as yet unobserved outcomes, for example
next year's rate of inflation or the possible effects of an increase in government
spending. Further, the ability to predict by use of economic generalizations or
theories is intimately related to the formulation of economic policies and solution
of problems involving decision-making.
The methods and procedures by which econometricians and other scientists
learn from their data and use such knowledge to predict as yet unobserved data
and outcomes and to solve decision problems constitutes the subject-matter of
statistical theory. A principal objective of work in statistical theory is to formulate
methods and procedures for learning from data, making predictions, and solving
decision problems that are generally applicable, work well in applications and are
consistent with generally accepted principles of scientific induction and decision-
making under uncertainty. Current statistical theories provide a wide range of
methods applicable to many problems faced by econometricians and other
scientists. In subsequent sections, many theories and methods will be reviewed.
It should be appreciated that probability theory plays a central role in statisti-
cal theory. Indeed, it is generally hypothesized that economic and other types of
data are generated stochastically, that is by an assumed probabilistic process or
model. This hypothesis is a key one which has been found fruitful in econometrics
and other sciences. Thus, under this assumption, most operational economic
generalizations or theories are probabilistic, and in view of this fact some
elements of probability theory and probabilistic models will be reviewed in
Section 2.
The use of probability models as a basis for economic generalizations and
theories is widespread. If the form of a probability model and the values of its
parameters were known, one could use such a model to make probability
statements about as yet unobserved outcomes, as for example in connection with
games of chance. When the probability model's form and nature are completely
known, using it in the way described above is a problem in direct probability. That
Ch. 2: Statistical Theory and Econometrics 69

is, with complete knowledge of the probability model, it is usually "direct" to


compute probabilities associated with as yet unobserved possible outcomes.
On the other hand, problems usually encountered in science are not those of
direct probability but those of inverse probability. That is, we usually observe data
which are assumed to be the outcome or output of some probability process or
model, the properties of which are not completely known. The scientist's problem
is to infer or learn the properties of the probability model from observed data, a
problem in the realm o f inverse probability. For example, we may have data on
individuals' incomes and wish to determine whether they can be considered as
drawn or generated from a normal probability distribution or by some other
probability distribution. Questions like these involve considering alternative prob-
ability models and using observed data to try to determine from which hypothe-
sized probability model the data probably came, a problem in the area of
statistical analysis of hypotheses. Further, for any of the probability models
considered, there is the problem of using data to determine or estimate the values
of parameters appearing in it, a problem of statistical estimation. Finally, the
problem of using probability models to make predictions about as yet unobserved
data arises, a problem of statistical prediction. Aspects of these major topics,
namely (a) statistical estimation, (b) statistical prediction, and (c) statistical analysis
of hypotheses, will be reviewed and discussed.
Different statistical theories can yield different solutions to the problems of
statistical estimation, prediction, and analysis of hypotheses. Also, different
statistical theories provide different justifications for their associated methods.
Thus, it is important to understand alternative statistical theories, and in what
follows attention is given to features of several major statistical theories. Selected
examples are provided to illustrate differences in results and rationalizations of
them provided by alternative statistical theories.1
Finally, in a concluding section a number of additional topics are mentioned
and some concluding remarks are presented.

2. Elements of probability theory

We commence this section with a discussion of the elements of probability models


for observations. Then a brief consideration of several views and definitions of
probability is presented. A summary of some properties of axiom systems for
probability theory is given followed by a review of selected results from probabil-
ity theory that are closely related to the formulation of econometric models.

1For valuable discussions of many of the statistical topics considered below and referencesto the
statistical literature, see Kruskal and Tanur (1978).
70 A. Zetlner

2.1. Probability models for observations

As remarked in Section 1, probability models are generally employed in analyzing


data in econometrics and other sciences. Lindley (1971, p. 1) explains: " T h e
mathematical model that has been found convenient for most statistical problems
contains a sample space X of elements x endowed with an appropriate a-field of
sets over which is given a family of probability measures. These measures are
indexed by a quantity 0, called a parameter belonging to the parameter space 19.
The values x are referred to variously as the sample, observations or data." Thus, a
statistical model is often represented by the triplet (X, 19, P0(x)), where X denotes
the sample space, (9 the parameter space, and P0(x) the probability measures
indexed by the parameter 0 belonging to 19. For many problems, it is possible to
describe a probability measure P0(x) through its probability density function (pdf),
denoted by p(xlO ). For example, in the case of n independent, identically
distributed normal observations, the sample space is - oe < X/< oc, i = 1,2 .... ,n,
the pdf is p(xlO)=l-IT=lf(xilO ), where X t = ( X I , X2 .... ,Xn) and f ( x i [ O ) =
(2qro2) - 1/2exp(-- (x i -- IX)2/202), with 0' = (IX, o) the parameter vector, and O:
- oe < IX< oe and 0 < o < oe the parameter space.
While the triplet (X, 19, p(xl 0)) contains the major elements of many statistical
problems, there are additional elements that are often relevant. Some augment the
triplet by introducing a decision space D of elements d and a non-negative convex
loss function L(d, 0). For example, in connection with the normal model de-
scribed at the end of the previous paragraph, L(d, 0) might be the following
"squared error" loss function, L(d, IX)= c(IX- d) 2, where c is a given positive
constant and d = d(x) is some estimate of IXbelonging to a decision space D. The
problem is then to choose a d from D that is in some sense optimal relative to the
loss function L( d, ix).
An element that is added to the triplet (X, O, p(x I0)) in the Bayesian approach
is a probability measure defined on the o-field supported by O that we assume
can be described by its pdf, ~r(0). Usually or(0) is called a prior pdf and its
introduction is considered by many to be the distinguishing feature of the
Bayesian approach. The prior pdf, 7r(0), represents initial or prior information
about the possible values of 0 available before obtaining the observations.
In summary, probabilistic models for observations can be represented by
(X, O, p(x[ 0)). This representation is often extended to include a loss function,
L(d, 0), a decision space D, and a prior pdf, 7r(0). As will be seen in what
follows, these elements play very important roles in statistical theories.
Statistical theories indicate how the data x are to be employed to make
inferences about the possible value of 0, an estimation problem, how to test
hypotheses about possible values of 0, e.g. 0 = 0 vs. 0 ~ 0, a testing problem, and
how to make inferences about as yet unobserved or future data, the problem of
prediction. Also, very importantly, the information in the data x can be employed
Ch. 2: Statistical Theory and Econometrics 71

to explore the adequacy of the probability model (X,O, p'(x]O)), a procedure


called "model criticism" by Box (1980). Model criticism, involving diagnostic
checks of the form ofp(x] 0) and other assumptions, may indicate that the model
is adequate or inadequate. If the model is found to be inadequate, then it has to
be reformulated. Thus, work with probability models for observations has an
important iterative aspect, as emphasized by Box (1976), Box and Tiao (1973),
and Zellner (1975, 1979). While some elements of the theory of hypothesis testing
are relevant for this process of iterating in on an adequate probability model for
the observations, additional research is needed to provide formalizations of the
many heuristic procedures employed by applied researchers to iterate in on an
adequate model, that is a model that achieves the objectives of an analysis. See
Leamer (1978) for a thoughtful discussion of related issues.

2.2. Definitionsof probability

Above, we have utilized the word "probability" without providing a definition of


it. Many views a n d / o r definitions of probability have appeared in the literature.
On this matter Savage (1954) has written, "It is unanimously agreed that statistics
depends somehow on probability. But, as to what probability is and how it is
connected with statistics, there has seldom been such complete disagreement and
breakdown of communication since the Tower of Babel. Doubtless, much of the
disagreement is merely terminological and would disappear under sufficiently
sharp analysis" (p. 2). He distinguishes three main classes of views on the
interpretation of probability as follows (p. 3):
Objectivistie views hold that some repetitive events, such as tosses of a
penny, prove to be in reasonably close agreement with the mathematical
concept of independently repeated random events, all with the same
probability. According to such views, evidence for the quality of agreement
between the behavior of the repetitive event and the mathematical concept,
and for the magnitude of the probability that applies (in case any does), is to
be obtained by observation of some repetitions of the event, and from no
other source whatsoever.
Personalistie views hold that probability measures the confidence that a
particular individual has in the truth of a particular proposition, for example,
the proposition that it will rain tomorrow. These views postulate that the
individual concerned is in some way "reasonable", but they do not deny the
possibility that two reasonable individuals faced with the same evidence may
have different degrees of confidence in the truth of the same proposition.
Necessary views hold that probability measures the extent to which one set
of propositions, out of logical necessity and apart from human opinion,
confirms the truth of another. They are generally regarded by their holders as
72 A. Zellner

extensions of logic, which tells when one set of propositions necessitates the
truth of another.
While Savage's classification scheme probably will not satisfy all students of
the subject, it does bring out critical differences of alternative views regarding the
meaning of probability. To illustrate further, consider the following definitions of
probability, some of which are reviewed by Jeffreys (1967, p. 369 if).
1. Classical or Axiomatic Definition
If there are n possible alternatives, for m of which a proposition denoted by p is
true, then the probability of p is m / n .
2. Venn Limit Definition
If an event occurs a large number of times, then the probability of p is the limit of
the ratio of the number of times when p will be true to the whole number of trials,
when the number of trials tends to infinity.
3. Hypothetical Infinite Population Definition
An actually infinite number of possible trials is assumed. Then the probability of
p is defined as the ratio of the number of cases where p is true to the whole
number.
4. Degree of Reasonable Belief Definition
Probability is the degree of confidence that we may reasonably have in a
proposition.
5. Value of an Expectation Definition
If for an individual the utility of the uncertain outcome of getting a sum of s
dollars or zero dollars is the same as getting a sure payment of one dollar, the
probability of the uncertain outcome of getting s dollars is defined to be
u(1)/u(s), where u(-) is a utility function. If u(.) can be taken proportional to
returns, the probability of receiving s is 1/s.
Jeffreys notes that Definition 1 appeared in work of De Moivre in 1738 and of
J. Neyman in 1937; that R. Mises advocates Definition 2; and that Definition 3 is
usually associated with R. A. Fisher. Definition 4 is Jeffreys' definition (1967, p.
20) and close to Keynes' (1921, p. 3). The second part of Definition 5 is involved
in Bayes (1763). The first part of Definition 5, embodying utility comparisons, is
central in work by Ramsey (1931), Savage (1954), Pratt, Raiffa and Schlaifer
(1964), DeGroot (1970), and others.
Definition 1 can be shown to be defective, as it stands, by consideration of
particular examples-see Jeffreys (1967, p. 370ff.). For example, if a six-sided die
is thrown, by Definition 1 the probability that any particular face will appear is
1/6. Clearly, this will not be the case if the die is biased. To take account of this
Ch. 2: Statistical Theory and Econometrics '73

possibility, some have altered the definition to read, " I f there are n equally likely
possible alternatives, for m of which p is true, then the probability of p is m / n . "
If the phrase "equally likely" is interpreted as "equally probable," then the
definition is defective since the term to be defined is involved in the definition.
Also, Jeffreys (1967) points out in connection with the Venn Limit Definition
that, " F o r continuous distributions there are an infinite number of possible cases,
and the definition makes the probability, on the face of it, the ratio of two infinite
numbers and therefore meaningless" (p. 371). He states that attempts by Neyman
and Cram6r to avoid this problem are unsatisfactory.
With respect to Definitions 2 and 3, it must be recognized that they are both
non-operational. As Jeffreys (1967) puts it:

No probability has ever been assessed in practice, or ever will be, by


counting an infinite number of trials or finding the limit of a ratio in an
infinite series. Unlike the first definition, which gave either an unacceptable
assessment or numerous different assessments, these two give none at all. A
definite value is got on them only by making a hypothesis about what the
result would be. The proof even of the existence is impossible. On the limit
definition, without some rule restricting the possible orders of occurrence,
there might be no limit at all. The existence of the limit is taken as a
postulate by Mises, whereas Venn hardly considered it as needing a
postulate...the necessary existence of the limit denies the possibility of
complete randomness, which would permit the ratio in an infinite series to
tend to no limit (p. 373, fn. omitted).

Further, with respect to Definition 3, Jeffreys (1967) writes, " O n the infinite
population definition, any finite probability is the ratio of two infinite numbers
and therefore is indeterminate" (p. 373, fn. omitted). Thus, Definitions 2 and 3
have some unsatisfactory features.
Definition 4, which defines probability in terms of the degree of confidence
that we may reasonably have in a proposition, is a primitive concept. It is
primitive in the sense that it is not produced by any axiom system; however, it is
accepted by some on intuitive grounds. Furthermore, while nothing in the
definition requires that probability be measurable, say on a scale from zero to
one, Jeffreys (1967, p. 19) does assume measurability [see Keynes (1921) for a
critique of this assumption] and explores the consequences of the use of this
assumption in a number of applications. By use of this definition, it becomes
possible to associate probabilities with hypotheses, e.g. it is considered meaning-
ful to state that the probability that the marginal propensity to consume is
between 0.7 and 0.9 is 0.8, a statement that is meaningless in terms of Definitions
1-3. However, the meaningfulness of the metric employed for such statements is
a key issue which, as with many measurement problems will probably be resolved
by noting how well procedures based on particular metrics perform in practice.
74 A. Zellner

Definition 5, which views probability as a subjective, personal concept, involves


the use of a benefit or utility metric. For many, but not all problems, one or the
other of these metrics may be considered satisfactory in terms of producing useful
results. There may, however, be some scientific and other problems for which a
utility or loss (negative utility) function formulation is inadequate.
In summary, several definitions of probability have been briefly reviewed.
While the definitions are radically different, it is the case that operations with
probabilities, reviewed below, are remarkably similar even though their interpre-
tations differ considerably.

2.3. Axiom systems for probability theory

Various axiom systems for probability theory have appeared in the literature.
Herein Jeffreys' axiom system is reviewed that was constructed to formalize
inductive logic in such a way that it includes deductive logic as a special limiting
case. His definition of probability as a degree of reasonable belief, Definition 4
above, allows for the fact that in induction, propositions are usually uncertain
and only in the limit may be true or false in a deductive sense. With respect to
probability, Jeffreys, along with Keynes (1921), Uspensky (1937), Rrnyi (1970),
and others, emphasizes that all probabilities are conditional on an initial informa-
tion set, denoted by A. For example, let B represent the proposition that a six will
be observed on a single flip of a coin. The degree of reasonable belief or
probability that one attaches to B depends on the initial information concerning
the shape and other features of the coin and the way in which it is thrown, all of
which are included in the initial information set, A. Thus, the probability of B is
written P(B[ A), a conditional probability. The probability of B without specify-
ing A is meaningless. Further, failure to specify A clearly and precisely can lead to
confusion and meaningless results; for an example, see Jaynes (1980).
Let propositions be denoted by A, B, C . . . . . Then Jeffreys' (1967) first four
axioms are:
Axiom 1 (Comparability)
Given A, B is either more, equally, or less probable than C, and no two of these
alternatives can be true.
Axiom 2 (Transitivity)
If A, B, C, and D are four propositions and given A, B is more probable than C
and C is more probable than D, then given A, B is more probable than D.
Axiom 3 (Deducibility)
All propositions deducible from a proposition A have the same probability given
A. All propositions inconsistent with A have the same probability given data A.
Ch. 2: Statistical Theory and Econometrics 75

Axiom 4

If given A, B~ and B 2 cannot both be true and if, given A, C~ and C2 cannot both
be true, and if, given A, B t and C l are equally probable and B 2 and C2 are equally
probable, then given A, B l or B z and C 1 or C2 are equally probable.
Jeffreys states that Axiom 4 is required to prove the addition rule given below.
DeGroot (1970, p. 71) introduces a similar axiom.
Axiom 1 permits the comparison of probabilities or degrees of reasonable belief
or confidence in alternative propositions. Axiom 2 imposes a transitivity condi-
tion on probabilities associated with alternative propositions based on a c o m m o n
information set A . The third axiom is needed to insure consistency with deductive
logic in cases in which inductive and deductive logic are both applicable. The
extreme degrees of probability are certainty and impossibility. As Jeffreys (1967,
p. 17) mentions, certainty on data A and impossibility on data A "do not refer to
mental certainty of any particular individual, but to the relations of deductive
logic..." expressed by B is deducible from A and not-B is deducible from A , or in
other words, A entails B in the former case and A entails not-B in the latter.
Axiom 4 is needed in what follows to deal with pairs of exclusive propositions
relative to a given information set A. Jeffreys' Theorem 1 extends Axiom 4 to
relate to more than two pairs of exclusive propositions with the same probabilities
on the same data A.
Jeffreys (1967) remarks that it has " . . . n o t yet been assumed that probabilities
can be expressed by numbers. I do not think that the introduction of numbers is
strictly necessary to the further development; but it has the enormous advantage
that it permits us to use mathematical technique. Without it, while we might
obtain a set of propositions that would have the same meanings, their expression
would be much more cumbersome" (pp. ! 8-19). Thus, Jeffreys recognizes that it
is possible to have a "non-numerical" theory of probability but opts for a
"numerical" theory in order to take advantage of less cumbersome mathematics
and that he believes leads to propositions with about the same meanings.
The following notation and definitions are introduced to facilitate further
analysis.

Definitions 2

(1) - A means "not-A", that is, A is false.


(2) A N B means "A and B", that is, both A and B are true. The proposition
A n B is also termed the "intersection" or the "joint assertion" or "conjunc-
tion" or "logical product" of A and B.

ZThese are presented in Jeffreys(1967, pp. 17-18) using different notation.


76 A. Zellner

(3) AtOB means "A or B " , that is, at least one of A and B is true. The
proposition A tO B is also referred to as the " u n i o n " or "disjunction" or
"logical sum" of A and B.
(4) A f) B n C n D means "A and B and C and D " , that is, A, B, C, and D are
all true.
(5) A u B U C U D m e a n s " A or B or C or D " , that is, at least one of A, B, C,
and D is true.
(6) Propositions Bi, i -- 1,2 ..... n, are said to be exclusive on data A if not more
than one of them can be true given A.
(7)" Propositions Bi, i = 1,2 ..... n, are said to be exhaustive on data A if at least
one of them must be true given A.
N o t e that a set of propositions can be both exclusive and exhaustive. Also, for
example, Axiom 4 can be restated using the above notation and concepts as:
Axiom 4
If B 1 and B 2 a r e exclusive and C 1 and C2 are exclusive, given data A, and if,
given A, B I and C l are equally probable and B 2 and C2 are equally probable, then
given A, B I tO B 2 and C l W C2 are equally probable.
At this point in the development of his axiom system, Jeffreys introduces
numbers associated with or measuring probabilities by the following conventions.
Convention 1
A larger number is assigned to the more probable proposition (and therefore
equal numbers to equally probable propositions).
Convention 2
If, given A, B~ and B 2 are exclusive, then the number assigned on data A to "B~ or
Bz", that is B~ U ' B 2 , is the sum of those assigned t o B 1 and t o B 2.
The following axiom is needed to insure that there are enough numbers
available to associate with probabilities.
Axiom 5
The set of possible probabilities on given data, ordered in terms of the relation
" m o r e probable than" can be put into a one-one correspondence with a set of real
numbers in increasing order.
It is important to realize that the notation P ( B I A ) stands for the number
associated with the probability of the proposition B on data A. The number
expresses or measures the reasonable degree of confidence in B given A, that is,
the probability of B given A, but is not identical to it.
The following theorem that Jeffreys derives from Axiom 3 and Convention 2
relates to the numerical assessment of impossible propositions.
Ch. 2: Statistical Theory and Econometrics 77

Theorem 2

If proposition A entails - B , then P(B I A ) = 0. Thus, Theorem 2 in conjunction


with Convention 1 provides the result that all probability numbers are >~ 0.

The number associated with certainty is given in the following convention.

Convention 3
If A entails B, then P(B I A) = 1.
The use of 1 to represent certainty is a pure convention. In some cases it is useful
to allow numerical probabilities to range from 0 to ~ rather than from 0 to 1. On
given data, however, it is necessary to use the same numerical value for certainty.

Axiom 6
If A N B entails C, then P(B N C I A) = P(B I A).

That is, given A throughout, if B is false, then B n C is false. If B is true, since


A n B entails C, C is true and therefore B n C is true. Similarly, if B n C is true, it
entails B and if B n C is false, B must be false on data A since if it were true,
B n C would be true. Thus, it is impossible, given A, that either B or B n C should
be true without the other. This is an extension of Axiom 3 that results in all
equivalent propositions having the same probability on given data.

Theorem 3
If B and C are equivalent in the sense that each entails the other, then each entails
B n C, and the probabilities of B and C on any given data must be equal.
Similarly, if A N B entails C and A n C entails B, P(B I A) = P(C I A), since both
are equal to P(B n CI A).

A theorem following from Theorem 3 is:

Theorem 4
P(B I A) = P(B n C I A ) + P ( B n - C I A).

Further, since P(B n - C I A) >/0, P(B I A) >~ P(B n C I A). Also, by using B U C
for B in Theorem 4, it follows that P(B U C I A) >I P(CI A).
The addition rule for numerical probabilities is given by Theorem 5.

Theorem 5
If B and C are two propositions, not necessarily exclusive on data A, the addition
rule is given by

P(B I A ) + P ( C IA) = P(B N C I A ) + P ( B U C IA).


78 A. Zellner

It follows that

P(B U CIA) <~P(BIA)+P(CIA),

since P(B C) CIA) >~O. Further, if B and C are exclusive, then P(B N CIA) = 0 and
P(n u CIA) = P(BIA)+ P( CIA).
Theorems 4 and 5 together express upper and lower bounds on the possible
values of P(B U CIA) irrespective of exclusiveness, that is

max[P( BIA ),P( CIA ) ] ~<P ( B U CIA ) ~ P( BIA ) + P( CIA ).

Theorem 6
If B 1, B 2 , . . . ,B n are a set of equally probable and exclusive alternatives on data A,
and if Q and R are unions of two subsets of these alternatives, of numbers m and
n, then P(Q[A)/P(RIA ) = m / n . This follows from Convention 2 since P(QIA) =
ma and P(RIA ) = na, where a = P(Bi[A ) for all i.
Theorem 7
Under the conditions of Theorem 6, if B 1, B E . . . . . B n are exhaustive on data A, and
R denotes their union, then R is entailed by A and by Convention 3, P(RIA ) = l,
and it follows that P(QIA) = m / n .
As Jeffreys notes, Theorem 7 is virtually Laplace's rule stated at the beginning of
his ThOorie Analytique. Since R entails itself and is a possible value of A, it is
possible to write, P(QI R) = m / n , which Jeffreys (1967) interprets as, "...given
that a set of alternatives are equally probable, exclusive and exhaustive, the
probability that some one of any subset is true is the ratio of the number in that
subset to the whole number of possible cases" (p. 23). Also, Theorem 6 is
consistent with the possibility that the number of alternatives is infinite, since it
requires only that Q and R shall be finite subsets.
Theorems 6 and 7 indicate how to assess the ratios of probabilities and their
actual values. Such assessments will always be rational fractions that Jeffreys calls
R-probabilities. If all probabilities were R-probabilities, there would be no need
for Axiom 5. But, as Jeffreys points out, many propositions are of a form that a
magnitude capable of a continuous range of values lies within a specified part of
the range and it may not be possible to express them in the required form. He
explains how to deal with this problem and puts forward the following theorem:
Theorem 8
Any probability can be expressed by a real number. For a variable z that can
assume a continuous set of values, given A, the probability that z's value is less
than a given value z 0 is P(z < zo]A ) = F(zo), where F(zo) is referred to as the
cumulative probability density function (cdf). If F(zo) is differentiable, P(z o < z
Ch. 2: StatisticalTheoryand Econometrics 79

< z 0 + d z l A ) = f ( z o ) d z +0(dz), where f ( z o ) = F'(Zo) is the probability density


function (pdf) and this last expression gives the probability that z lies in the
interval z 0 to z 0 + dz.
Theorem 9
If Q is the union of a set of exclusive alternatives, given A, and if R and S are
subsets of Q (possibly overlapping), and if the alternatives in Q are all equally
probable on data A and also on data R O A, then

P(R N SIA ) = P( RIA )P( SIR n A ) / P ( RIR n A ).

Note that if Convention 3 is employed, P(RIR n A ) = 1, since R N A entails R


and then Theorem 9 reads

P ( R n SIA) = P(RIA)P(SIR n A).

In other words, given A throughout, the probabifity that the true proposition is in
the intersection of R and S is equal to the probability that it is in R times the
probability that it is in S, given that it is in R. Theorem 9 involves the assumption
that the alternatives in Q are equally probable, both given A and also given
R n A. Jeffreys notes that it has not been possible to relax this assumption in
proving Theorem 9. However, he regards this theorem as suggestive of the
simplest rule that relates probabilities based on different data, here denoted by A
and R n A, and puts forward the following axiom.
Axiom 7
P(B N CIA) = P( BIA)P( CIB n A ) / P ( BIB n a).
If Convention 3 is used in Axiom 7, P(BIB n A) = 1 and

P(B n CIA) = P(BIA)P(CIB N A),

which is the product rule. Thus, the product rule relates to probability statements
regarding the logical product or intersection B n C, often also written as BC,
while the addition or sum rule relates to probability statements regarding the
logical sum or union, B U C.
Since Axiom 7 is just suggested by Theorem 9, Jeffreys shows that it holds in
several extreme cases and concludes that, " T h e product rule may therefore be
taken as general unless it can be shown to lead to contradictions" (p. 26). Also, he
states, "When the probabilities...are chances, they satisfy the product rule
automatically" (p. 51). 3

3jeffreys (1967) defines "chance" as follows: "If qt, q2,...,q, are a set of alternatives, mutually
exclusiveand exhaustiveon data r, and if the probabilities ofp given any of them and r are the same,
each of these probabilities is called the chance ofp on data r" (p. 51).
80 A. Zellner

In general, if P ( q B n A) = P(CIA), B is said to be irrelevant to or independent


of C, given A. In this special case, the product rule can be written as P(B N CIA)
= P(BIA)P(CiA ), a form of the product rule that is valid only when B is
irrelevant to or independent of C.
Theorem 10
If ql, q2 ..... qn are a set of alternatives, A, the information already available and
x, some additional information, then the ratio

P( qrlX n A )P( qrlqr n A)


P( GIA )P( xIG N A)
is the same for all the qr-
If we use Convention 3, P(GIqr n A) = 1, then

P(q~lx n A) = cP(qr[A)P(xlG N A),

where 1/c = Y'>P( qrlA)P(xlqr N A). This is the principle of inverse probability or
Bayes' Theorem, first given in Bayes (1763). The result can also be expressed as

Posterior probability ec (prior probability)(likelihood function)

where c~ denotes "is proportional to", P(qr[X N A) is the posterior probability,


P(qrlA) is the prior probability, and P(xlq r n A) is the likelihood function.
In general terms, Jeffreys describes the use of Bayes' Theorem by stating that if
several hypotheses ql, q2 ..... q~ are under consideration and, given background
information A there is no reason to prefer any one of them, the prior probabili-
ties, P(qrlA), r = 1,2 ..... n, will be taken equal. Then, the most probable hypothe-
sis after observing the data x, that is, the one with the largest P(G[x n A), will be
that with the largest value for P(xlq r N A), the likelihood function. On the other
hand, if the data x are equally probable on each hypothesis, the prior views with
respect to alternative hypotheses, whatever they were, will be unchanged. Jeffreys
(1967) concludes: " T h e principle will deal with more complicated circumstances
also; the immediate point is that it does provide us with what we want, a formal
rule in general accord with common sense, that will guide us in our use of
experience to decide between hypotheses" (p. 29). Jeffreys (1967, p. 43) also
shows that the theory can be utilized to indicate how an inductive inference can
approach certainty, though it cannot reach it, and thus explains the usual
confidence that most scientists have in inductive inference. These conclusions are
viewed as controversial by those who question the appropriateness of introducing
prior probabilities and associating probabilities with hypotheses. It appears
that these issues can only be settled by close comparative study of the results
yielded by various approaches to statistical inference as Anscombe (1961) and
Ch. 2: Statistical Theory and Econometrics 81

Jaynes (1974), among others, have emphasized. See also Fisher (1959), Savage et
al. (1962), Lindley (1971), Barnett (1973), Cox and Hinkley (1974), Rothenberg
(1975), and Zellner (1975) for further discussion of these and related issues.
Jeffreys' theory of probability, described above, did not mention utility or
benefit since it is primarily a theory of what it is reasonable to believe. However,
Jeffreys notes that his theory permits him to define the expectation of a function
u(~), say a utility function, as follows:

E[u(2)IA] = ~ u(x,)P(xilA ),
i=l
when 2 is a discrete random variable with possible values x~, x 2.... ,xm, or by

E [ u ( f ) l A ] = fabU(x)f(xlA) dx,

when 2 is a continuous random variable with probability density function f(x[A),


a < x < b. Thus, utility considerations can be incorporated in Jeffreys' theory. On
the other hand, Bayes (1763), Ramsey (1931), Savage (1954), Pratt, Raiffa and
Schlaifer (1964), and some others take as the fundamental idea that of expecta-
tion of benefit or utility, as pointed out in Definition 5 in Section 2.2. Generally
speaking, it is assumed that expectations of benefit or utility can be placed in an
order. As Jeffreys (1967) points out, Bayes speaks in terms of monetary stakes,
and would say that 1/100 chance of receiving $100 is as valuable as a certainty of
receiving $1. Bayes' definition of a probability of 1/100 would be that it is the
probability such that the value of the chance of $100 is the same as the value of a
certain $1. This requires a postulate that the value of the expectation is propor-
tional to the value to be received. In more modern treatments of the problem, as
mentioned in Definition 5 and described below, utility considerations enter into
the definition of probability.
Raiffa and Schlaifer (1961) remark that "...when one is forced to compare
utility characteristics because one is forced to act, a few basic principles of
logically consistent behavior necessarily lead to the introduction of a weighting
function over O [the parameter space].., if this weighting function is normalized it
has all the properties of a probability measure on O . . . " (p. 25). Raiffa and
Schlaifer (1961, pp. 25-27) provide a simple informal proof of this result based
on three assumptions for the case that O is finite with elements 0i, i = 1,2,...,r.
The utility characteristic of any decision d is represented by u =
[u( d, 0 O, u( d, 02) ..... u( d, Or)] =- (Ul, u 2..... ur). Their three assumptions are:
Assumption 1
Let u = (UI, Uz,...,Ur) and v = (vl, v 2..... vr) be the utility characteristics of
decision functions d a and db, respectively. If u i >~v i for all i and if u~ > vi for
some i, then da is preferred to d b.
82 A. Zellner

Assumption 2
Indifference surfaces extend smoothly from boundary to boundary in the r-space,
R, of the ui's, i = 1,2,...,r.

Assumption 3

If da, db, and d C are three decision functions such that da and d b are indifferent,
then given any p such that 0 ~<p ~<1, a mixed strategy that selects d~ with
objective probability p and dc with objective probability 1 - p is indifferent to a
strategy which selects d b with objective probability p and d c with objective
probability 1 - p.

From these three assumptions that are discussed at length in Luce and Raiffa
(1957), Raiffa and Schlaifer (1961) show that " . . . t h e decision-maker's indif-
ference surfaces must be parallel hyper-planes with a common normal going into
the interior of the first orthant, from which it follows that all utility characteristics
u = (ul, u 2 . . . . . Ur) in R can in fact be ranked by an index which applies a
predetermined set of weights P = (PI, P2 ..... Pr) to their r components" (p. 25).
That is ~2~=i Piui and Y'.~= i Pit~ i c a n be employed to rank decision functions d a
and d b with utility characteristics u and v, respectively, where the Pi's are the
predetermined set of non-negative weights that can be normalized and have all
the properties of a probability measure on O. Since the Pi's are intimately related
to a person's indifference surfaces, it is clear why some refer to the normalized
Pi's as "personal probabilities". For more discussion of this topic see Blackwell
and Girshick (1954, ch. 4), Luce and Raiffa (1957, ch. 13), Savage (1954, ch. 1-5),
and DeGroot (1970, ch. 6-7). Further, Jeffreys (1967) remarks:

The difficulty about the separation of propositions into disjunctions of


equally possible and exclusive alternatives is avoided by this [Bayes, Ramsey
et al.] treatment, but is replaced by difficulties concerning additive
expectations [and utility comparisons]. These are hardly practical ones in
either case... In my method expectation would be defined in terms of value
[or utility] and probability; in theirs [Bayes, Ramsey et al.], probability is
defined in terms of values [or utilities] and expectations. The actual
propositions [of probability theory] are of course identical (p. 33).

2.4. Random variables and probability models

As mentioned in Section 1, econometric and statistical models are usually


stochastic, involving random variables. In this section several important probabil-
ity models are reviewed and some of their properties are indicated.
Ch. 2: Statistical Theory and Econometrics 83

2.4.1. Random variables

A r a n d o m variable (rv) will be denoted b y ~. There are discrete, continuous, and


mixed rvs. If ~ is a discrete rv, it can, by assumption, just assume particular
values, that is $ = xi, j = O, 1,2 . . . . . m, where m can be finite or infinite and the
x j ' s are given values, for example x 0 = 0, x I = 1, x 2 = 2, and x m = m. These xj
values may represent quantitative characteristics, for example the n u m b e r of
purchases in a given period or qualitative characteristics, for example different
occupational categories. If ~ can assume just two possible values, it is termed a
dichotomous rv, if three, a trichotomous rv, if more than three, a polytymous rv.
F o r quantitative discrete rvs the ordering x 0 < x I < x 2 < • • • < x m is meaningful,
while for some qualitative discrete rvs such an ordering is meaningless.
A continuous rv, 2, such that a < ff < b, where a and b are given values,
possibly with a = - ~ a n d / o r b = ~ , can assume a c o n t i n u u m of values in the
interval a to b, the range of the rv. A mixed rv, Y, a < ~ < b, assumes a c o n t i n u u m
of values over part of its range, say for a < ~ < c, and discrete values over the
remainder of its range, c <~ ff < b. Some econometric models incorporate just
continuous or just discrete rvs while others involve mixtures of continuous,
discrete, and mixed rvs.

2.4. 2. Discrete random variables

F o r a discrete rv, 2, that can assume the values, x~, x z , . . . ,Xm, where the x f s are
distinct and exhaustive, the probability that 2 = x j, given the initial information
A, is

P ( 2 = x j [ A ) = pj, j = 1,2,o..,m, (2.1)

with

pj >/0 and 2.~ Pj = 1. (2.2)


j=l

The collection of pj's in (2.1), subject to (2.2), defines the probability mass
function (pmf) for the discrete rv, 2. A plot of the p j ' s against the xj's m a y be
unimodal, bimodal, U-shaped, J-shaped, uniform ( P l = P2 . . . . . Pro), etc. If it
is unimodal, the p m f ' s modal value is the value of £ associated with the largest pj.
Further, the m e a n of :~, denoted b y / ~ - E~ is:

= = pjx s. (2.3)
j=l
84 A. Zellner

In general, the zeroth and higher order m o m e n t s a b o u t zero are given by

~¢r= E-~r= ~ pjxj, r =0,1,2 ..... (2.4)


j=l

M o m e n t s about the mean, E~, called central m o m e n t s , are given b y

r = 1,2,.... (2.5)
j=l

N o t e that/,~ = 0 and that/*2, defined as the variance of ~, V(ff), is

/*2 = = - ee) 2

= 2. (2.6)
j=l

F r o m the first line of (2.6), V(.~) = E g 2 - 2(Eft) 2 + ( E f t ) 2 = E:~ 2 - (E:~) 2 = / ~ -


(/~])2. Similar relations can be obtained that relate higher order central m o m e n t s
to m o m e n t s about zero. Also, the following unitless measures of skewness are
available to characterize unimodal pmfs: sk = ( m e a n - m o d e ) / / ~ / 2 , fll =/.3//*>2 3
and 3' =/*3//*32/2. Further, a unitless measure of kurtosis for u n i m o d a l p m f s is
given by t32 =/*4//* 2 that frequently measures the peakedness of a p m f although
its value is sensitive to the characteristics of the tails of a pmf.
If ~ is a discrete rv assuming only non-negative integer values with P ( ~ = j ) =
pj, j = 0 , 1 , 2 , . . . , then p(z) = ~ pjz j is called the probability generating function
with the obvious p r o p e r t y that p(1) = 1, given p r o p e r t y (2.6). Further, the a t h
derivative of p(z), evaluated at z = 0, is just a!p~, where a! = a ( a - 1 ) ( a - 2)... 1,
and it is in this sense that the probability generating function "generates" the
probabilities, the pj's of a pmf.
If in p ( z ) = ~ "j=ovj ,, z j, z = e t, the result is the m o m e n t - g e n e r a t i n g function
associated with a pmf, P0, P l, P2 ..... namely

p(et) = ~ I~'j#/j!, (2.7)


j=0

where t h e / ~ are given in (2.4). The expression in (2.7) is obtained b y noting that

j=O
Ch. 2." Statistical Theory and Econometrics 85

and

a=O j=Oa=O j=O a=O e~=O

where ~ l'j -- ~2°~ ,, a j . O n taking t h e j t h derivative of (2.7) with respect to t and


~=O/~a
evaluating it at t = 0, the result is just /~}, and it is in this sense that (2.7)
"generates" the m o m e n t s of a pmf. U p o n taking z = e it in p(z), where i = rE_ 1,
b y similar analysis the characteristic function for a p m f can be obtained, namely

p(d') = ~. l~':(it)J/j!, (2.8)


j=o

f r o m which the moments, /x~, j = 0, 1,2,..., can be obtained by differentiation


with respect to t and evaluating the derivatives at t = 0. It can be shown b y
complex Fourier series analysis that a specific p m f function has a unique
characteristic function and that a specific characteristic function implies a unique
pmf. This is important since, on occasion, manipulating characteristic functions is
simpler than manipulating pmfs.
We now turn to consider some specific pmfs for discrete rvs.

2.4.2.1. The binomial process. Consider a dichotomous rv, Yi, such that Yi = 1
with probability p and 37 = 0 with probability 1 - p. F o r example, 97i = 1 might
denote the appearance of a head on a flip of a coin and Yi = 0 the appearance of a
tail. Then EYi= l . p +O.(1- p ) = p and V(yi)= E ( ~ i - EYi)2=(1- p)Zp +
( 0 - p)2(1 - p ) = p(1 - p). N o w consider a sequence of such )Ti's, i = 1,2 . . . . . n,
s u c h that the value of any m e m b e r of the sequence provides no information about
the values of others, that is, the )~i's are independent rvs. Then any particular
realization of r ones and n - r zeros has probability pr(1 -- p)n-r. On the other
hand, the probability of obtaining r ones and n - r zeros is

n r -r
P(f=r[n,p)=(r)p (l-p)" , (2.9)

where

(7)-n'/r'(.-r)'
N o t e that the total n u m b e r of realizations with r ones and n - r zeros is obtained
b y recognizing that the first one can occur in n ways, the second in n - 1 ways, the
third in n - 2 ways, and the r t h in n - I r - 1) ways. Thus, there are n ( n - 1)
86 A. Zellner

(n - 2 ) . . . ( n - r + 1) ways of getting r ones. However, r ( r - 1)(r - 2 ) . . . 2 . 1 of


these ways are indistinguishable. Then n ( n - 1)(n - 2 ) . . . ( n - r + 1 ) / r !
= n ! / r ! ( n - r)! is the number of ways of obtaining r ones in n realizations. Since
pr O p)n-r is the probability of each one, (2.9) provides the total probability of
_

obtaining r ones in n realizations.


The expression in (2.9) can be identified with coefficients in a binomial
expansion,

l=(p+q)"= ~ (n)prqn-r= ~ P(~=rJp,n),


r=O r=O

where q = 1 - p, and hence the name " b i n o m i a l " distribution. Given the value of
p, it is possible to c o m p u t e various probabilities f r o m (2.9). For example,
r0
n r n-r
Pr(r<~rolp,n): E (r)P (l-p) ,
r~O

where r 0 is a given value of r. Further, m o m e n t s of r can be evaluated directly


from

E(~[P,n) = ~ r=P(~=r[p,n).
r~O

By such direct evaluation:

= = np,

= Ee 2 = (np)2 + np(1 - p), (2.10)

E(e- Ee)2=np(1- p).


Further, higher central m o m e n t s can be evaluated directly or c o m p u t e d from the
R a m o n o v s k y recursion formula, g~+l = p q [ a n t ~ 1 - d / % / d q ] , a = 1,2 ..... with
q = 1 - p. F r o m these results the skewness measure 3', introduced above, is
y = 1~3/t~32/z = (q - p ) / ~ , while the kurtosis measure/3 2 = g4//~22 = 1 / n p q +
3(n - 2 ) / n and the "excess" is/32 - 3 = 1 / n p q - 6 / n . F o r p = q = 1 / 2 , y = 0, that
is, the binomial p m f is symmetric.
F r o m (2.10), the m o m e n t s of the p r o p o r t i o n of ones, f / n , are easily obtained:
E ~ / n = p, E ( ~ / n ) 2 = p2 + p(1 - p ) / n and E [ ~ / n - E ~ / n ] 2 = p(1 - p ) / n . Also
note that E ( ~ / n ) ~ = ( E ? ~ ) / n ~, a = 1,2 . . . . .
It is of great interest to determine the form of the binomial p m f when b o t h r
and n - r are large and p is fixed, the problem solved in the D e M o i v r e - L a p l a c e
Ch. 2: Statistical Theory and Econometrics 87

Theorem. With rn - n - r,

- l o g P ( f = r t p , n) = l o g r ! + l o g m ! - log n ! - r l o g p - mlog(1 - p ) .

Stirling's formula, logn! = (n + 1/2)log n - n + (1/2)log2~r + O(n-1), can be ap-


plied for large r and m to yield

- l o g P ( f = r]n, p) - (1/2)log(21rrm/n)
+ rlog(r/np)+ mlog[m/n(1 - p ) ]

or 4

P(f=rLp, n) "-- [ 2 ~ r n p ( 1 - p ) ] - ' / Z e x p ( - ( r - n p ) 2 / 2 n p ( 1 - p ) ) , (2.11)

which is a normal approximation 5 to the binomial p m f when r and m = n -- r are


both large. In (2.11), the mean and variance of ~ are np and n p ( 1 - p ) , respec-
tively, the same as the exact values for these moments. See, for example, Kenney
and Keeping (1951, p. 36ff) and Jeffreys (_1967, pp. 61-62) for discussions Of the
quality of this approximation. If we let f = f / n , with the condition underlying
(2.11), / h a s an approximate normal distribution with mean p and variance
p ( 1 - p ) / n . Thus, (2.11) is an important example of a case in which a discrete
rv's p m f can be well approximated by a continuous probability density function
(pdf).

2.4.2.2. The Poisson process. The Poisson process can be developed as an


approximation to the binomial process when n is large and p (or q = 1 - p ) is
small. Such situations are often encountered, for example, in considering the
number of children born blind in a large population of mothers, or the number of
times the volume of trading on a stock exchange exceeds a large number, etc. For
such rare (low p ) events from a large number of trials, (2.11) provides a poor
approximation to the probabilities of observing a particular number of such rare
events and thus another approximation is needed. If n is large but np is of
moderate size, say approximately of order 10, the Poisson exponential function
can be employed to approximate

4Let r / n = p + e / n 1/2, where e is small, or r = n p + nl/2e and n - r = m = n ( 1 - p ) - n l / 2 e . O n


s u b s t i t u t i n g these expressions for r a n d m in the l o g a r i t h m i c terms, this p r o d u c e s terms involving
log[1 + e / p n I/2] and log[l - e / ( l - p)nl/2]. E x p a n d i n g these as log(l + x) = x x 2 / 2 and collecting
d o m i n a n t terms in e 2, the result is (2.11).
5See below for a discussion of the n o r m a l distribution.
88 A. Zellner
That is, with 0 = np, if 0 and r are fixed and n ~ oo,

( rn) ( ~ ) 0 r ( -1- O ) n- r"-~Ore-O/r! , (2.12)

which is the Poisson approximation to the probability of r occurrences of the rare


event in a large number of trials [see, for example, Kenny and Keeping (1951, p.
44ff) for a discussion of the quality of the approximation]. In the limit as n ~ oo,
P ( f = rio ) = Ore °/r! is the exact Poisson pmf. Note that S'~ z,-.ar = 0 Ore °/r~
/ - = 1 and
E ( f l 0 ) = 0, E ( f a [ 0 ) = 0(0 + 1), E ( f - E ? ) 2 = 0 , E ( f - E ~ ) 3 = 0, and E ( f - E ~ ) 4
= 3 0 2 + 0. It is interesting that the mean, variance, and third central moment are
all equal to 0. From these moments, measures of skewness and kurtosis can be
evaluated.

2.4.2.3. Other variants of the binomial process. Two interesting variants of the
binomial process are the Poisson and Lexis schemes. In the Poisson scheme, the
probability that 37~= 1 is pi, and not p as in the binomial process. That is,
the probability of a one (or "success") varies from trial to trial. As before, the )~'s
are assumed independent. Then the expectation of f, the number of ones, is

E?= E ~ gi= ~ pi= nP,


i~l i=1

where

p = ~ Pi/n,
i=1

and

V(F)=E(~-EF) 2= ~ E(371-E)7,)2= ~ Pi(1-pi)=n(p-gl-O2),


i=1 i=1

where

F/=l-p and o~ ~ (pi-fi)Z/n.


i=1

N o t e that V(f) is less than the variance of f associated with independent binomial
trials with a fixed probability p at each trial.
Extensions of the Poisson scheme that are widely used in practice involve the
assumption that Pi = f(xi, fl), where x / i s a given vector of observed variables and
Ch. 2." Statistical Theory and Econometrics 89

/3 is a vector o f parameters. T h e function f ( . ) is chosen so that 0 < f ( - ) < 1 for all


i. F o r example, in the probit model,

Pi = (2~)-~/2f *;tie t2/2dt, (2.13)


--00

while in the logit model,


Pi = 1/(1 + e-X;#). (2.14)

T h e n the probability of any particular realization of the Yi, i = 1,2 . . . . . n, is given


by
t/

I] pY'(1 - p~) 1 y ', (2.15)


i~l

where yi = 0 or 1 are the observations. By inserting (2.13) or (2.14) in (2.15), the


probit and logit models, respectively, for the observations are obtained. Of course,
other functions f ( - ) that satisfy 0 < f ( - ) < 1 for all i can be e m p l o y e d as well, for
example pi = f(xi, fl) = 1 - e -axe, with/3 > 0 and 0 < x 2 < oo, etc.
In the Lexis scheme, m sets of n trials each are considered. The probability of
obtaining a one (or a "success") is assumed constant within each set of trials but
varies from one set to another. The r a n d o m n u m b e r of ones in the j t h set is 6,
with expectation E ~ = npj. Then, with ? = ~jm__l ~, E~ = n~jm=lpj = nmp, where
m
here p = ~9= ~pj/m. Also, b y direct computation,
var(e) = np0 + n ( n - 1 ) ¢ , (2.!6)

where ?/= 1 - p and o~ = ~jml( & --p)2/m. It is seen f r o m (2.16) that v a r ( f ) is


larger than from binomial trials with a fixed probability p on each trial.
If o 2 is the variance of ?, the n u m b e r of ones or successes in a set of n trials,
and if oB2 is the variance calculated on the basis of a binomial process, then the
ratio L = o / o B is called the Lexis ratio. The dispersion is said to be s u b n o r m a l if
L < 1, and supernormal if L > 1.
The negative binomial process involves observing independent yi's, as in the
binomial process, but with the condition that a preassigned n u m b e r of ones be
observed. Thus r, the n u m b e r of ones (or successes) to be observed, is fixed and
the n u m b e r of observations or trials, n, is random. Since the last observation is a
one with probability p and the probability of getting r - 1 ones in the first n - 1
trials is

(n-llp~-l(l_p)"-~
r-l}
90 A. Zellner

the desired probability of observing r ones, with r fixed b e f o r e h a n d in n trials, is

(n -- 1 ) p r ( l - p)n--r, (2.17)
= nlr, p) = 1

with n >t r >~ 1, which should be c o m p a r e d with the p m f for the usual binomial
process in (2.9).

2.4.2.4. Multinomial process. Let P1 be the probability that a discrete rv, )7i,
assumes the value j, j = 1 , 2 , . . . , J . If we observe n independent realizations of )7i,
i = 1,2 ..... n, the probability that r I values h a v e j = 1, r 2 h a v e j = 2 , . . . , and rj have
j = J, with n = ~iJ__ 1ri, is given by

n~
P(~= rip) r,!r2!...rj!p~'p~ 2" .. p);,
rj (2.18)

with F = (FI' F2,"''FJ)' i" (rl, r2..... rj), p ' = (Pl, P2 ..... Ps), 0 <~pj, and ~J=IPj
=

= 1. If J = 2, the multinomial p m f in (2.18) becomes identical to the binomial p m f


in (2.9). As with the binomial pmf, for large n and ~'s, we can take the logarithm
of both sides of (2.18), use Stirling's approximation, and obtain an a p p r o x i m a t i n g
multivariate normal distribution [see K e n n e y and Keeping (1951, pp. 113-114)
for analysis of this problem]. Also, as with (2.13) and (2.14), it is possible to
develop multivariate probit and logit models.
T h e pmfs reviewed above are some leading examples of probability models for
independent discrete rvs. For further examples, see Johnson and Kotz (1969).
W h e n non-independent discrete rvs are considered, it is necessary to take account
of the nature of dependencies, as is done in the literature on time series point
processes [see, for example, Cox and Lewis (1966) for a discussion of this topic].

2.4.3. Continuous random variables

W e first describe some properties of models for a single continuous rv, that is,
univariate probability density functions (pdfs), and then turn to some models for
two or m o r e continuous rvs, that is, bivariate or multivariate pdfs.
Let ~ denote a continuous rv that can assume a c o n t i n u u m of values in the
interval a to b and let f ( x ) be a non-negative function for a < x < b such that
P r ( x < 2 < x + d x ) = f ( x ) d x , where a < x < b and f b f ( x ) d x = 1. Then f ( x ) is
the normalized p d f for the continuous rv, 2. In this definition, a m a y be equal to
-~ a n d / o r b = ~ . Further, the cumulative distribution function (cdf) for 2 is
given by F(x) = f~ f ( t ) d t with a < x < b. Given that fb f (t)dt = 1, 0 <~F(x) <~1.
Further, Pr(c < 2 < d ) = F ( d ) - F(c), where a ~< c < d ~< b.
Ch.2: StatisticalTheoryandEconometrics 91

The moments around zero of a continuous rv, ~, with pdf f ( x ) , are given by

tt'~=E2r=Lbx~f(x)dx, r-- 0,1,2,..., (2.19)

with/z' l -= the m e a n of ~. The central m o m e n t s are given b y

t~r=E(~-E~)~=Lb(~-E~)~f(x)dx, r = 0,1,2,..., (2.20)

with/x 2 = 02, the variance, and o, the standard deviation. N o t e that ~1 = 0. Also,
m o m e n t s are said to exist when the integrals in (2.19) and (2.20) converge; in
cases in which particular integrals in (2.19) and (2.20) fail to converge, the
associated m o m e n t s do not exist or are infinite. 6
F o r unimodal pdfs, unitless measures of skewness are sk = ( m e a n - m o d e ) / o ,
/3t = ~3//~2,2 3 and 71 =~3//.,,2/,,3/2. F o r symmetric, u n i m o d a l pdfs, mean = modal
value and thus sk = 0. Since all odd order central m o m e n t s are equal to zero,
given symmetry,/31 = 71 = 0. Measures of kurtosis are given by/32 =/~4//~22 and
72 = / 3 2 - 3, the "excess". F o r a normal pdf,/32 = 3, and 72 = 0. W h e n 72 > 0, a
pdf is called leptokurtic, and platykurtic when 72 < 0.
The moment-generating function for 2 with pdf f ( x ) is

M(t)= Lbf(x)etXdx (2.21)

from which

drM( t t =0 = E'fcr

under the assumption that the integral in (2.21) converges for some t = t o > 0. In
(2.21), a m a y equal - m a n d / o r b m a y equal oo. Thus, knowing the form of M(t)
allows one to c o m p u t e m o m e n t s conveniently.
The characteristic function associated with the continuous rv, if, with pdf f(x),
is given by

C(t)= Lbf(x)eit~dx, (2.22)

where i = ~ 1. It is k n o w n that the integral (2.22) converges uniformly in t.


Thus, the r t h derivative of C(t) with respect to t is irffxrf(x)eitXdx. On setting

6For analysis regarding the convergence and divergence of integrals, see, for example, Widder (1961,
p. 271 ft.), or other advanced calculus texts.
92 A. Z e l l n e r

t = 0, C(r)(0) = ir/~'~, which provides a useful expression for evaluating moments


when they exist. See Kendall and Stuart (1958, ch. 4) for further discussion and
uses of characteristic functions.
For each characteristic function there exists a unique pdf and vice versa. On the
other hand, even if moments of all orders exist, it is only under certain conditions
that a set of moments uniquely determine a pdf or cdf uniquely. However, as
Kendall and Stuart (1958) point out, "...fortunately for statisticians, those
conditions are obeyed by all the distributions arising in statistical practice" (p. 86;
see also p. 109ft.).
Several examples of univariate pdfs for continuous rvs follow.

2.4.3.1. Uniform. A rv ff has a uniform pdf if and only if its pdf is


1
f(x)= b-a' a<~x<~b, (2.23)

and f ( x ) = 0 elsewhere. That (2.23) is a normalized pdf is apparent since


f b f ( x ) d x = 1. By direct evaluation,

,=[• xr 1 1 (br+l ar+l)


/zr Ja ~ d x = b---a r + l , r=l,2 ....

and thus E.~ = (a + b ) / 2 and E:¢ 2 = (b 3 - a3)/3(b - a) = (a 2 + ab + b2)/3. Also,


from V(~) = EY 2 - ( E f t ) 2, V(~) = (b - a)2/12. Note too that the moment-gener-
ating function is
-b e TM e b t - - e at
M ( t ) = Ja ~ - a dx = ( b - a)--t

= l + 1 / 2 ( a + b)t + 1 / 3 ( a 2 + ab + b2)2!t 2 + . - - ,

where e b t = 1 + bt +(bt)2/2!+ . . - and a similar expression for e at have been


employed. Then, for example, i~'l= M'(O) = (a + b)/2, /~ = M"(0) = (a 2 + ab +
b2)/3, etc. Finally, observe that (2.23) can be expressed as d f / d x = O, a < x < b.
The solution of this differential equation subject to the normalization condition
and f ( x ) = 0 for x < a and x > b leads to (2.23).

2.4.3.2. Cauchy. A rv ~ is distributed in the Cauchy form if and only if its p d f


has the following form:

1 1
---f(xlO'°)=Tro 2' -oo<x<oo, -oo<0<oo, 0<o<~.

(2.24)
Ch. 2: Statistical Theory and Econometrics 93

That (2.24) is a normalized pdf can be established by making a change of


variable, z = (x - 0 ) / o , and noting that f°~o~(1 + z Z ) - l d z = ~. Further, note that
(2.24) is symmetric about 0, the location parameter which is the modal value and
median of the Cauchy pdf. However, 0 is not the mean since the mean and higher
order moments of the Cauchy pdf do not exist. The non-existence of moments is
fundamentally due to the fact that the pdf does not rapidly approach zero as
(x - 0)2/02 grows large; that is, the Cauchy pdf has heavy tails. A useful measure
of dispersion for such a pdf is the inter-quartile range (IQR), that is, the value of
2c, with c > 0, such that F(O + c ) - F(O - c) = 0.5, where F(-) is the Cauchy cdf.
For the Cauchy pdf, IQR = 20.
On making a change of variable in (2.24), z = ( x - 0 ) / o , the standardized
Cauchy pdf is obtained, namely

1 1
f(z)= r l+z2, -oo<z<oo, (2.25)

which is symmetric about z = 0, the modal value and median. Further, it is


interesting to note that (2.25) can be generated by assuming that the arc tangent
of an angle, say w, ranging from - ~r/2 to rr/2 is uniformly distributed, that is,
w = t a n - l z has a pdf p(w)dco = dco/~r, - ~r/2 < co < ~r/2. Since dco = d t a n - lz =
d z / ( 1 + z) 2, this uniform pdf for co implies (2.25).

2.4.3.3. Normal. A rv :~ is said to be normally distributed if and only if its pdf


is

f(x]O,o) = (1/o 2~)exp(-(x- 0)2/2o2),


-~<x<oo, oo<0<oo, 0<o<oo. (2.26)

The pdf in (2.26) is the normal pdf that integrates to one and thus is normalized.
It is symmetric about 0, a location parameter that is the modal value, median, and
mean. The parameter o is a scale parameter, the standard deviation of the normal
pdf as indicated below. Note that from numerical evaluation, P r { I x - 01 ~<1.96o)
= 0.95 for the normal pdf (2.26) indicating that its tails are rather thin or,
equivalently, that (2.26) decreases very rapidly as (x - 0) 2 grows in value, a fact
that accounts for the existence of moments of all orders. Since (2.26) is symmetric
about 0, all odd order central moments are zero, that is,/~2r+1 = 0, r = 0, 1,2 . . . . .
From E(2 - 0) = 0, E2 = 0, the mean of the normal pdf. As regards even central
moments, they satisfy

/~2r = E ( 2 -- E2) 2r = o2r2rF(r + 1 / 2 ) / ~ -


= o2r(2r)!/2rr!, (2.27)
94 A. Zellner

where F(r + 1 / 2 ) is the g a m m a function, F ( q ) , with a r g u m e n t q = r + 1 / 2 , t h a t


is, F ( q ) = f o u q - l e - " d u , with 0 < q < ~ . 7 F r o m (2.27), ~t2 = O 2 a n d /~4 = 304.
Thus, the kurtosis measure f12 = / ~ 4 / / ~ = 3 a n d "12 = f 1 2 - 3 = 0 for the n o r m a l
pdf.
T h e s t a n d a r d i z e d f o r m of (2.26) m a y b e o b t a i n e d b y m a k i n g a c h a n g e of
variable, z = ( x - 0 ) / o , to yield

f(z) = (1/2~)exp(- z2/2), - ~ < z < oo, (2.28)

with m o d a l value, m e d i a n , a n d m e a n at z = 0. Also, V(£) = 1 a n d E ( £ - z) 4 = 3


f r o m (2.27).
A s was shown in Section 2.4.2.1, the n o r m a l p d f can b e viewed as a l i m i t i n g
f o r m of the b i n o m i a l process. Below it will b e i n d i c a t e d that central limit
theorems show t h a t m o r e general sequences of rvs have limiting n o r m a l d i s t r i b u -
tions. These results u n d e r l i n e the i m p o r t a n c e of the n o r m a l d i s t r i b u t i o n in
theoretical a n d a p p l i e d statistics.

2.4.3.4. Student-t. A rv :~ is d i s t r i b u t e d in the u n i v a r i a t e S t u d e n t - t ( U S - t ) f o r m


if a n d only if it has the following pdf:

f(xlO, 1+;(x-O) ] ,
-o0<x<~, -oo<0<ce, 0<v,h<~, (2.29)

with c = F [ ( r + 1)/2]/~f~F(v/2), where F ( . ) d e n o t e s the g a m m a function. F r o m


i n s p e c t i o n of (2.29) it is seen that the U S - t p d f has a single m o d e at x = 0 a n d is
s y m m e t r i c a b o u t the m o d a l value. Thus, x = 0 is the m e d i a n a n d m e a n (which
exists for r > 1 - s e e below) of the U S - t pdf. A s will b e seen, the p a r a m e t e r h is
i n t i m a t e l y linked to the dispersion of the US-t, while the p a r a m e t e r v, often called
the " d e g r e e s of f r e e d o m " p a r a m e t e r , is involved b o t h in the d i s p e r s i o n as well as
the kurtosis of' the pdf. N o t e that if p = 1, the U S - t is i d e n t i c a l to the C a u c h y p d f
in (2.24) with h = 1 / o 2. O n the o t h e r h a n d , as u grows in value, the U S - t
a p p r o a c h e s the n o r m a l p d f (2.29) with m e a n 0 a n d v a r i a n c e 1/h.
I n Z e l l n e r (1971, p. 367 ff) it is shown that the U S - t p d f in (2.29) is a
n o r m a l i z e d pdf. T h e o d d o r d e r m o m e n t s a b o u t 0, #2r 1 = E ( ~ - O)zr l, r =
1,2 . . . . . exist when v > 2 r - 1 a n d are all equal to zero given the s y m m e t r y of the
p d f a b o u t 0. I n particular, for v > 1, E ( £ - 0) = 0 a n d E ~ = 0, the m e a n which
exists, given v > 1. T h e even o r d e r central m o m e n t s , ~2r = E ( x - O) 2r, F = 1,2 . . . . ,

7See, for example, Zellner (1971, p. 365) for a derivation of (2.27). From the calculus, F(q + 1) =
qF(q), F(1) = 1, and F(I/2) = Vr~. Using these relations, the second line of (2.27) can be derived from
the first.
Ch. 2: Statistical Theory and Econometrics 95

exist given that v > 2r and are given by

~2r=E(fc__e) 2r F(r+l)y(v/2-r) r = l , 2 ..... v>2r.


= I'(1/2)F(v/2) (h)'
(2.30)

From (2.30), the second and fourth central moments are ]z2 = E ( 2 - 0) 2 = v/
(v - 2 ) h , v > 2, and /*4 = E(2 - 0) 4 = 3v2/(v - 2 ) ( v - 4 ) h 2, v > 4. The kurtosis
measure is then ]/2 = ~ 4 / ~ 2 - 3 = 6 / ( v - 4 ) , for v > 4, and thus the US-t is
leptokurtic (72 > 0). As v gets large, 72 ~ 0, and the US-t approaches a normal
form with mean 0 and variance 1/h. When v > 30, the US-t's form is very close
to that of a normal pdf. However, for small v, the US-t pdf has much heavier tails
than a normal pdf with the same mean and variance.
The standardized form of (2.29) is obtained by making the change of variable,
t = ~/h(x - 0), that yields

f(tlv)=(c/~/v)/(l+t2/v) (~+0/2, -oo<t<oo, (2.31)

where c has been defined in connection with (2.29). The standardized US-t pdf in
(2.31) has its modal value at t = 0 which is also the median. The moments of
(2.31) may easily be obtained from those of 2 - 0 presented above.
Finally, it is of interest to note that the US-t pdf in (2.29) can be generated as a
"continuous mixture" of normal pdfs, that is,

f(xlO , h, v) = fo~fN (x]O, o)f~o(olh, v)do, (2.32)

wherefy(xhO, o) is the normal pdf in (2.26) andfm(oth, v) is an inverted g a m m a


pdf "mixing distribution" for o, 0 < o. < oe, given by

flG(Oll~ h ) 2 v)
\ v/2 l
o_(~+ ,)exp { - v
} (2.33)
F(v/2) ~ 2ho 2 '
where v and h are the parameters in the US-t pdf in (2.29). 8 From (2.32), it is seen
that fN(XlO, O) is averaged over possible values of o. This is an example in which
the standard deviation o of a normal p d f can be viewed as random with the pdf
shown in (2.33). The fact that (2.32) yields the US-t p d f is a useful interpretation
of the US-t pdf. Many well-known pdfs can be generated as continuous mixtures
of underlying pdfs.

SSee Zellner (1971, p. 371ff) for properties of this and other inverted gamma pdfs.
96 A. Zellner

2.4.3.5. Other important univariate pdfs. A m o n g m a n y pdfs that are i m p o r t a n t


in theoretical and applied statistics and econometrics, the following are some
leading examples.
The g a m m a pdf, f(xl,/, ~) = x~-le x/v/'g~F(oO, with 0 < x < ~ and parame-
ters 0 < a, ,{ is a rich class of pdfs. With a change of variable, z = x / y , it can be
b r o u g h t into standardized form, p.(z/a) = z ~- le- Z/ I"( a), 0 < z < ce. In this f o r m
its relation to the g a m m a function is apparent. If in the non-standardized g a m m a
p d f ~ = v / 2 and ~, = 2, the result is the chi-squared p d f with v "degrees of
freedom", p ( x t u ) = xV/Z-le-X/2/2~/2F(v/2), with 0 < x < oe and 0 < v < oe. If
the transformation x = 1 / y 2 is made, the p d f for y is p(yly, ~ ) =
2e 1/vY2//y 2a+ l F ( a ) ~ a , 0 < y < oO. The particular inverted g a m m a p d f in (2.32)
can be obtained f r o m p ( y l 7 , a) by setting o = y, a = v/2, and ~, = 2 h / v . Proper-
ties of these and other gamma-related densities are discussed in Raiffa and
Schlaifer (1961), Zellner (1971), and J o h n s o n and K o t z (1970).
F o r a continuous rv that has a range 0 ~< x ~ e, the beta pdf, f(xla, b, c) =
( x / c ) a- 1(1 - x / c ) b- 1/cB(a, b), where B(a, b) is the beta function 9 with a, b > 0,
is a flexible and useful p d f that can assume a variety of shapes. By a change of
variable, y = x - d, the range of the beta p d f above can be changed to y = - d to
y = ¢ - d. Also, b y taking z - x / c , the standardized form is f(z}a, b) = z a 1
( 1 - z) b- 1/B(a, b), with 0 ~< z ~< 1. There are various pdfs associated with the beta
p d f. The inverted beta p d f is obtained from the standardized beta b y the
change of variable z = l / ( l + u ) , so that 0 < u < m and f(ula, b ) = u b - l /
(1 + u)a+bB(a, b) is the inverted beta pdf. A n o t h e r form of the inverted beta pdf
is obtained by letting u = y/c, with 0 < c < ~ . T h e n f ( y l a , b, c) = (y/e)b-1/(1 +
y/c)a+bcB(a, b), with 0 < y < m. The F i s h e r - S n e d e c o r F distribution is a special
case of this last density with a = v2/2, b = vl/2, and c = pz/Vl. T h e parameters
u 1 and ~2 are referred to as "degrees of freedom" parameters. Properties of the
pdfs mentioned in this paragraph are given in the references cited at the end of
the previous paragraph.
The discussion above has emphasized the importance of the normal, Student-t,
beta, and g a m m a distributions. For each of the distributions mentioned above
there are often several ways of generating them that are useful, lead to greater
insight, and are of value in analysis and applications. F o r example, generation of
the US-t as a special continuous mixture of normal pdfs was mentioned above. A
rv with the chi-squared pdf with p degrees of freedom, say 23, can be considered
as the sum of p squared independent, standardized normal variables, Xv-2- ~ = 1~- ~2,
with zi = ()?i - 0 ) / o . If X,,-2 and X~ ~2 are two independent chi-squared variables
with v t and v2 degrees of freedom, respectively, then P~.~ = (22/~)/(22/p2)
has an F - p d f with u~ and v2 degrees of freedom. These are just some of the ways

9From the calculus, B(a, b) = B(b, a) = I ' ( a ) F ( b ) / F ( a + b). Also, B(a, b) ~ f~l za I( 1 -
z)h-ldz, a,b>O.
Ch. 2: Statistical Theory and Econometrics 97

in which particular pdfs can be generated. Further examples are provided in the
references mentioned above.
Above, the reciprocal transformation was employed to produce " i n v e r t e d "
pdfs. M a n y other transformations can be fruitfully utilized. For example, if the
continuous rv )7 is such that 0 < )7 < oe and g = In )7, - o0 < ~ < oe, has a n o r m a l
p d f with mean 0 and variance 02, )7 is said to have a " l o g - n o r m a l " pdf whose
f o r m can be obtained from the normal p d f for ~ by a simple change of variable.
The median of the p d f for)7 = e ~ is e °, while the mean of)7 is E)7 = E e z = e °+~2/2.
Thus, there is an interesting dependence of the mean of 37 on the variance of ~.
This and m a n y other transformations have been analyzed in the literature.
Finally, it should be noted that m a n y of the pdfs mentioned in this section can
be obtained as solutions to differential equations. For example, the normal p d f in
(2.26) is the solution to ( 1 / f ) d f / d x = -(x-/z)/o 2. T h e generalization of this
differential equation that yields the Pearson system of pdfs is given by

1 df - (x - a) (2.34)
f dx ( b o + bl x + b2 x2) "

T h e integral of (2.34) is

f(x) = A(x - c1)m'(¢2 - X) m2 , (2.35)

where the value of A is fixed by f f ( x ) d x = 1 and c I and c 2 are the roots, possibly
complex, of b o + b l x + bzx 2 = 0. See Jeffreys (1967, p. 74ff.) for a discussion of
the solutions to (2.35) that constitute the Pearson system which includes m a n y
frequently encountered pdfs. For a discussion of other systems of pdfs, see
Kendall and Stuart (1958, p. 167 ft.).

2.4.3.6. Multivariate pdfs for continuous random variables. Consider a r a n d o m


vector ~ ' = (if1, 3¢2 . . . . . Xm)' with elements ffi, i = 1,2 . . . . . m, that are scalar con-
tinuous rvs. Assume that $ c Rx, the sample space. F o r example, R x might be
- ~ < .~i < oo, i = 1,2 ..... m. The p d f for .~, or equivalently the joint p d f for the
elements of $, f ( x ) = f ( x l, x 2 ..... xm), is a non-negative continuous and single-
valued function such that f ( x ) d x = f ( x 1, x 2 . . . . , x m ) d x I d x 2 . . . d x m is the proba-
bility that :~ is contained in the infinitesimal element of volume d x =
d x 1d x 2 . . , d x m. If

fRf(x)dx= fRx ff(xl,x2, . . . , X. m ) d.X l d X. 2 dYm=l, (2.36)

then f ( x ) is a normalized p d f for ~. W h e n ~ has just two elements, m = 2, f ( x ) is


a bivariate pdf, if three elements, m = 3, a trivariate pdf, and if m > 3, a
multivariate pdf.
98 A. Zellner

W h e n R x is - oo < $~ < oo, i = 1,2 . . . . . m, the cumulative distribution function


associated with f ( x ) is F ( a ) given by

..... x m ) d x , d x 2 . . . d x m ,

where a ' = (a~, a 2 . . . . . am) is a given vector and Pr(:~ ~< a ) is the probability of the
intersection of the events g i ~< a~, i = 1,2,..., m.
T h e mean, assuming that it exists, of an m x 1 r a n d o m vector ~ is

E~ l 01
E-~ 2 02
E.~= =0, (2.37)

Egm Om

where, if :~ has p d f f ( x ) and $ c Rx,

Ex i = 0i = f xif(x)dx, i = 1 , 2 ..... m. (2.38)


Rx
This means that the Oi's exist and are finite if and only if each integral in (2.38)
converges to a finite value.
Second order m o m e n t s about the m e a n vector O are given by
V(.~) = E ( x - O ) ( x - O) t = ( E ( . ~ i - Oi)(~ j - Oj)), i, j = 1,2 .... ,m,
(2.39)

a n d the typical element of the symmetric m × m matrix V(£) is given b y

E(xi-Oi)(?j-Oj) ~0"ij = f R x ( X i - - O i ) ( x j - - O j ) f ( x ) d x . (2.40)

If, in (2.40), i = j, Oil = E ( ~ i - O i ) 2 is the variance of xi, i = 1,2 . . . . . m, while if


i ~: j, oi2, given in (2.40), is the covariance of g / a n d gj. Clearly, 0"ij = 0"ji and thus
the m × m matrix of variances and covariances,

Oll O12 ... Olm


O2.1 0"22 ••" 02m
=
(2.41)

L 0",,,1 °,.2 "'" on,.

the "covariance matrix" for i is symmetric with m ( m + 1)/2 distinct elements.


Ch. 2." Statistical Theory and Econometrics 99

The "correlation matrix," denoted by P, associated with V ( 2 ) is given by

1 P12 P13 "'" Plm i


1021 1 P23 """ P2m
P = P31 P32 1 " " " P3m

lore1 lore2 Pm3 "'" 1

where pij = o , j / ~ , i, j = l , 2 , 3 , . . . , m . N o t e that P is symmetric and that


P = D- l E D - 1, with N given in (2.41) and D is an m × m diagonal matrix with
typical element a]i/2. In general, mixed central m o m e n t s are given b y ~l,, t:,. ,tin=
E(21 - 01)l,(22 - 0 2 ) 1 2 . . . ( 2 m - On) l,., I i = 0, 1,2 ..... i = 1,2,... ,m.
To illustrate linear transformations of the elements of 2 - 0, consider the m × 1
r a n d o m vector g = H ( 2 - 0 ) , where H is an m × m non-stochastic matrix of rank
m. Then from the linearity property of the expectation operator, Eg = H E ( £ - O)
= 0, since E2 = 0 f r o m (2.37). Thus, g has a zero mean vector. By definition f r o m
(2.39), V ( g ) = E z z ' = H E ( ~ - 0 ) ( 2 - O ) ' H ' = H , ~ H ' , the covarial~ce matrix of g.
N o w if Z is positive definite, there exists a unique H such that H N H ' = Im .1° If H
is SO chosen, then V(~) = E z z ' = Ira; that is, E22i = 1, i = 1,2 ..... m, and E ~ i 2 j = 0
for all i :x j. F r o m H N H ' = I m, ~ = H - 1 ( H ' ) - 1 and N - 1 = H ' H . Furthermore,
from z = H ( x - 0 ) , the Jacobian of the transformation is

J = m o d ~Oz
x = modlHI, (2.42)

where " m o d " denotes absolute value, and the Jacobian matrix is

Ozl Oz I Ozl
Oxl Ox2 "'" Ox m

OZ Oz 2 Oz 2 Oz 2
Ox 1 Ox 2 "'" Ox m =H,
Ox

Oz m Oz m Oz m
Oxl Ox2 "'" Ox m

l ° T h a t is, given that E is a positive definite symmetric matrix, J~ can be diagonalized as follows:
P ' X P = D(Xi), where P is an m × m orthogonal matrix and the X i are the roots of .~. T h e n
D-1/2p,~.pD ~/ 2 =IandH= D-I/2p, whereD-l/2=D()~[1/a),anm×mdiagonalmatrixwit h
typical element )t[ 1/2.
100 A. Zellner

and dz = J d x shows how the transformation from z to x modifies the infinitesi-


mal unit of volume. Thus, the pdf for z, f ( z ) gets transformed as follows:
f ( z ) d z = J f [ H ( x - 0 ) ] d x a n d f [ H ( x - 0)] is the pdf for x. 11
Associated with bivariate and multivariate pdfs are marginal and conditional
pdfs. For example, in terms of a bivariate pdf, f ( x 1, x2), the marginal pdf for xl
is given by

(2.43)

while the marginal pdf for -~2 is

h (x 2) = f R x f ( x , , x2) dx~.

N o t e that

fR~,g(x l) dx, = fR~h (x2) d x 2 = 1

given that f(x~, x2) satisfies (2.36). Also, g ( x l ) d x I is the probability that :~l is in
the interval (xl, x 1 + d x l ) for all values of )~2 and similarly for h(x2).
Definition
The rvs x l and 9~2 are independent if and only if their joint pdf, f (x l, x 2), satisfies
f ( x l , x2) = g(xl)h(x2).
The conditional pdf for xl given x2, denoted by f(xllx2), is

f ( x~[x2) = f ( x,, X z ) / h (x2), (2.44)

provided that h(x2) > 0. Similarly, the conditional pdf for ~2 given ~i, denoted by
f(x21xl), if f ( x 2 1 x l ) = f ( x l , x2)/g(xl), provided that g ( x l ) > 0. From (2.44),
f(x1, X2)= f(xllXz)h(x2) which, when inserted in (2.43), shows that the margi-
nal pdf

g(xl) = f R x f ( x ~ [ x 2 ) h ( x 2 ) d x 2

11If zi = ~i (x), i = 1,2,..., m, or z = q~(x) represents a set of one-to-one transformations from x to z


and if ~ has pdff(z), thenf(z)dz = Jf[eo(x)]dx, where J = modIOeo/dxI is assumed not equal to zero
in R X.
Ch. 2: Statistical Theory and Econometrics 101

can be interpreted as an average of the conditional pdf f ( x l l x 2) with the marginal


pdf h(x2) serving as the weighting function. Also, from (2.44),

fRxf(xlIx2) dx, = 1

since

~ x f ( x i , x 2 ) d x 1= h ( x 2 ) .

An instructive interpretation of (2.44) is obtained by writing

f(x,,x2)dxldx2 = [h(xz)dx2][f(x,lx2)dx,]
= Pr(x2 < x2 < x2 + d x 2 ) P r ( x l < 2, < x 1 +dXll2 2 = x2).

Some of these general features of multivariate pdfs will be illustrated below in


terms of specific, important pdfs that are frequently employed in econometrics.

(1) Bivariate Normal (BN). A two-element random vector, ~?'= (xl, if2), has a
BN distribution if and only if its pdf is

f ( x l , x2lO ) = 2~'OlO 2 exp(- Q/2), - ~ < xl, x 2 < ~ ,


(2.45)
where

Q= [(x,-
- 2 0 ( x I - bt,)( x 2 - / x 2 ) / o , 0 2 ] / ( 1 - 02) (2.46)

and 0'=(/~1,/~2, ol,o2,0), with 101<1, - ~ < / ~ i < o o , and 0 < o i < m , i = 1 , 2 .
Under these conditions, Q is a positive definite quadratic form.
To obtain the standardized form of (2.45), let z 1 = (x 1 - / ~ l ) / O l and z 2 = (x 2 -
/z2)/o 2, with dzldZ 2 = d x l d x 2 / o l o 2 . Further, let the 2 × 2 matrix P - 1 be defined
by

1--0 2 -- p
01
1 "

Then Q = z'P - lz, where z' = (z 1, Z2) and (2.45) becomes

~ ( z i p ) = ( 2 r r ) - l i P I - 1 / 2 e x p ( - z'P - lz/2), (2.47)


102 A. Zellner

with - o o < z ~ , z 2 < o ~ , and where [ P l - l l 2 = l / ( 1 - p 2 ) 1/2. This is the stan-


dardized B N pdf.
N o w in (2.47) let z = Hv, where v ' = (vl, v2) and H is a 25<2 non-singular
matrix such that H ' P - 1 H = I 2 . Then dz = [H[dv, and, from H ' P - 1 H = I2,
IH'P- 1H I = I H I 2 1 P - 1t = 1 or IHI = IPI ~/2. Using these results, (2.47) becomes

f ( v ) = (2~r)- 'exp{ - v'v/2}


= [(2w)-'/2exp(-v~/2)][(2w)-'/2exp(-v~/2)]. (2.48)

Thus, vt and t~2 are independent, standardized n o r m a l rvs and [f(v)dv= 1,


implying that ff(xlO)dx = 1. Furthermore, f r o m (2.48), E6 = 0, so that E £ = HE~
= 0, and from the definition of z, E2~ = / ~ and E £ 2 =/~2. Thus, E $ =/~, /x'=
(/~,/~2), is the m e a n of $. Also, from E ~ ' = 12 and ~ = H~3, E£~' = HEf~'H' =
HH' = P, since from H'P - lH = 12, P = HH'. Therefore,

v(~) = Ezz- - ' = (1° , 49,


E~21 , El 2 ] 1 '

where the matrix involving p is the inverse of P l, introduced above. Then (2.49)
yields

E3~=E(\Ycl--1~10.1 ) 2 = 1 or E(:~l--/~l)2=0.~,

E22 = E( 2~2---/z2 ) 2 = 1 or E(22 - - ~ 2 ) 2 = 0.2,


02

and

EZ'lZ'2=E(Xl--#l)
°l \ 0.2 =O or E(21-/~l)(ffz-/~2)=0.10.20.

F r o m these results, V(21) = o 2, V ( ) ~ 2 ) = 0.2 and O = E ( 2 l - ] / 1 ) ( x 2 - ~2)//0.10.2 is


the correlation coefficient for 21 and ~2.
T o obtain the marginal and conditional pdfs associated with the BN pdf, write
Q in (2.46) as

Q= [(z:- pz,):+(1- 02)z2]/(1- 0:), (2.50a)


= [(z,- pz:):+(1- o:)z~]/(1- 0:), (2.50b)
Ch. 2." Statistical Theory and Econometrics 103

where z t = (x i -/**)/oi, i = 1,2. Substituting (2.50a) into (2.45) and noting that
dzldz 2 = dxldx2/olo2, the pdf for zl and z2 is

f ( z , , zalp) = f(z2lzl, p)g(z,), (2.51)

where

f(ZzlZl,p)=[2~r(1-p2)]-'/2exp(-(zz-pzi)2/2(1--p2)) (2.51a)

and

g(zl) = (2~r)-l/2exp( - z2/2). (2.51b)

That (2.51b) is the marginal pdf for ~, can be established by integrating (2.51)
with respect to z 2. Given this result, f(z21zl, O) = f(zi, z21o)/g(zl) is the condi-
tional pdf for z2 given zi and is shown explicitly in (2.51a).
From (2.51b), it is seen that the marginal pdf for ~ is a standardized normal
pdf with zero mean and unit variance. Since ii = (xi - / z i ) / o i , the marginal pdf
for 21 is

g( x,I/x,, o,) = (27ro 2 )-1/2exp(--(x I --/zt)2/2ol2 },

a normal pdf with mean ~ and variance o 2.


From (Z51a), the conditional pdf for Z2, given z1, is normal with conditional
mean Oz I and conditional variance 1 - 02. Since ~z = (22 -/~2)/Oz the conditional
pdf for 22, given 2 l, is normal, that is,

f(x21xl,O ) = [2~ro2(1 - p2)] -1/2


X e x p ( - [x2 -/~2-/}2.1(XI--llI)]212(1--Q2)02), (2.52)

where 0 ' = (O, /~,, /12/}2-1, O'2), With f12-, - °2P/(II" From (2.52),

E(2212,= Xl) = [I2 -~/}2.1(Xl - / I i ) (2.53)

and

v(2212, = x,) = o}(1- p2 ), (2.54)

where E(22[21 = Xl) is the conditional mean of 22, given 21, and V(22121 = Xl) is
the conditional variance of :72, given 2~. Note from (2.53) that the conditional
mean of x2 is linear in x 1 with slope or "regression" coefficient/32.1 = ° 2 0 / o r
104 A. Zellner
The marginal pdf for 22 and the conditional pdf for 21, given 2 2, may be
obtained by substituting (2.50b) into (2.45) and performing the operations in the
preceding paragraphs. The results are:

f(x~, x210 ) = f(x,lx2, Ot)g(x2102), (2.55)

with

g(x2l. ) = (2 ~ro2 ) - 1/2 exp( - ( x 2 -/*2 )2/202 ) (2.56)

and

f(XI[X2,')= [2~07(1--p2)] -1/2


X exp(- Ix 1 -/*1-/~1.2 ( x2 - / . 2 ) ] 2 / 2 ( 1- 102)°12 ), (2.57)

where El .2 ~ O10/O2" From (2.56), it is seen that the marginal pdf for 2 2 is normal
with mean/*2 and variance o22, while from (2.57) the conditional pdf for 2t, given
22, is normal with

e( z,122 = x2) =/*, + 8, 2(x2 -/*2) (2.58)

and

V(21i2 , = x,) = o2(1 - 02). (2.59)

ill-2 = °10/°2 is the regression coefficient.


From what has been presented above, it is the case that (1) all marginal and
conditional pdfs are in the normal form and (2) both conditional means in (2.53)
and (2.58) are linear in the conditioning variable. Since E071]22) and E(22121)
define the "regression functions" for a bivariate pdf in general, the bivariate
normal pdf is seen to have both regression functions linear. Further, from the
definitions of 82.1 and El. 2, 82.1Bl. 2 = 02, the squared correlation coefficient, so
that the regression coefficients have the same algebraic signs. Further, if 0 = 0, the
joint pdf in (2.45) factors into
2
2 -- 1/2
[I (2"lrai) exp(-(xi 2 2 },
i=1

showing that with p = 0, 21 and 2 2 a r e independent. Thus, for the BN distribu-


tion, O = 0 implies independence and also, as is true in general, independence
implies p = 0. Note also that with p = 0, the conditional variances reduce to
Ch. 2: Statistical Theory and Econometrics 105

marginal variances and J~l. 2 = fi2. i = 0, that is, the regressions in (2.53) and (2.58)
have zero slope coefficients.

(2) M u l t i v a r i a t e n o r m a l ( M V N ) . A n m-element r a n d o m vector, ~ ' = (21, 22 . . . . .


2m), has a M V N distribution if and only if its pdf is

f(x[O, Z) = (2Ir)- m/2l~]--l/2exp(- ( x - - 0)'N-l(x - 0)/2}, (2.60)

with - eo < x i < oo, i = l , 2 . . . . . m , O ' = ( Ol, O2 . . . . . Om),-- o0 < Oi < m , i = l , 2 . . . . . m ,


and Z = (o~j) is an m X m positive definite symmetric matrix. W h e n m = 1, (2.60)
is a univariate normal pdf, and when m = 2 it is a bivariate normal pdf.
If H is an m X m non-singular matrix such that H ' 2 ~ - IH = I m a n d x - 0 = H z ,
then the pdf for z ' = ( z I, z 2 . . . . . z,,,) is 12

f(z) = (2~r)- m / 2 e x p ( -- z ' z / 2 } . (2.61)

F r o m (2.61), the ~i's are independent, standardized n o r m a l variables and there-


fore (2.60) and (2.61) integrate to one. I n addition, (2.61) implies E~ = H I E ( 2
-0)=0, or

E:~ = 0. (2.62)

Thus, 0 is the m e a n of the M V N pdf. Also, (2.61) implies that E z z ' = I m, since the
elements of ff are independent standardized normal rvs. It follows that H - IE(J?
- O ) ( x - O ) ' ( H ' ) - 1 = im; that is, E ( ~ - 0)(~ - 0 ) ' = H H ' , or

V(:f) = E ( . f - 0)(:~ - 0 ) ' = 2~, (2.63)

since from H ' Z - 1H = Ira, Z = H H ' . Thus, Z is the covariance matrix of J?.
To obtain the marginal and conditional pdfs associated with (2.60), let G - 2~-
and partition x - 0 and G correspondingly as

o, t
X - - O = 1X 2 02 and G= G21 G22].

Then the exponent of (2.60) can be expressed as

(x - O)'G(x - O) = QI + Q 2 , (2.64)

12Note that the Jacobian of the transformation from x - 0 to z is IHI and iX] 1/2 = [H b 1 from
]H'Z 1HI = Jim]= 1. Thus, IX]- I/2]H] = 1.
106 A. Zellner

with

Ql=[X,-Ot +G~lG12(x2-O2)]'Gll[xl-Ol +G~lGlz(X2-02)] (2.65)

and

Q2 = (x2 - 02)'(G22 - G21GllllG12)(x2 - 02)" (2.66)

From

]z~]-t = ]GI = [GllIIG22 - GzlGlllGI2[,

(2.60) can be expressed as

f ( xl, X2IO, G) = f ( x~lx 2, G, O) g( x2lG, 02), (2.67)

where

f ( x,lx2, G, 0) = (2~r)- r"'/2lGllll/2exp( - Q J 2 ) (2.67a)

and

g(x21G , 02) = (2~r) m2/21G22- G2,Gll 1G12l~/2exp(- Q 2 / 2 ) , (2.67b)

with m i the n u m b e r of d e m e n t s in xi, i= 1,2, m I + m 2 = m, QI, and Q2 as


defined in (2.65) and (2.66), respectively, and G = 2~- 1
On integrating (2.67) with respect to x 1, the result is g(x21G, 02), the marginal
p d f for x 2, where ](x~lx2,G,O ) is the conditional p d f for :~l, given :?2- Both
(2.67a) and (2.67b) are n o r m a l pdfs. The m e a n and covariance matrices of the
marginal p d f in (2.67b) are 13

E:? 2 = 02,

v(:?2) = (G22 - G2,C IG,2) 22.

13On partitioning G = ~ 1 and ~J correspondingly as

(G1, GI2 /
G = ~ G21 G22 ]
and
Zll Xl2 /
X= X21 X22 ]

then from results on inverting partitioned matrices, ~22 = (G22- G21Gtl IGI2) I, X12Z221 = _ G~l
G12, and GI) i = 2~lt _ ~12~1~21.
Ch. 2: Statistical Theory and Econometrics 107

The mean and covariance matrices of the conditional pdf in (2.67a) are

E ( ~ 1 1 . ~ 2 ) = O 1 -- G I ~ 1 G I 2 ( X 2 - 02)

(2.68)

V(~11~2) = GIq I = ~11 - -Y12-Y~.21Y"21. (2.69)

Similar operations can be utilized to derive the marginal pdf for ~1 and the
conditional pdf for x2, given £l- As in the case of (2.67a) and (2.67b), the
marginal and conditional pdfs are MVN pdfs. In addition, just as E($11£2) in
(2.68) is linear in x2, E(:~zl:~l) = 02 q- Z 2 1 ~ H l(Xl - 01) is linear in x 1. Thus, both
conditional expectations, or regression functions, are linear in the conditioning
variables.
The conditional expectation of $1, given ~2, is called the regression function for
~l, given $2. As (2.68) indicates, in the case of the MVN distribution,
this regression function is linear and B ' - ~ 1 2 ~ 2 ~ 1 is the m 1 × m s matrix of
partial regression coefficients. If x I has just one element, then the vector of partial
regression coefficients is fl' = 0~2Z2~1, where o[2 is a 1 × rn 2 vector of covariances
of ~1, and the elements of $2 are the first row of ~$12. With respect to partial
regression coefficients, it is instructive to write

01)'= - 02)'B + a', (2.70)

where li' is a 1 × rn, random vector with Ea[:~ 2 = 0 and E ( $ 2 - 0 2 ) a ' = 0. Then
on multiplying (2.70) on the left by x 2 - 02 and taking the expectation of both
sides, the result is 2~21= Z22B or

B = ~221~21 . (2.71)

Note that (2.71) was obtained from (2.70) without assuming normality but, of
course, it also holds in the normal case. Without normality, it is not true in
general that both E ( £ l l £ 2 ----x2) and E(£21£ l = xl) are linear in x 2 and x l,
respectively. For the cases of non-normality, it may be that one of these
conditional expectations is linear, but in general both will not be linear except in
special cases, for example the multivariate Student-t distribution discussed below.
If in the MVN distribution the elements of £ are mutually uncorrelated, that is,
E ( ~ ; - 0~)(~j- 0 j ) = 0 , for all i ~ j, then ~ in (2.63) is a diagonal matrix,
= D(o;i) and from (2.60) f(x[O, D) = H i=lg(xilOi,
m Oii), where g(xilOi, Oii) is a
univariate normal pdf with mean 0; and variance air Thus, diagonality of
implies that the elements of ~ are independently distributed and then they are
mutually uncorrelated, a result that holds in general. Thus, for the MVN
distribution, diagonality of Z implies independence, and independence of the
108 A. Zellner

elements of ~? implies diagonality of Z. Also, if Z is diagonal, Z12 = Z[~ = 0 and


thus B = Z~IZ21 = 0; that is, with Z diagonal, all partial regression coefficients
are equal to zero. Further, with Z diagonal, V(~IIJ?2) in (2.69) specializes to
V(J?llJ?2)= D11(%), an m 1 × m I diagonal matrix with typical element, o,, the
variance of ~i.
A m o n g m a n y other results for the M V N distribution, the fact that linear
combinations of n o r m a l variables are n o r m a l l y distributed is of great importance.
T h a t is, let # = C:~, where C is a q × m given matrix of rank q and thus # is a
q × 1 r a n d o m vector. It can be established that ~, has a q-dimensional M V N
n o r m a l distribution with m e a n vector E # = CO and covariance matrix E ( r ~ -
E # ) ( # - E # ) ' = C E ( . f - 0 ) ( . ~ - O)'C'= C Z C ' , a q × q positive definite s y m m e t -
ric matrix.
Another i m p o r t a n t result is that the r a n d o m variable )7 = ( $ - O)',Y, 1(:~_ 0)
has a X 2 distribution with m degrees of freedom. Let $ - 0 = H~, where H a n d
are as defined in connection with (2.61). Then ) 7 = U ~ = ~ i m l ~ / 2 is a s u m of
independent, squared, standardized n o r m a l rvs and thus has a X 2 p d f with m
degrees of freedom. A generalization of this result is that if the r a n d o m m × m
symmetric matrix A = ~ = 1 ~ , where the ~ ' s are mutually independent n o r m a l
r a n d o m vectors, each with zero mean vector and c o m m o n pds covariance matrix
Z, then the m ( m + 1)/2 distinct elements of A have a central Wishart p d f and
those of A - 1 an inverted Wishart pdf; for A a scalar, its p d f is a X 2 p d f with u
degrees of freedom, while its reciprocal has an inverted X 2 pdf.

(3) Multivariate Student-t (MVS). An m-element random vector ~'=


(3~1, "1~2. . . . ,Xm), with - oe < ~ < oe, i = 1,2 . . . . . m, has a MVS distribution if and
only if its p d f is

f ( xlO, Vv ) = clVI1/2/[~ 4- ( x - O )'V( x - 0)] (~ + ~)/2, (2.72)

where c = u~/2/'[(~ + m ) / 2 ] / ~ r m / 2 Y ( u / 2 ) , u > 0, V is an m × m pds matrix, and


0 ' = ( 0 1 , 0 2 ..... 0m), with - o e < O i < ~ , i = 1 , 2 . . . . . m. T h e p d f in (2.72) is an
m-dimensional MVS p d f with v degrees of f r e e d o m and is denoted MVSm(0, V, u).
If m = 1, (2.72) is the US-t pdf. Further, if p = 1, (2.72) is the multivariate C a u c h y
p d f and the univariate Cauchy p d f when m = 1. As ~ grows large, the M V S
approaches a limiting M V N form with m e a n 0 and covariance matrix V 1
F r o m the form of (2.72), it is the case that the p d f is symmetric about 0, t h e
m o d a l value. A standardized form of (2.72) is obtained by making the change of
variable x - 0 = Cz, where C is an m × m non-singular matrix such that C ' V C =
Im. Then the pdf for z is

f (zlP ) : C/(1) ±
-- z 'z )\(m + v)/2 , (2.73)
Ch. 2: Statistical Theory and Econometrics 109

an m-dimensional, standardized MVS density. It can be shown that (2.72) and


(2.73) are normalized pdfs, that is, that they integrate to one. Also, the first two
moments of g are Eg = 0 for v > 1 and Ezz' = 1,,,v/(v - 2) for v > 2, which imply
that

E.f = 0, v > 1, (2.74)

and

V(.~)=E(x-O)(x-O)'=V-tv/(v-2), v>2. (2.75)

The conditions v > 1 and v > 2 are needed for the existence of moments.
If :~ is partitioned, $ ' = (:~, $~), marginal and conditional pdfs can be obtained
by methods similar to those employed in connection with the MVN. Let V in
(2.72) be partitioned, V = (V~j), i, j = 1,2, to correspond to the partitioning of
into $1 with m 1 elements and 3~ 2 with m 2 elements. Then the marginal pdf, for :~2,
is in the MVS form, that is, MVSm2(0>V2.1, v), where 02 is a subvector of
0 ' = (0~, 0~), partitioned to correspond to the partitioning of $, and V2.1 = V= -
Vt2Vll 1Vt2. As regards the conditional pdf for Xl, given x2, it too is a MVS pdf,
namely MVS,,,(81. > M, v'), with v ' = m 2 + v:

$1-2 = 01 -- V t l I V I 2 ( X 2 -- 02) (2.76)

and

M = ..v,,/[. + (x2 - o2)'v ,(.2 - 02)]. (2.77)

For v ' > 1, S t 2 in (2.76) is the conditional mean of £1, given £2; note its similarity
to the conditional mean for the MVN' pdf in (2.68) in that it is linear in x 2. Also,
E£2lx I is linear in xv Thus, the MVS pdf with v > 1 has all conditional means or
regression functions linear. In addition, if V is diagonal, (2.75) indicates that all
elements of ~ are uncorrelated, given v > 2, and from (2.76), $1.2 = 01 when V is
diagonal. From the form of (2.72), it is clear that diagonality of V, or lack of
correlation, does not imply independence for the MVS pdf, in contrast to the
MVN case. This feature of the MVS pdf can be understood by recognizing that
the MVS can be generated as a continuous mixture of normal pdfs. That is, let
have a MVN pdf with mean vector 0 and covariance matrix o 2 V - 1, denoted by
f N ( O , V - t o 2 ) , and consider o to be random with the inverted gamma pdf in
(2.33). Then

MVSm (0, V 1, p) = foO~fN(O, V -~o2)fm(olw, 1) do.


110 A. Zellner

Thus, if V is diagonal, independence does not result because of the common


r a n d o m o in f N( O, V -1o2).
Three other important features of the MVS pdf are: (1) If :~ is MVS, then
= L$, where L is a given q × m matrix of rank q, is MVS; (2) the quadratic
form in (2.72) divided by m, that is, (:~ - O)'V($ - O)/m, has an Fro, . pdf [see, for
example, Zellner (1971, p. 385)]; and (3) the quantity ti = ( ~ - 0~)/~v" has a
univariate Student-t pdf with v degrees of freedom where v" is the i, ith element
of V -1

2.5. Elements of asymptotic theory 14

In previous sections specific pdfs for rvs were described and discussed, the use of
which permits one to make probability statements about values of rvs. This
section is devoted to a review of some results relevant for situations in which the
exact pdfs for rvs are assumed unknown. Use of asymptotic theory provides,
among other results, approximate pdfs for rvs under relatively weak assumptions.
As Cox and Hinkley (1974, p. 247) state: " T h e numerical accuracy of such an
[asymptotic] approximation always needs consideration . . . . " For example, con-
sider a random sample mean, X, = )-"7=1Xi/n" Without completely specifying the
pdfs for the X/s, central limit theorems (CLTs) yield the result that X, is
approximately normally distributed for large finite n given certain assumptions
about the X~. Then it is possible to use this approximate normal distribution to
make probability statements regarding possible values of An. Similar results are
available relating to other functions of the X/s, for example, S~ = ~ = i ( X / -
Xn)2/n, etc. The capability of deducing the "large sample" properties of func-
tions of rvs such as A', or S2 is very useful in econometrics, especially when exact
pdfs for the Xi's are not known a n d / o r when the Xi's have known pdfs but
functions of the X~'s have complicated or unknown pdfs.
The following are some useful inequalities for rvs that are frequently employed
in asymptotic theory and elsewhere.

2.5.1. Selected inequalities for random variables

Chebychev's Inequality: Let X be any rv with EX 2 < ~ . Then, with e an arbitrary


constant such that e > 0,
P(Igl > e) ~< E S 2 / e 2. (2.78)

As an example of (2.78), let X n = Zn - 0, where Z~ is the random proportion of

taFor simplicityof notation, in this section rvs are denoted by capital letters, e.g. X, Y, Z, etc. See,
for example, Cram~r (1946), Lo~ve (1963), Rao (1973), Cox and Hinkley (1974) and the references
cited in these works for further consideration of topics in asymptotic theory.
Ch. 2." Statistical Theory and Econometrics 111

successes in n independent binomial trials with EZ, = 0, the c o m m o n probability


of success on individual trials, and E X 2 = E ( Z , - 0) 2 = 0(1 - O)/n, the variance
of Z , . Then P(IZn -- 01 > e)~< E ( Z n - 0 ) 2 / e 2 = 0(1 - O)/ne 2, which approaches 0
as n ~ m. In this sense, the rv Z , approaches 0 as n grows large, a type of
convergence that is defined below as "convergence in probability (i.p.)" and
i.p.
denoted by plim Z~ = 0 or Z n ~ 0. N o t e further that if X, were known to be
N(0, 1), Pr(IX,] > 1.96) = 0.05 and (2.78) yields Pr(lXn] > 1.96) ~< 1/(1.96) 2 = 0.260.
Thus, in this case while (2.78) is true, it does not yield a very strong bound. On
the other hand, if X, has a symmetric p d f with mean zero, finite second m o m e n t ,
and heavier tails than the n o r m a l pdf, P(IX, t > 1.96) > 0.05 and the b o u n d yielded
b y (2.78) is closer to P(IX, I > 1.96).
A useful generalization of (2.78) is

P { Y > e} ~< E g ( Y ) / g ( ~ ) , (2.79)

where e > 0 is given, Y is any rv, g(. ) is a non-negative, even function, defined on
the range of Y such that g ( . ) is non-decreasing, and E g ( . ) < oo. If in (2.79)
Y = IX I and g(lXI) = X 2 for IXI >/0 and zero otherwise, (2.79) reduces to (2.78). If
in (2.79) Y = IXI and g(lXI) = IXI for IX[ >/0 and zero otherwise, (2.79) reduces to
P(IXI > e)~ EIXI/e. Other choices of g(.), for example g(lXI) = ISl r, r > 0,
produce inequalities involving higher order moments, e ( l g [ > e)<~ElXlr/e r,
M a r k o v ' s Inequality that includes (2.78) as a special case.
Some additional inequalities are: 15
(1) EIX+YIr~cfiEIXIr+E[yIr), where c ~ = l for r~<l and Cr=2 r-1 for
r >/1. Thus, if the r t h absolute m o m e n t s of X and Y exist and are finite, so is
the r th absolute m o m e n t of X + Y.
(2) HOlder lnequality: E I X Y I <~[ElSlrll/~[EIYIS] l/s, where r > 1 and 1/r + 1/s
~1.
(3) Minkowskilnequality: If r >/1, then [ E [ X + y[r]l/r<~ [ElSlr]l/~+[E[Ylr]l/r.
(4) Schwarz Inequality: EIXYI <~[EIXI2EIY] 2] or [EIXYI] 2 ~ EIXI2EIY[ 2,
which is a special case of H61der's Inequality with r = s = 2.
These inequalities are useful in establishing properties of functions of rvs.
N o w various types of convergence of sequences of rvs will be defined.

2.5.2. Convergence of sequences of random variables

Consider a sequence of rvs (X,), n = 1 , 2 . . . . . A specific example of such a


sequence is X 1, X2, X 3. . . . . X, ..... where ~'1 = ) ( 1 , ~-2 =E~=lXi/2, X3 =
3 X J 3 , . . . , X , = ~n i X i / n , a sequence of means. Since the m e m b e r s of such
~i=l

15For proofs of these and other inequalities, see Lo~ve (1963, p. 154 ft.)
112 A. Zellner

sequences are rvs, the usual mathematical limit of a sequence of non-stochastic


quantities does not apply. Thus, it is necessary to define appropriate limits of
sequences of rvs. Sequences of rvs can converge, in senses to be defined below, to
a non-stochastic constant or to a r a n d o m variable. The following are the m a j o r
modes of convergence for sequences of rvs.

1. Weak convergence or convergence in probability (i.p.). If for the sequence (An),


n = 1,2,...,

lira P(IX. - cl > e) = 0 (2.80)


I1 ---} OO

for every given e > 0, then the sequence converges weakly or i.p. to the constant c.
i.p. p
Alternative ways of writing (2.80) are X~ --~ c, or Xn --+ c, or plim X, = c, where
" p l i m " represents the particular limit given in (2.80) and is the notation m o s t
frequently employed in econometrics.

. Strong convergence or convergence almost surely (a.s.). If for the sequence


(X~), n = 1 , 2 .....

P( lim X , = c ) = 1 , (2.81)

a.s.
then the sequence converges strongly or a.s. to c, denoted b y X, ~c. An
alternative way of expressing (2.81) is

NfiIno~P( sup I X n - c[ > e} = 0


n>>.N

for every given e > O.

3. Convergence in quadratic mean (q.m.). If for the sequence (X,), n = 1,2 .... ,

lim E ( X , - c) 2 = 0, (2.82)

q.IIl.
then the sequence converges in quadratic m e a n to c, also expressed as X n --, c.
A sequence of rvs (Xn), n = 1,2,..., is said to converge to a rv X i n the sense of
(2.80), (2.81), or (2.82) if and only if the sequence (Xn - X), n = 1,2 ..... con-
verges to c = 0 according to (2.80), (2.81), or (2.82). I n the case of (2.80) such
i.p.
convergence is denoted by X, ~ X or p l i m ( X n - X ) = 0; in the case of (2.81) by
a.s. q.rn.
X~ --, X; and in the case of (2.82) by X, ~ X.
Ch. 2." Statistical Theory and Econometrics 113

R a o (1973, p. 110f.) proves the following relations:


(a) Convergence in q.m. (2.82) implies convergence i.p. (2.80).
(b) Convergence a.s. (2.81) implies convergence i.p. (2.80).
q.m. a.s.
(C) If X, ~ c in such a way that ~ = I E ( X , - c) 2 < co, then X n ~ c.

I n connection with a sequence of sample means, (_~), n = 1,2 . . . . . with EX'~ =/~
and Var(X'n) = o 2 / n , Chebychev's Inequality (2.78) yields P(I X', -/~1 > e) ~< E ( X,
- 11)2/e 2 = o 2 / n e 2. Thus, lim~_, ooe(IX'~ - #1 > e) = 0; that is, plim X'~ =/~ or
_ i.p.
X, ~ / ~ . Further, on applying (2.82), l i m , _ ooE(X', - / z ) 2 = lim, _~ ooo2/n - - - 0 and
_ q.m.
thus X, ~ /z.
In the case of a sequence of sample means, and in m a n y other cases, it is
valuable to k n o w under what general conditions if and how X, converges to a
limiting constant. Laws of large numbers (LLN) provide answers to these
questions.

2.5.3. L a w s o f large numbers ( L L N )

In this section several L L N are reviewed. W e a k laws of large numbers ( W L L N )


relate to cases in which sequences of rvs converge weakly or in probability while
strong laws of large numbers ( S L L N ) relate to cases in which sequences converge
strongly, that is almost surely or with probability 1. In what follows, the sequence
of rvs is (X,), n = 1,2,..., and sequences of averages X, = ~ n = l X i / n , n = 1,2 . . . . .
are considered.

Chebychev" s W L L N

If EX~ = #i, V(X~) = E ( X ~ - t~) 2 = oi2, and cov(X,, Xj) = E ( X i - ixi)(Xj - t~j) =
0, i ~ j for all i, j = 1 , 2 , . . . , then l i m n ~ a / n = 0, where ~2 = ~ 7 = l o ~ / n implies
that

i,p,
X , - - ~ n - - " O,

or plim (X"n -- ~ , ) = 0, with ~, = ~n=ll~i/n.16


As a special case of this W L L N , if/~i =/x and o i = o for all i and cov(Xi, Xj) = 0,
then X', l.p. /~ or plim ~', =/L.

1 6 p r o o f is by use of Chebychev's Inequality (2.78) with X= ,~n- ~ since E(X, "-fin) 2= ff2/n.
Therefore P(IX,, - ff,,I > e)<~62/n e and thus lim, ~ ~oP(IXn - ffnl > e) = 0. For proofs and discussion
of this and other LLN see, for example, Rao (1973, p. 111 ff.).
114 A. Zellner

Khintchin's W L L N
If Xl, X2,... are independent and identically distributed (i.i.d.) rvs and E X i = ~ <
i.p.
c¢, then ~r ~ g (or prim ~', = g).

In Khintchin's W L L N , there is no requirement that second moments exist as in


Chebychev's W L L N ; however, the former W L L N does require that the X~'s be
i.i.d, and have a common finite mean. As an example in which Khintchin's
W L L N applies but Chebychev's does not, consider the X / s to be i.i.d, each with a
univariate Student-t pdf with degrees of freedom 1, = 2. For v = 2, EX~ = t~ < oe
but the second moment does not exist a n d thus Chebychev's W L L N cannot be
applied but Khintchin's W L L N can. On the other hand, for the X / s i.i.d.
univariate Cauchy, the mean does not exist and thus neither law can be applied.
While these exceptions are worth noting, there are m a n y sequences to which these
W L L N can be applied. However, special results are needed to handle cases in
which the X,'s do not possess moments a n d / o r are correlated.
Kolmogorov's First S L L N
If X i, X 2.... is a sequence of independent rvs such that E X i = ~i and V(X~) = oi2,
i = 1,2,..., and if ~=1o~2/i
oe
2 < oe, then, with ~ , = ~ .n= d ~ J n
,

a.s.
2.-~.--. 0,

and the sequence Xl, )(2,... is said to obey the SLLN. Further, if/~i =/~ for all i,
-- a,s.

Kolmogorov" s Second S L L N
If Xt, X 2.... is a sequence of i.i.d, rvs, then a necessary and sufficient condition
-- a.s.

that X n ~ t~ is that EXi =/x < o¢ for all i.

Kolmogorov's Second SLLN does not require the existence of the second
moments of the independent X~'s as in his first law; however, in the second law
the X~ must be independently and identically distributed, which is not assumed in
his first law. In the first law, if #~ =/~ and °2i = o2, the X / s need not be identically
-- a.s.
distributed and still, X, -~/x since o 2 ~ = ~ l / i 2 < oo.

2.5.4. Convergence of sequences of distributions and density functions and central


limit theorems (CLTs)

Let (Fn), n = 1,2 .... , be a sequence of cumulative distribution functions (cdfs) for
the rvs (Xn), n = 1,2,..., respectively. Then (Xn) converges in distribution or in
law to a rv X with cdf F if F , ( t ) ---, F ( t ) as n --* co for every point t such that F ( t )
Ch. 2: Statistical Theory and Econometrics 115
L
is continuous at t. This convergence in distribution or law is denoted by X, ~ X.
T h e cdf F of the rv X is called the limiting or asymptotic distribution of X,.
Further, if X, has p d f f , ( x ) and f n ( x ) ~ f ( x ) as n ~ co and if f ( x ) is a pdf,
then flL(x)-f(x)ldx --, 0 as n ~ oo. In addition, if If.(x)l < q ( x ) and f q ( x ) d x
exists and is finite, this implies that f ( x ) is a p d f such that f IL(x)- f ( x ) l d x --> 0
asn~.
Several additional results that are very useful in practice are:
(1) Helly-Bray Theorem: Fn --, F implies that f g d F , ~ f g d F as n ~ oo for
every b o u n d e d continuous function g.
F o r example, the Helly-Bray T h e o r e m can be e m p l o y e d to a p p r o x i m a t e
Eg r = f g r d F , , r = 1,2,..., when gr satisfies the conditions of the theorem and
F ' s form is known.
L L
(2) With g a continuous function, (a) if X~ ~ X, then g(X~) ~ g ( X ) , a n d (b) if
i.p. i.p.
Xn ~ X, then g ( X . ) --* g ( X ) .
i.p. i.p.
As a special case of (b), if X. ~ c, a constant, g ( X . ) -~ g(c).
(3) Continuity Theorem. Let c.(t) be the characteristic function (cf) of An. If
L
Xn ~ X, then c.(t) ~ e(t), where c(t) is the cf of X. Also, if Cn(t ) ~ c(t) and
L
c(t) is continuous at t = 0, then X n -~ X with the distribution function of X
having cf c(t).
By the Continuity Theorem, derivation of the form of e(t) = limn ~ ~oe.(t) often
permits one to determine the form of the limiting distribution of X..
T h e following convergence results relating to (X., Yn), n = 1,2 . . . . . a sequence
of pairs of rvs are frequently employed:

i.p. L
(a) If IX n - Y.I ~ 0 and Y. ~ Y, then X. ~ Y, that is, the limiting cdf of X n
exists and is the same as that of ¥~.
L i.p. --~ O.
(b) Xn ~ X and Y. ~ O, implies that X.Y.i.p.
L i.p. L L
(c) X. ~ X and Y. ~ c, implies that (i) X n + Y. ~ X + e; (ii) X,,Y. ~ cX; and
L
(fro X ~ / Y . ~ X / e if c ~ O.
i.p. L L
(d) X. - Y. ~ 0 and X. ~ X, implies that Y. ~ X.

The results ( a ) - ( d ) can be generalized to apply to cases in which X. and Yn are


vectors of rvs.
T h e following l e m m a relates to the convergence of sequences of r a n d o m
vectors, ( X .(1), X.(2). . . . . X~(k)), n = 1,2 . . . . .
116 A. Zellner

temFi~a

If for any real X1, X2 .... ,X,

X,x~o + X2x~ 2) + ... + Xkxff) ~ X~X~') + X2x~2)+ . . . XkXCk)

where X 0), X(2),..., X (k) have a joint cdf F ( x i, x2 ..... x k), then the limiting joint
cdf of the sequence of random vectors exists and is equal to F(xj, x 2 ..... xk).
Central Limit Theorems (CLTs) establish particular limiting cdfs for sequences
of rvs. While only CLTs yielding limiting normal cdfs will be reviewed below, it is
the case that non-normal limiting cdfs are sometimes encountered.
Lindeberg- Levy C L T
Let (An), n = 1,2 ..... be a sequence of i.i.d, rvs such that E X , =/~ and V(Xn) = o 2
0 exist. Then the cdf of Yn = v ~ ( X n - / , ) / o - - + ~, where • is the normal cdf,
~ ( y ) = (2~r)-l/Zfy e - t /2dt and ~'n = ~7=~X~/n.
Liapunov C L T
Let (Xn), n = 1,2 ..... be a sequence of independent rvs. Let EX,, =- t*~, E(Xn -
/*n) 2 = o[ ~ 0 and E ( X n _ / , , ) 3 =fin exist for each n. Furthermore, let B n =
(~7=lc.fl~)~/3 and C~ = (Y'.7_,tr2) ~/2. Then if l i m ( B , / C n ) = 0 as n--+ oc, the cdf of
Y, = 2 ~ i = l ( X / - i z i ) / C , --+-~(y), a normal cdf.
Lindeberg- Feller C L T
Let (Xn), n = 1,2,..., be a sequence of independent rvs and Gn be the cdf of X,.
Further, let EX, = / * , and V ( X , ) = o~ ~: 0 exist. Define Y, = ZT=I(Xi - ~ti)/Cn,
where C, = V%-0, with ~ 2 = ~7= ff*~2/n" Then the relations

lira max o i / C . = 0 and Freq'(y)


n ---~ oo l <~ i <~ n

hold if and only if for every e > 0,

n "~ wJ C 2 i = l

Multivariate C L T
Let F n denote the joint cdf of the k-dimensional random vector (X~ l~, X~2~,...,
Xff)), n = 1,2,..., and Fan the cdf of the linear function ~ X ~ ~) + ~2X~ 2) + - - - +
XkXff). A necessary and sufficient condition that F n tend to a k-variate cdf F is
that Fx, converges to a limit for each vector ~.
Ch. 2: Statistical Theory and Econometrics 117

With Fn, Fan, a n d F as defined in the Multivariate CLT, if for each vector X,
Fxn ~ F x, the cdf of X~X o) + X2 X(2) + - • • + XkX(k), then Fn ---, F. As an applica-
tion, consider the random vector U,' = (Uln, U2n,..., Ukn ) with EUn =/~ and V(U,,)
=X, a k × k matrix. Define ~ = ( U ~ n , U 2. . . . . . U~,), n = l , 2 ..... with ~n =
n
~j=lU~j/n. Then the asymptotic cdf of f n ( U ~ - / ~ ) is that of a random normal
vector with zero mean and covariance matrix Z.
For use of these and related theorems in proving the asymptotic normality of
maximum likelihood estimators and posterior distributions, see, for example,
Heyde and Johnstone (1979) and the references in this paper that relate both to
cases in which rvs (X~) are i.i.d, and statistically dependent. Finally, for a
description of Edgeworth and other asymptotic expansion approaches for ap-
proximating finite sample distributions and moments of random variables, see,
for example, Kendall and Stuart (1958), Jeffreys (1967, appendix A), Copson
(1965), and Phillips (1977a, 1977b). Such asymptotic expansions and numerical
integration approaches are useful in checking the quality of approximate asymp-
totic results and in obtaining more accurate approximations.

3. Estimation theory

Learning the values of parameters appearing in econometric models is important


for checking the implications of economic theories and for practical uses of
econometric models for prediction and policy-making. Thus, much research in
statistical and econometric theory has been concentrated on developing and
rationalizing procedures for using data to infer or estimate the values of parame-
ters. It is to a review of the major elements of this work on estimation that we
now turn.

3.1. Point estimation

Consider a parameter 0 contained in O, the parameter space. It is often assumed


that there exists a true, unknown value of O, 00. Whether the true value 00 exists in
nature or just in the mind of an investigator is a philosophical issue that will not
be discussed. Given 0 and its associated parameter space, assume that a sample of
data, denoted by x ' = (x 1, x 2.... ,xn) is randomly drawn or generated and that it
has probability density function (pdf), p(x[O). With this conception of how the
data are drawn or generated, x is a particular value of a random vector £, with
pdf p(x]O) and is referred to as a random sample. The problem of point
estimation is how to form a function of the sample observations x, denoted by
O(x), that will, in some sense, be a good approximation to or close to the true,
unknown value of O. The function of the data, O(x), whatever it is, is by definition
118 A. Zellner

a point estimate of the true, unknown value of 0. A point estimate is a non-ran-


dom quantity since it depends just on the given, observed data x and, by
definition, the function t~(x) does not involve 0. On the other hand, a point
estimator is the random quantity 0(~), where the tilde denotes that £ is considered
to be random. Also, the random function or estimator, 0(£), does not depend on
0 and its stochastic properties can be determined before observing the value of ~,
namely x. As a slight generalization, we may be interested in estimating g(0), a
single-valued function of 0. A special case is g(O)= 0. Then ~(x) is a point
estimate of g(O), and ~(£) is a point estimator of g(O). The problem of point
estimation is how to pick the form of ~(x) so that ~(x) will be close to the true,
unknown value of g(O). Various definitions of "closeness" and criteria for
choosing the functional forms of estimates have been put forward, and are
discussed below.

3.2. Criteria for point estimation

There are two general types of criteria that are employed in evaluating properties
of point estimates. First, sampling criteria involve properties of the sample space
and relate to sampling or frequency properties of particular or alternative
estimates. The overriding considerations with the use of sampling criteria are
properties of estimates in actual or hypothetical repeated samples. Second,
non-sampling criteria involve judging particular or alternative estimates just on the
basis of their properties relative to the given, actually observed data, x. With
non-sampling criteria, other as yet unobserved samples of data and long-run
frequency properties are considered irrelevant for the estimation of a parameter's
value from the actually observed data. The issue of whether to use sampling or
non-sampling criteria for constructing and evaluating estimates is a crucial one
since these different criteria can lead to different estimates of a parameter's value
from the same. set of data. However, it is the case, as will be seen, that some
non-sampling-based procedures yield estimates that have good sampling proper-
ties as well as optimality properties relative to the actually observed data.

3.2.1. Sampling criteria for point estimation

According to sampling theory criteria, a particular point estimate, say 0(x) is


judged with respect to the properties of 0(~), the point estimator. For example,
the point estimate, tJ(x) = ~7= lxi/n, the sample mean is judged by reference to
the sampling properties of 0(£) = ~7= l x i / n , the random sample mean or estima-
tor.

3.2.1.1. "'Perfect" estimation criteria. One sampling criterion in point estima-


tion is given by P[0(~) = 0] = 1 for whatever the value of 0. Unfortunately, this
Ch. 2." Statistical Theory and Econometrics 119

criterion of perfection cannot be realized since $ can assume many different


values and thus it is impossible to estimate the value of 0 without error. That is,
the probability that the random sampling error ~ = 0 ( : ~ ) - 0 is equal to zero, is
zero.
The fact that estimation error is generally unavoidable has led some to
introduce the criterion of mean squared error (MSE), E~ 2 = E [ 0 ( ~ ) - 0] 2 and to
seek the estimator 0(~?) that minimizes MSE for all possible values of 0. Again,
unfortunately, such an estimator does not exist. For example, if we use the
"estimator" 0 = 5, no matter what the sample data are, we will experience a lower
MSE when 0 = 5 than that associated with any other estimator. Thus, no one
estimator can dominate all others in terms of MSE for all possible values of 0.
Some other conditions have to be put on the problem in order to obtain a unique,
optimal estimator relative to the MSE criterion.
Another sampling criterion for estimation is the highest degree of concentration
about the value of the parameter being estimated. That is, we might seek an
estimator 0(2) such that

p[0-x, < < 0 + x2] p[0-x, < 0o( ) < 0 + x2] (3.1)

for all 0 with Xl and ~2 ill the interval 0 to X and where 0a(£) is any other
estimator. A necessary condition for (3.1) to hold is that E[t~(£)-012 ~< E[0a(£ )
- 0 ] 2 for all 0. As mentioned above, it is not possible to satisfy this necessary
MSE condition and thus the criterion of highest degree of concentration cannot
be realized.
Since the strong sampling theory criteria of error-free, minimal MSE, and
highest degree of concentration cannot be realized, several weaker criteria for
estimators have been put forward. One of these is the criterion of unbiasedness.

3.2.1.2. Unbiasedness.
Definition
An estimator tJ(~) of a parameter, 0 is unbiased if and only if E[0($)I0 ] = 0 for all
OcO.
Thus, if an estimator is unbiased, its mean is equal to the value of the parameter
being estimated.

Example 3.1
As an example of an unbiased estimator, consider the model, 2 i = 0 + ei, with
E(2iL0)= 0 for i = 1 , 2 ..... n. Then O ( £ ) = ~ . = l ~ i / n is an unbiased estimator
since E[0(.~)[0] = ~7= lE(xi[O)/n = O.
120 A. Zellner

Example 3.2
Consider the multiple regression model, 3~= X13 + ~, where j~'= (Yi, 372..... Yn), X
is an n × k non-stochastic matrix of rank k, 13'= (fll,fl2 ..... 13~), and t / ' =
( / ~ 1 , ~ 2 . . . . , ~n), a vector of unobservable random errors or disturbances. Assume

that E(J~IX13) = X13. Then/~ = ( X ' X ) -1X'~, the "least squares" estimator of 13, is
unbiased since E(/~I X13) = ( X ' X ) - lX'E(yl X13) = ( X X ) - iX'X13 =/3 for all 13.
While unbiasedness is often regarded as a desirable property of estimators, the
following qualifications should be noted. First, there are usually many unbiased
estimators for a particular parameter. With respect to Example 3.1, the estimator
0w(£)= ~i'.7=,wixi, with the wi's given, has mean E[Ow(~)lO]=~i=lwiE(xil )
0~7= lw~, and is unbiased for all wi's satisfying Y.7=lw~ = 1. Similarly, with respect
to Example 3.2, 1~¢= [ ( X ' X ) - 1 X ' + C'] .~, where C' is a non-stochastic k X n
matrix and has mean E(13cIX13, C) = 13 + C'X13 = 13 for all C such that C ' X = O.
Thus, unbiased estimators are not unique.
Secondly, imposing the condition of unbiasedness can lead to unacceptable
results in frequently encountered problems. If we wish to estimate a parameter,
such as a squared correlation coefficient, that satisfies 0-%<0 < 1, an unbiased
estimator 0($) must assume negative as well as positive values in order to satisfy
Eta(S) = 0 for all 0, 0 ~ 0 < 1. Similarly, an unbiased estimator for a variance
parameter ~.z that is related to two other variances, o 2 and o2, by I" 2 = 0 . 2 - - 0 2~
2
with 0.2 >/0.2 > 0, has to assume negative as well as positive values in order to be
unbiased for all values of 22. Negative estimates of a variance, that is known to be
non-negative, are unsatisfactory.
Third, and perhaps of most general importance, the criterion of unbiasedness
does not take account of the dispersion or degree of concentration of estimators.
Biased estimators can be more closely concentrated about a parameter's value
than are unbiased estimators. In this connection, the criterion of MSE can be
~7
expressed in general as:

MSE(0) = V(O)+ (Bias) z, (3.2)

where V(O) = E(O - EO) 2, the variance of t~, and bias = E0 - 0 is the bias of 0.
Thus, MSE depends on both dispersion, as measured by V(O), and squared bias,
and gives them equal weights. In terms of (3.2), the criterion of unbiasedness gives
zero weight to var(0) and unit weight to the bias squared term which is
considered unsatisfactory by many.
Fourth, on considering just unbiased estimators for a parameter, denoted by
t~(a~), it is clear from (3.2) that MSE = var[t~(~)]. While the restriction that an

)7Note MSE (0)=E(0-0) 2=E[(0-E0)+(E0-o)] 2=E(0 Eta)2+[E0-0] 2, since E(0-


EO)( EO - O) = O.
Ch. 2: Statistical Theory and Econometrics 121

estimator belong to the class of unbiased estimators can be costly in terms of


MSE, such a restriction permits estimators to be ordered in terms of their
variances and suggests seeking the unbiased estimator with minimal variance.
Such an estimator, if it exists, is called a minimum variance unbiased estimator.

3.2.1.3. Criterion of minimum variance unbiased estimation. Some fundamental


results on minimum variance unbiased estimators (MVUEs) have been provided
by Rao (1973), among others. With reference to the triplet (X, Po, 0 c 6)), where 0
may be vector-valued, consider estimation of g(0) a real-valued function of 0. Let
Ug denote the class of unbiased estimators of g(0); that is, an estimator ~ belongs
to Ug if and only if E(~[0) = g(0) for each 0 c ~9. Also, define Uo as the class of
all functions with zero expectation; that is, f c Uo if and only if E(flO ) = 0 for
each 0. Then the following results, proved in Rao (1973) are available.
Rao Theorem
A necessary and sufficient condition that an estimator ~ c Ug, that is, E(~I0 ) =
g(O), has minimum variance at the value 0 = 00 is that cov(~, fl00) = 0 for every
f c Uo such that var(fl0o) < oo provided that var(~100) < oo.
From the Rao Theorem it follows that: (a) the correlation between a MVUE
and any other estimator is non-negative; (b) if there are two unbiased estimators
with the same minimum variance, their correlation is equal to one, that is,
P ( 0 0 ) = l , and therefore they are the same except for a set of samples of
probability measure zero for 0 = 0o; and (c) if gl and g2 are MVUEs of gl(O) and
g2(0), respectively, then bl~ 1 + b2~2 is a MVUE of blgl(O)+ bzg2(O), where b l
and b 2 are fixed constants. Also, if ~ has minimum variance for each 0, ~ is called
a uniformly MVUE.
As an example of Rao's Theorem, consider Example 3.1 with the ~i's assumed
normally and independently distributed, each with mean 0 and variance o 2. Then
f ( $ ) c Uo implies

i=l

Differentiating this last expression with respect to 0 yields

i~l

where ~ = ~7= l x i / n , the sample mean, and this result implies that cov(f, 5) = 0,
the necessary and sufficient condition for ~ to be a MVUE. By similar analysis,
Rao shows that for the multiple regression model in Example 3.2, with j~ assumed
122 A. Zellner

to be normally distributed with mean X/~ and covariance matrix 02In, j~ =


(X'X)-IX'y and s 2 = ( y - X ~ ) ' ( y - X ~ ) / ( n - k ) are M V U E s for /~ and o 2,
respectively.
The existence of M V U E s is closely linked to the concepts of sufficiency,
sufficient statistics, and completeness, concepts that are important generally.
W i t h respect to sufficiency, for some problems certain aspects of the sample data
x provide no information about the value of a parameter. F o r example, in n
independent binomial trials with probability of "success" O on each trial, the
order of the occurrence of successes and failures provides no information a b o u t
the value of 0. The n u m b e r of successes, t ( x ) = ~ i n= 1X i, where x i = 1, denotes a
success and xi = 0 a failure, is a statistic that maps m a n y different sequences of
observations into a single value, t(x), and thus represents a reduction of the
sample data. The basic idea of sufficiency involves a reduction of the dimen-
sionality of the data without loss of sample information.
As regards estimation of a scalar parameter, 0, with a sample of size n >~ 2,
consider the joint distribution of a set of r functionally independent statistics,
pr(t, tl, t 2..... t~_~lO),r = 2,3 ...... n, where t is a statistic of particular interest.
Then

pr(t, tl,t2 ..... t ~ _ d O ) = g ( t l O ) h ~ _ l ( t l , t 2 ..... t~ 110). (3.3)


As Kendall and Stuart (1961, pp. 2 2 - 2 3 ) point out, if h~_l(tl, t 2 . . . . . t~_~lO) is
independent of 0, then t 1, t 2 . . . . . tr_ 1 contribute nothing to our knowledge of the
value of 0. Thus, formally t is sufficient for 0 if and only if

pr(t, tl,t2,...,tr llO)=g(tlO)h~_l(q,t2,...,t~_llt), (3.4)

where h r_ 1 is independent of 0 for r = 2, 3 . . . . . n, and any choice of t 1, t 2 ..... t r - l-


T h e n t is said to be sufficient for 0. A similar definition applies when 0 is a vector
of parameters and the t ' s are vectors. A minimal sufficient statistic is one that
achieves the greatest reduction of the sample information without loss of informa-
tion.IS
While (3.4) defines sufficiency in general, it does not indicate how s u f f i c i e n t
statistics can be f o u n d in specific problems. The Factorization T h e o r e m is helpful
in this regard.
Factorization Theorem 19
A necessary and sufficient condition for a statistic t = t ( x ) to be sufficient for a
parameter vector 0 appearing in f(xlO ), the probability density function for the

18See Kendall and Stuart (1961, pp. 193-194) and Silvey (1970, p. 29) for a discussion of minimal
sufficient statistics.
19proofs may be found in Kendall and Stuart (1961, p. 23) and Lehmann (1959, p. 47).
Ch. 2: Statistical Theory and Econometrics 123

sample data x is that f ( x l O ) can be expressed as


f ( x l O ) = g ( t l O ) h ( x ), (3.5)
where h ( x ) does not depend on 0.
Let us apply the Factorization Theorem to Examples 3.1 and 3.2 under the
assumption that in both examples the random observations are normally and
independently distributed. For Example 3.1, with the ffi's normally and indepen-
dently distributed, each with mean 0 and variance o 2,

i=1

= (2~ro 2)- n/aexp(- [vs 2 + n ( ~ - 0)2]/2o2} = g(X, s210, o 2 ),

where n ~ = ~ i =n 1X i, v s 2 = ~ i =n l ( x ~ - ~ ) 2 and v = n - 1 . 2° Thus, ~ and s 2 are


sufficient statistics.
In Example 3.2, with fi normal with mean Xfl and covariance matrix ozI~,
f ( Ylfl, o2) = (2¢r°2) - n/2exp(- ( Y - Xfl)'( y - X f l ) / 2 o 2 )

= (2rro2)-"/2exp(-[vs 2 +(fl-fl)'X'X(fl-fJ)/2o 2)
=

where/~ = ( X ' X ) - ~ x ' y , vs 2 = ( y - X ~ ) ' ( y - X/~), and v = n - k. 2' Thus,/~ and


s 2 are sufficient statistics.
The fundamental Rao-Blackwell Theorem provides a link between sufficiency
and MVUE.
R a o - Blackwell Theorem =
Let t be a sufficient statistic for 0, where both t and 0 may be vector-valued and t,
any other statistic. If g is any function of 0, then

E [,, - g(0)] 2 >/E [h ( t ) - g(0)] 2, (3.6)

2°Note that with all summations extending from i = 1 to i = n,

E (xi- 0) 2 = E [ x i - ~ --(O -- .~)]2 = E ( x i - Yc) 2+ n ( X -- 0) 2 = ps 2 Jr n ( x -- 0),

since £ ( x i -- Y:) = O.
21Note:

(,- [y- xh- - xa- a)]


= (, -

since X ' ( y - XfJ) = O.


22See Rao (1973, pp. 320-321) and also Rao (1945) and Blackwell (1947).
124 A. Zellner

where h(t) = E(tl[t ) is independent of 0. Furthermore, Eh(t) = g(0), that is, h(t)
is unbiased if Et I = g(O).
See Rao (1973, p. 321) for a proof of this theorem. As Rao (1973) notes: " G i v e n
any statistic [q], we can find a function of the sufficient statistic [h(t)] which is
uniformly better in the sense of mean square error or minimum variance (if no
bias is imposed)" (p. 321). He also notes that if a complete sufficient statistic
exists, that is, one such that no function of it has zero expectation unless it is zero
almost everywhere with respect to each of the measures P0, then every function of
it is a uniformly M V U E of its expected value. In view of the Rao-Blackwell
Theorem and assuming the existence of complete sufficient statistics, to find a
M V U E it is enough to start with any unbiased estimator and take its conditional
expectation given the sufficient statistic, that is, E (t 11t) = h (t).
Since these results depend strongly on the existence of complete statistics, it is
relevant to ask which classes of distribution functions possess sufficient statistics
for their parameters. The P i t m a n - K o o p m a n Theorem [Pitman (1936) and
K o o p m a n (1936)] provides an answer to this question.

Pitman- Koopman Theorem


For a parameter 0, scalar or vector-valued, if the range of f(x]O) is independent
of 0, a distribution will have a sufficient statistic (or statistics) if and only if it is a
m e m b e r of the exponential class of distributions, that is,

f(xlO ) = exp{A(O)B(x)+ C(x)+ D(O)) (3.7)


in the scalar case, and

{k
f ( x l O ) = e x p ~ Aj(O)Bj(x)+C(x)+ D(O)
j=l
} (3.8)

in the case that 0.is a k-element vector.


The exponential class includes many distributions that arise in practice, for
example binomial, normal, geometric, exponential, and Poisson distributions. 23
However, there are m a n y cases in which minimal sufficient statistics are not
complete and then it is not possible to use the Rao-Blackwell Theorem to
establish the existence of a MVUE. In such cases there m a y be several different

23For example, the binomial pdf,

f(xlO,n)=(xn)OX(l 8) ~ X=exp{xlog[8/(l 8)]+nlog(1-0))


is in the exponential form (3.7) with B(x)= x, A(8)= log[0/(1- 0)], D(0)= n log(1- 8), and
C(x)=0.
Ch. 2: Statistical Theory and Econometrics t25

functions of the minimal sufficient statistic which are unbiased estimators of the
parameter and there is no general means of comparing their variances. Silvey
(1970, pp. 3 4 - 3 5 ) presents the following example to illustrate this problem.
Suppose that n independent binomial trials, each with probability 0 of success, are
carried out; then trials are continued until an additional k successes are obtained,
this requiring s additional trials. 24 Let the sample be denoted by x = (xl, x 2 . . . . .
x n, xn+ l, x , + 2 , - - . , x n + s - 7 , 1), where x i = 1 for a success and x i = 0 for a failure.
T h e n f ( x ] O ) = 0 r + k ( 1 - O) " + s - r - k where r = ~ i n= l X i and s depends on x also.
The statistic t = (r, s) is sufficient for 0 and is also a minimal sufficient statistic.
However, t is not complete because if

r k-1
f(t) n s - 1 '

then

Ef(t)= E rn - E s--Z-i-
k-1 = 0- 0= 0 for all O.

However, f (t) ~ 0, as is required for a complete sufficient statistic, and thus there
are problems in applying the R a o - B l a c k w e l l T h e o r e m to this and similar prob-
lems.
In problems in which it is difficult or impossible to obtain a M V U E , it is useful
to consider the C r a m r r - R a o Inequality that provides a lower b o u n d for the
variance of an unbiased estimator.
C r a m & - Rao Inequality
Given ( X , p(xlO ), 0 c O), with O an interval on the real fine, then subject to
certain regularity conditions, the variance of any unbiased estimator ~ of g(O)
satisfies the following inequality:

var(~) >/[g'(O)]2/Io, (3.9)

where I o = E ( 0 log p ( x h0 ) ~ 00)2 was interpreted by R . A . Fisher as the a m o u n t of


information about 0 contained in x. If g(O) = 0 and ~ = t~, then (3.9) becomes

v a r ( 0 ) > / 1 / I o. (3.10)

24This latter part of the process is a negative binomial process.


126 A. Zellner

Proof
Differentiate E ~ = fx~P(xlO)dx= g(O) with respect to 0, assuming that it is
permissible to differentiate under the integral to obtain:

g'(O) = fxg[ Ologp(xlO)/OO]p(xlO)dx

or

g'(O) = fx[:,,g(O)] [0 logp(x10)/3O]p(x[O


)dx, (3.1I)

since from fxp(xlO)dx = 1, fx[ Op(xlO)/OO]dx-- 0, and E[ O log p(xlO)/O0 ] =


f [a log p( xlO )/ OO]p( xlO )dx = O.
On applying the Cauchy-Schwarz integral inequality to (3.11),

[ g ' ( 0 ) ] 2 ~<fx[~, - g(O)]Zp ( x l 0 ) d x f x [ Olog p(xlO)/OO]2p(xlO)dx

or

var(~) = fx[~ - g( O)]2p(xlO)dx


>1[g'(o)]Tfx[ a logp(xlO)/OO]2p(xl O) dx

>1[g'(O)]2/Io .

The following lemma provides an alternative expression for Io, the Fisher
information measure.

Lemma
I o = E[ 0 log p(x[O)/O0] 2 = - E[ 021og p(xlO)/OO2].
Proof
Differentiate fx[O log p(xlO)/OO]p(x[O) d x = 0 with respect to 0 to obtain:

f ([ aZlogp(xlO)/O0 2] p(xtO)+ [0 logp(xlO)/O0] [ Op(xlO)/OO])dx = 0


ax- -
Ch. 2: Statistical Theory and Econometrics 127

or

fx{[ OZlogp( xlO)/O02 + [ Ologp( xlO)/OO]2)p( xlO)dx = 0

or

S [ O log p (x10)/00] 2 = - E [ O210g p (xlO)/OO2].


In (3.9), [g'(O)]2/Io is called the minimum variance bound (MVB) for the
unbiased estimator ~. As Kendall and Stuart (1961, p. 10) point out, the MVB
was obtained by application of the Cauchy-Schwarz inequality and thus the
necessary and sufficient condition that equality holds in (3.9), that is, that the
MVB is attained, is that ~ - g(O) is proportional to O logp(x[O)/O0 for all sets of
observations, that is,

Ologp(xlO)/O0 = A ( 0 ) [ ~ - g ( 0 ) ] , (3.12)

where A(O) may depend on 0 but does not depend on x, the observations. From
(3.12), var[O log p(x]O)/30] = A2(0)var(~) and then from the equality form of
(3.9),

var ~, = g'( O)/A( O), (3.13)

or in terms of (3.10), var(0) = 1/A(O).


To illustrate use of the MVB in (3.13), consider

p(.Io,4) = exp~ - n (~ - 0)2/2o02),

where o02 is a known value for o 2 and X = ET= ~xi/n. Then 3 logp(x[O, o2)/O0 =
n(Y,-O)/o 2. With ~ , = ~ and g(O)=O, A(O)=n/o~. Thus, since 0log
p(x[O, Oo2)/00 is proportional to ~ - 0, with the factor of proportionality A(O) =
n/%2, X is the MVB unbiased estimator with v a r ( f f ) = 1/A(O)= o2/n.
While MVUEs are ingenious constructs and useful in a variety of situations, as
Silvey (1970) points out: "There are m a n y situations where either no M V U E
exists or where we cannot establish whether or not such an estimator exists" (p.
43). For example, even in the case of a simple binomial parameter, 0, there exists
no unbiased estimator of 7/= 0 / ( 1 - 0), the odds in favor of success. Also,
problems arise in obtaining unbiased estimators of the reciprocals and ratios of
means and regression coefficients and coefficients of structural econometric
models. For these and other problems there is a need for alternative estimation
principles.
128 A. Zellner

3.2.1.4. L e a s t s q u a r e s ( L S ) a n d other g o o d n e s s o f f i t criteria. W i t h LS a n d o t h e r


g o o d n e s s of fit criteria, a s a m p l e e s t i m a t e o f a p a r a m e t e r O, scalar or vector-
v a l u e d a p p e a r i n g in a model, is d e t e r m i n e d so that a given s a m p l e of o b s e r v a t i o n s
is m o s t closely a p p r o x i m a t e d b y the e s t i m a t e d o r fitted model, a heuristic,
n o n - s a m p l i n g criterion. T h e s a m p l i n g p r o p e r t i e s of the e s t i m a t e / J ( x ) so o b t a i n e d
a r e then shown to b e o p t i m a l in certain senses to b e reviewed b e l o w a n d it is t h e
s a m p l i n g p r o p e r t i e s that are usually utilized to j u s t i f y the LS or g o o d n e s s of fit
estimation procedure.
T o illustrate LS a n d other goodness of fit criteria, let x i = 0 + ei, i = 1,2 . . . . ,n,
b e the m o d e l for the actual, given o b s e r v a t i o n s x ' = ( x t, x 2 . . . . , x n ) , where 0 is a
scalar p a r a m e t e r a n d the ei's are unobserved, n o n - r a n d o m errors. T o e s t i m a t e O,
the LS principle involves finding the value of 0 that m i n i m i z e s the s u m of s q u a r e d
errors, S S = ~ 7 = f f 2 = ~ i = 1n( i - OX) 2. T h e value of 0 t h a t minimizes SS is t ) =
y,n ~ x i / n , the s a m p l e mean. 2S To this p o i n t the x i ' s a n d e~'s are n o n - r a n d o m .
I n d e e d , it is meaningless to a t t e m p t to m i n i m i z e SS = ~ i n= ~e~ -2 ~ ~.n
. i= l ( ~ - 0) 2, a
r a n d o m function. However, n o w that we have the LS estimate 8 ( x ) , the s a m p l e
m e a n , it is p o s s i b l e to explore p r o p e r t i e s of 0~( x") = ~ i = l £ i / n , the r a n d o m s a m p l e
m e a n or e s t i m a t o r given various a s s u m p t i o n s a b o u t the p r o b a b i l i t y m o d e l gener-
a t i n g the observations, the x / s . Similarly, with respect to the m u l t i p l e regression
model, y = X/3 + u, where y is a g i v e n n × 1 vector of observations, X is a given
n × k n o n - s t o c h a s t i c m a t r i x of r a n k k , / 3 is a k × 1 vector of regression p a r a m e t e r s
with u n k n o w n values, a n d u is an n x 1 vector of u n o b s e r v e d , realized error terms.
T h e sum of squared error terms to be m i n i m i z e d with respect to the value o f / 3 is
SS= u'u = (y-X/3)'(y-X/3). T h e minimizing, LS value of /3 is /~ =
( X ' X ) - l X ' y . 26 N o t e that /~ d e p e n d s o n the given s a m p l e d a t a a n d thus is a
n o n - r a n d o m estimate. Also, fi = y - Xfl, the LS residual vector, is a n e s t i m a t e of
the unobserved, n o n - r a n d o m vector u. T o d e t e r m i n e the s a m p l i n g p r o p e r t i e s of r ,
fi, a n d other quantities, it is necessary to p r o v i d e stochastic a s s u m p t i o n s a b o u t the
m o d e l for the observations, y = X f l + u. 27 Before c o n s i d e r i n g this p r o b l e m , it is
r e l e v a n t to ask: W h y m i n i m i z e the s u m of s q u a r e d errors, and n o t some o t h e r
f u n c t i o n of the errors?
I n general, the sum of squared errors, SS = ~5"~ i~l e i2, is e m p l o y e d because, as
shown in the previous p a r a g r a p h , it leads in m a n y cases to simple expressions for

25Note that d SS/d0 = -21~7_ i(xi - 0) and d2SS/d0 2 = 2n > 0. Thus, the value of 0 for which
dSS/d0 = 0 is 0 = 1~7=I x i / n and this is a minimizing value since d2SS/d0 2 > 0 at 0 = 0.
26Note OSS/Ofl = - 2 X ' y + 2X'XB and 02SS/3,8 2 = 2X'X. The value of fl setting 0SS/0B = 0 is
obtained from - 2 X'y + 2X'X,8 = 0 or X'Xp = X'y, the so-called "normal equations", the solution of
which is/~ = (X'X)-IX'y. Since 02SS/018~ is a positive definite symmetric matrix, ~ is a minimizing
value of ft.
27While these examples involve models linear in the parameters and error terms, it is also possible to
use the LS principle in connection with non-linear models, for example Yi = f ( zi, 8) + u i , i = 1,2.... , n,
where f(z~, 8) is a known function of a vector of given variables, zi, and a vector of parameters. In
this case, the LS principle involves finding the value of 0 such that ~'~_ 1[Y~ - f(zi, 8)] 2 is minimized.
Ch. 2." Statistical Theory and Econometrics 129

parameter estimates. Further, as will be shown below, in important problems


minimizing SS leads to estimates that are identical to maximum likelihood and
Bayesian estimates. However, criteria other than minimizing SS are available, for
example minimizing SAD = F.i= n 11E i], the sum of absolute deviations that leads to

minimum absolute deviation (MAD) estimates or minimizing WSS = ~7=1w: 2,


the weighted sum of squares, where the wi's are weights, leads to weighted least
squares (WLS) estimates. Basically, the choice among these alternative criteria for
generating estimates depends importantly on what is assumed about the probabil-
ity model for the observations. Given a probability model for the observations
and general principles of estimation, it is possible to obtain a unique estimate that
is optimal according to the general principle of estimation adopted.
The G a u s s - M a r k o v Theorem involves specifying a probability model for the
observations and applying a principle of estimation, that of minimum variance
linear unbiased estimation to obtain an estimator. A version of the G a u s s - M a r k o v
Theorem is:

Gauss- Markov Theorem


Assume that the random n × 1 observation vector fl is generated by the model
j~ = X~8 + ~, where X is a non-stochastic, known n X k matrix of rank k, A8a k x 1
vector of parameters with unknown values, and ~ an n × 1 vector of unobserved
random errors. Further, assume Eli = 0 and E ~ ' = 02I,, where o 2 is the common
unknown variance of the errors. Then, in the class of unbiased estimators of 1'/~,
where ! is a given k x 1 vector of rank one, the minimum variance linear unbiased
estimator 28 of 1'~8is i'/], where/~ = (X'X) 1X'f, the LS estimator.

Proof
Consider f l = [ ( X ' X ) - I x ' + c ' ] ~ , where C' is an arbitrary k × n matrix. This
defines a class of linear (in j~) estimators. For I'/~ to be an unbiased estimator of
i'B, C must be such that C'X = 0 since El'~ = l'fl + I'C'Xfl. With the restriction
C'X = 0 imposed,

var(r/~) = E r ( / ~ - / ~ ) ( / ~ - ~ ) ' t = r [( x ' x ) - l + c'c]lo 2,

since

-l x.+ +

28Some use the term "best linear unbiased estimator" (BLUE) rather than minimum variance linear
unbiased estimator" (MVLUE) with "best" referring to the minimal variance property.
130 A. Zellner

on utilizing E ~ ' = ozIn and C'X = O. Thus,

var(l'/~) = / ' [ ( X ' X ) - ' + C'C]Io 2

attains a minimum for C = 0 which results in i'/~ = I'll, where/~ = ( X ' X ) - I X ' f is
the LS estimator with covariance matrix, V(/~) = ( X ' X ) - L a 2.
Thus, the Gauss-Markov (GM) Theorem provides a justification for the LS
estimator for the regression coefficient vector/3 under the hypotheses that the
regression model )7 = X/3 + ti is properly specified, Ej7 = X/3 and V(.17) = V(ti) =
ozI,, that is, that the errors or observations have a common variance and are
uncorrelated. Further, the G M Theorem restricts the class of estimators to be
linear and unbiased, restrictions that limit the range of candidate estimators and
involves the use of the MSE criterion that here is equivalent to variance since only
unbiased estimators are considered. As will be shown below, dropping the
restrictions of linearity and unbiasedness can lead to biased, non-linear estimators
with smaller MSE than that of the LS estimator under frequently encountered
conditions.
While the GM Theorem is remarkable in providing a justification for the LS
estimator in terms of its properties in repeated (actual or hypothetical) samples, it
does not provide direct justification for the LS estimate that is based on a given
sample of data. Obviously, good performance on average does not always insure
good performance in a single instance.
An expanded version of the GM Theorem in which the assumption that
Eaa'= O2In is replaced by Etiti' = Vo 2, with V an n × n known positive definite
symmetric matrix, shows that !'/~ is the M V L U E of 1'/3, where /~ =
( X ' V - 1 X ) - IX'V- l~ is the generalized least squares (GLS) estimator. On sub-
stituting )7 = X/3 + a into /~, /~ =/3 + ( X ' V - 1X)- 1X'V- l ft and thus E/~ =/3
and V(/~)= ( X ' V - 1 X ) lo2. Also, the GLS estimate can be given a weighted
least squares interpretation by noting that minimizing the weighted SS,
( y - X/3)'V- l( y _ X/3) with respect to/3 yields the GLS estimate. Various forms
of the GM Theorem are available in the literature for cases in which X is not of
full column rank a n d / o r there are linear restrictions on the elements of/3, that is,
A/3 = a, where A is a q x k given matrix of rank q and a is a q x 1 given vector.
The parameter o z appears in the G M Theorem and in the covariance matrices
of the LS and GLS estimators. The G M Theorem provides no guidance with
respect to the estimation of o 2. Since ~ = y - X/~, the n x 1 LS residual vector is
an estimate of u, the true unobserved error vector, it seems natural to use the
average value of the sum of squared residuals as an estimate of 0 2, that is,
0 2 = l ~ ' t ~ / n . As will be seen, (~2 is the maximum likelihood estimate of o 2 in the
regression model with normally distributed errors. However, when 6 2 is viewed as
Ch. 2: Statistical Theory and Econometrics 131

an estimator, E62 = 02(1 - k / n ) , that is, 62 is biased downward, z9 The unbiased


estimator of 02, s 2 = ~ ' ~ / ( n - k), is widely used even though it has a larger M S E
than that of 6m2 = ~ ' ~ / ( n - k +2), the minimal M S E estimator in the class
6 2 = cfi'fi, where c > 0 is a constant. T h a t is, with a assumed N(O, a2In), X 2 =
f i ' a / o 2 has a chi-squared p d f with p = n - k degrees of freedom. Then M S E =
E ( 6 2 -- 0 2 ) 2 = C 2 0 4 E ( x 2 ) 2 - 2 C o a E X 2 + 0"4, and the minimizing value of c, c m, is
cm=Ex~/E(xZ)Z=p/(p2+Z~,)=I/(t,+2) and thus is the
m i n i m u m M S E estimator. T h e M S E of 6m2 is M S E ( 6 2 m ) = 2 o 4 / u + 2 ) , while
M S E ( s 2 ) = 2 o 4 / p . Thus, M S E ( 6 ~ ) / M S E ( s 2 ) = p / ( 1 , + 2 ) , which for small p is
appreciably below one. 3° This is an example illustrating that a biased estimator,
~ can have a smaller M S E than an unbiased estimator s 2.
Above, the LS a p p r o a c h was considered in relation to linear models for which
minimization of the SS led to solutions, say ~ = ( X ' X ) - I X ' y or / ~ =
( X ' V ~ ~X)- 1X'V- ~y, that do not depend on para~neters with u n k n o w n values and
hence are estimates that can be viewed as estimators. In a n u m b e r of generally
encountered problems, this is not the case. For example, if in the usual regression
model ~ = X/3 + a, E a = 0, and E a a ' = a2V(0), where V(O) is an n × n matrix
with elements depending on some parameters 0 with u n k n o w n values,
then a p r o b l e m arises in applying the LS or G L S approach. T h a t is, minimi-
zation of u ' V - l ( O ) u = ( y - X / 3 ) ' V - | ( O ) ( y - X/3) with respect to 13 yields / ~ =
(X'V t(O)X)-lX'V-I(O)y which is not an estimate since /~ depends on 0, a
vector of p a r a m e t e r s with u n k n o w n values. In some p r o b l e m s it is possible to
estimate 0 from the LS residuals, ~ = y - X/~, where/3 = ( X ' X ) 1X'y. Let 0 be
this estimate. T h e n a "feasible" or " o p e r a t i o n a l " or a p p r o x i m a t e G L S estimator,
/~, is defined b y / ~ = ( X ' V - ~ ( O ) X ) - ' X ' V '(t~)y. S i n c e / ~ is just an approxima-
tion to/~, it is not exactly a G L S estimate. Often when n is large, the large sample
distributions of/~ a n d / ~ coincide. However, when n is small, the distributions of
/~ and/~a are different and further analysis is required to establish the sampling
properties o f / ~ . T o illustrate consider the following example:
Example 3.3
Let y ' = ( ~ y~), X ' = (X~X~), and t i ' = (a'la~) and consider the regression model,
= X/3 + ~ with /3 a k x 1 vector, Eti = 0, Eliltil'= o~I~,, E ~ z t ~ 2 ' = 02In2 , and
E t i l t i 2 ' = 0. For this specification

V( 0 ) = Ell.' = O' 02In2 '

29From t~= )7 Xh = [In - X(X'X) IX']~, Efi'~ = E~'M~ = o2trM = o 2 ( n -- k), where M =
I,, - X(X'X)-tX ' and trM= n - k. Thus, E62 = o2(n - k)/2n.
3°The MSE of 62 = ~'~/n is MSE (62) = on(2/v)[(1 + k /2v)/(l + k/~)] which is smaller than
MSE(s 2) = 2o4/v for k > 2.
132 A. Zellner

with 0' = (0 7, ¢r~). Then ( y - X B ) ' V - 1(0)( y - X ~ ) has minimal value for

( -lx'v-'(o)y
= [XltXl/O ? x2tx2/02]-l(Xltyl/O 2 -Jr-x2,y2/02),

which is clearly a function o 2 and o22. Estimates of 0 2 and o22 are s/2 =
( y~ - Xi~,) ( y~ - X f ~ ) / v , , with vi = ni - k and t ~ i = ( X [ X i ) - l X ' / Y i for i = 1 , 2 .
Then the approximate GLS estimate is

Ba = [ X ; X l / s 2 + , 2 - [X[yl/S2+X y2/4]

Further, V(~) = [ X [ X 1 / o 2 -Jr"X ~ X 2 / o 21 1 which is often approximated by Va(¢])


= [X~X1/s 2 + X ~ X 2 / s 2 ] -I. For large sample sizes, these approximations have
been shown to be very good. For small samples, some of the properties of these
approximations have been studied in the literature. Also, above the unbiased
estimates, s ( and s~ have been and usually are inserted for o 2 and 022. Whether
other estimates, for example ML or minimum MSE estimates of o 2 and o22, would
produce better or worse results in estimating fl requires additional analysis.
The basic problem brought out in the above example is that the GLS approach
can lead to a non-operational result. Then approximations to the non-operational
result are introduced. Fortunately, in many cases these approximations are rather
good in large samples. However, in small samples, the definition of which
depends on various properties of the model as well as the sample size, there is no
assurance that these approximate GLS estimates will have good sampling proper-
ties. Each case must be considered carefully using analytical a n d / o r Monte Carlo
techniques. For examples of such studies, see Rao and Griliches (1969), Fomby
and Guilkey (1978), Taylor (1978), Revankar (1974), Mehta and Swamy (1976),
and Srivastava and Dwivedi (1979). From a practical point of view, it would be
desirable to have estimation principles that yield estimators with good small and
large sample properties.

3.2.1.5. M a x i m u m likelihood estimation. In maximum likelihood (ML) estima-


tion, a basic element is the likelihood function. Let the sample observations be
denoted by x and the joint pdf for the observations be p(xlO), with x c R x the
sample space, and 0 c O the parameter space. The likelihood function is p(xlO )
viewed as a function of 0 defined on the parameter space O. To emphasize this
point, the likelihood function is denoted by l(OIx ) and is clearly not a pdf for 0.
To make this point explicitly, the joint pdf of n independent observations from a
Ch. 2: Statistical Theory and Econometrics 133

normal distribution with mean ~ and variance o 2 is

p(xllz, 02) = (2fro2) - " / 2 e x p


/ Xn (xi - tt)2/2o 2 / •
xi~l

The likelihood function for this problem is

z(t~, oZlx) = (2~ro21 - "/2exp -- 2 (x, - it) 2 O2 , (3.14)


i=l

with - oo < x i < o0, i = 1,2,...,n, the sample space, and - oo < tt < oo and 0 < 0 2
< oo, the parameter space. In (3.14), the likelihood function, lot, o21x), is a
function of/~ and o 2 given x.
According to the M L estimation principle, estimates of parameters are obtained
by maximizing the likelihood function given the data. T h a t is, the M L estimate is
the quantity O(x) c O such t h a t / ( ~ l x ) = m a x ~ c ol(Olx). For a very broad range
of problems the M L estimate t~(x) exists and is unique. In the likelihood
approach, l(Olx) expresses the "plausibility" of various values of 0, and ~, the
M L estimate, is regarded as the " m o s t plausible" or " m o s t likely" value of 0. This
view is a basic non-sampling argument for M L estimation, although the terms
" m o s t plausible" or " m o s t likely" cannot be equated with " m o s t probable" since
the likelihood function, l(OIx), is not a pdf for 0. As with LS estimates, ML
estimates can be viewed as estimators and their properties studied to determine
whether they are good or optimal in some senses.
F r o m the likelihood function in (3.14), log l(tz, o2lx) = - n l o g o - ~
( x i - / z ) 2 / 2 o 2 + c o n s t a n t . The necessary conditions for a m a x i m u m are
0 logl/Ott = 0 and 0 logl/Oo = 0 which yield ~-~i=1( n X i --/Q/2°2 = 0 and - n / o

+Y'.7=l(xi-t02/o 3 = 0 , the solutions of which are /2=Y'.7=lxi/n and 6 2 =


n 1( X i --/2)2/n and these values can be shown to be global-maximizing values
~'i=
and hence are M L estimates. In this particular case,/2 is the sample mean which,
as mentioned above, is a minimum variance unbiased estimator of/~. With respect
to 0 2 = E n = l ( X i - - / 2 ) 2 / / , / , it was shown above that this estimator is biased. Thus,
with ML estimation, there is no assurance that ML estimators will be unbiased.
Indeed, the ML estimator of 0 = 1//x is 0 = 1//2, which does not possess a mean.
However, as emphasized above, the criterion of unbiasedness is subject to
important limitations. Further, note that in this problem the ML estimates are
functions of the sufficient statistics,/2 and ~7= l(Xi -/2)2. Given the Factorization
Theorem described above, it is the case that M L estimates are always functions of
minimal sufficient statistics when they exist. This is not to say, however, that the
M L estimator necessarily makes the best possible use of the information con-
tained in minimal sufficient statistics.
134 A. Zellner

For the usual multiple regression model, j? = X/3 + a, with li assumed normal
with mean 0 and covariance matrix, o2I~, the joint pdf for the observations is
p ( ylX, /3, o 2) = (2~ro2) - " / 2 e x p ( - ( y - X/3)'( y - X / 3 ) / 2 o 2) and the likelihood
function, l(/3, o21 X, y), i s p ( ylX,/3, o 2) viewed as a function of/3 and o 2, that is,

/(/3, o2[X, y ) = (2fro 2) - "/2exp( - ( y - Xfl )'( y - X / 3 ) / 2 o z )


= (2qm2)-"/2exp(- [a,a +(#-l~)'x'x(/3-~)]/202),
w h e r e / ~ = ( X ' X ) - 1 X ' y and fi= y - X / ~ . On maximizing logl(fl, 02lX, y ) with
respect to 13 and 02, the ML estimates are/~ --- ( X ' X ) - l X ' y and 62 = Wf~/n. The
M L estimate/~ is just the LS estimate, ( X ' X ) - 1X'y, while again the M L estimator
for 02, #2, is biased. Above, the GM Theorem led to the LS estimate/~ without a
normality assumption. In this connection it is interesting to observe that for all
likelihood functions of the form, o - n f ( u ' u / 0 2 ) , where the function f ( . ) is
monotonically decreasing in u'u / o 2, minimizing u'u = ( y - Xfl)'( y - Xfl) with
respect to fl produces a ML estimate equal to the LS estimate. Therefore, the
normal case f ( u ' u / o Z ) o c e x p ( - u ' u / 2 0 2 ) is just a special case of a class of
likelihood functions for which the ML estimate is identical to the LS estimate. 3~
In the case where the error vector in the multiple regression model is normal,
with Eti = 0 and E a a ' = o2V(O), the likelihood function is

l(fl, o z, 01 y, X ) = (2rro2) - n/Zlv(O)[-1/2


× exp{- ( y - Xfl)'V-1(0)( y - Xfl)/2o2).

The quadratic form in the exponential is minimized for any given value of 0 by
flo = [ X ' V - I ( O ) X ] - IX'V-1(O)Y, the GLS quantity. Also, the conditional maxi-
mizing value of o z, given 0, is 6~ = ( y - X ~ ) ' V - I ( O ) ( y - X f l ) / n . On substitu-
tion these conditional maximizing values in the likelihood function, the result is
the so-called concentrated log-likelihood function, log lc(Oly, X ) = c o n s t a n t -
n / 2 log62 -½1oglV(0)[. By numerical evaluation of this function for various
values of 0 it is possible to find a maximizing value for 0, say/J, which when
substituted into/]0 and 602 provides M L estimates of all of the parameters. This

31On the other hand, if the likelihoodfunctionis a monotonicallydecreasingfunction of ~7= l[U/I/O'


then minimizingthe sum of the absolute deviationsproducesML estimates. Such a likelihoodfunction
is encountered when the fii's are identically and independently distributed, each with a double
exponential pdf, p ( u i ) ~ a - texp(-[ui[/o), - oo < u i < 0o. Then the joint pdf of the ui's is

I-l p(ui)Oza nexp - luil/o


i=l i 1

and minimizing~=ll uil maximizesthe likelihoodfunction.


Ch. 2: Statistical Theory and Econometrics 135

procedure is useful for computing ML estimates only when there are a few, say
one or two, parameters in 0.
A more general procedure, the Newton method, for maximizing log-likelihood
functions, L =-logl(OIx), where 0 is an m × 1 vector of parameters and x a vector
of observations, commences with the first-order conditions, OL/O0 = 0. Given an
initial estimate, 0 (°), of the solution 0 of OL/O0 = 0, expand OL/O0 in a Taylor's
Series about 0 ~°), that is,

OL/aO = o - OL/aO + [a2L/aoao'](O -

where all partial derivatives on the right-side are evaluated at 0 ~°). Then,

0 (1)= 0 (0)- [02L/0000 t]-I[0L/o0], (3.15)


with the derivatives evaluated at 0 = 0 (°), is an approximation to the ML estimate,
0. By repeating this process, the sequence 0 (1), 0 (2)..... usually converges to the
ML estimate. Other numerical algorithms for solving non-linear optimization
problems are described in Goldfeld and Quandt (1972). What is of great impor-
tance is that the M L method provides estimates of all parameters of a model in
accord with a well-defined criterion and is generally applicable to most economet-
ric estimation problems. Some further properties of ML estimation are sum-
marized in the next paragraph.
First, as pointed out above, ML estimators are not necessarily unbiased in
finite samples. However, insistence on the property of unbiasedness in general is
not necessarily desirable. The discussion of the criterion of unbiasedness, pre-
sented above, and the examples considered are relevant. Second, when an
unbiased estimator, 0, exists that attains the Cram6r-Rao lower bound, it was
mentioned above that in such a case Ologl(O[x)/O0 = a(O)(O- O) and thus the
only solution to Ologl(O[x)/O0=O' is 0. In this case, the ML estimator is
identical to the MVB unbiased estimator. Third, in some cases, but not all, when
a ML estimator is unbiased, its variance may be close to the Cram6r-Rao lower
bound, a property that has to be checked in individual cases. Fourth, ML
estimates have an invariance property, namely if 0 is the ML estimate of 0 and
7/= g(0) is a one-to-one transformation, then ~ = g(t~) is the ML estimate of 7/.
This property also applies when 0 and ~/ are vectors of parameters. Finally, the
most important sampling justification for ML estimators is that they usually have
very good properties in large samples. That is, under certain regularity
conditions, 31 ML estimators are consistent in the sense that the sequence of ML
estimators, depending on n, the sample size, (0(,)) converges to 0 as n ~ oo in

32These regularityconditions are presented and discussed in Kendall and Stunt (1961) and Heyde
and Johnstone (1979),
136 A. Zellner

" i.p. " a.s.


either a weak probability sense, 0(,)~ 0, or a strong probability sense, 0(,)~ 0,
depending upon whether a weak or strong law of large numbers is employed in
the derivation. Further, in large samples (large n), ML estimators are approxi-
mately unbiased and have variances close to the Cramrr-Rao lower bound under
regularity conditions. Also, as n grows large, the ML estimator's distribution is
approximately a normal distribution with mean 0 and variance I o l, where I o is
the Fisher information matrix or Cramrr-Rao lower bound for an unbiased
estimator based on n independent and identically distributed observations. In the
^

case of a vector ML estimator, 0(n), for large n and under regularity conditions, its
approximate distribution is multivariate normal with mean 0 and covariance
17 l, where I 0 is n times the Fisher information matrix for a single observation.
For proofs of these properties that generally assume that observations, vector or
scalar, are independently and identically distributed and impose certain condi-
tions on the higher moments or other features of the observations' common
distribution, see Cramrr (1946, p. 500), Wald (1949), and Anderson (1971). For
dependent observations, for example those generated by time series processes,
additional assumptions are required to establish the large sample properties of
ML estimators; see, for example, Anderson (1971) and Heyde and Johnstone
(1979).
A basic issue with respect to the large sample properties of ML estimators is
the determination of what constitutes a "large sample". For particular problems,
mathematical analysis a n d / o r Monte Carlo experiments can be performed to
shed light on this issue. It must be emphasized that not only is the sample size
relevant, but also other features of the models, including parameter values, the
properties of independent variables, and the distributional properties of error
terms. Sometimes the convergence to large sample properties of ML estimators
is rapid, while in other cases it can be quite slow. Also, in "irregular" cases,
the above large sample properties of ML estimators may not hold. One such
simple case is 9i~Oi+gi, i = 1 , 2 ..... n, where the gi's are NID(0, o2). The ML
estimate of 0r is 0r = Yi, and it is clear that t~i does not converge to 0g as n grows
since there is just one observation for each 0i. The irregular aspect of this problem
is that the number of parameters grows with the sample size, a so-called
"incidental parameter" problem. Incidental parameters also appear in the func-
tional form of the errors-in-the-variables model and affect asymptotic properties
of ML estimators; see, for example, Neyman and Scott (1948) and Kendall and
Stuart (1961). Thus, such "irregular" cases, and also others in which the ranges of
observations depend on parameters with unknown values or observations are
dependent and generated by non-stationary time series processes and have to be
analyzed very carefully since regular large sample ML properties, including
consistency, normality, and efficiency, may not hold.

3.2.1.6. Admissibility criterion. The admissibility criterion is a sampling crite-


rion for alternative estimators that involves separating estimators into two classes,
Ch. 2: Statistical Theory and Econometrics 137

namely those that are admissible and those that are inadmissible with respect to
estimators' risk properties relative to given loss functions. In this approach,
inadmissible estimators are regarded as unacceptable and attention is con-
centrated on the class of admissible estimators. Since this class usually contains
many estimators, additional criteria are required to choose a preferred estimator
from the class of admissible estimators.
The basic elements in applying the admissibility criterion are (1) loss functions,
(2) risk functions, and (3) comparisons of risk functions associated with alterna-
tive estimators. Consider a scalar parameter 0, and 0 an estimator for 0. Some
examples of loss functions are given below, where the c's are given positive
constants:

(1) Quadratic (or squared error): L(O, O) = q ( O - t~)2.


(2) Absolute error: L(O, 0) = c210- 01.
(3) Relative squared error: L(O, O) = c3(0 - 0)2/0 2.
(4) Generalized quadratic: L(O, O) = h(O)(O - 0) 2.
(5) Exponential: L(O, 0) = c5[1 - e x p ( - c6(0 - ~)2)].

These are but a few of many possible loss functions that can be employed. Note
that they all are monotonically increasing functions of the absolute error of
estimation, ]el = 10 - 0 ] . The first three loss functions are unbounded while the
fifth is an example of a bounded loss function that attains a maximal value of c 5
as 10 - Ol -~ oo. The relative squared error loss function (3) is a special case of the
generalized loss function (4) with h(O) = c 3 / 0 2. Note too that, as is customary,
these loss functions have been scaled so that minimal loss equals zero when
- 0 = 0. Also, negative loss can be interpreted as utility, that is, U(0, 0 ) -
- L(O,~).
In the case of a vector of parameters, 0, and a vector estimator,/~, a quadratic
loss function is given by L(0,/~)= ( 0 - / J ) ' Q ( 0 - / ~ ) , where Q is a given pds
matrix. A generalized quadratic loss function is L(0, 0) = h(0)(0 - / J ) ' Q ( # -/~),
where h(0) is a given function of 0. One example is h(0) = 1/(0'0) m, where m is a
given non-negative constant.
In a particular estimation problem, the choice of an appropriate loss function is
important. Sometimes subject-matter considerations point to a particular form for
a loss (or utility) function. The widespread use of quadratic loss functions can
perhaps be rationalized by noting that a Taylor's Series expansion of any loss
function, L(e), about e = 0 - 0 = 0 , such that L ( 0 ) = 0 and L ' ( 0 ) = 0, yields
L(e)'-L"(0)e2/2, an approximate quadratic loss function. This, it must be
emphasized, is a local approximation which may not be very good for asymmetric
loss functions a n d / o r bounded loss functions [see, for example, Zellner and
Geisel (1968)].
Given that a loss function, L(0, 0), has been selected, the next step in applying
the admissibility criterion is to evaluate the risk function, denoted by r#(0) and
138 A. Zellner

defined by

r~(O) = fRxL(/9, O)p(xiO ) dx, (3.16)

where p(xlO ) is the pdf for the observations given 0. It is seen that the risk
function in (3.16) is defined for a particular estimator, 19, and a particular loss
function, L(0, 0). While it would be desirable to choose an estimator 0 so as to
minimize r~(O) for all values of 0, unfortunately this is impossible. For example,
an "estimator" 0 = 5 will have lower risk when 0 = 5 than any other estimator and
thus no one estimator can minimize risk for all possible values of 0. In view of
this fact, all that is possible at this point is to compare the risk functions of
alternative estimators, say 0 l, 02 ..... with risk functions r~,(O), rG2(O) .... , relative
to a given loss function. From such a comparison, it may" be thatr~,(0) _< rd2(O)
for all 0 with the inequality being strict for some 0. In such a case, 02 is said to be
dominated by 0 I, and 02 is termed an inadmissible estimator. That 01 dominates 02
does not necessarily imply that 01 is itself admissible. To be admissible, an
estimator, say 0 l, must have a risk function r~,(O) such that rs~(O) < r~a(0) for all
0, where r~a(O) is the risk function associated with any other estimator 0a. Work
on proof of admissibility of estimators is given in Brown (1966).
A leading example of the inadmissibility of a maximum likelihood and least
squares estimator has been given by Stein (1956) and James and Stein (1961).
Let fii=Oi+gi, i = 1 , 2 ..... n, with the gi's NID(0, o 2) and the Oi's the means
of the yi's, - o e < 0i < o e . Further, let the loss function be L ( 0 , / 9 ) =
( 0 - / 9 ) ' ( 0 - /9), a quadratic loss function. The likelihood function is
p(ylO) = (2~ro2)-"/2exp(-( y - 0)'( y - 0 ) / 2 o 2 ) , where y ' = (Yl, Y 2 , ' " ,Yn) and
0 ' = (01, 02,a..,0n). Then the ME estimator is/9o = h, with risk function rro(O) =
E(O o - 0 ) ' ( 0 o - 0 ) = no 2. When 02 has a known value, say o 2 = 1, James and
Stein (1961) put forward the following estimator for 0 when n > 3,

/91 = [1 ~ (n - 2)/y' y] y, (3.17)

that has uniformly lower risk than the ML (and LS) estimator, /9o = Y; that is,
rg, < rg° or E(/9~- 0)'(/9! - 0) < E(/90 - 0)'(/90 - 0) for 0 < 0'0 < m [see James and
Stein (1961) for a proof]. As James and Stein show, use of/91 in (3.17) rather than
the ML estimator results in a large reduction in risk, particularly in the vicinity of
0 = 0. They also develop an estimator similar to (3.17) for the case of o 2 unknown
and show that it dominates the ML estimator uniformly. For details, see James
and Stein (1961), Zellner and Vandaele (1975), and the references in the latter
paper. Also, as shown in Stein (1960), Sclove (1968), and Zellner and Vandaele
(1975), Stein's result on the inadmissibility of the M L (and LS) mean estimator
carries over to apply to regression estimation problems when the regression
coefficients number three or more and an unbounded quadratic loss function is
Ch. 2: Statistical Theory and Econometrics 139

utilized. It is also the case that for certain problems, say estimating the reciprocal
of a population mean, the ratio of regression coefficients, and coefficients of
simultaneous equation models, ML and other estimators' moments usually or
often do not exist, implying that such estimators are inadmissible relative to
quadratic and many other unbounded loss functions [see Zellner (1978) and
Zellner and Park (1979)]. While use of bounded loss functions will result in
bounded risk for these estimators, see Zaman (1981), it is not clear that the ML
and other estimators for these problems are admissible.
Another broad class of estimators that are inadmissible are those that are
discontinuous functions of the sample data, for example certain "pre-test"
estimators. That is, define an estimator by 0 = 01 if t > a and 0 D2 if t __<a, where
=

7 is a test statistic. If Pr(t > a) = w, then the risk of this estimator relative to
quadratic loss is r~(O) = wE(O, - 0) 2 +(1 - w)E(02 - 0) :. As an alternative
estimator, consider ~3 = wOl + (1 - w)t~2 with risk function

re3(0) = e(0 -0)2= E[w(O,-O)+(l- w)(02-O)] 2


= w 2 E ( 0, - 0)2+ (1 - w ) z E ( 02 - 0 )2 ÷ 2w(1 - w)cov(01,02).

Then re(O ) - r~3(fl) = w(1 - w)E[(/~ 1 - 0 ) - ( 0 z - 0)] 2 > 0 and thus the discontinu-
ous estimator 0 is inadmissible. For further properties of "preliminary-test"
estimators, see Judge and Bock (1978).
Since the class of admissible estimators relative to a specific loss function
contains many estimators, further conditions have to be provided in order to
choose among them. As seen above, the conditions of the Gauss-Markov
Theorem limits the choice to linear and unbiased estimators and thus rules out,
for example, the non-linear, biased James-Stein estimator in (3.17) and many
others. The limitation to linear and unbiased estimators is not only arbitrary but
can lead to poor results in practice [see, for example, Efron and Morris (1975)].
Another criterion for choosing among admissible estimators is the Wald
minimax criterion; that is, choose the estimator that minimizes the maximum
expected loss. Formally, find ~ such that max~r~(0) < max6 r~a(O), where 0: is any
other estimator. While this rule provides a unique solution in many problems, its
very conservative nature has been criticized; see, for example, Ferguson (1967, p.
58) and Silvey (1970, p. 165). A much less conservative rule is to choose the
estimator, when it exists, that minimizes the minimum risk or, equivalently,
maximizes the maximum utility. While these rules may have some uses in
particular cases, in many others they lead to solutions that are not entirely
satisfactory.
To illustrate the use of risk functions, consider Figure 3.1 in which the risk
functions associated with three estimators,/gj, 02, and/~3, have been plotted. As
140 A. Zellner

risk

r~3 (0)

r°lll I
d

Figure 3.1

drawn, 0, and 02 clearly dominate 03 since rd3 lies everywhere above the other two
risk functions. Thus, 03 is inadmissible. In choosing between 01 and 02, it is clearly
important to know whether 0's value is to the right or left of the point of
intersection, 8 = d. Without this information, choice between 01 and 02 is difficult,
if not impossible. Further, unless admissibility is proved, there is no assurance
that either 01 or ~2 is admissible. There may be some other estimator, say 04, that
dominates both 01 and ~2. Given these conditions, there is uncertainty about the
choice between 01 and 02 and, as stated above, without a proof of admissibility
there is no assurance that either is admissible. For a practical illustration of these
problems in the context of estimating the parameter p in a stationary, normal,
first-order autoregressive process, Yt = P Y t - 1 + et, see Thornber (1967). He pro-
vides estimated risk functions for ML and several other estimators for p. These
risk functions cross and thus no one estimator uniformly dominates the others.
The shapes of the estimated risk functions are also of interest. See also Fomby
and Guilkey (1978).
In summary, the criterion of admissibility, a sampling criterion, provides a
basis for ruling out some estimators. Indeed, according to this criterion, Stein's
results indicate that many ML and LS estimators are inadmissible relative to
quadratic loss. In other cases in which estimators do not possess finite moments,
they are inadmissible relative to quadratic and other loss functions that require
estimators' moments to be finite in order for risk to be finite. Even if just
bounded loss functions are considered, there is no assurance that ML and LS
estimators are admissible relative to them without explicit proofs that they do
Ch. 2: Statistical Theory and Econometrics 141

indeed possess this property. As regards admissible estimators, they are not in
general unique so that the problem of choice among them remains difficult. If
information is available about the range of "plausible" or "reasonable" values of
parameters, a choice among alternative admissible estimators can sometimes be
made. In terms of Figure 3.1, if it is known that t~1 and /~2 are admissible
estimators and if it is known that 0 > d, then t~2 would be preferred to 01- Below,
in the Bayesian approach, it is shown how such information can be employed in
obtaining estimators.

3.2.1.7. Bayesian approach. In the Bayesian approach to estimation, both ob-


servations and parameters are considered random. Let p (x, 0) be the joint pdf for
an observation vector ~?c R x and 0 c O. Then, according to the usual rules for
analyzing joint pdfs, the joint pdf can be expressed as

p(x,O) =p(xlO)p(O)
=p(OIx)p(x), (3.18)

where the functions p ( . ) are labelled by their arguments. From (3.18), p(Olx)=
p(O)p(xlO)/p(x) or

p(OIx) ocp(O)p(xlO), (3.19)

where the factor of proportionality in (3.18) is the reciprocal offo p (0) p (xlO) d 0
= p (x). The result in (3.19) is Bayes' Theorem with p (Olx) the posterior pdf for
O, p(O) the prior pdf for 0, and p(x[O) the likelihood function. Thus, (3.19) can
be expressed as,

posterior pdf cc (prior pdf) × (likelihood function).

As an example of the application of (3.19), consider n independent binomial


trials with likelihood function

P ( r l n , O ) = ( rn ) O (r 1 - O ) n - - r ,

with 0 < 0 < 1. As prior pdf for 0, assume that it is given byp(Ola, b) = 0a-l(1 -
O)b-1/B(a, b) a beta pdf with a, b > 0 having given values so as to represent the
available information regarding possible values of 0. Then the posterior pdf for 0
142 A. Zellner

is given by

p(OlD ) oc p(Ola , b)p(r[n, O)


(xor+a_l(l__O)n r+b-l, (3.20)

where D denotes the prior and sample information and the factor of proportional-
ity, the normalizing constant is 1/B(r + a, n - r + b). It is seen that the posterior
pdf in (3.20) is a beta-pdf with parameters r + a and n - r + b. The sample
information enters the posterior pdf through the likelihood function, while the
prior information is introduced via the prior pdf. Note that the complete posterior
pdf for 0 is available. It can be employed to make probability statements about 0,
e.g. Pr(c I < 0 < c 2 [ D ) = f~2p(OlD)dO. Also, the mean and other moments of the
posterior pdf are easily evaluated from properties of the beta distribution. Thus,
the prior pdf, p(Ola, b), has been transformed into a posterior pdf, p(OID), that
incorporates both sample and prior information.
As mentioned in Section 2.2, the added element in the Bayesian approach is the
prior pdf, p(O), in (3.19), or p(Ola, b) in (3.20). Given a prior pdf, standard
mathematical operations yield the posterior pdf as in (3.19). Explicit posterior
distributions for parameters of many models encountered in econometrics have
been derived and applied in the literature; see, for example, Jeffreys (1967),
Lindley (1965), DeGroot (1970), Box and Tiao (1973), Learner (1978), and Zellner
(1971). Further, from (3.19), the marginal pdf for a single element or a subset of
the elements of 0 can be obtained by integration. That is, if 0 ' = (0~0~), the
marginal posterior pdf for 01 is given by

p (0,ID) = fp(Ol, 02ID) d02


= f p (0 102, D)p(O2ID)d02, (3.21)

where in the sedond line the integration over the elements of 02 can be interpreted
as an averaging of the conditional posterior pdf for 01 given 02, p(Ol102,D), with
the marginal posterior pdf for 02, p(O2lD), serving as the weight function. This
integration with respect to the elements of 02 is a way of getting rid of parameters
that are not of special interest to an investigator, the so-called nuisance parame-
ters. In addition, the conditional posterior pdf, p(01lO2, D), can be employed to
determine how sensitive inferences about 0~ are to what is assumed about the
value of 02; that is, p(01102, D) can be computed for various values of 02; see, for
example, Box and Tiao (1973) and Zellner (1971) for examples of such sensitivity
analyses. Finally, as will be explained below, given a loss function point estimates
can be obtained.
Ch. 2: Statistical Theory and Econometrics 143

The prior pdf, p(0), in (3.19) is formulated to reflect an investigator's prior


information, that is, information available about possible values of 0 before
observing a current sample. The information represented in a prior distribution
may be past sample information a n d / o r non-sample information derived perhaps
from economic theory or other sources. The problem of representing such
information accurately and adequately is not an easy one even though consider-
able effort has been devoted to the problem of assessing or determining the forms
of prior pdfs [see, for example, Winkler (1980), Kadane et al. (1980), and Zellner
(1972, 1980)]. In some cases, particularly when the sample size is moderately
large, the posterior properties of pdfs are not very sensitive to minor alterations in
the forms of prior pdfs. In terms of the binomial example above, when n and r are
moderately large, altering slightly the values of the prior parameters a and b does
not change features of the posterior pdf very much.
As regards the often-mentioned issue that different investigators may have
different prior pdfs and thus will obtain different posterior distributions from the
same likelihood function, this is hardly surprising since they have different initial
information. On pooling their initial information, they will obtain similar in-
ferences. Or if it is a matter of comparing the compatibility of prior information
with sample information, as explained below predictive and posterior odds
techniques can be employed. Given that researchers tend to be individualistic in
their thinking, it is not surprising that initial views differ. Generally, the informa-
tion in data, as reflected in the fikelihood, will modify prior views and dominate
as the sample size grows large. In fact, for any non-degenerate prior, as the
sample size grows, the posterior pdf in (3.19) assumes a normal shape centered at
the ML estimate with posterior covariance matrix approximately equal to the
inverse of the Fisher information matrix evaluated at the ML estimate; see, for
example, Jeffreys (1967, p. 193ff.) for details. Jeffreys (1967, p. 194) regards this
as a justification of ML estimation in large samples, a non-sampling argument.
Thus, in large samples, the information in the sample dominates the posterior pdf
in the sense that the prior pdf's influence on the shape of the posterior pdf
becomes negligible.
In some cases there may be little or practically no prior information available
about the possible values of parameters, as in the early stages of an investigation.
In such cases Bayesians employ so-called "non-informative" or "diffuse" prior
pdfs. For some work on the formulation of such prior distributions, see Jeffreys
(1967), Box and Tiao (1973) Jaynes (1968, 1980), Savage (1961), and Zellner
(1971, 1975). In the case of a parameter 0 such that - ~ < 0 < ~ , Jeffreys
recommends using p(O)c~ constant, while for a parameter with a semi-infinite
range, such as a standard deviation o satisfying 0 < o < ~ , he recommends taking
log o uniformly distributed that implies p(o) ~ 1/o. It is the case that these are
improper priors since they do not integrate to a finite constant and hence are
144 A. Zellner
termed "improper". 33 Others, notably Savage (1961) and Box and Tiao (1973),
define a "diffuse" prior for O as uniform over a very wide, finite interval, that is,
p(O) oc constant for - M < 0 < M with M large but finite. In this case the prior is
proper but a choice of the value of M is required. An example will be presented to
illustrate the use of diffuse prior pdf.
Example 3.4
Consider the normal mean problem Yi = ~ + ei, i = 1 , 2 ..... n, where the ei's are
NID(0,1). The likelihood function is (27r)-'/Zexp( - [ps 2 + n(/L - ~)2]/2), where
is the sample mean, vs 2 = Y"7=l(Yi - ~)2, and p = n - 1. Let the diffuse prior be
p(/~) cc constant. Then the posterior p d f for #, p(/~lD), where D represents the
prior, and sample information is

p(l~lD) ~x ( 2 ~ r ) - " / 2 e x p ( - [pS2/2)+ n(/z - y )2]/2) ~ e x p ( - n(/~ - 37)2/2),

is in the normal form with posterior mean 37, the sample mean and posterior
variance 1 / n . 34
In this example it is seen that the mean and mode of the posterior pdf are equal
to the sample mean, )7, the M L estimate. Some have crudely generalized this and
similar results to state that with the use of diffuse prior pdfs, Bayesian and
non-Bayesian estimation results are equivalent, aside from their differing interpre-
tations. This generalization is not true in general. If a prior pdf is uniform,
p(O) cx constant, then the posterior pdf in (3.19) is given byp(Olx ) ~x p(xlO), that
is, it is proportional to the likelihood function. Thus, the modal value of the
posterior pdf will be exactly equal to the M L estimate and in this sense there is an
exact correspondence between Bayesian and non-Bayesian results. However, as
shown below, the posterior mean of 0 is optimal relative to a quadratic loss
function. If a posterior pdf (and likelihood function) is asymmetric, the posterior
mean of 0 can be far different from the modal value. Thus, the optimal Bayesian
point estimate can be quite different from the M L estimate in finite samples.
Asymmetric fikelihood functions are frequently encountered in econometric
analyses.
As regards point estimation, a part of the Bayesian approach, given a loss
function, L(O,t~), wherein 0 is viewed as random and /~ is any non-random
estimate, 0 =/~(x), a non-sampling criterion is to find the value of ~ that

33jeffreys (1967) interprets such improper priors as implying that oo rather than 1 is being
employed to represent the certain event, Pr(- oo < 0 < ~). Then the probability that 0 lies in any
finite interval, Pr(a < 0 < b) = 0 and Pr(a < 0 < b)/Pr(c < 8 < d) being of the form 0/0 is inde-
terminate.
34If the prior pdf p(/x) ~ constant, - M < ~ < M, had been used, the posterior pdf is p(/z[D) 0c
exp(- n(/~- 37)2/2) with - M </~ < M. For M large relative to l/n, the posterior is very closely
normal.
Ch. 2. StatisticalTheoryandEconometrics 145

minimizes the posterior expectation of the loss function. Explicitly the problem is
as follows:

minEL( O, 0) = min [ L ( O, O) p( Olx)dO, (3.22)


0 i ~o

where p(O[x) is the posterior pdf in (3.19). The minimizing value of 0, say 0 B, is
the optimal Bayesian estimate, optimal in a non-sampling sense since the observa-
tion vector x is given. In the case of a quadratic loss function, L(O, O)= ( 0 -
O)'Q(O - 0), where Q is a given pds matrix, 0 B = 0 = E(OIx), the posterior mean
vector. That is,

E(O-O)'Q(O-O)=El(o- 0)- (0- (0- 0)]


= E(o - - (0 - -

since E(O - / i ) = 0. Thus, since Q is pds, taking 0 = 0 leads to minimai expected


loss. For a scalar parameter, 0, and an absolute error loss function, L(O, O)=
c[0 - 01, it can be shown that the median of the posterior pdf for 0 is the optimal
point estimate in the sense of minimizing posterior expected loss. 35 When the
minimization problem in (3.22) cannot be solved analytically, it is often possible
to determine the solution by use of numerical integration procedures. Thus, the
optimal Bayesian estimate is tailored to be optimal relative to the loss function
that is considered appropriate:
Above, the optimal Bayesian estimate is defined as the solution to the minimi-
zation problem in (3.22). Before the data are observed, it is of interest to consider
the sampling properties of the Bayesian estimate /~B. Given a loss function
L(0, O), the risk function, discussed above, is rt(O) = fRxL(O,O)p(xlO)dx for
0 ~ O, The Bayesian estimator is defined as the solution to the following problem:

min Erg( O) = min. forg( O) p ( O) do. (3.23)


o o

That is, choose the estimator 0 so as to minimize average risk, Erg(O), where the
expectation is taken with respect to the prior pdf p(0). On substituting the
integral expression for rg(O) in (3.23), the minimand is

fofRxL(O,O)p(O)p(xlO)dxdO= fofRxL(O,O)p(Olx)p(x)dxdO, (3.24)

35See, for example, Zellner (1971, p. 25) for a proof. Also, the particular loss structure that yields
the modal value of a posterior pdf as an optimal point estimate is describedin Blackwelland Girshick
(1954, p. 305). This loss structure implies zero loss for very small estimation errors and constant
positive loss for errors that are not small.
146 A. Zellner
where p(O)p(xlO)=p(x)p(Olx ) from (3.18) has been employed. On inter-
changing the order of integration in (3.24), the right side becomes

fRx[f L(O,O)p(O,x)dO]p(x)dx. (3.25)

When this multiple integral converges, the quantity/~B that minimizes the expres-
sion in square brackets will minimize the entire expression given that p (x) > 0 for
x C Rx .36 Thus,/~B, the solution to the problem in (3.22), is the Bayesian estimator
that minimizes average risk in (3.23).
Some properties of/~B follow:
(1) ~B has the optimal non-sampling property in (3.22) and the optimal sam-
pling property in (3.23).
(2) Since/~B minimizes average risk, it is admissible relative to L(O, ~). This is so
because if there were another estimator, say/~A, that uniformly dominates/~B
in terms of risk, it would have lower average risk and this contradicts the
fact that 0B is the estimator with minimal average risk. Hence, no such t~A
exists.
(3) The class of Bayesian estimators is complete in the sense that in the class of
all estimators there is no estimator outside the subset of Bayesian estimators
that has lower average risk than every member of the subset of Bayesian
estimators.
(4) Bayesian estimators are consistent and normally distributed in large samples
with mean equal to the ML estimate and covariance matrix equal to the
inverse of the estimate information matrix. Further, in large samples the
Bayesian estimator (as well as the ML estimator) is "third-order" asymptoti-
cally efficient.37 These results require certain regularity conditions [see, for
example, Heyde and Johnstone (1979)].
A key point in establishing these sampling properties of the Bayesian estimator,
/~B, is the assumption that the multiple integral in (3.24) converges. It usually does
when prior pdfs are proper, although exceptions are possible. One such case
occurs in the estimation of the reciprocal of a normal mean, 0 = 1//~, using
quadratic loss, (0 - 0) 1. The posterior pdf for/~, based on a proper normal prior
for #, is normal. Thus, 0 = l / p , the reciprocal of a normal variable, possesses no
finite moments and the integral defining posterior expected loss does not con-
verge. With more information, say 0 > 0, this problem becomes amenable to
solution. Also, if the loss function is ( 0 - t~)2/02, a relative squared error loss

36See Blackwelland Girshick(1954), Ferguson(1967), and DeGroot (1970) for considerationof this
and the followingtopics.
37See, for example,Takeuchi(1978) and Pfanzagland Wefelmeyer(1978).
Ch. 2: Statistical Theory and Econometrics 147

function, there is a well-defined, Bayesian estimator that minimizes average risk


[see Zellner (1978) for details]. Also, if the loss function is bounded, solutions
exist [see Zaman (1981)]. In terms of the Stein normal vector-mean problem,
y = 0 + e, considered above, if the prior pdf is p ( O ) c~ constant, the posterior pdf
is p ( O I D ) ~x e x p ( - ( 0 - y ) ' ( O - y ) / 2 ) that has posterior mean y, the inadmissible
M L and LS estimator relative to L ( O , 0) = (0 - 0)'(0 - 0) when n >__3. However,
when n = 1 or n = 2 the posterior mean is admissible even though it is associated
with a posterior pdf based on an improper prior pdf. While inadmissible, the
estimate/~ = y with n > 2 does satisfy the optimality criterion in (3.22). Also, as
Hill (1975) points out, if the elements of the mean vector 0 are independently
distributed a priori, then the joint prior pdf is p ( 0 ) = l~i"__lpi(Oi) and the
associated posterior pdf is p ( O l x ) <x l-I;=lPi(Oi)p(yilOi). Thus, the 0i's are inde-
pendently distributed a posteriori and, using any separable loss function, for
example L ( O , 0) = ~ =n l( 0i - t~) 2, the Bayesian estimate of 0~ is its posterior mean
that just depends on Yi and not on the other Z's. 38 If the priors, pi(Ot), are normal
with very large dispersion, Hill shows that the admissible Bayesian estimates, the
posterior means of the 0~'s, are not far different from the Bayesian diffuse prior
and ML estimates, t~ = Yi, i = 1,2 . . . . . n.
The important point brought out by Hill's (1975) cogent analysis is that if the
means in the Stein problem are mean rainfall in Calcutta, mean income in Palo
Alto, and mean annual attendance at the Milan opera, these 0~'s are reasonably
considered independent a priori. Given this property and Hill's analysis, the
estimate of a single 0i, say me~/n rainfall in Calcutta, will just depend on observed
rainfall in Calcutta and not on observed income in Palo Alto and attendance at
Milan's opera. Therefore, the Stein-James estimate (3.17) is inappropriate for
such data and assumptions. On the other hand, there are many situations in
which the 0i's are dependent a priori 39 and for them use of an appropriate prior
pdf reflecting such dependence can lead to substantial improvement in estimation
and prediction results. Specific prior assumptions leading to a Bayesian estimate
close to or equal to the Stein-James estimate in (3.17) are reviewed in Zellner and
Vandaele (1975).
In summary, the Bayesian prescription, i.e. choose the estimate that minimizes
expected loss, is a general principle that is widely applicable. Its use in finite
samples does involve the choice of an appropriate prior pdf for parameters. While
this is difficult, particularly in multi-parameter problems, a basic issue is whether
it is possible to get sensible estimation results from any point of view without
information as to what the probable values of parameters are. Bayesians formally

38This implies that the Stein-James estimate in (3.17) is suboptimal for this specification.
39Lindley (1962) provides the following model to rationalize dependent Oi's: yi=Oi + ei and
0/~0 +vi, i=1,2 ..... n, where the ei's and vi's are independent normal error terms and 0 is
interpreted as a "common effect". Analysisof this modelproducesestimates of the 0j's very similar to
those in (3.17) [see, for example,Zellner and Vandaele (1975)].
148 A. Zellner

represent such information by use of prior pdfs, while non-Bayesians often use
such information informally. Evidence is being accumulated on the relative merits
of these alternative approaches to parameter estimation and other inference
problems.

3.2.1.8. Robustness criterion. The robustness criterion relates to the sensitivity


of point estimation and other inference procedures to departures from specifying
assumptions regarding models and prior distributions and to unusual or outlying
data. Since specifying assumptions are usually only approximately valid, it is
important that the sensitivity of inference techniques to departures from specify-
ing assumptions and to outlying observations be understood and that methods be
available that are relatively robust to possible departures and outlying observa-
tions. For example, it is well known that least squares estimates can be vitally
affected by one or a few outlying data points. Also, in some cases, Bayesian
inferences are sensitive to slight changes in the formulation of prior distributions.
In dealing with robustness issues two general approaches have been pursued. In
the first, attempts are made to formulate estimation and other inference proce-
dures that retain desirable properties over a range of alternative models a n d / o r in
the presence of outlying observations. For example, in estimating a population
mean, the sample median is less sensitive to outlying observations that is the
sample mean. Such procedures are called "blanket procedures" by Barnett and
Lewis (1978, p. 47). 4o The second approach, which may be called a "nesting
approach", involves broadening an initial model to accommodate suspected
departures from specifying assumptions a n d / o r possible outlying observations
and then proceeding with an analYSiS of the broader model. In both approaches,
the nature of alternatives to the initially entertained model must be given careful
consideration in order to obtain sensible results in estimation. Mechanical down-
weighting of outlying observations does not necessarily lead to satisfactory
results. For example, use of the median to estimate the location of a distribution
when outlying observations are present may suggest a unimodal distribution when
in fact the true distribution is bimodal. In this case, the outlying observations may
give some information about the location of a second mode. Or in some cases,
outlying observations in regression analysis may indicate that the assumed
functional form for the regression equation is incorrect and thus such outlying
points should not be carelessly discarded or down-weighted. On the other hand, if
outlying observations are in some sense spurious, say the result of transcription
errors, then down-weighting them in estimation can lead to more sensible results.
An example of the first approach, the blanket approach, is Huber's (1972)
linear order statistics estimators or L-estimators for a location parameter/~ based

4°For examples of this approach, see Tukey (1977), Huber (1964, 1972), and Belsley, Kuh and
Welsch (1980).
Ch. 2: Statistical Theory and Econometrics 149

on a sample of independent observations, x~, x 2..... x n. The ordered observations,


x(o < x<2) < . . . < x<,) are combined with weights ci to yield the estimate /2 =
~i=
n ic, X (i), with the c~'s smaller in value for the extreme observations than for the

central observations. The L-class of estimators includes various "trimmed-mean"


estimates (those that disregard extreme observations "an-d just average central
observations), the sample median, and the sample mean (c~ = l / n ) as special
cases. Judicious choice of the c~'s can lead to estimates that have better sampling
properties than the sample mean when the underlyirig distribution departs from,
for example, a normal distribution and in this sense are more robust than the
sample mean. Other robust estimates, discussed by Jeffreys (1967, pp. 214-216)
and Huber (1972), are maximum likelihood-type estimates, called M-estimates by
Huber for estimating a location parameter from n independent observations with
log-likelihood function ~ , 7 = l l o g f ( x ~ - I~). The necessary condition for a maxi-
mum of the likelihood function can be written as

= o,
i=1 i=1

where /2 is the ML estimate and O)i ~" f ' ( x i - - / 2 ) / ( X i - - / 2 ) f ( x i - ~ ) , as Jeffreys


(1967, p. 214) explains. Or, in Huber's (1972) notation, it may be written as
n X -- ~-~n
~i=lq'( i /2)=0 with + ( x i - f ~ ) - ( x i - / 2 ) w i. Thus, /2=/__,i= l £0 ixi//..,i=l
J~-~n
o~i is
the form of the estimate with the o~i's data dependent. Choice of the form of +(. )
or of the ~ ' s depends on the nature of the underlying distribution which usually
is not known exactly. If the underlying distribution is normal, then o~ = constant
and equal weights are appropriate. If f ' / f does not increase as fast as tx~-/~1,
then Jeffreys (1967) remarks: "...the appropriate treatment will give reduced
weights to the large residuals. If it [ f ' / f ] increases faster, they should receive
more weight than the smaller ones" (p. 214). See Jeffreys (1967, pp. 214-215) for
an application of this approach, and Huber (1972) for further discussion of the
appropriate choice of weights. For independently and identically distributed
(i.i.d.) observations q~(x i - I~) = f ' ( x i - l ~ ) / f ( x i -/~), i = 1,2,...,n, and on in-
tegration logf(vi) = f + ( v i ) d v i +constant), where vi = x~ -/x. Thus, choice of a
particular form for q~(vi) implies a form of the likelihood function when f ~ ( v i ) d v i
converges and the i.i.d, assumption is satisfied. Viewed in this way, the M-
estimation approach is a "nested approach". However, this interpretation is not
generally possible if the observations are not i.i.d.
The second approach, the nesting approach, involves representing suspected
departures a n d / o r outlying observations from an initial model by formulating a
broader model and analyzing it. There are many examples of this approach in
econometrics and statistics. Student-t distributions that include Cauchy and
normal distributions as limiting cases can be employed in analyzing regression
and other models [see, for example, Jeffreys (1973, p. 68) and Zellner (1976)]. The
heavy tails of Student distributions for low degrees of freedom accommodate
150 A. Zellner

outlying observations. Also, see Barnett and Lewis (1978) for a review of a
number of models for particular kinds of outlying observations. Many production
function models including the CES, trans-log, and other generalized production
function models include the Cobb-Douglas and other models as special cases.
B o x - C o x (1964) and other transformations [see, for example, Tukey (1957) and
Zellner and Revankar (1969)] can be employed to broaden specifying assump-
tions and thus to guard against possible specification errors. In regression
analysis, it is common practice to consider models for error terms, say autoregres-
sive a n d / o r moving average processes when departures from independence are
thought to be present. Such broadened models can of course be analyzed in either
sampling theory or Bayesian approaches. With respect to Bayesian considerations
of robustness, see, for example, Savage et al. (1963), Box and Tiao (1973), and
DeRobertis (1978).

3.2.1.9. l n v a r i a n c e criterion. The invariance criterion, discussed for example in


Cox and Hinkley (1974, pp. 41-45) and Arnold (1981, pp. 20-24), relates to
properties of estimation and other inference procedures when sample observa-
tions and parameters are subjected to certain types of transformations. For
example, if in a model for i.i.d, observations the observations are reordered or
permuted, the results obtained with most estimation procedures will be unaffected
and are thus invariant to such transformations of the data. Further, if the ~t's are
i.i.d., each with pdf f ( x - 0), and the ~i's are transformed to Yt = xi + a, where
- ~ < a < ~ , then each )7i has a pdf f ( y - 0 " ) = f ( x - 0 ) , where 0* = 0 + a. If
the parameter spaces for 0 and 0* are identical, say the real line, then the
invariance criterion requires that an estimate of 0, 0(x), based on the x / s , be
identical to that obtained from the y / s to estimate 0", t~(y), and then getting an
estimate of 0 from t~(y) = t~*( y ) - a. That is, an invariant estimate must satisfy
0 ( x + a~) = O(x)+ a for all values of a, where ~ denotes a column of ones. As Cox
and Hinkley (1,974, p. 43) point out, a crucial point is that there are no external
reasons for preferring some values of O to others. For example, they mention that
if 0 >__0, the invariance condition above would not hold since the parameter space
is not invariant under the transformation when 0 >_ 0 is imposed.
Conditions under which estimates are invariant in the above sense to more
general transformations of the data, say fit = cffi + a or )7t = g~i, where g is a
member of a class of transformations G, have been analyzed in the literature.
Also, Arnold (1981, p. 20 ff.) defines conditions under which an estimation
problem and associated loss function are invariant under both transformations of
the data, parameters, and loss function and goes on to discuss "best invariant" or
"minimum risk invariant" estimators. See also Cox and Hinkley (1974, p. 443) for
a discussion of the famous Pitman minimum risk invariant estimate of a location
Ch. 2: Statistical Theory and Econometrics 151

parameter 0 in the likelihood f u n c t i o n l - I ; = t f ( x i - 0 ) , - O0 < 0 < ~ . The result is


that the Pitman estimate is "... the mean of the normalized likelihood function"
(p. 444), that is, the mean of a posterior distribution based on a uniform,
improper prior for 0.
In the Bayesian approach, invariance of estimation results to transformations
of the data and parameters has been considered in Jeffreys (1967), Hartigan
(1964), and Zellner (1971). Hartigan, building on Jeffreys' pioneering work,
defines various kinds of invariance and provides classes of prior distributions,
including Jeffreys' that lead to invariant estimation results.
Requiring that estimation procedures be invariant places restrictions on the
forms of estimators. Having invariance with respect to changes in units of
measurement and some other types of transformations suggested by the nature of
specific problems seems desirable. However, as Cox and Hinkley (1974) state,
"... in a decision-making context [as in choice of an estimator relative to a given
loss or utility function].., there is sometimes a clash between the invariance
principle and other apparently more compelling requirements; there can be a
uniform loss of expected utility from following the invariance principle" (p. 45).
Thus, for each problem it is important to consider carefully the types of
transformations for which invariance is required and their effects on estimation
and other inference procedures.

3.2.1.10. Cost criterion. Practically speaking, the cost of applying alternative


estimation techniques is of importance. Some estimation procedures involve
difficult numerical procedures. Generally, cost-benefit analysis is relevant in
choosing among alternative estimation procedures. While this is recognized, it is
difficult to generalize about the range of considerations. Each case has to be
considered separately. In some cases, cost of computation can be formally
introduced in loss functions and these broadened loss functions can be employed
to choose among alternative estimation procedures. However, in many cases, cost
considerations are dealt with in an informal, heuristic manner.
In this section various approaches and criteria for point estimation have been
considered. While point estimation is important, it must be emphasized that a
point estimate unaccompanied by a measure of precision is very unsatisfactory. In
the sampling theory approach, point estimates are supplemented by their associ-
ated standard errors (estimates of standard deviations of estimators). In the
Bayesian approach, point estimates are usually accompanied by a measuie of the
dispersion of posterior distributions, e.g. posterior standard deviations or com-
plete posterior distributions. In the next section attention is directed toward
explaining methods for computing intervals or regions that in one of several
senses probably include the values of parameters being estimated.
152 A. Zellner

4. Interval estimation: Confidence bounds, intervals, and regions

T o provide a quantitative expression of the uncertainty associated with a scalar


point estimate 0 of a parameter 0, confidence bounds and intervals are available
both in the sampling theory and Bayesian approaches. Similarly, for a vector
point estimate ~ of a parameter vector 0, sampling theory and Bayesian confi-
dence regions for 0 can be computed. As will be seen, the probabilistic interpreta-
tions of sampling theory and Bayesian confidence bounds, intervals, and regions
are radically different.

4.1. Confidence bounds

Let 0 c 0 be a scalar parameter appearing in a probability pdf, f(x]O), for a


random observation vector 2. Further, let ~ = a~(£) be a statistic, that is, a
function of :?, such that

P( gt,~>~OlO)=l - a. (4.1a)

In addition, it is required that if a~ > a2 and if ~i~ and a~2 are both defined in
accord with (4.1), then ~ , ~<~2, that is, the larger 1 - a, the larger is the upper
bound. Then ti~ is called a ( 1 - a)100 percent upper confidence bound for 0.
F r o m (4.1 a) the random event a , >/0 has probability 1 - a of occurrence and this
is the sense in which 6~ is a probabilistic bound for 0. When :~ is observed, a~(:~)
can be evaluated with the given sample data x. The result is a,~(x), a non-stochas-
tic quantity, say a,~(x) = 1.82, and the computed upper confidence bound is 1.82.
In a similar way a (1 - a ) x 100 percent lower confidence bound for 0 is/;~ = b~(~?)
such that

P(/~ ~< 010) = 1 - ~, (4.1b)

with/~,, >//3-2 .when ~ > a2; that is, the larger is 1 - a, the smaller is the lower
bound.
Bayesian confidence bounds are based on the posterior pdf for 0, p(OIx ) oc
rr(O)f(xlO), where rr(0) is a prior pdf for 0. A ( 1 - a ) × 100 percent Bayesian
upper bound, c a = %(x), is defined by

P(0 ~<%Ix) = 1 - a, (4.2a)

where 0 is considered random and the sample data x are given. Note that

P(O colx =£ p(Olx)dO


is just the posterior cdf evaluated at %.
Ch. 2: StatisticalTheoryandEconometrics 153

A (1 - a ) × 100 percent Bayesian lower bound is d , = d,,(x) satisfying


P(O>d~lx)=l-a, (4.2b)

where

p(o >1a.lx)= £7p(OIx)dO.


The fundamental differences in the probability statements in (4.1) and (4.2)
must be emphasized. A sampling theory bound has the interpretation that in
repeated samples the bound so computed will be correct in about 1 - o~, say
1 - a = 0.95 of the cases. The Bayesian bound states that the random parameter 0
will satisfy the bound with posterior probability 1 - a given the sample and prior
information. The following example is a case in which Bayesian and sampling
confidence bounds are numerically identical.
Example 4.1
Let n 2i's be i.i.d. N(O, 1). Then

f ( xlO ) = (2~r)- "/2exp{ - E ( x i - 0)2/2} c~ e x p ( - n( 0 - 2 ) 2 / 2 ) ,

where ~ is the sample mean. Then with the prior 7r(0) cc constant, - m < 0 < m,
the posterior pdf is f(Olx)(Xexp(-n(0-~)2/2}, a normal pdf. Thus, z =
x / n ( 0 - ~) is N(O, 1) a posteriori and the constant % can be found such that
P(z~<%l~)=l-cc z<~% is equivalent to vrn(o-~)<~% or O<~+%/v~.
Thus, P(0 ~<~ + c~/fn 1if) = 1 - a and ff + c~/fn is the Bayesian upper confi-
dence bound. Now from a sampling theory point of view, ff has a normal
sampling pdf with mean 0 and variance 1/n. Thus, z = vCn(~ - 0) is N(0, 1), given
0. From P ( z > - % l O ) = l - o l it follows that P[v~(g-O)>~-c, IOI=P(Y~+
%/~/-ff >~010)= 1 - a and ~ + %/¢rff is the sampling theory upper confidence
bound that is numerically identical to the Bayesian bound.
The example indicates that when a uniform prior pdf for the parameter 0 is
appropriate and when a "pivotal quantity", such as z = x/h-(~ - 0), that has a pdf
not involving the parameter 0 exists, 41 Bayesian and sampling theory confidence
bounds are numerically identical. Other examples involving different pivotal
quantities will be presented below. Also, a connection of confidence bounds with
construction of tests of hypotheses will be discussed below in the section on
hypothesis testing.

41Note that the pdf for z =~/~(2 - 0) in the exampleis N(0,1) both from the sampfing theory and
Bayesian points of view.
154 A. Zellner

4.2. Confidence intervals

By use of both a lower and an upper confidence b o u n d for a scalar parameter O,


an interval estimate or a confidence interval is obtained. In the sampfing theory
a p p r o a c h the r a n d o m quantities ti = a ( £ ) and/~ = b ( £ ) such that

P( f~ < O < gttO) = l - a (4.3)

yields a r a n d o m interval that has probability 1 - a of including or covering the


fixed u n k n o w n value of O. O n combining a lower confidence bound,/~,,, with an
u p p e r confidence b o u n d t~2, with o~1 ~ O~2 = 0~, (4.3) will be satisfied. Similar
considerations apply to combinations of Bayesian lower and upper confidence
b o u n d s to obtain a Bayesian confidence interval. In general, since there are m a n y
values for a I and a2 satisfying cq + a2 = a, confidence intervals with probability
content 1 - a are not unique.

E x a m p l e 4.2
Consider the standard normal regression model)7 = Xfl + a, where the n × 1 vector
is M V N (0, a 2In). Then/~ = ( X ' X ) - l x ' f i has a pdf that is M V N [ f l , ( X ' X ) - 1o2]
a n d v s 2 / a 2, where I, = n - k and ~,s 2 = ( f i - X l ~ ) ' ( f i - Xfl), has a X 2 p d f with p
degrees of freedom (d.f.). It follows that t = ( / 3 i - fli)/sh, has a univariate Stu-
dent-t (US-t) pdf with v d.f., where/~i and fli are the ith d e m e n t s o f / ~ and fl,
respectively, and s~, = mils 2, with m ii the i, ith element of ( X ' X ) - 1. Then f r o m
tables of the Student-t distribution with 1, d.f., a constant c~ > 0 can be f o u n d
such that with given probability 1 --, a, P(I/I < c D = P ( I / ) i - 13iJ/sh, < c~1/3/) =
1 - a. Since I/3i - Bil/s~, < c~ is equivalent to fli - c,s~i < fli < fli + csh, P(Bi -
c,sh, < fli < / ~ + %sh, lfli) = 1 - a and /~i + c,sh, is a (1 - a ) × 100 percent confi-
dence interval for/3/. N o t e that the interval is r a n d o m and fli has a fixed u n k n o w n
value. With given data fl~ + %sh, can be evaluated to yield, for example 0.56 ___0.12.
E x a m p l e 4.3
If the regression model in the previous example is analyzed with a diffuse prior
pdf, P ( B , a) (x 1 / o , the posterior pdf is p ( f l , o l y ) cco (n+ l)exp(_ [us 2 + (fl _
~ ) r X ' X ( f l - ~)]//2o 2) and on integrating over o, 0 < o < o0, the marginal poste-
rior p d f for fl is p(Bly) oc[~,s2+(B-~)rx'x(B-l})l -("+k)/2, a p d f in the
MVS-t form with I, = n - k,/~ = ( X ' X ) - 1 X ' y , and ~,s 2 = ( y - XI~)'(y - XI~).
T h e n it follows that t = (fli - fli)/sh has a US-t p d f with ~ d.f. where ~ i and/)i are
the i th elements of fl and /~, respectively,
~ . and st}a` = m"s. 2,. where
. . . m " is the .l-lth.
element of ( X ' X ) - 1. Thus, c, can be f o u n d such that for given probability 1 - a,
P(Itl < %) = P ( l f l ~ - O~l/sh, < % I Y ) = 1 - a. Equivalently, P ( ] ~ i - c~s~, < fli < fli +
%s~,ly ) = 1 - a and fl~ + c,sh, is a ( 1 - a ) 1 0 0 percent Bayesian confidence inter-
val for fli in the sense that the posterior probability that the r a n d o m fli lies in the
fixed interval/3i + %sh, is 1 - a.
Ch. 2." Statistical Theory and Econometrics 155

In these two examples t = ( ~ i - fli)/s~, is a pivotal quantity. Its pdf, p ( t ) =


c ( v ) / ( 1 + t 2 / v ) (v+W2, with c(v) a normalizing constant, does not involve fl and
o. Also, the p d f for t is the same in the sampling theory and Bayesian approaches
when the diffuse prior p ( f l , o) ~ 1 / o is employed and in this case sampling theory
and Bayesian confidence intervals are numerically identical. However, if an
informative prior p d f were employed, reflecting additional information, the
intervals would not be numerically identical. Generally the Bayesian interval,
incorporating more information, will be shorter in length for a given confidence
level 1 - ~. Generally, as 1 - a is increased in value b o t h sampling theory and
Bayesian confidence intervals get broader; that is, to be more confident (higher
1 - a) that the interval p r o b a b l y covers or includes the value of a parameter, it
will have to be of greater length.
The intervals discussed in the two examples above, /3~ +_ %sf~,, are "central"
intervals. As mentioned above, %, and %2 can be f o u n d such that fl~- %s~, to
l~i + c~2s~, is a 1 - a confidence interval for fli. In this case, the interval is not
symmetric with respect to/~i. Similarly, a confidence interval for o 2 need not be a
central interval. That is, the sampling p d f of z = vsZ/a 2 is X 2 with u d.f. Then
constants % and % exist such that P ( c , < z < % ) = P ( % < v s 2 / a 2 < % [o) =
1 - a or, equivalentl~y, P(vs2/%~ < 0 2 < vs2/%, [0)~= 1 - a. 'The interval us2/%2
to usZ/G, is not centered at s2. 42 Also, there are m a n y ways of selecting %, and
c ~ such that the probability associated with an interval for a 2 is 1 - a.
The problem of obtaining a unique confidence interval for a scalar parameter 0
can be solved in m a n y cases by applying the criterion that for a given confidence
level 1 - a the interval selected be of shortest, in some sense, length. In a Bayesian
context with a posterior pdf for O, p (OLD), where D denotes the sample and prior
information, an interval b to a is sought such that a - b is minimized subject to
fffp(OlD)dO=l-a, where 1 - a is given. The solution to this constrained
minimization problem is to select a = a , and b = b , Such that p ( a , l D ) =
p ( b , ] D ) . 43 Then the interval b , to a , has probability content 1 - a and is of
minimal length. F o r a unimodal p (01D), the ordinate of p ( 01 D ) in the interval b ,
to a , is everywhere higher than the ordinates outside this interval and thus the
interval b , to a , is often called a highest posterior density ( H P D ) interval with
probability content 1 - a. If p ( O l D ) is unimodal and symmetric about the modal
value t~,~, then the H P D interval can be expressed as t~m + % a central interval

42
us 2/c. to vs 2/c_ is also a 1 - a Bayesian interval when the diffuse prior p(fl, o) c( 1/o is
employed, since then vs 2/0 2 has a X2 posterior pdf with u d.f. In this problem the pivotal quantity is
vs "/a ~, which has a X2 pdf not involving fl or 0 in both the sampling theory and Bayesian approaches.
43Write a - b + ~ fbP (OLD)d 0, where ~ is a Lagrange multiplier. On differentiating this expression
partially with respect to a and to b and setting first partial derivatives equal to zero yields
l+~p(a]D)=O and l+Xp(b[D)=O so that p(a]D)=p(b[D) is necessary for a - b to be
minimized subject to the restriction. Under weak conditions, this condition is also sufficient. Also, this
interval can be obtained by minimizing expected loss with a loss function of the following type:
L = q(a - b)- 1 if b ~<0 ~<a and L = q(a - b) otherwise, with q > 0 a given constant. This loss
function depends on the length of the interval, a - b.
156 A. Zellner

with p(O~ + G[D) = p(O~ - GI D) and Pr(0 m - c~ < 0 < 0,~ + GI D) = 1 - a. For
bimodal and some other types of posterior pdfs, a single interval is not very useful
in characterizing a range of probable values for 0.
In the sampling theory approach various definitions of interval shortness have
been proposed. Since the sampling theory confidence interval is random, its
length is random. Attempts to obtain confidence intervals with minimum ex-
pected length have not been successful in general. Another criterion is to
maximize the probability of coverage, that is, to find /~ and a~ such that
1-a=P(b~<~O<<.gl~lO)>~P([~<~O'<<.a~lO' ) for every 0 and 0 ' c O , where 0 is
the true value and 0' is some other value. That is, the interval must be at least as
likely to cover the true value as any other value. An interval satisfying this
criterion is called an unbiased confidence interval of level 1 - a. Pratt (1961) has
shown that in many standard estimation problems there exist 1 - a level confi-
dence intervals which have uniformly minimum expected length among all 1 - a
level unbiased confidence intervals. Also, a concept of shortness related to
properties of uniformly most powerful unbiased tests will be discussed below.
In summary, for a scalar parameter 0, or for a function of 0, g(O) results are
available to compute upper and lower confidence bounds and confidence inter-
vals in both the sampling theory and Bayesian approaches. For some problems,
for example g(O)= 1/0, where ~i = 0 + ~i, with the ~i's N I D (0, o2), both the
sampling distribution of 1 / 2 and the posterior pdf for 1 / 0 can be markedly
bimodal and in such cases a single interval is not very useful. Some other
pathological cases are discussed in Lindley (1971) and Cox and Hinkley (1974, p.
232 ft.). The relationship of sampling properties of Bayesian and sampling theory
intervals is discussed in Pratt (1965).

4. 3. Confidence regions

A confidence region is a generalization of a confidence interval in the sense that it


relates to a v~ctor of parameters rather than a scalar parameter. A sampling
theory 1 - a confidence region for a vector of parameters 0 c O is a nested set of
regions in the sample space denoted by 05~= ~%(£), where £ is the random
observation vector such that for all 0 c O,

P(Oc~o,,lO)=l-a, (4.4)

and w~,(~) c %2(~) when a~ > a 2. This last condition insures that the confidence
region will be larger the larger is 1 - a. In particular problems, as with confidence
intervals, some additional considerations are usually required to determine a
unique form for o3~. If o5~ is formed so that all parameter values in o5~ have higher
likelihood than those outside, such a region is called a likelihood-based confi-
dence region by Cox and Hinkley (1974, p. 218).
Ch. 2: Statistical Theory and Econometrics 157

A Bayesian 1 - ct confidence region for a p a r a m e t e r vector 0 is based on the


posterior p d f for 0. That is, a nested set of regions w~ = o~(x), where x is a given
vector of observations such that

P ( 0 c o ~ ( x ) I x ) = 1 - a, (4.5)

and o~,(x) c ~ 2 ( x ) for a I > oL2 is a 1 - a Bayesian confidence region for 0. O n


comparing (4.4) and (4.5) it is the case that they involve different probability
statements, (4.4) relating to properties of the r a n d o m region th, given 0, and (4.5)
relating to posterior properties of the r a n d o m 0 given the region ~ = ~ ( x ) . T w o
examples will be provided to illustrate these concepts.
Example 4.4
F r o m Example 4.2 the sampling distribution o f / ~ = (/~ - # ) ' X ' X ( f 3 - f l ) / k s 2 is
k n o w n to be an F p d f with k and u d.f. F r o m tables of this distribution, c~ can be
determined such that P(P~< c~) = 1 - a. Then F = (/~ - fl)'X'X(fJ - f l ) / k s 2 <~c~
defines a set of nested regions (ellipsoids) that constitute a confidence region for
/3. In the case where fl has two elements, (/~ - f l ) ' X ' X ( f l - f l ) / k s 2 <~ c~ defines a
set of nested ellipses. F o r k = 1, the result (/3 - f l ) 2 ~ x 2 / s 2 <~c, is consistent with
the confidence interval for a single p a r a m e t e r discussed earlier. T h e bounding
contours of these confidence regions have constant marginal likelihood.
Example 4.5
F r o m the marginal posterior p d f for /3 in E x a m p l e 4.3, it is the case that
F = (fl - f l ) ' X ' X ( f l - l~)/ks 2 has a posterior p d f in the form of an F pdf with k
and t, d.f. Thus, c~ can be obtained such that P[(fl - f ~ ) ' X ' X ( f l - f l ) / k s 2 <~ c~[D]
= 1 - a. Then ( f l - f l ) ' X ' X ( f l - fl)/ks2<<, c~ defines a set of nested regions or
confidence region for ft. F o r fl having two elements, the nested regions are
ellipses. Also, it is the case that these confidence regions are highest posterior
density regions.
T h a t the regions in Examples 4.4 and 4.5 are identical is due to the fact that
there exists a pivotal quantity, namely F = (/~ - f l ) ' X ' X ( f l - f l ) / k s 2, that has the
same pdf under the sampling theory and Bayesian assumptions and does not
involve any p a r a m e t e r s with u n k n o w n values. These confidence regions relate to
the entire coefficient vector ft. Similar results can be obtained for any subvector
of ft. Further, there are several types of " s i m u l t a n e o u s " confidence intervals for
all differences of means or contrasts for various analysis of variance models,
including that for independent, n o r m a l observations, ~ij, with Effij = 0 + yi,
var(~ij ) = 02, i = 1,2 ..... k and j = 1,2 . . . . . n and Y'./k=l~/i = 0; see, for example,
Arnold (1981, p. 135 ft. and ch. 12) for derivations of simultaneous confidence
intervals for regression and analysis of variance models.
F o r m a n y p r o b l e m s involving a p a r a m e t e r vector 0' = (0~, 0~) if pivotal quanti-
ties do not exist, it is difficult to obtain an exact 1 - ct confidence region for 0~
158 A. Zellner

without additional conditions; see Cox and Hinkley (1974, p. 230 ff.) for analysis
of this problem. In a Bayesian approach the marginal posterior pdf for 01,
h(OllD ), is obtained from the joint posterior pdf p(Ol,OzlD ) and confidence
regions can be based on h(O 1[,D). Another serious problem arises if the sampling
distribution of an estimator 0 or the posterior pdf for 0 is multi-modal or has
some other unusual features. In such cases sampling theory and Bayesian confi-
dence regions can be misleading. Finally, in large samples, maximum likelihood
and other estimators are often approximately normally distributed and the large
sample normal distribution can be employed to obtain approximate confidence
intervals and regions in a sampling theory approach. Similarly, in large samples,
posterior pdfs assume an approximate normal form and approximate Bayesian
intervals and regions can be computed from approximate normal posterior
distributions. For n large enough, these approximations will be satisfactory.

5. Prediction

Prediction is a most important part of econometrics and other sciences. Indeed,


Jeffreys (1967) defines induction to be the process "... of making inferences from
past experience to predict future experience" (p. 1). Also, causation has been
defined to be confirmed predictability from a law or set of laws by Feigl (1953)
and other philosophers. Since induction and causation are directly linked to
prediction, and since prediction is intimately involved in economic research, 44
econometric modelling, and policy analysis, it is a topic that deserves considerable
emphasis in econometrics and other sciences, a point of view stressed strongly by
Geisser (1980).
Prediction usually involves the study of past data, denoted by x and formula-
tion of a probability model for them. For simplicity, assume that the n elements
of J? are independently and identically distributed and that the model for £ is
p(xlO) = l-Ii~=J(xi[O), where 0 is a vector of parameters. Now consider some
future or as yet unobserved data. These future or as yet unobserved data are
denoted by g, a q × 1 random vector. If it is further assumed that the elements of
g are generated by the same probability model that generated £, then the
probability model for g is g(z IO) = [I q=1f (zil 0). If the form of f (.) and the value
of 0 were known exactly, then g(z 10) would be completely determined and could
be employed to make various probability statements about the elements of ~.
Unfortunately, the value of 0 is not usually known and this fact makes it
important to have prediction techniques that are operational when the value of 0
is uncertain. Also, if there are serious errors in the assumptions underlying the

~See, for example, the predictions that Friedman (1957, pp. 214-219) derived from his theory of
the consumptionfunction.
Ch. 2: Statistical Theory and Econometrics ! 59

probability models for $ and for ~, then predictions of ~ will usually be adversely
affected. 45

5.1. Sampling theory prediction techniques 46

With past data x and future or as yet unobserved data g, a point prediction of the
random vector g is defined to be £ = q~(x), where cp(x)'= [q~l(x),~02(x) .....
q0q(X)] is a function of just x and thus can be evaluated given x. When the value
of g is observed, say z0, then e o = £ - z0 is the observed forecast error vector. In
general perfect prediction in the sense e o = 0 is impossible and thus some other
criteria have to be formulated to define good prediction procedures. The paral-
lelism with the problem of defining good estimation procedures, discussed in
Section 3, is very close except that here the object of interest, g, is random.
In the sampling theory approach, prediction procedures are evaluated in terms
of their sampling properties in repeated (actual or hypothetical) samples. That is,
the sampling properties of a predictor, cp($), are considered in defining good or
optimal prediction procedures which involves the choice of an explicit functional
form for q~(£). Note that use of the term " p o i n t predictor" or "predictor" implies
that the random function ~(:~) is being considered, while use of the term "point
prediction" or "prediction" implies that the non-random function ~0(x) is being
considered.
Some properties of predictors are reviewed below with ~ = g - qo(£) the random
forecast error vector. For brevity of notation, q0($) will be denoted by qS.
(1) Minimal M S E predictor. If I'~, where ! is a given vector of rank one, has
minimal MSE, then ~ is a minimal MSE predictor.
(2) Unbiasedness. If E~ = O, then q5 is an unbiased predictor. If E~ ~ 0, then q3
is a biased predictor.
(3) Linearity. If ~ = A:~, where A is a given matrix, then q5 is a linear predictor.
(4) Minimum variance linear unbiased ( M F L U ) predictor. Consider l'~, where i
is a given q x 1 vector of rank one and the class of linear, unbiased
predictors, ~u = Au£, with A u, not unique, such that E ( g - A u £ ) = 0. If
var(l'~) is minimized by taking A, = A . , then q3. = A . ~ is the MVLU
predictor.
(5) Prediction risk. If L ( ~ ) is a convex loss function, then the risk associated
with ~ relative to L(~) is r(O) = EL(~). For example, if L(~) = ~'Q~, where
Q is a given q x q positive definite symmetric matrix, the risk associated with

45Statistical tests of these assumptions, e.g. the i.i.d, assumption, can be performed.
46For further discussion of sampling theory prediction and forecasting techniques for a range of
problems, see Granger and Newbold (1977).
160 A. Zellner

q5 is EUQ~= A'QA + t r Q V ( ~ ) , where ,4 = E~, the bias vector, and V(Y)=


E ( ~ - A)(~ - Zi)', the covariance matrix of #. In the case of a scalar forecast
error 6, with L ( g ) = 62

EO 2 = zl2 + var(0), (5.1)

where E~ = Zl is the prediction bias and var(g) is the variance of the forecast
error.
(6) Admissiblepredietor. Let rl(O ) = EL(~l) be the risk associated with predic-
tor qS] and ra(O ) = EL(~a) be the risk associated with any other predictor. If
there does not exist another predictor such that ra(O ) <~rl(O ) for all 0 in the
parameter space, then the predictor ¢Pl is admissible relative to the loss
function L. If another predictor exists such that ra(O) <~r~(O) for all 0 in the
parameter space, with ra(O) < rl(O ) for some values of 0, then qS] is inadmis-
sible relative to L.
(7) Robust predictor. A robust predictor is a predictor that performs well in the
presence of model specification errors a n d / o r in the presence of unusual
data.
Much of what has been said above with respect to criteria for choice of
estimators is also applicable to choice of predictors. Unfortunately, minimal MSE
predictors do not in general exist. The unbiasedness property alone does not lead
to a unique predictor and insisting on unbiasedness may be costly in terms of
prediction MSE. In terms of (5.1), it is clear that a slightly biased predictor with a
very small prediction error variance can be better in terms of MSE than an
unbiased predictor with a large prediction error variance. Also, as with admissible
estimators, there usually are m a n y admissible predictors relative to a given loss
function. Imposing the condition that a predictor be linear and unbiased in order
to obtain a M V L U predictor can be costly in terms of MSE. For m a n y prediction
problems, non-linear biased Stein-like predictors have lower MSE than do M V L U
predictors; see, .for example, Efron and Morris (1975). Finally, it is desirable that
predictors be robust and what has been said above about robust estimators can be
adapted to apply to predictors' properties.
To illustrate a close connection between estimation and prediction, consider the
standard multiple regression model .9 = X/~ + ti, where fl is n × 1, X is a given
non-stochastic n x k matrix of rank k, ~ is a k x 1 vector of parameters, and ti is
an n × 1 disturbance vector with Eti = 0 and Etiti'= o2In. Let a future scalar
observation g be generated by ~ = w'/~ + 15, where w' is a 1 x k given47 vector of
rank one and ~ is a future disturbance term, uncorrelated with the elements of ti
with E~5 = 0 and E152= o 2. Then a predictor of ~, denoted by 2, is given by

47For some analysis of this problem when w is random, see Feldstein (1971).
Ch. 2: Statistical Theory and Econometrics 161

2 = w'/~, where /~ is an estimator for ft. Then 0 = £ - i = w ' ( / ~ - f l ) - 1 5 and


EO = w ' E ( f l - ~), so that if E(/~ - ~ ) = 0, that is,/~ is unbiased, EO = 0 and the
predictor 2 = w'/~ is unbiased. Furthermore, if/3 is a linear estimator, say/~ = A iT,
then ~ is a linear predictor. Further relative to L ( 0 ) = 0 2, the risk of 2 in general is
EO 2 = E [ w ' ( B - a ) - 6 ] x [ ( B - B ) ' w - el : w ' E ( B - B ) ( B - B ) ' w + 02. 48 If in
this last expression, /~ =/~ = ( X ' X ) - I X ' ~ , the M V L U least squares estimator,
then E0 = 0 and E 0 2 a s s u m e s a minimal value in the class of unbiased linear
predictors, that is, £ = w'Au.~, where Au = ( X ' X ) I X ' + C' and C" is such that
C ' X = O . Finally, if /3 is an estimator such that w ' E ( ~ - f l ) ( ~ - f l ) ' w <
w ' E ( ~ - f l ) ( ~ - fl)'w for all values of r , then £ = w'~ is an inadmissible predictor
relative to L ( 0 ) = 02.
As is seen from the above discussion, m u c h traditional analysis has been
carried forward under the assumption that the appropriate loss function is
quadratic, for example L ( 0 ) = 02 or L ( E ) = E'Q0. While such quadratic loss
functions are appropriate or are a reasonable approximation in a n u m b e r of
problems, there are problems for which they are not. F o r example, Varian (1975)
analyzed a p r o b l e m employing an asymmetric "linex" loss function, L ( 0 ) = b exp
(a#)-c0- b, with a = 0.0004; b = 2500, and c = 0.1. H e considered this asym-
metric loss function more appropriate for his p r o b l e m than L ( 0 ) = 02. Also, as
pointed out in the estimation section, b o u n d e d loss functions m a y be appropriate
for some problems. Use of an appropriate loss function for prediction and other
problems is i m p o r t a n t and solutions can be sensitive to the form of the loss
function utilized [see, for example, Varian (1975), Zellner and Geisel (1968), and
Zellner (1973)].
Sampling theory prediction bounds, intervals, and regions are available that
relate to future observations; for discussion of these topics, see, for example,
Christ (1966, pp. 557-564) and G u t t m a n (1970). It must be emphasized that
sampling theory prediction intervals or regions are subsets of the sample space
that have a specified probability of including the random future observations. For
example, q5 + %(.~) is a (1 - a ) × 100 percent central predictive interval for ~, a
future r a n d o m scalar observation if P[q5 - % ( 2 ) < £ < 95 + %(2)] = 1 - a. N o t e
that q? = ¢p(.~), the predictor, % ( £ ) , and i are all r a n d o m in this probability
statement. For a particular sample, x, the c o m p u t e d prediction interval is qo(x)_+
% ( x ) , for example 50.2+ 1.1, where q o ( x ) = 5 0 . 2 is the point prediction and
% ( x ) = 1.1. Other types of prediction intervals, for example r a n d o m intervals that
are constructed so that in repeated sampling the p r o p o r t i o n of cases in which i is
included in the interval has a specified expected value with a given probability or

48Note that Ew'(f3 -fl)t~ = 0 if the elements of/~- 13 and ~5are uncorrelated as they are under the
above assumptions if B is a linear estimator. On the other hand, if /~ is a non-linear estimator,
sufficient conditions for this result to hold are that the elements of ff and ~ are independently
distributed and Ew'([J -~8)~ is finite.
162 A. Zellner
such that the proportion is not less than a specified value with given probability,
are called tolerance intervals. See Christ (1966) and Guttman (1970) for further
discussion and examples of tolerance intervals. Finally, in many econometric
problems, exact prediction intervals and regions are not available and large
sample approximate intervals and regions are often employed.

5.2. Bayesian prediction techniques

Central in the Bayesian approach to prediction is the predictive pdf for ~, p(z [D),.
where g is a vector of future observations and D denotes the sample, x, and prior
information. To derive the predictive pdf, let ~ and ~ be independent 49 with pdfs
g(zlO) and f(xlO ), where 0 c O is a vector of parameters with posterior pdf
h(OlD) = cTr(O)f(x]O), where c is a normalizing constant and ~r(0) is the prior
pdf. Then,

p(z]D) = fog(z[O)h(O[D)dO (5.2)

is the predictive pdf for g. Note that (5.2) involves an averaging of the conditional
pdfs g(z[O), with h(0]D), the posterior pdf for 0, serving as the weight function.
Also, p (z]D) incorporates both sample and prior information reflected in h (O]D).
For examples of explicit predictive pdfs for regression and other models, see
Aitchison and Dunsmore (1975), Box and Tiao (1973), Guttman (1970), and
Zellner (1971).
If z is partitioned as z ' = (z'l,z~), the marginal predictive pdf for z~ can be
obtained from (5.2) by analytical or numerical integration. Also, a pdf for z2
given zl a n d / o r the distribution of functions of z can be derived from (5.2).
If a point prediction of ~ is desired, the mean or modal value of (5.2) might be
used. If a convex prediction loss function L(g, ~) is available, where ~ = g(D) is
some point prediction depending on the given sample x and prior information,
Bayesians choose g so as to minimize expected loss, that is, solve the following
problem:

f L(,e)p(zlD)d. (5.3)

The solution, say ~ = ff*(D), is the Bayesian point prediction relative to the loss
function L(g,2). For example, if L ( g , g ) = ( g - g ) ' Q ( g - g ) , with Q a given

49Independence,assumedhere for simplicity,can be relaxed.


Ch. 2: Statistical Theoryand Econometrics 163

positive definite symmetric matrix, the optimal £ is ~* = E(gID), the mean of the
predictive pdf in (5.2). 50 For other loss functions, Bayesian minimum expected
loss point predictions can be obtained [see Aitchison and Dunsmore (1975),
Litterman (1980), and Varian (1975) for examples]. Prediction intervals and
regions can be computed from (5.2) in the same way that posterior intervals and
regions are computed for parameters, as described above. These prediction
intervals and regions are dependent on the given data D and hence are not viewed
as random. For example, in the case of a scalar future observation, £, given the
predictive pdf for ~, p ( z l D ) , the probability that b < £ < a is just f f l p ( z l D ) d z =
1 - a. If a and b are given, 1 - a can be calculated. If 1 - a is given, then a and b
are not uniquely determined; however, by requiring that b - a be a minimum
subject to a given 1 - a, unique values of a and b can be obtained.
To this point, all results in this subsection are for given data x and given prior
information. The sampling properties of Bayesian procedures are of interest,
particularly before £ is observed and also in characterizing average properties of
procedures. In this regard, the solution, g* to the problem in (5.3) can be viewed
as a Bayesian predictor, random since it is a function of £. For brevity, write a
predictor as g = ~(:~). Then the prediction risk function, r(O), relative to the loss
function L(~, g), is

r(O) = ffL(e,e)f( lO)g( la)dxd , (5.4)

where the integrations are over the sample spaces of £ and g. Risk, r(O), can be
computed for alternative predictors, £ = 2(£). The Bayesian predictor is the one,
if it exists, that minimizes average risk, AR = for(O)~r(O)dO, where ~r(O) is a
prior for 0. If AR is finite, then the Bayesian predictor isadmissible and also is
given by the solution to the problem in' (5.3).
From what has been presented, it is the case that both sampling theory and
Bayesian techniques are available for predictive inference. As with estimation, the
approaches differ in terms of justifications for procedures and in that the
Bayesian approach employs a prior distribution, whereas it is not employed in
the sampling theory approach. Further, in both approaches predictive inference
has been discussed in terms of given models for the observations. Since there is
often uncertainty about models' properties, it is important to have testing
procedures that help determine the forms of models for observations. In the next
Section general features of testing procedures are presented.

5°The proof is very similar to that presented above in connection with Bayesian parameter
estimation with a quadratic loss function.
164 A. Zellner

6. Statistical analysis of hypotheses

Statistical procedures for analyzing and testing hypotheses, that is, hypothesized
probability models for observations, are important in work to obtain satisfactory
econometric models that explain past economic behavior well, predict future
behavior reliably, and are dependable for use in analyzing economic policies. In
this connection, statistical theory has yielded general procedures for analyzing
hypotheses and various justifications for them. In what follows, some basic results
in this area will be reviewed.

6.1. Types of hypotheses

Relative to the general probability model, (X, O, p (x I0)) hypotheses can relate to
the value of 0, or a subvector of 0, a n d / o r to the form ofp(x[O ). For example,
0 = 0 or 0 = c, a given vector, are examples of simple hypotheses, that is,
hypotheses that completely specify the parameter vector 0 appearing in p(xlO ).
On the other hand, some hypotheses about the value of 0 do not completely
specify p(xlO). For example, 0 c ~0, a subspace of O, does not imply a particular
value for 0 and thus is not a simple hypothesis but rather is termed a composite
hypothesis. In terms of a scalar parameter 0 c O, where O is the entire real line,
0 = 0 is a simple hypothesis, whereas 0 < 0 and 0 > 0 are composite hypotheses.
Further, it is often the case that two or more hypotheses are considered.
For example 0 = 0 and 0 = 1, two simple hypotheses, or 0 = 0 and 0 > 0, a simple
hypothesis and a composite hypothesis, or 0 > 0 and 0 < 0, two com-
posite hypotheses, may be under study. Finally, various forms for p(xlO ) may
be hypothesized, for example p[(x - ~1)/011 normal or p[(x -/~2)/02] double-
exponential are two alternative hypotheses regarding the form of p ( - ) with the
same parameter space O: - oc < ~i < oe and 0 < 0i < oc, i = 1,2. In other cases,
p ( x l 0 ) and g ( x l ~ ) may be two alternative hypothesized forms for the pdf for the
observations involving parameter vectors 0 and ~ and their associated parameter
spaces. Finally, if the probability model is expanded to include a prior pdf for 0,
denoted by p(O), differentp(0)'s can be viewed as hypotheses. For example, for a
scalar 0, pl(O) in the form of a normal pdf, with given mean 0-l and given variance
0 2, 0 - N(Ol,o~) and 0 - N(t)2,~2), with t~ and o22 given, can be viewed as
hypotheses.
Whatever the hypothesis or hypotheses, statistical testing theory provides
procedures for deciding whether sample observations are consistent or incon-
sistent with a hypothesis or set of hypotheses. Just as with estimation and
prediction procedures, it is desirable that testing procedures have reasonable
justifications and work well in practice. It is to these issues that we now turn.
Ch. 2: Statistical Theoryand Econometrics 165

6.2. Sampling theory testing procedures

The N e y m a n - P e a r s o n (NP) sampling theory testing procedures are widely utilized


and described in most statistics and econometrics textbooks. 51 In the N P ap-
proach, two hypotheses are considered, which in terms of a scalar parameter
0 c O can be described by 0 c ~0 and 0 c O - ~0, where ~o is a subspace of O, and
- ¢0 denotes the region of O not including ~0, that is, the complement of ~o. F o r
example, 0 = 0 and 0 ~ 0, with O the entire real line, might be two hypotheses
under consideration. Usually 0 c ~o is called the "null hypothesis" and 0 c O - co,
the "alternative hypothesis", nomenclature that suggests an asymmetric view of
the hypotheses. In N P theory, the sample space X is partitioned into two regions:
(1) the "region of acceptance" or the region in which outcomes are thought to be
consistent with the 0 c co, the null hypothesis, and (2) a " r e g i o n of rejection" or a
region, c o m p l e m e n t a r y to the "acceptance region", in which outcomes are thought
to be inconsistent with the null hypothesis. This "rejection region" is usually
called " t h e critical region" of the test. Example 6.1 illustrates these concepts.

Example 6.1
Let 2 i, i = 1,2 ..... n, be N I D (0, 1) with O: - oo < 0 < oc, and consider the null
hypothesis, Ho: 0 = 0o and the alternative hypothesis, HI: 0 :* 0o. Here ~0 c O is
0 = 00 and O - co is 0 ~ 00. Suppose that we consider the r a n d o m sample mean
.,~ = Y'~n=l.~i/n. A " r e g i o n of acceptance" might be L2 - 001 4 c and a "region of

rejection", or critical region 1~ - 001 > c, where c is a given constant. Thus, given
the value of c, the sample space is partitioned into two regions.
Two major questions raised b y Example 6.1 are: W h y use ~ in constructing the
regions and on what grounds can the value of c be selected? In regard to these
questions, N P theory recognizes two types of errors that can be m a d e in testing
0 c ~o and 0 c 0 - ~o. A n error of type I, or of the first kind, is rejecting s2 0 c co
when it is true, while an error of type II, or of the second kind, is accepting 0 c ,0
when it is false. The operating characteristics of a N P test are functions that
describe probabilities of type I and type II errors. Let t = t(:~) be a test statistic,
for example ~ in Example 6.1, and let R be the region of rejection, a subset of the
sample space. Then a ( 0 ) = P ( i c RIO c ~o) is the probability of a type I error
expressed as a function of 0, which specializes to a(O) = P ( ] £ - 001 >/c[O = 0o) in
terms of Example 6.1. The probability of a type II error is given by fi(O) = P(t c
/~10 c 0 - co) = 1 - P ( t c RIO c 0 - ~o), w h e r e / ~ i s the region of acceptance, the
complement of R. In terms of Example 6.1, /~(0) = P ( l f f - 0ol ~< c]O ~ 0o) =

5~For a detailed account see, for example, Lehmann (1959).


S2The common terminology "reject" and "accept" will be employed even though "inconsistent
with" and "consistent with" the data appear to be preferable.
166 A. Zellner

1 - P ( l g - 0ot > c]O ~= 0). The function 1 - fl(O), the probability of rejecting 0 c
when 0 c O - w is true, is called the power function of the test.
A test with minimal probabilities of type I and type II errors, that is, minimal
a ( 0 ) and fl(O), would be ideal in the N P framework. Unfortunately, such tests do
not exist. W h a t N P do to meet this problem is to look for tests with minimal
value for fl(O), the probability of type II error subject to the condition that for all
0 c o~, a(O) ~< a, a given value, usually small, say 0.05, the "significance level of
the test". 53 By minimizing fl(O), of course, the power of the test, 1 - fl(O), is
maximized. A test meeting these requirements is called a uniformly most powerful
( U M P ) test.
Unfortunately, except in special cases, uniformly most powerful tests do not
exist. In the case of two simple hypotheses, that is, O = (01, 02) with w: 0 = 01 and
- w: 0 = 02, and data pdfp(x]O ), the famous N P l e m m a 54 indicates that a test
based on the rejection region t ( x ) = p(xlO1)/p(xlO2)>~ k~, where k~ satisfies
P[ t ( 2 ) >/k~ l0 = 01] = a, with a given, is a U M P test. This is of great interest since
in this case t(x) is the likelihood ratio and thus the N P l e m m a provides a
justification for use of the likelihood ratio in appraising two simple hypotheses.
W h e n composite hypotheses are considered, say 0 ~: 0, it is usually the case that
U M P tests do not exist. One important exception to this statement is in terms of
Example 6.1 testing 0 = 0o against 0 > 0. Then with ~/n(~ - 00) as the test statistic
and using ( n ( 2 - 0o) > k , as the region of rejection, where k~ is determined so
that P ( v ~ ( x - 0 o ) > k~[O = 0 0 ) = a, for given a, this test can be shown to be
U M P . 55 Similarly, a U M P test of 0 = 0o against 0 < 00 exists for this problem.
However, a U M P test of 0 = 00 against 0 ~: 00 does not exist. That is, using
~/~ 12 - 00l >/k~ as a region of rejection with k~ such that P(v/n [~ - 0o1 > k~) = a,
given a, is not a U M P test.
Given that U M P tests are not usually available for m a n y testing problems, two
further conditions have been utilized to narrow the range of candidate tests. First,
only unbiased tests are considered. A test is an unbiased a-level test if its
operating characteristics satisfy a(O)<~ a for all 0 c ~ and 1 - fl(O)>1 a for all
0 c O - ~. This requirement seems reasonable since it implies 1 - a >//3(0), that
is, that the probability, 1 - a, of accepting 0 c o~ when it is true is greater than or
equal to the probability fl(O), 0 c O - o~, of accepting it when it is false. M a n y
tests of a null hypothesis, 0 = 01, with 01 given, against composite hypotheses
0 :x 01 are U M P unbiased tests. In terms of Example 6.1 the test statistic 12 -
Oolv/nwith rejection region 12 -OOlCrn->~ k~ is a U M P unbiased test of 0 = 0o

53Some call a the "size of the test".


54See Lehmann (1959, p. 65 ff.) for proof of the NP lemma.
55Note that 2 is normal with mean 00 and variance 1/n under the null hypotheses 0 = 0o. Thus,
fn(.2 - 0o) is N(0,1) under 0 = 0o and tables of the normal distribution can be utilized to evaluate k,~
for given a, say a = 0.10.
Ch. 2: Statistical Theory and Econometrics 167

against 0=~0 o. See Lehmann (1959, ch. 4-5) for many" examples of U M P
unbiased tests.
It is also interesting to note that L~ - 001 fn- < k~ can be written as 5~- k,/Vrn
< 0o < ~ + k,/vrn and that, given 0 = 00, the probability that ~ _+ k , / ~ / n covers
00 is 1 - a. Thus, there is a close mathematical relationship between test statistics
and confidence intervals, discussed above, and in m a n y cases optimal tests
produce optimal intervals (in a shortness sense). However, there is a fundamental
difference in that in testing 0 = 00, 0o is given a specific value, often 0o = 0, which
is of special interest. On the other hand, with a confidence interval or interval
estimation problem the value of 0o is not specified; that is, 0o is the true unknown
value of 0. Thus, if ~ _+ k~/~/n, with a = 0.05 assumes the value 0.32 + 0.40, this is
a 95 percent confidence interval for 00 that extends from - 0 . 0 8 to 0.72. That the
interval includes the value 0 does not necessarily imply that 00 = 0. It may be that
0o =~ 0 and the precision of estimation is low. In terms of testing 0 = 00 = 0, the
result 0.32 +0.40 implies that the test statistic assumes a value in the region of
acceptance, I~l¢~ < k , , and would lead to the conclusion that the data are
consistent with 0 = 00 = 0. In NP theory, however, this is an incomplete reporting
of results. The power of the test must be considered. For example, if 0 = + 0.20
represent important departures from 0 = 0, the probabilities of rejecting 0 = 0
when 0 = _+0.20, that is, 1 - fl(0.2) and 1 - f l ( - 0.2), should be reported. Under
the above conditions, these probabilities are quite low and thus the test is not very
powerful relative to important departures from 0 = 0. More data are needed to
get a more powerful test and more precise estimates.
The above discussion reveals an important dependence of a test's power on the
sample size. Generally, for given a, the power increases with n. Thus, to
"balance" the probabilities of errors of type I and II as n increases requires some
adjustment of a. See D e G r o o t (1970) for a discussion of this problem. 56
A second way of delimiting candidate tests is to require that tests be invariant,
that is, invariant to a certain group of transformations. See Lehmann (1959, ch.
6-7) for discussion of U M P invariant tests that include the standard t and F tests
employed to test hypotheses about regression coefficients that are U M P invariant
tests under particular groups of linear transformations. They are also U M P
unbiased tests. In a remarkable theorem, Lehmann (1959, p. 229) shows that there
exists a unique U M P unbiased test for a given testing problem and that there also
exists a U M P almost invariant 5v test with respect to some group of transforma-
tions G. Then the latter is also unique and the two sets coincide almost
everywhere.

56Very few econometrics and statistics texts treat this problem.


57See Lehmann (1959, p. 225) for a definition of "almost invariant".
168 A. Zellner

Example 6.2
In terms of the normal regression model of Example 4.2, to test Ho: /3i =/3io, a
given value against H1: /3i ~: 0 with all other regression coefficients and o
unrestricted, the test statistic t = (/3~ -/3~0)/sh, has a univariate Student-t pdf with
l, d.f. Then Itl >/k,, where k , is such that Pr(lt I >1 k~lB~ =Bio) = a, with a, the
significance level given, is a rejection region. Such a test is a UMP unbiased and
invariant (with respect to a group of linear transformations) a-level test. In a
similar fashion, from Example 4.4, the statistic F = ( ~ - f l o ) ' X ' X ( ~ - flo)/ks 2
that has an F pdf with k and ~ d.f. under H0: fl = fl0 can be used to test H o
against Ht: fl ~fl0 with o 2 unrestricted under both hypotheses. The rejection
region is F>~ k~, with k~ such that P(F>~ k~) = a, a given value. This test is a
UMP unbiased and invariant a-level test.
In many testing problems, say those involving hypotheses about the values of
parameters of time series models, or of simultaneous equations models, exact tests
are generally not available. In these circumstances, approximate large sample
tests, for example approximate likelihood ratio (LR), Wald (W), and Lagrange
Multiplier (LM) tests, are employed. For example, let 0 ' = (0~, 0~) be a parameter
vector appearing in a model with likelihood function p(xlO ) and let the null
hypothesis be Ho: 01 = 010 and 02 unrestricted and HI: O1 and 02 both unrestricted.
Then 2t(x), the approximate LR, is defined to be

X ( x ) = p(xlO,o,O2)/p(xl~t,~2), (6.1)

where ~2 is the value of 02 that maximizes p(xl01o , 02), the restricted likelihood
function (LF), while (01,/~2) is the value of (0 l, 02) that maximizesp(xlOl, 02), the
unrestricted LF. Since the numerator of (6.1) is less than or equal to the
denominator, given that the numerator is the result of a restricted maximization,
0 < X(x) ~ 1. The larger X(x), the "more likely" that the restriction 0~ = 0~0 is
consistent with the data using a relative maximized LF criterion for the meaning
of "more likely". In large samples, under regularity conditions and H 0,
-210gX(2) = 2 2. has an approximate X 2 pdf with q d.f., where q is the number of
restrictions implied by H o, here equal to the number of elements of 0p Then a
rejection region is 2 2 ) k,, where k~ is such that P ( 2 2 >~k~lHo) "-- a, the given
significance level. 58 Many hypotheses can be tested approximately in this ap-
proach given that regularity conditions needed for - 2 log X(:~) to be approxi-
mately distributed as 2~ in large samples under H 0 are satisfied.
In the Wald large sample approach to the test, for example H 0 : 0 = 00 against
H~: 0:~0o, let /~ be a ML estimator for 0 that, under Ho, is known to be
approximately normally distributed in large samples with asymptotic mean 00 and

58k~is obtained from the tables for Xq-


2 SinceX2 is only approximatelydistributedin the X~ form,
the significancelevelis approximatelya.
Ch. 2: Statistical Theory and Econometrics 169

large sample approximate covariance matrix V = I ~ l, where Ig is Fisher's infor-


mation matrix evaluated at the ML estimate tJ, then under Ho, the test statistic,
IYV= (tJ - 0o)'/~-~(tJ - 00) , has an approximate large sample Xq2 pdf, where q is
the number of restrictions implied by Ho .59 Then a rejection region is I~>~ k,,
where k s is determined such that P(/TV>~ k s ) - a , the given significance level,
where k s is obtained from tables of the X~ pdf.
In the LM large sample approach to testing restrictions on 0, for example/4o:
01 = 010 against Hi: 01 ~ 0L0 with 02 unrestricted under both hypotheses, where
0 ' = (0~, 0~), use is made of the fact that if H 0 is true, then the restricted and
unrestricted estimates of 0 will be very close to each other in large samples. If the
log LF is regular, then the partial derivatives of this function at the restricted
maximizing values will tend to be small. On the other hand, if H o is false these
partial derivatives will not in general be small. Let tJ'= (01o, tJ2) be the ML
estimate for the restricted log LF, = log p(xlOw, Oz). Then it can be shown that

( 0 log LR/00)~/6~- ' ( 0 log L R / 0 0 ) ~ , ,

where the partial derivatives are evaluated at .(~,, the restricted ML estimate, and
IGr is the information matrix evaluated at ~r has an approximate Xq2 pdf in large
samples under H 0 and regularity conditions, and this fact can be employed to
construct an approximate a-level test of H 0. The LM test requires just the
computation of the restricted ML estimate, Or, and is thus occasionally much less
computationally burdensome than the LR and W tests that require the unre-
stricted ML estimate. On the other hand, it seems important in applications to
view and study both the unrestricted and restricted estimates.
Finally, it is the case that for a given pair of hypotheses, the LR, W, and LM
test statistics have the same large sample X2 pdf so that in large samples there are
no grounds for preferring any one. In small samples, however, their properties are
somewhat different and in fact use of large sample test results based on them can
give conflicting results [see, for example, Berndt and Savin (1975)]. Fortunately,
research is in progress on this problem. Some approximations to the finite sample
distributions of these large sample test statistics have been obtained that appear
useful; see, for example, Box (1949) and Lawley (1956). Also, Edgeworth expan-
sion techniques to approximate distributions of various test statistics are currently
being investigated by several researchers.

6.3. Bayesian analysis of hypotheses

Bayesian procedures are available for analyzing various types of hypotheses that
yield posterior probabilities and posterior odds ratios associated with alternative

59Anyconsistent,asymptoticallyefficientestimatorcan be employedin place of the ML estimator.


170 A. Zellner

hypotheses which incorporate both sample and prior information. Further, given
a loss structure, it is possible to choose between or among hypotheses in such a
manner so as to minimize expected loss. These procedures, which are discussed in
Jeffreys (1967), DeGroot (1970), Learner (1978), Bernardo et al. (1980), and
Zellner (1971), are briefly reviewed below.
With respect to hypotheses relating to a scalar parameter 0, - oo < 0 < ~ , of
the form Hi: 0 > c a n d / / 2 : 0 < c, where c has a given value, e.g. c = 0, assume
that a posterior pdf for 0, p(O]D), is available, where D denotes the sample and
prior information. Then in what has been called the Laplacian Approach, the
posterior probabilities relating to H l and to H 2 are given by Pr(0 > c]D)=
f~p(OtO)dO and Pr(0 < c]D) = f ~ p ( O I D ) d O , respectively. The posterior odds
ratio for H 1 and H 2, denoted by Kl2, is then K12 = Pr(0 > clD)/Pr(O < clD ).
Other hypotheses, e.g. ]0[ < 1 and ]01 > 1, can be appraised in a similar fashion.
That is, Pr(10 [ < I [ D ) = flip(OlD)dO is the posterior probability that 10] < 1
and 1 -Pr(L0 [ < 1 ]D) is the posterior probability that ]0[ > 1.
Example 6.3
Let Yi, i = 1,2 ..... n, be independent observations from a normal distribution with
mean 0 and unit variance. If a diffuse prior for 0, p(O) (x const., is employed, the
posterior pdf is p(OlD ) ¢c e x p ( - n(O F)2//2), where ~ is the sample mean; that
-

is, z = x / n ( O - y ) has a N(0,1) posterior pdf. Then for the hypothesis 0 > 0 ,
Pr(0 > 01 D) = Pr(z > - V%-y[D) = 1 - • ( - f n y ) , where • (.) is the cumulative
normal pdf. Thus, Pr(0 > 01D ) can be evaluated from tables ~(-).
When a vector of parameters, 0, with 0 c @ and posterior pdf p(O]D), is
considered, and the hypotheses are HA: 0 c to and HB: 0 c (9 - to, where to is a
subspace of 0 , Pr(0 c to ID) = f~ p (0 [D) d 0 is the posterior probability associated
with H A while 1 - Pr(0 c tolD ) is that associated with H B and the posterior odds
ratio is the ratio of these posterior probabilities. The above posterior probabilities
can be evaluated either analytically, or by the use of tabled values of integrals or
by numerical integration. For an example of this type of analysis applied to
hypotheses about properties of a second order autoregression, see Zellner (1971,
p. 194 ft.).
For a very wide range of different types of hypotheses, the following Jeffreys
Approach, based on Bayes Theorem, can be employed in analyzing alternative
hypotheses or models for observations. Let p ( y, H ) be the joint distribution for
the data y and an indicator variable H. Then p ( y , H ) = p ( H ) p ( y [ H ) =
p ( y ) p ( H [ y ) andp(Hly ) = p ( H ) p ( y l H ) / p ( y ) . If H can assume values H l and
H 2, it follows that the posterior odds ratio, K~2, is

= p(Hl[Y ) = p ( H , ) . p ( y [ H l ) (6.2)
K'2 p(H2ly ) P(H2) p ( y l n 2 ) '
where p(Hi) is the prior probability assigned to H i, i = 1,2, p(H~)/p(H2) is the
Ch. 2: Statistical Theory and Econometrics 171

prior odds ratio for H l versus t/2, and p ( y t H i ) is the marginal pdf for y under
hypothesis H i, i = 1,2. The ratio p ( y IHl')/p ( y [ti2) is called the Bayes Factor
(BF). In the case that both H 1 and H 2 are simple hypotheses, the BF
p( y l H O / p ( y t H 2 ) is just the Likelihood Ratio (LR).
Example 6.4
Let Yi = 0 + ei, i = 1,2 ..... n, with the ei's assumed independently drawn from a
normal distribution with zero mean and unit variance. Consider two simple
hypotheses, H~: 0 = 0 and / / 2 : 0 = 1, with prior probabilities p ( H O = l / 2 and
p ( H 2 ) = 1/2. Then from (6.2),

K~2 : (½ + ½)p( yLO : O ) / p ( ylO :1)


: e x p ( - y ' y / 2 } / e x p ( - ( y - t ) ' ( y - Q / 2 } , = exp{2n (½ - ~)),

where y ' = (Yl, Y2..... Yn), t' = (1, 1..... 1), and 2 is the sample mean. In this case
Kl2 = LR and its value is determined by the value of 2n(½ - )~).
In cases in which non-simple hypotheses are considered, that is, hypotheses that
do not involve assigning values to all parameters of a pdf for the data y,
p( y t0i, Hi), given that a prior pdf for 0 i is available, P(OilHi), it follows that
p ( y l H i ) = fp( ylOi, H i ) p ( OilHi)dO i and in such cases (6.2) becomes

= p(H1) fP(YlO~'H~)p(O~IH~)dG (6.3)


Kt2 p(H2) " fp(ylO2,n2)p(O21a2)d02

Thus, in this case, K12 is equal to the prior odds ratio, p ( H O / p ( H 2 ) times a BF
that is a ratio of averaged likelihood functions. For discussion and applications of
(6.3) to a variety of problems, see, for example, Jeffreys (1967), DeGroot (1970),
Leamer (1978), and Zellner (1971).
In (6.2) and (6.3) it is seen that a prior odds ratio gets transformed into a
posterior odds ratio. If a loss structure is available, it is possible to choose
between or among hypotheses so as to minimize expected loss. To illustrate,
consider two mutually exclusive and exhaustive hypotheses, H l and H2, with
posterior odds ratio K12-" p l / ( 1 - Pt), where Pl is the posterior probability for
H 1 and 1 - P l is the posterior probability for H 2. Suppose that the following
two-action, two-state loss structure is relevant:

State of world

HI //2
Al: Choose H 1 0 Lie
A cts
A2: Choose H 2 L21 0
172 A. Zellner

The two "states of the world" are: H~ is in accord with the data or H 2 is in accord
with the data; while the two possible actions are: choose H 1 and choose H 2.
LI2 > 0 and L:~ > 0 are losses associated with incorrect actions. Then using the
posterior probabilities, p 1 and 1 - p ~, posterior expected losses associated with A 1
and A 2 are:

E(LIA1) = ( 1 - p~)L,2 and E(LIA2) = plL21,


and thus

E(LIA2) Pl L21
E(LtA,) 1 - Pl LI2

(6.4)
LI2 "

If this ratio of expected losses is larger than one, choosing A 1 minimizes expected
loss, while if it is less than one, choosing A 2 leads to minimal expected loss. Note
from the second line of (6.4) that both K12 and L21/L12 affect the decision. In the
very special case Lzl/L12 = 1, the symmetric loss structure, if K12 > 1 choose HI,
while if K12 < 1 choose H 2. The analysis can be generalized to apply to more than
two hypotheses. Also, there are intriguing relations between the results provided
by the Bayesian approach and sampling theory approaches to testing hypotheses
that are discussed in the references cited above.
Finally, given the posterior probabilities associated with the hypotheses, it is
possible to use them not only in testing but also in estimation and prediction.
That is, if two hypotheses are H~: 0 = c and H 2 : 0 :*=C, where c has a given value
and p~ and 1 - P l are the posterior probabilities associated with H 1 and H2,
respectively, then relative to quadratic loss, an optimal estimate is

0 = pl c+(] - pl)O = c + ( l - p l ) ( O - c) = c + 1 ( ~ _ c),


1+ K12
where 0 is the posterior mean of 0 under H 2 and K12 = P t~(1 - P 1), the posterior
odds ratio [see Zellner and Vandaele (1975) for details]. Also, in Geisel (1975)
posterior probabilities associated with two different models for the same set of
observations are employed to average their predictions and thus to obtain an
optimal prediction relative to quadratic loss.

7. Summary and concluding remarks

Research in statistical theory has yielded very useful procedures for learning from
data, one of the principal objectives of econometrics and science. In addition, this
research has produced a large number of probability models for observations that
Ch. 2: Statistical Theory and Econometrics 173

are widely utilized in econometrics and other sciences. Some of them were
reviewed above. Also, techniques for estimation, prediction, and testing were
reviewed that enable investigators to solve inference problems in a scientific
manner. The importance of utilizing sound, scientific methods in analyzing data
and drawing conclusions from them is obvious since such conclusions often have
crucial implications for economic policy-making and the progress of economic
science. On the other hand, it is a fact that statistical and econometric analysis
frequently is a mixture of science and art. In particular, the formulation of
appropriate theories and models is largely an art. A challenge for statistical theory
is to provide fruitful, formal procedures that are helpful in solving model
formulation problems.
While many topics were discussed in this chapter, it is necessary to point to
some that were not. These include non-parametric statistics, survey methodology,
design of experiments, time series analysis, random parameter models, statistical
control theory, sequential and simultaneous testing procedures, empirical Bayes
procedures, and fiducial and structural theories of inference. Some of these topics
are treated in other parts of this Handbook. Also, readers may refer to Kruskal
and Tanur (1978) for good discussions of these topics that provide references to
the statistical literature. The annual issues of the ASA/IMS Current Index to
Statistics are a very useful guide to the current statistical literature.
In the course of this chapter a number of controversial issues were mentioned
that deserve further thought and study. First, there is the issue of which concept
of probability is most fruitful in econometric work. This is a critical issue since
probabihty statements play a central role in econometric analyses.
Second, there are major controversies concerning the most appropriate ap-
proach to statistical inference to employ in econometrics. The two major ap-
proaches to statistical inference discussed in this chapter are the sampling theory
approach and the Bayesian approach. Examples illustrating both approaches
were presented. For further discussion of the issues involved see, for example,
Barnett (1973), Bernardo et al. (1980), Cox and Hinkley (1974), Lindley (1971),
Rothenberg (1975), Zellner (1975), and the references in these works.
Third, with respect to both sampling theory and Bayesian approaches, while
there are many problems for which both approaches yield similar solutions, there
are some problems for which solutions differ markedly. Further attention to such
problems, some of which are discussed in Bernardo et al. (1980), Cox and Hinkley
(1974), Jaynes (1980), Lindley (1971), and the references cited in these works,
would be worthwhile.
Fourth, there is controversy regarding the implications of the likelihood princi-
ple for econometric and statistical practice. Briefly, the likelihood principle states
that if x and y are two data sets such that p (x]O) = cf( y ]0), with 0 c ~ and c not
depending on 0, then inferences and decisions based on x and on y should be
identical. The Bayesian approach satisfies this condition since for a given prior
pdf, ~r(0), the posterior pdfs for 0 based on p(x]O) and on of(y]O) are identical
174 A. Zellner

given p(xlO)= cf(yl0). On the other hand, sampling theory properties and
procedures that involve integrations over the sample space, as in the case of
unbiasedness, MVU estimation, confidence intervals, and tests of significance
violate the likelihood principle. Discussions of this range of issues are provided in
Cox and Hinkley (1974, ch. 2) and Lindley (1971, p. 10 ft.) with references to
important work by Bimbaum, Barnard, Durbin, Savage, and others.
Fifth, the importance of Bayesian logical consistency and coherence is em-
phasized by most Bayesians but is disputed by some who argue that these
concepts fail to capture all aspects of the art of data analysis. Essentially, what is
being criticized here is the Bayesian learning model and/or the precept, "act so as
to maximize expected utility (or equivalently minimize expected loss)". If im-
provements can be made to Bayesian and other learning and decision procedures,
they would constitute major research contributions.
Sixth, some object to the introduction of prior distributions in statistical
analyses and point to the difficulty in formulating prior distributions in multi-
parameter problems. Bayesians point to the fact that non-Bayesians utilize prior
information informally in assessing the "reasonableness" of estimation results,
choosing significance levels, etc. and assert that formal, careful use of prior
information provides more satisfactory results in estimation, prediction, and
testing.
Seventh, frequentists assert that statistical procedures are to be assessed in
terms of their behavior in hypothetical repetitions under the same conditions.
Others dispute this assertion by stating that statistical procedures must be
justified in terms of the actually observed data and not in terms of hypothetical,
fictitious repetitions. This range of issues is very relevant for analyses of non-
experimental data, for example macro-economic data.
The above controversial points are just some of the issues that arise in judging
alternative approaches to inference in econometrics and statistics. Furthermore,
Good has suggested in a 1980 address at the University of Chicago and in Good
and Crook (1974) some elements of a Bayes/non-Bayes synthesis that he expects
to see emerge in the future. In a somewhat different suggested synthesis, Box
(1980) proposes Bayesian estimation procedures for parameters of given models
and a form of sampling theory testing procedures for assessing the adequacy of
models. While these proposals for syntheses of different approaches to statistical
inference are still being debated, they do point toward possible major innovations
in statistical theory and practice that will probably be of great value in economet-
ric analyses.

References
Aitchison, J. and I. R. Dunsmore (1975) Statistical Prediction Analysis. Cambridge: Cambridge
University Press.
Anderson, T. W. (1971) The Statistical Analysis of Time Series. New York: John Wiley & Sons, Inc.
Ch. 2: Statistical Theory and Econometrics 175

Anscombe, F. J. (1961) "Bayesian Statistics", American Statistician, 15, 21-24.


Arnold, S. F. (1981) The Theory of Linear Models and Multivariate Analysis. New York: John Wiley &
Sons, Inc.
Barnett, V. D. (1973) Comparative Statistical Inference. New York: John Wiley & Sons, Inc.
Barnett, V. and T. Lewis (1978) Outliers in Statistical Data. New York: John Wiley & Sons, Inc.
Bayes, T. (1763) "An Essay Toward Solving a Problem in the Doctrine of Chances", Philosophical
Transactions of the Royal Society (London), 53, 370-418; reprinted in Biometrika, 45 (1958),
293-315.
Belsley, D. A., E. Kuh and R. E. Welsch (1980) Regression Diagnostics. New York: John Wiley &
Sons, Inc.
Bernardo, J. M., M. H. DeGroot, D. V. Lindley and A.F.M. Smith (eds.) (1980) Bayesian Statistics.
Valencia, Spain: University Press.
Berndt, E. and N. E. Savin (1977) "Conflict Among Criteria for Testing Hypotheses in the
Multivariate Linear Regression Model", Econometrica, 45, 1263-1272.
Blackwell, D. (1947) "Conditional Expectation and Unbiased Sequential Estimation," Annals of
Mathematical Statistics, 18, 105-110.
BlackweU, D. and M. A. Girsehick (1954) Theory of Games and Statistical Decisions. New York: John
Wiley & Sons, Inc.
Box, G. E. P. (1949) " A General Distribution Theory for a Class of Likelihood Criteria", Biometrika,
36, 317-346.
Box, G. E. P. (1976) "Science and Statistics", Journal of the American Statistical Association, 71,
791-799.
Box, G. E. P. (1980) "Sampling and Bayes' Inference in Scientific Modelling and Robustness", Journal
of the Royal Statistical Association A, 143, 383-404.
Box, G. E. P. and D. R. Cox (1964) "An Analysis of Transformations", Journal of the Royal Statistical
Association B, 26, 211-243.
Box, G. E. P. and G. C. Tiao (1973) Bayesian Inference in Statistical Analysis. Reading, Mass.:
Addison-Wesley Publishing Co.
Brown, L. (1966) " O n the Admissibility of Estimators of One or More Location Parameters", Annals
of Mathematical Statistics, 37, 1087-1136.
Christ, C. (1966) Econometric Models and Methods. New York: John Wiley & Sons, Inc.
Copson, E. T. (1965) Asymptotic Expansions. Cambridge: Cambridge University Press.
Cox, D. R. and D. V. Hinkley (1974) Theoretical Statistics. London: Chapman and Hall.
Cox, D. R. and P. A. W. Lewis (1966) The Statistical Analysis of Series of Events. London: Methuen.
Cramer, H. (1946) Mathematical Methods of Statistics. Princeton: Princeton University Press.
DeGroot, M. H. (1970) Optimal Statistical Decisions. New York: McGraw-Hill Book Co.
DeRobertis, L. (1978) "The Use of Partial Prior Knowledge in Bayesian Inference", unpublished
doctoral dissertation, Department of Statistics, Yale University.
Effort, B. and C. Morris (1975) "Data Analysis Using Stein's Estimator and Its Generalizations",
Journal of the American Statistical Association, 70, 311-319.
Feigl, H. (1953) "Notes on Causality", in: H. Feigl and M. Brodbeck (eds.), Readings in the Philosophy
of Science. New York: Appleton-Century-Crofts, Inc., pp. 408-418.
Feldstein, M. S. (1971) "The Error of Forecast in Econometric Models when the Forecast-Period
Exogenous Variables are Stochastic", Econometrica, 39, 55-60.
Ferguson, T. S. (1967) Mathematical Statistics: A Decision Theory Approach. New York: Academic
Press, Inc.
Fisher, R. A. (1959) Statistical Methods and Scientific Inference (2nd edn.). New York: Hafner
Publishing Co.
Fomby, T. B. and D. K. Guilkey (1978) " O n Choosing the Optimal Level of Significance for the
Durbin-Watson Test and the Bayesian Alternative", Journal of Econometrics, 8, 203-214.
Friedman, M. (1957) A Theory of the Consumption Function. Princeton: Princeton University Press.
Geisel, M. S. (1975) "Bayesian Comparisons of Simple Macroeconomic Models", in: S. E. Fienberg
and A. ZeUner (eds.), Studies in Bayesian Econometrics and Statistics in Honor of Leonard J. Savage.
Amsterdam: North-Holland Publishing Co., pp. 227-256.
Geisser, S. (1980) "A Predictivistic Primer", in: A. Zellner (ed.), Bayesian Analysis in Econometrics and
Statistics: Essays in Honor of Harold Jeffreys. Amsterdam: North-Holland Publishing Co., pp.
363-381.
176 A. Zellner

Goldfeld, S. M. and R. E. Quandt (1972) Nonlinear Methods in Econometrics. Amsterdam: North-Hol-


land Publishing Co.
Good, I. J. and J. F. Crook (1974) "The Bayes/Non-Bayes Compromise and the Multinomial
Distribution", Journal of the American Statistical Association, 69, 711-720.
Granger, C. W. J. and P. Newbold (1977) Forecasting Economic Time Series. New York: Academic
Press, Inc.
Guttman, I. (1970) Statistical Tolerance Regions: Classical and Bayesian. London: Charles Griffen &
Co., Ltd.
Hartigan, J. (1964) "Invariant Prior Distributions," Annals of Mathematical Statistics, 35,836-845.
Heyde, C. C. and I. M. Johnstone (1979) " O n Asymptotic Posterior Normality for Stochastic
Processes", Journal of the Royal Statistical Association B, 41, 184-189.
Hill, B. M. (1975) " O n Coherence, Inadmissibility and Inference about Many Parameters in the
Theory of Least Squares", in: S. E. Fienberg and A. Zellner (eds.), Studies in Bayesian Econometrics
and Statistics. Amsterdam: North-Holland Pubhshing Co., pp. 555-584.
Huber, P. J. (1964) "Robust Estimation of a Location Parameter," Annals of Mathematical Statistics,
35, 73-101.
Huber, P. J. (1972) "Robust Statistics: A Review", Annals of Mathematical Statistics, 43, 1041-1067.
James, W. and C. Stein (1961) "Estimation with Quadratic Loss", in: Proceedings of the Fourth
Berkeley Symposium on Mathematical Statistics and Probability Theory, vol. I. Berkeley: University
of California Press, pp. 361-397.
Jaynes, E. T. (1968) "Prior Probabilities", IEEE Transactions on Systems Science and Cybernetics,
SSC-4, 227-241.
Jaynes, E. T. (1974) "Probability Theory", manuscript. St. Louis: Department of Physics, Washington
University.
Jaynes, E. T. (1980) "Marginalization and Prior Probabilities", in: A. Zellner (ed.), Bayesian Analysis
in Econometrics and Statistics: Essays in Honor of Harold Jeffreys. Amsterdam: North-Holland
Publishing Co., pp. 43-78.
Jeffreys, H. (1967) Theory of Probability (3rd rev. edn.; 1st ed. 1939). London: Oxford University
Press.
Jeffreys, H. (1973) Scientific Inference (3rd edn.). Cambridge: Cambridge University Press.
Johnson, N. L. and S. Kotz (1969) Discrete Distributions. Boston: Houghton Mifflin Publishing Co.
Johnson, N. L. and S. Kotz (1970) Continuous Univariate Distributions, vols. 1 and 2. New York: John
Wiley & Sons, Inc.
Judge, G. G. and M. E. Bock (1978) The Statistical Implications of Pre-Test and Stein-Rule Estimators
in Econometrics. Amsterdam: North-Holland Publishing Co.
Kadane, J. B., J. M, Dickey, R. L. Winkler, W. S. Smith and S. C. Peters (1980) "Interactive
Elicitation of Opinion for a Normal Linear Model", Journal of the American Statistical Association,
75, 845-854.
Kendall, M. G. and A. Stuart (1958) The Advanced Theory of Statistics, vol. 1. London: C. Griffen &
Co., Ltd.
Kendall, M. G. and'A. Stuart (1961) The Advanced Theory of Statistics, vol. 2. London: C. Griffen &
Co., Ltd.
Kenney, J. F. and E. S. Keeping (1951) Mathematics of Statistics (Part Two). New York: D. Van
Nostrand Company, Inc.
Keynes, J. M. (1921) Treatise on Probability. London: Macmillan and Co., Ltd.
Koopman, B. O. (1936) " O n Distributions Admitting a Sufficient Statistic", Transactions of the
American Mathematical Society, 39, 399-409.
Kruskal, W. J. and J. M. Tanur (eds.) (1978), International Encyclopedia of Statistics, vols. 1 and 2.
New York: The Free Press (Division of Macmillan Publishing Co., Inc.).
Lawley, D. N. (1956) "A General Method for Approximating to the Distribution of Likelihood Ratio
Criteria", Biometrika, 43,295-303.
Leamer, E. E. (1978) Specification Searches. New York: John Wiley & Sons, Inc.
Lehmann, E. L. (1959) Testing Statistical Hypotheses. New York: John Wiley & Sons, Inc.
Lindley, D. V. (1962) "Discussion on Professor Stein's Paper", Journal of the Royal Statistical
Association B, 24, 285-287.
Lindley, D. V. (1965) Introduction to Probability and Statistics from a Bayesian Viewpoint. Part 2:
Inference. Cambridge: Cambridge University Press.
Ch. 2: Statistical Theory and Econometrics 177

Lindley, D. V. (1971) Bayesian Statistics, A Review. Philadelphia: Society for Industrial and Applied
Mathematics.
Litterman, R. (1980) "A Bayesian Procedure for Forecasting with Vector Autoregressions", manuscript.
Department of Economics, MIT; to appear in Journal of Econometrics.
Lobve, M. (1963) Probability Theory (3rd edn.). Princeton: D. Van Nostrand Co., Inc.
Luce, R. D. and H. Raiffa (1957) Games and Decisions. New York: John Wiley & Sons, Inc.
Mehta, J. S. and P. A. V. B. Swamy (1976) "Further Evidence on the Relative Efficiencies of Zellner's
Seemingly Unrelated Regressions Estimators", Journal of the American Statistical Association, 71,
634-639.
Neyman, J. and E. L. Scott (1948) "Consistent Estimates Based on Partially Consistent Observations",
Econometrica, 161 1-32.
Pfanzagl, J. and W. Wefelmeyer (1978) "A Third Order Optimum Property of the Maximum
Likelihood Estimator", Journal of Multivariate Analysis, 8, 1-29.
Phillips, P. C. B. (1977a) "A General Theorem in the Theory of Asymptotic Expansions for
Approximations to the Finite Sample Distribution of Econometric Estimators", Econometrica, 45,
1517-1534.
Phillips, P. C. B. (1977b) "An Approximation to the Finite Sample Distribution of Zellner's Seemingly
Unrelated Regression Estimator", Journal of Econometrics, 6, 147-164.
Pitman, E. J. G. (1936), "Sufficient Statistics and Intrinsic Accuracy", Proceedings" of the Cambridge
Philosophical Society, 32, 567-579.
Pratt, J. W. (1961) "Length of Confidence Intervals", Journal of the American Statistical Association,
56, 549-567.
Pratt, J. W. (1965) "Bayesian Interpretation of Standard Inference Statements", Journal of the Royal
Statistical Association B, 27, 169-203.
Pratt, J. W., H. Raiffa and R. Schlaifer (1964) "The Foundations of Decision Under Uncertainty: An
Elementary Exposition", Journal of the American Statistical Association, 59, 353-375.
Raiffa, H. and R. Schlaifer (1961) Applied Statistical Decision Theory. Boston: Graduate School of
Business Administration, Harvard University.
Ramsey, F. P. (1931) The Foundations of Mathematics and Other Essays. London: Kegan, Paul,
Trench, Trnber & Co., Ltd.
Rao, C. R. (1945) "Information and Accuracy Attainable in Estimation of Statistical Parameters",
Bulletin of the Calcutta Mathematical Society, 37, 81-91.
Rao, C. R. (1973) Linear Statistical Inference and Its Applications. New York: John Wiley & Sons, Inc.
Rao, P. and Z. Griliches (1969) "Small-Sample Properties of Two-Stage Regression Methods in the
Context of Auto-correlated Errors", Journal of the American Statistical Association, 64, 253-272.
Rbnyi, A. (1970) Foundations of Probability. San Francisco: Holden-Day, Inc.
Revankar, N. S. (1974) "Some Finite Sample Results in the Context of Two Seemingly Unrelated
Regression Equations", Journal of the American Statistical Association, 69, 187-190.
Rothenberg, T. J. (1975) "The Bayesian Approach and Alternatives", in: S. E. Fienberg and A.
Zellner (eds.), Studies in Bayesian Econometrics and Statistics. Amsterdam: North-Holland Publish-
ing Co., pp. 55-67.
Savage, L. J. (1954) The Foundations of Statistics. New York: John Wiley & Sons, Inc.
Savage, L. J. (1961) "The Subjective Basis of Statistical Practice", manuscript. University of Michigan,
Ann Arbor.
Savage, L. J., et al. (1962) The Foundations of Statistical Inference. London: Meuthen.
Savage, L. J., N. Edwards and H. Lindman (1963) "Bayesian Statistical Inference for Psychological
Research", Psychological Review, 70, 193-242.
Sclove, S. L. (1968) "Improved Estimators for Coefficients in Linear Regression", Journal of the
American Statistical Association, 63,596-606.
Silvey, S. D. (1970) Statistical Inference. Baltimore, Md.: Penguin Books.
Srivastava, V. K. and T. D. Dwivedi (1979) "Estimation of Seemingly Unrelated Regression
Equations: A Brief Survey", Journal of Econometrics, 10, 15-32.
Stein, C. (1956) "Inadmissibility of the Usual Estimator for the Mean of a Multivariate Normal
Distribution", in: Proceedings of the Third Berkeley Symposium on Mathematical Statistics and
Probability, vol. I. Berkeley: University of California Press, pp. 197-206.
Stein, C. (1960) "Multiple Regression", in: L Olkin (ed.), Contributions to Probability and Statistics:
Essays in Honour of Harold Hotelling. Stanford: Stanford University Press.
178 A. Zellner

Takeuchi, K. (1978) "Asymptotic Higher Order Efficiency of ML Estimators of Parameters in Linear


Simultaneous Equations", paper presented at the Kyoto Econometrics Seminar Meeting, University
of Kyoto, 27-30 June.
Taylor, W. E. (1978) "The Heteroscedastic Linear Model: Exact Finite Sample Results", Econometrica,
46, 663-675.
Thornber, E. H. (1967) "Finite Sample Monte Carlo Studies: An Autoregressive Illustration", Journal
of the American Statistical Association, 62, 801-818.
Tukey, J. W. (1957) "On the Comparative Anatomy of Transformations", Annals of Mathematical
Statistics, 28, 602-632.
Tukey, J. W. (1977) Exploratory Data Analysis. Reading, Mass.: Addison-Wesley Publishing Co.
Uspensky, J. V. (1937) Introduction to Mathematical Probability. New York: McGraw-Hill Book Co.,
Inc.
Varian, H. R. (1975) "A Bayesian Approach to Real Estate Assessment", in: S. E. Fienberg and A.
Zellner (eds.), Studies in Bayesian Econometrics and Statistics in Honor of Leonard J. Savage.
Amsterdam: North-Holland Publishing Co., pp. 195-208.
Wald, A. (1949) "Note on the Consistency of Maximum Likelihood Estimate", Annals of Mathemati-
cal Statistics, 20, 595-601.
Widder, D. V. (1961) Advanced Calculus (2nd edn.). New York: Prentice-Hall, Inc.
Winkler, R. L. (1980) "Prior Information, Predictive Distributions and Bayesian Model-Building", in:
A. Zellner (ed.), Bayesian Analysis in Econometrics and Statistics: Essays in Honor of Harold Jeffreys.
Amsterdam: North-Holland Publishing Co., pp. 95-109.
Zaman, A. (1981) "Estimators Without Moments: The Case of the Reciprocal of a Normal Mean",
Journal of Econometrics, 15, 289-298.
Zellner, A. (1971) An Introduction to Bayesian Inference in Econometrics. New York: John Wiley &
Sons, Inc.
Zellner, A. (1972) "On Assessing Informative Prior Distributions for Regression Coefficients',,
manuscript. H. G. B. Alexander Research Foundation, Graduate School of Business, University of
Chicago.
Zellner, A. (1973) "The Quality of Quantitative Economic Policy-making When Targets and Costs of
Change are Misspecified", in: W. Sellekaerts (ed.), Selected Readings in Econometrics and Economic
Theory: Essays in Honor of Jan Tinbergen, Part II. London: Macmillan Publishing Co., pp.
147-164.
Zellner, A. (1975) "Time Series Analysis and Econometric Model Construction", in: R. P. Gupta
(ed.), Applied Statistics. Amsterdam: North-Holland Publishing Co., pp. 373-398.
Zellner, A. (1976) "Bayesian and Non-Bayesian Analysis of the Regression Model with Multivariate
Student-t Error Terms", Journal of the American Statistical Association, 71,400-405.
Zellner, A. (1978) "Estimation of Functions of Population Means and Regression Coefficients
Including Structural Coefficients: A Minimum Expected Loss (MELO) Approach", Journal of
Econometrics, 8, 127-158.
Zellner, A. (1979) "Statistical Analysis of Econometric Models", Journal of the American Statistical
Association, 74, 628-651.
Zellner, A. (1980) "On Bayesian Regression Analysis with g-Prior Distributions", paper presented at
the Econometric Society Meeting, Denver, Colorado.
Zelhier, A. and M. S. Geisel (1968) "Sensitivity of Control to Uncertainty and Form of the Criterion
Function", in: D. G. Watts (ed.), The Future of Statistics. New York: Academic Press, Inc., pp.
269-289.
Zellner, A. and S. B. Park (1979) "Minimum Expected Loss (MELO) Estimators for Functions of
Parameters and Structural Coefficients of Econometric Models", Journal of the American Statistical
Association, 74, 185-193.
Zellner, A. and N. S. Revankar (1969) "Generalized Production Functions", Review of Economic
Studies, 36, 241-250.
Zellner, A. and W. Vandaele (1975) "Bayes-Stein Estimators for k-Means, Regression and Simulta-
neous Equation Models", in: S. E. Fienberg and A. Zellner (eds.), Studies in Bayesian Econometrics
and Statistics. Amsterdam: North-Holland Publishing Co., pp. 627-653.

You might also like