Zellner
Zellner
ARNOLD ZELLNER*
University of Chicago
Contents
1. Introduction and overview 68
2. Elements of probability theory 69
2.1. Probability models for observations 70
2.2. Definitions of probability 71
2.3. Axiom systems for probability theory 74
2.4. Random variables and probability models 82
2.5. Elements of asymptotic theory 110
3. Estimation theory 117
3.1. Point estimation 117
3.2. Criteria for point estimation 118
4. Interval estimation: Confidence bounds, intervals, and regions 152
4.1. Confidence bounds 152
4.2. Confidence intervals 154
4.3. Confidence regions 156
5. Prediction 158
5.1. Sampling theory prediction techniques 159
5.2. Bayesian prediction techniques 162
6. Statistical analysis of hypotheses 164
6.1. Types of hypotheses 164
6.2. Sampling theory testing procedures 165
6.3. Bayesian analysis of hypotheses 169
7. Summary and concluding remarks 172
References 174
*Research for this paper was financed by NSF Grant SES 7913414 and by income from the H.G.B.
Alexander Endowment Fund, Graduate School of Business, University of Chicago. Part of this work
was done while the author was on leave at the National Bureau of Economic Research and the Hoover
Institution, Stanford, California.
1For valuable discussions of many of the statistical topics considered below and referencesto the
statistical literature, see Kruskal and Tanur (1978).
70 A. Zetlner
extensions of logic, which tells when one set of propositions necessitates the
truth of another.
While Savage's classification scheme probably will not satisfy all students of
the subject, it does bring out critical differences of alternative views regarding the
meaning of probability. To illustrate further, consider the following definitions of
probability, some of which are reviewed by Jeffreys (1967, p. 369 if).
1. Classical or Axiomatic Definition
If there are n possible alternatives, for m of which a proposition denoted by p is
true, then the probability of p is m / n .
2. Venn Limit Definition
If an event occurs a large number of times, then the probability of p is the limit of
the ratio of the number of times when p will be true to the whole number of trials,
when the number of trials tends to infinity.
3. Hypothetical Infinite Population Definition
An actually infinite number of possible trials is assumed. Then the probability of
p is defined as the ratio of the number of cases where p is true to the whole
number.
4. Degree of Reasonable Belief Definition
Probability is the degree of confidence that we may reasonably have in a
proposition.
5. Value of an Expectation Definition
If for an individual the utility of the uncertain outcome of getting a sum of s
dollars or zero dollars is the same as getting a sure payment of one dollar, the
probability of the uncertain outcome of getting s dollars is defined to be
u(1)/u(s), where u(-) is a utility function. If u(.) can be taken proportional to
returns, the probability of receiving s is 1/s.
Jeffreys notes that Definition 1 appeared in work of De Moivre in 1738 and of
J. Neyman in 1937; that R. Mises advocates Definition 2; and that Definition 3 is
usually associated with R. A. Fisher. Definition 4 is Jeffreys' definition (1967, p.
20) and close to Keynes' (1921, p. 3). The second part of Definition 5 is involved
in Bayes (1763). The first part of Definition 5, embodying utility comparisons, is
central in work by Ramsey (1931), Savage (1954), Pratt, Raiffa and Schlaifer
(1964), DeGroot (1970), and others.
Definition 1 can be shown to be defective, as it stands, by consideration of
particular examples-see Jeffreys (1967, p. 370ff.). For example, if a six-sided die
is thrown, by Definition 1 the probability that any particular face will appear is
1/6. Clearly, this will not be the case if the die is biased. To take account of this
Ch. 2: Statistical Theory and Econometrics '73
possibility, some have altered the definition to read, " I f there are n equally likely
possible alternatives, for m of which p is true, then the probability of p is m / n . "
If the phrase "equally likely" is interpreted as "equally probable," then the
definition is defective since the term to be defined is involved in the definition.
Also, Jeffreys (1967) points out in connection with the Venn Limit Definition
that, " F o r continuous distributions there are an infinite number of possible cases,
and the definition makes the probability, on the face of it, the ratio of two infinite
numbers and therefore meaningless" (p. 371). He states that attempts by Neyman
and Cram6r to avoid this problem are unsatisfactory.
With respect to Definitions 2 and 3, it must be recognized that they are both
non-operational. As Jeffreys (1967) puts it:
Further, with respect to Definition 3, Jeffreys (1967) writes, " O n the infinite
population definition, any finite probability is the ratio of two infinite numbers
and therefore is indeterminate" (p. 373, fn. omitted). Thus, Definitions 2 and 3
have some unsatisfactory features.
Definition 4, which defines probability in terms of the degree of confidence
that we may reasonably have in a proposition, is a primitive concept. It is
primitive in the sense that it is not produced by any axiom system; however, it is
accepted by some on intuitive grounds. Furthermore, while nothing in the
definition requires that probability be measurable, say on a scale from zero to
one, Jeffreys (1967, p. 19) does assume measurability [see Keynes (1921) for a
critique of this assumption] and explores the consequences of the use of this
assumption in a number of applications. By use of this definition, it becomes
possible to associate probabilities with hypotheses, e.g. it is considered meaning-
ful to state that the probability that the marginal propensity to consume is
between 0.7 and 0.9 is 0.8, a statement that is meaningless in terms of Definitions
1-3. However, the meaningfulness of the metric employed for such statements is
a key issue which, as with many measurement problems will probably be resolved
by noting how well procedures based on particular metrics perform in practice.
74 A. Zellner
Various axiom systems for probability theory have appeared in the literature.
Herein Jeffreys' axiom system is reviewed that was constructed to formalize
inductive logic in such a way that it includes deductive logic as a special limiting
case. His definition of probability as a degree of reasonable belief, Definition 4
above, allows for the fact that in induction, propositions are usually uncertain
and only in the limit may be true or false in a deductive sense. With respect to
probability, Jeffreys, along with Keynes (1921), Uspensky (1937), Rrnyi (1970),
and others, emphasizes that all probabilities are conditional on an initial informa-
tion set, denoted by A. For example, let B represent the proposition that a six will
be observed on a single flip of a coin. The degree of reasonable belief or
probability that one attaches to B depends on the initial information concerning
the shape and other features of the coin and the way in which it is thrown, all of
which are included in the initial information set, A. Thus, the probability of B is
written P(B[ A), a conditional probability. The probability of B without specify-
ing A is meaningless. Further, failure to specify A clearly and precisely can lead to
confusion and meaningless results; for an example, see Jaynes (1980).
Let propositions be denoted by A, B, C . . . . . Then Jeffreys' (1967) first four
axioms are:
Axiom 1 (Comparability)
Given A, B is either more, equally, or less probable than C, and no two of these
alternatives can be true.
Axiom 2 (Transitivity)
If A, B, C, and D are four propositions and given A, B is more probable than C
and C is more probable than D, then given A, B is more probable than D.
Axiom 3 (Deducibility)
All propositions deducible from a proposition A have the same probability given
A. All propositions inconsistent with A have the same probability given data A.
Ch. 2: Statistical Theory and Econometrics 75
Axiom 4
If given A, B~ and B 2 cannot both be true and if, given A, C~ and C2 cannot both
be true, and if, given A, B t and C l are equally probable and B 2 and C2 are equally
probable, then given A, B l or B z and C 1 or C2 are equally probable.
Jeffreys states that Axiom 4 is required to prove the addition rule given below.
DeGroot (1970, p. 71) introduces a similar axiom.
Axiom 1 permits the comparison of probabilities or degrees of reasonable belief
or confidence in alternative propositions. Axiom 2 imposes a transitivity condi-
tion on probabilities associated with alternative propositions based on a c o m m o n
information set A . The third axiom is needed to insure consistency with deductive
logic in cases in which inductive and deductive logic are both applicable. The
extreme degrees of probability are certainty and impossibility. As Jeffreys (1967,
p. 17) mentions, certainty on data A and impossibility on data A "do not refer to
mental certainty of any particular individual, but to the relations of deductive
logic..." expressed by B is deducible from A and not-B is deducible from A , or in
other words, A entails B in the former case and A entails not-B in the latter.
Axiom 4 is needed in what follows to deal with pairs of exclusive propositions
relative to a given information set A. Jeffreys' Theorem 1 extends Axiom 4 to
relate to more than two pairs of exclusive propositions with the same probabilities
on the same data A.
Jeffreys (1967) remarks that it has " . . . n o t yet been assumed that probabilities
can be expressed by numbers. I do not think that the introduction of numbers is
strictly necessary to the further development; but it has the enormous advantage
that it permits us to use mathematical technique. Without it, while we might
obtain a set of propositions that would have the same meanings, their expression
would be much more cumbersome" (pp. ! 8-19). Thus, Jeffreys recognizes that it
is possible to have a "non-numerical" theory of probability but opts for a
"numerical" theory in order to take advantage of less cumbersome mathematics
and that he believes leads to propositions with about the same meanings.
The following notation and definitions are introduced to facilitate further
analysis.
Definitions 2
(3) AtOB means "A or B " , that is, at least one of A and B is true. The
proposition A tO B is also referred to as the " u n i o n " or "disjunction" or
"logical sum" of A and B.
(4) A f) B n C n D means "A and B and C and D " , that is, A, B, C, and D are
all true.
(5) A u B U C U D m e a n s " A or B or C or D " , that is, at least one of A, B, C,
and D is true.
(6) Propositions Bi, i -- 1,2 ..... n, are said to be exclusive on data A if not more
than one of them can be true given A.
(7)" Propositions Bi, i = 1,2 ..... n, are said to be exhaustive on data A if at least
one of them must be true given A.
N o t e that a set of propositions can be both exclusive and exhaustive. Also, for
example, Axiom 4 can be restated using the above notation and concepts as:
Axiom 4
If B 1 and B 2 a r e exclusive and C 1 and C2 are exclusive, given data A, and if,
given A, B I and C l are equally probable and B 2 and C2 are equally probable, then
given A, B I tO B 2 and C l W C2 are equally probable.
At this point in the development of his axiom system, Jeffreys introduces
numbers associated with or measuring probabilities by the following conventions.
Convention 1
A larger number is assigned to the more probable proposition (and therefore
equal numbers to equally probable propositions).
Convention 2
If, given A, B~ and B 2 are exclusive, then the number assigned on data A to "B~ or
Bz", that is B~ U ' B 2 , is the sum of those assigned t o B 1 and t o B 2.
The following axiom is needed to insure that there are enough numbers
available to associate with probabilities.
Axiom 5
The set of possible probabilities on given data, ordered in terms of the relation
" m o r e probable than" can be put into a one-one correspondence with a set of real
numbers in increasing order.
It is important to realize that the notation P ( B I A ) stands for the number
associated with the probability of the proposition B on data A. The number
expresses or measures the reasonable degree of confidence in B given A, that is,
the probability of B given A, but is not identical to it.
The following theorem that Jeffreys derives from Axiom 3 and Convention 2
relates to the numerical assessment of impossible propositions.
Ch. 2: Statistical Theory and Econometrics 77
Theorem 2
Convention 3
If A entails B, then P(B I A) = 1.
The use of 1 to represent certainty is a pure convention. In some cases it is useful
to allow numerical probabilities to range from 0 to ~ rather than from 0 to 1. On
given data, however, it is necessary to use the same numerical value for certainty.
Axiom 6
If A N B entails C, then P(B N C I A) = P(B I A).
Theorem 3
If B and C are equivalent in the sense that each entails the other, then each entails
B n C, and the probabilities of B and C on any given data must be equal.
Similarly, if A N B entails C and A n C entails B, P(B I A) = P(C I A), since both
are equal to P(B n CI A).
Theorem 4
P(B I A) = P(B n C I A ) + P ( B n - C I A).
Further, since P(B n - C I A) >/0, P(B I A) >~ P(B n C I A). Also, by using B U C
for B in Theorem 4, it follows that P(B U C I A) >I P(CI A).
The addition rule for numerical probabilities is given by Theorem 5.
Theorem 5
If B and C are two propositions, not necessarily exclusive on data A, the addition
rule is given by
It follows that
since P(B C) CIA) >~O. Further, if B and C are exclusive, then P(B N CIA) = 0 and
P(n u CIA) = P(BIA)+ P( CIA).
Theorems 4 and 5 together express upper and lower bounds on the possible
values of P(B U CIA) irrespective of exclusiveness, that is
Theorem 6
If B 1, B 2 , . . . ,B n are a set of equally probable and exclusive alternatives on data A,
and if Q and R are unions of two subsets of these alternatives, of numbers m and
n, then P(Q[A)/P(RIA ) = m / n . This follows from Convention 2 since P(QIA) =
ma and P(RIA ) = na, where a = P(Bi[A ) for all i.
Theorem 7
Under the conditions of Theorem 6, if B 1, B E . . . . . B n are exhaustive on data A, and
R denotes their union, then R is entailed by A and by Convention 3, P(RIA ) = l,
and it follows that P(QIA) = m / n .
As Jeffreys notes, Theorem 7 is virtually Laplace's rule stated at the beginning of
his ThOorie Analytique. Since R entails itself and is a possible value of A, it is
possible to write, P(QI R) = m / n , which Jeffreys (1967) interprets as, "...given
that a set of alternatives are equally probable, exclusive and exhaustive, the
probability that some one of any subset is true is the ratio of the number in that
subset to the whole number of possible cases" (p. 23). Also, Theorem 6 is
consistent with the possibility that the number of alternatives is infinite, since it
requires only that Q and R shall be finite subsets.
Theorems 6 and 7 indicate how to assess the ratios of probabilities and their
actual values. Such assessments will always be rational fractions that Jeffreys calls
R-probabilities. If all probabilities were R-probabilities, there would be no need
for Axiom 5. But, as Jeffreys points out, many propositions are of a form that a
magnitude capable of a continuous range of values lies within a specified part of
the range and it may not be possible to express them in the required form. He
explains how to deal with this problem and puts forward the following theorem:
Theorem 8
Any probability can be expressed by a real number. For a variable z that can
assume a continuous set of values, given A, the probability that z's value is less
than a given value z 0 is P(z < zo]A ) = F(zo), where F(zo) is referred to as the
cumulative probability density function (cdf). If F(zo) is differentiable, P(z o < z
Ch. 2: StatisticalTheoryand Econometrics 79
In other words, given A throughout, the probabifity that the true proposition is in
the intersection of R and S is equal to the probability that it is in R times the
probability that it is in S, given that it is in R. Theorem 9 involves the assumption
that the alternatives in Q are equally probable, both given A and also given
R n A. Jeffreys notes that it has not been possible to relax this assumption in
proving Theorem 9. However, he regards this theorem as suggestive of the
simplest rule that relates probabilities based on different data, here denoted by A
and R n A, and puts forward the following axiom.
Axiom 7
P(B N CIA) = P( BIA)P( CIB n A ) / P ( BIB n a).
If Convention 3 is used in Axiom 7, P(BIB n A) = 1 and
which is the product rule. Thus, the product rule relates to probability statements
regarding the logical product or intersection B n C, often also written as BC,
while the addition or sum rule relates to probability statements regarding the
logical sum or union, B U C.
Since Axiom 7 is just suggested by Theorem 9, Jeffreys shows that it holds in
several extreme cases and concludes that, " T h e product rule may therefore be
taken as general unless it can be shown to lead to contradictions" (p. 26). Also, he
states, "When the probabilities...are chances, they satisfy the product rule
automatically" (p. 51). 3
3jeffreys (1967) defines "chance" as follows: "If qt, q2,...,q, are a set of alternatives, mutually
exclusiveand exhaustiveon data r, and if the probabilities ofp given any of them and r are the same,
each of these probabilities is called the chance ofp on data r" (p. 51).
80 A. Zellner
where 1/c = Y'>P( qrlA)P(xlqr N A). This is the principle of inverse probability or
Bayes' Theorem, first given in Bayes (1763). The result can also be expressed as
Jaynes (1974), among others, have emphasized. See also Fisher (1959), Savage et
al. (1962), Lindley (1971), Barnett (1973), Cox and Hinkley (1974), Rothenberg
(1975), and Zellner (1975) for further discussion of these and related issues.
Jeffreys' theory of probability, described above, did not mention utility or
benefit since it is primarily a theory of what it is reasonable to believe. However,
Jeffreys notes that his theory permits him to define the expectation of a function
u(~), say a utility function, as follows:
E[u(2)IA] = ~ u(x,)P(xilA ),
i=l
when 2 is a discrete random variable with possible values x~, x 2.... ,xm, or by
E [ u ( f ) l A ] = fabU(x)f(xlA) dx,
Assumption 2
Indifference surfaces extend smoothly from boundary to boundary in the r-space,
R, of the ui's, i = 1,2,...,r.
Assumption 3
If da, db, and d C are three decision functions such that da and d b are indifferent,
then given any p such that 0 ~<p ~<1, a mixed strategy that selects d~ with
objective probability p and dc with objective probability 1 - p is indifferent to a
strategy which selects d b with objective probability p and d c with objective
probability 1 - p.
From these three assumptions that are discussed at length in Luce and Raiffa
(1957), Raiffa and Schlaifer (1961) show that " . . . t h e decision-maker's indif-
ference surfaces must be parallel hyper-planes with a common normal going into
the interior of the first orthant, from which it follows that all utility characteristics
u = (ul, u 2 . . . . . Ur) in R can in fact be ranked by an index which applies a
predetermined set of weights P = (PI, P2 ..... Pr) to their r components" (p. 25).
That is ~2~=i Piui and Y'.~= i Pit~ i c a n be employed to rank decision functions d a
and d b with utility characteristics u and v, respectively, where the Pi's are the
predetermined set of non-negative weights that can be normalized and have all
the properties of a probability measure on O. Since the Pi's are intimately related
to a person's indifference surfaces, it is clear why some refer to the normalized
Pi's as "personal probabilities". For more discussion of this topic see Blackwell
and Girshick (1954, ch. 4), Luce and Raiffa (1957, ch. 13), Savage (1954, ch. 1-5),
and DeGroot (1970, ch. 6-7). Further, Jeffreys (1967) remarks:
F o r a discrete rv, 2, that can assume the values, x~, x z , . . . ,Xm, where the x f s are
distinct and exhaustive, the probability that 2 = x j, given the initial information
A, is
with
The collection of pj's in (2.1), subject to (2.2), defines the probability mass
function (pmf) for the discrete rv, 2. A plot of the p j ' s against the xj's m a y be
unimodal, bimodal, U-shaped, J-shaped, uniform ( P l = P2 . . . . . Pro), etc. If it
is unimodal, the p m f ' s modal value is the value of £ associated with the largest pj.
Further, the m e a n of :~, denoted b y / ~ - E~ is:
= = pjx s. (2.3)
j=l
84 A. Zellner
r = 1,2,.... (2.5)
j=l
/*2 = = - ee) 2
= 2. (2.6)
j=l
where t h e / ~ are given in (2.4). The expression in (2.7) is obtained b y noting that
j=O
Ch. 2." Statistical Theory and Econometrics 85
and
2.4.2.1. The binomial process. Consider a dichotomous rv, Yi, such that Yi = 1
with probability p and 37 = 0 with probability 1 - p. F o r example, 97i = 1 might
denote the appearance of a head on a flip of a coin and Yi = 0 the appearance of a
tail. Then EYi= l . p +O.(1- p ) = p and V(yi)= E ( ~ i - EYi)2=(1- p)Zp +
( 0 - p)2(1 - p ) = p(1 - p). N o w consider a sequence of such )Ti's, i = 1,2 . . . . . n,
s u c h that the value of any m e m b e r of the sequence provides no information about
the values of others, that is, the )~i's are independent rvs. Then any particular
realization of r ones and n - r zeros has probability pr(1 -- p)n-r. On the other
hand, the probability of obtaining r ones and n - r zeros is
n r -r
P(f=r[n,p)=(r)p (l-p)" , (2.9)
where
(7)-n'/r'(.-r)'
N o t e that the total n u m b e r of realizations with r ones and n - r zeros is obtained
b y recognizing that the first one can occur in n ways, the second in n - 1 ways, the
third in n - 2 ways, and the r t h in n - I r - 1) ways. Thus, there are n ( n - 1)
86 A. Zellner
where q = 1 - p, and hence the name " b i n o m i a l " distribution. Given the value of
p, it is possible to c o m p u t e various probabilities f r o m (2.9). For example,
r0
n r n-r
Pr(r<~rolp,n): E (r)P (l-p) ,
r~O
E(~[P,n) = ~ r=P(~=r[p,n).
r~O
= = np,
Theorem. With rn - n - r,
- l o g P ( f = r t p , n) = l o g r ! + l o g m ! - log n ! - r l o g p - mlog(1 - p ) .
- l o g P ( f = r]n, p) - (1/2)log(21rrm/n)
+ rlog(r/np)+ mlog[m/n(1 - p ) ]
or 4
2.4.2.3. Other variants of the binomial process. Two interesting variants of the
binomial process are the Poisson and Lexis schemes. In the Poisson scheme, the
probability that 37~= 1 is pi, and not p as in the binomial process. That is,
the probability of a one (or "success") varies from trial to trial. As before, the )~'s
are assumed independent. Then the expectation of f, the number of ones, is
where
p = ~ Pi/n,
i=1
and
where
N o t e that V(f) is less than the variance of f associated with independent binomial
trials with a fixed probability p at each trial.
Extensions of the Poisson scheme that are widely used in practice involve the
assumption that Pi = f(xi, fl), where x / i s a given vector of observed variables and
Ch. 2." Statistical Theory and Econometrics 89
(n-llp~-l(l_p)"-~
r-l}
90 A. Zellner
(n -- 1 ) p r ( l - p)n--r, (2.17)
= nlr, p) = 1
with n >t r >~ 1, which should be c o m p a r e d with the p m f for the usual binomial
process in (2.9).
2.4.2.4. Multinomial process. Let P1 be the probability that a discrete rv, )7i,
assumes the value j, j = 1 , 2 , . . . , J . If we observe n independent realizations of )7i,
i = 1,2 ..... n, the probability that r I values h a v e j = 1, r 2 h a v e j = 2 , . . . , and rj have
j = J, with n = ~iJ__ 1ri, is given by
n~
P(~= rip) r,!r2!...rj!p~'p~ 2" .. p);,
rj (2.18)
with F = (FI' F2,"''FJ)' i" (rl, r2..... rj), p ' = (Pl, P2 ..... Ps), 0 <~pj, and ~J=IPj
=
W e first describe some properties of models for a single continuous rv, that is,
univariate probability density functions (pdfs), and then turn to some models for
two or m o r e continuous rvs, that is, bivariate or multivariate pdfs.
Let ~ denote a continuous rv that can assume a c o n t i n u u m of values in the
interval a to b and let f ( x ) be a non-negative function for a < x < b such that
P r ( x < 2 < x + d x ) = f ( x ) d x , where a < x < b and f b f ( x ) d x = 1. Then f ( x ) is
the normalized p d f for the continuous rv, 2. In this definition, a m a y be equal to
-~ a n d / o r b = ~ . Further, the cumulative distribution function (cdf) for 2 is
given by F(x) = f~ f ( t ) d t with a < x < b. Given that fb f (t)dt = 1, 0 <~F(x) <~1.
Further, Pr(c < 2 < d ) = F ( d ) - F(c), where a ~< c < d ~< b.
Ch.2: StatisticalTheoryandEconometrics 91
The moments around zero of a continuous rv, ~, with pdf f ( x ) , are given by
with/x 2 = 02, the variance, and o, the standard deviation. N o t e that ~1 = 0. Also,
m o m e n t s are said to exist when the integrals in (2.19) and (2.20) converge; in
cases in which particular integrals in (2.19) and (2.20) fail to converge, the
associated m o m e n t s do not exist or are infinite. 6
F o r unimodal pdfs, unitless measures of skewness are sk = ( m e a n - m o d e ) / o ,
/3t = ~3//~2,2 3 and 71 =~3//.,,2/,,3/2. F o r symmetric, u n i m o d a l pdfs, mean = modal
value and thus sk = 0. Since all odd order central m o m e n t s are equal to zero,
given symmetry,/31 = 71 = 0. Measures of kurtosis are given by/32 =/~4//~22 and
72 = / 3 2 - 3, the "excess". F o r a normal pdf,/32 = 3, and 72 = 0. W h e n 72 > 0, a
pdf is called leptokurtic, and platykurtic when 72 < 0.
The moment-generating function for 2 with pdf f ( x ) is
from which
drM( t t =0 = E'fcr
under the assumption that the integral in (2.21) converges for some t = t o > 0. In
(2.21), a m a y equal - m a n d / o r b m a y equal oo. Thus, knowing the form of M(t)
allows one to c o m p u t e m o m e n t s conveniently.
The characteristic function associated with the continuous rv, if, with pdf f(x),
is given by
6For analysis regarding the convergence and divergence of integrals, see, for example, Widder (1961,
p. 271 ft.), or other advanced calculus texts.
92 A. Z e l l n e r
= l + 1 / 2 ( a + b)t + 1 / 3 ( a 2 + ab + b2)2!t 2 + . - - ,
1 1
---f(xlO'°)=Tro 2' -oo<x<oo, -oo<0<oo, 0<o<~.
(2.24)
Ch. 2: Statistical Theory and Econometrics 93
1 1
f(z)= r l+z2, -oo<z<oo, (2.25)
The pdf in (2.26) is the normal pdf that integrates to one and thus is normalized.
It is symmetric about 0, a location parameter that is the modal value, median, and
mean. The parameter o is a scale parameter, the standard deviation of the normal
pdf as indicated below. Note that from numerical evaluation, P r { I x - 01 ~<1.96o)
= 0.95 for the normal pdf (2.26) indicating that its tails are rather thin or,
equivalently, that (2.26) decreases very rapidly as (x - 0) 2 grows in value, a fact
that accounts for the existence of moments of all orders. Since (2.26) is symmetric
about 0, all odd order central moments are zero, that is,/~2r+1 = 0, r = 0, 1,2 . . . . .
From E(2 - 0) = 0, E2 = 0, the mean of the normal pdf. As regards even central
moments, they satisfy
f(xlO, 1+;(x-O) ] ,
-o0<x<~, -oo<0<ce, 0<v,h<~, (2.29)
7See, for example, Zellner (1971, p. 365) for a derivation of (2.27). From the calculus, F(q + 1) =
qF(q), F(1) = 1, and F(I/2) = Vr~. Using these relations, the second line of (2.27) can be derived from
the first.
Ch. 2: Statistical Theory and Econometrics 95
From (2.30), the second and fourth central moments are ]z2 = E ( 2 - 0) 2 = v/
(v - 2 ) h , v > 2, and /*4 = E(2 - 0) 4 = 3v2/(v - 2 ) ( v - 4 ) h 2, v > 4. The kurtosis
measure is then ]/2 = ~ 4 / ~ 2 - 3 = 6 / ( v - 4 ) , for v > 4, and thus the US-t is
leptokurtic (72 > 0). As v gets large, 72 ~ 0, and the US-t approaches a normal
form with mean 0 and variance 1/h. When v > 30, the US-t's form is very close
to that of a normal pdf. However, for small v, the US-t pdf has much heavier tails
than a normal pdf with the same mean and variance.
The standardized form of (2.29) is obtained by making the change of variable,
t = ~/h(x - 0), that yields
where c has been defined in connection with (2.29). The standardized US-t pdf in
(2.31) has its modal value at t = 0 which is also the median. The moments of
(2.31) may easily be obtained from those of 2 - 0 presented above.
Finally, it is of interest to note that the US-t pdf in (2.29) can be generated as a
"continuous mixture" of normal pdfs, that is,
flG(Oll~ h ) 2 v)
\ v/2 l
o_(~+ ,)exp { - v
} (2.33)
F(v/2) ~ 2ho 2 '
where v and h are the parameters in the US-t pdf in (2.29). 8 From (2.32), it is seen
that fN(XlO, O) is averaged over possible values of o. This is an example in which
the standard deviation o of a normal p d f can be viewed as random with the pdf
shown in (2.33). The fact that (2.32) yields the US-t p d f is a useful interpretation
of the US-t pdf. Many well-known pdfs can be generated as continuous mixtures
of underlying pdfs.
SSee Zellner (1971, p. 371ff) for properties of this and other inverted gamma pdfs.
96 A. Zellner
9From the calculus, B(a, b) = B(b, a) = I ' ( a ) F ( b ) / F ( a + b). Also, B(a, b) ~ f~l za I( 1 -
z)h-ldz, a,b>O.
Ch. 2: Statistical Theory and Econometrics 97
in which particular pdfs can be generated. Further examples are provided in the
references mentioned above.
Above, the reciprocal transformation was employed to produce " i n v e r t e d "
pdfs. M a n y other transformations can be fruitfully utilized. For example, if the
continuous rv )7 is such that 0 < )7 < oe and g = In )7, - o0 < ~ < oe, has a n o r m a l
p d f with mean 0 and variance 02, )7 is said to have a " l o g - n o r m a l " pdf whose
f o r m can be obtained from the normal p d f for ~ by a simple change of variable.
The median of the p d f for)7 = e ~ is e °, while the mean of)7 is E)7 = E e z = e °+~2/2.
Thus, there is an interesting dependence of the mean of 37 on the variance of ~.
This and m a n y other transformations have been analyzed in the literature.
Finally, it should be noted that m a n y of the pdfs mentioned in this section can
be obtained as solutions to differential equations. For example, the normal p d f in
(2.26) is the solution to ( 1 / f ) d f / d x = -(x-/z)/o 2. T h e generalization of this
differential equation that yields the Pearson system of pdfs is given by
1 df - (x - a) (2.34)
f dx ( b o + bl x + b2 x2) "
T h e integral of (2.34) is
where the value of A is fixed by f f ( x ) d x = 1 and c I and c 2 are the roots, possibly
complex, of b o + b l x + bzx 2 = 0. See Jeffreys (1967, p. 74ff.) for a discussion of
the solutions to (2.35) that constitute the Pearson system which includes m a n y
frequently encountered pdfs. For a discussion of other systems of pdfs, see
Kendall and Stuart (1958, p. 167 ft.).
..... x m ) d x , d x 2 . . . d x m ,
where a ' = (a~, a 2 . . . . . am) is a given vector and Pr(:~ ~< a ) is the probability of the
intersection of the events g i ~< a~, i = 1,2,..., m.
T h e mean, assuming that it exists, of an m x 1 r a n d o m vector ~ is
E~ l 01
E-~ 2 02
E.~= =0, (2.37)
Egm Om
J = m o d ~Oz
x = modlHI, (2.42)
where " m o d " denotes absolute value, and the Jacobian matrix is
Ozl Oz I Ozl
Oxl Ox2 "'" Ox m
OZ Oz 2 Oz 2 Oz 2
Ox 1 Ox 2 "'" Ox m =H,
Ox
Oz m Oz m Oz m
Oxl Ox2 "'" Ox m
l ° T h a t is, given that E is a positive definite symmetric matrix, J~ can be diagonalized as follows:
P ' X P = D(Xi), where P is an m × m orthogonal matrix and the X i are the roots of .~. T h e n
D-1/2p,~.pD ~/ 2 =IandH= D-I/2p, whereD-l/2=D()~[1/a),anm×mdiagonalmatrixwit h
typical element )t[ 1/2.
100 A. Zellner
(2.43)
h (x 2) = f R x f ( x , , x2) dx~.
N o t e that
given that f(x~, x2) satisfies (2.36). Also, g ( x l ) d x I is the probability that :~l is in
the interval (xl, x 1 + d x l ) for all values of )~2 and similarly for h(x2).
Definition
The rvs x l and 9~2 are independent if and only if their joint pdf, f (x l, x 2), satisfies
f ( x l , x2) = g(xl)h(x2).
The conditional pdf for xl given x2, denoted by f(xllx2), is
provided that h(x2) > 0. Similarly, the conditional pdf for ~2 given ~i, denoted by
f(x21xl), if f ( x 2 1 x l ) = f ( x l , x2)/g(xl), provided that g ( x l ) > 0. From (2.44),
f(x1, X2)= f(xllXz)h(x2) which, when inserted in (2.43), shows that the margi-
nal pdf
g(xl) = f R x f ( x ~ [ x 2 ) h ( x 2 ) d x 2
fRxf(xlIx2) dx, = 1
since
~ x f ( x i , x 2 ) d x 1= h ( x 2 ) .
f(x,,x2)dxldx2 = [h(xz)dx2][f(x,lx2)dx,]
= Pr(x2 < x2 < x2 + d x 2 ) P r ( x l < 2, < x 1 +dXll2 2 = x2).
(1) Bivariate Normal (BN). A two-element random vector, ~?'= (xl, if2), has a
BN distribution if and only if its pdf is
Q= [(x,-
- 2 0 ( x I - bt,)( x 2 - / x 2 ) / o , 0 2 ] / ( 1 - 02) (2.46)
and 0'=(/~1,/~2, ol,o2,0), with 101<1, - ~ < / ~ i < o o , and 0 < o i < m , i = 1 , 2 .
Under these conditions, Q is a positive definite quadratic form.
To obtain the standardized form of (2.45), let z 1 = (x 1 - / ~ l ) / O l and z 2 = (x 2 -
/z2)/o 2, with dzldZ 2 = d x l d x 2 / o l o 2 . Further, let the 2 × 2 matrix P - 1 be defined
by
1--0 2 -- p
01
1 "
where the matrix involving p is the inverse of P l, introduced above. Then (2.49)
yields
E3~=E(\Ycl--1~10.1 ) 2 = 1 or E(:~l--/~l)2=0.~,
and
EZ'lZ'2=E(Xl--#l)
°l \ 0.2 =O or E(21-/~l)(ffz-/~2)=0.10.20.
where z t = (x i -/**)/oi, i = 1,2. Substituting (2.50a) into (2.45) and noting that
dzldz 2 = dxldx2/olo2, the pdf for zl and z2 is
where
f(ZzlZl,p)=[2~r(1-p2)]-'/2exp(-(zz-pzi)2/2(1--p2)) (2.51a)
and
That (2.51b) is the marginal pdf for ~, can be established by integrating (2.51)
with respect to z 2. Given this result, f(z21zl, O) = f(zi, z21o)/g(zl) is the condi-
tional pdf for z2 given zi and is shown explicitly in (2.51a).
From (2.51b), it is seen that the marginal pdf for ~ is a standardized normal
pdf with zero mean and unit variance. Since ii = (xi - / z i ) / o i , the marginal pdf
for 21 is
where 0 ' = (O, /~,, /12/}2-1, O'2), With f12-, - °2P/(II" From (2.52),
and
where E(22[21 = Xl) is the conditional mean of 22, given 21, and V(22121 = Xl) is
the conditional variance of :72, given 2~. Note from (2.53) that the conditional
mean of x2 is linear in x 1 with slope or "regression" coefficient/32.1 = ° 2 0 / o r
104 A. Zellner
The marginal pdf for 22 and the conditional pdf for 21, given 2 2, may be
obtained by substituting (2.50b) into (2.45) and performing the operations in the
preceding paragraphs. The results are:
with
and
where El .2 ~ O10/O2" From (2.56), it is seen that the marginal pdf for 2 2 is normal
with mean/*2 and variance o22, while from (2.57) the conditional pdf for 2t, given
22, is normal with
and
marginal variances and J~l. 2 = fi2. i = 0, that is, the regressions in (2.53) and (2.58)
have zero slope coefficients.
E:~ = 0. (2.62)
Thus, 0 is the m e a n of the M V N pdf. Also, (2.61) implies that E z z ' = I m, since the
elements of ff are independent standardized normal rvs. It follows that H - IE(J?
- O ) ( x - O ) ' ( H ' ) - 1 = im; that is, E ( ~ - 0)(~ - 0 ) ' = H H ' , or
since from H ' Z - 1H = Ira, Z = H H ' . Thus, Z is the covariance matrix of J?.
To obtain the marginal and conditional pdfs associated with (2.60), let G - 2~-
and partition x - 0 and G correspondingly as
o, t
X - - O = 1X 2 02 and G= G21 G22].
(x - O)'G(x - O) = QI + Q 2 , (2.64)
12Note that the Jacobian of the transformation from x - 0 to z is IHI and iX] 1/2 = [H b 1 from
]H'Z 1HI = Jim]= 1. Thus, IX]- I/2]H] = 1.
106 A. Zellner
with
and
From
where
and
E:? 2 = 02,
(G1, GI2 /
G = ~ G21 G22 ]
and
Zll Xl2 /
X= X21 X22 ]
then from results on inverting partitioned matrices, ~22 = (G22- G21Gtl IGI2) I, X12Z221 = _ G~l
G12, and GI) i = 2~lt _ ~12~1~21.
Ch. 2: Statistical Theory and Econometrics 107
The mean and covariance matrices of the conditional pdf in (2.67a) are
E ( ~ 1 1 . ~ 2 ) = O 1 -- G I ~ 1 G I 2 ( X 2 - 02)
(2.68)
Similar operations can be utilized to derive the marginal pdf for ~1 and the
conditional pdf for x2, given £l- As in the case of (2.67a) and (2.67b), the
marginal and conditional pdfs are MVN pdfs. In addition, just as E($11£2) in
(2.68) is linear in x2, E(:~zl:~l) = 02 q- Z 2 1 ~ H l(Xl - 01) is linear in x 1. Thus, both
conditional expectations, or regression functions, are linear in the conditioning
variables.
The conditional expectation of $1, given ~2, is called the regression function for
~l, given $2. As (2.68) indicates, in the case of the MVN distribution,
this regression function is linear and B ' - ~ 1 2 ~ 2 ~ 1 is the m 1 × m s matrix of
partial regression coefficients. If x I has just one element, then the vector of partial
regression coefficients is fl' = 0~2Z2~1, where o[2 is a 1 × rn 2 vector of covariances
of ~1, and the elements of $2 are the first row of ~$12. With respect to partial
regression coefficients, it is instructive to write
where li' is a 1 × rn, random vector with Ea[:~ 2 = 0 and E ( $ 2 - 0 2 ) a ' = 0. Then
on multiplying (2.70) on the left by x 2 - 02 and taking the expectation of both
sides, the result is 2~21= Z22B or
B = ~221~21 . (2.71)
Note that (2.71) was obtained from (2.70) without assuming normality but, of
course, it also holds in the normal case. Without normality, it is not true in
general that both E ( £ l l £ 2 ----x2) and E(£21£ l = xl) are linear in x 2 and x l,
respectively. For the cases of non-normality, it may be that one of these
conditional expectations is linear, but in general both will not be linear except in
special cases, for example the multivariate Student-t distribution discussed below.
If in the MVN distribution the elements of £ are mutually uncorrelated, that is,
E ( ~ ; - 0~)(~j- 0 j ) = 0 , for all i ~ j, then ~ in (2.63) is a diagonal matrix,
= D(o;i) and from (2.60) f(x[O, D) = H i=lg(xilOi,
m Oii), where g(xilOi, Oii) is a
univariate normal pdf with mean 0; and variance air Thus, diagonality of
implies that the elements of ~ are independently distributed and then they are
mutually uncorrelated, a result that holds in general. Thus, for the MVN
distribution, diagonality of Z implies independence, and independence of the
108 A. Zellner
f (zlP ) : C/(1) ±
-- z 'z )\(m + v)/2 , (2.73)
Ch. 2: Statistical Theory and Econometrics 109
and
The conditions v > 1 and v > 2 are needed for the existence of moments.
If :~ is partitioned, $ ' = (:~, $~), marginal and conditional pdfs can be obtained
by methods similar to those employed in connection with the MVN. Let V in
(2.72) be partitioned, V = (V~j), i, j = 1,2, to correspond to the partitioning of
into $1 with m 1 elements and 3~ 2 with m 2 elements. Then the marginal pdf, for :~2,
is in the MVS form, that is, MVSm2(0>V2.1, v), where 02 is a subvector of
0 ' = (0~, 0~), partitioned to correspond to the partitioning of $, and V2.1 = V= -
Vt2Vll 1Vt2. As regards the conditional pdf for Xl, given x2, it too is a MVS pdf,
namely MVS,,,(81. > M, v'), with v ' = m 2 + v:
and
For v ' > 1, S t 2 in (2.76) is the conditional mean of £1, given £2; note its similarity
to the conditional mean for the MVN' pdf in (2.68) in that it is linear in x 2. Also,
E£2lx I is linear in xv Thus, the MVS pdf with v > 1 has all conditional means or
regression functions linear. In addition, if V is diagonal, (2.75) indicates that all
elements of ~ are uncorrelated, given v > 2, and from (2.76), $1.2 = 01 when V is
diagonal. From the form of (2.72), it is clear that diagonality of V, or lack of
correlation, does not imply independence for the MVS pdf, in contrast to the
MVN case. This feature of the MVS pdf can be understood by recognizing that
the MVS can be generated as a continuous mixture of normal pdfs. That is, let
have a MVN pdf with mean vector 0 and covariance matrix o 2 V - 1, denoted by
f N ( O , V - t o 2 ) , and consider o to be random with the inverted gamma pdf in
(2.33). Then
In previous sections specific pdfs for rvs were described and discussed, the use of
which permits one to make probability statements about values of rvs. This
section is devoted to a review of some results relevant for situations in which the
exact pdfs for rvs are assumed unknown. Use of asymptotic theory provides,
among other results, approximate pdfs for rvs under relatively weak assumptions.
As Cox and Hinkley (1974, p. 247) state: " T h e numerical accuracy of such an
[asymptotic] approximation always needs consideration . . . . " For example, con-
sider a random sample mean, X, = )-"7=1Xi/n" Without completely specifying the
pdfs for the X/s, central limit theorems (CLTs) yield the result that X, is
approximately normally distributed for large finite n given certain assumptions
about the X~. Then it is possible to use this approximate normal distribution to
make probability statements regarding possible values of An. Similar results are
available relating to other functions of the X/s, for example, S~ = ~ = i ( X / -
Xn)2/n, etc. The capability of deducing the "large sample" properties of func-
tions of rvs such as A', or S2 is very useful in econometrics, especially when exact
pdfs for the Xi's are not known a n d / o r when the Xi's have known pdfs but
functions of the X~'s have complicated or unknown pdfs.
The following are some useful inequalities for rvs that are frequently employed
in asymptotic theory and elsewhere.
taFor simplicityof notation, in this section rvs are denoted by capital letters, e.g. X, Y, Z, etc. See,
for example, Cram~r (1946), Lo~ve (1963), Rao (1973), Cox and Hinkley (1974) and the references
cited in these works for further consideration of topics in asymptotic theory.
Ch. 2." Statistical Theory and Econometrics 111
where e > 0 is given, Y is any rv, g(. ) is a non-negative, even function, defined on
the range of Y such that g ( . ) is non-decreasing, and E g ( . ) < oo. If in (2.79)
Y = IX I and g(lXI) = X 2 for IXI >/0 and zero otherwise, (2.79) reduces to (2.78). If
in (2.79) Y = IXI and g(lXI) = IXI for IX[ >/0 and zero otherwise, (2.79) reduces to
P(IXI > e)~ EIXI/e. Other choices of g(.), for example g(lXI) = ISl r, r > 0,
produce inequalities involving higher order moments, e ( l g [ > e)<~ElXlr/e r,
M a r k o v ' s Inequality that includes (2.78) as a special case.
Some additional inequalities are: 15
(1) EIX+YIr~cfiEIXIr+E[yIr), where c ~ = l for r~<l and Cr=2 r-1 for
r >/1. Thus, if the r t h absolute m o m e n t s of X and Y exist and are finite, so is
the r th absolute m o m e n t of X + Y.
(2) HOlder lnequality: E I X Y I <~[ElSlrll/~[EIYIS] l/s, where r > 1 and 1/r + 1/s
~1.
(3) Minkowskilnequality: If r >/1, then [ E [ X + y[r]l/r<~ [ElSlr]l/~+[E[Ylr]l/r.
(4) Schwarz Inequality: EIXYI <~[EIXI2EIY] 2] or [EIXYI] 2 ~ EIXI2EIY[ 2,
which is a special case of H61der's Inequality with r = s = 2.
These inequalities are useful in establishing properties of functions of rvs.
N o w various types of convergence of sequences of rvs will be defined.
15For proofs of these and other inequalities, see Lo~ve (1963, p. 154 ft.)
112 A. Zellner
for every given e > 0, then the sequence converges weakly or i.p. to the constant c.
i.p. p
Alternative ways of writing (2.80) are X~ --~ c, or Xn --+ c, or plim X, = c, where
" p l i m " represents the particular limit given in (2.80) and is the notation m o s t
frequently employed in econometrics.
P( lim X , = c ) = 1 , (2.81)
a.s.
then the sequence converges strongly or a.s. to c, denoted b y X, ~c. An
alternative way of expressing (2.81) is
3. Convergence in quadratic mean (q.m.). If for the sequence (X,), n = 1,2 .... ,
lim E ( X , - c) 2 = 0, (2.82)
q.IIl.
then the sequence converges in quadratic m e a n to c, also expressed as X n --, c.
A sequence of rvs (Xn), n = 1,2,..., is said to converge to a rv X i n the sense of
(2.80), (2.81), or (2.82) if and only if the sequence (Xn - X), n = 1,2 ..... con-
verges to c = 0 according to (2.80), (2.81), or (2.82). I n the case of (2.80) such
i.p.
convergence is denoted by X, ~ X or p l i m ( X n - X ) = 0; in the case of (2.81) by
a.s. q.rn.
X~ --, X; and in the case of (2.82) by X, ~ X.
Ch. 2." Statistical Theory and Econometrics 113
I n connection with a sequence of sample means, (_~), n = 1,2 . . . . . with EX'~ =/~
and Var(X'n) = o 2 / n , Chebychev's Inequality (2.78) yields P(I X', -/~1 > e) ~< E ( X,
- 11)2/e 2 = o 2 / n e 2. Thus, lim~_, ooe(IX'~ - #1 > e) = 0; that is, plim X'~ =/~ or
_ i.p.
X, ~ / ~ . Further, on applying (2.82), l i m , _ ooE(X', - / z ) 2 = lim, _~ ooo2/n - - - 0 and
_ q.m.
thus X, ~ /z.
In the case of a sequence of sample means, and in m a n y other cases, it is
valuable to k n o w under what general conditions if and how X, converges to a
limiting constant. Laws of large numbers (LLN) provide answers to these
questions.
Chebychev" s W L L N
If EX~ = #i, V(X~) = E ( X ~ - t~) 2 = oi2, and cov(X,, Xj) = E ( X i - ixi)(Xj - t~j) =
0, i ~ j for all i, j = 1 , 2 , . . . , then l i m n ~ a / n = 0, where ~2 = ~ 7 = l o ~ / n implies
that
i,p,
X , - - ~ n - - " O,
1 6 p r o o f is by use of Chebychev's Inequality (2.78) with X= ,~n- ~ since E(X, "-fin) 2= ff2/n.
Therefore P(IX,, - ff,,I > e)<~62/n e and thus lim, ~ ~oP(IXn - ffnl > e) = 0. For proofs and discussion
of this and other LLN see, for example, Rao (1973, p. 111 ff.).
114 A. Zellner
Khintchin's W L L N
If Xl, X2,... are independent and identically distributed (i.i.d.) rvs and E X i = ~ <
i.p.
c¢, then ~r ~ g (or prim ~', = g).
a.s.
2.-~.--. 0,
and the sequence Xl, )(2,... is said to obey the SLLN. Further, if/~i =/~ for all i,
-- a,s.
Kolmogorov" s Second S L L N
If Xt, X 2.... is a sequence of i.i.d, rvs, then a necessary and sufficient condition
-- a.s.
Kolmogorov's Second SLLN does not require the existence of the second
moments of the independent X~'s as in his first law; however, in the second law
the X~ must be independently and identically distributed, which is not assumed in
his first law. In the first law, if #~ =/~ and °2i = o2, the X / s need not be identically
-- a.s.
distributed and still, X, -~/x since o 2 ~ = ~ l / i 2 < oo.
Let (Fn), n = 1,2 .... , be a sequence of cumulative distribution functions (cdfs) for
the rvs (Xn), n = 1,2,..., respectively. Then (Xn) converges in distribution or in
law to a rv X with cdf F if F , ( t ) ---, F ( t ) as n --* co for every point t such that F ( t )
Ch. 2: Statistical Theory and Econometrics 115
L
is continuous at t. This convergence in distribution or law is denoted by X, ~ X.
T h e cdf F of the rv X is called the limiting or asymptotic distribution of X,.
Further, if X, has p d f f , ( x ) and f n ( x ) ~ f ( x ) as n ~ co and if f ( x ) is a pdf,
then flL(x)-f(x)ldx --, 0 as n ~ oo. In addition, if If.(x)l < q ( x ) and f q ( x ) d x
exists and is finite, this implies that f ( x ) is a p d f such that f IL(x)- f ( x ) l d x --> 0
asn~.
Several additional results that are very useful in practice are:
(1) Helly-Bray Theorem: Fn --, F implies that f g d F , ~ f g d F as n ~ oo for
every b o u n d e d continuous function g.
F o r example, the Helly-Bray T h e o r e m can be e m p l o y e d to a p p r o x i m a t e
Eg r = f g r d F , , r = 1,2,..., when gr satisfies the conditions of the theorem and
F ' s form is known.
L L
(2) With g a continuous function, (a) if X~ ~ X, then g(X~) ~ g ( X ) , a n d (b) if
i.p. i.p.
Xn ~ X, then g ( X . ) --* g ( X ) .
i.p. i.p.
As a special case of (b), if X. ~ c, a constant, g ( X . ) -~ g(c).
(3) Continuity Theorem. Let c.(t) be the characteristic function (cf) of An. If
L
Xn ~ X, then c.(t) ~ e(t), where c(t) is the cf of X. Also, if Cn(t ) ~ c(t) and
L
c(t) is continuous at t = 0, then X n -~ X with the distribution function of X
having cf c(t).
By the Continuity Theorem, derivation of the form of e(t) = limn ~ ~oe.(t) often
permits one to determine the form of the limiting distribution of X..
T h e following convergence results relating to (X., Yn), n = 1,2 . . . . . a sequence
of pairs of rvs are frequently employed:
i.p. L
(a) If IX n - Y.I ~ 0 and Y. ~ Y, then X. ~ Y, that is, the limiting cdf of X n
exists and is the same as that of ¥~.
L i.p. --~ O.
(b) Xn ~ X and Y. ~ O, implies that X.Y.i.p.
L i.p. L L
(c) X. ~ X and Y. ~ c, implies that (i) X n + Y. ~ X + e; (ii) X,,Y. ~ cX; and
L
(fro X ~ / Y . ~ X / e if c ~ O.
i.p. L L
(d) X. - Y. ~ 0 and X. ~ X, implies that Y. ~ X.
temFi~a
where X 0), X(2),..., X (k) have a joint cdf F ( x i, x2 ..... x k), then the limiting joint
cdf of the sequence of random vectors exists and is equal to F(xj, x 2 ..... xk).
Central Limit Theorems (CLTs) establish particular limiting cdfs for sequences
of rvs. While only CLTs yielding limiting normal cdfs will be reviewed below, it is
the case that non-normal limiting cdfs are sometimes encountered.
Lindeberg- Levy C L T
Let (An), n = 1,2 ..... be a sequence of i.i.d, rvs such that E X , =/~ and V(Xn) = o 2
0 exist. Then the cdf of Yn = v ~ ( X n - / , ) / o - - + ~, where • is the normal cdf,
~ ( y ) = (2~r)-l/Zfy e - t /2dt and ~'n = ~7=~X~/n.
Liapunov C L T
Let (Xn), n = 1,2 ..... be a sequence of independent rvs. Let EX,, =- t*~, E(Xn -
/*n) 2 = o[ ~ 0 and E ( X n _ / , , ) 3 =fin exist for each n. Furthermore, let B n =
(~7=lc.fl~)~/3 and C~ = (Y'.7_,tr2) ~/2. Then if l i m ( B , / C n ) = 0 as n--+ oc, the cdf of
Y, = 2 ~ i = l ( X / - i z i ) / C , --+-~(y), a normal cdf.
Lindeberg- Feller C L T
Let (Xn), n = 1,2,..., be a sequence of independent rvs and Gn be the cdf of X,.
Further, let EX, = / * , and V ( X , ) = o~ ~: 0 exist. Define Y, = ZT=I(Xi - ~ti)/Cn,
where C, = V%-0, with ~ 2 = ~7= ff*~2/n" Then the relations
n "~ wJ C 2 i = l
Multivariate C L T
Let F n denote the joint cdf of the k-dimensional random vector (X~ l~, X~2~,...,
Xff)), n = 1,2,..., and Fan the cdf of the linear function ~ X ~ ~) + ~2X~ 2) + - - - +
XkXff). A necessary and sufficient condition that F n tend to a k-variate cdf F is
that Fx, converges to a limit for each vector ~.
Ch. 2: Statistical Theory and Econometrics 117
With Fn, Fan, a n d F as defined in the Multivariate CLT, if for each vector X,
Fxn ~ F x, the cdf of X~X o) + X2 X(2) + - • • + XkX(k), then Fn ---, F. As an applica-
tion, consider the random vector U,' = (Uln, U2n,..., Ukn ) with EUn =/~ and V(U,,)
=X, a k × k matrix. Define ~ = ( U ~ n , U 2. . . . . . U~,), n = l , 2 ..... with ~n =
n
~j=lU~j/n. Then the asymptotic cdf of f n ( U ~ - / ~ ) is that of a random normal
vector with zero mean and covariance matrix Z.
For use of these and related theorems in proving the asymptotic normality of
maximum likelihood estimators and posterior distributions, see, for example,
Heyde and Johnstone (1979) and the references in this paper that relate both to
cases in which rvs (X~) are i.i.d, and statistically dependent. Finally, for a
description of Edgeworth and other asymptotic expansion approaches for ap-
proximating finite sample distributions and moments of random variables, see,
for example, Kendall and Stuart (1958), Jeffreys (1967, appendix A), Copson
(1965), and Phillips (1977a, 1977b). Such asymptotic expansions and numerical
integration approaches are useful in checking the quality of approximate asymp-
totic results and in obtaining more accurate approximations.
3. Estimation theory
There are two general types of criteria that are employed in evaluating properties
of point estimates. First, sampling criteria involve properties of the sample space
and relate to sampling or frequency properties of particular or alternative
estimates. The overriding considerations with the use of sampling criteria are
properties of estimates in actual or hypothetical repeated samples. Second,
non-sampling criteria involve judging particular or alternative estimates just on the
basis of their properties relative to the given, actually observed data, x. With
non-sampling criteria, other as yet unobserved samples of data and long-run
frequency properties are considered irrelevant for the estimation of a parameter's
value from the actually observed data. The issue of whether to use sampling or
non-sampling criteria for constructing and evaluating estimates is a crucial one
since these different criteria can lead to different estimates of a parameter's value
from the same. set of data. However, it is the case, as will be seen, that some
non-sampling-based procedures yield estimates that have good sampling proper-
ties as well as optimality properties relative to the actually observed data.
p[0-x, < < 0 + x2] p[0-x, < 0o( ) < 0 + x2] (3.1)
for all 0 with Xl and ~2 ill the interval 0 to X and where 0a(£) is any other
estimator. A necessary condition for (3.1) to hold is that E[t~(£)-012 ~< E[0a(£ )
- 0 ] 2 for all 0. As mentioned above, it is not possible to satisfy this necessary
MSE condition and thus the criterion of highest degree of concentration cannot
be realized.
Since the strong sampling theory criteria of error-free, minimal MSE, and
highest degree of concentration cannot be realized, several weaker criteria for
estimators have been put forward. One of these is the criterion of unbiasedness.
3.2.1.2. Unbiasedness.
Definition
An estimator tJ(~) of a parameter, 0 is unbiased if and only if E[0($)I0 ] = 0 for all
OcO.
Thus, if an estimator is unbiased, its mean is equal to the value of the parameter
being estimated.
Example 3.1
As an example of an unbiased estimator, consider the model, 2 i = 0 + ei, with
E(2iL0)= 0 for i = 1 , 2 ..... n. Then O ( £ ) = ~ . = l ~ i / n is an unbiased estimator
since E[0(.~)[0] = ~7= lE(xi[O)/n = O.
120 A. Zellner
Example 3.2
Consider the multiple regression model, 3~= X13 + ~, where j~'= (Yi, 372..... Yn), X
is an n × k non-stochastic matrix of rank k, 13'= (fll,fl2 ..... 13~), and t / ' =
( / ~ 1 , ~ 2 . . . . , ~n), a vector of unobservable random errors or disturbances. Assume
that E(J~IX13) = X13. Then/~ = ( X ' X ) -1X'~, the "least squares" estimator of 13, is
unbiased since E(/~I X13) = ( X ' X ) - lX'E(yl X13) = ( X X ) - iX'X13 =/3 for all 13.
While unbiasedness is often regarded as a desirable property of estimators, the
following qualifications should be noted. First, there are usually many unbiased
estimators for a particular parameter. With respect to Example 3.1, the estimator
0w(£)= ~i'.7=,wixi, with the wi's given, has mean E[Ow(~)lO]=~i=lwiE(xil )
0~7= lw~, and is unbiased for all wi's satisfying Y.7=lw~ = 1. Similarly, with respect
to Example 3.2, 1~¢= [ ( X ' X ) - 1 X ' + C'] .~, where C' is a non-stochastic k X n
matrix and has mean E(13cIX13, C) = 13 + C'X13 = 13 for all C such that C ' X = O.
Thus, unbiased estimators are not unique.
Secondly, imposing the condition of unbiasedness can lead to unacceptable
results in frequently encountered problems. If we wish to estimate a parameter,
such as a squared correlation coefficient, that satisfies 0-%<0 < 1, an unbiased
estimator 0($) must assume negative as well as positive values in order to satisfy
Eta(S) = 0 for all 0, 0 ~ 0 < 1. Similarly, an unbiased estimator for a variance
parameter ~.z that is related to two other variances, o 2 and o2, by I" 2 = 0 . 2 - - 0 2~
2
with 0.2 >/0.2 > 0, has to assume negative as well as positive values in order to be
unbiased for all values of 22. Negative estimates of a variance, that is known to be
non-negative, are unsatisfactory.
Third, and perhaps of most general importance, the criterion of unbiasedness
does not take account of the dispersion or degree of concentration of estimators.
Biased estimators can be more closely concentrated about a parameter's value
than are unbiased estimators. In this connection, the criterion of MSE can be
~7
expressed in general as:
where V(O) = E(O - EO) 2, the variance of t~, and bias = E0 - 0 is the bias of 0.
Thus, MSE depends on both dispersion, as measured by V(O), and squared bias,
and gives them equal weights. In terms of (3.2), the criterion of unbiasedness gives
zero weight to var(0) and unit weight to the bias squared term which is
considered unsatisfactory by many.
Fourth, on considering just unbiased estimators for a parameter, denoted by
t~(a~), it is clear from (3.2) that MSE = var[t~(~)]. While the restriction that an
i=l
i~l
where ~ = ~7= l x i / n , the sample mean, and this result implies that cov(f, 5) = 0,
the necessary and sufficient condition for ~ to be a MVUE. By similar analysis,
Rao shows that for the multiple regression model in Example 3.2, with j~ assumed
122 A. Zellner
18See Kendall and Stuart (1961, pp. 193-194) and Silvey (1970, p. 29) for a discussion of minimal
sufficient statistics.
19proofs may be found in Kendall and Stuart (1961, p. 23) and Lehmann (1959, p. 47).
Ch. 2: Statistical Theory and Econometrics 123
i=1
= (2rro2)-"/2exp(-[vs 2 +(fl-fl)'X'X(fl-fJ)/2o 2)
=
since £ ( x i -- Y:) = O.
21Note:
where h(t) = E(tl[t ) is independent of 0. Furthermore, Eh(t) = g(0), that is, h(t)
is unbiased if Et I = g(O).
See Rao (1973, p. 321) for a proof of this theorem. As Rao (1973) notes: " G i v e n
any statistic [q], we can find a function of the sufficient statistic [h(t)] which is
uniformly better in the sense of mean square error or minimum variance (if no
bias is imposed)" (p. 321). He also notes that if a complete sufficient statistic
exists, that is, one such that no function of it has zero expectation unless it is zero
almost everywhere with respect to each of the measures P0, then every function of
it is a uniformly M V U E of its expected value. In view of the Rao-Blackwell
Theorem and assuming the existence of complete sufficient statistics, to find a
M V U E it is enough to start with any unbiased estimator and take its conditional
expectation given the sufficient statistic, that is, E (t 11t) = h (t).
Since these results depend strongly on the existence of complete statistics, it is
relevant to ask which classes of distribution functions possess sufficient statistics
for their parameters. The P i t m a n - K o o p m a n Theorem [Pitman (1936) and
K o o p m a n (1936)] provides an answer to this question.
{k
f ( x l O ) = e x p ~ Aj(O)Bj(x)+C(x)+ D(O)
j=l
} (3.8)
functions of the minimal sufficient statistic which are unbiased estimators of the
parameter and there is no general means of comparing their variances. Silvey
(1970, pp. 3 4 - 3 5 ) presents the following example to illustrate this problem.
Suppose that n independent binomial trials, each with probability 0 of success, are
carried out; then trials are continued until an additional k successes are obtained,
this requiring s additional trials. 24 Let the sample be denoted by x = (xl, x 2 . . . . .
x n, xn+ l, x , + 2 , - - . , x n + s - 7 , 1), where x i = 1 for a success and x i = 0 for a failure.
T h e n f ( x ] O ) = 0 r + k ( 1 - O) " + s - r - k where r = ~ i n= l X i and s depends on x also.
The statistic t = (r, s) is sufficient for 0 and is also a minimal sufficient statistic.
However, t is not complete because if
r k-1
f(t) n s - 1 '
then
Ef(t)= E rn - E s--Z-i-
k-1 = 0- 0= 0 for all O.
However, f (t) ~ 0, as is required for a complete sufficient statistic, and thus there
are problems in applying the R a o - B l a c k w e l l T h e o r e m to this and similar prob-
lems.
In problems in which it is difficult or impossible to obtain a M V U E , it is useful
to consider the C r a m r r - R a o Inequality that provides a lower b o u n d for the
variance of an unbiased estimator.
C r a m & - Rao Inequality
Given ( X , p(xlO ), 0 c O), with O an interval on the real fine, then subject to
certain regularity conditions, the variance of any unbiased estimator ~ of g(O)
satisfies the following inequality:
v a r ( 0 ) > / 1 / I o. (3.10)
Proof
Differentiate E ~ = fx~P(xlO)dx= g(O) with respect to 0, assuming that it is
permissible to differentiate under the integral to obtain:
or
or
>1[g'(O)]2/Io .
The following lemma provides an alternative expression for Io, the Fisher
information measure.
Lemma
I o = E[ 0 log p(x[O)/O0] 2 = - E[ 021og p(xlO)/OO2].
Proof
Differentiate fx[O log p(xlO)/OO]p(x[O) d x = 0 with respect to 0 to obtain:
or
or
Ologp(xlO)/O0 = A ( 0 ) [ ~ - g ( 0 ) ] , (3.12)
where A(O) may depend on 0 but does not depend on x, the observations. From
(3.12), var[O log p(x]O)/30] = A2(0)var(~) and then from the equality form of
(3.9),
where o02 is a known value for o 2 and X = ET= ~xi/n. Then 3 logp(x[O, o2)/O0 =
n(Y,-O)/o 2. With ~ , = ~ and g(O)=O, A(O)=n/o~. Thus, since 0log
p(x[O, Oo2)/00 is proportional to ~ - 0, with the factor of proportionality A(O) =
n/%2, X is the MVB unbiased estimator with v a r ( f f ) = 1/A(O)= o2/n.
While MVUEs are ingenious constructs and useful in a variety of situations, as
Silvey (1970) points out: "There are m a n y situations where either no M V U E
exists or where we cannot establish whether or not such an estimator exists" (p.
43). For example, even in the case of a simple binomial parameter, 0, there exists
no unbiased estimator of 7/= 0 / ( 1 - 0), the odds in favor of success. Also,
problems arise in obtaining unbiased estimators of the reciprocals and ratios of
means and regression coefficients and coefficients of structural econometric
models. For these and other problems there is a need for alternative estimation
principles.
128 A. Zellner
25Note that d SS/d0 = -21~7_ i(xi - 0) and d2SS/d0 2 = 2n > 0. Thus, the value of 0 for which
dSS/d0 = 0 is 0 = 1~7=I x i / n and this is a minimizing value since d2SS/d0 2 > 0 at 0 = 0.
26Note OSS/Ofl = - 2 X ' y + 2X'XB and 02SS/3,8 2 = 2X'X. The value of fl setting 0SS/0B = 0 is
obtained from - 2 X'y + 2X'X,8 = 0 or X'Xp = X'y, the so-called "normal equations", the solution of
which is/~ = (X'X)-IX'y. Since 02SS/018~ is a positive definite symmetric matrix, ~ is a minimizing
value of ft.
27While these examples involve models linear in the parameters and error terms, it is also possible to
use the LS principle in connection with non-linear models, for example Yi = f ( zi, 8) + u i , i = 1,2.... , n,
where f(z~, 8) is a known function of a vector of given variables, zi, and a vector of parameters. In
this case, the LS principle involves finding the value of 0 such that ~'~_ 1[Y~ - f(zi, 8)] 2 is minimized.
Ch. 2." Statistical Theory and Econometrics 129
Proof
Consider f l = [ ( X ' X ) - I x ' + c ' ] ~ , where C' is an arbitrary k × n matrix. This
defines a class of linear (in j~) estimators. For I'/~ to be an unbiased estimator of
i'B, C must be such that C'X = 0 since El'~ = l'fl + I'C'Xfl. With the restriction
C'X = 0 imposed,
since
-l x.+ +
28Some use the term "best linear unbiased estimator" (BLUE) rather than minimum variance linear
unbiased estimator" (MVLUE) with "best" referring to the minimal variance property.
130 A. Zellner
attains a minimum for C = 0 which results in i'/~ = I'll, where/~ = ( X ' X ) - I X ' f is
the LS estimator with covariance matrix, V(/~) = ( X ' X ) - L a 2.
Thus, the Gauss-Markov (GM) Theorem provides a justification for the LS
estimator for the regression coefficient vector/3 under the hypotheses that the
regression model )7 = X/3 + ti is properly specified, Ej7 = X/3 and V(.17) = V(ti) =
ozI,, that is, that the errors or observations have a common variance and are
uncorrelated. Further, the G M Theorem restricts the class of estimators to be
linear and unbiased, restrictions that limit the range of candidate estimators and
involves the use of the MSE criterion that here is equivalent to variance since only
unbiased estimators are considered. As will be shown below, dropping the
restrictions of linearity and unbiasedness can lead to biased, non-linear estimators
with smaller MSE than that of the LS estimator under frequently encountered
conditions.
While the GM Theorem is remarkable in providing a justification for the LS
estimator in terms of its properties in repeated (actual or hypothetical) samples, it
does not provide direct justification for the LS estimate that is based on a given
sample of data. Obviously, good performance on average does not always insure
good performance in a single instance.
An expanded version of the GM Theorem in which the assumption that
Eaa'= O2In is replaced by Etiti' = Vo 2, with V an n × n known positive definite
symmetric matrix, shows that !'/~ is the M V L U E of 1'/3, where /~ =
( X ' V - 1 X ) - IX'V- l~ is the generalized least squares (GLS) estimator. On sub-
stituting )7 = X/3 + a into /~, /~ =/3 + ( X ' V - 1X)- 1X'V- l ft and thus E/~ =/3
and V(/~)= ( X ' V - 1 X ) lo2. Also, the GLS estimate can be given a weighted
least squares interpretation by noting that minimizing the weighted SS,
( y - X/3)'V- l( y _ X/3) with respect to/3 yields the GLS estimate. Various forms
of the GM Theorem are available in the literature for cases in which X is not of
full column rank a n d / o r there are linear restrictions on the elements of/3, that is,
A/3 = a, where A is a q x k given matrix of rank q and a is a q x 1 given vector.
The parameter o z appears in the G M Theorem and in the covariance matrices
of the LS and GLS estimators. The G M Theorem provides no guidance with
respect to the estimation of o 2. Since ~ = y - X/~, the n x 1 LS residual vector is
an estimate of u, the true unobserved error vector, it seems natural to use the
average value of the sum of squared residuals as an estimate of 0 2, that is,
0 2 = l ~ ' t ~ / n . As will be seen, (~2 is the maximum likelihood estimate of o 2 in the
regression model with normally distributed errors. However, when 6 2 is viewed as
Ch. 2: Statistical Theory and Econometrics 131
29From t~= )7 Xh = [In - X(X'X) IX']~, Efi'~ = E~'M~ = o2trM = o 2 ( n -- k), where M =
I,, - X(X'X)-tX ' and trM= n - k. Thus, E62 = o2(n - k)/2n.
3°The MSE of 62 = ~'~/n is MSE (62) = on(2/v)[(1 + k /2v)/(l + k/~)] which is smaller than
MSE(s 2) = 2o4/v for k > 2.
132 A. Zellner
with 0' = (0 7, ¢r~). Then ( y - X B ) ' V - 1(0)( y - X ~ ) has minimal value for
( -lx'v-'(o)y
= [XltXl/O ? x2tx2/02]-l(Xltyl/O 2 -Jr-x2,y2/02),
which is clearly a function o 2 and o22. Estimates of 0 2 and o22 are s/2 =
( y~ - Xi~,) ( y~ - X f ~ ) / v , , with vi = ni - k and t ~ i = ( X [ X i ) - l X ' / Y i for i = 1 , 2 .
Then the approximate GLS estimate is
Ba = [ X ; X l / s 2 + , 2 - [X[yl/S2+X y2/4]
with - oo < x i < o0, i = 1,2,...,n, the sample space, and - oo < tt < oo and 0 < 0 2
< oo, the parameter space. In (3.14), the likelihood function, lot, o21x), is a
function of/~ and o 2 given x.
According to the M L estimation principle, estimates of parameters are obtained
by maximizing the likelihood function given the data. T h a t is, the M L estimate is
the quantity O(x) c O such t h a t / ( ~ l x ) = m a x ~ c ol(Olx). For a very broad range
of problems the M L estimate t~(x) exists and is unique. In the likelihood
approach, l(Olx) expresses the "plausibility" of various values of 0, and ~, the
M L estimate, is regarded as the " m o s t plausible" or " m o s t likely" value of 0. This
view is a basic non-sampling argument for M L estimation, although the terms
" m o s t plausible" or " m o s t likely" cannot be equated with " m o s t probable" since
the likelihood function, l(OIx), is not a pdf for 0. As with LS estimates, ML
estimates can be viewed as estimators and their properties studied to determine
whether they are good or optimal in some senses.
F r o m the likelihood function in (3.14), log l(tz, o2lx) = - n l o g o - ~
( x i - / z ) 2 / 2 o 2 + c o n s t a n t . The necessary conditions for a m a x i m u m are
0 logl/Ott = 0 and 0 logl/Oo = 0 which yield ~-~i=1( n X i --/Q/2°2 = 0 and - n / o
For the usual multiple regression model, j? = X/3 + a, with li assumed normal
with mean 0 and covariance matrix, o2I~, the joint pdf for the observations is
p ( ylX, /3, o 2) = (2~ro2) - " / 2 e x p ( - ( y - X/3)'( y - X / 3 ) / 2 o 2) and the likelihood
function, l(/3, o21 X, y), i s p ( ylX,/3, o 2) viewed as a function of/3 and o 2, that is,
The quadratic form in the exponential is minimized for any given value of 0 by
flo = [ X ' V - I ( O ) X ] - IX'V-1(O)Y, the GLS quantity. Also, the conditional maxi-
mizing value of o z, given 0, is 6~ = ( y - X ~ ) ' V - I ( O ) ( y - X f l ) / n . On substitu-
tion these conditional maximizing values in the likelihood function, the result is
the so-called concentrated log-likelihood function, log lc(Oly, X ) = c o n s t a n t -
n / 2 log62 -½1oglV(0)[. By numerical evaluation of this function for various
values of 0 it is possible to find a maximizing value for 0, say/J, which when
substituted into/]0 and 602 provides M L estimates of all of the parameters. This
procedure is useful for computing ML estimates only when there are a few, say
one or two, parameters in 0.
A more general procedure, the Newton method, for maximizing log-likelihood
functions, L =-logl(OIx), where 0 is an m × 1 vector of parameters and x a vector
of observations, commences with the first-order conditions, OL/O0 = 0. Given an
initial estimate, 0 (°), of the solution 0 of OL/O0 = 0, expand OL/O0 in a Taylor's
Series about 0 ~°), that is,
where all partial derivatives on the right-side are evaluated at 0 ~°). Then,
32These regularityconditions are presented and discussed in Kendall and Stunt (1961) and Heyde
and Johnstone (1979),
136 A. Zellner
case of a vector ML estimator, 0(n), for large n and under regularity conditions, its
approximate distribution is multivariate normal with mean 0 and covariance
17 l, where I 0 is n times the Fisher information matrix for a single observation.
For proofs of these properties that generally assume that observations, vector or
scalar, are independently and identically distributed and impose certain condi-
tions on the higher moments or other features of the observations' common
distribution, see Cramrr (1946, p. 500), Wald (1949), and Anderson (1971). For
dependent observations, for example those generated by time series processes,
additional assumptions are required to establish the large sample properties of
ML estimators; see, for example, Anderson (1971) and Heyde and Johnstone
(1979).
A basic issue with respect to the large sample properties of ML estimators is
the determination of what constitutes a "large sample". For particular problems,
mathematical analysis a n d / o r Monte Carlo experiments can be performed to
shed light on this issue. It must be emphasized that not only is the sample size
relevant, but also other features of the models, including parameter values, the
properties of independent variables, and the distributional properties of error
terms. Sometimes the convergence to large sample properties of ML estimators
is rapid, while in other cases it can be quite slow. Also, in "irregular" cases,
the above large sample properties of ML estimators may not hold. One such
simple case is 9i~Oi+gi, i = 1 , 2 ..... n, where the gi's are NID(0, o2). The ML
estimate of 0r is 0r = Yi, and it is clear that t~i does not converge to 0g as n grows
since there is just one observation for each 0i. The irregular aspect of this problem
is that the number of parameters grows with the sample size, a so-called
"incidental parameter" problem. Incidental parameters also appear in the func-
tional form of the errors-in-the-variables model and affect asymptotic properties
of ML estimators; see, for example, Neyman and Scott (1948) and Kendall and
Stuart (1961). Thus, such "irregular" cases, and also others in which the ranges of
observations depend on parameters with unknown values or observations are
dependent and generated by non-stationary time series processes and have to be
analyzed very carefully since regular large sample ML properties, including
consistency, normality, and efficiency, may not hold.
namely those that are admissible and those that are inadmissible with respect to
estimators' risk properties relative to given loss functions. In this approach,
inadmissible estimators are regarded as unacceptable and attention is con-
centrated on the class of admissible estimators. Since this class usually contains
many estimators, additional criteria are required to choose a preferred estimator
from the class of admissible estimators.
The basic elements in applying the admissibility criterion are (1) loss functions,
(2) risk functions, and (3) comparisons of risk functions associated with alterna-
tive estimators. Consider a scalar parameter 0, and 0 an estimator for 0. Some
examples of loss functions are given below, where the c's are given positive
constants:
These are but a few of many possible loss functions that can be employed. Note
that they all are monotonically increasing functions of the absolute error of
estimation, ]el = 10 - 0 ] . The first three loss functions are unbounded while the
fifth is an example of a bounded loss function that attains a maximal value of c 5
as 10 - Ol -~ oo. The relative squared error loss function (3) is a special case of the
generalized loss function (4) with h(O) = c 3 / 0 2. Note too that, as is customary,
these loss functions have been scaled so that minimal loss equals zero when
- 0 = 0. Also, negative loss can be interpreted as utility, that is, U(0, 0 ) -
- L(O,~).
In the case of a vector of parameters, 0, and a vector estimator,/~, a quadratic
loss function is given by L(0,/~)= ( 0 - / J ) ' Q ( 0 - / ~ ) , where Q is a given pds
matrix. A generalized quadratic loss function is L(0, 0) = h(0)(0 - / J ) ' Q ( # -/~),
where h(0) is a given function of 0. One example is h(0) = 1/(0'0) m, where m is a
given non-negative constant.
In a particular estimation problem, the choice of an appropriate loss function is
important. Sometimes subject-matter considerations point to a particular form for
a loss (or utility) function. The widespread use of quadratic loss functions can
perhaps be rationalized by noting that a Taylor's Series expansion of any loss
function, L(e), about e = 0 - 0 = 0 , such that L ( 0 ) = 0 and L ' ( 0 ) = 0, yields
L(e)'-L"(0)e2/2, an approximate quadratic loss function. This, it must be
emphasized, is a local approximation which may not be very good for asymmetric
loss functions a n d / o r bounded loss functions [see, for example, Zellner and
Geisel (1968)].
Given that a loss function, L(0, 0), has been selected, the next step in applying
the admissibility criterion is to evaluate the risk function, denoted by r#(0) and
138 A. Zellner
defined by
where p(xlO ) is the pdf for the observations given 0. It is seen that the risk
function in (3.16) is defined for a particular estimator, 19, and a particular loss
function, L(0, 0). While it would be desirable to choose an estimator 0 so as to
minimize r~(O) for all values of 0, unfortunately this is impossible. For example,
an "estimator" 0 = 5 will have lower risk when 0 = 5 than any other estimator and
thus no one estimator can minimize risk for all possible values of 0. In view of
this fact, all that is possible at this point is to compare the risk functions of
alternative estimators, say 0 l, 02 ..... with risk functions r~,(O), rG2(O) .... , relative
to a given loss function. From such a comparison, it may" be thatr~,(0) _< rd2(O)
for all 0 with the inequality being strict for some 0. In such a case, 02 is said to be
dominated by 0 I, and 02 is termed an inadmissible estimator. That 01 dominates 02
does not necessarily imply that 01 is itself admissible. To be admissible, an
estimator, say 0 l, must have a risk function r~,(O) such that rs~(O) < r~a(0) for all
0, where r~a(O) is the risk function associated with any other estimator 0a. Work
on proof of admissibility of estimators is given in Brown (1966).
A leading example of the inadmissibility of a maximum likelihood and least
squares estimator has been given by Stein (1956) and James and Stein (1961).
Let fii=Oi+gi, i = 1 , 2 ..... n, with the gi's NID(0, o 2) and the Oi's the means
of the yi's, - o e < 0i < o e . Further, let the loss function be L ( 0 , / 9 ) =
( 0 - / 9 ) ' ( 0 - /9), a quadratic loss function. The likelihood function is
p(ylO) = (2~ro2)-"/2exp(-( y - 0)'( y - 0 ) / 2 o 2 ) , where y ' = (Yl, Y 2 , ' " ,Yn) and
0 ' = (01, 02,a..,0n). Then the ME estimator is/9o = h, with risk function rro(O) =
E(O o - 0 ) ' ( 0 o - 0 ) = no 2. When 02 has a known value, say o 2 = 1, James and
Stein (1961) put forward the following estimator for 0 when n > 3,
that has uniformly lower risk than the ML (and LS) estimator, /9o = Y; that is,
rg, < rg° or E(/9~- 0)'(/9! - 0) < E(/90 - 0)'(/90 - 0) for 0 < 0'0 < m [see James and
Stein (1961) for a proof]. As James and Stein show, use of/91 in (3.17) rather than
the ML estimator results in a large reduction in risk, particularly in the vicinity of
0 = 0. They also develop an estimator similar to (3.17) for the case of o 2 unknown
and show that it dominates the ML estimator uniformly. For details, see James
and Stein (1961), Zellner and Vandaele (1975), and the references in the latter
paper. Also, as shown in Stein (1960), Sclove (1968), and Zellner and Vandaele
(1975), Stein's result on the inadmissibility of the M L (and LS) mean estimator
carries over to apply to regression estimation problems when the regression
coefficients number three or more and an unbounded quadratic loss function is
Ch. 2: Statistical Theory and Econometrics 139
utilized. It is also the case that for certain problems, say estimating the reciprocal
of a population mean, the ratio of regression coefficients, and coefficients of
simultaneous equation models, ML and other estimators' moments usually or
often do not exist, implying that such estimators are inadmissible relative to
quadratic and many other unbounded loss functions [see Zellner (1978) and
Zellner and Park (1979)]. While use of bounded loss functions will result in
bounded risk for these estimators, see Zaman (1981), it is not clear that the ML
and other estimators for these problems are admissible.
Another broad class of estimators that are inadmissible are those that are
discontinuous functions of the sample data, for example certain "pre-test"
estimators. That is, define an estimator by 0 = 01 if t > a and 0 D2 if t __<a, where
=
7 is a test statistic. If Pr(t > a) = w, then the risk of this estimator relative to
quadratic loss is r~(O) = wE(O, - 0) 2 +(1 - w)E(02 - 0) :. As an alternative
estimator, consider ~3 = wOl + (1 - w)t~2 with risk function
Then re(O ) - r~3(fl) = w(1 - w)E[(/~ 1 - 0 ) - ( 0 z - 0)] 2 > 0 and thus the discontinu-
ous estimator 0 is inadmissible. For further properties of "preliminary-test"
estimators, see Judge and Bock (1978).
Since the class of admissible estimators relative to a specific loss function
contains many estimators, further conditions have to be provided in order to
choose among them. As seen above, the conditions of the Gauss-Markov
Theorem limits the choice to linear and unbiased estimators and thus rules out,
for example, the non-linear, biased James-Stein estimator in (3.17) and many
others. The limitation to linear and unbiased estimators is not only arbitrary but
can lead to poor results in practice [see, for example, Efron and Morris (1975)].
Another criterion for choosing among admissible estimators is the Wald
minimax criterion; that is, choose the estimator that minimizes the maximum
expected loss. Formally, find ~ such that max~r~(0) < max6 r~a(O), where 0: is any
other estimator. While this rule provides a unique solution in many problems, its
very conservative nature has been criticized; see, for example, Ferguson (1967, p.
58) and Silvey (1970, p. 165). A much less conservative rule is to choose the
estimator, when it exists, that minimizes the minimum risk or, equivalently,
maximizes the maximum utility. While these rules may have some uses in
particular cases, in many others they lead to solutions that are not entirely
satisfactory.
To illustrate the use of risk functions, consider Figure 3.1 in which the risk
functions associated with three estimators,/gj, 02, and/~3, have been plotted. As
140 A. Zellner
risk
r~3 (0)
r°lll I
d
Figure 3.1
drawn, 0, and 02 clearly dominate 03 since rd3 lies everywhere above the other two
risk functions. Thus, 03 is inadmissible. In choosing between 01 and 02, it is clearly
important to know whether 0's value is to the right or left of the point of
intersection, 8 = d. Without this information, choice between 01 and 02 is difficult,
if not impossible. Further, unless admissibility is proved, there is no assurance
that either 01 or ~2 is admissible. There may be some other estimator, say 04, that
dominates both 01 and ~2. Given these conditions, there is uncertainty about the
choice between 01 and 02 and, as stated above, without a proof of admissibility
there is no assurance that either is admissible. For a practical illustration of these
problems in the context of estimating the parameter p in a stationary, normal,
first-order autoregressive process, Yt = P Y t - 1 + et, see Thornber (1967). He pro-
vides estimated risk functions for ML and several other estimators for p. These
risk functions cross and thus no one estimator uniformly dominates the others.
The shapes of the estimated risk functions are also of interest. See also Fomby
and Guilkey (1978).
In summary, the criterion of admissibility, a sampling criterion, provides a
basis for ruling out some estimators. Indeed, according to this criterion, Stein's
results indicate that many ML and LS estimators are inadmissible relative to
quadratic loss. In other cases in which estimators do not possess finite moments,
they are inadmissible relative to quadratic and other loss functions that require
estimators' moments to be finite in order for risk to be finite. Even if just
bounded loss functions are considered, there is no assurance that ML and LS
estimators are admissible relative to them without explicit proofs that they do
Ch. 2: Statistical Theory and Econometrics 141
indeed possess this property. As regards admissible estimators, they are not in
general unique so that the problem of choice among them remains difficult. If
information is available about the range of "plausible" or "reasonable" values of
parameters, a choice among alternative admissible estimators can sometimes be
made. In terms of Figure 3.1, if it is known that t~1 and /~2 are admissible
estimators and if it is known that 0 > d, then t~2 would be preferred to 01- Below,
in the Bayesian approach, it is shown how such information can be employed in
obtaining estimators.
p(x,O) =p(xlO)p(O)
=p(OIx)p(x), (3.18)
where the functions p ( . ) are labelled by their arguments. From (3.18), p(Olx)=
p(O)p(xlO)/p(x) or
where the factor of proportionality in (3.18) is the reciprocal offo p (0) p (xlO) d 0
= p (x). The result in (3.19) is Bayes' Theorem with p (Olx) the posterior pdf for
O, p(O) the prior pdf for 0, and p(x[O) the likelihood function. Thus, (3.19) can
be expressed as,
P ( r l n , O ) = ( rn ) O (r 1 - O ) n - - r ,
with 0 < 0 < 1. As prior pdf for 0, assume that it is given byp(Ola, b) = 0a-l(1 -
O)b-1/B(a, b) a beta pdf with a, b > 0 having given values so as to represent the
available information regarding possible values of 0. Then the posterior pdf for 0
142 A. Zellner
is given by
where D denotes the prior and sample information and the factor of proportional-
ity, the normalizing constant is 1/B(r + a, n - r + b). It is seen that the posterior
pdf in (3.20) is a beta-pdf with parameters r + a and n - r + b. The sample
information enters the posterior pdf through the likelihood function, while the
prior information is introduced via the prior pdf. Note that the complete posterior
pdf for 0 is available. It can be employed to make probability statements about 0,
e.g. Pr(c I < 0 < c 2 [ D ) = f~2p(OlD)dO. Also, the mean and other moments of the
posterior pdf are easily evaluated from properties of the beta distribution. Thus,
the prior pdf, p(Ola, b), has been transformed into a posterior pdf, p(OID), that
incorporates both sample and prior information.
As mentioned in Section 2.2, the added element in the Bayesian approach is the
prior pdf, p(O), in (3.19), or p(Ola, b) in (3.20). Given a prior pdf, standard
mathematical operations yield the posterior pdf as in (3.19). Explicit posterior
distributions for parameters of many models encountered in econometrics have
been derived and applied in the literature; see, for example, Jeffreys (1967),
Lindley (1965), DeGroot (1970), Box and Tiao (1973), Learner (1978), and Zellner
(1971). Further, from (3.19), the marginal pdf for a single element or a subset of
the elements of 0 can be obtained by integration. That is, if 0 ' = (0~0~), the
marginal posterior pdf for 01 is given by
where in the sedond line the integration over the elements of 02 can be interpreted
as an averaging of the conditional posterior pdf for 01 given 02, p(Ol102,D), with
the marginal posterior pdf for 02, p(O2lD), serving as the weight function. This
integration with respect to the elements of 02 is a way of getting rid of parameters
that are not of special interest to an investigator, the so-called nuisance parame-
ters. In addition, the conditional posterior pdf, p(01lO2, D), can be employed to
determine how sensitive inferences about 0~ are to what is assumed about the
value of 02; that is, p(01102, D) can be computed for various values of 02; see, for
example, Box and Tiao (1973) and Zellner (1971) for examples of such sensitivity
analyses. Finally, as will be explained below, given a loss function point estimates
can be obtained.
Ch. 2: Statistical Theory and Econometrics 143
is in the normal form with posterior mean 37, the sample mean and posterior
variance 1 / n . 34
In this example it is seen that the mean and mode of the posterior pdf are equal
to the sample mean, )7, the M L estimate. Some have crudely generalized this and
similar results to state that with the use of diffuse prior pdfs, Bayesian and
non-Bayesian estimation results are equivalent, aside from their differing interpre-
tations. This generalization is not true in general. If a prior pdf is uniform,
p(O) cx constant, then the posterior pdf in (3.19) is given byp(Olx ) ~x p(xlO), that
is, it is proportional to the likelihood function. Thus, the modal value of the
posterior pdf will be exactly equal to the M L estimate and in this sense there is an
exact correspondence between Bayesian and non-Bayesian results. However, as
shown below, the posterior mean of 0 is optimal relative to a quadratic loss
function. If a posterior pdf (and likelihood function) is asymmetric, the posterior
mean of 0 can be far different from the modal value. Thus, the optimal Bayesian
point estimate can be quite different from the M L estimate in finite samples.
Asymmetric fikelihood functions are frequently encountered in econometric
analyses.
As regards point estimation, a part of the Bayesian approach, given a loss
function, L(O,t~), wherein 0 is viewed as random and /~ is any non-random
estimate, 0 =/~(x), a non-sampling criterion is to find the value of ~ that
33jeffreys (1967) interprets such improper priors as implying that oo rather than 1 is being
employed to represent the certain event, Pr(- oo < 0 < ~). Then the probability that 0 lies in any
finite interval, Pr(a < 0 < b) = 0 and Pr(a < 0 < b)/Pr(c < 8 < d) being of the form 0/0 is inde-
terminate.
34If the prior pdf p(/x) ~ constant, - M < ~ < M, had been used, the posterior pdf is p(/z[D) 0c
exp(- n(/~- 37)2/2) with - M </~ < M. For M large relative to l/n, the posterior is very closely
normal.
Ch. 2. StatisticalTheoryandEconometrics 145
minimizes the posterior expectation of the loss function. Explicitly the problem is
as follows:
where p(O[x) is the posterior pdf in (3.19). The minimizing value of 0, say 0 B, is
the optimal Bayesian estimate, optimal in a non-sampling sense since the observa-
tion vector x is given. In the case of a quadratic loss function, L(O, O)= ( 0 -
O)'Q(O - 0), where Q is a given pds matrix, 0 B = 0 = E(OIx), the posterior mean
vector. That is,
That is, choose the estimator 0 so as to minimize average risk, Erg(O), where the
expectation is taken with respect to the prior pdf p(0). On substituting the
integral expression for rg(O) in (3.23), the minimand is
35See, for example, Zellner (1971, p. 25) for a proof. Also, the particular loss structure that yields
the modal value of a posterior pdf as an optimal point estimate is describedin Blackwelland Girshick
(1954, p. 305). This loss structure implies zero loss for very small estimation errors and constant
positive loss for errors that are not small.
146 A. Zellner
where p(O)p(xlO)=p(x)p(Olx ) from (3.18) has been employed. On inter-
changing the order of integration in (3.24), the right side becomes
When this multiple integral converges, the quantity/~B that minimizes the expres-
sion in square brackets will minimize the entire expression given that p (x) > 0 for
x C Rx .36 Thus,/~B, the solution to the problem in (3.22), is the Bayesian estimator
that minimizes average risk in (3.23).
Some properties of/~B follow:
(1) ~B has the optimal non-sampling property in (3.22) and the optimal sam-
pling property in (3.23).
(2) Since/~B minimizes average risk, it is admissible relative to L(O, ~). This is so
because if there were another estimator, say/~A, that uniformly dominates/~B
in terms of risk, it would have lower average risk and this contradicts the
fact that 0B is the estimator with minimal average risk. Hence, no such t~A
exists.
(3) The class of Bayesian estimators is complete in the sense that in the class of
all estimators there is no estimator outside the subset of Bayesian estimators
that has lower average risk than every member of the subset of Bayesian
estimators.
(4) Bayesian estimators are consistent and normally distributed in large samples
with mean equal to the ML estimate and covariance matrix equal to the
inverse of the estimate information matrix. Further, in large samples the
Bayesian estimator (as well as the ML estimator) is "third-order" asymptoti-
cally efficient.37 These results require certain regularity conditions [see, for
example, Heyde and Johnstone (1979)].
A key point in establishing these sampling properties of the Bayesian estimator,
/~B, is the assumption that the multiple integral in (3.24) converges. It usually does
when prior pdfs are proper, although exceptions are possible. One such case
occurs in the estimation of the reciprocal of a normal mean, 0 = 1//~, using
quadratic loss, (0 - 0) 1. The posterior pdf for/~, based on a proper normal prior
for #, is normal. Thus, 0 = l / p , the reciprocal of a normal variable, possesses no
finite moments and the integral defining posterior expected loss does not con-
verge. With more information, say 0 > 0, this problem becomes amenable to
solution. Also, if the loss function is ( 0 - t~)2/02, a relative squared error loss
36See Blackwelland Girshick(1954), Ferguson(1967), and DeGroot (1970) for considerationof this
and the followingtopics.
37See, for example,Takeuchi(1978) and Pfanzagland Wefelmeyer(1978).
Ch. 2: Statistical Theory and Econometrics 147
38This implies that the Stein-James estimate in (3.17) is suboptimal for this specification.
39Lindley (1962) provides the following model to rationalize dependent Oi's: yi=Oi + ei and
0/~0 +vi, i=1,2 ..... n, where the ei's and vi's are independent normal error terms and 0 is
interpreted as a "common effect". Analysisof this modelproducesestimates of the 0j's very similar to
those in (3.17) [see, for example,Zellner and Vandaele (1975)].
148 A. Zellner
represent such information by use of prior pdfs, while non-Bayesians often use
such information informally. Evidence is being accumulated on the relative merits
of these alternative approaches to parameter estimation and other inference
problems.
4°For examples of this approach, see Tukey (1977), Huber (1964, 1972), and Belsley, Kuh and
Welsch (1980).
Ch. 2: Statistical Theory and Econometrics 149
= o,
i=1 i=1
outlying observations. Also, see Barnett and Lewis (1978) for a review of a
number of models for particular kinds of outlying observations. Many production
function models including the CES, trans-log, and other generalized production
function models include the Cobb-Douglas and other models as special cases.
B o x - C o x (1964) and other transformations [see, for example, Tukey (1957) and
Zellner and Revankar (1969)] can be employed to broaden specifying assump-
tions and thus to guard against possible specification errors. In regression
analysis, it is common practice to consider models for error terms, say autoregres-
sive a n d / o r moving average processes when departures from independence are
thought to be present. Such broadened models can of course be analyzed in either
sampling theory or Bayesian approaches. With respect to Bayesian considerations
of robustness, see, for example, Savage et al. (1963), Box and Tiao (1973), and
DeRobertis (1978).
P( gt,~>~OlO)=l - a. (4.1a)
In addition, it is required that if a~ > a2 and if ~i~ and a~2 are both defined in
accord with (4.1), then ~ , ~<~2, that is, the larger 1 - a, the larger is the upper
bound. Then ti~ is called a ( 1 - a)100 percent upper confidence bound for 0.
F r o m (4.1 a) the random event a , >/0 has probability 1 - a of occurrence and this
is the sense in which 6~ is a probabilistic bound for 0. When :~ is observed, a~(:~)
can be evaluated with the given sample data x. The result is a,~(x), a non-stochas-
tic quantity, say a,~(x) = 1.82, and the computed upper confidence bound is 1.82.
In a similar way a (1 - a ) x 100 percent lower confidence bound for 0 is/;~ = b~(~?)
such that
with/~,, >//3-2 .when ~ > a2; that is, the larger is 1 - a, the smaller is the lower
bound.
Bayesian confidence bounds are based on the posterior pdf for 0, p(OIx ) oc
rr(O)f(xlO), where rr(0) is a prior pdf for 0. A ( 1 - a ) × 100 percent Bayesian
upper bound, c a = %(x), is defined by
where 0 is considered random and the sample data x are given. Note that
where
where ~ is the sample mean. Then with the prior 7r(0) cc constant, - m < 0 < m,
the posterior pdf is f(Olx)(Xexp(-n(0-~)2/2}, a normal pdf. Thus, z =
x / n ( 0 - ~) is N(O, 1) a posteriori and the constant % can be found such that
P(z~<%l~)=l-cc z<~% is equivalent to vrn(o-~)<~% or O<~+%/v~.
Thus, P(0 ~<~ + c~/fn 1if) = 1 - a and ff + c~/fn is the Bayesian upper confi-
dence bound. Now from a sampling theory point of view, ff has a normal
sampling pdf with mean 0 and variance 1/n. Thus, z = vCn(~ - 0) is N(0, 1), given
0. From P ( z > - % l O ) = l - o l it follows that P[v~(g-O)>~-c, IOI=P(Y~+
%/~/-ff >~010)= 1 - a and ~ + %/¢rff is the sampling theory upper confidence
bound that is numerically identical to the Bayesian bound.
The example indicates that when a uniform prior pdf for the parameter 0 is
appropriate and when a "pivotal quantity", such as z = x/h-(~ - 0), that has a pdf
not involving the parameter 0 exists, 41 Bayesian and sampling theory confidence
bounds are numerically identical. Other examples involving different pivotal
quantities will be presented below. Also, a connection of confidence bounds with
construction of tests of hypotheses will be discussed below in the section on
hypothesis testing.
41Note that the pdf for z =~/~(2 - 0) in the exampleis N(0,1) both from the sampfing theory and
Bayesian points of view.
154 A. Zellner
E x a m p l e 4.2
Consider the standard normal regression model)7 = Xfl + a, where the n × 1 vector
is M V N (0, a 2In). Then/~ = ( X ' X ) - l x ' f i has a pdf that is M V N [ f l , ( X ' X ) - 1o2]
a n d v s 2 / a 2, where I, = n - k and ~,s 2 = ( f i - X l ~ ) ' ( f i - Xfl), has a X 2 p d f with p
degrees of freedom (d.f.). It follows that t = ( / 3 i - fli)/sh, has a univariate Stu-
dent-t (US-t) pdf with v d.f., where/~i and fli are the ith d e m e n t s o f / ~ and fl,
respectively, and s~, = mils 2, with m ii the i, ith element of ( X ' X ) - 1. Then f r o m
tables of the Student-t distribution with 1, d.f., a constant c~ > 0 can be f o u n d
such that with given probability 1 --, a, P(I/I < c D = P ( I / ) i - 13iJ/sh, < c~1/3/) =
1 - a. Since I/3i - Bil/s~, < c~ is equivalent to fli - c,s~i < fli < fli + csh, P(Bi -
c,sh, < fli < / ~ + %sh, lfli) = 1 - a and /~i + c,sh, is a (1 - a ) × 100 percent confi-
dence interval for/3/. N o t e that the interval is r a n d o m and fli has a fixed u n k n o w n
value. With given data fl~ + %sh, can be evaluated to yield, for example 0.56 ___0.12.
E x a m p l e 4.3
If the regression model in the previous example is analyzed with a diffuse prior
pdf, P ( B , a) (x 1 / o , the posterior pdf is p ( f l , o l y ) cco (n+ l)exp(_ [us 2 + (fl _
~ ) r X ' X ( f l - ~)]//2o 2) and on integrating over o, 0 < o < o0, the marginal poste-
rior p d f for fl is p(Bly) oc[~,s2+(B-~)rx'x(B-l})l -("+k)/2, a p d f in the
MVS-t form with I, = n - k,/~ = ( X ' X ) - 1 X ' y , and ~,s 2 = ( y - XI~)'(y - XI~).
T h e n it follows that t = (fli - fli)/sh has a US-t p d f with ~ d.f. where ~ i and/)i are
the i th elements of fl and /~, respectively,
~ . and st}a` = m"s. 2,. where
. . . m " is the .l-lth.
element of ( X ' X ) - 1. Thus, c, can be f o u n d such that for given probability 1 - a,
P(Itl < %) = P ( l f l ~ - O~l/sh, < % I Y ) = 1 - a. Equivalently, P ( ] ~ i - c~s~, < fli < fli +
%s~,ly ) = 1 - a and fl~ + c,sh, is a ( 1 - a ) 1 0 0 percent Bayesian confidence inter-
val for fli in the sense that the posterior probability that the r a n d o m fli lies in the
fixed interval/3i + %sh, is 1 - a.
Ch. 2." Statistical Theory and Econometrics 155
42
us 2/c. to vs 2/c_ is also a 1 - a Bayesian interval when the diffuse prior p(fl, o) c( 1/o is
employed, since then vs 2/0 2 has a X2 posterior pdf with u d.f. In this problem the pivotal quantity is
vs "/a ~, which has a X2 pdf not involving fl or 0 in both the sampling theory and Bayesian approaches.
43Write a - b + ~ fbP (OLD)d 0, where ~ is a Lagrange multiplier. On differentiating this expression
partially with respect to a and to b and setting first partial derivatives equal to zero yields
l+~p(a]D)=O and l+Xp(b[D)=O so that p(a]D)=p(b[D) is necessary for a - b to be
minimized subject to the restriction. Under weak conditions, this condition is also sufficient. Also, this
interval can be obtained by minimizing expected loss with a loss function of the following type:
L = q(a - b)- 1 if b ~<0 ~<a and L = q(a - b) otherwise, with q > 0 a given constant. This loss
function depends on the length of the interval, a - b.
156 A. Zellner
with p(O~ + G[D) = p(O~ - GI D) and Pr(0 m - c~ < 0 < 0,~ + GI D) = 1 - a. For
bimodal and some other types of posterior pdfs, a single interval is not very useful
in characterizing a range of probable values for 0.
In the sampling theory approach various definitions of interval shortness have
been proposed. Since the sampling theory confidence interval is random, its
length is random. Attempts to obtain confidence intervals with minimum ex-
pected length have not been successful in general. Another criterion is to
maximize the probability of coverage, that is, to find /~ and a~ such that
1-a=P(b~<~O<<.gl~lO)>~P([~<~O'<<.a~lO' ) for every 0 and 0 ' c O , where 0 is
the true value and 0' is some other value. That is, the interval must be at least as
likely to cover the true value as any other value. An interval satisfying this
criterion is called an unbiased confidence interval of level 1 - a. Pratt (1961) has
shown that in many standard estimation problems there exist 1 - a level confi-
dence intervals which have uniformly minimum expected length among all 1 - a
level unbiased confidence intervals. Also, a concept of shortness related to
properties of uniformly most powerful unbiased tests will be discussed below.
In summary, for a scalar parameter 0, or for a function of 0, g(O) results are
available to compute upper and lower confidence bounds and confidence inter-
vals in both the sampling theory and Bayesian approaches. For some problems,
for example g(O)= 1/0, where ~i = 0 + ~i, with the ~i's N I D (0, o2), both the
sampling distribution of 1 / 2 and the posterior pdf for 1 / 0 can be markedly
bimodal and in such cases a single interval is not very useful. Some other
pathological cases are discussed in Lindley (1971) and Cox and Hinkley (1974, p.
232 ft.). The relationship of sampling properties of Bayesian and sampling theory
intervals is discussed in Pratt (1965).
4. 3. Confidence regions
P(Oc~o,,lO)=l-a, (4.4)
and w~,(~) c %2(~) when a~ > a 2. This last condition insures that the confidence
region will be larger the larger is 1 - a. In particular problems, as with confidence
intervals, some additional considerations are usually required to determine a
unique form for o3~. If o5~ is formed so that all parameter values in o5~ have higher
likelihood than those outside, such a region is called a likelihood-based confi-
dence region by Cox and Hinkley (1974, p. 218).
Ch. 2: Statistical Theory and Econometrics 157
P ( 0 c o ~ ( x ) I x ) = 1 - a, (4.5)
without additional conditions; see Cox and Hinkley (1974, p. 230 ff.) for analysis
of this problem. In a Bayesian approach the marginal posterior pdf for 01,
h(OllD ), is obtained from the joint posterior pdf p(Ol,OzlD ) and confidence
regions can be based on h(O 1[,D). Another serious problem arises if the sampling
distribution of an estimator 0 or the posterior pdf for 0 is multi-modal or has
some other unusual features. In such cases sampling theory and Bayesian confi-
dence regions can be misleading. Finally, in large samples, maximum likelihood
and other estimators are often approximately normally distributed and the large
sample normal distribution can be employed to obtain approximate confidence
intervals and regions in a sampling theory approach. Similarly, in large samples,
posterior pdfs assume an approximate normal form and approximate Bayesian
intervals and regions can be computed from approximate normal posterior
distributions. For n large enough, these approximations will be satisfactory.
5. Prediction
~See, for example, the predictions that Friedman (1957, pp. 214-219) derived from his theory of
the consumptionfunction.
Ch. 2: Statistical Theory and Econometrics ! 59
probability models for $ and for ~, then predictions of ~ will usually be adversely
affected. 45
With past data x and future or as yet unobserved data g, a point prediction of the
random vector g is defined to be £ = q~(x), where cp(x)'= [q~l(x),~02(x) .....
q0q(X)] is a function of just x and thus can be evaluated given x. When the value
of g is observed, say z0, then e o = £ - z0 is the observed forecast error vector. In
general perfect prediction in the sense e o = 0 is impossible and thus some other
criteria have to be formulated to define good prediction procedures. The paral-
lelism with the problem of defining good estimation procedures, discussed in
Section 3, is very close except that here the object of interest, g, is random.
In the sampling theory approach, prediction procedures are evaluated in terms
of their sampling properties in repeated (actual or hypothetical) samples. That is,
the sampling properties of a predictor, cp($), are considered in defining good or
optimal prediction procedures which involves the choice of an explicit functional
form for q~(£). Note that use of the term " p o i n t predictor" or "predictor" implies
that the random function ~(:~) is being considered, while use of the term "point
prediction" or "prediction" implies that the non-random function ~0(x) is being
considered.
Some properties of predictors are reviewed below with ~ = g - qo(£) the random
forecast error vector. For brevity of notation, q0($) will be denoted by qS.
(1) Minimal M S E predictor. If I'~, where ! is a given vector of rank one, has
minimal MSE, then ~ is a minimal MSE predictor.
(2) Unbiasedness. If E~ = O, then q5 is an unbiased predictor. If E~ ~ 0, then q3
is a biased predictor.
(3) Linearity. If ~ = A:~, where A is a given matrix, then q5 is a linear predictor.
(4) Minimum variance linear unbiased ( M F L U ) predictor. Consider l'~, where i
is a given q x 1 vector of rank one and the class of linear, unbiased
predictors, ~u = Au£, with A u, not unique, such that E ( g - A u £ ) = 0. If
var(l'~) is minimized by taking A, = A . , then q3. = A . ~ is the MVLU
predictor.
(5) Prediction risk. If L ( ~ ) is a convex loss function, then the risk associated
with ~ relative to L(~) is r(O) = EL(~). For example, if L(~) = ~'Q~, where
Q is a given q x q positive definite symmetric matrix, the risk associated with
45Statistical tests of these assumptions, e.g. the i.i.d, assumption, can be performed.
46For further discussion of sampling theory prediction and forecasting techniques for a range of
problems, see Granger and Newbold (1977).
160 A. Zellner
where E~ = Zl is the prediction bias and var(g) is the variance of the forecast
error.
(6) Admissiblepredietor. Let rl(O ) = EL(~l) be the risk associated with predic-
tor qS] and ra(O ) = EL(~a) be the risk associated with any other predictor. If
there does not exist another predictor such that ra(O ) <~rl(O ) for all 0 in the
parameter space, then the predictor ¢Pl is admissible relative to the loss
function L. If another predictor exists such that ra(O) <~r~(O) for all 0 in the
parameter space, with ra(O) < rl(O ) for some values of 0, then qS] is inadmis-
sible relative to L.
(7) Robust predictor. A robust predictor is a predictor that performs well in the
presence of model specification errors a n d / o r in the presence of unusual
data.
Much of what has been said above with respect to criteria for choice of
estimators is also applicable to choice of predictors. Unfortunately, minimal MSE
predictors do not in general exist. The unbiasedness property alone does not lead
to a unique predictor and insisting on unbiasedness may be costly in terms of
prediction MSE. In terms of (5.1), it is clear that a slightly biased predictor with a
very small prediction error variance can be better in terms of MSE than an
unbiased predictor with a large prediction error variance. Also, as with admissible
estimators, there usually are m a n y admissible predictors relative to a given loss
function. Imposing the condition that a predictor be linear and unbiased in order
to obtain a M V L U predictor can be costly in terms of MSE. For m a n y prediction
problems, non-linear biased Stein-like predictors have lower MSE than do M V L U
predictors; see, .for example, Efron and Morris (1975). Finally, it is desirable that
predictors be robust and what has been said above about robust estimators can be
adapted to apply to predictors' properties.
To illustrate a close connection between estimation and prediction, consider the
standard multiple regression model .9 = X/~ + ti, where fl is n × 1, X is a given
non-stochastic n x k matrix of rank k, ~ is a k x 1 vector of parameters, and ti is
an n × 1 disturbance vector with Eti = 0 and Etiti'= o2In. Let a future scalar
observation g be generated by ~ = w'/~ + 15, where w' is a 1 x k given47 vector of
rank one and ~ is a future disturbance term, uncorrelated with the elements of ti
with E~5 = 0 and E152= o 2. Then a predictor of ~, denoted by 2, is given by
47For some analysis of this problem when w is random, see Feldstein (1971).
Ch. 2: Statistical Theory and Econometrics 161
48Note that Ew'(f3 -fl)t~ = 0 if the elements of/~- 13 and ~5are uncorrelated as they are under the
above assumptions if B is a linear estimator. On the other hand, if /~ is a non-linear estimator,
sufficient conditions for this result to hold are that the elements of ff and ~ are independently
distributed and Ew'([J -~8)~ is finite.
162 A. Zellner
such that the proportion is not less than a specified value with given probability,
are called tolerance intervals. See Christ (1966) and Guttman (1970) for further
discussion and examples of tolerance intervals. Finally, in many econometric
problems, exact prediction intervals and regions are not available and large
sample approximate intervals and regions are often employed.
Central in the Bayesian approach to prediction is the predictive pdf for ~, p(z [D),.
where g is a vector of future observations and D denotes the sample, x, and prior
information. To derive the predictive pdf, let ~ and ~ be independent 49 with pdfs
g(zlO) and f(xlO ), where 0 c O is a vector of parameters with posterior pdf
h(OlD) = cTr(O)f(x]O), where c is a normalizing constant and ~r(0) is the prior
pdf. Then,
is the predictive pdf for g. Note that (5.2) involves an averaging of the conditional
pdfs g(z[O), with h(0]D), the posterior pdf for 0, serving as the weight function.
Also, p (z]D) incorporates both sample and prior information reflected in h (O]D).
For examples of explicit predictive pdfs for regression and other models, see
Aitchison and Dunsmore (1975), Box and Tiao (1973), Guttman (1970), and
Zellner (1971).
If z is partitioned as z ' = (z'l,z~), the marginal predictive pdf for z~ can be
obtained from (5.2) by analytical or numerical integration. Also, a pdf for z2
given zl a n d / o r the distribution of functions of z can be derived from (5.2).
If a point prediction of ~ is desired, the mean or modal value of (5.2) might be
used. If a convex prediction loss function L(g, ~) is available, where ~ = g(D) is
some point prediction depending on the given sample x and prior information,
Bayesians choose g so as to minimize expected loss, that is, solve the following
problem:
f L(,e)p(zlD)d. (5.3)
The solution, say ~ = ff*(D), is the Bayesian point prediction relative to the loss
function L(g,2). For example, if L ( g , g ) = ( g - g ) ' Q ( g - g ) , with Q a given
positive definite symmetric matrix, the optimal £ is ~* = E(gID), the mean of the
predictive pdf in (5.2). 50 For other loss functions, Bayesian minimum expected
loss point predictions can be obtained [see Aitchison and Dunsmore (1975),
Litterman (1980), and Varian (1975) for examples]. Prediction intervals and
regions can be computed from (5.2) in the same way that posterior intervals and
regions are computed for parameters, as described above. These prediction
intervals and regions are dependent on the given data D and hence are not viewed
as random. For example, in the case of a scalar future observation, £, given the
predictive pdf for ~, p ( z l D ) , the probability that b < £ < a is just f f l p ( z l D ) d z =
1 - a. If a and b are given, 1 - a can be calculated. If 1 - a is given, then a and b
are not uniquely determined; however, by requiring that b - a be a minimum
subject to a given 1 - a, unique values of a and b can be obtained.
To this point, all results in this subsection are for given data x and given prior
information. The sampling properties of Bayesian procedures are of interest,
particularly before £ is observed and also in characterizing average properties of
procedures. In this regard, the solution, g* to the problem in (5.3) can be viewed
as a Bayesian predictor, random since it is a function of £. For brevity, write a
predictor as g = ~(:~). Then the prediction risk function, r(O), relative to the loss
function L(~, g), is
where the integrations are over the sample spaces of £ and g. Risk, r(O), can be
computed for alternative predictors, £ = 2(£). The Bayesian predictor is the one,
if it exists, that minimizes average risk, AR = for(O)~r(O)dO, where ~r(O) is a
prior for 0. If AR is finite, then the Bayesian predictor isadmissible and also is
given by the solution to the problem in' (5.3).
From what has been presented, it is the case that both sampling theory and
Bayesian techniques are available for predictive inference. As with estimation, the
approaches differ in terms of justifications for procedures and in that the
Bayesian approach employs a prior distribution, whereas it is not employed in
the sampling theory approach. Further, in both approaches predictive inference
has been discussed in terms of given models for the observations. Since there is
often uncertainty about models' properties, it is important to have testing
procedures that help determine the forms of models for observations. In the next
Section general features of testing procedures are presented.
5°The proof is very similar to that presented above in connection with Bayesian parameter
estimation with a quadratic loss function.
164 A. Zellner
Statistical procedures for analyzing and testing hypotheses, that is, hypothesized
probability models for observations, are important in work to obtain satisfactory
econometric models that explain past economic behavior well, predict future
behavior reliably, and are dependable for use in analyzing economic policies. In
this connection, statistical theory has yielded general procedures for analyzing
hypotheses and various justifications for them. In what follows, some basic results
in this area will be reviewed.
Relative to the general probability model, (X, O, p (x I0)) hypotheses can relate to
the value of 0, or a subvector of 0, a n d / o r to the form ofp(x[O ). For example,
0 = 0 or 0 = c, a given vector, are examples of simple hypotheses, that is,
hypotheses that completely specify the parameter vector 0 appearing in p(xlO ).
On the other hand, some hypotheses about the value of 0 do not completely
specify p(xlO). For example, 0 c ~0, a subspace of O, does not imply a particular
value for 0 and thus is not a simple hypothesis but rather is termed a composite
hypothesis. In terms of a scalar parameter 0 c O, where O is the entire real line,
0 = 0 is a simple hypothesis, whereas 0 < 0 and 0 > 0 are composite hypotheses.
Further, it is often the case that two or more hypotheses are considered.
For example 0 = 0 and 0 = 1, two simple hypotheses, or 0 = 0 and 0 > 0, a simple
hypothesis and a composite hypothesis, or 0 > 0 and 0 < 0, two com-
posite hypotheses, may be under study. Finally, various forms for p(xlO ) may
be hypothesized, for example p[(x - ~1)/011 normal or p[(x -/~2)/02] double-
exponential are two alternative hypotheses regarding the form of p ( - ) with the
same parameter space O: - oc < ~i < oe and 0 < 0i < oc, i = 1,2. In other cases,
p ( x l 0 ) and g ( x l ~ ) may be two alternative hypothesized forms for the pdf for the
observations involving parameter vectors 0 and ~ and their associated parameter
spaces. Finally, if the probability model is expanded to include a prior pdf for 0,
denoted by p(O), differentp(0)'s can be viewed as hypotheses. For example, for a
scalar 0, pl(O) in the form of a normal pdf, with given mean 0-l and given variance
0 2, 0 - N(Ol,o~) and 0 - N(t)2,~2), with t~ and o22 given, can be viewed as
hypotheses.
Whatever the hypothesis or hypotheses, statistical testing theory provides
procedures for deciding whether sample observations are consistent or incon-
sistent with a hypothesis or set of hypotheses. Just as with estimation and
prediction procedures, it is desirable that testing procedures have reasonable
justifications and work well in practice. It is to these issues that we now turn.
Ch. 2: Statistical Theoryand Econometrics 165
Example 6.1
Let 2 i, i = 1,2 ..... n, be N I D (0, 1) with O: - oo < 0 < oc, and consider the null
hypothesis, Ho: 0 = 0o and the alternative hypothesis, HI: 0 :* 0o. Here ~0 c O is
0 = 00 and O - co is 0 ~ 00. Suppose that we consider the r a n d o m sample mean
.,~ = Y'~n=l.~i/n. A " r e g i o n of acceptance" might be L2 - 001 4 c and a "region of
rejection", or critical region 1~ - 001 > c, where c is a given constant. Thus, given
the value of c, the sample space is partitioned into two regions.
Two major questions raised b y Example 6.1 are: W h y use ~ in constructing the
regions and on what grounds can the value of c be selected? In regard to these
questions, N P theory recognizes two types of errors that can be m a d e in testing
0 c ~o and 0 c 0 - ~o. A n error of type I, or of the first kind, is rejecting s2 0 c co
when it is true, while an error of type II, or of the second kind, is accepting 0 c ,0
when it is false. The operating characteristics of a N P test are functions that
describe probabilities of type I and type II errors. Let t = t(:~) be a test statistic,
for example ~ in Example 6.1, and let R be the region of rejection, a subset of the
sample space. Then a ( 0 ) = P ( i c RIO c ~o) is the probability of a type I error
expressed as a function of 0, which specializes to a(O) = P ( ] £ - 001 >/c[O = 0o) in
terms of Example 6.1. The probability of a type II error is given by fi(O) = P(t c
/~10 c 0 - co) = 1 - P ( t c RIO c 0 - ~o), w h e r e / ~ i s the region of acceptance, the
complement of R. In terms of Example 6.1, /~(0) = P ( l f f - 0ol ~< c]O ~ 0o) =
1 - P ( l g - 0ot > c]O ~= 0). The function 1 - fl(O), the probability of rejecting 0 c
when 0 c O - w is true, is called the power function of the test.
A test with minimal probabilities of type I and type II errors, that is, minimal
a ( 0 ) and fl(O), would be ideal in the N P framework. Unfortunately, such tests do
not exist. W h a t N P do to meet this problem is to look for tests with minimal
value for fl(O), the probability of type II error subject to the condition that for all
0 c o~, a(O) ~< a, a given value, usually small, say 0.05, the "significance level of
the test". 53 By minimizing fl(O), of course, the power of the test, 1 - fl(O), is
maximized. A test meeting these requirements is called a uniformly most powerful
( U M P ) test.
Unfortunately, except in special cases, uniformly most powerful tests do not
exist. In the case of two simple hypotheses, that is, O = (01, 02) with w: 0 = 01 and
- w: 0 = 02, and data pdfp(x]O ), the famous N P l e m m a 54 indicates that a test
based on the rejection region t ( x ) = p(xlO1)/p(xlO2)>~ k~, where k~ satisfies
P[ t ( 2 ) >/k~ l0 = 01] = a, with a given, is a U M P test. This is of great interest since
in this case t(x) is the likelihood ratio and thus the N P l e m m a provides a
justification for use of the likelihood ratio in appraising two simple hypotheses.
W h e n composite hypotheses are considered, say 0 ~: 0, it is usually the case that
U M P tests do not exist. One important exception to this statement is in terms of
Example 6.1 testing 0 = 0o against 0 > 0. Then with ~/n(~ - 00) as the test statistic
and using ( n ( 2 - 0o) > k , as the region of rejection, where k~ is determined so
that P ( v ~ ( x - 0 o ) > k~[O = 0 0 ) = a, for given a, this test can be shown to be
U M P . 55 Similarly, a U M P test of 0 = 0o against 0 < 00 exists for this problem.
However, a U M P test of 0 = 00 against 0 ~: 00 does not exist. That is, using
~/~ 12 - 00l >/k~ as a region of rejection with k~ such that P(v/n [~ - 0o1 > k~) = a,
given a, is not a U M P test.
Given that U M P tests are not usually available for m a n y testing problems, two
further conditions have been utilized to narrow the range of candidate tests. First,
only unbiased tests are considered. A test is an unbiased a-level test if its
operating characteristics satisfy a(O)<~ a for all 0 c ~ and 1 - fl(O)>1 a for all
0 c O - ~. This requirement seems reasonable since it implies 1 - a >//3(0), that
is, that the probability, 1 - a, of accepting 0 c o~ when it is true is greater than or
equal to the probability fl(O), 0 c O - o~, of accepting it when it is false. M a n y
tests of a null hypothesis, 0 = 01, with 01 given, against composite hypotheses
0 :x 01 are U M P unbiased tests. In terms of Example 6.1 the test statistic 12 -
Oolv/nwith rejection region 12 -OOlCrn->~ k~ is a U M P unbiased test of 0 = 0o
against 0=~0 o. See Lehmann (1959, ch. 4-5) for many" examples of U M P
unbiased tests.
It is also interesting to note that L~ - 001 fn- < k~ can be written as 5~- k,/Vrn
< 0o < ~ + k,/vrn and that, given 0 = 00, the probability that ~ _+ k , / ~ / n covers
00 is 1 - a. Thus, there is a close mathematical relationship between test statistics
and confidence intervals, discussed above, and in m a n y cases optimal tests
produce optimal intervals (in a shortness sense). However, there is a fundamental
difference in that in testing 0 = 00, 0o is given a specific value, often 0o = 0, which
is of special interest. On the other hand, with a confidence interval or interval
estimation problem the value of 0o is not specified; that is, 0o is the true unknown
value of 0. Thus, if ~ _+ k~/~/n, with a = 0.05 assumes the value 0.32 + 0.40, this is
a 95 percent confidence interval for 00 that extends from - 0 . 0 8 to 0.72. That the
interval includes the value 0 does not necessarily imply that 00 = 0. It may be that
0o =~ 0 and the precision of estimation is low. In terms of testing 0 = 00 = 0, the
result 0.32 +0.40 implies that the test statistic assumes a value in the region of
acceptance, I~l¢~ < k , , and would lead to the conclusion that the data are
consistent with 0 = 00 = 0. In NP theory, however, this is an incomplete reporting
of results. The power of the test must be considered. For example, if 0 = + 0.20
represent important departures from 0 = 0, the probabilities of rejecting 0 = 0
when 0 = _+0.20, that is, 1 - fl(0.2) and 1 - f l ( - 0.2), should be reported. Under
the above conditions, these probabilities are quite low and thus the test is not very
powerful relative to important departures from 0 = 0. More data are needed to
get a more powerful test and more precise estimates.
The above discussion reveals an important dependence of a test's power on the
sample size. Generally, for given a, the power increases with n. Thus, to
"balance" the probabilities of errors of type I and II as n increases requires some
adjustment of a. See D e G r o o t (1970) for a discussion of this problem. 56
A second way of delimiting candidate tests is to require that tests be invariant,
that is, invariant to a certain group of transformations. See Lehmann (1959, ch.
6-7) for discussion of U M P invariant tests that include the standard t and F tests
employed to test hypotheses about regression coefficients that are U M P invariant
tests under particular groups of linear transformations. They are also U M P
unbiased tests. In a remarkable theorem, Lehmann (1959, p. 229) shows that there
exists a unique U M P unbiased test for a given testing problem and that there also
exists a U M P almost invariant 5v test with respect to some group of transforma-
tions G. Then the latter is also unique and the two sets coincide almost
everywhere.
Example 6.2
In terms of the normal regression model of Example 4.2, to test Ho: /3i =/3io, a
given value against H1: /3i ~: 0 with all other regression coefficients and o
unrestricted, the test statistic t = (/3~ -/3~0)/sh, has a univariate Student-t pdf with
l, d.f. Then Itl >/k,, where k , is such that Pr(lt I >1 k~lB~ =Bio) = a, with a, the
significance level given, is a rejection region. Such a test is a UMP unbiased and
invariant (with respect to a group of linear transformations) a-level test. In a
similar fashion, from Example 4.4, the statistic F = ( ~ - f l o ) ' X ' X ( ~ - flo)/ks 2
that has an F pdf with k and ~ d.f. under H0: fl = fl0 can be used to test H o
against Ht: fl ~fl0 with o 2 unrestricted under both hypotheses. The rejection
region is F>~ k~, with k~ such that P(F>~ k~) = a, a given value. This test is a
UMP unbiased and invariant a-level test.
In many testing problems, say those involving hypotheses about the values of
parameters of time series models, or of simultaneous equations models, exact tests
are generally not available. In these circumstances, approximate large sample
tests, for example approximate likelihood ratio (LR), Wald (W), and Lagrange
Multiplier (LM) tests, are employed. For example, let 0 ' = (0~, 0~) be a parameter
vector appearing in a model with likelihood function p(xlO ) and let the null
hypothesis be Ho: 01 = 010 and 02 unrestricted and HI: O1 and 02 both unrestricted.
Then 2t(x), the approximate LR, is defined to be
X ( x ) = p(xlO,o,O2)/p(xl~t,~2), (6.1)
where ~2 is the value of 02 that maximizes p(xl01o , 02), the restricted likelihood
function (LF), while (01,/~2) is the value of (0 l, 02) that maximizesp(xlOl, 02), the
unrestricted LF. Since the numerator of (6.1) is less than or equal to the
denominator, given that the numerator is the result of a restricted maximization,
0 < X(x) ~ 1. The larger X(x), the "more likely" that the restriction 0~ = 0~0 is
consistent with the data using a relative maximized LF criterion for the meaning
of "more likely". In large samples, under regularity conditions and H 0,
-210gX(2) = 2 2. has an approximate X 2 pdf with q d.f., where q is the number of
restrictions implied by H o, here equal to the number of elements of 0p Then a
rejection region is 2 2 ) k,, where k~ is such that P ( 2 2 >~k~lHo) "-- a, the given
significance level. 58 Many hypotheses can be tested approximately in this ap-
proach given that regularity conditions needed for - 2 log X(:~) to be approxi-
mately distributed as 2~ in large samples under H 0 are satisfied.
In the Wald large sample approach to the test, for example H 0 : 0 = 00 against
H~: 0:~0o, let /~ be a ML estimator for 0 that, under Ho, is known to be
approximately normally distributed in large samples with asymptotic mean 00 and
where the partial derivatives are evaluated at .(~,, the restricted ML estimate, and
IGr is the information matrix evaluated at ~r has an approximate Xq2 pdf in large
samples under H 0 and regularity conditions, and this fact can be employed to
construct an approximate a-level test of H 0. The LM test requires just the
computation of the restricted ML estimate, Or, and is thus occasionally much less
computationally burdensome than the LR and W tests that require the unre-
stricted ML estimate. On the other hand, it seems important in applications to
view and study both the unrestricted and restricted estimates.
Finally, it is the case that for a given pair of hypotheses, the LR, W, and LM
test statistics have the same large sample X2 pdf so that in large samples there are
no grounds for preferring any one. In small samples, however, their properties are
somewhat different and in fact use of large sample test results based on them can
give conflicting results [see, for example, Berndt and Savin (1975)]. Fortunately,
research is in progress on this problem. Some approximations to the finite sample
distributions of these large sample test statistics have been obtained that appear
useful; see, for example, Box (1949) and Lawley (1956). Also, Edgeworth expan-
sion techniques to approximate distributions of various test statistics are currently
being investigated by several researchers.
Bayesian procedures are available for analyzing various types of hypotheses that
yield posterior probabilities and posterior odds ratios associated with alternative
hypotheses which incorporate both sample and prior information. Further, given
a loss structure, it is possible to choose between or among hypotheses in such a
manner so as to minimize expected loss. These procedures, which are discussed in
Jeffreys (1967), DeGroot (1970), Learner (1978), Bernardo et al. (1980), and
Zellner (1971), are briefly reviewed below.
With respect to hypotheses relating to a scalar parameter 0, - oo < 0 < ~ , of
the form Hi: 0 > c a n d / / 2 : 0 < c, where c has a given value, e.g. c = 0, assume
that a posterior pdf for 0, p(O]D), is available, where D denotes the sample and
prior information. Then in what has been called the Laplacian Approach, the
posterior probabilities relating to H l and to H 2 are given by Pr(0 > c]D)=
f~p(OtO)dO and Pr(0 < c]D) = f ~ p ( O I D ) d O , respectively. The posterior odds
ratio for H 1 and H 2, denoted by Kl2, is then K12 = Pr(0 > clD)/Pr(O < clD ).
Other hypotheses, e.g. ]0[ < 1 and ]01 > 1, can be appraised in a similar fashion.
That is, Pr(10 [ < I [ D ) = flip(OlD)dO is the posterior probability that 10] < 1
and 1 -Pr(L0 [ < 1 ]D) is the posterior probability that ]0[ > 1.
Example 6.3
Let Yi, i = 1,2 ..... n, be independent observations from a normal distribution with
mean 0 and unit variance. If a diffuse prior for 0, p(O) (x const., is employed, the
posterior pdf is p(OlD ) ¢c e x p ( - n(O F)2//2), where ~ is the sample mean; that
-
is, z = x / n ( O - y ) has a N(0,1) posterior pdf. Then for the hypothesis 0 > 0 ,
Pr(0 > 01 D) = Pr(z > - V%-y[D) = 1 - • ( - f n y ) , where • (.) is the cumulative
normal pdf. Thus, Pr(0 > 01D ) can be evaluated from tables ~(-).
When a vector of parameters, 0, with 0 c @ and posterior pdf p(O]D), is
considered, and the hypotheses are HA: 0 c to and HB: 0 c (9 - to, where to is a
subspace of 0 , Pr(0 c to ID) = f~ p (0 [D) d 0 is the posterior probability associated
with H A while 1 - Pr(0 c tolD ) is that associated with H B and the posterior odds
ratio is the ratio of these posterior probabilities. The above posterior probabilities
can be evaluated either analytically, or by the use of tabled values of integrals or
by numerical integration. For an example of this type of analysis applied to
hypotheses about properties of a second order autoregression, see Zellner (1971,
p. 194 ft.).
For a very wide range of different types of hypotheses, the following Jeffreys
Approach, based on Bayes Theorem, can be employed in analyzing alternative
hypotheses or models for observations. Let p ( y, H ) be the joint distribution for
the data y and an indicator variable H. Then p ( y , H ) = p ( H ) p ( y [ H ) =
p ( y ) p ( H [ y ) andp(Hly ) = p ( H ) p ( y l H ) / p ( y ) . If H can assume values H l and
H 2, it follows that the posterior odds ratio, K~2, is
= p(Hl[Y ) = p ( H , ) . p ( y [ H l ) (6.2)
K'2 p(H2ly ) P(H2) p ( y l n 2 ) '
where p(Hi) is the prior probability assigned to H i, i = 1,2, p(H~)/p(H2) is the
Ch. 2: Statistical Theory and Econometrics 171
prior odds ratio for H l versus t/2, and p ( y t H i ) is the marginal pdf for y under
hypothesis H i, i = 1,2. The ratio p ( y IHl')/p ( y [ti2) is called the Bayes Factor
(BF). In the case that both H 1 and H 2 are simple hypotheses, the BF
p( y l H O / p ( y t H 2 ) is just the Likelihood Ratio (LR).
Example 6.4
Let Yi = 0 + ei, i = 1,2 ..... n, with the ei's assumed independently drawn from a
normal distribution with zero mean and unit variance. Consider two simple
hypotheses, H~: 0 = 0 and / / 2 : 0 = 1, with prior probabilities p ( H O = l / 2 and
p ( H 2 ) = 1/2. Then from (6.2),
where y ' = (Yl, Y2..... Yn), t' = (1, 1..... 1), and 2 is the sample mean. In this case
Kl2 = LR and its value is determined by the value of 2n(½ - )~).
In cases in which non-simple hypotheses are considered, that is, hypotheses that
do not involve assigning values to all parameters of a pdf for the data y,
p( y t0i, Hi), given that a prior pdf for 0 i is available, P(OilHi), it follows that
p ( y l H i ) = fp( ylOi, H i ) p ( OilHi)dO i and in such cases (6.2) becomes
Thus, in this case, K12 is equal to the prior odds ratio, p ( H O / p ( H 2 ) times a BF
that is a ratio of averaged likelihood functions. For discussion and applications of
(6.3) to a variety of problems, see, for example, Jeffreys (1967), DeGroot (1970),
Leamer (1978), and Zellner (1971).
In (6.2) and (6.3) it is seen that a prior odds ratio gets transformed into a
posterior odds ratio. If a loss structure is available, it is possible to choose
between or among hypotheses so as to minimize expected loss. To illustrate,
consider two mutually exclusive and exhaustive hypotheses, H l and H2, with
posterior odds ratio K12-" p l / ( 1 - Pt), where Pl is the posterior probability for
H 1 and 1 - P l is the posterior probability for H 2. Suppose that the following
two-action, two-state loss structure is relevant:
State of world
HI //2
Al: Choose H 1 0 Lie
A cts
A2: Choose H 2 L21 0
172 A. Zellner
The two "states of the world" are: H~ is in accord with the data or H 2 is in accord
with the data; while the two possible actions are: choose H 1 and choose H 2.
LI2 > 0 and L:~ > 0 are losses associated with incorrect actions. Then using the
posterior probabilities, p 1 and 1 - p ~, posterior expected losses associated with A 1
and A 2 are:
E(LIA2) Pl L21
E(LtA,) 1 - Pl LI2
(6.4)
LI2 "
If this ratio of expected losses is larger than one, choosing A 1 minimizes expected
loss, while if it is less than one, choosing A 2 leads to minimal expected loss. Note
from the second line of (6.4) that both K12 and L21/L12 affect the decision. In the
very special case Lzl/L12 = 1, the symmetric loss structure, if K12 > 1 choose HI,
while if K12 < 1 choose H 2. The analysis can be generalized to apply to more than
two hypotheses. Also, there are intriguing relations between the results provided
by the Bayesian approach and sampling theory approaches to testing hypotheses
that are discussed in the references cited above.
Finally, given the posterior probabilities associated with the hypotheses, it is
possible to use them not only in testing but also in estimation and prediction.
That is, if two hypotheses are H~: 0 = c and H 2 : 0 :*=C, where c has a given value
and p~ and 1 - P l are the posterior probabilities associated with H 1 and H2,
respectively, then relative to quadratic loss, an optimal estimate is
Research in statistical theory has yielded very useful procedures for learning from
data, one of the principal objectives of econometrics and science. In addition, this
research has produced a large number of probability models for observations that
Ch. 2: Statistical Theory and Econometrics 173
are widely utilized in econometrics and other sciences. Some of them were
reviewed above. Also, techniques for estimation, prediction, and testing were
reviewed that enable investigators to solve inference problems in a scientific
manner. The importance of utilizing sound, scientific methods in analyzing data
and drawing conclusions from them is obvious since such conclusions often have
crucial implications for economic policy-making and the progress of economic
science. On the other hand, it is a fact that statistical and econometric analysis
frequently is a mixture of science and art. In particular, the formulation of
appropriate theories and models is largely an art. A challenge for statistical theory
is to provide fruitful, formal procedures that are helpful in solving model
formulation problems.
While many topics were discussed in this chapter, it is necessary to point to
some that were not. These include non-parametric statistics, survey methodology,
design of experiments, time series analysis, random parameter models, statistical
control theory, sequential and simultaneous testing procedures, empirical Bayes
procedures, and fiducial and structural theories of inference. Some of these topics
are treated in other parts of this Handbook. Also, readers may refer to Kruskal
and Tanur (1978) for good discussions of these topics that provide references to
the statistical literature. The annual issues of the ASA/IMS Current Index to
Statistics are a very useful guide to the current statistical literature.
In the course of this chapter a number of controversial issues were mentioned
that deserve further thought and study. First, there is the issue of which concept
of probability is most fruitful in econometric work. This is a critical issue since
probabihty statements play a central role in econometric analyses.
Second, there are major controversies concerning the most appropriate ap-
proach to statistical inference to employ in econometrics. The two major ap-
proaches to statistical inference discussed in this chapter are the sampling theory
approach and the Bayesian approach. Examples illustrating both approaches
were presented. For further discussion of the issues involved see, for example,
Barnett (1973), Bernardo et al. (1980), Cox and Hinkley (1974), Lindley (1971),
Rothenberg (1975), Zellner (1975), and the references in these works.
Third, with respect to both sampling theory and Bayesian approaches, while
there are many problems for which both approaches yield similar solutions, there
are some problems for which solutions differ markedly. Further attention to such
problems, some of which are discussed in Bernardo et al. (1980), Cox and Hinkley
(1974), Jaynes (1980), Lindley (1971), and the references cited in these works,
would be worthwhile.
Fourth, there is controversy regarding the implications of the likelihood princi-
ple for econometric and statistical practice. Briefly, the likelihood principle states
that if x and y are two data sets such that p (x]O) = cf( y ]0), with 0 c ~ and c not
depending on 0, then inferences and decisions based on x and on y should be
identical. The Bayesian approach satisfies this condition since for a given prior
pdf, ~r(0), the posterior pdfs for 0 based on p(x]O) and on of(y]O) are identical
174 A. Zellner
given p(xlO)= cf(yl0). On the other hand, sampling theory properties and
procedures that involve integrations over the sample space, as in the case of
unbiasedness, MVU estimation, confidence intervals, and tests of significance
violate the likelihood principle. Discussions of this range of issues are provided in
Cox and Hinkley (1974, ch. 2) and Lindley (1971, p. 10 ft.) with references to
important work by Bimbaum, Barnard, Durbin, Savage, and others.
Fifth, the importance of Bayesian logical consistency and coherence is em-
phasized by most Bayesians but is disputed by some who argue that these
concepts fail to capture all aspects of the art of data analysis. Essentially, what is
being criticized here is the Bayesian learning model and/or the precept, "act so as
to maximize expected utility (or equivalently minimize expected loss)". If im-
provements can be made to Bayesian and other learning and decision procedures,
they would constitute major research contributions.
Sixth, some object to the introduction of prior distributions in statistical
analyses and point to the difficulty in formulating prior distributions in multi-
parameter problems. Bayesians point to the fact that non-Bayesians utilize prior
information informally in assessing the "reasonableness" of estimation results,
choosing significance levels, etc. and assert that formal, careful use of prior
information provides more satisfactory results in estimation, prediction, and
testing.
Seventh, frequentists assert that statistical procedures are to be assessed in
terms of their behavior in hypothetical repetitions under the same conditions.
Others dispute this assertion by stating that statistical procedures must be
justified in terms of the actually observed data and not in terms of hypothetical,
fictitious repetitions. This range of issues is very relevant for analyses of non-
experimental data, for example macro-economic data.
The above controversial points are just some of the issues that arise in judging
alternative approaches to inference in econometrics and statistics. Furthermore,
Good has suggested in a 1980 address at the University of Chicago and in Good
and Crook (1974) some elements of a Bayes/non-Bayes synthesis that he expects
to see emerge in the future. In a somewhat different suggested synthesis, Box
(1980) proposes Bayesian estimation procedures for parameters of given models
and a form of sampling theory testing procedures for assessing the adequacy of
models. While these proposals for syntheses of different approaches to statistical
inference are still being debated, they do point toward possible major innovations
in statistical theory and practice that will probably be of great value in economet-
ric analyses.
References
Aitchison, J. and I. R. Dunsmore (1975) Statistical Prediction Analysis. Cambridge: Cambridge
University Press.
Anderson, T. W. (1971) The Statistical Analysis of Time Series. New York: John Wiley & Sons, Inc.
Ch. 2: Statistical Theory and Econometrics 175
Lindley, D. V. (1971) Bayesian Statistics, A Review. Philadelphia: Society for Industrial and Applied
Mathematics.
Litterman, R. (1980) "A Bayesian Procedure for Forecasting with Vector Autoregressions", manuscript.
Department of Economics, MIT; to appear in Journal of Econometrics.
Lobve, M. (1963) Probability Theory (3rd edn.). Princeton: D. Van Nostrand Co., Inc.
Luce, R. D. and H. Raiffa (1957) Games and Decisions. New York: John Wiley & Sons, Inc.
Mehta, J. S. and P. A. V. B. Swamy (1976) "Further Evidence on the Relative Efficiencies of Zellner's
Seemingly Unrelated Regressions Estimators", Journal of the American Statistical Association, 71,
634-639.
Neyman, J. and E. L. Scott (1948) "Consistent Estimates Based on Partially Consistent Observations",
Econometrica, 161 1-32.
Pfanzagl, J. and W. Wefelmeyer (1978) "A Third Order Optimum Property of the Maximum
Likelihood Estimator", Journal of Multivariate Analysis, 8, 1-29.
Phillips, P. C. B. (1977a) "A General Theorem in the Theory of Asymptotic Expansions for
Approximations to the Finite Sample Distribution of Econometric Estimators", Econometrica, 45,
1517-1534.
Phillips, P. C. B. (1977b) "An Approximation to the Finite Sample Distribution of Zellner's Seemingly
Unrelated Regression Estimator", Journal of Econometrics, 6, 147-164.
Pitman, E. J. G. (1936), "Sufficient Statistics and Intrinsic Accuracy", Proceedings" of the Cambridge
Philosophical Society, 32, 567-579.
Pratt, J. W. (1961) "Length of Confidence Intervals", Journal of the American Statistical Association,
56, 549-567.
Pratt, J. W. (1965) "Bayesian Interpretation of Standard Inference Statements", Journal of the Royal
Statistical Association B, 27, 169-203.
Pratt, J. W., H. Raiffa and R. Schlaifer (1964) "The Foundations of Decision Under Uncertainty: An
Elementary Exposition", Journal of the American Statistical Association, 59, 353-375.
Raiffa, H. and R. Schlaifer (1961) Applied Statistical Decision Theory. Boston: Graduate School of
Business Administration, Harvard University.
Ramsey, F. P. (1931) The Foundations of Mathematics and Other Essays. London: Kegan, Paul,
Trench, Trnber & Co., Ltd.
Rao, C. R. (1945) "Information and Accuracy Attainable in Estimation of Statistical Parameters",
Bulletin of the Calcutta Mathematical Society, 37, 81-91.
Rao, C. R. (1973) Linear Statistical Inference and Its Applications. New York: John Wiley & Sons, Inc.
Rao, P. and Z. Griliches (1969) "Small-Sample Properties of Two-Stage Regression Methods in the
Context of Auto-correlated Errors", Journal of the American Statistical Association, 64, 253-272.
Rbnyi, A. (1970) Foundations of Probability. San Francisco: Holden-Day, Inc.
Revankar, N. S. (1974) "Some Finite Sample Results in the Context of Two Seemingly Unrelated
Regression Equations", Journal of the American Statistical Association, 69, 187-190.
Rothenberg, T. J. (1975) "The Bayesian Approach and Alternatives", in: S. E. Fienberg and A.
Zellner (eds.), Studies in Bayesian Econometrics and Statistics. Amsterdam: North-Holland Publish-
ing Co., pp. 55-67.
Savage, L. J. (1954) The Foundations of Statistics. New York: John Wiley & Sons, Inc.
Savage, L. J. (1961) "The Subjective Basis of Statistical Practice", manuscript. University of Michigan,
Ann Arbor.
Savage, L. J., et al. (1962) The Foundations of Statistical Inference. London: Meuthen.
Savage, L. J., N. Edwards and H. Lindman (1963) "Bayesian Statistical Inference for Psychological
Research", Psychological Review, 70, 193-242.
Sclove, S. L. (1968) "Improved Estimators for Coefficients in Linear Regression", Journal of the
American Statistical Association, 63,596-606.
Silvey, S. D. (1970) Statistical Inference. Baltimore, Md.: Penguin Books.
Srivastava, V. K. and T. D. Dwivedi (1979) "Estimation of Seemingly Unrelated Regression
Equations: A Brief Survey", Journal of Econometrics, 10, 15-32.
Stein, C. (1956) "Inadmissibility of the Usual Estimator for the Mean of a Multivariate Normal
Distribution", in: Proceedings of the Third Berkeley Symposium on Mathematical Statistics and
Probability, vol. I. Berkeley: University of California Press, pp. 197-206.
Stein, C. (1960) "Multiple Regression", in: L Olkin (ed.), Contributions to Probability and Statistics:
Essays in Honour of Harold Hotelling. Stanford: Stanford University Press.
178 A. Zellner