0% found this document useful (0 votes)
36 views35 pages

Paper 2

The document discusses using orthogonal polynomials to estimate the prior distribution in Bayesian and empirical Bayesian methods for natural exponential families with quadratic variance functions. These families include common distributions like the normal, Poisson, gamma, binomial and negative binomial. The approach combines Bayesian and nonparametric empirical Bayesian methods to obtain improved estimates of the prior distribution based on samples from the marginal distribution.

Uploaded by

nabakumarj9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views35 pages

Paper 2

The document discusses using orthogonal polynomials to estimate the prior distribution in Bayesian and empirical Bayesian methods for natural exponential families with quadratic variance functions. These families include common distributions like the normal, Poisson, gamma, binomial and negative binomial. The approach combines Bayesian and nonparametric empirical Bayesian methods to obtain improved estimates of the prior distribution based on samples from the marginal distribution.

Uploaded by

nabakumarj9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Bayes Empirical Bayes Estimation for Natural Exponential Families with Quadratic Variance

Functions
Author(s): G. G. Walter and G. G. Hamedani
Source: The Annals of Statistics, Vol. 19, No. 3 (Sep., 1991), pp. 1191-1224
Published by: Institute of Mathematical Statistics
Stable URL: https://www.jstor.org/stable/2241946
Accessed: 22-01-2019 12:43 UTC

JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide
range of content in a trusted digital archive. We use information technology and tools to increase productivity and
facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected].

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at
https://about.jstor.org/terms

Institute of Mathematical Statistics is collaborating with JSTOR to digitize, preserve and


extend access to The Annals of Statistics

This content downloaded from 118.185.164.5 on Tue, 22 Jan 2019 12:43:00 UTC
All use subject to https://about.jstor.org/terms
The Annals of Statistics
1991, Vol. 19, No. 3, 1191-1224

BAYES EMPIRICAL BAYES ESTIMATION FOR NATURAL


EXPONENTL FAMILIES WITH QUADRATIC
VARIANCE FUNCTIONS

BY G. G. WALTER AND G. G. HAMEDANI

University of Wisconsin-Milwaukee and Marquette University


Certain orthogonal polynomials are employed to estimate the prior
distribution of the parameter of natural exponential families with quadratic
variance functions in an approach which combines Bayesian and nonpara-
metric empirical Bayesian methods. These estimates are based on samples
from the marginal distribution rather than the conditional distribution.

1. Introduction. The univariate natural exponential families (NEF) with


quadratic variance functions (QVF) include many of the most widely used
distributions (normal, Poisson, gamma, binomial, and negative binomial; in-
deed these are five of the six basic NEF-QVF distributions). These were
studied by Morris (1982, 1983), who presented many of their properties in a
unified way. Among other things, Morris constructed an associated family of
orthogonal polynomials which in each particular case reduced to a family of
standard classical orthogonal polynomials. These polynomials were then used
to find estimators for arbitrary analytic functions.
The conjugate families needed for the prior distribution in Bayesian analysis
were also studied by Morris (1983). These are not themselves NEF-QVF
distributions, but belong to a Pearson family and have a simple form which
can be exploited to obtain formulas for Bayes and parametric empirical Bayes
estimation.
In previous papers, Walters and Hamedani (1987, 1989) have exploited
certain classical orthogonal polynomials to obtain estimates for a prior distri-
bution in an approach which combined Bayesian and nonparametric empirical
Bayesian methods. These estimates are based on samples from the marginal
distribution.
In this work we shall show that Bayes empirical Bayes procedure works in
this general setting of NEF-QVF. However, the orthogonal polynomials must
be related to the prior distribution rather than the conditional distribution,
and therefore must be defined differently than those of Morris (1982).
We shall suppose that an initial prior distribution, based on subjective
knowledge, has been selected from a member of the conjugate family. This is
the best one can do [Morris (1983), Theorem 5.5] if only the first two moments
of the prior distribution are known. We then use our sample to improve the

Received September 1988; revised July 1990.


AMS 1980 subject classifications. Primary 60E05, 62F10, 62F15.
Key words and phrases. Exponential families, natural exponential families, quadratic variance
function, normal distribution, Poisson distribution, gamma distribution, binomial distribution,
negative binomial distribution, hyperbolic secant distribution, orthogonal polynomials, moments.

1191

This content downloaded from 118.185.164.5 on Tue, 22 Jan 2019 12:43:00 UTC
All use subject to https://about.jstor.org/terms
1192 G. G. WALTER AND G. G. HAMEDANI

estimate, that is, by getting a better approximation to the true prior distribu-
tion as the partial sum of a series of these orthogonal polynomials.
The orthogonal polynomials defined here are exactly the "classical" orthog-
onal polynomials considered by Tricomi (1955). In each of the six NEF-QVF
distributions, the polynomials are identified as particular types of classical
orthogonal polynomials. In some cases, however, only a finite number of them
can be used, since the conjugate prior distributions may not have moments of
all orders.
In Section 2 we shall review, for subsequent use, some of the properties of
NEF-QVF distributions given in Morris (1982). In Section 3 we define our
family of orthogonal polynomials and show their relation to those defined by
Morris (1982). Some basic properties are also discussed, including the differ-
ential equation and recurrence formulas satisfied by the polynomials. More
detailed properties are relegated to Appendix A. Section 4 introduces a
biorthogonal system related to the polynomials which is used to recover the
prior distribution from the marginal distribution. This is applied to the
empirical Bayes estimation problems in Section 5. Appendix B contains
the results for each of the six basic NEF-QVF distributions. These results are
summarized in Table 1.
In the standard Bayesian approach it is assumed that the parameter, say 6,
is fixed but not precisely known. The prior probability law g(6) has a different
character than the probability law f(xJ0) of the random variable X. It is
assumed to be a subjective measure of the investigator's prior knowledge of 0.
The observations are of the function f(xJ1), and a sample X=
(X1, X29 .., XN) therefore has the probability law
N

f(xIo) = 1f(xil6),
i=1

and the marginal distribution of X is determined by

f(x) - frl f(xil6)g(O) dO.

The nonparametric empirical Bayes procedure referred to earlier is due


principally to Robbins (1956). It assumes that the parameter 0 is a bona fide
random variable. A sample consists of independent pairs

(X1, Ol) (X2, 6 2)X .. 9 (XNS E N)

with the joint probability law

rl f(xbI2O)g(6)n
i=l

The X1, X2,'. .., XN are observable, but the 091, 6 2X*~ . 0N are not. The

This content downloaded from 118.185.164.5 on Tue, 22 Jan 2019 12:43:00 UTC
All use subject to https://about.jstor.org/terms
EMPIRICAL BAYES ESTIMATION 1193

marginal distribution of X = (X1, X2, .. ., XN) iS


N

f(x) = f| f *| l f(xil0i)g(0i) dOl dO2 ... dON


a=1

=rl ff(xiO)g(O) dO.

The assumption here is that (X1, 01), (X2, 02), ... , (XN, ON) is an indepen-
dent sample from the distribution with density function f(xlI)g(O). The
conditional probability law of X given 0, namely, f(xl 4 is assumed known;
g( ) is assumed unknown but a smooth density.
Most approaches to the problem of estimating g(O) have been indirect in
that estimators are obtained not for g(O) itself, but for the moments of g(O);
see Maritz (1970). These approaches, while simple, often suffer from excess
"jumpiness" [as was observed by Berger (1985)] and should be smoothed. The
direct methods in which g(O) itself is estimated have usually been based on
step functions [see, e.g., Deely and Kruse (1968)] or Dirichlet processes [see,
e.g., Berry and Christensen (1979)] or maximum likelihood [see, e.g., Laird
(1978) or Leonard (1984)1. Laird pointed out that her method is equivalent to
the simultaneous estimation of several exchangeable parameters and leads to
an estimator with finite support. Since we shall assume that g is a smooth
density, such estimators suffer from the same difficulty as the empirical
distribution, viz. they are not mean-squared consistent.
At the other extreme lie the parametric methods in which g depends only
on a finite number of parameters. The simplest method depends on the
assumption that g belongs to a class of conjugate densities, for example, the
assumption that g is a beta density in the binomial case.
The method presented here is intermediate between the two, and may
involve a finite or infinite number of parameters. It is similar to that in Walter
and Hamedani (1987, 1989) and is based on orthogonal polynomials. It in-
volves a preliminary choice of a conjugate prior and of two parameters, the
prior mean and variance of which may be subjective (Bayesian) or estimated
from the data (parametric empirical Bayesian), followed by an improved
estimate of g based on the sample from the mixture. Then Bayes and
empirical Bayes methods are combined, but in a fashion somewhat different
than that of Deely and Lindley (1981).

2. A review of certain properties of NEF-QVF. A natural exponen-


tial family is one with a cumulative distribution function (CDF) F0 given by

(2.1) dFo(x) = exp(Ox - i(O)) dFo(x), x E R,

where O c c FR, Fo is a univariate CDF possessing a moment-generating


function in a neighborhood of zero and

if(O) =- log exp(Ox) dFo(x), 0e r 0.

This content downloaded from 118.185.164.5 on Tue, 22 Jan 2019 12:43:00 UTC
All use subject to https://about.jstor.org/terms
1194 G. G. WALTER AND G. G. HAMEDANI

The mean and variance of F0 are given by

(2.2) = f'(6)
and

dO -
(2.3) V(,) = d/(e) = d I
A NEF has quadratic variance function if V has the form

(2.4) V(,u) = v0 + vlil + v2A2.


The orthogonal polynomials defined by Morris (1982) are given by the
Rodrigues formulas
Idn /
(2.5) Pn( xAU) = Vn(/l) d f (x
where f(x, 0) = exp(ox - f(O)) is a multiple of a NEF-QVF probability law.
These are polynomials of exact degree n in both x and A which are orth
with respect to (2.1) as a function of x. Their normalizing factor is

(2.6) E9(Pn2( X, ,u)) = atnVn(A,u


where a? = n!7In_Ol(l + kv2). If the parameter A is change
pected value with respect to 6(g) is

(2.7) Eo(Pn(Xx ,O)) = !( _ AO) n .


This is used to obtain the unique minimum variance unbiased estimator of an
analytic function [Morris (1983), Theorem 3.1]

(2.8) gE() c= ( - )
n=O n.
by

(2.9) gl(X) = E-O)


n=O an

where cn = g(n)(UO), the nth der


This unbiased estimator (2.9) leads to the standard moment estimators
when gl(,) = (, - AO)n' for example, if n = 1, 21(X) = X - AO. This
tor is of course sufficient for 6 as well by the Rao-Blackwell theorem. Th
also be shown directly by the factorization theorem.
A conjugate prior distribution has a density with respect to d,u of the form

(2.10) g(A) = K{exp(mA06 - m0(0)))1V(A)X


where 6 is now treated as a function of A and m > 0 is a convolution
parameter. This now depends upon two prior parameters ,u and m, and is a
two-parameter exponential family; in fact, it is a Pearson family, but is not in
general a NEF.

This content downloaded from 118.185.164.5 on Tue, 22 Jan 2019 12:43:00 UTC
All use subject to https://about.jstor.org/terms
EMPIRICAL BAYES ESTIMATION 1195

3. A system of orthogonal polynomials. Since the system of polynomi-


als given by Morris (2.5) is orthogonal with respect to x and not ,u, we must
define a different system for use with prior distributions. We define

1 dn
(3.1) rn(IL) = rn(A' m,A0) = (-1) n( d
n = 0, 1,2,....
where g(,u) is the prior distribution given by (2.10). Then

ro(p) = 1,

ri(,u)= m (t I- 0),

r2(A)= - /Lo)2 - u /LO)V'(pu) + (2v2 -

and rn(/L) is a polynomial of exact degree n except for exceptional val


parameters. For example, if m = V2or m = 2v2, then the leading coeff
r2(AL) is zero; otherwise it is not.
We next observe that the polynomials of Morris may be obtained from the
prior distribution by means of formula (2.5) since g(,u) = Kf1(m AO, 0)/V(y4),
where f1(x, 0) is the modification to f(x, 0) which includes the convolution
parameter m [Morris (1983), (4.1)],

f1(x, 0) = exp(Ox - m*(8)).


This is equivalent to multiplying A and V(,u) by m. Indeed we have
dnI
(3.2) pn(
(3.2) MAso mA) = Vn-i(A)_
Pdm/0,tn)= ./ln(V(A)g(A)) /(A),

n= 1,2, ....

and hence find that rn is given by

rn = (_1)n(VgVn-1 (n/g
n

= (-1) E ( Vn k 1) )(v g)(n -k)/g


= (-1) n(E (E )Vnkl(vg)(nk)/g)(Vn1) (k)vl+kn}
n

= (-1) n (: n Pn-k ( Vn-)(kV1 k .


Since the expression in the braces is a polynomial of degree at most k, it
follows that rn is a polynomial of degree at most n. It is of exact degree n
except possibly in a discrete set of values of the parameters.
We now consider the orthogonality. We assume that g(,u) has moments up
to the 2nth order. Then by Theorem 5.2 of Morris (1983) and the fact that the

This content downloaded from 118.185.164.5 on Tue, 22 Jan 2019 12:43:00 UTC
All use subject to https://about.jstor.org/terms
1196 G. G. WALTER AND G. G. HAMEDANI

integrated terms in integration by parts vanish for NEF-QVF distributions,

f b, rn(,.A)g(/.u) d, = f b,k( - V( )g())( dn

(3.4) - dLVn(i)g(p,) d/t

{O, k <n,

in!| Vn (A)g(A)dA, k = n,

and hence rn and rk are orthogonal. Let rn have leading coefficient kn.
the normalizing factor is

Y = fr2(,u)g(A,U) d,u = knf ,bnrn(pu)g(A1) d,u


(3.5) a a

= kn!f Vn (,)g(p&) d,u = k n!vn.

This sort of Rodrigues formula is satisfied by the classical orthogonal


polynomials of Jacobi, Laguerre and Hermite. In fact polynomials defined by
(3.1) have a long history and many of their properties appear in Tricomi
(1955). He showed not only the orthogonality mentioned in (3.4), but also
derived the form of g(,u) in the case of finite, semiinfinite and infinite
intervals. These forms correspond to the three classical cases. However, he
assumed special forms of V(,u) which do not always hold for us. In particular,
V(,u) may have a nonzero leading coefficient in some cases of infinite intervals.
All orthogonal polynomials satisfy a three-term recurrence formula of the
form [Szego (1967), page 42]

(3.6) xttrn(A) = Anrn +1(A) + B nrrn(A) + Cnrn -(A)


The coefficients may be evaluated by observing that

fbrn2(tL)g(A) d- = B rnJ(r)g(tL) dA B

fb/.Lrn(A.)rn+l(/)g(A.) dA =A

7 f rn(bL) rn + (A Qt) g (u) dAl = fbkn/yn+1rn +(/tL)g(.L) d1 i =k Yn+l)


aa kn+1

a bAlrn(u)rn_

This content downloaded from 118.185.164.5 on Tue, 22 Jan 2019 12:43:00 UTC
All use subject to https://about.jstor.org/terms
EMPIRICAL BAYES ESTIMATION 1197

In addition, the r,(ji) are related to the Pn(x,

f (x, H) Pn(x,/uJ) g(/l) dJ =d

(3.8) = b nf(xf) d~ (Vn ( -L )g(Aj)) d

- fbf(x, 0)rr(Au)g(p() dt.

Then rJ(,A) correspond to particular classic


special cases (see Table 1). These classical polynomials usually are defined as
eigenfunctions of a second-order differential operator. Therefore, it is not
surprising that the following proposition holds.

TABLE 1
Natural exponential families with quadratic variance functions, their conjugate prior distribution
and associated orthogonal polynomials

Name Normal Poisson

1 ~~~~~~~~Axe-A
Density - (X-A)2/20,2
rll2S x!
A
6 2 log A
o(A) Ar log,u
o262
ifi(6) ~ ~~~~~~l2o
41(02 ee
.2
V(,u) ol2A
(a,b) (-oo,o) (O,oo)
Zero of V(,) 0

Std. go(,uit e _Z2 Aae-A


for m = 2o.2 1
AO = 0 a+ 1

V f= IVngot F2n r(n + a + 1)


Vn= fvgo ,2 [r(a + 1)
Standard polynomial Hermite Laguerre
Usual symbol Hn(x) L(')(x)

rn(A) ( l Hnn2 (4 ?) (-1) nnLin(m


m n A -AO ~~~~~nF(n a)+M1)

rn(a) - r(0) ()n r(n +a + 1)


kn mn m n
no > max n 00 00

tUp to a multiplicati

This content downloaded from 118.185.164.5 on Tue, 22 Jan 2019 12:43:00 UTC
All use subject to https://about.jstor.org/terms
1198 G. G. WALTER AND G. G. HAMEDANI

TABLE 1
Continued

Name Gamma Binomial

Density (x) r-1 e-X/A (r)p

-1( -P)
-r

O(C) log A
4fr(0) -r log(-8) r log(1 + e')
A2 2
V(,u) r A r
(a, b) (O, oo) (O, r)
Zero of V(A) 0 0, r
Std. go(,)t Aae - VL z(r - )
f(2 + a) a +,8 + 2
for m =
r r
-1 13+1
AO = -(2 + a)
m

v = Jvngot r(-l-a-2n) rnB(n +a + 1,n + 13 + 1)


Standard polynomial Bessel on (0, oo) Jacobi on (-1, 1)
Usual symbol y(a)(x) Pna )(x)

(n/yla A -anpna
rn(A) (2 + aY AO 2+a) n!rnP r)(2$ 1
r.(a) rn(0) = (2 + a) n ()nrn r(n + 1) =
n r(2n + a + 1) r(2n + a + 3 + 1)
kn (-1) r(n+a+1) r(n+a+a 3+1)
no > maxn -1 -a 00

tUp to a multiplicative con

PROPOSITION 3.1. Let vn - JbVn(U)g(p )dp < 0 for n <nO. Then rnQ,),
given by (3.1), satisfy the differential equation
d
(3.9) d (g(Ai)V(,)rn(0)) = {nrn(A)9(0,
where 6n = n((n - 1)v2 - m).

PROOF. Since rn is a polynomial of degree n - 1, the left-hand side of


may be expressed as

(gV)'rn + gVrn' = g(Vrn' - r r1) = gp,


where p is a polynomial of degree less than or equal to n. Let k < n. We shall

This content downloaded from 118.185.164.5 on Tue, 22 Jan 2019 12:43:00 UTC
All use subject to https://about.jstor.org/terms
EMPIRICAL BAYES ESTIMATION 1199

TABLE 1
Continued

Name Negative binomial Hyperbolic Secant

Density (x + r 1)Px(1 p)r (1 + A2) -r/2ex tan- I Af 0(x)


6 logp tan- 'A

o(IL) log ~~~~~~~~ tan'(1?)


@(y) log( ~r + A1 r)
OM8) -r log(1 - e6) -r log(cos 0)
A2 Al2
V(,u) A +- r +-
r r

(a,b) (O,oo) (-oo,o)


Zero of V(A) (O, -r) ?ir
Std. gO(A)t A,(r + A)a (r + i,)a(r - iA)
__+___+ _ -(a + (3 + 2)
for m =
r r
,B+1 (-a

AO m mi
F(-2n- a -/3 -1)
fn= fvgt rnB(n + /3 + 1, -2n - a - - 1) (4r ) F(-n
Standard polynomial Jacobi on (1, oo) Jacobi on (- ioo, ioo)
Usual symbol Pna t)(x) Pn(a, (x)
rn(A) n!rnp(a,)( 2 + 1 (_i)n !2npn,)(a-,P )
F(n + a + 1) n, r(n + a+1)
rn(a) rn(O) = rnr + rn(ir) = (-i) 2n r(a + 1)
r'(a )ra+1
k l)n r(2n +a +f3 + 1) (-_)nr(2n +a + + 1)
n r(n + a + 1 +1) rnr(n + a + + 1)
no > max n -1a - /3 --a - 6
tUp to a multiplicative constant and po

show that

fbg(/)p(A)/ d,u = 0

and hence that p(,u) is orthogonal to -all polynomials of degree less than n.
Thus p(y) is a multiple of rn(,u), say 6nrn(,A). To see this we integrate by parts
twice,

fb(g()V(A)rn'( )) A kd = bg(a )V(A)rn(A)kA k

(3.10) = k br (1)(g(,)V(A),uk 1)I d4t = 0,


since (g(,i)V(,A),ik-1)'/g(Au) is a polynomial of degree less than or eq

This content downloaded from 118.185.164.5 on Tue, 22 Jan 2019 12:43:00 UTC
All use subject to https://about.jstor.org/terms
1200 G. G. WALTER AND G. G. HAMEDANI

The leading coefficient of p(,u) is the same as that of


-mknn + v2n(n - 1)kn = nkn((n - 1)v2 - m). This in t
whence the conclusion. a

A more detailed version similar to this proof may be found in Tricomi


(1955).

COROLLARY 3.2. The polynomials {rn} satisfy the differential equation


Vy" - m(, - ,O)y' - n((n - 1)V2 - m) = 0-

We may also obtain an expression for the derivatives of rn. Indeed r1'V is a
polynomial of degree less than or equal to n + 1; it satisfies the recurrence
formula

(3.11) rnV = anrnl + 3nrn + nrn-l


where

an =nv2kn/kn+,

=a r(b)rnl(.)gQ() dpu/yn = 2m(Bn -

bn= f brt( wU) rn 1(pU) V( U) g (/.L) dp~y /n-1


a

r j()rnj(A) rn(u) g(A) d - V(I)rn,+l(,L)rn(,1u()g(A.L) d1a

(kn-j/kn)Yn(m - (n - 1)V2).
All the coefficients of these two recurrence formulas may be given in term
of kn, vn and Bn. This last coefficient may be found in terms of the others a
well if the polynomials are known at one point [usually a zero of V(,)]. Then
arn(a) = Anrn+1(a) + Bnrn(a) + Cnrn-l(a)
or

(3.12) B = -A~ rn+(a) Cnrna) +- a


n n rn(a) rn(a)
In Table 1, we give the values of kn vn and rn
families. More detailed general calculations are found
particular cases in Appendix B.

4. A related biorthogonal sequence. In Bayesian simultaneous esti-


mation methods [see Leonard (1984) for references] it is assumed that the
density g belongs to a parametrized family, and then assumptions about the
parameters of g are introduced. As pointed out by Leonard (1984), the choice
of g very often involves a unimodal density with thin tails (e.g., normal or
gamma). While these choices of prior will be adequate in numerous situations,

This content downloaded from 118.185.164.5 on Tue, 22 Jan 2019 12:43:00 UTC
All use subject to https://about.jstor.org/terms
EMPIRICAL BAYES ESTIMATION 1201

they will be less appropriate in many others. Dawid (1973) investigated prior
densities with thicker tails than normal and showed that it is unreasonable to
expect the same results from analysis based upon a normal prior. Alterna-
tively, g might possess more than one mode, in which case fairly complex
analysis might be involved. In view of these observations due to Leonard
(1984), he studied the empirical estimation of the general prior density g, that
is, under no prior information about g. He pointed out that if some partial
information about g were available, then it could be used for smoothing
densities.
We are therefore interested in prior distributions which are not necessarily
conjugate distributions but are more general. In this section we shall denote
the conjugate density by go(g.) and shall allow g(,t) to be any density in
(topological) span of {r,1 in L2(go). This is not a restriction if the {r,} a
complete as they are for finite intervals.
In this case, if

(4.1) g(,l ) = E anr.(ji)g0(p),


n=O

then the marginal distribution is formally given by

f(x) = f f(x, 6) E a.r,(,u)go(pu) dA


a n
(4.2)

= E, anl f (x, 0) rn(tt)go(,u) dt,u


n a

times some fixed measure dFO(x). We denot

(4.3) In(X) = f f(x, 6)r(n(A)gO(/) du.

These may also be expressed in terms of Morris's polynomials as

(4.4) In(X) =b|f(x, 0) Pn(x, A)go (t) dii,

by (3.8).
We shall be interested in turning the problem around and going from f(x)
in (4.2) to g(,u), that is, in finding coefficients an such that

(4.5) f(x) = E anIn(X)


which we may then use to recover g(,t) by using (4.1). To do so we find a
sequence {An(x)) of polynomials biorthogonal to {In(x)} by using (2.8) for rn(u)
The u0 used there is taken to be the parameter in go(Qt) given by (2.10). Thus
by (2.9) we have the unique minimum variance unbiased estimator of rn(,u) in

This content downloaded from 118.185.164.5 on Tue, 22 Jan 2019 12:43:00 UTC
All use subject to https://about.jstor.org/terms
1202 G. G. WALTER AND G. G. HAMEDANI

the form

n UO)~~~k
(4.6) rn(A) = E C'nk
k=o

given by

CnkC
(4.7) An(X) = E -P,Pk(X, AO)
k=O ak
Then we have

Eo(Ak(X)ln(X)) = fbAk(X) (bf(x, 0)rn(A)gO(t) d} dFo(x)

(4.8) =|E0(Ak(X))rn(A)gO(A) dA

afbrk(A)rn(A)gO(/.) dA = 5nkYn,

by Morris (1983), page 517.


Thus if f(x) dFo(x) denotes the marginal distribution, we may express f(x)
in the form (4.5) by taking

(4.9) an = Eo(An(X) f(X))/Yn


and subsequently use (4.1) to find g(,).
We have been rather cavalier with questions of convergence in this section.
A number of problems may arise which we shall allude to in the next sections.

1. go(A) may not have moments of all orders so that rn(Au)gO(u) may not
integrable for large n.
2. g(,u) may not be identifiable. This may happen if the In are not linearly
independent.
3. The topological span of {rn) in L2(go) may not include all the prior
distributions of interest.

The variance of An(X) may also be calculated from the general formula in
Morris (1983). It is

VarofAkn(X)} = {rn(,U()}2V(L) + {rn'(A))2V2(A)/2(1 + V2)


(4.10) + n {rn (y)}2Vk(y)/ak,

k =3

where ak = k'H' 1(1 +jv2).


However, we shall be interested in the variance of Ak(X) when X is the
random variable with distribution

Jbf(x, 6)gO(,.) dA dFo(x),

which we shall denote by Vargo. We shall use repeated integration by parts to

This content downloaded from 118.185.164.5 on Tue, 22 Jan 2019 12:43:00 UTC
All use subject to https://about.jstor.org/terms
EMPIRICAL BAYES ESTIMATION 1203

evaluate integrals of the form

(4.11) In = fr(k)()i}2V,(
a
A)go(I) d1.
The appropriate formula is

t p(,u)q (,)V(,u)go(,u) dy = -d p'(A)q(p)V(A)go(A) dAs


(4.12) a
+ brj()p(tz)q(1)go(1) dA,
where p and q are polynomials such that

p(A)q(A)V(b) Igo(A) dA < b


For n k, the integral (4.11) is easy to evaluate. It is

(4.13) Inn - (n!k )2f Vn(p)go(p) d,u = (n!kn))2 ! = n!knyn.

For k = 0, InO = yn. In the other cases we apply (4.12) repeatedly to obtain:

LEMMA 4.1. Let go(pL) have a finite 2nth moment. Then


n! k-1
(4.14n ink H) (m -(n +j - ')V2)yn9 k = 12 ..2,n
(n - k)! j=0

For k = 1 we use (4.12) with p = rn, and q = rn. Then we find that

|brn(A,) rn( ,u )V(A,)go( A) dA

= rnu(/I)rn(L)V(l)gQ(l) d + fbrl(,)rn(.t)rn(,)go(,L) dAt

n^Yn-

For general k we take q = rn(k- 1) and p = rn(k)Vk-1 to find

br-)fbr(k-i)(A,)(r(k).(L)Vk(A L))dV()g ) dp
a

(4.15) + |brl(u) rn(k - +~d.


(4.15) )(A) rnk)(s )V'-'(A g (A) dA

- _ Jbrn(ki-1)( (A)
a

?(rn(k+1)(A) V(A) + r(k,)( )(k - 1)V'(A) - rl(A)r("(A))

This content downloaded from 118.185.164.5 on Tue, 22 Jan 2019 12:43:00 UTC
All use subject to https://about.jstor.org/terms
1204 G. G. WALTER AND G. G. HAMEDANI

which is simplified by using the differential equatio


Vy" -(r1 - (k - 1)V')y'

(4.16) ={(n(n - 1) - (k - l)(k - 2))v2+ m(k - 1) -mn)y


= (n - k + 1)((n + k - 2)v2 - m)y =qn,k-1Y
By substituting (4.16) into the right-hand side of (4.15), we find that

tb(rn(k)(A))2Vk(A)g (u) dA
(4.17)

= 71n,k-1 b (rn( 1)(A))2Vk- (A)g (4) du


and the conclusion follows by induction.

COROLLARY 4.2. The variance of An(X) is given by

(4.18) n k- (m= -j0(n


(1rgo(An(XJ) (1 -i +j)v2)
+Iv2)

Since the Ink of (4.11) must be nonnegative, it follows by (4.14) that


77n,k < 0 and in particular that

7qn,n = n((2n - 1)V2 - m) < 0


or

(4.19) m 2 (2n - 1)v2.

This is not a contradiction, since in those cases in which v2 > 0, the conjugate
prior distribution has only a finite number of moments. If v2 < 0, the binomial
case only, then we must have 1 +iv2 > 0 for j = 0,1, ... ., k - 1, that is,
r = k, where V(yi) = ,u- IL2Ir.

5. Estimation. In this section we suppose that we have an i.i.d. sample


X1, X2 ... ., XN of the mixture with the probability law given by

(5.1) dF(x) = f (X, 6(IL))g(ji) dpu dFo(x) = f (x) dFo(x).


a

We shall first estimate f(x) by using density estimators similar to those used
with orthogonal functions. Then we estimate g(,u) by employing the procedure
mentioned in the last section. Finally we obtain Bayes empirical Bayes esti-
mates of the moments.
If g(pi) is a conjugate prior density, then the (Bayesian) posterior estimate
of the mean is a weighted average of 40 and the sample mean X as is wel
known. However, this assumption is excessively restrictive, since such conju-
gate priors are usually univalent. This excludes the common assumption that
mixtures consist of a linear combination of the f(xl0d). This in turn corre-
sponds to prior distributions of the form E pi (6 - Oi). A "smeared" smooth

This content downloaded from 118.185.164.5 on Tue, 22 Jan 2019 12:43:00 UTC
All use subject to https://about.jstor.org/terms
EMPIRICAL BAYES ESTIMATION 1205

version of this would be E piA (O - 6i), where {bin} is smooth delta family
[Walter and Blum (1979)]. Prior distributions of this form arise from MLE
[Laird (1978)] and are the form considered in Leonard (1984). If g(,u) is not
the conjugate prior density, this is no longer necessarily true and the posterior
mean is

(5.2) A f(X, 0) g (Ay) dj f (X, 0) g(Ay) d ix,

where f is the probability law of the sample mean which is also a NEF-QVF.
This can either be estimated directly or by first estimating g(,) from the
sample. We shall adopt the latter approach, which has the advantage of giving
estimates of other moments as well.
We shall assume that g0(,U), an initial Bayesian conjugate prior distribution,
has been found and has moments up to 2n0 which may be infinity. If our
Bayesian is reluctant to specify ,u0 and m based on his subjective knowledge,
other procedures may be used. One such is to assume a noninformative prior
distribution as the initial guess for g0. This only works if the interval (a, b) is
bounded. Another procedure is to estimate ,u0 and m from a portion of the
data by using MLE or other methods and then using the conjugate prior
distribution

A0(A) = Kexp(M,h0O(A) - mf(6(pA))}/V(Au)


as the estimate. If the true prior density g(,u) is of the form

(5.3) g(A) = h(A)go(p), A E fl,


where h(,u) is a polynomial of degree less than or equal to n1=
min{no, card(supp F(x))), if n1 < .x, and an element of L2(fQ; g0
but is unknown, we assume that f(x) is given by (5.1) with that g(u).
We use the sample to estimate the coefficients in the expression (4.1) of g
and (4.2) of f. They are
l N

(5.4) ak =- E Ak(Xi)Yk 1

The estimators of g(,L) and of f(x) are, respectively,


p

(5.5) P() =hP(4)o(g) = E akrk(A4)0(A.), p =O1, ... ,nl,


k=O

and
p

(5.6) fp(X) = E dklk(X), P =


k=O

Here, as in the orthogonal function estimation, p is a smoothness parame-


ter with decreasing value corresponding to increasing smoothness. If one is
interested primarily in smoothness, p should be chosen as small as possible
consistent with the maximum number of anticipated modes in g(,u). Since

This content downloaded from 118.185.164.5 on Tue, 22 Jan 2019 12:43:00 UTC
All use subject to https://about.jstor.org/terms
1206 G. G. WALTER AND G. G. HAMEDANI

h,p(/u) is a polynomial of degree less than or


modes.
In general p = p(N) will increase with the sample size and may approach
infinity. This happens only if n1 = 00, in which case we obtain mean-squared
consistency for general prior densities (below). However, it is also possible to
restrict p to some value less than nl. In this case the problem beco
parametric with parameters ILO, m, a0, a1, ... , ap. The choice of p again m
be subjective, chosen on the basis of smoothness, or it may be data-based. For
the latter, the choice of p based on the penalized MLE method of Schwarz
(1978) seems the most promising, but has not as yet been explored.

PROPOSITION 5.1. If h(u) is a polynomial given by


n2

h(,u) = E akrk(A), n2 < nl,


k=o
then

(i) &k is an unbiased estimator of ak, k = 0, 1, ... , n2;


(ii) A.p(ji) is an unbiased estimator of h(,u), n2 < p < nl;
(iii) fp(x) is an unbiased estimator of f(x), n2 < p < n1.

PROOF. By (5.4) we have


Eg&k = Yk lEgAk(X)

( 5=7) Yk | fAk(x)f(x,H)h(A)g0(jx) d,adF0(x)

yk1 frk(/)h(/u)gO(/) dA = ak,

where Eg denotes the expectation with respect to the distribution given in


(5.1). The proofs of (ii) and (iii) follow from (i). C

The variance of ak may be obtained from that of Ak(X), which in turn may
be based on (4.10). Indeed we have

Varg ak = Varg Ak ( X)/N- = |P2k()h( d/ug/NdYk

where P2k(P1) iS the polynomial on the right-hand side of (4.10).


From this we obtain:

COROLLARY 5.2. Let h(,) be a polynomial of degree p < n. Then the


integrated mean-squared errors (IMSE) of hp and fp satisfy

LbE[p() - h(pu)] 2g(u) dA = ? N

LbE[ OX-f(x)]2dF0(x) = ( N)

This content downloaded from 118.185.164.5 on Tue, 22 Jan 2019 12:43:00 UTC
All use subject to https://about.jstor.org/terms
EMPIRICAL BAYES ESTIMATION 1207

However, if h is not a polynomial of degree less than or equal to p, then the


estimators (ii) and (iii) will be biased and the mean-squared error will not
converge at the same rates. In fact we shall now assume that h is not a
polynomial but is a bounded continuous function. This requires that go(,L)
have moments of every order if h is to be approximated by polynomials.

THEOREM 5.3. Let go(A) have moments of every order and let h(,) be a
bounded continuous function such that Tqh E L2(go), for some positive inte-
ger q, where T is the differential operator given by

To = V+" - rlqY.

Let fip and fp be given by (5.5) and (5.6), respectively. Then for some constants
C1 and C2 independent of N and p and each E> 0,

b E[ hp(l) - h( A) 2go0(,t) dL ? (m +pIv21)P+l + C2(p + 2q

fbE[fp(x) -f(x)] dFo(X) < N (m +PIV21)P+l + C2(p +


a~~~~

PROOF. The mean-squared error of h,p is

E hp - h] = E[ E(k - ak)rkj + K E akrk]X


.k=OL k=p+ 1 j

where now h is given by the conv

h(,u)= E akrk(L).
k=O

The coefficients are given by

ak = |h(L)rk()gO() dt/Yk
(5.8) f

Ta | qh(A)rk(A)gO(A) d/AkYk

and

12 < b q Th,u)]2go( )/{*2,


by Schwarz's inequality. Thus the integrated bias term satisfies

rb 0ar0 g 2 00 (00
(5.10) |b[ E akrt(y)] g
a k=p+l ] k=p+l k=p+l

Since by (3.9), 6k = 0
that (5.10) is dominated by (p + )1l+e-2q for each ? > 0.

This content downloaded from 118.185.164.5 on Tue, 22 Jan 2019 12:43:00 UTC
All use subject to https://about.jstor.org/terms
1208 G. G. WALTER AND G. G. HAMEDANI

The integrated variance (IV) term is given by

IV wbE[f( - ak)rk(L)] go(/L) d,u

(5.11) = a ak) /Yk}


k = 1

p b k , 2
E (| E (rk j(1i)] V
k =1 kj=1

In order to evaluate this expression, we use (4.18) to find that for 1 + iv2 > O,
i = 1,2, ...,)p,

1 P 1k(kjz) (m0 (1+iv2 )


NV ? - ( ) ( miJJ- (m (k - 1) 11 V III
Nk=( 5.j= i) k =Vj=

1 P k

(5.12) Nk=lj=l (k-1v)IhI

= - E (1 + m - (k - 1)V2)kIIhl.

1 + p-1) +I~1.
< const. N(m + (p - )*21)

Hence by combining (5.12) and (5.10) we reach the first conclusion. The second
follows from the first by Schwarz's inequality. o

This IMSE can be made to converge to 0 as N -X oc provided p = p(N)


converges to infinity at such a rate that pP+'/N - 0 as N - oo for v2 o 0 an
at a rate of mP+'/N 0 as N -* oo for v2 = O. In the first case the rate of
convergence will be extremely slow. However, in the case of v2> 0 there are
only a finite number of possible terms in the expansion of h(,u), while for
v2 < 0 there are only a finite number of possible values of X. Hence in neither
case is it possible to allow p -m oo.
If v2 = 0, the convergence is more rapid, though still quite slow. Indeed, if
p + 1 = O(log N/2 log m), we have:

COROLLARY 5.3. If v2 = 0, then the IMSE satisfies

IMSE = O([log N]1l2q)

for both Afp and 4p, where p + 1 = O(log N/2 log m).

This content downloaded from 118.185.164.5 on Tue, 22 Jan 2019 12:43:00 UTC
All use subject to https://about.jstor.org/terms
EMPIRICAL BAYES ESTIMATION 1209

The estimates of the posterior mean and variance arising from the estimate
of g(,u) = h(,.)go(pz) are

A= b ff(x, 60)h(p)go(1) dl]


[ a ~~~~~~~~~~~~~J (a, b)
and

Vp A 0 j p - ff(X, 6) hp(Pu)go() dPj


Both may be shown to be asymptotically optimal [see Walter and Hamedani
(1989)] with rate O(N-1/2+) by Corollary 5.2 if h(y) is a polynomial in
L2((a, b); go). If v2 = 0 and h(pi) is a bounded continuous function, then b
are again asymptotically optimal but with a slow rate by Corollary 5.3.
This is also true of the posterior mean estimate. We can use the properties
of the orthogonal polynomials to find an expression for the posterior mean
[h(,u) a polynomial]

b n n
=i A fl f(x, 0) E akrk(A)g0(A) dA E aklk(X)
a k=O i
(5.13) n n
( E ak(Aklk+l(X) + Bklk(X) + Cklk(x))/ E aklk(x).
k=O /k=O

The estimate can then be obtained by replacing ak by ak and truncating the


sum to p.
An alternative point of view is found by observing that

lk(X) = E(rk(A)IX = x)
and using the moment calculations obtained from Theorem 5.2 of Morris
(1983),

(5.14) E(( --xo) rk(a) IX = x) = mk+()V(,u)|X = x)X

where xo = (x + mAo)/(m + 1). This is useful when p is small, for example,


p = 1. Then we have

10(X) = 1, 11(x) = rl(xo), E(piX = x) = xo


and

E(A rl(i)IX = x) = E((A - xo)r1(lt) + xor(I) IX = x)


m

m + 1 E(V(A) IX = x) + xorl(x0)
m
V(Xo) (m + 1)(m + 1 - v2) + x0r1(xO).

This content downloaded from 118.185.164.5 on Tue, 22 Jan 2019 12:43:00 UTC
All use subject to https://about.jstor.org/terms
1210 G. G. WALTER AND G. G. HAMEDANI

Hence for p = 1, the posterior mean is

a0x0 + a1(x0rl(x0) + V(x0)m/(m + 1)(m + 1 - v2))


,u ~ ~ ~ ~~a + a1+ar1(x0)
(5.15) aor(xr)m
Xo + ~~alr(xo) m
(m + 1)(m + 1 - v2)(a0 + a
We have proposed a general method which encompasses six particular cases.
Our results have been primarily theoretical and of an asymptotic nature.
However, the small sample behavior of our method is potentially more useful.
A simulation study is being undertaken, but is not as yet complete. We present
here an example from our previous work and results of two simulations.

EXAMPLE 1. In the binomial case, since the interval (a, b) is bounded, the
use of a noninformative prior is possible. If the interval is normalized to (0, 1)
by using p = ,/r (see Table 1) as the parameter, the setting is exactly the
same as in Walter and Hamedani (1987). The resulting polynomials are the
Legendre polynomials. These were used to estimate p from the past data
(5, 4, 5, 5, 0) and current value 5 from a binomial mixture with r = 5. The
results were

P= 0.865, P2 = 0.930, P3 = 0.939, P4 = 0.945,


where the subscript denotes the number of terms in the estimator. This may
be compared to the estimates ranging from 0.886 to 0.936 from the same data
arising from an estimator based on a Dirichlet process [Berry and Christensen
(1979)].

EXAMPLE 2. A simulation in the binomial case r = 5 in which the prior was


bimodal was also done (J. Letelier, personal communication). The prior was
assumed to be the density

3Tr
g (p) = -j(2 sin(7Tp) + sin(37rp)),
and samples of size 15 were taken from the resulting marginal distribution.
The results were

P= 0509, P2 = 0.605, p3 = 0.769, p4 = 0.528,

in which the subscript denotes the number of terms in the estimate of g(p). In
this example a different sample was used in each of these cases as well as a
different current value. These were also generated randomly and were, respec-
tively, x = 3,5,5,3 for the four cases. The true value of the E(p) was of
course 0.5.
The expression for 4(p) was compared to that of g(p). For approximation
by a fourth degree polynomial (five terms in the series), the correct shape was
observed even when samples as small as 5 were taken.

This content downloaded from 118.185.164.5 on Tue, 22 Jan 2019 12:43:00 UTC
All use subject to https://about.jstor.org/terms
EMPIRICAL BAYES ESTIMATION 1211

EXAMPLE 3. We consider a simulation in which the prior distribution has a


point mass at 0 and at 2, that is, g(,u) = 2 6(y) + 2 (a - 2), and the NEF-QVF
is N(,I, 1), that is,
2

f (x, A) = exp x,u - 2

Since the conjugate prior is also normal in this case, it has the form

0z2
go(A) = Kexp m,4oO - -

where 0 = ,A. In this case we cannot take the trial prior to be noninformative
since the interval (a, b) is infinite. Accordingly, we take it to be as simple as
possible with ,uo = m = 1. The polynomials rn(Au) are by (B.1.3)

rn(A) = (2) (Hnf/( 1)

for

go() 7= / expt
The polynomials Pn(x, ,) similarly may be found to be

= 1)n/2(X /L)
Pn(X, ) = Hnf

From this a simple calculation gives us

ln(X) e f (exp{xu - )-})( 2) Hn (7l)g0(g) dA

= (2 exp{2 2))(2)Hn( 2 2)exp{-( 2}


The biorthogonal system An(x) which satisfies
+ 00

J Ak(X)ln(X) dFo(x) = 8knYn

must satisfy

f__ A{(X)ln(X)-2} eXP dx = kn2nn!F/.


Hence

Ak(X) = Hk( 1 2

This content downloaded from 118.185.164.5 on Tue, 22 Jan 2019 12:43:00 UTC
All use subject to https://about.jstor.org/terms
1212 G. G. WALTER AND G. G. HAMEDANI

The estimator of the prior, given a sample X1, . . .,

p /I -1 ( (//2 - 1)2\
4(1 an= E d2n/2Hnf , )(27r) exp(
n=O 2 2
where

i N IXi -
an N Hn 2

A number of samples of the mixture were generated and


recover g(,). The first was (0.639,0.0049, 1.456, - 1.083,
estimates of an were

ao = 1, al = -0.521, a2 = -0.419, ?13 = 0.191, a4 = 0.085,


which gave an estimator of

= 1 (ex{ A( - 1) 2}
RGO.} = exp 2 )

[ 1- H(4 ) H2y )

0.191 ,u-1 0.085 ,u-1


+ i H32 + 4 H44 .

This estimate is very crude given the small sample size and the small number
of terms used. The mean is just 1 plus the coefficient of H1, in this case,
,u = 0.64. (This is not the posterior mean, but rather an estimate of the prior
mean.)
For a sample of size 10 with the same seed we have

ao = 1, a1 = -0.736, a2= -0.356,

a3 = 0.402, a4 -0.0052,

while for a sample of size 20 we have

ao = 1, 1,= -0.015, a2 = -0.140,


a?3 = 0.034, a4 = -0.135.
In this last case, the estimate for ,u is A = 0.989, while the correct value is
course 1.

REMARK. Leonard (1984) has proposed an estimator of the form


im

kOdO8)) =-E 8(6 - 0)


m i=1

for the prior density. The O6 are chosen based on a sample xl, x2, ... , xm by

This content downloaded from 118.185.164.5 on Tue, 22 Jan 2019 12:43:00 UTC
All use subject to https://about.jstor.org/terms
EMPIRICAL BAYES ESTIMATION 1213

maximizing the likelihood function


m m

L = H E exp(xjoi - oo(i))
j=1 i=1

with respect to each Oi. Since this estimator shares the shortcomings (and
advantages) of the empirical distribution, it cannot be mean-squared consis-
tent. However, a smoothed version should be. This can easily be obtained in
terms of our orthogonal polynomials by approximating 3(/1 -,ui) by t
partial sums of its polynomial expansion,
00

-(A Ai) E rn(A)r.(Ai)g0(U)'


n=O

that is, since 8(0 - 0j) = 8(u - dIWV ),


1 m p
(5.16) p(40)) = h E tErn(A(0))rn(A(0i)) 90(A(0))

This approach has not yet been explored but shows promise.

APPENDIX A

Estimates of parameters associated with the orthogonal polynomi-


als {r.}. The basic relations are the Rodrigues formula

(3.1) r= (-1)n(Vngo) '/g0,


the differential equation

(3.9) (Vgorn)' = 6nrngos en = n((n - 1)V2 -M),


the recurrence formula

(3.6) ,u rn(At) = An +rn+1( .A) + B nrrn(.) + Cnrn- (A)


and the derivative expression

(3.11) Vrn = anrn+1 + 3nrn + nrn-1.


We have also taken the leading coefficient to be k n and the
to be yn. A central quantity that occurs repeatedly is the exp

(A.1) fbVn(,u)gO(L) dA = vn5


which may be calculated explicitly if the QVF, V(,) and go(pk) are
Table 1).
The normalizing factor may be calculated easily: It is given by (3.5),

(A.2) yn = knn!vn
and may be found explicitly if kn is calc

This content downloaded from 118.185.164.5 on Tue, 22 Jan 2019 12:43:00 UTC
All use subject to https://about.jstor.org/terms
1214 G. G. WALTER AND G. G. HAMEDANI

A recurrence formula for the leading coefficients may be found by means of


(3.1) [Tricomi (1955), page 136],

(A.3) kn+1 _ (2nv2 - m)((2n - 1)V2 - m)


kn (n - 1)v2 - m

This gives us expressions for An and Cn in (3.1

kn ~kn-1 Yn n vn
(A.4) A kn' C n= k_ =nl
k+ oyn-, n V_

However Bn involves both the leading coefficient and th


un-1 in rn(A). It is [Tricomi (1955), page 126]

k' k'n1
(A.5) B= - ___
n n+1

By again using the differential equation, we can find k'n to be [Tricomi (1955),
page 1371

(A.6) =n f m/10 + (n - 1)vl


kn \ -m + 2(n - 1) v2/
The three coefficients in the other recurrence formula (3.11) may be found in
terms of k n and v,n by using these expressions. Alternately one can use (3.12
which gives Bn in terms of rn(a) if it is known.
Since the differential equation has as its highest order coefficient V(,u), a
quadratic function, it is possible to convert it into a standard form by a linear
change of variable. This form depends on whether or not v2 = 0 and, if it does,
whether v1 = 0 as well.
In case v2 # 0, we may divide (3.9) by -v2 to obtain the differential
equation satisfied by y =rn:

(v1 vo d2y (m _MAO dy 6n


(A.7) -p2 _t -i 2 + - - )- Y=?
V2 V2 d2 V2 V2 dpA V2
We then change scale and location by letting r = a + bt where a =
-(v1 + d)/(2v2) [i.e., V(a) = 0] and b = d/v2 where d2 is the discriminant
of V(y), d = v- 4vOv2 . Then (A.7) becomes

d2y (M (v1+d+2V2A0\m dy 6n

=n n- 1-2)
V2 V2

the hypergeometric

This content downloaded from 118.185.164.5 on Tue, 22 Jan 2019 12:43:00 UTC
All use subject to https://about.jstor.org/terms
EMPIRICAL BAYES ESTIMATION 1215

as the hypergeometric function

m -m , a
(A.9) rn(As) = c.F - n,n -) 12dv2
- (d + v- o
+ 2V2
b
where cn is a constant. Since F(a, b; c

Cn = rn(a).

In the case v2 = 0, but v1 # 0 the equation in (3.9) may be expressed as

vo d 2y m dy m
(A.10) y v d ---( AO) n-Y = ?.
By letting , = a + bt, a = -vo/v1, b = vl/m, we obtain the confluent
hypergeometric equation

(A.11) tdy + (m m t (-n)y

whence it follows that

I mvO mpMO /.t -a


rn(A) = 1F1 -n; 2 - ; b ,

where IF1 is the confluent hypergeometric function.


In case both v1 and v2 = 0, then V(,ui) is just a constant v
by the transformation I.L = a + bt, a = 1uo, b = 2vo/m, be
Hermite equation

d 2y dy
dt2 - 2t- + 2ny = 0.

These formulas can be used to obtain an expression for the der


Indeed in the case v2 # 0, we use the fact that

d 0(a)(b) v
-F(a, b; c,x)
dxv

(A.12) = E (a)v1(b)^+1 Xv ab , (a + 1)v(b + 1)v


v=0 V!(C)v+i c v-0 v!(c + 1),,
ab
=-F(a + 1,b + 1;c + 1;x).
c

Hence for our polynomials we have

d m
drn(,u, A,) = rn(a, AO)enF -n + 1,n - -;
(A.13)

This content downloaded from 118.185.164.5 on Tue, 22 Jan 2019 12:43:00 UTC
All use subject to https://about.jstor.org/terms
1216 G. G. WALTER AND G. G. HAMEDANI

where

2n(n - 1 - m/V2)V2
e

But

Id
rn-l(/A1.o- -

m -m d A ,ua
xF -n + 1,n -- 2 d + ul + 2V2 AiO - 'I;

since a and b depend only on V(no) and not the prior paIameters m and b (.
Thus we have

d - rn(a, L ) en dl
(A.14) n(A,(t, O) = (-i)/ rn-1E(P (mOt- mm}

The constant coefficients in (A.14) may be evaluated by using t


the leading coefficients of both sides must be equal. Aloo, it shou
that this leading coefficient does not involve the parameter fuo.
we see that

rn(,F) ( ) E k )Pn-k(MAO, mA)


(3.3) k=o
X ((Vn-1(A))(k3Vj+k-n(A)}

where Pnformula, m,u) is a polynomial of exact degree n - k in both mi (15 a


pA.(A.15)
Its leading
d coefficient in ,u doestnot
kni r =((_t,n n depend upon inO [Morrms (1982), (8.
The expression (V e -(i))aV1+e-n(e ) is a polynomial of degrsi k whenever
V2 =#O. Hence each term on the right-hand side of (3.3) is a polynomial of
degree n whose leading coefficient is independent of ,u 0 and so is rn(A, ,, 0).
Thus by equating the leading coefficients in (A.14) we find

(A.15) rn(rlg0) = ( +npi (rl - mg),

Other formulas involvng the derivative may be found in [Tricomi (1955),


page 1361. As before, the coefficients may be expressed in terms of vn and the
parameters of WA,), go(A).
For the case V2 = 0, the polynomials are either Lagerre or Hermite and
the derivative expressions are well known. Another expression for the deriva-
tive of rngo is easily derived from the definition

(A.16) (rngo)' =Pn(rn_jgO)' + np'(rn_jgO),


where Pn, = r, - (n - 1)V'.

This content downloaded from 118.185.164.5 on Tue, 22 Jan 2019 12:43:00 UTC
All use subject to https://about.jstor.org/terms
EMPIRICAL BAYES ESTIMATION 1217

APPENDIX B

The six individual cases. In this section we present detailed calculations


for each of the six basic NEF-QVF distributions by Morris (1982).

B.1. Normal. In the normal case we have ,u = Oor2, if(0) = 02cr2/2 and
V(y) = 2. The conjugate prior distribution is a constant times
1
(B.1.1) gg(p) = _2 exp(m(t00 - .202 /2))

1
= -2 exp((m,t/2cr2)(2pu0 - .x)).
The polynomials satisfy the Rodrigues formula

,= (-1)n exp((mA/2o2)(bt - 2,o))


(B.1.2) d (2 ( / 2)(A
x d~ n (a exp( - (m,U/2<)(r - 2,u

For m = 2q2 and /0 = 0, rn(A) = 2nHn(Qu), where Hn are the Hermite


polynomials. The polynomial rn may be obtained from Hn by a change of scale
and location:

(B.1.3) rn(j) = (m/2 2)f/2o2nHn((m/2L2))/2( -


The leading coefficient of Hn is 2n and hence that of rn(,u) is
(B.1.4) k =m o-2n

Since V(,) = o2 is a constant, vn


The formulas for Hn are well k

(B.1.5) xHn(x) = 'Hn+,(x) + nHn-1(x)


and

(B.1.6) Hn(x) = 2nHn-l(x),


while the normalizing factor is

(B.1.7) yfn e-X2Hn(x) dx = 2n nn!7rl/2. _00

B.2. Poisson. In the Poisson case the parameters 0 and ,u are related by
,u = e' = ql(0) = V(,), where ,u may take values in fl = (0, oc). There is an
immense literature in this case, most of which deals with estimation of the
parameter 0 [Hudson (1978)]. That their mean-squared error is often better
than ours is not surprising given the generality of our method. The conjugate
prior is a multiple of

1
(B.2. 1) gO( t) =-exp( m A 0 log , - m t ) = m'O0 -1 exp( - mI. ).

This content downloaded from 118.185.164.5 on Tue, 22 Jan 2019 12:43:00 UTC
All use subject to https://about.jstor.org/terms
1218 G. G. WALTER AND G. G. HAMEDANI

The Rodrigues formula is

(B.2 .2) rn(1tt) = (-ln nmo+Olem d (n +mALO- le-mL),

from which it follows that rn with a change of scale and normalization is


related to the standard Laguerre polynomial L(a) with a = m/io - 1,

(B.2.3) rn() = 1) n ! l( m A),


and hence rn(O) = (-1)4R(n + a + 1)/F'(a + 1), kn = mn.
The recurrence formulas are

(B.2.4) xL(a)(x) - -(n + 1) L 1( x) + (2n + a + 1)L(a)(x)


-(n + a) LVQ(x)
and

(B.2.5) x(L(a)(x))' = nL (a)(x) -(n + a)L (a)Q(x).

The vn again are straightforward:

(B.2.6) v, = f ynPae-IdA/F(a + 1) = r(n + a + 1)/r(a + 1).

B.3. Gamma. In the case of the gamma distribution, the parameter 0 is


given by 0 = -r/,u, qif(0) = - r log(-0) and V(,u) =,I2/r. Hence the c
gate prior is a constant times

(B.3.1) go(,) = (r/y2)exp( - m ,A r/I + mr log( r/A))


mr+1 -mr-2
= rmr+'A -m exp(-mAor4,), A E (O oo)-
The Rodrigues formula is

rn(A) n 2 dp
(B.3.2) () ( 1) n2+mremLor/A ~dn m2
(,.2n-mr-2e-mIor/).

If we take m,uor/,i = 2/x and let mr + 2 = -a, we obtain the Rodrigues


formula (with a different constant) for the generalized Bessell polynomials y(a)
[Chihara (1978), page 183],

dn
(B.3.3)~yn(a)(X) 2 2-nX -'e2/x .. 2n+a -2/

Hence we have, since mr = -(a + 2),

(B.3.4) r (A) = (-(2 + a).O/2)n _1)n2ny(a) - (2 + a) o)


= ((2 + a),uo) nYn,)(( 2/(2 + ))1L/O).
Thus rn(O) = ((2 + a),0)nY(a)(O) = (2 + a) nAn and the leading coeffic
may be calculated from that of n or directly from (B.3.2). It is

kn= (-1) n(2n +a + 1)/F(n +a+ 1).

This content downloaded from 118.185.164.5 on Tue, 22 Jan 2019 12:43:00 UTC
All use subject to https://about.jstor.org/terms
EMPIRICAL BAYES ESTIMATION 1219

In the normalized case =uo = 1/mr = -1/(2 + a), the vn are found to be
00

(B.3.5) n = f (A2n+a/rn)e -1/L dl/r(-a - 1)


= r(-2n - a - 1)/rnr(-a - 1).

The recurrence formulas for Yn(c) are given in Chihara


The rn(p) may also be expressed in terms of Laguerr
satisfy the Rodrigues-type formulas [Szeg6 (1967), page 3881

(B.3.6) e PypLO)(y) = (i )n+l d (___ e

where y = 1/,u. Hence for mr + 2 = ,3 + 1 and A0 = 1/mr, we hav

(B.3.7) rn( ) - n! F, (2n )L (l/,))(1)


k=o

This last expression can also be given by [Tricomi (1955), page 218]

(B.3.8) r() n!L(-2n)(11A)


In this case only a finite number of the
Indeed

jVn( ()go(0 ) dA = f (A2/r) rmr+ 1ape-mLor/IL dA

=frmr+1-nX2n-a-2e--mLorx dx
0

converges if - 2n - a - 2 > -1 and diverges otherwise. Thus only those


polynomials with n satisfying this inequality have finite norms.

B.4. Binomial. In the binomial case, the appropriate interval is finite


Q = (0, r). The mean is ,u = r/(1 + e-@), fr(O) = r log(1 + e@) and V
A - ,t2/r. Here r is the total number of trials and 1/(1 + e-@) =
probability of success. The conjugate prior distribution will be

go(A) = exp(m(AO0 - r log(1 + e')))/(A - A2/r)


(B.4. 1) = exp(mAO log(,/(r - ,u)) - mr log(r/(r - ,)))/(I, - ,21r)
= 'Umlo- 1(r - I.) mr-m,Ao- 1rl-mr

= ,A(r - ) r-(a+P+l).
With a change of scale (i.e., r = 1), this leads to the usual Rodrigues formula
for the Jacobi polynomials on (0, 1),

(B.4.2) n!r,l(/L) n-(-1)n -( 1 - )a (n+f( l - ) f+)

This content downloaded from 118.185.164.5 on Tue, 22 Jan 2019 12:43:00 UTC
All use subject to https://about.jstor.org/terms
1220 G. G. WALTER AND G. G. HAMEDANI

The Bayes empirical Bayes problem has already been treated for this case in
Walter and Hamedani (1987). They also considered the case of a noninforma-
tive initial prior which led to the Legendre polynomials. The more general
problem in which the indices ni of the binomial distribution are allowed to
vary was not considered, but may be attacked by the method of Leonard
(1976). The recurrence formulas for r = 1 are well known [see Szeg6 (1967),
pages 71-721 as is the differential equation. We observe merely that rn(
F(n + a + 1)/F(a + 1), that the leading coefficient is

k F(2n +a +,B + 1)
n r(n +a +13 + 1)

and

1n()gA d2 r(n + a + )r(n +3


0n = | r4y)g(y)dy n!(2n + a + / + 1)r(n + a + (3 - 1)

These Jacobi polynomials are related to the standard ones on (-1, 1) by a


change of scale and location

p(a, 3) a, - 3)(2A - 1).

The {rn(O)} are complete in L2((O, r), pJ(r - 1)a) but the correspondin
{ln(x)} given by (4.3) are not linearly independent since x has only r + 1
distinct values. Hence to avoid problems with identifiability of g(,u)=
h(y)go(,u), we must restrict h(,i) to the span of {ro, rl, ... , rr}.

B.5. Negative binomial. In the case of the negative binomial the mean is
given by A = r/(e - - 1), qr(6) = -r log(1 - e@) and V(,u) = a + k&2/r.
conjugate prior distribution may be expressed as a constant times

go(A) = exp(mRO log(IL/(r + ,)) - mr log(r/(r + u)))/(t + ,2/r)

(B.5.1) = A M?o-1(r + ,)mr-muo-1r1-mr

= 1(r + ,)ar(a+P+l)

which is similar to (B.4.1) but the interval is fQ = (0, oo). Since go


is restricted to ,B > -1 which must hold since both ,u 0 and m
However, a + , < -1 as well and therefore a < 0. We can change the scale
again which is equivalent to setting r = 1 and find the Rodrigues formula to
be

(B.5.2) rn(,u) = (-1),-Z(1 +/L)- d n (l + )f+a).

This content downloaded from 118.185.164.5 on Tue, 22 Jan 2019 12:43:00 UTC
All use subject to https://about.jstor.org/terms
EMPIRICAL BAYES ESTIMATION 1221

This can be converted to the Rodrigues formula for Jacobi polynomials on


(-1, 1) by the change of variable x = 1 + 2,u, to obtain

rnf 2 ) (-2)n( 2 ) 2)
dn {x -1 n+(x + 1\ n+a\
x 1 1 1 1
(B.5.3) dXn 2 J 2Jj

1 2 )(x-1) (x + 1) a _
= (-1) n!P, eP)(x) = n!P( -(x).
However, since the interval in x is (0, oo), many of the standard calculations
do not hold. The moments vn are given by

(B.5.4) vn = f /n (l + i/r) nP(r + El)a d / AP(r + jx)a d ,

which by the change of variable x = ,/(r + ,) are seen to be

(B.5.5) vn =4nB(n + ,B + 1,-_2n -8-at- 1)/B(P + 1, -,B -a - 1).


These moments clearly exist only if , > -1 and 2n + 13 + a < -1 and hence
we do not have a complete set of polynomials.
Other parameters may be calculated in terms of Pna,(?). We find rn(O) b
(B.5.3) to be

(B.5 .6) rn(O) = (-1) n n ! Pn( a)( 1) = (-l)n n!(-l1)( + a )


= F(n + a + 1)/F(a + 1)

and the leading coefficient to be

n= (-1) F(2n + a + / + 1)/r(n + a +,B + 1).

B.6. Generalized hyperbolic secant. The generalized hyperbolic secant dis-


tributions introduced by Morris (1982) have as their interval of mean values
fl=(-oo,oo) with it=rtan6, 0(6)= -rlog(cosO) and V(,) = +,l2/r.
The conjugate prior therefore is

go(A) = (p/(r2 + A2)) eXp(m 0 t- '( 1r)


+ mr log(cos(tan-1(,u/r))))

(B.6.1) = (r/(r + A2))(cos Q) remzLoO 0 = tan '(,u/r)


=r mr+ (r + 2) 2l-mr/2((r - ipu)/(r + i)) ,,i/2

This content downloaded from 118.185.164.5 on Tue, 22 Jan 2019 12:43:00 UTC
All use subject to https://about.jstor.org/terms
1222 G. G. WALTER AND G. G. HAMEDANI

The Rodrigues formula is

rn( 1) = (-1/r) (r 2 + pk2)l +mr/2e-m/ d ((r2 + /2)n mr/2emA6)

(B.6.2) = (1/r )nrmr+l(1/g 0())

x d ((r - i,)(mAoi/2)-(mr/2)-l+n(r +

However, the differential equation is easier to interpret in this case. It is

(B.6.3) (r2 + A2)y,, - mr(A - = n(n - 1 - mr)y.


This may be converted into the equation

d2y dy
(B.6.4) (1 - X2) dX

by the change of variable ,u = ixr. But this is the equation of the Jacobi
polynomials on (-1, 1), with the solution

y = pn(a, 0 3( x )

where a = -(m/2)(r + ,uoi) - 1, /3 = -(m/2Xr - ,uoi) - 1. Under the sam


change of variable (B.6.2) becomes

rn(xri) = ( r (r- X_) (r + xr)

d n+f

(B.6.5) dxn ((r - xr) (r + xr)p)

( (1 X) +) dxn(( )( )' )

= (-i) n212nPna, P)(x).

Since Pn,a, 3)(1) = (n + a) it follows

(B.6.6) rn(ri) = (-i) n2nF(n + a + 1)/F(a + 1).


The leading coefficient of Pna P)(X) is 2 -n( 2n + a + 13) [Szego (1967), page 63].
Hence that of rn(,t) is

(B.6.7) kn = ((-1)n/rn)F(2n + a + ,B + 1)/F(n + a + ,3 + 1).


The recurrence formulas can be found from those of the Jacobi polynomials.

This content downloaded from 118.185.164.5 on Tue, 22 Jan 2019 12:43:00 UTC
All use subject to https://about.jstor.org/terms
EMPIRICAL BAYES ESTIMATION 1223

The moments vn are given by

(B.6.8) lx ((r2 + ,2)n /rn)(r +


n lxf(r + ii)a(r -i) dl,
and may be calculated by means of the formula [Er

2 iir2 vi-
- (cos - 1 cos(yO) dO

(B.6.9) 21-vF(v)

F((v + y + 1)/2)F((v - y + 1)/2) R >0.


By the change of variable u/r = tan 0, we find the integral to be

-f/ (cosO)V+le'y6r(cos) 2dO


7r -7(/2

=-|f rv+2(r2 + ,-2(v +l)/2 r+A/ )

and hence by taking y = a - , and - (v + 1)/2 = n + (a + 1 )/2, we find


(B.6.8) to be

(B.6.10) (4r)nF( -2n - a - , - 1)F( -f)F( -a)


Vn r( _ n - a) rf - n - 13)r( -a - ,B - 1) .
Again the moments exist only when 2n + a + ,3 < -1.

Acknowledgments. The authors wish to thank Richard Askey for point-


ing out certain references and for saving them the pain (and pleasure) of
rediscovering properties of classical orthogonal polynomials. We are also grate-
ful to a referee for calling our attention to several important references and for
his suggestions which improved our presentation.

REFERENCES

BERGER, J. 0. (1985). Statistical Decision Theory and Bayesian Analysis, 2nd ed. Springer,
New York.
BERRY, D. A. and CHRISTENSEN, R. (1979). Empirical Bayes estimation of binomial parameter via
mixtures of Dirichlet processes. Ann. Statist. 7 558-568.
CHIHARA, T. S. (1978). An Introduction to Orthogonal Polynomials. Gordon and Breach,
New York.
DAWID, A. P. (1973). Posterior expectations for large observations. Biometrika 60 664-667.
DEELY, J. J. and KRUSE, R. L. (1968). Construction of sequences estimating the mixing distribu-
tion. Ann. Math. Statist. 39 268-288.
DEELY, J. J. and LINDLEY, D. V. (1981). Bayes Empirical Bayes. J. Amer. Statist. Assoc. 76
833-841.
ERDPLYI, A. (ed.) (1954). Tables of Integral Transforms 1. McGraw-Hill, New York.
HUDSON, H. M. (1978). A natural identity for exponential families with applications in multipa-
rameter estimation. Ann. Statist. 6 473-484.

This content downloaded from 118.185.164.5 on Tue, 22 Jan 2019 12:43:00 UTC
All use subject to https://about.jstor.org/terms
1224 G. G. WALTER AND G. G. HAMEDANI

LAIRD, N. M. (1978). Non-parametric maximum likelihood estimat


Amer. Statist. Assoc. 73 805-811.
LEONARD, T. (1976). Some alternative approaches to multiparameter estimation. Biometrika 63
69-75.
LEONARD, T. (1984). Some data-analytic modifications to Bayes-Stein estimation. Ann. Inst.
Statist. Math. Part A 86 11-21.
MARITZ, J. S. (1970). Empirical Bayes Methods. Methuen, London.
MORRIS, C. N. (1982). Natural exponential families with quadratic variance functions. Ann.
Statist. 10 65-80.
MORRIS, C. N. (1983). Natural exponential families with quadratic variance functions: Statistical
theory. Ann. Statist. 11 515-529.
ROBBINS, H. (1956). An empirical Bayes approach to statistics. In Proc. Third Berkeley Symp.
Math. Statist. Prob. 1 157-164. Univ. California Press, Berkeley.
SCHwARz, G. (1978). Estimating the dimension of a model. Ann. Statist. 6 461-464.
SZEGO, G. (1967). Orthogonal Polynomials. Amer. Math. Soc., Providence, R.I.
TRICOMI, F. G. (1955). Vorlesungen uiber Orthogonalreiken [translation of Serie Ortogonali di
Funzloni, Torino (1948)]. Springer, Berlin.
WALTER, G. G. and HAMEDANI, G. G. (1987). Empiric Bayes estimation of binomial probability.
Commun. Statist. Theory Methods 16 559-577.
WALTER, G. G. and HAMEDANI, G. G. (1989). Bayes empirical Bayes estimation for discrete
exponential families. Ann. Inst. Statist. Math. 41 101-119.
WALTER, G. G. and BLUM, J. (1979). Probability density estimation using delta sequences. Ann.
Statist. 7 328-340.

DEPARTMENT OF MATHEMATICAL SCIENCES DEPARTMENT OF MATHEMATICS, STATISTICS,


UNIVERSITY OF WISCONSIN-MILWAUKEE AND COMPUTER SCIENCE
MILWAUKEE, WISCONSIN 53201 MARQUETTE UNIVERSITY
MILWAUKEE, WISCONSIN 53233

This content downloaded from 118.185.164.5 on Tue, 22 Jan 2019 12:43:00 UTC
All use subject to https://about.jstor.org/terms

You might also like